In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
from scipy.stats import chi2_contingency, chisquare

# Q1. Marital Status and Drinking
A national survey was conducted to obtain information on the alcohol consumption patterns of U.S. adults by marital status.
A random sample of 1772 residents, aged 18 and older, yielded the data displayed in Table below:

![](https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/032/895/original/Screenshot_2023-04-27_at_1.36.46_PM.png?1682588952)

Test whether Marital status and alcohol consumption are associated with a 5% significance level.

Choose the correct option below :

In [6]:
data = [
  [67,213,74],
  [411,633,129],
  [85,51,7],
  [27,60,15],
]
chi_stat, p_val, dof, excepted = chi2_contingency(data)
print('Chi Stat:', chi_stat)
print('P Value:', p_val)
print('Degrees of Freedom:', dof)
print('Excepted:', excepted)

alpha = 0.05 # 5% significance level
if p_val <= alpha:
  print('Dependent (reject H0): Marital status and alcohol consumption are associated.')
else:
  print('Independent (fail to reject H0): Marital status and alcohol consumption are not associated.')

Chi Stat: 94.26880078578765
P Value: 3.925170647869838e-18
Degrees of Freedom: 6
Excepted: [[117.86681716 191.18397291  44.94920993]
 [390.55869074 633.49943567 148.94187359]
 [ 47.61286682  77.22968397  18.15744921]
 [ 33.96162528  55.08690745  12.95146727]]
Dependent (reject H0): Marital status and alcohol consumption are associated.


# Q2. Internet Use
A random sample of adults yielded the following data on age and Internet usage.

![](https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/032/904/original/Screenshot_2023-04-27_at_2.34.39_PM.png?1682592271)

At 1% significance level, does the data provide sufficient evidence to conclude that an association exists between age and Internet usage?

Choose the correct option below :


In [8]:
data = [
  [6,38,31],
  [14,31,4],
  [50,50,5]
]
chi_stat, p_val, dof, excepted = chi2_contingency(data)
print('Chi Stat:', chi_stat)
print('P Value:', p_val)
print('Degrees of Freedom:', dof)
print('Excepted:', excepted)

alpha = 0.01 # 1% significance level
if p_val < alpha:
  print('Dependent (reject H0): Age and Internet Usage consumption are associated.')
else:
  print('Independent (fail to reject H0):  Age and Internet Usage consumption are not associated.')

Chi Stat: 60.74604310295546
P Value: 2.0217185191724964e-12
Degrees of Freedom: 4
Excepted: [[22.92576419 38.97379913 13.10043668]
 [14.97816594 25.4628821   8.55895197]
 [32.09606987 54.56331878 18.34061135]]
Dependent (reject H0): Age and Internet Usage consumption are associated.


# Q3. Income and Residence
The U.S. Census Bureau compiles information on the money income of people by type of residence and publishes its finding in Current Population Reports.

Independent simple random samples of people consists of following types of residences

Inside Principal Cities (IPC),
Outside Principal Cities but within Metropolitan Areas (OPC), and
Outside Metropolitan Areas (OMA),
The Census gave the following data on income levels:

![](https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/032/905/original/Screenshot_2023-04-27_at_2.43.31_PM.png?1682592447)

At the 5% significance level, can you conclude that the type of residence is related to income level?

Choose the correct option below :

In [10]:
data = [
  [75,106,46],
  [106,161,61],
  [98,183,52],
  [48,102,14],
]
chi_stat, p_val, dof, excepted = chi2_contingency(data)
print('Chi Stat:', chi_stat)
print('P Value:', p_val)
print('Degrees of Freedom:', dof)
print('Excepted:', excepted)

alpha = 0.05 # 1% significance level
if p_val < alpha:
  print('Dependent (reject H0): Resident and Income Level consumption are associated.')
else:
  print('Independent (fail to reject H0): Resident and Income Level consumption are not associated.')

Chi Stat: 15.727554171801787
P Value: 0.015293451318673136
Degrees of Freedom: 6
Excepted: [[ 70.55988593 119.11026616  37.32984791]
 [101.95437262 172.10646388  53.9391635 ]
 [103.50855513 174.73003802  54.76140684]
 [ 50.97718631  86.05323194  26.96958175]]
Dependent (reject H0): Resident and Income Level consumption are associated.


# Q4. Observation representing sample

According to a survey conducted on car owners, it was determined that

- 60% of owners have only one car,
- 28% have two cars, and
- 12% have three or more cars.

Suppose Ram conducted his own survey within his residential society, and found that

- 73 owners have only one car,
- 38 owners have two cars, and
- 18 owners have three or more cars.

Determine whether Ram's survey supports the original one, with a significance level of 0.05.

In [18]:
observed_counts = np.array([73, 38, 18])
expected_counts = np.array([0.60 * 129, 0.28 * 129, 0.12 * 129])

chi_squared_stat, p_value = chisquare(f_obs=observed_counts, f_exp=expected_counts)

alpha = 0.05

if p_value < alpha:
    print("Reject the null hypothesis: Thus concluding that Ram's survey results are not consistent with the original survey.")
else:
    print("Fail to reject the null hypothesis: Thus concluding that Ram's survey results are consistent with the original survey.")

print(f"Chi-Square Statistic: {chi_squared_stat}")
print(f"P-value: {p_value}")

Fail to reject the null hypothesis: Thus concluding that Ram's survey results are consistent with the original survey.
Chi-Square Statistic: 0.7582133628645247
P-value: 0.6844725882551137


# Q5. Distribution of smartphone brands
A Mobile Retail store owner is interested in the distribution of popular smartphone brands among a group of 200 people.

They expect that 30% of people would prefer Brand A, 40% would prefer Brand B and 30% would prefer Brand C.

However, upon surveying the group, the results are as follows: 70 prefer Brand A, 80 prefer Brand B, and 50 prefer Brand C.

Conduct an appropriate test to see if the distribution of preferences matches the store owner's expectations at a 5% significance level.

Choose the correct option below:

- a)

  P-value: 0.2048
  
  We fail to reject the null hypothesis
  
  Thus concluding that the observed distribution matches the expectations.
- b)

  P-value: 0.2048
  
  The null hypothesis is rejected
  
  Thus concluding that the observed distribution does NOT match the expectations.
- c)

  P-value: 0.1888
  
  We fail to reject the null hypothesis
  
  Thus concluding that the observed distribution matches the expectations.
- d)
  
  P-value: 0.1888
  
  The null hypothesis is rejected
  
  Thus concluding that the observed distribution does not match the expectations.

In [20]:
observed_counts = np.array([70, 80, 50])
expected_counts = np.array([0.30 * 200, 0.40 * 200, 0.30 * 200])

chi_squared_stat, p_value = chisquare(f_obs=observed_counts, f_exp=expected_counts)

alpha = 0.05

if p_value < alpha:
    print("Reject the null hypothesis: Thus concluding that the observed distribution does not matches the expectations.")
else:
    print("Fail to reject the null hypothesis: Thus concluding that the observed distribution matches the expectations.")

print(f"Chi-Square Statistic: {chi_squared_stat}")
print(f"P-value: {p_value}")

Fail to reject the null hypothesis: Thus concluding that the observed distribution matches the expectations.
Chi-Square Statistic: 3.3333333333333335
P-value: 0.18887560283756186


# Q6. Dof Politics
In a social science survey, researchers investigate the relationship between two categorical variables.

Those variables, along with their categories are:

- Variable A: PoliticalOpinions
  - Strongly Agree,
  - Agree,
  - Disagree,
  - Strongly Disagree
- Variable B: DemographicInfo (Age Group)
  - 18-25,
  - 26-35,
  - 36-50
The goal is to determine if there is a significant association between the opinions on the political issue and demographic characteristics, specifically age groups.

In this scenario, what is the degrees of freedom for the chi-square test of independence?

In [11]:
(4-1)*(3-1)

6

# Q7. Time spent on website
Suppose you are interested in the distribution of time spent on a website, by it's users. You expect that:

- 20% of users spend less than 5 minutes,
- 50% spend between 5 and 10 minutes, and
- 30% spend more than 10 minutes.

After collecting data from 200 users, you find that

- 30 users spent less than 5 minutes,
- 85 users spent between 5 and 10 minutes, and
- 85 users spent more than 10 minutes.
Conduct an appropriate test to see if the distribution of browsing times matches your expectations at a 5% significance level. Choose the correct option from below:
```
a)

P-value: 0.0005

We fail to reject the null hypothesis

Thus concluding that the distribution of browsing times matches expectations.
```

```
b)

P-value: 0.0005

The null hypothesis is rejected, 

Thus concluding that the distribution of browsing times does not match expectations.
```

```
c)

P-value: 0.33

We fail to reject the null hypothesis

Thus concluding that the distribution of browsing times matches expectations.
```

```
d)

P-value: 0.33

The null hypothesis is rejected, 

Thus concluding that the distribution of browsing times does not match expectations.
```

In [13]:
observed_counts = np.array([30, 85, 85])
expected_counts = np.array([0.20 * 200, 0.50 * 200, 0.30 * 200])

chi_squared_stat, p_value = chisquare(f_obs=observed_counts, f_exp=expected_counts)

alpha = 0.05

if p_value < alpha:
    print("Reject the null hypothesis: The distribution of browsing times does not match expectations.")
else:
    print("Fail to reject the null hypothesis: The distribution of browsing times matches expectations.")

print(f"Chi-Square Statistic: {chi_squared_stat}")
print(f"P-value: {p_value}")

Reject the null hypothesis: The distribution of browsing times does not match expectations.
Chi-Square Statistic: 15.166666666666666
P-value: 0.0005088621855732918


# Q8. Help to choose the right test
A telecom company had taken a survey of smartphone owners in a certain town 5 years back and found 73% of the population own a smartphone, and have been since using this data to make their business decisions.

Now a new marketing manager has joined, and he believes this value is not valid anymore. Thus he conducts a survey of 500 people and finds that 420 of them responded with affirmation as to owning a smartphone.

Which statistical test would you use to compare these two survey data?
