In [14]:
import numpy as np
import pandas as pd
from scipy.stats import chisquare, chi2

## Goodness of fit

In [15]:
# # Coint toss example1
p_val = 1 - chi2.cdf(0.72, df=1)
p_val

0.3961439091520741

In [16]:
alpha = 0.05
if p_val < alpha:
  print('Reject H0')
else:
  print('Fail to Reject H0')

Fail to Reject H0


In [17]:
# Coint toss example2
p_val = 1 - chi2.cdf(32, df=1)
p_val

1.5417257914762672e-08

In [18]:
# # Coint toss example1
chi_stat, p_val = chisquare([28, 22], [25, 25])
p_val

0.3961439091520741

In [19]:
alpha = 0.05
if p_val < alpha:
  print('Reject H0')
else:
  print('Fail to Reject H0')

Fail to Reject H0


In [20]:
# # Coint toss example2
chi_stat, p_val = chisquare([45, 5], [25, 25])
p_val

1.5417257900280013e-08

## Test of Independance

Imagine you are running a Marketing Campaign in your company.

There are 2 modes through which customers can purchase the company's products: Offline and Online.

Your goal is to run a campaign that aims at increasing the number of online purchases.

In [21]:
from scipy.stats import chi2_contingency

In [22]:
observed = [[527, 72], [206, 102]]
chi2_contingency(observed)

Chi2ContingencyResult(statistic=57.04098674049609, pvalue=4.268230756875865e-14, dof=1, expected_freq=array([[484.08710033, 114.91289967],
       [248.91289967,  59.08710033]]))

In [23]:
chi_stat, p_val, df, exp_freq = chi2_contingency(observed)

In [24]:
alpha = 0.05
if p_val < alpha:
  print('Reject H0')
else:
  print('Fail to Reject H0')

Reject H0


In [25]:
!wget --no-check-certificate https://drive.google.com/uc?id=12muEOrUvEtKAPVhKr4rSlrsqwjGuMJfu -O aerofit.csv


--2024-10-05 20:17:41--  https://drive.google.com/uc?id=12muEOrUvEtKAPVhKr4rSlrsqwjGuMJfu
Resolving drive.google.com (drive.google.com)... 2404:6800:4007:82c::200e, 142.250.195.238
Connecting to drive.google.com (drive.google.com)|2404:6800:4007:82c::200e|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://drive.usercontent.google.com/download?id=12muEOrUvEtKAPVhKr4rSlrsqwjGuMJfu [following]
--2024-10-05 20:17:42--  https://drive.usercontent.google.com/download?id=12muEOrUvEtKAPVhKr4rSlrsqwjGuMJfu
Resolving drive.usercontent.google.com (drive.usercontent.google.com)... 2404:6800:4007:80a::2001, 172.217.160.129
Connecting to drive.usercontent.google.com (drive.usercontent.google.com)|2404:6800:4007:80a::2001|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7461 (7.3K) [application/octet-stream]
Saving to: ‘aerofit.csv’


2024-10-05 20:17:45 (11.7 MB/s) - ‘aerofit.csv’ saved [7461/7461]



In [26]:
df = pd.read_csv('aerofit.csv')
df

Unnamed: 0,Product,Age,Gender,Education,MaritalStatus,Usage,Fitness,Income,Miles
0,KP281,18,Male,14,Single,3,4,29562,112
1,KP281,19,Male,15,Single,2,3,31836,75
2,KP281,19,Female,14,Partnered,4,3,30699,66
3,KP281,19,Male,12,Single,3,3,32973,85
4,KP281,20,Male,13,Partnered,4,2,35247,47
...,...,...,...,...,...,...,...,...,...
175,KP781,40,Male,21,Single,6,5,83416,200
176,KP781,42,Male,18,Single,5,4,89641,200
177,KP781,45,Male,16,Single,5,5,90886,160
178,KP781,47,Male,18,Partnered,4,5,104581,120


A marketing manager wants to determine if there is a relationship between the type of advertising (online, print, or TV) and the purchase decision (buy or not buy) of a product.

The manager collects data from 300 customers and records their advertising exposure and purchase decisions.
What statistical test should the manager use to analyze this data?

In [27]:
observed_values = [70, 80, 50]
expected_values = [0.3*200, 0.4*200, 0.3*200]
chi_stat, p_val = chisquare(observed_values, expected_values)
p_val

0.1888756028375618

In [28]:
alpha = 0.05
if p_val < alpha:
  print('Reject H0')
else:
  print('Fail to Reject H0')

Fail to Reject H0


## Assignment problems

### Marital Status and Drinking
A national survey was conducted to obtain information on the alcohol consumption patterns of U.S. adults by marital status.
A random sample of 1772 residents, aged 18 and older, yielded the data displayed in Table below:

![Screenshot from 2024-09-30 12-23-16.png](<attachment:Screenshot from 2024-09-30 12-23-16.png>)

Test whether Marital status and alcohol consumption are associated with a 5% significance level.

Choose the correct option below :

In [29]:
#H0: Marital status and alcohol consumption are not associated.
#Ha: Marital status and alcohol consumption are associated. 

#Chi Squared Test for Independence

from scipy.stats import chi2_contingency

observed = [[67,213,74], [411,633,129], [85,51,7], [27,60,15]]
test_statistic, p_value, dof, expected_values = chi2_contingency(observed)

print("Test statistic:", test_statistic)
print("P-value:", p_value)

alpha = 0.05

if(p_value < alpha):
  print("Reject H0 ")
else:
  print('Fail to reject H0')

Test statistic: 94.26880078578765
P-value: 3.925170647869838e-18
Reject H0 


###  Internet Use
A random sample of adults yielded the following data on age and Internet usage.

![Screenshot from 2024-09-30 12-25-07.png](<attachment:Screenshot from 2024-09-30 12-25-07.png>)

At 1% significance level, does the data provide sufficient evidence to conclude that an association exists between age and Internet usage?

Choose the correct option below :

In [30]:
#H0: Age and Internet usage are not associated
#Ha: Age and Internet usage are associated

#chi-squared test for independence

from scipy.stats import chi2_contingency

observed = [[6,38, 31], [14, 31, 4], [50, 50, 5]]
test_statistic, p_value, dof, expected_values = chi2_contingency(observed)

print("Test statistic:", test_statistic)
print("p-value:", p_value)

alpha = 0.05

if(p_value < alpha):
  print("Reject H0 ")
else:
  print('Fail to reject H0')

Test statistic: 60.74604310295546
p-value: 2.0217185191724964e-12
Reject H0 


### Income and Residence
The U.S. Census Bureau compiles information on the money income of people by type of residence and publishes its finding in Current Population Reports.

Independent simple random samples of people consists of following types of residences

* Inside Principal Cities (IPC),

* Outside Principal Cities but within Metropolitan Areas (OPC), and

* Outside Metropolitan Areas (OMA),

The Census gave the following data on income levels:

![Screenshot from 2024-09-30 12-25-38.png](<attachment:Screenshot from 2024-09-30 12-25-38.png>)


At the 5% significance level, can you conclude that the type of residence is related to income level?

Choose the correct option below :

In [31]:
#H0: Income and Residence usage are not associated
#Ha: Income and Residence usage are associated

#Chi-Squared Test for Independence

from scipy.stats import chi2_contingency

observed = [[75,106,46], [106, 161, 61], [98, 183, 52], [48, 102, 14]]
test_statistic, p_value, dof, expected_values = chi2_contingency(observed)

print("Test statistic:", test_statistic)
print("p-value:", p_value)

alpha = 0.05

if(p_value < alpha):
  print("Reject H0 ")
else:
  print('Fail to reject H0')

Test statistic: 15.727554171801787
p-value: 0.015293451318673136
Reject H0 


### Observation representing sample

According to a survey conducted on car owners, it was determined that

* 60% of owners have only one car,

* 28% have two cars, and

* 12% have three or more cars.

Suppose Ram conducted his own survey within his residential society, and found that

* 73 owners have only one car,

* 38 owners have two cars, and

* 18 owners have three or more cars.

Determine whether Ram's survey supports the original one, with a significance level of 0.05.

In [32]:
#H0 = carowner = original survay
#HA = carowner != original survay

#Chi-Square Goodness of Fit Test

observed_values = np.array([73, 38, 18])

expected_values = np.array([0.60, 0.28, 0.12])

observed_sum = observed_values.sum()

expected_sum = observed_sum * expected_values

chi_stats, p_value = chisquare(f_obs = observed_values, f_exp=expected_sum)

print('Chisquare', chi_stats)
print('P_value', p_value)

alpha = 0.05

if(p_value < alpha):
  print("Reject H0 ")
else:
  print('Fail to reject H0')

Chisquare 0.7582133628645247
P_value 0.6844725882551137
Fail to reject H0


### Distribution of smartphone brands
A Mobile Retail store owner is interested in the distribution of popular smartphone brands among a group of 200 people.

They expect that 30% of people would prefer Brand A, 40% would prefer Brand B and 30% would prefer Brand C.

However, upon surveying the group, the results are as follows: 70 prefer Brand A, 80 prefer Brand B, and 50 prefer Brand C.

Conduct an appropriate test to see if the distribution of preferences matches the store owner's expectations at a 5% significance level.

In [33]:
#H0 = oberved_value = expected_value
#H1 = oberved_value != expected_value

#Chi-Square Goodness of Fit Test

observed_values = np.array([70,80,50])
observed_sum = observed_values.sum()

expected_values = np.array([0.30, 0.40, 0.30])
expected_sum = expected_values * observed_sum

chi_stats, p_value = chisquare(f_obs= observed_values, f_exp= expected_sum)
print('Chisquare', chi_stats)
print('P_value', p_value)

alpha = 0.05

if(p_value < alpha):
  print("Reject H0 ")
else:
  print('Fail to reject H0')

Chisquare 3.3333333333333335
P_value 0.1888756028375618
Fail to reject H0


### Dof Politics

In a social science survey, researchers investigate the relationship between two categorical variables.

Those variables, along with their categories are:

Variable A: PoliticalOpinions
* Strongly Agree,
* Agree,
* Disagree,
* Strongly Disagree

Variable B: DemographicInfo (Age Group)
* 18-25,
* 26-35,
* 36-50

The goal is to determine if there is a significant association between the opinions on the political issue and demographic characteristics, specifically age groups.

In this scenario, what is the degrees of freedom for the chi-square test of independence?

Answer : 6

### Time spent on website
Suppose you are interested in the distribution of time spent on a website, by it's users. You expect that:

* 20% of users spend less than 5 minutes,
* 50% spend between 5 and 10 minutes, and
* 30% spend more than 10 minutes.

After collecting data from 200 users, you find that

* 30 users spent less than 5 minutes,
* 85 users spent between 5 and 10 minutes, and
* 85 users spent more than 10 minutes.

Conduct an appropriate test to see if the distribution of browsing times matches your expectations at a 5% significance level.

In [34]:
#H0 = oberved_value = expected_value
#H1 = oberved_value != expected_value

#Chi-Square Goodness of Fit Test

observed_values = np.array([30,85,85])
observed_sum = observed_values.sum()

expected_values = np.array([0.20, 0.50, 0.30])
expected_sum = observed_sum * expected_values

chi_stats, p_value = chisquare(f_obs= observed_values, f_exp= expected_sum)
print('Chisquare', chi_stats)
print('P_value', p_value)

alpha = 0.05

if(p_value < alpha):
  print("Reject H0 ")
else:
  print('Fail to reject H0')

Chisquare 15.166666666666666
P_value 0.0005088621855732918
Reject H0 


### Help to choose the right test

A telecom company had taken a survey of smartphone owners in a certain town 5 years back and found 73% of the population own a smartphone, and have been since using this data to make their business decisions.

Now a new marketing manager has joined, and he believes this value is not valid anymore. Thus he conducts a survey of 500 people and finds that 420 of them responded with affirmation as to owning a smartphone.

Which statistical test would you use to compare these two survey data?


Answer : Test of proportions, z-test