# Lab | Goodness of Fit and Independence Tests

## Question 1
A researcher gathers information about the patterns of Physical Activity of children in the fifth grade of primary school of a public school. He defines three categories of physical activity (Low, Medium, High). He also inquires about the regular consumption of sugary drinks at school, and defines two categories (Yes = consumed, No = not consumed). We would like to evaluate if there is an association between patterns of physical activity and the consumption of sugary drinks for the children of this school, at a level of 5% significance. The results are in the following table: 

![](https://education-team-2020.s3.eu-west-1.amazonaws.com/ds-ai/lab-goodness-of-fit/table4.png)

In [11]:
import pandas as pd
from scipy.stats import chi2_contingency

data = pd.DataFrame({
    'Yes': [32, 14, 6],
    'No': [12, 22, 18]
}, index=['Low', 'Medium', 'High'])

data

Unnamed: 0,Yes,No
Low,32,12
Medium,14,22
High,6,18


In [3]:
# chi-squared test
chi2, p, dof, expected = chi2_contingency(data)

print(f"Chi-squared Statistic: {chi2}")
print(f"p-value: {p}")
print(f"Degrees of Freedom: {dof}")
print("Expected Frequencies:")
print(expected)

Chi-squared Statistic: 16.868686868686872
p-value: 0.0002172757151178007
Degrees of Freedom: 2
Expected Frequencies:
[[22. 22.]
 [18. 18.]
 [12. 12.]]


In [4]:
if p < 0.05:
    print("Reject the null hypothesis: There is a significant association between physical activity and sugary drink consumption.")
else:
    print("Fail to reject the null hypothesis: No significant association between physical activity and sugary drink consumption.")

Reject the null hypothesis: There is a significant association between physical activity and sugary drink consumption.


## [OPTIONAL] Question 2
The following table indicates the number of 6-point scores in an American rugby match in the 1979 season.

![](https://education-team-2020.s3.eu-west-1.amazonaws.com/ds-ai/lab-goodness-of-fit/table1.png)

Based on these results, we create a Poisson distribution with the sample mean parameter  = 2.435. Is there any reason to believe that at a .05 level the number of scores is a Poisson variable?

Check [here](https://www.geeksforgeeks.org/how-to-create-a-poisson-probability-mass-function-plot-in-python/) how to create a poisson distribution and how to calculate the expected observations, using the probability mass function (pmf). 
A Poisson distribution is a discrete probability distribution. It gives the probability of an event happening a certain number of times (k) within a given interval of time or space. The Poisson distribution has only one parameter, λ (lambda), which is the mean number of events.

In [5]:
import numpy as np
import pandas as pd
from scipy.stats import poisson, chisquare

# Observed data
observed_freq = np.array([35, 99, 104, 63, 41, 10, 5, 1])
k_values = np.arange(8)  # 0 through 7
lambda_ = 2.435
total_obs = observed_freq.sum()

In [7]:
# Poisson PMF for each k value
poisson_probs = poisson.pmf(k_values, mu=lambda_)

# Expected frequencies = probability * total observations
expected_freq = poisson_probs * total_obs

# Optional: round expected frequencies for readability
expected_freq_rounded = np.round(expected_freq, 2)

# Display in a DataFrame for clarity
df = pd.DataFrame({
    "k": k_values,
    "Observed": observed_freq,
    "Expected": expected_freq_rounded
})
df

Unnamed: 0,k,Observed,Expected
0,0,35,31.36
1,1,99,76.36
2,2,104,92.97
3,3,63,75.46
4,4,41,45.94
5,5,10,22.37
6,6,5,9.08
7,7,1,3.16


In [9]:
# Combine k=6 and k=7 into one bin if necessary
observed_adj = np.append(observed_freq[:-2], observed_freq[-2:].sum())
expected_adj = np.append(expected_freq[:-2], expected_freq[-2:].sum())

# Normalize expected frequencies
expected_adj *= observed_adj.sum() / expected_adj.sum()

# Chi-squared test
chi_stat, p_value = chisquare(f_obs=observed_adj, f_exp=expected_adj)

print(f"Chi-squared Statistic: {chi_stat}")
print(f"p-value: {p_value}")

Chi-squared Statistic: 20.969683346796458
p-value: 0.0018577672092151857


In [10]:
alpha = 0.05

if p_value < alpha:
    print("Reject the null hypothesis: The data does not follow a Poisson distribution.")
else:
    print("Fail to reject the null hypothesis: The data is consistent with a Poisson distribution.")

Reject the null hypothesis: The data does not follow a Poisson distribution.
