# Lab | Goodness of Fit and Independence Tests

## Question 1
A researcher gathers information about the patterns of Physical Activity of children in the fifth grade of primary school of a public school. He defines three categories of physical activity (Low, Medium, High). He also inquires about the regular consumption of sugary drinks at school, and defines two categories (Yes = consumed, No = not consumed). We would like to evaluate if there is an association between patterns of physical activity and the consumption of sugary drinks for the children of this school, at a level of 5% significance. The results are in the following table: 

![](https://education-team-2020.s3.eu-west-1.amazonaws.com/ds-ai/lab-goodness-of-fit/table4.png)

In [1]:
import scipy.stats as stats
import numpy as np

# Observed values from the contingency table
observed = np.array([
    [32, 12],
    [14, 22],
    [6, 9]
])

# Perform Chi-square test of independence
chi2_stat, p_val, dof, expected = stats.chi2_contingency(observed)

chi2_stat, p_val, dof, expected


(np.float64(10.712198008709638),
 np.float64(0.004719280137040844),
 2,
 array([[24.08421053, 19.91578947],
        [19.70526316, 16.29473684],
        [ 8.21052632,  6.78947368]]))

## [OPTIONAL] Question 2
The following table indicates the number of 6-point scores in an American rugby match in the 1979 season.

![](https://education-team-2020.s3.eu-west-1.amazonaws.com/ds-ai/lab-goodness-of-fit/table1.png)

Based on these results, we create a Poisson distribution with the sample mean parameter  = 2.435. Is there any reason to believe that at a .05 level the number of scores is a Poisson variable?

Check [here](https://www.geeksforgeeks.org/how-to-create-a-poisson-probability-mass-function-plot-in-python/) how to create a poisson distribution and how to calculate the expected observations, using the probability mass function (pmf). 
A Poisson distribution is a discrete probability distribution. It gives the probability of an event happening a certain number of times (k) within a given interval of time or space. The Poisson distribution has only one parameter, λ (lambda), which is the mean number of events.

In [21]:
import numpy as np
from scipy.stats import poisson, chisquare

# Observed frequencies from the image
observed = np.array([35, 99, 104, 110, 62, 25, 10, 3])
total = np.sum(observed)

# Calculate lambda (mean number of scores)
score_values = np.array([0, 1, 2, 3, 4, 5, 6, 7])  # using 7 for "7 or more"
mean_lambda = np.sum(observed * score_values) / total

# Calculate expected probabilities for k = 0 to 6
expected_probs = poisson.pmf(k=np.arange(0, 7), mu=mean_lambda)
# Add probability for 7 or more
expected_probs = np.append(expected_probs, 1 - expected_probs.sum())

# Expected frequencies
expected = expected_probs * total

# Combine bins if expected frequencies are too small (< 5)
# Combine last two bins if needed
if expected[-1] < 5:
    expected[-2] += expected[-1]
    observed[-2] += observed[-1]
    expected = expected[:-1]
    observed = observed[:-1]

# Perform chi-square goodness-of-fit test
chi2_stat, p_value = chisquare(f_obs=observed, f_exp=expected)

chi2_stat, p_value, mean_lambda


(np.float64(6.490217386995407),
 np.float64(0.4838104817385577),
 np.float64(2.435267857142857))