# Chi-Squared Goodness-of-Fit Test

<div class="alert alert-info">Learning Goals:</div>

1. Understand the Chi-Squared Goodness-of-Fit Test and its purpose in statistical analysis.
2. Learn how to perform the Chi-Squared Goodness-of-Fit Test to assess the fit between observed and expected frequencies.
3. Explore the assumptions and limitations of the Chi-Squared Goodness-of-Fit Test.
4. Utilize Python code examples to conduct the Chi-Squared Goodness-of-Fit Test and interpret the results.

## Introduction
The Chi-Squared Goodness-of-Fit Test is a statistical test used to determine if an observed sample fits a specified theoretical distribution. It compares the observed frequencies with the expected frequencies based on the hypothesized distribution.

### Applications of the Chi-Squared Goodness-of-Fit Test
- Testing Hypotheses: Assessing whether observed data follows a specific distribution.
- Model Evaluation: Checking if a theoretical model adequately represents observed data.

### Performing the Chi-Squared Goodness-of-Fit Test
2.1 Chi-Squared Statistic
The Chi-Squared statistic measures the discrepancy between observed and expected frequencies. It is calculated using the formula:

$$χ² = \frac{(O - E)²}{E}$$

<div class="alert alert-success"><b>O</b> represents the observed frequencies, <b>E</b> represents the expected frequencies based on the hypothesized distribution, and the sum is taken over all categories or bins.</div>

### Degrees of Freedom
The degrees of freedom (df) for the Chi-Squared Goodness-of-Fit Test depend on the number of categories and constraints imposed by the test. For the test of fit to a specified distribution, df = number of categories - 1.

### P-value and Interpretation
The p-value associated with the Chi-Squared Goodness-of-Fit Test represents the probability of obtaining the observed data or more extreme results if the null hypothesis (goodness-of-fit) is true. A smaller p-value suggests stronger evidence against the null hypothesis.

### Assumptions and Limitations of the Chi-Squared Goodness-of-Fit Test

Assumptions
- Independence: The observations within each category or bin are assumed to be independent.
- Expected Frequencies: The expected frequencies should be sufficiently large (typically at least 5) to satisfy the asymptotic properties of the Chi-Squared distribution.

Limitations
- Sample Size: The Chi-Squared Goodness-of-Fit Test may not be reliable with small sample sizes.
- Cell Frequency: If any cell in the contingency table has an expected frequency below a certain threshold (e.g., 5), the Chi-Squared test results may be unreliable.

## Performing the Chi-Squared Goodness-of-Fit Test

In [3]:
from scipy.stats import chisquare

observed = [10, 15, 25, 20]  # Observed values
expected = [12, 15, 20, 23]  # Expected values based on hypothesized distribution

chi2, p_value = chisquare(f_obs=observed, f_exp=expected)
print("Chi-Squared Statistic:", round(chi2, 2))
print("p-value:", round(p_value, 2))

Chi-Squared Statistic: 0.95
p-value: 0.81


### Interpreting the Results
Compare the p-value to a predetermined significance level (e.g., 0.05) to determine the statistical significance. If the p-value is below the significance level, we reject the null hypothesis and conclude that the observed data does not fit the hypothesized distribution.

## Chi-squared in biology

Gregor Mendel developed the principle of independent assortment, which 
states that alleles will segregate independently into gametes in a 
dihybrid with a phenotypic ratio of **9:3:3:1**. For example, see this dihybrid 
cross table of a BbEe x BbEe hybrid fly. The `B` gene codes for body color 
and the `E` gene codes for eye color. The dominant body color is brown, 
and the recessive color is black, the dominant eye color is red 
and the recessive eye color is brown. See that there are a 9:3:3:1 ratio 
of possible offspring. All possible combinations are represented in the table below.

|           |     BE      |     Be      |     bE      |     be      |
|-----------|-------------|-------------|-------------|-------------|
|     BE    |     BEBE    |     BEBe    |     BEbE    |     BEbe    |
|     Be    |     BeBE    |     BeBe    |     BebE    |     Bebe    |
|     bE    |     bEBE    |     bEBe    |     bEbE    |     bEbe    |
|     Be    |     beBE    |     beBe    |     bEbE    |     bebe    |

However, we now know that sometimes certain alleles are associated with each 
other, or in other words are not independent. We can test to see if alleles 
are linked using the `chi-square goodness of fit` test. 


Let's say we take flies that are heterozygous for both the B and E genes. 
We perform `100` dihybrid crosses and  get the following offspring phenotypes:

|     Genotype        |     # individuals    |
|---------------------|----------------------|
|     Brown, red      |     50               |
|     Brown, brown    |     20               |
|     Black, red      |     20               |
|     Black, brown    |     10               |
|                     |     Total = 100      |

We can use the chi-square test to see if these genes independently assort. 
If we expect a 9:3:3:1 ratio, we can calculate the expected phenotypes 
of the offspring:

|     Genotype        |     # individuals         |
|---------------------|---------------------------|
|     Brown, red      |     9/16 * 100 = 56.25    |
|     Brown, brown    |     3/16 * 100 = 18.75    |
|     Black, red      |     3/16 * 100 = 18.75    |
|     Black, brown    |     1/16 * 100 = 6.25     |
|                     |     Total = 100           |

In [6]:
from scipy.stats import chisquare

observed = [50, 20, 20, 10]  # Observed values
expected = [56.25, 18.75, 18.75, 6.25]  # Expected values based on hypothesized distribution

chi2, p_value = chisquare(f_obs=observed, f_exp=expected)
print("Chi-Squared Statistic:", round(chi2, 2))
print("p-value:", round(p_value, 2))

Chi-Squared Statistic: 3.11
p-value: 0.37


In [7]:
## End of chapter question

Let's suppose that in aardvarks, front foot hair length is controlled by 
a single gene with two alleles, and let's call them `A` for the dominant allele 
and `a` for the recessive allele. In individuals have at least one `A` allele, 
they have a long front foot hair, but if they have both `a`
alleles, they have a shorter foot hair.

<div style="display:flex; justify-content:center;">
    <img src="../images/aardvark.jpg" alt="Image" width="400" height="300" style="margin-left: 10px;">
</div>

We take a random sample of 230 aardvarks in a national park in Africa and genotype
them. We find that 43 are `AA`, 32 are `Aa`, and 10 are `aa`. If having the short hair allele does not impact fitness (survival and reproduction),then individuals  with 0, 1, or 2 copies of the "b" allele should follow a 
binomial distribution.

Test this using the chi-square goodness of fit test.