# Goodness of Fit
<div style="font-size:16px; line-height:25pt">
is used to test if sample data <b>fits</b> a distribution from a certain population. In other words, it tells you if your sample data represents the data you would expect to find in the actual population.
<br>Measures of goodness of fit is <b>sum</b> of discrepancy(kind of difference) between <b>observed</b> values and the values <b>expected</b> under the model.
<br>There are many types <b>tests and their measures of fit</b> that help us to assess whether a given distribution fits the data-set and those tests are
<br>1. The chi-square.
<br>2. Kolmogorov-Smirnov.
<br>3. Anderson-Darling.
<br>4. Shipiro-Wilk.
</div>

## Example of Chi-Squared goodness-of-fit test

Question: 256 visual artists were surveyed to find out their zodiac. Are zodiac signs of artists distributed equally?
DataFrame is provided.


In [72]:
import pandas as pd
zodiac_signs_list = ['Aries','Taurus','Gemini','Cancer','Leo','Virgo','Libra','Scorpio','Sagittarius','Capricorn','Aquarius','Pisces']
observed_values_list = [29,24,22,19,21,18,19,20,23,18,20,23]

question_table=pd.DataFrame(zodiac_signs_list,columns=['Zodiac signs'])
question_table['Observed values']=observed_values_list
question_table

Unnamed: 0,Zodiac signs,Observed values
0,Aries,29
1,Taurus,24
2,Gemini,22
3,Cancer,19
4,Leo,21
5,Virgo,18
6,Libra,19
7,Scorpio,20
8,Sagittarius,23
9,Capricorn,18


In [73]:
Expect = sum(observed_values_list)/len(observed_values_list)
Expected_value_list = [Expect for i in range(12)]

question_table['Expected Value'] = Expected_value_list
question_table

Unnamed: 0,Zodiac signs,Observed values,Expected Value
0,Aries,29,21.333333
1,Taurus,24,21.333333
2,Gemini,22,21.333333
3,Cancer,19,21.333333
4,Leo,21,21.333333
5,Virgo,18,21.333333
6,Libra,19,21.333333
7,Scorpio,20,21.333333
8,Sagittarius,23,21.333333
9,Capricorn,18,21.333333


In [74]:
O = observed_values_list
E = Expected_value_list
Chi = [(O[i]-E[i])**2/E[i] for i in range(len(O))]

question_table['Chi values'] = Chi
question_table 

Unnamed: 0,Zodiac signs,Observed values,Expected Value,Chi values
0,Aries,29,21.333333,2.755208
1,Taurus,24,21.333333,0.333333
2,Gemini,22,21.333333,0.020833
3,Cancer,19,21.333333,0.255208
4,Leo,21,21.333333,0.005208
5,Virgo,18,21.333333,0.520833
6,Libra,19,21.333333,0.255208
7,Scorpio,20,21.333333,0.083333
8,Sagittarius,23,21.333333,0.130208
9,Capricorn,18,21.333333,0.520833


In [75]:
import scipy.stats as st
from scipy.stats import chisquare

print(sum(Chi))
st.chisquare(O,E)

5.093750000000001


Power_divergenceResult(statistic=5.093750000000001, pvalue=0.9265413913115148)

In [77]:
# P-value is P(Chi > x)
v = st.chisquare(O,E)[1]

if v > 0.01 :
    print("True at 1% sig level")
if v > 0.05 :
    print("True at 5% sig level")
if v > 0.1 :
    print("True at 10% sig level")

True at 1% sig level
True at 5% sig level
True at 10% sig level


### Question: According to a particular genetic theory the number of colour strains(pink, white, blue) in a certain flower should appear in the ratio 3:2:5. In 100 randomly chosen plants, the corresonding numbers of each colour were 24, 14 and 62. Test at the 1% level whether the differences between the observed amd expected frequencies are significant

In [78]:
Colors = ['pink' , 'white' , 'blue']
O = [24,14,62]
E = [30,20,50]

qt = pd.DataFrame(Colors , columns=["Colors"])
qt['Observed Values'] = O
qt['Expected Values'] = E
qt

Unnamed: 0,Colors,Observed Values,Expected Values
0,pink,24,30
1,white,14,20
2,blue,62,50


In [79]:
st.chisquare(O,E)

v = st.chisquare(O,E)[1]

st.chisquare(O,E)[1]

if v > 0.01 :
    print("True at 1% sig level")
if v > 0.05 :
    print("True at 5% sig level")
if v > 0.1 :
    print("True at 10% sig level")

True at 1% sig level
True at 5% sig level
