# Goodness of Fit
<div style="font-size:16px; line-height:25pt">
is used to test if sample data <b>fits</b> a distribution from a certain population. In other words, it tells you if your sample data represents the data you would expect to find in the actual population.
<br>Measures of goodness of fit is <b>sum</b> of discrepancy(kind of difference) between <b>observed</b> values and the values <b>expected</b> under the model.
<br>There are many types <b>tests and their measures of fit</b> that help us to assess whether a given distribution fits the data-set and those tests are
<br>1. The chi-square.
<br>2. Kolmogorov-Smirnov.
<br>3. Anderson-Darling.
<br>4. Shipiro-Wilk.
</div>

Let's start with the <b>chi-square</b> test.
<img src="pronounce.PNG" width=800px, height=100px>
<img src="formula.PNG" width=500px, height=100px>
<div style="font-size:16px; line-height:25pt">
The chi-square test is the most common of the goodness of fit tests and it can be used for discrete distributions like the binomial distribution and the Poisson distribution, while the The Kolmogorov-Smirnov and Anderson-Darling goodness of fit tests can only be used for continuous distributions.

<br><b>Tip:</b> Chi-square test always includes on Hypothesis testing. Mostly H0 is accepted.
<br>There are two main situations when a Chi-squared test is used:
<br><b>1) A chi-squared goodness of fit test</b>
<br>   Two Standard Hypotheses are:
<br>   H0 = Particular Distribution fits the data
<br>   H1 = Particular Distribution does not fit the data

<br><b>2) A chi-squared test for indepedence (or for association)</b>
<br>   Two Standard Hypotheses are:
<br>   H0 = Factors are Independent
<br>   H1 = Factors are not Independent
<div>

# Solving part
<div style="font-size:16px; line-height:25pt">
1) State a Hypothesis test(H0 and H1)
<br>2) Find out the <b>Expected values</b> for each entry.
<br>3) Work out the <b>chi squared</b>(sum of variances of expected and observed values as shown in the formula)
<br>4) Calculate the degrees of freedom with the following formula: 
<img src=degree_of_freedom.PNG width=500px height=50px>
5) Find critical value in the following table(with Alpha 0.05). We mostly choose Probability level to be 95% accurate. So our Alha is 5%(100%-95%) which is 0.05.
<img src=critical_value.PNG width=500px height=100px>
<br>6) Accept or Reject Null Hypothesis
<br><b>Accept</b> Null hypothesis IF value of <b>chi-squared</b> is <b>less than</b> critical value in the table
<br><b>Reject</b> Null hypothesis IF value of <b>chi-squared</b> is <b>more than</b> critical value in the table
    </div>

## Example of Chi-Squared test for Independence

<img src="example1.PNG">

Step 1. State Hypothesis(H0 and H1)
<img src=example1.1.PNG>

Step 2. Find out Expected Values for each entry
<img src=example1.2.PNG>

Step 3. Calculate <b>Chi-Squared</b>
<img src=example1.3.PNG>

Step 4. Calculate degrees of freedom with given formula
<img src=example1.4.PNG>

Step 5. Find critical value from the table
<img src=example1.5.PNG>
So, our critical value is 3.841

Step 6. Reject or Accept Null Hypothesis
<br>Our Chi-Square is 4.102 and critical value is 3.841. Therefore, we have to reject Null Hypothesis which states "Gender and Preference for cats and dogs are INDEPENDENT", and accept Alternative Hypothesis.
<br>So, to conclude we say that Gender and Preference for cats and dogs are not INDEPENDENT.
<br>
<br><b>Please watch the video below if you have any issues understanding the Chi-Square test</b>

In [1]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/b3o_hjWKgQw" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>


## Example of Chi-Squared goodness-of-fit test

Question: 256 visual artists were surveyed to find out their zodiac. Are zodiac signs of artists distributed normally?
DataFrame is provided.


In [95]:
import pandas as pd

zodiac_signs_list = ['Aries','Taurus','Gemini','Cancer','Leo','Virgo','Libra','Scorpio','Sagittarius','Capricorn','Aquarius','Pisces']
observed_values_list = [29,24,22,19,21,18,19,20,23,18,20,23]
expected_values_list = [] #Question: Fill this up
chi_square_list = [] #Question: Fill this up
question_table=pd.DataFrame(zodiac_signs_list,columns=['Zodiac signs'])

question_table['Observed values']=observed_values_list
question_table

Unnamed: 0,Zodiac signs,Observed values
0,Aries,29
1,Taurus,24
2,Gemini,22
3,Cancer,19
4,Leo,21
5,Virgo,18
6,Libra,19
7,Scorpio,20
8,Sagittarius,23
9,Capricorn,18


In [96]:
#Step 1. Create a Hypothesis test. Our Null Hypothesis: Data is distributed normally.
null_hypothesis = None #let's assign it to None because we do not know whether it is true(accept H0) or false(reject H0)

#Step 2. Find Expected Values and add it to dataframe
all_observed = sum(observed_values_list) #we need sum of all observed values and divide it to number of rows)
expected_value = all_observed/12 #we have 12 rows in the table(starting from 0), so in order to make it normally distributed we need to divide to 12(number of rows)
#Now we need to add expected value for each unit(we can use "for loop")
for i in range(12):
    expected_values_list.append(round(expected_value, 1))#I also round up the number with only 1 decimal

question_table['Expected values']=expected_values_list #adding Expected values to DataFrame
question_table

Unnamed: 0,Zodiac signs,Observed values,Expected values
0,Aries,29,21.3
1,Taurus,24,21.3
2,Gemini,22,21.3
3,Cancer,19,21.3
4,Leo,21,21.3
5,Virgo,18,21.3
6,Libra,19,21.3
7,Scorpio,20,21.3
8,Sagittarius,23,21.3
9,Capricorn,18,21.3


In [97]:
#Step  3. Now let's calculate Chi-Squared for each unit and add it to the DataFrame
for i in range(12):
    chi_square1 = observed_values_list[i]-expected_values_list[i]
    chi_square2 = (chi_square1*chi_square1)/expected_values_list[i]
    chi_square_list.append(chi_square2)
    
question_table['Chi-Square values']=chi_square_list #adding Expected values to DataFrame
chi_square_sum = sum(chi_square_list)
question_table
#Step 4. Calculate degrees of freedom
degree_of_freedom = len(chi_square_list) - 1
#Step 5. Find critical value. (Can look it up on the table and assign it to a variable)
critical_value = 19.675
#Step 6. Accept or Reject Null Hypothesis
if(chi_square_sum < critical_value):
    null_hypothesis = True #true as accept H0
else: 
    null_hypothesis = False #false as reject H0
    
null_hypothesis #if true, then we gonna Accept H0. However, if it is false then we will Reject H0

True

In [98]:
question_table #Print the whole table

Unnamed: 0,Zodiac signs,Observed values,Expected values,Chi-Square values
0,Aries,29,21.3,2.783568
1,Taurus,24,21.3,0.342254
2,Gemini,22,21.3,0.023005
3,Cancer,19,21.3,0.248357
4,Leo,21,21.3,0.004225
5,Virgo,18,21.3,0.511268
6,Libra,19,21.3,0.248357
7,Scorpio,20,21.3,0.079343
8,Sagittarius,23,21.3,0.135681
9,Capricorn,18,21.3,0.511268


# Questions

### Question:  The table shows the number of employees absent for just one day during a particular period of time
| Day of the week 	| Monday 	| Tuesday 	| Wednesday 	|  Thursday  	| Friday 	|  	|
|------:	|:---:	|:---:	|:--:	|:---:	|:--:	|:--:	|
|   Number of absentees  	| 121 	|  87 	| 87 	| 91 	| 114 	| Total 500 	|

Question 1. Create a <b>Hypothesis test</b> -->Create a boolean variable for Null Hypothesis and assign it to <b>None</b> for now.
<br>Tip: Null Hypothesis variable can be True(accept) and False(reject). 

In [9]:
#Precode for Question 1
import pandas as pd

day_list = [] #Question: fill the list with weeks
observed_absentees = [] #Question: Fill up the list
expected_absentees = [] #Question: Fill up the list
chi_square_list = [] #Question: Fill up the list

#Creating the DataFrame
question1_table = pd.DataFrame(day_list,columns=['Weekdays'])

#Adding Observed Values into the DataFrame
question1_table['Observed values']=observed_absentees

#Step 1. Create a Hypothesis test.
#Create and Assign the variable according to the question 1

Question 2. Find out <b>Expected values</b> and add them to the DataFrame table

In [10]:
#Precode for Question 2
#Step 2. Find Expected Values and add it to dataframe
observed_absentees_sum  #we need sum of all observed values and divide it to number of rows)
expected_absentees  #we have 5 days in the table, so in order to find EVs we need to divide sum_of_observed_values to 5
#Now we need to add expected value for each unit(we can use "for loop")
#Also round up the number with only 1 decimal

#Add Expected values to DataFrame

#Display the DataFrame

NameError: name 'observed_absentees_sum' is not defined

Question 3. Find <b>Chi-Squared values</b> for each unit in table. 

In [11]:
#Precode for Question 3
#Step  3. Now calculate Chi-Squared for each unit 
#Use for loop

Question 4. Add all the chi-squared values to DataFrame. Calculate sum of Chi-Squared

In [13]:
#Precode for Question 4
#Display the DataFrame

chi_square_sum #Calculate the sum of Chi-Squared

NameError: name 'chi_square_sum' is not defined

Question 5. Calculate <b>degrees of freedom</b> and Find <b>critical value</b> from the table provided. 

In [None]:
#Precode for Question 5
#Step 4. Calculate degrees of freedom
degree_of_freedom
#Step 5. Find critical value. (Can look it up on the table and assign it to a variable)
critical_value

Question 6. Accept or Reject Null Hypothesis

In [14]:
#Precode for Question 6
#Step 6. Accept or Reject Null Hypothesis
#Assign Null Hypothesis to True or False according sum of Chi-Square and critical value

### Question: According to a particular genetic theory the number of colour strains(pink, white, blue) in a certain flower should appear in the ratio 3:2:5. In 100 randomly chosen plants, the corresonding numbers of each colour were 24, 14 and 62. Test at the 1% level whether the differences between the observed amd expected frequencies are significant

Question 7. Create a list of colors and Observed Frequncies and add them to the DataFrame(pandas)

In [None]:
#Precode for Question 7
import pandas as pd

color_list = [] #Question: fill the list with weeks
observed_strains = [] #Question: Fill up the list
expected_strains = [] #Question: Fill up the list
chi_square_list = [] #Question: Fill up the list

#Creating the DataFrame
question2_table

#Add Observed Values into the DataFrame




Question 8. Create <b>Hypothesis test</b>(Just like in Question 2) 

In [None]:
#Precode for Question 8
#Step 1. Create a Hypothesis test.
#Create and Assign the variable according to the question 1

Question 9. Find out <b>Expected values</b> and add them to a list and also add the list to the DataFrame table 

In [None]:
#Precode for Question 9

#Step 2. Find Expected Values and add them to dataframe
observed_absentees_sum  #we need sum of all observed values and divide it to number of rows)
expected_strains  #do it according to the ratio 3:2:5. Sum of observed is 100%
#Now we need to add expected value for each unit(we can use "for loop")
#Also round up the number with only 1 decimal

#Add Expected values to DataFrame

#Display the DataFrame

Question 10. Find <b>Chi-Squared values</b> for each unit in table. 

In [None]:
#Precode for Question 10

#Step  3. Now calculate Chi-Squared for each unit 
#Use for loop

Question 11. Add all the chi-squared values to list and also add it to <b>DataFrame</b>. Calculate <b>sum of Chi-Squared</b>.

In [None]:
#Precode for Question 11

#Display the DataFrame

chi_square_sum #Calculate the sum of Chi-Squared

Question 12. Calculate <b>degrees of freedom</b> and Find <b>critical value(Use 0.001)</b> from the table provided.

In [None]:
#Precode for Question 12

#Precode for Question 5
#Step 4. Calculate degrees of freedom
degree_of_freedom
#Step 5. Find critical value. (Can look it up on the table and assign it to a variable)
critical_value

Question 13. Accept or Reject Null Hypothesis

In [None]:
#Precode for Question 13
#Step 6. Accept or Reject Null Hypothesis
#Assign Null Hypothesis to True or False according sum of Chi-Square and critical value

# Hints
1) The way how to get expected values is Sum_of_all_observed_values divide to number of rows mostly. If the question is about ratio, then divide Sum_of_all_observed_values according to the ratio given.
<br>2) The only way we Accept null hypothesis is when critical value is larger than chi-squared value. 
