In [19]:
import pandas as pd

## Question

Is gender independent of education level? <br>
A random sample of 395 people were surveyed and each person was asked to report the highest education level they obtained. The data that resulted from the survey is summarized in the following table:<br>

In [17]:
d1 = {'High School': pd.Series([60, 40, 100], index=['Female', 'Male', 'Total']),
      'Bachelors': pd.Series([54, 44, 98], index=['Female', 'Male', 'Total']),
      'Masters': pd.Series([46, 53, 99], index=['Female', 'Male', 'Total']),
      'Ph.d': pd.Series([41, 57, 98], index=['Female', 'Male', 'Total']),
      'Total': pd.Series([201, 194, 395], index=['Female', 'Male', 'Total'])}

df1 = pd.DataFrame(d1, columns = [ 'High School', 'Bachelors', 'Masters', 'Ph.d', 'Total'])
df1

Unnamed: 0,High School,Bachelors,Masters,Ph.d,Total
Female,60,54,46,41,201
Male,40,44,53,57,194
Total,100,98,99,98,395


Question: Are gender and education level dependent at 5% level of significance? In other words, given the data collected above, is there a relationship between the gender of an individual and the level of education that they have obtained?

## Solution

* To test the Independency of two categorical variables (here gender & education level), we perform **Chi-square test**

**Null hypothesis :-** Gender & Education level is independent to each other.<br>
**Alternate hypothesis :-** Gender & Educational level is dependent to each other.

$$\chi^2 = \sum \frac{(O - E)^2}{E}$$
   &emsp; &emsp; &emsp; $O$ = Observed value<br>
   &emsp; &emsp; &emsp; $E$ = Expected value

***
**1. <u>Observed value</u> :-**

In [18]:
d1 = {'High School': pd.Series([60, 40, 100], index=['Female', 'Male', 'Total']),
      'Bachelors': pd.Series([54, 44, 98], index=['Female', 'Male', 'Total']),
      'Masters': pd.Series([46, 53, 99], index=['Female', 'Male', 'Total']),
      'Ph.d': pd.Series([41, 57, 98], index=['Female', 'Male', 'Total']),
      'Total': pd.Series([201, 194, 395], index=['Female', 'Male', 'Total'])}

df1 = pd.DataFrame(d1, columns = [ 'High School', 'Bachelors', 'Masters', 'Ph.d', 'Total'])
df1

Unnamed: 0,High School,Bachelors,Masters,Ph.d,Total
Female,60,54,46,41,201
Male,40,44,53,57,194
Total,100,98,99,98,395


***
**2. <u>Expected value</u>**

**a. ** For High School - Female = $\frac{201}{395} \times 100 = 50.866$<br>
**b. ** For High School - Male = $\frac{194}{395} \times 100 = 49.114$<br>
**c. ** For Bachelors - Female = $\frac{201}{395} \times 98 = 49.868$<br>
**d. ** For Bachelors - Male = $\frac{194}{395} \times 98 = 48.132$<br>
**e. ** For Masters - Female = $ \frac{201}{395} \times 99 = 50.377$<br>
**f. ** For Masters - Male = $ \frac{194}{395} \times 99 = 48.623$<br>
**g. ** For Ph.d - Female = $ \frac{201}{395} \times 98 = 49.868$<br>
**h. ** For Ph.d - Male = $ \frac{194}{395} \times 98 = 48.132$<br>



In [20]:
d1 = {'High School': pd.Series([50.866, 49.114, 100], index=['Female', 'Male', 'Total']),
      'Bachelors': pd.Series([49.868, 48.132, 98], index=['Female', 'Male', 'Total']),
      'Masters': pd.Series([50.377, 48.623, 99], index=['Female', 'Male', 'Total']),
      'Ph.d': pd.Series([49.868, 48.132, 98], index=['Female', 'Male', 'Total']),
      'Total': pd.Series([201, 194, 395], index=['Female', 'Male', 'Total'])}

df1 = pd.DataFrame(d1, columns = [ 'High School', 'Bachelors', 'Masters', 'Ph.d', 'Total'])
df1

Unnamed: 0,High School,Bachelors,Masters,Ph.d,Total
Female,50.866,49.868,50.377,49.868,201
Male,49.114,48.132,48.623,48.132,194
Total,100.0,98.0,99.0,98.0,395


***
**3. $\chi^2$**

In [23]:
chi_sqr = (((60-50.866)**2 / 50.866) + ((40-49.114)**2 / 49.114) + ((54-49.868)**2 / 49.868) + ((44-48.132)**2 / 48.132) \
           + ((46-50.377)**2 / 50.377) + ((53-48.623)**2 / 48.623) + ((41-49.868)**2 / 49.868) + ((57-48.132)**2 / 48.132))

chi_sqr

8.013723884467069

***
**4. Degree of freedom of $\chi^2$**<br>
    &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp;= (number of rows - 1) $\times$ (number of columns - 1)<br>
    &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; = (2 - 1) $\times$ (4 - 1)<br>
    &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; = 3

***
**5. Critical value of $\chi^2$ at $\alpha$ = 0.05 and 3 degree of freedom**<br>
By chi square distribution table it is $7.815$


***
**Result :-** we can notice that our chi_sqr value exceeds the critical value, that means our chi square value fall into the rejection region of curve.<br>
Therefore our decision is we are rejecting the null hypothesis i.e. education lever does depend on gender at 5% level of significance.