In [27]:
import pandas as pd
import numpy as np
from scipy import stats

In [28]:
df=pd.read_csv('../datasets/education_districtwise.csv')
df.head(10)

Unnamed: 0,DISTNAME,STATNAME,BLOCKS,VILLAGES,CLUSTERS,TOTPOPULAT,OVERALL_LI
0,DISTRICT32,STATE1,13,391,104,875564.0,66.92
1,DISTRICT649,STATE1,18,678,144,1015503.0,66.93
2,DISTRICT229,STATE1,8,94,65,1269751.0,71.21
3,DISTRICT259,STATE1,13,523,104,735753.0,57.98
4,DISTRICT486,STATE1,8,359,64,570060.0,65.0
5,DISTRICT323,STATE1,12,523,96,1070144.0,64.32
6,DISTRICT114,STATE1,6,110,49,147104.0,80.48
7,DISTRICT438,STATE1,7,134,54,143388.0,74.49
8,DISTRICT610,STATE1,10,388,80,409576.0,65.97
9,DISTRICT476,STATE1,11,361,86,555357.0,69.9


In [29]:
df=df.dropna()

In [30]:
df.isnull().sum()

DISTNAME      0
STATNAME      0
BLOCKS        0
VILLAGES      0
CLUSTERS      0
TOTPOPULAT    0
OVERALL_LI    0
dtype: int64

In [31]:
state21=df[df['STATNAME']=='STATE21']
state28=df[df['STATNAME']=='STATE28']

In [32]:
state21.head()

Unnamed: 0,DISTNAME,STATNAME,BLOCKS,VILLAGES,CLUSTERS,TOTPOPULAT,OVERALL_LI
133,DISTRICT607,STATE21,14,1357,127,3464228.0,72.03
134,DISTRICT50,STATE21,12,594,86,4138605.0,70.11
135,DISTRICT61,STATE21,16,1919,159,3683896.0,70.43
136,DISTRICT191,STATE21,10,1141,69,4773138.0,58.67
137,DISTRICT328,STATE21,7,1116,85,2335398.0,55.08


In [33]:
sampled_state21=state21.sample(n=20, replace=True,random_state=13490)
sampled_state28=state28.sample(n=20, replace=True,random_state=42)

In [34]:
sampled_state21_mean=sampled_state21['OVERALL_LI'].mean()
sampled_state28_mean=sampled_state28['OVERALL_LI'].mean()
print(f"Sampled State 21 Mean: {sampled_state21_mean}")
print(f"Sampled State 28 Mean: {sampled_state28_mean}")

Sampled State 21 Mean: 70.82900000000001
Sampled State 28 Mean: 63.93449999999999


**Conduct a hypothesis test**

1.   State the null hypothesis and the alternative hypothesis.
2.   Choose a significance level.
3.   Find the p-value. 
4.   Reject or fail to reject the null hypothesis.


#### Step 1: State the null hypothesis and the alternative hypothesis
*   $H_0$: There is no difference in the mean district literacy rates between STATE21 and STATE28.
*   $H_A$: There is a difference in the mean district literacy rates between STATE21 and STATE28.

#### Step 2: Choose a significance level
standard level of 5%, or 0.05

#### Step 3: Find the p-value
#### `scipy.stats.ttest_ind()`

For a two-sample $t$-test, you can use `scipy.stats.ttest_ind()` to compute your p-value. This function includes the following arguments:

*   `a`: Observations from the first sample 
*   `b`: Observations from the second sample
*   `equal_var`: A boolean, or true/false statement, which indicates whether the population variance of the two samples is assumed to be equal. In our example, you don’t have access to data for the entire population, so you don’t want to assume anything about the variance. To avoid making a wrong assumption, set this argument to `False`. 

*   `a`: Your first sample refers to the district literacy rate data for STATE21, which is stored in the `OVERALL_LI` column of your variable `sampled_ state21`.
*   `b`: Your second sample refers to the district literacy rate data for STATE28, which is stored in the `OVERALL_LI` column of your variable `sampled_ state28`.
*   `equal_var`: Set to `False` because you don’t want to assume that the two samples have the same variance.

In [35]:
stats.ttest_ind(a=sampled_state21['OVERALL_LI'], b=sampled_state28['OVERALL_LI'], equal_var=False)

TtestResult(statistic=3.4540442484973974, pvalue=0.0014014047341483517, df=36.96564575115564)

#### Step 4: Reject or fail to reject the null hypothesis

To draw a conclusion, compare your p-value with the significance level.

*   If the p-value is less than the significance level, you can conclude that there is a statistically significant difference in the mean district literacy rates between STATE21 and STATE28. In other words, you will reject the null hypothesis $H_0$.
*   If the p-value is greater than the significance level, you can conclude that there is *not* a statistically significant difference in the mean district literacy rates between STATE21 and STATE28. In other words, you will fail to reject the null hypothesis $H_0$.

Your p-value of 0.0014, or 0.14%, is less than the significance level of 0.05, or 5%. Therefore, you will *reject* the null hypothesis and conclude that there is a statistically significant difference between the mean district literacy rates of the two states: STATE21 and STATE28. 