# 2 sample-test hypothesis

Now imagine that the department of education asks you to collect data on mean district literacy rates for two of the nation’s largest states: STATE21 and STATE28. STATE28 has almost 40 districts, and STATE21 has more than 70. Due to limited time and resources, you are only able to survey 20 randomly chosen districts in each state. The department asks you to determine if the difference between the two mean district literacy rates is statistically significant, or due to chance. This will help the department decide how to distribute government funding to improve literacy. If there is a statistically significant difference, the state with the lower literacy rate may receive more funding.

You can use Python to simulate taking a random sample of 20 districts in each state, and conduct a two-sample t-test based on the sample data.

### lets import the data

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

In [3]:
df = pd.read_csv('education_districtwise.csv')
df.head()

Unnamed: 0,DISTNAME,STATNAME,BLOCKS,VILLAGES,CLUSTERS,TOTPOPULAT,OVERALL_LI
0,DISTRICT32,STATE1,13,391,104,875564.0,66.92
1,DISTRICT649,STATE1,18,678,144,1015503.0,66.93
2,DISTRICT229,STATE1,8,94,65,1269751.0,71.21
3,DISTRICT259,STATE1,13,523,104,735753.0,57.98
4,DISTRICT486,STATE1,8,359,64,570060.0,65.0


In [7]:
df.shape
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 680 entries, 0 to 679
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   DISTNAME    680 non-null    object 
 1   STATNAME    680 non-null    object 
 2   BLOCKS      680 non-null    int64  
 3   VILLAGES    680 non-null    int64  
 4   CLUSTERS    680 non-null    int64  
 5   TOTPOPULAT  634 non-null    float64
 6   OVERALL_LI  634 non-null    float64
dtypes: float64(2), int64(3), object(2)
memory usage: 37.3+ KB


In [8]:
df.duplicated().sum()

0

In [9]:
df.isna().sum()

DISTNAME       0
STATNAME       0
BLOCKS         0
VILLAGES       0
CLUSTERS       0
TOTPOPULAT    46
OVERALL_LI    46
dtype: int64

In [10]:
#let's drop the null values
df = df.dropna()

In [13]:
state_28 = df[df['STATNAME'] == 'STATE28']
state_21 = df[df['STATNAME'] == 'STATE21']

In [15]:
 28 = state_28['STATNAME'].value_counts()

STATE28    38
Name: STATNAME, dtype: int64

### Simulate random sampling

In [27]:
sample_state21 = state_21.sample(n=20, replace = True, random_state = 12345)
sample_state28 = state_28.sample(n=20, replace = True, random_state = 98765)

### Compute the sample means

You now have two random samples of 20 districts, one sample for each state. Next, use mean() to compute the mean district literacy rate for both STATE21 and STATE28.

In [28]:
sample_state21['OVERALL_LI'].mean()

70.511

In [29]:
sample_state28['OVERALL_LI'].mean()

64.32199999999999

STATE21 has a mean district literacy rate of about 70.5%, while STATE28 has a mean district literacy rate of about 64.3%.

Based on your sample data, the observed difference between the mean district literacy rates of STATE21 and STATE28 is 6.2 percentage points (70.5% - 64.3%)

# Conduct a hypothesis test

𝐻0
 : There is no difference in the mean district literacy rates between STATE21 and STATE28
 
𝐻𝐴
 : There is a difference in the mean district literacy rates between STATE21 and STATE28

### For a two-sample  𝑡-test, you can use scipy.stats.ttest_ind() to compute your p-value. This function includes the following arguments

In [30]:
stats.ttest_ind(a=sample_state21['OVERALL_LI'], b=sample_state28['OVERALL_LI'], equal_var=False)

Ttest_indResult(statistic=3.231836856172072, pvalue=0.00287993729460301)

Your p-value is about 0.0029, or 0.29%.

This means there is only a 0.29% probability that the absolute difference between the two mean district literacy rates would be 6.2 percentage points or greater if the null hypothesis is true. In other words, it’s highly unlikely that the difference in the two means is due to chance

# conclusion 

Your p-value of 0.0029, or 0.29%, is less than the significance level of 0.05, or 5%. So, you reject the null hypothesis, and conclude that there is a statistically significant difference between the mean district literacy rates of the two states STATE21 and STATE28.

Your analysis will help the education department decide how to distribute government resources. Since there is a statistically significant difference in mean district literacy rates, the state with the lower literacy rate, STATE28, will likely receive more resources to improve literacy.