## Hypothesis testing with Python

In [1]:
!mamba install pandas
!mamba install statsmodels

mambajs 0.19.13

Specs: xeus-python, numpy, matplotlib, pillow, ipywidgets>=8.1.6, ipyleaflet, scipy, pandas
Channels: emscripten-forge, conda-forge

Solving environment...
Solving took 1.5540999999996274 seconds
  Name                          Version                       Build                         Channel                       
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
[0;32m+ pandas                        [0m3.0.0                         np22py313h9d9dc1e_0           emscripten-forge              
[0;32m+ python-tzdata                 [0m2025.3                        pyhd8ed1ab_0                  conda-forge                   
[0;31m- pip                       

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import statsmodels.api as sm

**Introduction**
You work for an environmental think tank called Repair Our Air (ROA). ROA is formulating policy recommendations to improve the air quality in America, using the Environmental Protection Agency's Air Quality Index (AQI) to guide their decision making. An AQI value close to 0 signals "little to no" public health concern, while higher values are associated with increased risk to public health.

They've tasked you with leveraging AQI data to help them prioritize their strategy for improving air quality in America.

ROA is considering the following decisions. For each, construct a hypothesis test and an accompanying visualization, using your results of that test to make a recommendation:

ROA is considering a metropolitan-focused approach. Within California, they want to know if the mean AQI in Los Angeles County is statistically different from the rest of California.
With limited resources, ROA has to choose between New York and Ohio for their next regional office. Does New York have a lower AQI than Ohio?
A new policy will affect those states with a mean AQI of 10 or greater. Would Michigan be affected by this new policy?
Notes:

For your analysis, you'll default to a 5% level of significance.
Throughout the lab, for two-sample t-tests, use Welch's t-test (i.e., setting the equal_var parameter to False in scipy.stats.ttest_ind()). This will account for the possibly unequal variances between the two groups in the comparison.

In [14]:
# Use read_csv() to import your data

aqi = pd.read_csv('c4_epa_air_quality.csv')

**Data Exploration**

In [20]:
# Explore your dataframe `aqi` here:
print('Show Top 5 rows')
print(aqi.head())

print('Summary Statistics of dataframe')
print(aqi.describe(include='all'))

print("For a more thorough examination of observations by state use values_counts()")
print(aqi['state_name'].value_counts())

Show Top 5 rows
   Unnamed: 0  date_local    state_name   county_name      city_name  \
0           0  2018-01-01       Arizona      Maricopa        Buckeye   
1           1  2018-01-01          Ohio       Belmont      Shadyside   
2           2  2018-01-01       Wyoming         Teton  Not in a city   
3           3  2018-01-01  Pennsylvania  Philadelphia   Philadelphia   
4           4  2018-01-01          Iowa          Polk     Des Moines   

                                     local_site_name   parameter_name  \
0                                            BUCKEYE  Carbon monoxide   
1                                          Shadyside  Carbon monoxide   
2  Yellowstone National Park - Old Faithful Snow ...  Carbon monoxide   
3                             North East Waste (NEW)  Carbon monoxide   
4                                          CARPENTER  Carbon monoxide   

    units_of_measure  arithmetic_mean  aqi  
0  Parts per million         0.473684    7  
1  Parts per million  

**Question 1: From preceding data exploration, what do you recognize?**
You have county-level data for the first hypothesis.


## Statistical Tests
Before you proceed, recall the following steps for conducting hypothesis testing:

Formulate the null hypothesis and the alternative hypothesis.
Set the significance level.
Determine the appropriate test procedure.
Compute the p-value.
Draw your conclusion.

**Hypothesis 1: ROA is considering a metropolitan-focused approach. Within California, they want to know if the mean AQI in Los Angeles County is statistically different from the rest of California.**
Before proceeding with your analysis, it will be helpful to subset the data for your comparison.

In [22]:
# Create dataframes for each sample being compared in your test
ca_la = aqi[(aqi['state_name']=='California') & (aqi['county_name'] == 'Los Angeles')]
ca_other = aqi[(aqi['state_name']=='California') & (aqi['county_name'] != 'Los Angeles')]

**Formulate your hypothesis:**
Formulate your null and alternative hypotheses:

Null Hypothesis: There is no difference in the mean AQI between Los Angeles County and the rest of California.
Alternative hypotheses: There is a difference in the mean AQI between Los Angeles County and the rest of California.

In [25]:
# For this analysis, the significance level is 5%

significance_level = 0.05
significance_level


0.05

**Determine the appropriate test procedure:**
Here, you are comparing the sample means between two independent samples. Therefore, you will utilize a two-sample ùë°-test.

In [26]:
# Compute your p-value here
stats.ttest_ind(a=ca_la['aqi'], b = ca_other['aqi'], equal_var = False)

TtestResult(statistic=np.float64(2.1107010796372014), pvalue=np.float64(0.04983905684241102), df=np.float64(17.08246830361151))

**Question 2. What is your p-value for hypothesis 1, and what does this indicate for your null hypothesis?**
p-value is 0.049 less than significance level 0.05, reject the null hypothesis in favor of alternative hypothesis.
Therefore, a metropolitan strategy may make sense in this case.

**Hypothesis 2: With limited resources, ROA has to choose between New York and Ohio for their next regional office. Does New York have a lower AQI than Ohio?**

In [29]:
# Create dataframes for each sample being compared in your test
aqi_ny = aqi[aqi['state_name'] == 'New York']
aqi_ohio = aqi[aqi['state_name'] == 'Ohio']

**Formulate your hypothesis:**
Formulate your null and alternative hypotheses:

Null hypotheses: The mean AQI of New York is greater than or equal to that of Ohio.
Alternative hypotheses: The mean AQI of New York is below that of Ohio

**Significance Level (remains at 5%)**
Determine the appropriate test procedure:
Here, you are comparing the sample means between two independent samples in one direction. Therefore, you will utilize a two-sample ùë°-test.

In [32]:
# Compute your p-value here
#If you expect group1 < group2, use alternative = 'less'
#alternative='less' is an argument you can pass to stats.ttest_ind (and other SciPy tests) to run a one‚Äësided t‚Äëtest instead of the 
#default two‚Äësided test.

tstat,pvalue = stats.ttest_ind(a=aqi_ny['aqi'], b =aqi_ohio['aqi'], alternative='less', equal_var = False)
print(tstat)
print(pvalue)

-2.025951038880333
0.030446502691934704


**Question 3. What is your p-value for hypothesis 2, and what does this indicate for your null hypothesis?**
p value is 0.03 which is less than significance level and t- statistics is <0 (-2.02) , **reject the null hypothesis in favor of the alternative hypothesis.**
Therefore, you can conclude at the 5% significance level that New York has a lower mean AQI than Ohio.

**Hypothesis 3: A new policy will affect those states with a mean AQI of 10 or greater. Will Michigan be affected by this new policy?**

In [33]:
aqi_michigan = aqi[aqi['state_name'] == 'Michigan']

**Formulate your null and alternative hypotheses here:**

Null hypotheses: The mean AQI of Michigan is less than or equal to 10.
Alternative hypotheses: The mean AQI of Michigan is greater than 10.
**Significance Level (remains at 5%)**

**Determine the appropriate test procedure:**
Here, you are comparing one sample mean relative to a particular value in one direction. Therefore, you will utilize a one-sample ùë°-test.

In [34]:
# Compute your p-value here
tstat, pvalue = stats.ttest_1samp(aqi_michigan['aqi'], 10, alternative = 'greater')
print(tstat)
print(pvalue)

-1.7395913343286131
0.9399405193140109


**Question 4. What is your p-value for hypothesis 3, and what does this indicate for your null hypothesis?**
With a p-value (0.940) being greater than 0.05 (as your significance level is 5%) and a t-statistic < 0 (-1.74), fail to reject the null hypothesis.

Therefore, you cannot conclude at the 5% significance level that Michigan's mean AQI is greater than 10. This implies that Michigan would most likely not be affected by the new policy.

**Question 5. Did your results show that the AQI in Los Angeles County was statistically different from the rest of California?**
Yes, the results indicated that the AQI in Los Angeles County was in fact different from the rest of California.

**Question 6. Did New York or Ohio have a lower AQI?**
Using a 5% significance level, you can conclude that New York has a lower AQI than Ohio based on the results.

**Question 7: Will Michigan be affected by the new policy impacting states with a mean AQI of 10 or greater?**
Based on the tests, you would fail to reject the null hypothesis, meaning you can't conclude that the mean AQI is greater than 10. Thus, it is unlikely that Michigan would be affected by the new policy.

**Conclusion**
**What are key takeaways from this lab?**

Even with small sample sizes, the variation within the data is enough to allow you to make statistically significant conclusions. You identified at the 5% significance level that the Los Angeles mean AQI was stastitically different from the rest of California, and that New York does have a lower mean AQI than Ohio. However, you were unable to conclude at the 5% significance level that Michigan's mean AQI was greater than 10.

**What would you consider presenting to your manager as part of your findings?**

For each test, you would present the null and alternative hypothesis, then describe your conclusion and the resulting p-value that drove that conclusion. As the setup of t-test's have a few key configurations that dictate how you interpret the result, you would specify the type of test you chose, whether that tail was one-tail or two-tailed, and how you performed the t-test from stats.

**What would you convey to external stakeholders?**

In answer to the research questions posed, you would convey the level of significance (5%) and your conclusion. Additionally, providing the sample statistics being compared in each case will likely provide important context for stakeholders to quickly understand the difference between your results.

# Introduction
Throughout the following programming activity, you will learn to use Python to conduct a two-sample hypothesis test. Before beginning the activity, watch the associated instructional video and complete the in-video question. All of the code you will be implementing and related instructions are contained in this notebook.

In [4]:
education_districtwise = pd.read_csv('education_districtwise.csv')
education_districtwise = education_districtwise.dropna()
education_districtwise.head()

Unnamed: 0,DISTNAME,STATNAME,BLOCKS,VILLAGES,CLUSTERS,TOTPOPULAT,OVERALL_LI
0,DISTRICT32,STATE1,13,391,104,875564.0,66.92
1,DISTRICT649,STATE1,18,678,144,1015503.0,66.93
2,DISTRICT229,STATE1,8,94,65,1269751.0,71.21
3,DISTRICT259,STATE1,13,523,104,735753.0,57.98
4,DISTRICT486,STATE1,8,359,64,570060.0,65.0


In [8]:
# Get the sample data for stat 21 and 28
state21 = education_districtwise[education_districtwise['STATNAME'] == 'STATE21']
state28 = education_districtwise[education_districtwise['STATNAME'] == 'STATE28']

sampled_state21 = state21.sample(n = 20, replace = True, random_state = 13490)
sampled_state28 = state28.sample(n = 20, replace = True, random_state = 39103)

In [9]:
sampled_state21.head()

Unnamed: 0,DISTNAME,STATNAME,BLOCKS,VILLAGES,CLUSTERS,TOTPOPULAT,OVERALL_LI
169,DISTRICT373,STATE21,7,713,49,1218002.0,64.95
136,DISTRICT191,STATE21,10,1141,69,4773138.0,58.67
193,DISTRICT398,STATE21,23,2618,283,4616509.0,72.69
201,DISTRICT504,STATE21,13,1469,106,2494533.0,70.38
180,DISTRICT19,STATE21,10,1374,113,2398709.0,74.37


In [10]:
# Get the mean
sampled_state21['OVERALL_LI'].mean()

np.float64(70.82900000000001)

In [11]:
# Get the mean
sampled_state28['OVERALL_LI'].mean()

np.float64(64.60100000000001)

In [12]:
sampled_state21['OVERALL_LI'].mean() - sampled_state28['OVERALL_LI'].mean()

np.float64(6.227999999999994)

**Hypothesis**
Null: No difference in mean literacy rate between state 21 and state 28
Alternative: There is a difference in mean literacy rate between state 21 and state 28

Significance level - 5%

stats.ttest_ind is a SciPy function used to run an independent two‚Äësample t‚Äëtest ‚Äî a statistical test that compares the means of two independent groups to see if they are significantly different.

stats.ttest_ind(group1, group2)
group1 = list/array of sample values from group A

group2 = list/array of sample values from group B

In [13]:
stats.ttest_ind(a=sampled_state21['OVERALL_LI'], b= sampled_state28['OVERALL_LI'], equal_var = False)

TtestResult(statistic=np.float64(2.8980444277268735), pvalue=np.float64(0.006421719142765242), df=np.float64(35.20796133045557))

p value : 0.64%
p vallue < significance level 
Conclusion: Reject null hypothesis