# Activity: Explore hypothesis testing

## Introduction

You work for an environmental think tank called Repair Our Air (ROA). ROA is formulating policy recommendations to improve the air quality in America, using the Environmental Protection Agency's Air Quality Index (AQI) to guide their decision making. An AQI value close to 0 signals "little to no" public health concern, while higher values are associated with increased risk to public health. 

They've tasked you with leveraging AQI data to help them prioritize their strategy for improving air quality in America.

ROA is considering the following decisions. For each, construct a hypothesis test and an accompanying visualization, using your results of that test to make a recommendation:

1. ROA is considering a metropolitan-focused approach. Within California, they want to know if the mean AQI in Los Angeles County is statistically different from the rest of California.
2. With limited resources, ROA has to choose between New York and Ohio for their next regional office. Does New York have a lower AQI than Ohio?
3. A new policy will affect those states with a mean AQI of 10 or greater. Can you rule out Michigan from being affected by this new policy?

For your analysis, you'll default to a 5% level of significance.

## Step 1: Imports

#### Import Packages

In [15]:
# Import relevant packages
import numpy as np
import pandas as pd
from scipy import stats

You are also provided with a dataset with national Air Quality Index (AQI) measurements by state over time for this analysis. `Pandas` was used to import the file `c4_epa_air_quality.csv` as a dataframe named `aqi`. As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

**Note:** For purposes of your analysis, you can assume this data is randomly sampled from a larger population.

#### Load Dataset

In [16]:
# RUN THIS CELL TO IMPORT YOUR DATA.
aqi = pd.read_csv('c4_epa_air_quality.csv',index_col=[0])

## Step 2: Data Exploration

### Before proceeding to your deliverables, explore your datasets.

Use the following space to surface descriptive statistics about your data. In particular, explore whether you believe the research questions you were given are readily answerable with this data.

In [33]:
# Explore your dataframe `aqi` here:
print("aqi dataframe")
display(aqi.head(5))

print("aqi information")
display(aqi.info())

print("aqi state information")
display(aqi['state_name'].value_counts())


aqi['date_local'] = pd.to_datetime(aqi['date_local'])
print("aqi statistics")
display(aqi.describe(include='all'))

aqi dataframe


Unnamed: 0,date_local,state_name,county_name,city_name,local_site_name,parameter_name,units_of_measure,arithmetic_mean,aqi
0,2018-01-01,Arizona,Maricopa,Buckeye,BUCKEYE,Carbon monoxide,Parts per million,0.473684,7
1,2018-01-01,Ohio,Belmont,Shadyside,Shadyside,Carbon monoxide,Parts per million,0.263158,5
2,2018-01-01,Wyoming,Teton,Not in a city,Yellowstone National Park - Old Faithful Snow ...,Carbon monoxide,Parts per million,0.111111,2
3,2018-01-01,Pennsylvania,Philadelphia,Philadelphia,North East Waste (NEW),Carbon monoxide,Parts per million,0.3,3
4,2018-01-01,Iowa,Polk,Des Moines,CARPENTER,Carbon monoxide,Parts per million,0.215789,3


aqi information
<class 'pandas.core.frame.DataFrame'>
Int64Index: 260 entries, 0 to 259
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   date_local        260 non-null    datetime64[ns]
 1   state_name        260 non-null    object        
 2   county_name       260 non-null    object        
 3   city_name         260 non-null    object        
 4   local_site_name   257 non-null    object        
 5   parameter_name    260 non-null    object        
 6   units_of_measure  260 non-null    object        
 7   arithmetic_mean   260 non-null    float64       
 8   aqi               260 non-null    int64         
dtypes: datetime64[ns](1), float64(1), int64(1), object(6)
memory usage: 20.3+ KB


None

aqi state information


California              66
Arizona                 14
Ohio                    12
Florida                 12
Texas                   10
New York                10
Pennsylvania            10
Michigan                 9
Colorado                 9
Minnesota                7
New Jersey               6
Indiana                  5
North Carolina           4
Massachusetts            4
Maryland                 4
Oklahoma                 4
Virginia                 4
Nevada                   4
Connecticut              4
Kentucky                 3
Missouri                 3
Wyoming                  3
Iowa                     3
Hawaii                   3
Utah                     3
Vermont                  3
Illinois                 3
New Hampshire            2
District Of Columbia     2
New Mexico               2
Montana                  2
Oregon                   2
Alaska                   2
Georgia                  2
Washington               2
Idaho                    2
Nebraska                 2
R

aqi statistics


Unnamed: 0,date_local,state_name,county_name,city_name,local_site_name,parameter_name,units_of_measure,arithmetic_mean,aqi
count,260,260,260,260,257,260,260,260.0,260.0
unique,1,52,149,190,253,1,1,,
top,2018-01-01 00:00:00,California,Los Angeles,Not in a city,Kapolei,Carbon monoxide,Parts per million,,
freq,260,66,14,21,2,260,260,,
first,2018-01-01 00:00:00,,,,,,,,
last,2018-01-01 00:00:00,,,,,,,,
mean,,,,,,,,0.403169,6.757692
std,,,,,,,,0.317902,7.061707
min,,,,,,,,0.0,0.0
25%,,,,,,,,0.2,2.0


#### **Question 1: From the preceding data exploration, what do you recognize?**

- We will take a look at county-level data for the 1st hypothesis and state-level for the 2nd hypothesis.
- 'date_local' column
    - originally object type and it needs to be converted to datetime format
    - it contains data on 2018-01-01
- 'state_name' column
    - While the total number state is 260, there are 52 unique staes and most frequently occurred state is California with 66 occurrances.
- 'local_site_name' column
    - There are some sites that occur more than once. and it needs to be checked why they are recorded more than once.

## Step 3. Statistical Tests

Before you proceed, recall the following steps for conducting hypothesis testing:

1. Formulate the null hypothesis and the alternative hypothesis.<br>
2. Set the significance level.<br>
3. Determine the appropriate test procedure.<br>
4. Compute the p-value.<br>
5. Draw your conclusion.

### Hypothesis 1: ROA is considering a metropolitan-focused approach. Within California, they want to know if the mean AQI in Los Angeles County is statistically different from the rest of California.

Before proceeding with your analysis, it will be helpful to subset the data for your comparison.

In [29]:
# Create dataframes for each sample being compared in your test
aqi_ca = aqi[aqi['state_name'] == 'California']
aqi_ca_la = aqi_ca[aqi_ca['county_name'] == 'Los Angeles']
aqi_ca_others = aqi_ca[aqi_ca['county_name'] != 'Los Angeles']
print(f"California: {len(aqi_ca)} | Los Angeles: {len(aqi_ca_la)} | Others: {len(aqi_ca_others)}")

California: 66 | Los Angeles: 14 | Others: 52


#### Formulate your hypothesis:

**Formulate your null and alternative hypotheses:**

*   $H_0$: There is no difference in the mean AQI between Los Angeles County and the rest of California.
*   $H_A$: There is a difference in the mean AQI between Los Angeles County and the rest of California.


#### Set the significance level:

In [31]:
# For this analysis, the significance level is 5%
significance_level = 0.05

#### Determine the appropriate test procedure:

Here, you are comparing the sample means between two independent samples. Therefore, you will utilize a **two-sample  𝑡-test**.

#### Compute the P-value

In [32]:
# Compute your p-value here
stats.ttest_ind(a=aqi_ca_la['aqi'], b = aqi_ca_others['aqi'], equal_var = False)

Ttest_indResult(statistic=2.1107010796372014, pvalue=0.049839056842410995)

#### **Question 2. What is your P-value for hypothesis 1, and what does this indicate for your null hypothesis?**

Since p-value is 0.0498 and it is less than our selected significance level(0.05), we reject the null hypothesis. Therefore, the metropolitan strategy may make sense and it did not happen in a chance.

### Hypothesis 2: With limited resources, ROA has to choose between New York and Ohio for their next regional office. Does New York have a lower AQI than Ohio?

Before proceeding with your analysis, it will be helpful to subset the data for your comparison.

In [34]:
# Create dataframes for each sample being compared in your test
aqi_newyork = aqi[aqi['state_name'] == 'New York']
aqi_ohio = aqi[aqi['state_name'] == 'Ohio']

#### Formulate your hypothesis:

**Formulate your null and alternative hypotheses:**

*   $H_0$: The mean AQI of New York is greater than or equal to that of Ohio.
*   $H_A$: The mean AQI of New York is **below** that of Ohio.


#### Significance Level (remains at 5%)

#### Determine the appropriate test procedure:

Here, you are comparing the sample means between two independent samples in one direction. Therefore, you will utilize a **two-sample  𝑡-test**.

#### Compute the P-value

In [45]:
# Computer your p-value here
t_score, p_value = stats.ttest_ind(a=aqi_newyork['aqi'], b=aqi_ohio['aqi'], alternative='less')
print(f"T-score: {t_score:.3f} | P-value: {p_value:.3f}")

T-score: -1.892 | P-value: 0.037


#### **Question 3. What is your P-value for hypothesis 2, and what does this indicate for your null hypothesis?**

Since p-value (0.037) is less than significance level (0.05) an t_score is less than 0, we reject the null hypothesis in favor of the alternative.

###  Hypothesis 3: A new policy will affect those states with a mean AQI of 10 or greater. Can you rule out Michigan from being affected by this new policy?

Before proceeding with your analysis, it will be helpful to subset the data for your comparison.

In [39]:
# Create dataframes for each sample being compared in your test
aqi_michigan = aqi[aqi['state_name'] == 'Michigan']

#### Formulate your hypothesis:

**Formulate your null and alternative hypotheses here:**

*   $H_0$: The mean AQI of Michigan is less than or equal to 10.
*   $H_A$: The mean AQI of Michigan is greater than 10.


#### Significance Level (remains at 5%)

#### Determine the appropriate test procedure:

Here, you are comparing one sample mean relative to a particular value in one direction. Therefore, you will utilize a **one-sample  𝑡-test**. 

#### Compute the P-value

In [44]:
# Computer your p-value here
t_score, p_value = stats.ttest_1samp(a = aqi_michigan['aqi'], popmean = 10, alternative='greater')
print(f"T-score: {t_score:.3f} | P-value: {p_value:.3f}")

T-score: -1.740 | P-value: 0.940


#### **Question 4. What is your P-value for hypothesis 3, and what does this indicate for your null hypothesis?**

Since p-value (0.940) > significance level (0.05) and t_score(-1.740) < 0, we fail to reject the null hypothesis in favor of the alternative. Therefore, we conclude that there are not significant difference between Michigan's mean AQI and that the Michigan would not be affected by the new policy.

## Step 4. Results and Evaluation

Now that you've completed your statistical tests, you can consider your hypotheses and the results you gathered.

#### **Question 5. Did your results show that the AQI in Los Angeles County was statistically different from the rest of California?**

Yes, the results showed that the AQI in Los Angeles is sigifnicantly different from the rest of the cities in California. 

#### **Question 6. Did New York or Ohio have a lower AQI?**

With 5% of significance level, we can conclue that New York has the lower AQI than Ohio.

#### **Question 7: Will Michigan be affected by the new policy impacting states with a mean AQI of 10 or greater?**



With 5% of significance level, we can conclue that Michigan would not be affected by the new policy.