# Activity: Explore hypothesis testing

## Introduction

You work for an environmental think tank called Repair Our Air (ROA). ROA is formulating policy recommendations to improve the air quality in America, using the Environmental Protection Agency's Air Quality Index (AQI) to guide their decision making. An AQI value close to 0 signals "little to no" public health concern, while higher values are associated with increased risk to public health. 

They've tasked you with leveraging AQI data to help them prioritize their strategy for improving air quality in America.

ROA is considering the following decisions. For each, construct a hypothesis test and an accompanying visualization, using your results of that test to make a recommendation:

1. ROA is considering a metropolitan-focused approach. Within California, they want to know if the mean AQI in Los Angeles County is statistically different from the rest of California.
2. With limited resources, ROA has to choose between New York and Ohio for their next regional office. Does New York have a lower AQI than Ohio?
3. A new policy will affect those states with a mean AQI of 10 or greater. Can you rule out Michigan from being affected by this new policy?

**Notes:**
1. For your analysis, you'll default to a 5% level of significance.
2. Throughout the lab, for two-sample t-tests, use Welch's t-test (i.e., setting the `equal_var` parameter to `False` in `scipy.stats.ttest_ind()`). This will account for the possibly unequal variances between the two groups in the comparison.

## Step 1: Imports

To proceed with your analysis, import `pandas` and `numpy`. To conduct your hypothesis testing, import `stats` from `scipy`.

#### Import Packages

In [1]:
# Import packages
# Import the relevant Python libraries and modules needed in this lab.

### YOUR CODE HERE ###
# Import libraries and packages

# Import packages
from google.cloud import storage

import numpy as np
import re
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from matplotlib import cm
from datetime import datetime
import glob
import os
from io import StringIO
from io import BytesIO
import json
import pickle
import six
import charset_normalizer
from wordcloud import WordCloud 
from typing import List

from scipy import stats
import statsmodels.api as sm

sns.set()
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.options.mode.chained_assignment = None

import warnings
warnings.filterwarnings('ignore')

In [2]:
#storage_client =  storage.Client.from_service_account_json('../heidless-jupyter-0-d2008100d98c.json')
storage_client =  storage.Client()

BUCKET_NAME = 'heidless-jupyter-bucket-0'

bucket = storage_client.get_bucket(BUCKET_NAME)

AllCSV = []
my_prefix = 'air-quality/'
my_file = 'c4_epa_air_quality.csv'
full_file = my_prefix + my_file
#print(f'full_file: {full_file}')

file_names = list(bucket.list_blobs(prefix=my_prefix))
for file in file_names:
    if(file.name != my_prefix):
        if file.name == full_file:
            AllCSV.append(file.name)
#            print(file.name)
AllCSV


['air-quality/c4_epa_air_quality.csv']

In [3]:
# RUN THIS CELL TO IMPORT YOUR DATA.

### YOUR CODE HERE ###
#companies = pd.read_csv("Unicorn_Companies.csv")

# Display the first 10 rows of the data
all_dataframes = []

for csv in AllCSV:
    blob = bucket.get_blob(csv)
    if blob is not None and blob.exists(storage_client):
        bt = blob.download_as_string()
        s = str(bt, 'ISO-8859-1')
        s = StringIO(s)
        df = pd.read_csv(s, encoding='ISO-8859-1', low_memory=False)

        #df['country'] = csv[0:2] # adding column 'country' so that each dataset could be identified uniquely
        all_dataframes.append(df)
        print(csv)
    
#all_dataframes[0].head() # index 0 to 9 for [CA, DE, FR, GB, IN, JP, KR, MX, RU, US] datasets

aqi = all_dataframes[0]
aqi.head()


air-quality/c4_epa_air_quality.csv


Unnamed: 0.1,Unnamed: 0,date_local,state_name,county_name,city_name,local_site_name,parameter_name,units_of_measure,arithmetic_mean,aqi
0,0,2018-01-01,Arizona,Maricopa,Buckeye,BUCKEYE,Carbon monoxide,Parts per million,0.473684,7
1,1,2018-01-01,Ohio,Belmont,Shadyside,Shadyside,Carbon monoxide,Parts per million,0.263158,5
2,2,2018-01-01,Wyoming,Teton,Not in a city,Yellowstone National Park - Old Faithful Snow ...,Carbon monoxide,Parts per million,0.111111,2
3,3,2018-01-01,Pennsylvania,Philadelphia,Philadelphia,North East Waste (NEW),Carbon monoxide,Parts per million,0.3,3
4,4,2018-01-01,Iowa,Polk,Des Moines,CARPENTER,Carbon monoxide,Parts per million,0.215789,3


# Import relevant packages

### YOUR CODE HERE ###

You are also provided with a dataset with national Air Quality Index (AQI) measurements by state over time for this analysis. `Pandas` was used to import the file `c4_epa_air_quality.csv` as a dataframe named `aqi`. As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

**Note:** For purposes of your analysis, you can assume this data is randomly sampled from a larger population.

#### Load Dataset

# RUN THIS CELL TO IMPORT YOUR DATA.

### YOUR CODE HERE ###
aqi = pd.read_csv('c4_epa_air_quality.csv')

## Step 2: Data Exploration

### Before proceeding to your deliverables, explore your datasets.

Use the following space to surface descriptive statistics about your data. In particular, explore whether you believe the research questions you were given are readily answerable with this data.

In [33]:
aqi_mich = aqi[aqi['state_name'] == 'Michigan']
#aqi_cali.head()
print(aqi_mich.shape)
print(aqi_mich.describe())
#aqi_cali.isnull().value_counts()

(9, 10)
       Unnamed: 0  arithmetic_mean        aqi
count    9.000000         9.000000   9.000000
mean   172.666667         0.465400   8.111111
std     64.625846         0.176400   3.257470
min     65.000000         0.200000   2.000000
25%    123.000000         0.378947   7.000000
50%    192.000000         0.415789   8.000000
75%    226.000000         0.516667  10.000000
max    248.000000         0.811111  13.000000


In [31]:
aqi_ny = aqi[aqi['state_name'] == 'New York']
#aqi_cali.head()
print(aqi_ny.shape)
print(aqi_ny.describe())
#aqi_cali.isnull().value_counts()


(10, 10)
       Unnamed: 0  arithmetic_mean        aqi
count   10.000000        10.000000  10.000000
mean   165.800000         0.233684   2.500000
std     43.898367         0.041234   0.527046
min     90.000000         0.200000   2.000000
25%    134.750000         0.200000   2.000000
50%    177.500000         0.210527   2.500000
75%    192.250000         0.268421   3.000000
max    234.000000         0.300000   3.000000


In [32]:
aqi_ohio = aqi[aqi['state_name'] == 'Ohio']
#aqi_cali.head()
print(aqi_ohio.shape)
print(aqi_ohio.describe())
#aqi_cali.isnull().value_counts()

(12, 10)
       Unnamed: 0  arithmetic_mean        aqi
count   12.000000        12.000000  12.000000
mean   128.666667         0.225379   3.333333
std     96.437765         0.086912   1.302678
min      1.000000         0.083333   2.000000
25%     43.750000         0.173143   2.750000
50%    134.500000         0.238158   3.000000
75%    219.000000         0.265790   3.500000
max    252.000000         0.394737   6.000000


In [29]:
aqi_cali = aqi[aqi['state_name'] == 'California']
#aqi_cali.head()
print(aqi_cali.shape)
print(aqi_cali.describe())
#aqi_cali.isnull().value_counts()


(66, 10)
       Unnamed: 0  arithmetic_mean        aqi
count   66.000000        66.000000  66.000000
mean   137.272727         0.684871  12.121212
std     69.667020         0.322950   7.301244
min     16.000000         0.100000   1.000000
25%     76.250000         0.420834   7.000000
50%    144.000000         0.641667  11.000000
75%    198.750000         0.971491  16.000000
max    250.000000         1.742105  40.000000


In [30]:
aqi_la = aqi[aqi['county_name'] == 'Los Angeles']
#aqi_la.head()
print(aqi_la.shape)
print(aqi_la.describe())

(14, 10)
       Unnamed: 0  arithmetic_mean        aqi
count   14.000000        14.000000  14.000000
mean   133.285714         0.861487  16.285714
std     67.480400         0.364611   8.739201
min     33.000000         0.389474   6.000000
25%     84.250000         0.660088  10.250000
50%    125.500000         0.881579  16.500000
75%    175.750000         1.011842  18.750000
max    250.000000         1.742105  40.000000


In [34]:
# Explore your dataframe `aqi` here:

# isolate Clif data

aqi_rest = aqi[aqi['county_name'] != 'Los Angeles']
#aqi_rest.head()
print(aqi_rest.shape)
print(aqi_rest.describe())

(246, 10)
       Unnamed: 0  arithmetic_mean         aqi
count  246.000000       246.000000  246.000000
mean   129.284553         0.377086    6.215447
std     75.734124         0.295121    6.571299
min      0.000000         0.000000    0.000000
25%     64.250000         0.200000    2.000000
50%    129.500000         0.265790    3.000000
75%    195.750000         0.477631    8.000000
max    259.000000         1.921053   50.000000


<details>
  <summary><h4><strong>HINT 1</strong></h4></summary>

  Consider referring to the material on descriptive statisics.
</details>

<details>
  <summary><h4><strong>HINT 2</strong></h4></summary>

  Consider using `pandas` or `numpy` to explore the `aqi` dataframe.
</details>

<details>
  <summary><h4><strong>HINT 3</strong></h4></summary>

Any of the following functions may be useful:
- `pandas`: `describe()`,`value_counts()`,`shape()`, `head()`
- `numpy`: `unique()`,`mean()`
    
</details>

#### **Question 1: From the preceding data exploration, what do you recognize?**

[Write your response here. Double-click (or enter) to edit.]

mean LA: 16.285
mean Rest: 6.215

## Step 3. Statistical Tests

Before you proceed, recall the following steps for conducting hypothesis testing:

1. Formulate the null hypothesis and the alternative hypothesis.<br>
2. Set the significance level.<br>
3. Determine the appropriate test procedure.<br>
4. Compute the p-value.<br>
5. Draw your conclusion.

### Hypothesis 1: ROA is considering a metropolitan-focused approach. Within California, they want to know if the mean AQI in Los Angeles County is statistically different from the rest of California.

Before proceeding with your analysis, it will be helpful to subset the data for your comparison.

In [None]:
# Create dataframes for each sample being compared in your test

### YOUR CODE HERE ###

ca_la = aqi[aqi['county_name']=='Los Angeles']
ca_other = aqi[(aqi['state_name']=='California') & (aqi['county_name']!='Los Angeles')]

In [36]:
ca_la.head()

Unnamed: 0.1,Unnamed: 0,date_local,state_name,county_name,city_name,local_site_name,parameter_name,units_of_measure,arithmetic_mean,aqi
33,33,2018-01-01,California,Los Angeles,Lancaster,Lancaster-Division Street,Carbon monoxide,Parts per million,0.394737,7
42,42,2018-01-01,California,Los Angeles,Santa Clarita,Santa Clarita,Carbon monoxide,Parts per million,0.394737,7
61,61,2018-01-01,California,Los Angeles,Pasadena,Pasadena,Carbon monoxide,Parts per million,0.789474,16
76,76,2018-01-01,California,Los Angeles,Los Angeles,LAX Hastings,Carbon monoxide,Parts per million,0.863158,17
109,109,2018-01-01,California,Los Angeles,Los Angeles,Los Angeles-North Main Street,Carbon monoxide,Parts per million,0.994737,17


In [37]:
ca_other.head()

Unnamed: 0.1,Unnamed: 0,date_local,state_name,county_name,city_name,local_site_name,parameter_name,units_of_measure,arithmetic_mean,aqi
16,16,2018-01-01,California,San Bernardino,Ontario,Ontario Near Road (Etiwanda),Carbon monoxide,Parts per million,0.747368,11
18,18,2018-01-01,California,Sacramento,Arden-Arcade,Sacramento-Del Paso Manor,Carbon monoxide,Parts per million,0.752632,16
26,26,2018-01-01,California,Orange,La Habra,La Habra,Carbon monoxide,Parts per million,0.673684,13
27,27,2018-01-01,California,Alameda,Not in a city,Berkeley- Aquatic Park,Carbon monoxide,Parts per million,1.088889,15
34,34,2018-01-01,California,Fresno,Fresno,Fresno - Garland,Carbon monoxide,Parts per million,1.0,15


In [35]:
# Create dataframes for each sample being compared in your test

### YOUR CODE HERE ###

ca_la = aqi[aqi['county_name']=='Los Angeles']
ca_other = aqi[(aqi['state_name']=='California') & (aqi['county_name']!='Los Angeles')]

<details>
  <summary><h4><strong>HINT 1</strong></h4></summary>

  Consider referencing the material on subsetting dataframes.  
</details>

<details>
  <summary><h4><strong>HINT 2</strong></h4></summary>

  Consider creating two dataframes, one for Los Angeles, and one for all other California observations.
</details>

<details>
  <summary><h4><strong>HINT 3</strong></h4></summary>

For your first dataframe, filter to `county_name` of `Los Angeles`. For your second dataframe, filter to `state_name` of `Calfornia` and `county_name` not equal to `Los Angeles`.
    
</details>

#### Formulate your hypothesis:

**Formulate your null and alternative hypotheses:**

*   $H_0$: There is no difference in the mean AQI between Los Angeles County and the rest of California.
*   $H_A$: There is a difference in the mean AQI between Los Angeles County and the rest of California.


#### Set the significance level:

In [38]:
# For this analysis, the significance level is 5%

### YOUR CODE HERE
significance_level = 0.05
significance_level


0.05

#### Determine the appropriate test procedure:

Here, you are comparing the sample means between two independent samples. Therefore, you will utilize a **two-sample  𝑡-test**.

#### Compute the P-value

In [39]:
# Compute your p-value here

### YOUR CODE HERE ###
stats.ttest_ind(a=ca_la['aqi'], b=ca_other['aqi'], equal_var=False)

TtestResult(statistic=2.1107010796372014, pvalue=0.049839056842410995, df=17.08246830361151)

<details>
  <summary><h4><strong>HINT 1</strong></h4></summary>

  Consider referencing the material on how to perform a two-sample t-test.
</details>

<details>
  <summary><h4><strong>HINT 2</strong></h4></summary>

  In `ttest_ind()`, a is the aqi column from our "Los Angeles" dataframe, and b is the aqi column from the "Other California" dataframe.
</details>

<details>
  <summary><h4><strong>HINT 3</strong></h4></summary>

  Be sure to set `equal_var` = False.

</details>

#### **Question 2. What is your P-value for hypothesis 1, and what does this indicate for your null hypothesis?**

p-value: 0.049839056842410995
significance_level: 0.05

p-value < significance_level : therefore 'reject null hypotheses' 

### Hypothesis 2: With limited resources, ROA has to choose between New York and Ohio for their next regional office. Does New York have a lower AQI than Ohio?

Before proceeding with your analysis, it will be helpful to subset the data for your comparison.

In [40]:
# Create dataframes for each sample being compared in your test

### YOUR CODE HERE ###
ny = aqi[aqi['state_name']=='New York']
ohio = aqi[aqi['state_name']=='Ohio']

<details>
  <summary><h4><strong>HINT 1</strong></h4></summary>

  Consider referencing the materials on subsetting dataframes.  
</details>

<details>
  <summary><h4><strong>HINT 2</strong></h4></summary>

  Consider creating two dataframes, one for New York, and one for Ohio observations.
</details>

<details>
  <summary><h4><strong>HINT 3</strong></h4></summary>

For your first dataframe, filter to `state_name` of `New York`. For your second dataframe, filter to `state_name` of `Ohio`.
    
</details>

#### Formulate your hypothesis:

**Formulate your null and alternative hypotheses:**

*   $H_0$: The mean AQI of New York is greater than or equal to that of Ohio.
*   $H_A$: The mean AQI of New York is **below** that of Ohio.


#### Significance Level (remains at 5%)

#### Determine the appropriate test procedure:

Here, you are comparing the sample means between two independent samples in one direction. Therefore, you will utilize a **two-sample  𝑡-test**.

#### Compute the P-value

In [46]:
# Compute your p-value here

### YOUR CODE HERE ###
#stats.ttest_ind(a=ny['aqi'], b=ohio['aqi'], equal_var=False)

tstat, pvalue = stats.ttest_ind(a=ny['aqi'], b=ohio['aqi'], alternative='less', equal_var=False)
print(tstat)
print(pvalue)

-2.025951038880333
0.03044650269193468


<details>
  <summary><h4><strong>HINT 1</strong></h4></summary>

  Consider referencing the material on how to perform a two-sample t-test.
</details>

<details>
  <summary><h4><strong>HINT 2</strong></h4></summary>

  In `ttest_ind()`, a is the aqi column from the "New York" dataframe, an b is the aqi column from the "Ohio" dataframe.
</details>

<details>
  <summary><h4><strong>HINT 3</strong></h4></summary>

  You can assign `tstat`, `pvalue` to the output of `ttest_ind`. Be sure to include `alternative = less` as part of your code.  

</details>

#### **Question 3. What is your P-value for hypothesis 2, and what does this indicate for your null hypothesis?**

p-value: 0.03044650269193468
significance_level=0.05

Reject $H_0$

###  Hypothesis 3: A new policy will affect those states with a mean AQI of 10 or greater. Can you rule out Michigan from being affected by this new policy?

Before proceeding with your analysis, it will be helpful to subset the data for your comparison.

In [44]:
# Create dataframes for each sample being compared in your test

### YOUR CODE HERE ###
michigan = aqi[aqi['state_name']=='Michigan']

<details>
  <summary><h4><strong>HINT 1</strong></h4></summary>

  Consider referencing the material on subsetting dataframes.  
</details>

<details>
  <summary><h4><strong>HINT 2</strong></h4></summary>

  Consider creating one dataframe which only includes Michigan.
</details>

#### Formulate your hypothesis:

**Formulate your null and alternative hypotheses here:**

*   $H_0$: The mean AQI of Michigan is less than or equal to 10.
*   $H_A$: The mean AQI of Michigan is greater than 10.


#### Significance Level (remains at 5%)

#### Determine the appropriate test procedure:

Here, you are comparing one sample mean relative to a particular value in one direction. Therefore, you will utilize a **one-sample  𝑡-test**. 

#### Compute the P-value

In [45]:
# Compute your p-value here

### YOUR CODE HERE ###
tstat, pvalue = stats.ttest_1samp(michigan['aqi'], 10, alternative='greater')
print(tstat)
print(pvalue)

-1.7395913343286131
0.9399405193140109


<details>
  <summary><h4><strong>HINT 1</strong></h4></summary>

  Consider referencing the material on how to perform a one-sample t-test.
</details>

<details>
  <summary><h4><strong>HINT 2</strong></h4></summary>

  In `ttest_1samp)`, you are comparing the aqi column from your Michigan data relative to 10, the new policy threshold.
</details>

<details>
  <summary><h4><strong>HINT 3</strong></h4></summary>

  You can assign `tstat`, `pvalue` to the output of `ttest_1samp`. Be sure to include `alternative = greater` as part of your code.  

</details>

#### **Question 4. What is your P-value for hypothesis 3, and what does this indicate for your null hypothesis?**

p-value = 0.94
Fali to Reject null hypothesis

## Step 4. Results and Evaluation

Now that you've completed your statistical tests, you can consider your hypotheses and the results you gathered.

#### **Question 5. Did your results show that the AQI in Los Angeles County was statistically different from the rest of California?**

Yes, the results indicated that the AQI in Los Angeles County was in fact different from the rest of California.


#### **Question 6. Did New York or Ohio have a lower AQI?**

Using a 5% significance level, you can conclude that New York has a lower AQI than Ohio based on the results.

#### **Question 7: Will Michigan be affected by the new policy impacting states with a mean AQI of 10 or greater?**



Based on the tests, you would fail to reject the null hypothesis, meaning you can't conclude that the mean AQI is greater than 10. Thus, it is unlikely that Michigan would be affected by the new policy.


# Conclusion

**What are key takeaways from this lab?**

**What would you consider presenting to your manager as part of your findings?**

**What would you convey to external stakeholders?**


**Congratulations!** You've completed this lab. However, you may not notice a green check mark next to this item on Coursera's platform. Please continue your progress regardless of the check mark. Just click on the "save" icon at the top of this notebook to ensure your work has been logged.