In [None]:
NAME = "" # put your full name here
COLLABORATORS = [] # list names of anyone you worked with on this homework.

# [ER-131] Homework 2: Pandas EPA Air Quality

---

## Table of Contents
[Introduction](#intro)<br>
1 - [Downloading the Data](#data)<br>
2 - [Preparing the Data](#prep)<br>
3 - [Exploring Data with Pandas](#explore)<br>
4 - [California Data](#cadata)<br>

# Introduction <a id='intro'></a>

In this homework, we will investigate air quality data retreived from the EPA and use Pandas to analyze particulate matter (PM2.5) levels.

### Topics Covered

As we clean and explore these data, you will gain practice with:
* Manipulating tables and parts of the table (column, index)
* Identifying the type of data collected, missing values, anomalies, etc.
* Performing numeric operations (mean, variance)
* Merging and analyzing data sets

----

## Section 1: Downloading the Data<a id='data'></a>

Run the cell below to import some of the packages we will need for this assignment:

In [None]:
#Run this cell
from pathlib import Path
import sys
import math
import zipfile
%matplotlib inline
import matplotlib.pyplot as plt

**Question 1.1:** Import the Pandas and NumPy libraries `as` their commonly used abbreviations (i.e., `pd`, `np`).  

In [None]:
## YOUR CODE HERE

We'll be working with air quality data from the EPA. Have a look at the description of the data  [here](https://aqs.epa.gov/aqsweb/airdata/FileFormats.html#_hourly_data_files).

For this homework, we'll focus on PM2.5 Mass data from 2018. Although it's possible to download the dataset exclusively through the notebook environment, the dataset is too large (over 4 million rows, 1.3GB!) to load and process in DataHub given the memory constraint. Thus, we'll work with a preprocessed, reduced version of this dataset that removes readings from certain states that we will not be working with.**


**For your future reference, the raw data can be downloaded from [this website](https://aqs.epa.gov/aqsweb/airdata/download_files.html). Alternatively, you can directly download a zipfile using a URL in the following form:

https://aqs.epa.gov/aqsweb/airdata/hourly_TYPE_YEAR.zip

...replacing "TYPE" and "YEAR" with the measurement and year, respectively, that you want to analyze.

**Measurement | (TYPE)**  
Ozone | (44201)  
SO2 | (42401)  
CO | (42101)  
NO2 | (42602)  
PM2.5 FRM/FEM Mass | (88101)  
PM2.5 non FRM/FEM Mass | (88502)  
PM10 Mass | (81102)  
PM2.5 Speciation | (SPEC)  
PM10 Speciation | (PM10SPEC)

Let's start by using Python to unzip the folder and see how many files it contains:

In [None]:
air_quality_path = Path('data/reduced_PM25_2018.zip')
zf = zipfile.ZipFile(air_quality_path, 'r')
print([f.filename for f in zf.filelist])

We see that there is only one CSV file within the zip file. From here, we want to then get a sense of the structure of the data within the CSV.

**Question 1.2:** Read the CSV file from the zip.

In [None]:
f_name = ...  # REPLACE ELLIPSIS WITH YOUR CODE
with zf.open(f_name) as f:
    for i in range(2):
        print(f.readline().rstrip().decode())

**Question 1.3:** In the code above, what does `i` represent in the `for` statement? (In other words, what is the `for` statement iterating over?)

*YOUR ANSWER HERE*

#### We can then organize this data and read it more easily by putting it in a table! We will go over this in the next section.

----

## Section 2: Preparing the Data<a id='prep'></a>

We can see that the file contains a pretty descriptive header, and in fact the columns are explained in detail in the documentation at the url listed at the top of this notebook. Let's extract the data into a Pandas dataframe. We will keep using `zf` to read the file and extract the information.  

In [None]:
with zf.open(f_name) as fh:
    PM25_2018 = pd.read_csv(fh, low_memory = False)

In [None]:
PM25_2018.head()

**Question 2.1:** Look through the dataframe. Name three data types that it holds. 

*YOUR ANSWER HERE*

**Question 2.2:** Identify at least one issue relating to questionable or missing data in the dataframe, and outline (in one sentence) how this data-related issue could impact an analyst's ability to draw conclusions from the data. 

*YOUR ANSWER HERE*

**Question 2.3:** Answer the questions below. (*Hint*: the method `.shape` might be helpful for answering the first two questions.)

1. How many records are present?
2. How many fields are reported?
3. What does each row represent?
4. After reading up on the data formats [here](https://aqs.epa.gov/aqsweb/airdata/FileFormats.html#_hourly_data_files), what does MDL stand for and what is it?

In [None]:
# use this cell for scratch work
...

*YOUR ANSWER HERE*

**Question 2.4:** What percentage of the records in the PM25_2018 dataframe have a smaller sample measurement than they do an MDL value? Are you more or less confident in those values than you are in the sample measurement values in the rest of the dataset? Why?

In [None]:
# use this cell for scratch work
...

*YOUR ANSWER HERE*

**Question 2.5:** How many unique states are represented in `PM25_2018`? Which ones are they?

In [None]:
# scratch work here
...

*YOUR ANSWER HERE*

**Question 2.6:** We can see that there are a lot of columns that are unnecessary for this data analysis. Let's make a new dataframe with the information we need. Use pd.DataFrame to create a new table with 7 columns, named and ordered as follows:
1. `State`: The name of the state where the monitoring site is located.
1. `County`: The name of the county where the monitoring site is located.
1. `Date`: The column of dates corresponding to the `Date Local` column.
1. `Time`: The time of day that sampling began on a 24-hour clock, corresponding to the `Time Local` column.
1. `Measurement`: The measured value in the standard units of measure for the parameter corresponding to the `Sample Measurement` column.
1. `Units`: The unit of measure for the parameter corresponding to the `Units of Measure` column.
1. `MDL`: the method detection limit for the measurement.

In [None]:
# YOUR CODE HERE

state_table = ...

In [None]:
# run this cell
state_table.iloc[80:90,:]

**Question 2.7a:** Within the `state_table` dataframe, find any rows where "Measurement" is lower than "MDL." Replace the value in "Measurement" in those rows with `np.nan` (the `.loc` method is helpful here!).  

*Hint / Warning*: You may get a "SettingWithCopyWarning".  It's ok to ignore.  

In [None]:
# YOUR CODE HERE


In [None]:
# run this cell
state_table.iloc[80:90,:]

**Question 2.7b:** We could have used a different strategy to 'clean' the measurements with values below the MDL. Why might it be a good idea to replace these measurements with "NaN" rather than setting them to a numeric value, e.g., 0 or to the MDL? Similarly, why might we want to use "NaN" instead of removing these rows from the table entirely?

*YOUR ANSWER HERE*

<br>

----

## Section 3: Exploring Data with Pandas<a id='explore'></a>

The air quality in Los Angeles, CA is notoriously poor. In this section we will analyze the EPA data to examine how air pollution varies over the course of the day.

**Question 3.1:** Using the table from Question 2.6, create a new table containing just data from Los Angeles county in California.

In [None]:
# YOUR CODE HERE
lacounty = ... 

In [None]:
lacounty.head(10)

In [None]:
assert len(lacounty) == 15313

**Question 3.2:** Below, output all the measurements in `lacounty` taken at midnight and all the measurements taken at noon. What do you notice?  You might try using the `.describe` method to explore your midnight and noon outputs separately.

In [None]:
# measurements at midnight

#YOUR CODE HERE
la_midnight = ...

In [None]:
# Measurements at noon

# YOUR CODE HERE
la_noon = ...

*YOUR OBSERVATIONS HERE*

**Question 3.4a:** We can also visualize this data and see how the PM2.5 concentrations fluctuate throughout the day. Start by using the `groupby` method to find the mean PM2.5 measurement for each hour of the day. Your output should be a dataframe in which the indices are the 24 hours of the day (00:00 - 23:00) and the columns are `Measurement` and `MDL`.

In [None]:
# YOUR CODE HERE
la_hr = ...

In [None]:
assert la_hr.shape == (24, 2)

**Question 3.3b:** Now, run the code in the cell below to plot the mean PM2.5 for each hour. The x-axis should be the hour of the day (00:00 - 23:00). Label the y-axis. Are there any noticeable trends in this plot?

In [None]:
plt.plot(la_hr.index, la_hr['Measurement'])
plt.xticks(rotation=270)
plt.title('Average hourly PM2.5 concentration in Los Angeles County, 2018')
plt.xlabel('Hour')
# ADD A Y-AXIS LABEL
plt.show()

*YOUR OBSERVATIONS HERE*

**Question 3.4a:** Use the `groupby` method on the `lacounty` dataframe once again, but this time, use the standard deviation (`std()`) aggregation function. 

In [None]:
# YOUR CODE HERE
la_hr_stdev = ...

**Question 3.4b:** Run the code below to plot the standard deviation on the same graph as the mean data. Label the y-axis and title the plot. Record your observations.

In [None]:
plt.plot(la_hr.index, la_hr['Measurement'], label = 'Mean')
plt.plot(la_hr_stdev.index, la_hr_stdev['Measurement'], 'k:', label = 'Standard Deviation')
plt.xticks(rotation=270)
plt.xlabel('Hour')
plt.legend()
# LABEL THE Y-AXIS
# ADD A TITLE
plt.show()

*YOUR OBSERVATIONS HERE*

**Question 3.5:** Do the data support the hypothesis that PM2.5 concentrations follow a diurnal pattern? Why or why not? What are some of the limitations of either our data or the methods we've used to explore it so far in allowing us to observe hourly trends?

*YOUR ANSWER HERE*

**Question 3.6:** In Susan Athey's essay "Beyond Prediction", Athey defines the distinction between prediction problems and causal inference problems. Thinking about this air quality dataset, can you come up with one question that poses a prediction problem (also referred to a resource allocation problem in the essay) and another that poses a causal inference problem? The two questions you come up with should be air quality related, but you don't have to limit yourself to this dataset (eg. it's totally fair to come up with a question that would also incorporate, for example, census or demographic data).

*YOUR ANSWER HERE*

----

## Section 4: California Data<a id='cadata'></a>

Let's explore the dynamics of wildfire and air quality. In this section, we will use data analysis to see how a major wildfire impacted PM2.5 cocentrations in Alameda County. 

<br>**Question 4.1:** Create a dataframe called `PM25_2018_CA` that is a subset of `state_table` and just has PM2.5 2018 data for California.

In [None]:
# YOUR CODE HERE
PM25_2018_CA = ...

In [None]:
PM25_2018_CA.head()

<br>**Question 4.2:** Use `groupby` to find the maximum PM2.5 concentration in each county in 2018. 

In [None]:
# YOUR CODE HERE


----
The Camp Fire, which started on November 8, 2018, was described as the [‘deadliest, most destructive wildfire in California history’](https://www.washingtonpost.com/nation/2018/11/25/camp-fire-deadliest-wildfire-californias-history-has-been-contained/?noredirect=on).

UC Berkeley students could smell and see the effects of the fires in Butte County; classes were cancelled on November 15 due to poor air quality.

**Question 4.3:** Using `PM25_2018_CA`, create a table containing just information from Alameda County on November 15, 2018.

In [None]:
# YOUR CODE HERE
ac_nov15 = ...

In [None]:
ac_nov15.head()

**Question 4.4:** Using `PM25_2018_CA`, create a table containing just information from Alameda County on November 7, 2018.

In [None]:
# YOUR CODE HERE
ac_nov7 = ...

In [None]:
ac_nov7.head()

**Question 4.5:** Merge `napa_nov15` and `napa_nov7` on `Time` to compare their PM2.5 concentrations side by side.

*Note:* If  two dataframes have the same column names when pandas executes a merge, it will append a '_x' to the first dataframe's column names and a '_y' to the second dataframe's column names.  The rename operation is meant to clarify things.  Be sure that it's renaming correctly!

In [None]:
ac_merge = ... # YOUR CODE HERE
ac_merge.rename(columns={'Measurement_x':'Nov15 PM2.5', 'Measurement_y':'Nov7 PM2.5'}, inplace = True)
ac_merge.head()

**Question 4.7:** Calculate the mean PM2.5 measurement on both days. Using [EPA's air quality index](https://www.airnow.gov/aqi/aqi-basics/), comment on the relative level of health concern for each day.

In [None]:
# YOUR CODE HERE

*YOUR ANSWER HERE*

----

## Submission

Congrats, you're done with homework 2!

Before you submit, click **Kernel** --> **Restart & Clear Output**. Then, click **Cell** --> **Run All**. Then, go to the toolbar and click **File** -> **Download as** -> **.html** and submit the file through bCourses.

----

## Bibliography


- Washington Post - Camp Fire. https://www.washingtonpost.com/nation/2018/11/25/camp-fire-deadliest-wildfire-californias-history-has-been-contained/

---
Adapted from a notebook developed by: Melissa Ly

Data Science Modules: http://data.berkeley.edu/education/modules