In [None]:
NAME = "" # put your full name here
COLLABORATORS = [] # list names of anyone you worked with on this homework.

# [ERG-131] Homework 2: Pandas EPA Air Quality

---

## Table of Contents
[Introduction](#intro)<br>
1 - [Downloading the Data](#data)<br>
2 - [Preparing the Data](#prep)<br>
3 - [Exploring Data with Pandas](#explore)<br>
4 - [California Data](#cadata)<br>

# Introduction <a id='intro'></a>

In this homework, we will investigate air quality data retreived from the EPA. The main goal for this assignment is to understand how PM2.5 FRM/FEM Mass effects air quality. We will accomplish this by analyzing EPA data and utilizing pandas (a powerful Python data analysis toolkit). To give us a sense of how we think about each discovery we make and what next steps it leads to we will provide comments and insights along the way.

### Topics Covered

As we clean and explore these data, you will gain practice with:
* Manipulating tables and parts of the table (column, index)
* Identifying the type of data collected, missing values, anomalies, etc.
* Computing numeric operations (mean, variance)
* Merging and analyzing data sets

----

## Section 1: Downloading the Data<a id='data'></a>

In [None]:
#Run this cell
from pathlib import Path
import sys
import math
import zipfile
%matplotlib inline
import matplotlib.pyplot as plt

To start the assignment, run the cell below to set up some imports that we will need for this assignment:

In many of these assignments (and future adventures as a data scientist) we will use os, zipfile, pandas, numpy, matplotlib.pyplot, and seaborn.  

**Question 1.1:** Import each of these libraries `as` their commonly used abbreviations (e.g., `pd`, `np`).  

In [None]:
# YOUR CODE HERE

For this homework, we'll be working with air quality data from the EPA; we want to read the description of the data and download the data from the website.</div>

A description of the data is [here](https://aqs.epa.gov/aqsweb/airdata/FileFormats.html#_hourly_data_files).

We can then download the data. [Here is the site](https://aqs.epa.gov/aqsweb/airdata/download_files.html).

To download the data, use a link like this:

https://aqs.epa.gov/aqsweb/airdata/hourly_TYPE_YEAR.zip

...where we can fill in "TYPE" with the measurement we want and "YEAR" with the year.

**Measurement | (TYPE)**  
Ozone | (44201)  
SO2 | (42401)  
CO | (42101)  
NO2 | (42602)  
PM2.5 FRM/FEM Mass | (88101)  
PM2.5 non FRM/FEM Mass | (88502)  
PM10 Mass | (81102)  
PM2.5 Speciation | (SPEC)  
PM10 Speciation | (PM10SPEC)


We'll focus on PM2.5 Mass (88101) from 2018 in the problem set. Although it's possible to download the dataset exclusively through the notebook environment, the dataset is too large (over 4 million rows, 1.3GB in size!) to load and process in datahub given the memory constraint. Because of this, we'll be using a reduced version of this dataset which removes readings from certain states that we will not be working with.

<br>
Let's start by using Python to unzip the file and see how this data is laid out:

In [None]:
air_quality_path = Path('data/reduced_PM25_2018.zip')
zf = zipfile.ZipFile(air_quality_path, 'r')
print([f.filename for f in zf.filelist])

We see that there is only one CSV file within the zip file. From here, we want to then get a sense of the structure of the data within the CSV.

**Question 1.2:** Load the CSV file in the zip.

In [None]:
f_name = ... # YOUR CODE HERE
with zf.open(f_name) as f:
    for i in range(2):
        print(f.readline().rstrip().decode())

**Question 1.3:** Answer the following boolean expressions using `True` or `False`.

In [None]:
# Are all the files CSV files?
all_files_appear_to_be_csv = ... # YOUR ANSWER HERE

# Do all the files have a header line?
all_files_contain_headers = ... # YOUR ANSWER HERE

# Do all the strings in the file have quotes around them?
strings_appear_quoted = ... # YOUR ANSWER HERE

#### We can then organize this data and read it better by putting it in a table! We will go over this in the next section.

----

## Section 2: Preparing the Data<a id='prep'></a>

We can see that the file contains a pretty descriptive header, and in fact these are explained in detail in the documentation at the url listed at the top of this notebook. Let's extract it. We are going to pretend there are multiple files in the zip file, and keep using `zf` to read the file and extract the information.  

In [None]:
with zf.open(f_name) as fh:
    PM25_2018 = pd.read_csv(fh, low_memory = False)

In [None]:
PM25_2018.head()

**Question 2.1:** Look through the table and see what data types are within the table. For this question, identify at least one issue relating to bad or missing data in the dataset, and outline (in one sentence) how this data-related issue could impact an analyst's ability to draw conclusions from the data.

Answers can vary, but one answer is: the air quality uncertainty field ("Uncertainty") contains a lot of NaNs, which would make it difficult to compare air quality at different times or in different locations with any certainty.

**Question 2.2:** Find the dimensions of the table to figure out how much data we are working with.<br>
*Hint*: the method `.shape` is helpful here

In [None]:
# YOUR CODE HERE

**Question 2.3:** With this information, we can answer the questions below.

1. How many records are there?
2. How many fields are reported?
3. What does each row represent?
4. After reading up on the data formats [here](https://aqs.epa.gov/aqsweb/airdata/FileFormats.html#_hourly_data_files), what does MDL stand for and what is it?

In [None]:
# use this cell for scratch work
...


*YOUR ANSWER HERE*

**Question 2.4:** How many records in the PM25_2018 dataframe have a smaller sample measurement than they do an MDL value? Are you more or less confident in those values than you are in the sample measurement values in the rest of the dataset?

In [None]:
# use this cell for scratch work

*YOUR ANSWER HERE*

**Question 2.5:** Create an array of all the unique state names in `PM25_2018`.

In [None]:
# YOUR CODE HERE

**Question 2.6:** We can see that there are a lot of columns that are unneeded for this data analysis. Let's make a new dataframe with the information we need. Use pd.DataFrame to create a new table with 6 columns:
1. `Date`: The column of dates corresponding to the `Date Local` column.
1. `Time`: The time of day that sampling began on a 24-hour clock corresponding to the `Time Local` column.
1. `Measurement`: The measured value in the standard units of measure for the parameter corresponding to the `Sample Measurement` column.
1. `Units`: The unit of measure for the parameter corresponding to the `Units of Measure` column.
1. `State`: The name of the state where the monitoring site is located.
1. `County`: The name of the county where the monitoring site is located.

In [None]:
state_table = ... 
# YOUR CODE HERE

<br>

----

## Section 3: Exploring Data with Pandas<a id='explore'></a>

According to researchers at the International Journal of Environmental Research and Public Health, PM2.5 is observed with higher concentrations in cold seasons and lower concentrations in warm seasons [(Link to paper)](https://www.ncbi.nlm.nih.gov/pubmed/26426035).

In this section we will analyze our data and see whether this claim proves true.

**Question 3.1:** Using the table from Question 2.6, create a new table containing just data from New York in Queens County. There should be 8481 rows in this table.

In [None]:
queens = ... # YOUR CODE HERE

In [None]:
queens.shape

In [None]:
queens.head()

**Question 3.2:** Within the `queens` dataframe, find any rows where "Measurement" is lower than "MDL" and replace the value in "Measurement" in those rows with `np.nan` (the `.loc` method is helpful here!).  

*Hint / Warning*: You may get a "SettingWithCopyWarning".  It's ok to ignore.  

In [None]:
# YOUR CODE HERE

In [None]:
queens.head()

**Question 3.3:** Below, output all the measurements in `queens` taken at noon in January and all the measurements taken at noon in June. What do you notice?  You might try using the `.decribe` method to explore your June and January outputs separately.<br>
*Note*: There are a lot of ways to extract the month from the date in Pandas, and we'll explore some of them in the next homework. For now, one approach (you're free to use another) is to use the [`.str.contains`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html) method. For instance, if a Date cell contains the substring "2018-02", that means the date is in February 2018).

In [None]:
# noon in january
queens_jan = ... # YOUR CODE HERE
queens_jan

In [None]:
# noon in june
queens_jun = ... # YOUR CODE HERE
queens_jun

*YOUR ANSWER HERE*

**Question 3.3:** We can also visualize this data and see how the PM2.5 concentrations fluctuate throughout the year. Run the code  below to plot all of the measurement data throughout the year.

In order to better plot the x-axis, we have to convert the "Date" column in `queens` to `datetime` format. Otherwise the dates are read as strings, and while they will be plotted correctly, Python will not be able to label them correctly. 

Are there any noticeable trends in this plot? Are there any aspects of the plot that make it difficult for you to determine trends?

In [None]:
plt.plot(pd.to_datetime(queens["Date"]), queens["Measurement"])

plt.title("PM2.5 Concentrations in Queens, 2018")
plt.ylabel("PM2.5 Micrograms/cubic meter (LC)")

*YOUR ANSWER HERE*

**Question 3.4:** Let's try plotting values only in the months of January and February, and then only in the months of July and August. Create the dataframe `queens_winter` containing January and February values and `queens_summer` containing June and July values, and then run the corresponding cells. What do you notice about the two plots?

In [None]:
queens_winter = ... # YOUR CODE HERE

plt.plot(pd.to_datetime(queens_winter["Date"]), queens_winter["Measurement"], 'ro', markersize=1)

plt.title("PM2.5 Concentrations in Queens, November and December 2018")
plt.ylabel("PM2.5 Micrograms/cubic meter (LC)")
plt.xticks(rotation = 30)
plt.ylim((0,50))

In [None]:
queens_summer = ... # YOUR CODE HERE

plt.plot(pd.to_datetime(queens_summer["Date"]), queens_summer["Measurement"], 'ro', markersize = 1)

plt.title("PM2.5 Concentrations in Queens, June and July 2018")
plt.ylabel("PM2.5 Micrograms/cubic meter (LC)")
plt.xticks(rotation = 30)
plt.ylim((0,50))

*YOUR ANSWER HERE*

**Question 3.4:** Do the data support the observation that PM2.5 concentrations are on average higher in colder months than warmer months? Why or why not? What are some of the limitations of either our data or the methods we've used to explore it so far in allowing us to observe seasonal trends?

*YOUR ANSWER HERE*

**Question 3.5:** In Susan Athey's essay "Beyond Prediction", Athey defines the distinction between prediction problems and causal inference problems. Thinking about this air quality dataset, can you come up with one question that poses a prediction problem (also referred to a resource allocation problem in the essay) and another that poses a causal inference problem? The two questions you come up with should be air quality related, but you don't have to limit yourself to this dataset (eg. it's totally fair to come up with a question that would also incorporate, for example, census or demographic data).

*YOUR ANSWER HERE*

----

## Section 4: California Data<a id='cadata'></a>

Let's explore data that hits a little closer to home. In this section, we will look at air quality trends in California - more specifically Butte County. California is known for its wildfires and last year 5 California cities made it to the [top 10 worst cities for air quality in the United States and Canada](https://www.theguardian.com/cities/datablog/2017/feb/13/most-polluted-cities-world-listed-region). We will use data analysis to see how the fires have impacted PM2.5 cocentrations.

<br>**Question 4.1:** Create a dataframe called `PM25_2018_CA` that is a subset of `state_table` and just has PM2.5 2018 California data.

In [None]:
PM25_2018_CA = ... # YOUR CODE HERE

In [None]:
PM25_2018_CA.head()

<br>**Question 4.3:** Find the mean PM2.5 concentrations in each county. 

*hint: `groupby` is a helpful operation*

In [None]:
# YOUR CODE HERE

----
Camp Fire, which started in November 2018,  that started on October 8 was described as the [‘deadliest, most destructive wildfire in California history’](https://www.washingtonpost.com/nation/2018/11/25/camp-fire-deadliest-wildfire-californias-history-has-been-contained/?noredirect=on).

UC Berkeley students could smell and see the effects of the fires in Butte County. November 9, 2018 was one of the peak days that the fires were burning and we will analyze its effects on PM2.5 concentrations on this day.

**Question 4.4:** Using `PM25_2018_CA`, create a table containing just information from Napa County on November 9, 2018 (although the fires occurred in Butte County, you'll notice from the list of unique counties that we don't have measurements for Butte - so we'll look at values in a nearby county).

In [None]:
napa_nov9 = ... # YOUR CODE HERE

In [None]:
napa_nov9.head()

**Question 4.5:** Using `PM25_2018_CA`, create a table containing just information from Napa County on November 1, 2018.

In [None]:
napa_nov1 = ... # YOUR CODE HERE

In [None]:
napa_nov1.head()

**Question 4.6:** Merge `napa_nov9` and `napa_nov1` on `Time` to compare their PM2.5 concentrations side by side.

*Note:* If  two dataframes have the same column names when pandas executes a merge, it will append a '_x' to the first data frame column names and a '_y' to the second data frame column names.  The rename operation is meant to clarify things.  Be sure that it's renaming correctly!

In [None]:
napa_merge = ... # YOUR CODE HERE
napa_merge.rename(columns={'Measurement_x':'Nov9 PM2.5', 'Measurement_y':'Nov1 PM2.5'}, inplace = True)
napa_merge.head()

**Question 4.7:** Calcuate the mean PM2.5 measurements of both days. How do the PM2.5 concentrations will compare on these two dates?

In [None]:
# YOUR CODE HERE

*YOUR ANSWER HERE*

----

## Submission

Congrats, you're done with homework 2!

Before you submit, click **Kernel** --> **Restart & Clear Output**. Then, click **Cell** --> **Run All**. Then, go to the toolbar and click **File** -> **Download as** -> **.html** and submit the file through bCourses.

----

## Bibliography

- Yao, Ling, et al. - PM2.5 observations during the day vs at night. https://www.ncbi.nlm.nih.gov/pubmed/26426035
- Guardian News and Media - Air quality rankings in cities. https://www.theguardian.com/cities/datablog/2017/feb/13/most-polluted-cities-world-listed-region
- Washington Post - Camp Fire. https://www.washingtonpost.com/nation/2018/11/25/camp-fire-deadliest-wildfire-californias-history-has-been-contained/

---
Notebook developed by: Melissa Ly

Data Science Modules: http://data.berkeley.edu/education/modules