# Reshaping data: EPA air quality spreadsheets

One of my favorite tasks for Python is reshaping data -- taking data that's spread across dozens or hundreds of identically formatted worksheets or even separate spreadsheets, say, and pulling it all into a nice, clean CSV.

That's what we're going to do in this notebook. We've got a spreadsheet of EPA air quality data for core-based statistical areas in Ohio ([thanks, Becca!](https://twitter.com/_becca_king_)). The spreadsheet has a few dozen worksheets, one for each year.

Our mission: Turn it into a single CSV with nice, flat data, and the year appended to each row.

So right now, the headers on each sheet look like this:

`CBSA Code	CBSA	CO 2nd Max 1-hr	CO 2nd Max 8-hr	NO2 98th Percentile 1-hr	NO2 Mean 1-hr	Ozone 2nd Max 1-hr	Ozone 4th Max 8-hr	SO2 99th Percentile 1-hr	SO2 2nd Max 24-hr	SO2 Mean 1-hr	PM2.5 98th Percentile 24-hr	PM2.5 Weighted Mean 24-hr	PM10 2nd Max 24-hr	PM10 Mean 24-hr	Lead Max 3-Mo Avg`

In our CSV, the headers will be the same but with a `year` column added.

There is a way to do this whole thing in pandas, I'm pretty sure, but I like [`openpyxl`](https://openpyxl.readthedocs.io/) for this task, so that's what we'll use.

First, let's import our dependencies, `pandas` and the `load_workbook` method from the `openpyxl` package.

In [9]:
import pandas as pd
from openpyxl import load_workbook

Next, load the spreadsheet up. (You might get a warning here, which is fine, [we can ignore it](https://bitbucket.org/openpyxl/openpyxl/issues/537/userwarning-unknown-extension-is-not).)

In [2]:
wb = load_workbook('../data/epa.xlsx')

  warn(msg)


### Noodle around

With `openpyxl`, you can get a list of worksheet names using the attribute `sheetnames`:

In [3]:
wb.sheetnames

['conreport1980.csv',
 'conreport1981.csv',
 'conreport1982.csv',
 'conreport1983.csv',
 'conreport1984.csv',
 'conreport1985.csv',
 'conreport1986.csv',
 'conreport1987.csv',
 'conreport1988.csv',
 'conreport1989.csv',
 'conreport1990.csv',
 'conreport1991.csv',
 'conreport1992.csv',
 'conreport1993.csv',
 'conreport1994.csv',
 'conreport1995.csv',
 'conreport1996.csv',
 'conreport1997.csv',
 'conreport1998.csv',
 'conreport1999.csv',
 'conreport2000.csv',
 'conreport2001.csv',
 'conreport2002.csv',
 'conreport2003.csv',
 'conreport2004.csv',
 'conreport2005.csv',
 'conreport2006.csv',
 'conreport2007.csv',
 'conreport2008.csv',
 'conreport2009.csv',
 'conreport2010.csv',
 'conreport2011.csv',
 'conreport2012.csv',
 'conreport2013.csv',
 'conreport2014.csv',
 'conreport2015.csv',
 'conreport2016.csv',
 'conreport2017.csv',
 'conreport2018.csv']

... and you can access the data inside a worksheet by passing the name of the worksheet to your workbook variable inside square brackets, just like you would access a value in a dictionary.

Let's check out the first worksheet:

In [10]:
first_sheet = wb.sheetnames[0]
wb[first_sheet]
# this is the same as if we'd hardcoded it: wb['conreport1980.csv']

<Worksheet "conreport1980.csv">

The `values` attribute of a sheet returns the actual data in that worksheet. It returns what's called a `generator object`; we don't need to worry about what the means right now. Just know that you can use the [`list()`](https://docs.python.org/3/library/functions.html#func-list) function to turn it into a list.

In [7]:
wb[first_sheet].values

<generator object Worksheet.values at 0x10d296f48>

In [8]:
list(wb[first_sheet].values)

[('CBSA Code',
  'CBSA',
  'CO 2nd Max 1-hr',
  'CO 2nd Max 8-hr',
  'NO2 98th Percentile 1-hr',
  'NO2 Mean 1-hr',
  'Ozone 2nd Max 1-hr',
  'Ozone 4th Max 8-hr',
  'SO2 99th Percentile 1-hr',
  'SO2 2nd Max 24-hr',
  'SO2 Mean 1-hr',
  'PM2.5 98th Percentile 24-hr',
  'PM2.5 Weighted Mean 24-hr',
  'PM10 2nd Max 24-hr',
  'PM10 Mean 24-hr',
  'Lead Max 3-Mo Avg'),
 (10420,
  'Akron, OH',
  17.6,
  7.8,
  '.',
  '.',
  0.12,
  0.099,
  188,
  66,
  19,
  '.',
  '.',
  '.',
  '.',
  '.'),
 (11780,
  'Ashtabula, OH',
  '.',
  '.',
  '.',
  '.',
  0.11,
  0.092,
  101,
  34,
  10,
  '.',
  '.',
  '.',
  '.',
  '.'),
 (15940,
  'Canton-Massillon, OH',
  8,
  4.5,
  '.',
  29,
  0.1,
  0.087,
  '.',
  '.',
  '.',
  '.',
  '.',
  '.',
  '.',
  '.'),
 (17060,
  'Chillicothe, OH',
  '.',
  '.',
  '.',
  '.',
  '.',
  '.',
  '.',
  '.',
  '.',
  '.',
  '.',
  '.',
  '.',
  '.'),
 (17140,
  'Cincinnati, OH-KY-IN',
  10,
  6,
  70,
  26,
  0.16,
  0.11,
  190,
  55,
  14,
  '.',
  '.',
  '.',
  

Each item in the list represents a row's worth of data. It's stored in a data structure called a [tuple](https://docs.python.org/3/tutorial/datastructures.html#tuples-and-sequences). You can't add things to a tuple, though, and we need to add the year, so we're going to use `list()` again to convert each row of values into a list.

Where are we going to get the year from, though? _From the sheet name, exactly._ Let's test it out on our `first_sheet` variable.

In [11]:
print(first_sheet)

conreport1980.csv


OK, so we need to get the four numbers directly to the left of the period. As in most things Python, there are a dozen ways we could go about this ([regular expressions](https://docs.python.org/3/library/re.html), anyone? Anyone? OK fine). I like to use splitting and list slicing. First, let's get the bit before the period:

In [12]:
first_sheet.split('.'[0])

['conreport1980', 'csv']

`split()` returns a list, and we want the first (`[0]`) item in that list. Then we just need to grab the _last four_ characters from that string.

Remember: You can [slice](../reference/Python%20data%20types%20and%20basic%20syntax.ipynb#Lists) strings just like you would a list, and negative indexing is allowed. To get the last four characters in a string, then, you'd say: `[-4:]`. In other words, starting from the fourth character from the end of the string, take everything until the end of the string.

In [13]:
first_sheet.split('.')[0][-4:]

'1980'

Cool. Now let's grab a list of headers for our CSV and append a `'Year'` value. We'll grab the headers from the first sheet and then, as we cruise through each sheet, check to make sure that the headers in _that_ worksheet match the headers we've extracted.

This one is a little tangled, but let's work inside out to parse it out.

In [14]:
# get the values from the first worksheet and turn 'em into a list
list_of_values = list(wb[first_sheet].values)

# get the first [0] item in the list -- the headers -- and
# turn the tuple into a list
headers = list(list_of_values[0])

# append the fieldname 'Year' to our list of headers
headers.append('Year')

### Write an extraction function

We're going to write a function that will take the name of a worksheet, extract the data and return it as a list of dictionaries.

We're also going to see a few new functions:
- [`assert()`](https://docs.python.org/3/reference/simple_stmts.html#the-assert-statement) for making sure that the headers on each sheet match
- [`zip()`](https://docs.python.org/3/library/functions.html#zip) for marrying up each piece of data in a row to its correct field name
- [`dict()`](https://docs.python.org/3/library/functions.html#func-dict) for turning our zipped object into a dictionary

👉 For more details on writing your own functions, [see this notebook](../reference/Functions.ipynb).

In [16]:
# define a function that accepts a worksheet name
def extract_data(ws_name):
    
    # "open" the worksheet and turn the values into a list
    data = list(wb[ws_name].values)
    
    # grab the year from the worksheet name
    # the int() call is just another check that a number is returned
    year = int(ws_name.split('.')[0][-4:])
    
    # the column names are the first item in the data list
    # and we turn the tuple into a list
    colnames = list(data[0])
    
    # ... and append the 'Year' fieldname
    colnames.append('Year')
    
    # check that the column names for this worksheet match the
    # headers we defined above
    assert(colnames == headers)
    
    # create an empty list to hold the output data
    ls = []
    
    # loop over the data, skipping the header row
    for row in data[1:]:
        
        # turn the tuple of row data into a list
        row = list(row)
        
        # append the year
        row.append(year)
        
        # append a correctly formatted dictionary to the list
        # use `zip` to match headers and row data and `dict`
        # to turn the whole thing into a dictionary 
        data_as_dict = dict(zip(colnames, row))
        ls.append(data_as_dict)

    # return the list of data
    return ls

Next, we'll create an empty dataframe to hold our data.

In [17]:
df = pd.DataFrame(columns=headers)
df.head()

Unnamed: 0,CBSA Code,CBSA,CO 2nd Max 1-hr,CO 2nd Max 8-hr,NO2 98th Percentile 1-hr,NO2 Mean 1-hr,Ozone 2nd Max 1-hr,Ozone 4th Max 8-hr,SO2 99th Percentile 1-hr,SO2 2nd Max 24-hr,SO2 Mean 1-hr,PM2.5 98th Percentile 24-hr,PM2.5 Weighted Mean 24-hr,PM10 2nd Max 24-hr,PM10 Mean 24-hr,Lead Max 3-Mo Avg,Year


Finally, we'll loop over the sheet names in the spreadsheet, call the extraction function on each sheet and append the data to our empty datafraame.

In [18]:
# loop over sheet names
for sheet in wb.sheetnames:
    
    # use our extraction function to get the data
    annual_vals = extract_data(sheet)
    
    # append the data to our dataframe
    df = df.append(annual_vals)

In [19]:
df.head()

Unnamed: 0,CBSA Code,CBSA,CO 2nd Max 1-hr,CO 2nd Max 8-hr,NO2 98th Percentile 1-hr,NO2 Mean 1-hr,Ozone 2nd Max 1-hr,Ozone 4th Max 8-hr,SO2 99th Percentile 1-hr,SO2 2nd Max 24-hr,SO2 Mean 1-hr,PM2.5 98th Percentile 24-hr,PM2.5 Weighted Mean 24-hr,PM10 2nd Max 24-hr,PM10 Mean 24-hr,Lead Max 3-Mo Avg,Year
0,10420,"Akron, OH",17.6,7.8,.,.,0.12,0.099,188,66,19,.,.,.,.,.,1980
1,11780,"Ashtabula, OH",.,.,.,.,0.11,0.092,101,34,10,.,.,.,.,.,1980
2,15940,"Canton-Massillon, OH",8,4.5,.,29,0.1,0.087,.,.,.,.,.,.,.,.,1980
3,17060,"Chillicothe, OH",.,.,.,.,.,.,.,.,.,.,.,.,.,.,1980
4,17140,"Cincinnati, OH-KY-IN",10,6,70,26,0.16,0.11,190,55,14,.,.,.,.,.,1980


One last bit of cleanup: Periods mean "no observation" or "no data." That's fine, I guess, but it would make more sense to have those values represented as nulls. Let's fix that.

We'll import numpy, a Python package for scientific computing that pandas is built on, and use its `nan` value.

In [20]:
import numpy as np
df = df.replace('.', np.nan)

In [21]:
df.head()

Unnamed: 0,CBSA Code,CBSA,CO 2nd Max 1-hr,CO 2nd Max 8-hr,NO2 98th Percentile 1-hr,NO2 Mean 1-hr,Ozone 2nd Max 1-hr,Ozone 4th Max 8-hr,SO2 99th Percentile 1-hr,SO2 2nd Max 24-hr,SO2 Mean 1-hr,PM2.5 98th Percentile 24-hr,PM2.5 Weighted Mean 24-hr,PM10 2nd Max 24-hr,PM10 Mean 24-hr,Lead Max 3-Mo Avg,Year
0,10420,"Akron, OH",17.6,7.8,,,0.12,0.099,188.0,66.0,19.0,,,,,,1980
1,11780,"Ashtabula, OH",,,,,0.11,0.092,101.0,34.0,10.0,,,,,,1980
2,15940,"Canton-Massillon, OH",8.0,4.5,,29.0,0.1,0.087,,,,,,,,,1980
3,17060,"Chillicothe, OH",,,,,,,,,,,,,,,1980
4,17140,"Cincinnati, OH-KY-IN",10.0,6.0,70.0,26.0,0.16,0.11,190.0,55.0,14.0,,,,,,1980


_Now_ we're ready to export to a CSV using the [`to_csv()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html) method.

In [22]:
df.to_csv('parsed-ohio-air-quality-data.csv')