# Reshaping data: EPA air quality spreadsheets

One of my favorite tasks for Python is reshaping data -- taking data that's spread across dozens or hundreds of identically formatted worksheets or even separate spreadsheets, say, and pulling it all into a nice, clean CSV. Or taking a jankily formatted data set and tidying it up. Pandas has a number of tools and strategies for reshaping data [that you can read about here](https://pandas.pydata.org/pandas-docs/stable/reshaping.html).

In this notebook, we're going to do something a little more bespoke. We've got a spreadsheet of EPA air quality data for core-based statistical areas in Ohio ([thanks, Becca!](https://twitter.com/_becca_king_)). The spreadsheet has a few dozen worksheets, one for each year.

Our mission: Turn it into a single CSV with nice, flat data, and the year appended to each row.

So right now, the headers on each sheet look like this:

`CBSA Code	CBSA	CO 2nd Max 1-hr	CO 2nd Max 8-hr	NO2 98th Percentile 1-hr	NO2 Mean 1-hr	Ozone 2nd Max 1-hr	Ozone 4th Max 8-hr	SO2 99th Percentile 1-hr	SO2 2nd Max 24-hr	SO2 Mean 1-hr	PM2.5 98th Percentile 24-hr	PM2.5 Weighted Mean 24-hr	PM10 2nd Max 24-hr	PM10 Mean 24-hr	Lead Max 3-Mo Avg`

In our CSV, the headers will be the same but with a `year` column added.

There is a way to do this whole thing in pandas, I'm pretty sure, but I like [`openpyxl`](https://openpyxl.readthedocs.io/) for this task, so that's what we'll use.

First, let's import our dependencies, `pandas` and the `load_workbook` method from the `openpyxl` package.

In [None]:
# import pandas

# import load_workbook from openpyxl


Next, load the spreadsheet up. (You might get a warning here, which is fine, [we can ignore it](https://bitbucket.org/openpyxl/openpyxl/issues/537/userwarning-unknown-extension-is-not).)

In [None]:
# load the epa workbook


### Noodle around

With `openpyxl`, you can get a list of worksheet names using the attribute `sheetnames`:

In [None]:
# check out the .sheetnames attribute


... and you can access the data inside a worksheet by passing the name of the worksheet to your workbook variable inside square brackets, just like you would access a value in a dictionary.

Let's check out the first worksheet:

In [None]:
# get the name of the first [0] sheet

# grab the first sheet

# this is the same as if we'd hardcoded it: wb['conreport1980.csv']

The `values` attribute of a sheet returns the actual data in that worksheet. It returns what's called a `generator object`; we don't need to worry about what the means right now. Just know that you can use the [`list()`](https://docs.python.org/3/library/functions.html#func-list) function to turn it into a list.

In [None]:
# get the .values attribute of that sheet


In [None]:
# same, but turn that generator object into a list


Each item in the list represents a row's worth of data. It's stored in a data structure called a [tuple](https://docs.python.org/3/tutorial/datastructures.html#tuples-and-sequences). You can't add things to a tuple, though, and we need to add the year, so we're going to use `list()` again to convert each row of values into a list.

Where are we going to get the year from, though? _From the sheet name, exactly._ Let's test it out on our `first_sheet` variable.

In [None]:
# print name of first sheet


OK, so we need to get the four numbers directly to the left of the period. As in most things Python, there are a dozen ways we could go about this ([regular expressions](https://docs.python.org/3/library/re.html), anyone? Anyone? OK fine). I like to use splitting and list slicing. First, let's get the bit before the period:

In [None]:
# see what it looks like splitting the sheet name on a period


`split()` returns a list, and we want the first (`[0]`) item in that list:

In [None]:
# look at the first [0] item in that list


Then we just need to grab the _last four_ characters from that string.

Remember: You can [slice](../reference/Python%20data%20types%20and%20basic%20syntax.ipynb#Lists) strings just like you would a list, and negative indexing is allowed. To get the last four characters in a string, then, you'd say: `[-4:]`. In other words, starting from the fourth character from the end of the string, take everything until the end of the string.

In [None]:
# split on period
# get the first [0] item in the list that gets returned
# get the last four [-4:] characters from that string


Cool. Now let's grab a list of headers for our CSV and append a `'Year'` value. We'll grab the headers from the first sheet and then, as we cruise through each sheet, check to make sure that the headers in _that_ worksheet match the headers we've extracted.

This one is a little tangled, but let's work inside out to parse it out.

In [None]:
# get the values from the first worksheet and turn 'em into a list


# get the first [0] item in the list -- the headers -- and
# turn the tuple into a list


# append the fieldname 'Year' to our list of headers


### Write an extraction function

We're going to write a function that will take the name of a worksheet, extract the data and return it as a list of dictionaries.

We're also going to see a few new built-in functions:
- [`assert()`](https://docs.python.org/3/reference/simple_stmts.html#the-assert-statement) for making sure that the headers on each sheet match
- [`zip()`](https://docs.python.org/3/library/functions.html#zip) for marrying up each piece of data in a row to its correct field name
- [`dict()`](https://docs.python.org/3/library/functions.html#func-dict) for turning our zipped object into a dictionary

👉 For more details on writing your own functions, [see this notebook](../reference/Functions.ipynb).

In [None]:
# define a function that accepts a worksheet name

    
    # "open" the worksheet and turn the values into a list

    
    # grab the year from the worksheet name
    # the int() call is just another check that a number is returned

    
    # the column names are the first item in the data list
    # and we turn the tuple into a list

    
    # ... and append the 'Year' fieldname

    
    # using our friend `assert()`,
    # check that the column names for this worksheet match the
    # headers we defined above

    
    # create an empty list to hold the output data

    
    # loop over the data, skipping the header row

        
        # turn the tuple of row data into a list

        
        # append the year

        
        # append a correctly formatted dictionary to the list
        # use `zip` to match headers and row data and `dict`
        # to turn the whole thing into a dictionary 



    # return the list of data


Next, we'll create an empty dataframe to hold our data.

In [None]:
# create an empty data frame with the columns defined above

# check the results with head()


Finally, we'll loop over the sheet names in the spreadsheet, call the extraction function on each sheet and append the data to our empty datafraame.

In [None]:
# loop over sheet names

    
    # use our extraction function to get the data

    
    # append the data to our dataframe


In [None]:
# check the results with head()


One last bit of cleanup: Periods mean "no observation" or "no data." That's fine, I guess, but it would make more sense to have those values represented as nulls. Let's fix that.

We'll import numpy, a Python package for scientific computing that pandas is built on, and use its `nan` value.

In [None]:
# import numpy as np

# replace periods with np.nan in our data frame


In [None]:
# check the results with head()


_Now_ we're ready to export to a CSV using the [`to_csv()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html) method.

In [None]:
# export to csv 'parsed-ohio-air-quality-data.csv'
