# Reshaping data: Portland housing developments

In this notebook, we're going to work with some data on Portland (Oregon) housing developments since 2014. Right now, the data are scattered across a jillion spreadsheets. Our goal is to parse them all into one clean CSV. (Thanks to [Kelly Kenoyer of the Portland Mercury](https://twitter.com/Kelly_Kenoyer) for donating this data.)

The spreadsheets, a mixture of `xls` and `xlsx` files, live in `../data/portland/`. A few things to note:
- Some of the spreadsheets have extra columns
- Some of the spreadsheets have other worksheets in addition to the data worksheet (pivot tables, mostly) -- but these are not always in the same position
- Some of the spreadsheets have columns of mostly blank data that the city once used to manually aggregate data by category -- we don't want these columns
- Some of the spreadsheets have blank rows

Our strategy:
- Get a list of Excel files in that directory using the [`glob`](https://docs.python.org/3/library/glob.html) module
- Create an empty pandas data frame
- Loop over the list of spreadsheet files and ...
    - Read in the file to a data frame
    - Find the correct worksheet
    - Drop empty columns and rows
    - Append to the main data frame
    
First, we'll import `glob` and pandas.

In [None]:
# import glob and pandas


Next, we'll use `glob` to get a list of the files we're going to loop over. We'll use the asterisk `*`, which means "match everything."

In [None]:
# use glob to find everything in the `../data/portland/` directory


In [None]:
# print that list to make sure we have what we think we have


Now we'll create an empty data frame. This will be the container we stuff the data into as we loop over the files.

In [None]:
# create an empty data frame


Let's take a look at what we're dealing with. We're going to loop over the spreadsheet, and for each one, we're going to look at:
- The names of the worksheets in that spreadsheet
- The columns in each worksheet

This will help us decide, later, which worksheets we need to target.

We're going to take advantage of the fact, [according to the `read_excel()` documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html), that you can pass `None` as the `sheet_names` argument and pandas will read in _all_ of the sheets as a big dictionary -- the keys are the names of the worksheets, the values are the associated data frames.

Later, our logic will go like this:
- Read in every worksheet as a data frame
- Target the worksheet whose name matches the pattern for the data we need

👉 For a refresher on _for loops_ and dictionaries, [check out this notebook](../reference/Python%20data%20types%20and%20basic%20syntax.ipynb#for-loops).

In [None]:
# loop over the excel file paths

    
    # load the file into a data frame
    # specifying `None` as the sheet name

    
    # print the name of the file

    
    # print the worksheet names
    # -- the .keys() in the dictionary

    
    # print a divider to make scanning easier

    
    # and an empty line


OK. So it looks like our target sheets are called a few different things: `nrs`, `04_2016 New Res Units'`, `'2018 04 New Residential Units'`, etc.

Can we come up with a list of patterns to match all of them? I think we can.

In [None]:
# the items in this list are lowercased,
# because we're gonna match on .lower()'d versions of the sheet names
target_sheet_name_fragments = ['new res', 'nrs', 'lus stats']

So now, we need to write some logic that says: Pick the worksheet that has one of our `target_sheet_name_fragments` in the name. A nested pair of _for loops_ will do the trick for us.

In [None]:
# loop over the excel file paths

    
    # load the file into a data frame
    # specifying `None` as the sheet name

        
    # start off with no match -- None

    
    # loop over the worksheet names

        
        # loop over the word fragments

            
            # if this fragment exists in the lowercased worksheet name

                
                # we've got a winner

    # if, when we get to the end of this, `match` is still None

        # print something to let us know about it

        
        # and the names of the sheets

        
        # and break out of the loop

    
    # otherwise, grab a handle to the worksheet we want

    
    # print a status message to let us know what's up


Scanning through that list, I feel comfortable that we're grabbing the correct data. Let's take a look at the columns in each worksheet we'll be parsing.

In [None]:
# loop over the excel file paths

    
    # load the file into a data frame
    # specifying `None` as the sheet name

        
    # start off with no match - None

    
    # loop over the worksheet names

        
        # loop over the word fragments

            
            # if this fragment exists in the lowercased worksheet name

                
                # we've got a winner

    # if, when we get to the end of this, `match` is still None

        # print something to let us know about it

        
        # and the names of the sheets

        
        # and break out of the loop


    # otherwise, grab a handle to the worksheet we want

    
    # print a status message to let us know what's up

    
    # print a sorted list of column names

    
    # print a divider to make scanning our results easier

    
    # print an empty line


I notice that some columns are, e.g. `Unnamed: 4`. That means there's no column header. Let's take a look at one of those:

In [None]:
test = pd.read_excel('../data/portland/08_2014 New Res Units.xls', sheet_name='08_2014 New Res Units')

In [None]:
test.head(20)

Looks like they're using those columns to total up the valuations for groups of housing types. I'm noticing, too, that there are some blank rows -- probably used as dividers between groups -- so we'll want to drop those as well.

We'll keep that in mind as we roll through these sheets.

Here's the pandas documentation on the methods we'll be using here:
- [`append()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html)
- [`drop()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html)
- [`dropna()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html)

In [None]:
# loop over the excel file paths

    
    # load the file into a data frame
    # specifying `None` as the sheet name


    # start off with no match

    
    # loop over the worksheet names

        
        # loop over the word fragments

            
            # if this fragment exists in the lowercased worksheet name

                
                # we've got a winner


    # if, when we get to the end of this, `match` is still None

        # print something to let us know about it

        
        # and the names of the sheets

        
        # and break out of the loop

                
    # otherwise, grab a handle to the worksheet we want

    
    # print a status message to let us know what's up

    
    # get a list of columns we want to drop

    
    # drop those bad boys


    # drop empty rows in place, but only if _all_ of the values are nulls

    
    # append to our `housing` data frame


In [None]:
# check it out with head()


In [None]:
# check the len()


In [None]:
# check dtypes


One last thing I'd do, before writing out to file, is parse the date columns as dates:

In [None]:
# convert "indate" column to datetime


# convert "indate" column to datetime


In [None]:
# check it out with head()


Now we can use the `to_csv()` method to write out to a new file:

In [None]:
# write out to 'portland-developments.csv'
# specify no index
