# Melting Nashville library data

One reshaping task that comes up frequently is "melting" or "unpivoting" data -- turning "long" data into something that's tidier and easier to analyze.

A good example of data that needs melting is the city of Nashville's ["Public Library Visits By Branch"](https://data.nashville.gov/Libraries/Nashville-Public-Library-Visits-by-Branch/3iet-ewuy) database.

The header row looks like this: `Month,Year,Bellevue Library,Bordeaux Library,Donelson Library,East Library,Edgehill Library,Edmonson Pike Library,Goodlettsville Library,Green Hills Library,Hadley Park Library,Hermitage Library,Inglewood Library,Looby Library,Madison Library,Main Library,North Library,Old Hickory Library,Pruitt Library,Richland Park Library,Southeast Library,Thompson Lane Library,Watkins Library,Notes
`. Each row has multiple observations, one for each branch in the library system.

It'd be a lot easier to analyze if the data looked more like this: `Month,Year,Library,Count`. This is sometimes called ["tidy data,"](http://vita.had.co.nz/papers/tidy-data.html) where each row represents _one_ observation.

To tidy up this untidy data, we're going to use a pandas method called [`melt()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html), which will take each column of library data and "melt" it into a new row. We just need to tell it which ones are the "identifying" columns (month, year and notes) and which are the "value" columns we want to melt. We'll also specify the names of the new columns that are created, but that step is optional.

Let's start by importing pandas.

In [1]:
import pandas as pd

Next, let's read in the data live from the city's open data portal. (If this fails for some reason, we've got a cached copy at `../data/nashville-library.csv`.)

In [2]:
# https://data.nashville.gov/api/views/3iet-ewuy/rows.csv?accessType=DOWNLOAD
df = pd.read_csv('https://data.nashville.gov/api/views/3iet-ewuy/rows.csv?accessType=DOWNLOAD')

In [20]:
df.head()

Unnamed: 0,Month,Year,Bellevue Library,Bordeaux Library,Donelson Library,East Library,Edgehill Library,Edmonson Pike Library,Goodlettsville Library,Green Hills Library,...,Madison Library,Main Library,North Library,Old Hickory Library,Pruitt Library,Richland Park Library,Southeast Library,Thompson Lane Library,Watkins Library,Notes
0,8,2011,18641,8748,15560,8976,7018,25585,29273,54087,...,21055,61603,10029,4975,11284,13727,14675,10737,6389,
1,7,2011,18652,11985,15463,6254,6096,29813,26081,45639,...,22792,62337,9607,5096,16867,13451,13150,9462,8690,
2,9,2011,14720,9211,13867,8269,5168,21028,22286,44155,...,19260,56053,7048,4861,10478,10996,12073,8768,2569,
3,10,2011,15645,8998,14807,8501,6187,22850,21822,50397,...,23682,59960,7978,6390,10991,11201,13106,9881,4282,
4,11,2011,14203,6686,14579,9043,5692,20923,22341,39374,...,24809,53132,6713,5668,11577,10454,11791,8813,4019,


Looking through the `melt()` documentation, it looks like we can hand off a list to the `id_vars` argument -- the columns that we _don't_ want to melt, because we need them as identifiers. We can also hand off a list to the `value_vars` argument -- the columns we want to unpivot.

👉 For a refresher on lists, [check out this notebook](../reference/Python%20data%20types%20and%20basic%20syntax.ipynb#Lists).

Writing out `Month`, `Year` and `Notes` in a list isn't that big of a deal for the ID columns. For the value columns ... We _could_ take the time to write out the names of all those columns. But that's inefficient! Let's take a look at the column names with the data frame's `columns` attribute:

In [21]:
df.columns

Index(['Month', 'Year', 'Bellevue Library', 'Bordeaux Library',
       'Donelson Library', 'East Library', 'Edgehill Library',
       'Edmonson Pike Library', 'Goodlettsville Library',
       'Green Hills Library', 'Hadley Park Library', 'Hermitage Library',
       'Inglewood Library', 'Looby Library', 'Madison Library', 'Main Library',
       'North Library', 'Old Hickory Library', 'Pruitt Library',
       'Richland Park Library', 'Southeast Library', 'Thompson Lane Library',
       'Watkins Library', 'Notes'],
      dtype='object')

Sure looks iterable. What happens if we try to slice it?

In [22]:
# get the fourth [3] item out
df.columns[3]

'Bordeaux Library'

Sure enough, we can slice away. So that means our columns of interest start in position three (`[2]`, which is 'Bellevue Library') and go all the way to the second from the end (`[-1]`, 'Watkins Library'). Let's slice those out and save as a new variable:

In [23]:
library_columns = df.columns[2:-1]

In [24]:
print(library_columns)

Index(['Bellevue Library', 'Bordeaux Library', 'Donelson Library',
       'East Library', 'Edgehill Library', 'Edmonson Pike Library',
       'Goodlettsville Library', 'Green Hills Library', 'Hadley Park Library',
       'Hermitage Library', 'Inglewood Library', 'Looby Library',
       'Madison Library', 'Main Library', 'North Library',
       'Old Hickory Library', 'Pruitt Library', 'Richland Park Library',
       'Southeast Library', 'Thompson Lane Library', 'Watkins Library'],
      dtype='object')


Perfect. Now we're ready to melt. We'll pass the `melt()` method five arguments:
- `df`, the data frame we're melting
- `id_vars=['Month', 'Year', 'Notes']`, the "identification" columns
- `value_vars=library_columns`, the "value" columns we're melting
- `var_name='Library'`, the name of our new column that will hold the library name (default is "variable" if you don't specify)
- `value_name='Visits'`, the name of our new column with the visit counts (default is "value" if you don't specify)

We'll save the whole thing under a new variable, `melted`.

In [28]:
melted = pd.melt(df,
                 id_vars=['Month', 'Year', 'Notes'],
                 value_vars=library_columns,
                 var_name='Library',
                 value_name='Visits')

In [29]:
melted.head()

Unnamed: 0,Month,Year,Notes,Library,Visits
0,8,2011,,Bellevue Library,18641
1,7,2011,,Bellevue Library,18652
2,9,2011,,Bellevue Library,14720
3,10,2011,,Bellevue Library,15645
4,11,2011,,Bellevue Library,14203


One thing that remains untidy is the `Notes` field. In the original data, these notes applied to every library in the system that month.

Here, for instance, is the data for November 2013 from our original data frame -- see the note at the end about why Old Hickory's number was 0 that month:

In [33]:
df[(df['Month'] == 11) & (df['Year'] == 2013)]

Unnamed: 0,Month,Year,Bellevue Library,Bordeaux Library,Donelson Library,East Library,Edgehill Library,Edmonson Pike Library,Goodlettsville Library,Green Hills Library,...,Madison Library,Main Library,North Library,Old Hickory Library,Pruitt Library,Richland Park Library,Southeast Library,Thompson Lane Library,Watkins Library,Notes
28,11,2013,11929,8220,12289,7191,4708,15622,19009,17542,...,22793,58710,7031,0,8303,10559,11693,7693,4658,OLD HICKORY LIBRARY CLOSED FOR RENOVATION


... and here's the same thing in our melted data:

In [34]:
melted[(melted['Month'] == 11) & (melted['Year'] == 2013)]

Unnamed: 0,Month,Year,Notes,Library,Visits
28,11,2013,OLD HICKORY LIBRARY CLOSED FOR RENOVATION,Bellevue Library,11929
86,11,2013,OLD HICKORY LIBRARY CLOSED FOR RENOVATION,Bordeaux Library,8220
144,11,2013,OLD HICKORY LIBRARY CLOSED FOR RENOVATION,Donelson Library,12289
202,11,2013,OLD HICKORY LIBRARY CLOSED FOR RENOVATION,East Library,7191
260,11,2013,OLD HICKORY LIBRARY CLOSED FOR RENOVATION,Edgehill Library,4708
318,11,2013,OLD HICKORY LIBRARY CLOSED FOR RENOVATION,Edmonson Pike Library,15622
376,11,2013,OLD HICKORY LIBRARY CLOSED FOR RENOVATION,Goodlettsville Library,19009
434,11,2013,OLD HICKORY LIBRARY CLOSED FOR RENOVATION,Green Hills Library,17542
492,11,2013,OLD HICKORY LIBRARY CLOSED FOR RENOVATION,Hadley Park Library,4067
550,11,2013,OLD HICKORY LIBRARY CLOSED FOR RENOVATION,Hermitage Library,13612
