# Code Louisville intro to Pandas

Pandas is a library that allows us to deal with data in a dataframe format. This is really useful for being able to quickly do data analysis. When importing, it is conventional to import as 'pd' which allows it to later be referenced by typing 'pd' rather than 'pandas'

In [39]:
import pandas as pd
import os

I'm using the os to find the current working directory. 

In [40]:
os.getcwd()

'C:\\Users\\natek\\Documents\\GitHub\\code_lou_python'

I downloaded the data from here: http://greaterlouisvilleproject.org/deep-drivers-of-change/education/ and saved it in my current working directory. Now I'm using Pandas' read_excel command to read it in and save it as a dataframe named edu_df. Then I used the `.head()` method to show the first 6 rows of the dataframe. Pandas also has a `pd.read_csv()` command. You can specify either an absolute file path, or more commonly a relative file path, e.g. `pd.read_csv('data/my_data.csv')`

In [41]:
edu_df = pd.read_excel('GLP-Codebook.xlsx', 'Edu County', index_col=None, na_values=['NA'])
edu_df.head(n = 6)

Unnamed: 0,city,county,state,display_name,FIPS,year,current,baseline,under_5_per,five_to_17_per,...,per_25_64_grad,per_25_34_assoc_plus,per_25_34_bach_plus,per_25_34_grad,bach_plus_per_all,bach_plus_per_white,bach_plus_per_black,bach_plus_per_hispanic,per_high_wage,enrolled_3_4
0,Birmingham,Jefferson,AL,BIR,1073,2005,1,1,27.561629,24.424063,...,14.957714,36.986318,32.212249,9.140147,24.863193,34.561143,15.130703,9.632123,37.918348,42.8
1,Jacksonville,Duval,FL,JAC,12031,2005,0,1,18.102449,14.348973,...,11.562865,35.05149,27.663203,6.819745,23.493214,27.910648,16.086584,23.423926,37.46739,51.2
2,Indianapolis,Marion,IN,IND,18097,2005,1,1,26.649174,19.114679,...,12.496485,39.578506,31.695338,6.908782,24.614262,30.634469,15.507141,8.733152,37.573134,44.5
3,Louisville,Jefferson,KY,LOU,21111,2005,1,1,27.751727,15.684337,...,15.934613,41.795956,33.119233,9.321286,25.140119,29.726655,13.401547,24.382419,37.728885,45.3
4,Grand Rapids,Kent,MI,GR,26081,2005,1,0,19.607122,15.606313,...,14.104029,39.952588,31.835959,7.23096,26.851961,30.713895,15.134037,6.954463,36.64657,45.9
5,Kansas City,Jackson,MO,KC,29095,2005,1,1,22.583039,19.766415,...,14.550455,36.692679,30.443096,7.133329,24.883282,31.588362,14.994156,11.896291,38.386257,44.2


Using `.tail()` shows the last n rows of the dataframe. 

In [42]:
edu_df.tail(n = 6)

Unnamed: 0,city,county,state,display_name,FIPS,year,current,baseline,under_5_per,five_to_17_per,...,per_25_64_grad,per_25_34_assoc_plus,per_25_34_bach_plus,per_25_34_grad,bach_plus_per_all,bach_plus_per_white,bach_plus_per_black,bach_plus_per_hispanic,per_high_wage,enrolled_3_4
246,St. Louis,St. Louis both city and county,MO,STL,MERGED,2011,1,0,24.618629,20.95443,...,21.414279,49.728091,41.76624,14.455011,33.823896,43.948022,16.514943,24.942241,44.121865,61.120777
247,St. Louis,St. Louis both city and county,MO,STL,MERGED,2012,1,0,27.639459,23.246344,...,21.789192,54.045574,45.1478,15.853368,35.297808,45.292258,17.283133,25.286236,44.119923,54.407214
248,St. Louis,St. Louis both city and county,MO,STL,MERGED,2013,1,0,24.555998,21.949645,...,22.162962,53.763148,45.185529,15.40588,37.00755,47.702164,16.824685,32.867665,45.069622,56.264128
249,St. Louis,St. Louis both city and county,MO,STL,MERGED,2014,1,0,19.859922,20.170396,...,23.234263,55.968555,48.008356,17.986457,36.912506,47.231126,18.283916,34.87269,47.558253,56.556415
250,St. Louis,St. Louis both city and county,MO,STL,MERGED,2015,1,0,19.616353,19.718017,...,23.078656,54.385359,46.610637,17.854147,36.897841,47.847756,17.188374,32.551582,45.101145,63.928063
251,St. Louis,St. Louis both city and county,MO,STL,MERGED,2016,1,0,19.540549,17.330064,...,,,,,,,,,,


More generally, `.shape` will give the dimensions of the dataframe

In [43]:
edu_df.shape

(252, 23)

And passing the pandas dataframe to the list function will produce a list of all the column names.

In [44]:
list(edu_df)

['city',
 'county',
 'state',
 'display_name',
 'FIPS',
 'year',
 'current',
 'baseline',
 'under_5_per',
 'five_to_17_per',
 'child_per',
 'per_25_64_assoc_plus',
 'per_25_64_bach_plus',
 'per_25_64_grad',
 'per_25_34_assoc_plus',
 'per_25_34_bach_plus',
 'per_25_34_grad',
 'bach_plus_per_all',
 'bach_plus_per_white',
 'bach_plus_per_black',
 'bach_plus_per_hispanic',
 'per_high_wage',
 'enrolled_3_4']

You can use the `.iloc` method to pull data based on its index location. 

In [45]:
edu_df.iloc[3, 8]

27.7517270253297

While calling `.iloc[3, 8]` will give you Louisville's child poverty rate in 2005, I don't recommend doing it this way. It's better to select by column name and then filter down to the row(s) you want. It's way too easy to make a mistake with numerical indices. Selecting by column name and filtering data based is covered a bit later in this intro. 

You can pull more than one row and column using the `:` operator. The first index is included, while the second one is excluded, so in the example below asking for rows `1:4` includes row 1 but not row 4

In [46]:
edu_df.iloc[1:4, 1:9]

Unnamed: 0,county,state,display_name,FIPS,year,current,baseline,under_5_per
1,Duval,FL,JAC,12031,2005,0,1,18.102449
2,Marion,IN,IND,18097,2005,1,1,26.649174
3,Jefferson,KY,LOU,21111,2005,1,1,27.751727


The `:` operator can also be used to select everything up to a certain index - again exclusive of the index you use. Sp :4 gives rows 0 to 3.

In [47]:
edu_df.iloc[:4, :9]

Unnamed: 0,city,county,state,display_name,FIPS,year,current,baseline,under_5_per
0,Birmingham,Jefferson,AL,BIR,1073,2005,1,1,27.561629
1,Jacksonville,Duval,FL,JAC,12031,2005,0,1,18.102449
2,Indianapolis,Marion,IN,IND,18097,2005,1,1,26.649174
3,Louisville,Jefferson,KY,LOU,21111,2005,1,1,27.751727


Pandas makes it easy to get summary statistics for the whole dataframe. Note though, that Pandas guesses what kind of data it is dealing with. A mean year of 2010.5 doesn't make much sense. 

In [48]:
edu_df.describe()

Unnamed: 0,year,current,baseline,under_5_per,five_to_17_per,child_per,per_25_64_assoc_plus,per_25_64_bach_plus,per_25_64_grad,per_25_34_assoc_plus,per_25_34_bach_plus,per_25_34_grad,bach_plus_per_all,bach_plus_per_white,bach_plus_per_black,bach_plus_per_hispanic,per_high_wage,enrolled_3_4
count,252.0,252.0,252.0,252.0,252.0,252.0,231.0,231.0,231.0,231.0,231.0,231.0,231.0,231.0,231.0,218.0,231.0,231.0
mean,2010.5,0.809524,0.714286,26.850341,22.353014,23.649377,42.19217,34.272072,16.577728,43.794477,36.270272,10.686201,30.168807,37.769598,17.735933,16.337495,40.864246,46.59705
std,3.458922,0.393458,0.452653,6.223837,6.039708,5.808153,5.674435,5.70377,2.788693,6.474984,6.83333,2.804736,5.25749,7.684775,4.378983,7.008783,3.355405,6.971813
min,2005.0,0.0,0.0,12.879569,9.924522,10.929946,33.932802,24.39351,11.130579,30.050042,23.672387,5.602442,21.448968,24.036965,9.645483,5.323617,34.870455,27.8
25%,2007.75,1.0,0.0,22.547792,18.652877,19.921239,37.672312,30.204792,14.526348,38.763829,30.567053,8.329107,26.32025,32.409471,14.798472,11.301737,38.696424,41.5
50%,2010.5,1.0,1.0,26.817761,21.711095,23.168039,41.253832,33.489888,16.477529,43.474687,36.016016,10.604136,29.321771,36.455164,16.757468,14.59189,40.217741,46.1
75%,2013.25,1.0,1.0,30.624183,25.348102,27.064674,45.109003,37.518617,18.018369,48.370058,41.17665,12.829775,32.619356,40.846901,19.318046,20.033535,42.363616,51.4
max,2016.0,1.0,1.0,45.911573,43.958317,42.666296,60.687238,52.225352,25.421147,59.661496,53.317076,17.986457,47.782932,59.818743,32.440581,41.113537,52.519105,63.928063


The type of a variable can be changed using the `.astype()` method. Here we make year a categorical variable, and it drops out of the `edu_df.describe()` output because it no longer matches the rest of the dataframe.

In [49]:
edu_df['year'] = edu_df['year'].astype('category')
edu_df.describe()

Unnamed: 0,current,baseline,under_5_per,five_to_17_per,child_per,per_25_64_assoc_plus,per_25_64_bach_plus,per_25_64_grad,per_25_34_assoc_plus,per_25_34_bach_plus,per_25_34_grad,bach_plus_per_all,bach_plus_per_white,bach_plus_per_black,bach_plus_per_hispanic,per_high_wage,enrolled_3_4
count,252.0,252.0,252.0,252.0,252.0,231.0,231.0,231.0,231.0,231.0,231.0,231.0,231.0,231.0,218.0,231.0,231.0
mean,0.809524,0.714286,26.850341,22.353014,23.649377,42.19217,34.272072,16.577728,43.794477,36.270272,10.686201,30.168807,37.769598,17.735933,16.337495,40.864246,46.59705
std,0.393458,0.452653,6.223837,6.039708,5.808153,5.674435,5.70377,2.788693,6.474984,6.83333,2.804736,5.25749,7.684775,4.378983,7.008783,3.355405,6.971813
min,0.0,0.0,12.879569,9.924522,10.929946,33.932802,24.39351,11.130579,30.050042,23.672387,5.602442,21.448968,24.036965,9.645483,5.323617,34.870455,27.8
25%,1.0,0.0,22.547792,18.652877,19.921239,37.672312,30.204792,14.526348,38.763829,30.567053,8.329107,26.32025,32.409471,14.798472,11.301737,38.696424,41.5
50%,1.0,1.0,26.817761,21.711095,23.168039,41.253832,33.489888,16.477529,43.474687,36.016016,10.604136,29.321771,36.455164,16.757468,14.59189,40.217741,46.1
75%,1.0,1.0,30.624183,25.348102,27.064674,45.109003,37.518617,18.018369,48.370058,41.17665,12.829775,32.619356,40.846901,19.318046,20.033535,42.363616,51.4
max,1.0,1.0,45.911573,43.958317,42.666296,60.687238,52.225352,25.421147,59.661496,53.317076,17.986457,47.782932,59.818743,32.440581,41.113537,52.519105,63.928063


But we can select it on its own to describe it. 

In [50]:
edu_df['year'].describe()

count      252
unique      12
top       2016
freq        21
Name: year, dtype: int64

The `.describe()` method returns something different depending on the type of data it's passed

In [51]:
edu_df['child_per'].describe()

count    252.000000
mean      23.649377
std        5.808153
min       10.929946
25%       19.921239
50%       23.168039
75%       27.064674
max       42.666296
Name: child_per, dtype: float64

Pandas also makes it easy to create new variables by performing mathematical operations on already existing variables. 

In [52]:
edu_df['bach_plus_race_gap'] = edu_df['bach_plus_per_white'] - edu_df['bach_plus_per_black']
edu_df.head(n = 6)

Unnamed: 0,city,county,state,display_name,FIPS,year,current,baseline,under_5_per,five_to_17_per,...,per_25_34_assoc_plus,per_25_34_bach_plus,per_25_34_grad,bach_plus_per_all,bach_plus_per_white,bach_plus_per_black,bach_plus_per_hispanic,per_high_wage,enrolled_3_4,bach_plus_race_gap
0,Birmingham,Jefferson,AL,BIR,1073,2005,1,1,27.561629,24.424063,...,36.986318,32.212249,9.140147,24.863193,34.561143,15.130703,9.632123,37.918348,42.8,19.43044
1,Jacksonville,Duval,FL,JAC,12031,2005,0,1,18.102449,14.348973,...,35.05149,27.663203,6.819745,23.493214,27.910648,16.086584,23.423926,37.46739,51.2,11.824065
2,Indianapolis,Marion,IN,IND,18097,2005,1,1,26.649174,19.114679,...,39.578506,31.695338,6.908782,24.614262,30.634469,15.507141,8.733152,37.573134,44.5,15.127328
3,Louisville,Jefferson,KY,LOU,21111,2005,1,1,27.751727,15.684337,...,41.795956,33.119233,9.321286,25.140119,29.726655,13.401547,24.382419,37.728885,45.3,16.325108
4,Grand Rapids,Kent,MI,GR,26081,2005,1,0,19.607122,15.606313,...,39.952588,31.835959,7.23096,26.851961,30.713895,15.134037,6.954463,36.64657,45.9,15.579857
5,Kansas City,Jackson,MO,KC,29095,2005,1,1,22.583039,19.766415,...,36.692679,30.443096,7.133329,24.883282,31.588362,14.994156,11.896291,38.386257,44.2,16.594206


Filtering data can be done by using brackets. So suppose we just want Louisville in the year 2005. We can filter to that, and then select the column for under age 5 child poverty. This is a better idea than using `.iloc()` because it won't silently break if the underlying dataframe changes and it's harder to make a mistake with column names and variable values (city == "Louisville) than with index values. 

In [53]:
filtered_df = edu_df[(edu_df.city == "Louisville") & (edu_df.year == 2005)]
filtered_df['under_5_per']

3    27.751727
Name: under_5_per, dtype: float64

We can also combine these operation to avoid creating a new dataframe. 

In [75]:
edu_df[(edu_df.city == "Louisville") & (edu_df.year == 2005)]['under_5_per'] 

3    27.751727
Name: under_5_per, dtype: float64

## Joining Data

Pandas also allows us to merge datasets together relatively painlessly. To start with, we'll need another dataset. Let's read in another sheet from the same excel document.

In [55]:
jobs_df = pd.read_excel('GLP-Codebook.xlsx', 'Jobs County', index_col=None, na_values=['NA'])
jobs_df.head(n = 6)

Unnamed: 0,city,county,state,FIPS,year,current,baseline,median_earnings,income_inequality,median_household_income,personal_income_per_cap,unemployment
0,Birmingham,Jefferson,AL,1073,2005,1,1,26654.0,,41821.0,37711.0,4.4
1,Birmingham,Jefferson,AL,1073,2006,1,1,27026.0,,41731.0,40093.0,4.0
2,Birmingham,Jefferson,AL,1073,2007,1,1,29234.0,,44908.0,41109.0,3.9
3,Birmingham,Jefferson,AL,1073,2008,1,1,30284.0,,46269.0,42313.0,5.4
4,Birmingham,Jefferson,AL,1073,2009,1,1,28820.0,,43312.0,40596.0,10.8
5,Birmingham,Jefferson,AL,1073,2010,1,1,26796.0,16.989401,41740.0,42248.0,10.3


Pandas has a `merge()` function that takes the name of the two dataframe, the type of join (left, right, inner, outer) and the names of the columns to join on. 

In [56]:
df = pd.merge(edu_df, jobs_df,  how='outer', left_on=['FIPS','year'], right_on = ['FIPS','year'])
df.head(n = 6)

Unnamed: 0,city_x,county_x,state_x,display_name,FIPS,year,current_x,baseline_x,under_5_per,five_to_17_per,...,city_y,county_y,state_y,current_y,baseline_y,median_earnings,income_inequality,median_household_income,personal_income_per_cap,unemployment
0,Birmingham,Jefferson,AL,BIR,1073,2005,1,1,27.561629,24.424063,...,Birmingham,Jefferson,AL,1.0,1.0,26654.0,,41821.0,37711.0,4.4
1,Jacksonville,Duval,FL,JAC,12031,2005,0,1,18.102449,14.348973,...,Jacksonville,Duval,FL,0.0,1.0,28666.0,,44694.0,34993.0,4.0
2,Indianapolis,Marion,IN,IND,18097,2005,1,1,26.649174,19.114679,...,Indianapolis,Marion,IN,1.0,1.0,27555.0,,42129.0,34376.0,5.6
3,Louisville,Jefferson,KY,LOU,21111,2005,1,1,27.751727,15.684337,...,Louisville,Jefferson,KY,1.0,1.0,27376.0,,40973.0,35987.0,5.9
4,Grand Rapids,Kent,MI,GR,26081,2005,1,0,19.607122,15.606313,...,Grand Rapids,Kent,MI,1.0,0.0,26750.0,,46637.0,35715.0,5.8
5,Kansas City,Jackson,MO,KC,29095,2005,1,1,22.583039,19.766415,...,Kansas City,Jackson,MO,1.0,1.0,28503.0,,43284.0,31837.0,6.4


Notice that pandas even renamed duplicated columns. So city was in both datasets and now there is city_x and city_y. 

In [57]:
list(df)

['city_x',
 'county_x',
 'state_x',
 'display_name',
 'FIPS',
 'year',
 'current_x',
 'baseline_x',
 'under_5_per',
 'five_to_17_per',
 'child_per',
 'per_25_64_assoc_plus',
 'per_25_64_bach_plus',
 'per_25_64_grad',
 'per_25_34_assoc_plus',
 'per_25_34_bach_plus',
 'per_25_34_grad',
 'bach_plus_per_all',
 'bach_plus_per_white',
 'bach_plus_per_black',
 'bach_plus_per_hispanic',
 'per_high_wage',
 'enrolled_3_4',
 'bach_plus_race_gap',
 'city_y',
 'county_y',
 'state_y',
 'current_y',
 'baseline_y',
 'median_earnings',
 'income_inequality',
 'median_household_income',
 'personal_income_per_cap',
 'unemployment']

That's more columns than we need for this example workbook. Pandas .filter method can be used to select a subset of the columns

In [58]:
df_sel = df.filter(items = ['city_x', 'year', 'current_x', 'child_per', 'per_25_64_bach_plus', 'per_high_wage'])
df_sel.head(n = 6)

Unnamed: 0,city_x,year,current_x,child_per,per_25_64_bach_plus,per_high_wage
0,Birmingham,2005,1,25.315549,29.910793,37.918348
1,Jacksonville,2005,0,15.413964,26.062117,37.46739
2,Indianapolis,2005,1,21.445334,27.932248,37.573134
3,Louisville,2005,1,19.195906,30.168911,37.728885
4,Grand Rapids,2005,1,16.747064,30.620225,36.64657
5,Kansas City,2005,1,20.627214,29.683065,38.386257


In this data, current is an indicator variable that takes 1 if the city is from the current peer city list, and 0 otherwise. There is an older peer city list called baseline. If you work for the city, you should select to keep the `baseline_x` column and then filter to when that variable is equal to 1. 

In [59]:
df_sel = df_sel[(df_sel.current_x == 1)]
df_sel['current_x'].describe()

count    204.0
mean       1.0
std        0.0
min        1.0
25%        1.0
50%        1.0
75%        1.0
max        1.0
Name: current_x, dtype: float64

Some of these names are kind of unwieldy. Let's rename things. And we already showed how to select columsn, but now that `current_x` only takes only the value of 1, we can drop it from the dataframe. 

In [60]:
df_sel = df_sel.rename(columns = {"city_x": "city",
                                  "per_25_64_bach_plus" :"bach", 
                                  "child_per":"child_pov", 
                                  "per_high_wage":"high_wage_jobs"})
df_sel = df_sel.drop('current_x', axis = 1)
list(df_sel)

['city', 'year', 'child_pov', 'bach', 'high_wage_jobs']

Pandas easily allows us to look for correlations across all of the data using the .corr() method.

In [61]:
df_sel.corr()

Unnamed: 0,child_pov,bach,high_wage_jobs
child_pov,1.0,-0.382622,-0.243805
bach,-0.382622,1.0,0.809804
high_wage_jobs,-0.243805,0.809804,1.0


## Reshaping Data

A common operation in data science is to transform data from wide to long and vice versa. The data is currently in a long format. It's 204 rows and 5 columns. 

In [62]:
df_sel.shape

(204, 5)

In [63]:
df_T = df_sel.T
df_T

Unnamed: 0,0,2,3,4,5,6,7,8,10,11,...,242,243,244,245,246,247,248,249,250,251
city,Birmingham,Indianapolis,Louisville,Grand Rapids,Kansas City,Omaha,Greensboro,Charlotte,Columbus,Cincinnati,...,St. Louis,St. Louis,St. Louis,St. Louis,St. Louis,St. Louis,St. Louis,St. Louis,St. Louis,St. Louis
year,2005,2005,2005,2005,2005,2005,2005,2005,2005,2005,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
child_pov,25.3155,21.4453,19.1959,16.7471,20.6272,18.1333,22.11,15.9939,20.1423,19.6714,...,17.6927,18.774,21.5629,20.5231,21.9453,24.3908,22.6118,19.9176,19.6283,17.9376
bach,29.9108,27.9322,30.1689,30.6202,29.6831,37.4143,33.606,40.6449,36.5409,33.4899,...,37.9208,38.0942,39.0094,40.1169,39.3952,41.0819,41.9703,43.0362,42.7386,
high_wage_jobs,37.9183,37.5731,37.7289,36.6466,38.3863,39.7768,38.9323,41.225,42.8394,40.0404,...,41.1653,42.8837,44.5627,44.287,44.1219,44.1199,45.0696,47.5583,45.1011,


In [64]:
df_wide = df_sel.pivot(index = 'city', columns = 'year')
df_wide

Unnamed: 0_level_0,child_pov,child_pov,child_pov,child_pov,child_pov,child_pov,child_pov,child_pov,child_pov,child_pov,...,high_wage_jobs,high_wage_jobs,high_wage_jobs,high_wage_jobs,high_wage_jobs,high_wage_jobs,high_wage_jobs,high_wage_jobs,high_wage_jobs,high_wage_jobs
year,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
city,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
Birmingham,25.315549,20.320927,22.245694,19.170431,23.482029,28.440403,27.63346,28.042259,27.24596,30.323325,...,40.247859,38.780646,39.293683,40.61958,43.411498,42.614665,42.250147,41.483709,42.648328,
Charlotte,15.99388,14.578691,13.138992,13.936343,19.776212,21.160727,23.680119,21.769929,19.922452,20.96877,...,43.503502,43.385801,44.98427,42.191889,43.84788,44.341902,43.212731,45.363812,45.937242,
Cincinnati,19.671406,21.076115,18.856582,20.095994,20.699191,28.851469,27.888509,30.485841,26.066482,23.984059,...,39.715397,40.528561,41.732643,43.167148,41.825659,41.397308,42.696113,43.524812,44.372491,
Columbus,20.142266,21.630252,21.528461,19.593149,25.802145,25.183748,26.396702,25.150565,25.526799,24.219905,...,42.983517,44.419126,44.932657,41.897663,42.419771,43.455367,44.363557,45.464854,44.912075,
Grand Rapids,16.747064,16.929896,18.602906,19.239829,21.006811,23.955846,19.906764,24.451626,18.20052,20.772052,...,36.044141,36.949593,36.302497,38.798805,38.204896,37.850412,39.541624,39.065046,37.664578,
Greensboro,22.110041,21.160935,21.469246,18.762812,22.459487,27.061878,23.612493,24.700583,27.774465,25.507547,...,38.45175,39.007671,41.358646,38.079074,39.722266,39.800763,39.911456,40.217741,41.336777,
Greenville,18.752905,19.712468,16.139407,19.464235,22.44937,21.304798,25.776793,24.406126,26.838484,22.050896,...,40.234581,38.024864,40.290231,40.618761,37.943205,40.472862,40.193497,41.296585,41.351745,
Indianapolis,21.445334,23.184057,22.704913,23.741446,28.478428,30.75317,32.428969,32.820737,30.359006,32.330276,...,36.92088,38.092754,39.79196,36.332633,36.421052,36.191032,37.752605,36.58739,37.400165,
Kansas City,20.627214,24.379493,22.663317,20.821379,23.765657,25.384729,29.033423,29.063901,24.839133,23.858601,...,38.505016,40.787834,37.985223,39.851821,38.957568,39.455111,37.819542,40.351286,38.833851,
Knoxville,19.177288,15.811311,15.287721,18.151228,19.008549,15.373321,19.122892,20.845092,20.657517,24.292448,...,40.735296,43.497463,42.578596,43.946215,43.276737,43.272303,42.745922,41.886823,44.98753,


In [65]:
df_wide.shape

(17, 36)

Pivoting the dataframe resulted in a dataframe with a hierarchical index. So now calling data at the top level can select more than one column. 

In [66]:
df_wide['child_pov']

year,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Birmingham,25.315549,20.320927,22.245694,19.170431,23.482029,28.440403,27.63346,28.042259,27.24596,30.323325,26.253143,21.315103
Charlotte,15.99388,14.578691,13.138992,13.936343,19.776212,21.160727,23.680119,21.769929,19.922452,20.96877,18.597405,18.001947
Cincinnati,19.671406,21.076115,18.856582,20.095994,20.699191,28.851469,27.888509,30.485841,26.066482,23.984059,23.055194,23.519682
Columbus,20.142266,21.630252,21.528461,19.593149,25.802145,25.183748,26.396702,25.150565,25.526799,24.219905,24.601604,24.422872
Grand Rapids,16.747064,16.929896,18.602906,19.239829,21.006811,23.955846,19.906764,24.451626,18.20052,20.772052,20.186184,14.705107
Greensboro,22.110041,21.160935,21.469246,18.762812,22.459487,27.061878,23.612493,24.700583,27.774465,25.507547,23.187433,26.832577
Greenville,18.752905,19.712468,16.139407,19.464235,22.44937,21.304798,25.776793,24.406126,26.838484,22.050896,17.393454,14.364423
Indianapolis,21.445334,23.184057,22.704913,23.741446,28.478428,30.75317,32.428969,32.820737,30.359006,32.330276,32.170614,28.464493
Kansas City,20.627214,24.379493,22.663317,20.821379,23.765657,25.384729,29.033423,29.063901,24.839133,23.858601,27.073061,24.749306
Knoxville,19.177288,15.811311,15.287721,18.151228,19.008549,15.373321,19.122892,20.845092,20.657517,24.292448,20.664445,17.292304


And we can call down multiple index levels, making it easy to select all our cities for 2016

In [67]:
df_wide['child_pov'][2016]

city
Birmingham       21.315103
Charlotte        18.001947
Cincinnati       23.519682
Columbus         24.422872
Grand Rapids     14.705107
Greensboro       26.832577
Greenville       14.364423
Indianapolis     28.464493
Kansas City      24.749306
Knoxville        17.292304
Louisville       20.652497
Memphis          34.527179
Nashville        22.310971
Oklahoma City    25.172032
Omaha            16.243179
St. Louis        17.937648
Tulsa            24.687908
Name: 2016, dtype: float64

You can also sort by using the .sort_values

In [68]:
df_wide['child_pov'][2016].sort_values

<bound method Series.sort_values of city
Birmingham       21.315103
Charlotte        18.001947
Cincinnati       23.519682
Columbus         24.422872
Grand Rapids     14.705107
Greensboro       26.832577
Greenville       14.364423
Indianapolis     28.464493
Kansas City      24.749306
Knoxville        17.292304
Louisville       20.652497
Memphis          34.527179
Nashville        22.310971
Oklahoma City    25.172032
Omaha            16.243179
St. Louis        17.937648
Tulsa            24.687908
Name: 2016, dtype: float64>

Except that sorted by city - by default it used the first column. Which is okay if we want them in alphabetical order, but what about order by child poverty?

In [69]:
df_wide['child_pov'].sort_values(by=[2016], ascending = False)

year,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Memphis,26.706115,27.671811,30.589489,27.735371,32.033956,29.934006,31.656765,32.757354,35.827601,35.546465,32.350763,34.527179
Indianapolis,21.445334,23.184057,22.704913,23.741446,28.478428,30.75317,32.428969,32.820737,30.359006,32.330276,32.170614,28.464493
Greensboro,22.110041,21.160935,21.469246,18.762812,22.459487,27.061878,23.612493,24.700583,27.774465,25.507547,23.187433,26.832577
Oklahoma City,26.731978,25.148822,23.022413,23.690517,25.306982,28.927845,26.94586,28.766366,27.939984,26.871078,24.711061,25.172032
Kansas City,20.627214,24.379493,22.663317,20.821379,23.765657,25.384729,29.033423,29.063901,24.839133,23.858601,27.073061,24.749306
Tulsa,22.225332,24.058133,23.281382,20.012369,22.150032,24.413512,21.472337,25.154074,24.703484,21.455671,24.531516,24.687908
Columbus,20.142266,21.630252,21.528461,19.593149,25.802145,25.183748,26.396702,25.150565,25.526799,24.219905,24.601604,24.422872
Cincinnati,19.671406,21.076115,18.856582,20.095994,20.699191,28.851469,27.888509,30.485841,26.066482,23.984059,23.055194,23.519682
Nashville,23.102665,25.941843,24.676335,27.728797,27.305119,32.226072,30.487724,29.363647,30.490555,33.14479,27.535363,22.310971
Birmingham,25.315549,20.320927,22.245694,19.170431,23.482029,28.440403,27.63346,28.042259,27.24596,30.323325,26.253143,21.315103


Reshaping hierarchical data frames is difficult, so I'm going to cut down to just the child poverty data

In [70]:
df_wide = df_wide['child_pov']
df_wide.head(n = 6)

year,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Birmingham,25.315549,20.320927,22.245694,19.170431,23.482029,28.440403,27.63346,28.042259,27.24596,30.323325,26.253143,21.315103
Charlotte,15.99388,14.578691,13.138992,13.936343,19.776212,21.160727,23.680119,21.769929,19.922452,20.96877,18.597405,18.001947
Cincinnati,19.671406,21.076115,18.856582,20.095994,20.699191,28.851469,27.888509,30.485841,26.066482,23.984059,23.055194,23.519682
Columbus,20.142266,21.630252,21.528461,19.593149,25.802145,25.183748,26.396702,25.150565,25.526799,24.219905,24.601604,24.422872
Grand Rapids,16.747064,16.929896,18.602906,19.239829,21.006811,23.955846,19.906764,24.451626,18.20052,20.772052,20.186184,14.705107
Greensboro,22.110041,21.160935,21.469246,18.762812,22.459487,27.061878,23.612493,24.700583,27.774465,25.507547,23.187433,26.832577


One final note before reshaping, you can use .T to transpose a dataframe. 

In [71]:
df_wide.T

city,Birmingham,Charlotte,Cincinnati,Columbus,Grand Rapids,Greensboro,Greenville,Indianapolis,Kansas City,Knoxville,Louisville,Memphis,Nashville,Oklahoma City,Omaha,St. Louis,Tulsa
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2005,25.315549,15.99388,19.671406,20.142266,16.747064,22.110041,18.752905,21.445334,20.627214,19.177288,19.195906,26.706115,23.102665,26.731978,18.133262,19.578897,22.225332
2006,20.320927,14.578691,21.076115,21.630252,16.929896,21.160935,19.712468,23.184057,24.379493,15.811311,22.752912,27.671811,25.941843,25.148822,15.578458,20.407225,24.058133
2007,22.245694,13.138992,18.856582,21.528461,18.602906,21.469246,16.139407,22.704913,22.663317,15.287721,21.557909,30.589489,24.676335,23.022413,17.420406,17.692664,23.281382
2008,19.170431,13.936343,20.095994,19.593149,19.239829,18.762812,19.464235,23.741446,20.821379,18.151228,23.133498,27.735371,27.728797,23.690517,16.002963,18.774046,20.012369
2009,23.482029,19.776212,20.699191,25.802145,21.006811,22.459487,22.44937,28.478428,23.765657,19.008549,22.546513,32.033956,27.305119,25.306982,17.70067,21.562939,22.150032
2010,28.440403,21.160727,28.851469,25.183748,23.955846,27.061878,21.304798,30.75317,25.384729,15.373321,25.29605,29.934006,32.226072,28.927845,21.855716,20.523139,24.413512
2011,27.63346,23.680119,27.888509,26.396702,19.906764,23.612493,25.776793,32.428969,29.033423,19.122892,27.529822,31.656765,30.487724,26.94586,20.427377,21.945286,21.472337
2012,28.042259,21.769929,30.485841,25.150565,24.451626,24.700583,24.406126,32.820737,29.063901,20.845092,26.873324,32.757354,29.363647,28.766366,20.95135,24.390811,25.154074
2013,27.24596,19.922452,26.066482,25.526799,18.20052,27.774465,26.838484,30.359006,24.839133,20.657517,22.447762,35.827601,30.490555,27.939984,20.49595,22.611822,24.703484
2014,30.323325,20.96877,23.984059,24.219905,20.772052,25.507547,22.050896,32.330276,23.858601,24.292448,24.358769,35.546465,33.14479,26.871078,18.929419,19.9176,21.455671


Pandas has an index that isn't strictly part of the dataframe. By default it's 0, 1, 2, 3, etc. HOwever, when I cast the data from long to wide, I set the index to city values. Now we need to undo that before melting/gathering the data.

In [72]:
df_wide.reset_index(level=0, inplace = True)
df_wide.head(n = 6)

year,city,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
0,Birmingham,25.315549,20.320927,22.245694,19.170431,23.482029,28.440403,27.63346,28.042259,27.24596,30.323325,26.253143,21.315103
1,Charlotte,15.99388,14.578691,13.138992,13.936343,19.776212,21.160727,23.680119,21.769929,19.922452,20.96877,18.597405,18.001947
2,Cincinnati,19.671406,21.076115,18.856582,20.095994,20.699191,28.851469,27.888509,30.485841,26.066482,23.984059,23.055194,23.519682
3,Columbus,20.142266,21.630252,21.528461,19.593149,25.802145,25.183748,26.396702,25.150565,25.526799,24.219905,24.601604,24.422872
4,Grand Rapids,16.747064,16.929896,18.602906,19.239829,21.006811,23.955846,19.906764,24.451626,18.20052,20.772052,20.186184,14.705107
5,Greensboro,22.110041,21.160935,21.469246,18.762812,22.459487,27.061878,23.612493,24.700583,27.774465,25.507547,23.187433,26.832577


And now we're ready to put our data back into long format using `.melt()`

In [73]:
df_long = pd.melt(frame = df_wide, 
                  col_level = 0,
                  id_vars = ['city'],
                  value_vars = [2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016],
                  value_name = "child_pov")
df_long

Unnamed: 0,city,year,child_pov
0,Birmingham,2005,25.315549
1,Charlotte,2005,15.993880
2,Cincinnati,2005,19.671406
3,Columbus,2005,20.142266
4,Grand Rapids,2005,16.747064
5,Greensboro,2005,22.110041
6,Greenville,2005,18.752905
7,Indianapolis,2005,21.445334
8,Kansas City,2005,20.627214
9,Knoxville,2005,19.177288
