# Pandas Munging Explorations

Here's what the data looks like, multiple, separate data sets, one per year, each of which contains 4 records that I want to extract.

```
2007_data.csv
GEOBASID|STNAME|YEAR|AAWDT
60150010|COLUMBIA ST ON RP|2007|9500
65700010|SENECA ST OFF RP|2007|6800
96200010|WESTERN ST ON RP|2007|7200
104300010|WESTERN ST OFF RP|2007|6500

2008_data.csv
GEOBASID|STNAME|YEAR|AAWDT
60150010|COLUMBIA ST ON RP|2008|9510
65700010|SENECA ST OFF RP|2008|6810
96200010|WESTERN ST ON RP|2008|7210
104300010|WESTERN ST OFF RP|2008|6510

2009_data.csv
GEOBASID|STNAME|YEAR|AAWDT
60150010|COLUMBIA ST ON RP|2009|9520
65700010|SENECA ST OFF RP|2009|6820
96200010|WESTERN ST ON RP|2009|7220
104300010|WESTERN ST OFF RP|2009|6520

2010_data.csv
GEOBASID|STNAME|YEAR|AAWDT
60150010|COLUMBIA ST ON RP|2010|9530
65700010|SENECA ST OFF RP|2010|6830
96200010|WESTERN ST ON RP|2010|7230
104300010|WESTERN ST OFF RP|2010|6530
```

I'd like to end up with a data frame that looks similar to this.

```
GEOBASID|STNAME            |2007|2008|2009|2010
60150010|COLUMBIA ST ON RP |9500|9510|9520|9530
65700010|SENECA ST OFF RP  |2010|6800|6810|6820|6830
96200010|WESTERN ST ON RP  |2010|7200|7210|7220|7230
104300010|WESTERN ST OFF RP|2010|6500|6510|6520|6530
```



In [2]:
# STEP 1, READ DATA INTO SERIES

import pandas as pd

data_2007 = pd.read_csv('./2007_data.csv', sep='|')
data_2008 = pd.read_csv('./2008_data.csv', sep='|')
data_2009 = pd.read_csv('./2009_data.csv', sep='|')
data_2010 = pd.read_csv('./2010_data.csv', sep='|')

In [3]:
# CHANGE THE AAWDT COLUMN NAME TO BE THE YEAR
def set_year(dataframe):
    year_idx = dataframe.columns.get_loc('YEAR')
    year = dataframe.iloc[0, year_idx]
    dataframe = dataframe.rename(columns = {'AAWDT': year}, inplace=True)
    return dataframe

set_year(data_2007)
set_year(data_2008)
set_year(data_2009)
set_year(data_2010)

In [4]:
data_2007.drop('YEAR', axis=1, inplace=True)

In [5]:
data_2007

Unnamed: 0,GEOBASID,STNAME,2007
0,60150010,COLUMBIA ST ON RP,9500
1,65700010,SENECA ST OFF RP,6800
2,96200010,WESTERN ST ON RP,7200
3,104300010,WESTERN ST OFF RP,6500


In [6]:
data_2008.drop('YEAR', axis=1, inplace=True)

In [7]:
data_2008

Unnamed: 0,GEOBASID,STNAME,2008
0,60150010,COLUMBIA ST ON RP,9510
1,65700010,SENECA ST OFF RP,6810
2,96200010,WESTERN ST ON RP,7210
3,104300010,WESTERN ST OFF RP,6510


In [8]:
data_2009.drop('YEAR', axis=1, inplace=True)

In [9]:
data_2009

Unnamed: 0,GEOBASID,STNAME,2009
0,60150010,COLUMBIA ST ON RP,9520
1,65700010,SENECA ST OFF RP,6820
2,96200010,WESTERN ST ON RP,7220
3,104300010,WESTERN ST OFF RP,6520


In [10]:
data_2010.drop('YEAR', axis=1, inplace=True)

In [11]:
data_2010

Unnamed: 0,GEOBASID,STNAME,2010
0,60150010,COLUMBIA ST ON RP,9530
1,65700010,SENECA ST OFF RP,6830
2,96200010,WESTERN ST ON RP,7230
3,104300010,WESTERN ST OFF RP,6530


In [12]:
# NOW HOW DO WE MERGE THESE SEPARATE DATAFRAMES INTO A SINGLE ONE?

dfs = [data_2007, data_2008, data_2009, data_2010]

df_final = reduce(lambda left,right: pd.merge(left,right), dfs)

df_final


Unnamed: 0,GEOBASID,STNAME,2007,2008,2009,2010
0,60150010,COLUMBIA ST ON RP,9500,9510,9520,9530
1,65700010,SENECA ST OFF RP,6800,6810,6820,6830
2,96200010,WESTERN ST ON RP,7200,7210,7220,7230
3,104300010,WESTERN ST OFF RP,6500,6510,6520,6530


In [24]:
# BELOW IS THE ADVICE FROM JAKE VAN DER PLAS

# READ DATA FILES IN ONE SHOT
data = pd.concat([pd.read_csv('{0}_data.csv'.format(year), sep='|') for year in range(2007, 2011)])

data


Unnamed: 0,GEOBASID,STNAME,YEAR,AAWDT
0,60150010,COLUMBIA ST ON RP,2007,9500
1,65700010,SENECA ST OFF RP,2007,6800
2,96200010,WESTERN ST ON RP,2007,7200
3,104300010,WESTERN ST OFF RP,2007,6500
0,60150010,COLUMBIA ST ON RP,2008,9510
1,65700010,SENECA ST OFF RP,2008,6810
2,96200010,WESTERN ST ON RP,2008,7210
3,104300010,WESTERN ST OFF RP,2008,6510
0,60150010,COLUMBIA ST ON RP,2009,9520
1,65700010,SENECA ST OFF RP,2009,6820


In [20]:
# PIVOT TABLE TO MAKE YEAR VALUES BECOME INDIVIDUAL COLUMNS
pivoted_data = data.pivot_table(values='AAWDT', index=['GEOBASID', 'STNAME'], columns='YEAR')

pivoted_data.head()

Unnamed: 0_level_0,YEAR,2007,2008,2009,2010
GEOBASID,STNAME,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
60150010,COLUMBIA ST ON RP,9500,9510,9520,9530
65700010,SENECA ST OFF RP,6800,6810,6820,6830
96200010,WESTERN ST ON RP,7200,7210,7220,7230
104300010,WESTERN ST OFF RP,6500,6510,6520,6530


In [23]:
# TO TURN THE INDICES INTO NORMAL COLUMNS, RESET THE INDEX
pivoted_data.reset_index()


YEAR,GEOBASID,STNAME,2007,2008,2009,2010
0,60150010,COLUMBIA ST ON RP,9500,9510,9520,9530
1,65700010,SENECA ST OFF RP,6800,6810,6820,6830
2,96200010,WESTERN ST ON RP,7200,7210,7220,7230
3,104300010,WESTERN ST OFF RP,6500,6510,6520,6530
