# Lecture 3

The documentation covers:  
- Reshaping and Pivot Tables

## Reshaping and Pivot Tables

In [9]:
import numpy as np
import pandas as pd

home = pd.read_csv('data_processed/home.csv')
so = pd.read_csv('data_input/stackoverflow_qa.csv')

In [83]:
# adding a year and month column
so['questionyear'] = pd.DatetimeIndex(so['creationdate']).year
so['questionmonth'] = pd.DatetimeIndex(so['creationdate']).month

In [27]:
# top 20
top20 = so.groupby('ans_name').aggregate(np.sum).sort_values(by=['ans_rep','answercount'], ascending=False).head(20)
top20.head()

Unnamed: 0_level_0,id,score,viewcount,answercount,commentcount,favoritecount,quest_rep,ans_rep,questionyear,questionmonth
ans_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
jezrael,228737021995,7860,1785324,7988,5177,856.0,6209834.0,1015956000.0,10962131,36140
unutbu,29959502937,4003,3511034,1462,926,1612.0,3359192.0,454985400.0,1972607,6278
EdChum,61904922571,4812,3580513,2692,3123,898.0,3354967.0,231571400.0,3752567,11192
piRSquared,81847238788,3762,542888,3192,1978,537.0,3896510.0,199053900.0,3910186,12830
MaxU,65214217669,2348,495154,2404,2121,360.0,2862882.0,131291800.0,3161929,9881


In [28]:
criteria = so['ans_name'].isin(top20.index)
so_selected = so[criteria]
so_selected.head()

Unnamed: 0,id,creationdate,score,viewcount,title,answercount,commentcount,favoritecount,quest_name,quest_rep,ans_name,ans_rep,questionyear,questionmonth
4,7577546,2011-09-28 01:58:38,9,2488,"Using pandas, how do I subsample a large DataF...",1,0,5.0,Uri Laserson,958.0,HYRY,54137.0,2011,9
121,10844493,2012-06-01 04:40:16,5,4397,DataFrame.apply in python pandas alters both o...,1,0,1.0,MikeGruz,28.0,BrenBarn,136870.0,2012,6
127,10943478,2012-06-08 05:24:57,12,11115,pandas reindex DataFrame with datetime objects,1,0,4.0,BFTM,895.0,BrenBarn,136870.0,2012,6
130,10972410,2012-06-10 21:12:43,19,46428,pandas: combine two columns in a DataFrame,5,0,5.0,BFTM,895.0,BrenBarn,136870.0,2012,6
145,11067027,2012-06-16 21:05:01,115,85762,Python Pandas - Re-ordering columns in a dataf...,11,2,28.0,pythOnometrist,1068.0,BrenBarn,136870.0,2012,6


### Pivot

In [90]:
# subset only one user with selected columns only
hy = so.loc[so['ans_name'] == 'HYRY', ['title', 'questionyear', 'viewcount', 'commentcount', 'quest_name']]
hy.head()

Unnamed: 0,title,questionyear,viewcount,commentcount,quest_name
4,"Using pandas, how do I subsample a large DataF...",2011,2488,0,Uri Laserson
216,Pandas xaxis auto-format issue,2012,613,0,joelhoro
367,Grouping data by multiple dates in pandas,2012,273,0,user1074057
722,Change Categorical Variable levels to What I p...,2012,1382,0,Tom Bennett
932,Plot key count per unique value count in pandas,2013,9115,0,monkut


In [91]:
hy.head(3)

Unnamed: 0,title,questionyear,viewcount,commentcount,quest_name
4,"Using pandas, how do I subsample a large DataF...",2011,2488,0,Uri Laserson
216,Pandas xaxis auto-format issue,2012,613,0,joelhoro
367,Grouping data by multiple dates in pandas,2012,273,0,user1074057


In [95]:
# index and columns have to be unique
hy.pivot(index='title', columns='questionyear', values='viewcount').head()

questionyear,2011,2012,2013,2014,2015,2016,2017
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"3D animation with matplotlib, connect points to create moving stick figure",,,,2970.0,,,
Add multi-index to pandas dataframe and keep current index,,,4723.0,,,,
Adding means to a pandas dataframe,,,,,53.0,,
Adding values for missing data combinations in Pandas,,,,,209.0,,
"Aggregating overlapping ""all-previous-events"" features from time series data - in Python",,,,181.0,,,


If the `values` argument are omitted, and the DataFrame has more than one columns of values not used as index or columns, then the result will have hierarchical columns:

In [102]:
pivoted = hy.head().pivot(index='title', columns='questionyear')
pivoted

Unnamed: 0_level_0,viewcount,viewcount,viewcount,commentcount,commentcount,commentcount,quest_name,quest_name,quest_name
questionyear,2011,2012,2013,2011,2012,2013,2011,2012,2013
title,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
Change Categorical Variable levels to What I provide/Combine levels two categorical variables,,1382.0,,,0.0,,,Tom Bennett,
Grouping data by multiple dates in pandas,,273.0,,,0.0,,,user1074057,
Pandas xaxis auto-format issue,,613.0,,,0.0,,,joelhoro,
Plot key count per unique value count in pandas,,,9115.0,,,0.0,,,monkut
"Using pandas, how do I subsample a large DataFrame by group in an efficient manner?",2488.0,,,0.0,,,Uri Laserson,,


In [104]:
# we can then subset from the pivoted dataframe
pivoted[['viewcount', 'quest_name']]

Unnamed: 0_level_0,viewcount,viewcount,viewcount,quest_name,quest_name,quest_name
questionyear,2011,2012,2013,2011,2012,2013
title,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Change Categorical Variable levels to What I provide/Combine levels two categorical variables,,1382.0,,,Tom Bennett,
Grouping data by multiple dates in pandas,,273.0,,,user1074057,
Pandas xaxis auto-format issue,,613.0,,,joelhoro,
Plot key count per unique value count in pandas,,,9115.0,,,monkut
"Using pandas, how do I subsample a large DataFrame by group in an efficient manner?",2488.0,,,Uri Laserson,,


### Stacking and Unstacking