# Pivot Tables

When the huge amounts of data are spread across various files, accessing that data is lot easier when organized properly. In this chapter we will learn how to aggregate the data by wrangling as needed. 

## 1.Index:

Dealing with higher dimensional data is always a challenge, so we need to index them hierarchically, this helps in working with in working with higher dimensional data in lower dimensional form, also known as hierarchical indexing. Lets begin with an simple example. 


In [1]:
import pandas as pd
import numpy as np

data = pd.Series(np.random.randn(9), index = [['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd' ], [1, 2, 3, 1, 3, 1, 2, 2, 3]])
print(data)


a  1   -0.080382
   2    0.135079
   3   -0.237350
b  1    0.884212
   3    0.654550
c  1   -1.882738
   2   -0.730928
d  2    2.684121
   3   -0.092498
dtype: float64


Hierarchical indexing plays an vital role in reshaping the data and group based operations like forming a pivot table. Below are some of the methods used to reorganize the data in python.

data.unstack() : rearrange data into a dataframe using its unstack method.
data.unstack().stack() : inverse operation of unstack. 

With a dataframe, either axis can have a hierarchical index


In [2]:
frame = pd.DataFrame(np.arange(12).reshape((4, 3)), index = [['a', 'a', 'b', 'b'], [1, 2, 1, 2]], columns = [['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']])
print(frame)


     Ohio     Colorado
    Green Red    Green
a 1     0   1        2
  2     3   4        5
b 1     6   7        8
  2     9  10       11


Below are the methods used to rearrange the order of the levels on an axis or sort the data by the values in one specific level. 

* frame.swaplevel() : The swaplevel takes two level numbers or names and returns a new object with the levels interchanged. 
* frame.sort_index() : This method sorts the data using only the values in a single level. <br>
** Note: ** When swapping levels, it is not common to also use sort_index so that the result is lexicographically sorted by the indicated level. For instance, frame.sort_index(level = 1). 

Aggregation Methods:
* df.count() : This Method counts total number of items
* df.first(), df.last() : Methods to get first and last item respectively.
* df.mean(), df.median() : Methods to get mean and median respectively.
* df. min(), df.max() : Methods to get minimum and maximum values respectively.
* df.std(), df.var(): Methods to get standard deviation and variance respectively.
* df.mad() : Method to get mean absolute deviation 
* df.prod() : Method to get product of the all items
* df.sum() : Method to get sum of all the items.

Groupby: Split, Apply ,Combine
* Split : Breaks and groups data frame depending on the value of the specified key.
* Apply : Computes some functions, usually an aggregate, transformation or filtering, within individual groups.
* Combine : Merges the results of these operations into an output array.

** Example: **

In [3]:
import numpy as np
rng = np.random.RandomState(0)
df = pd.DataFrame({'key' : [ 'A', 'B', 'C', 'A', 'B', 'C'], 'data1': range(6), 'data2': rng.randint(0, 10, 6)}, columns = ['key', 'data1', 'data2'])
print(df)


  key  data1  data2
0   A      0      5
1   B      1      0
2   C      2      3
3   A      3      3
4   B      4      7
5   C      5      9


In [4]:
#Aggregation

df.groupby('key').aggregate(['min', np.median, max])

#Filtering

def filter_func(x):
    return x['data2'].std() > 4

print('****Original Dataframe****')
print(df)

print('****Grouping Example****')
print(df.groupby('key').std())

print('****Filtering Example****')
print(df.groupby('key').filter(filter_func))



****Original Dataframe****
  key  data1  data2
0   A      0      5
1   B      1      0
2   C      2      3
3   A      3      3
4   B      4      7
5   C      5      9
****Grouping Example****
       data1     data2
key                   
A    2.12132  1.414214
B    2.12132  4.949747
C    2.12132  4.242641
****Filtering Example****
  key  data1  data2
1   B      1      0
2   C      2      3
4   B      4      7
5   C      5      9


While aggregation must return a reduced version of the data, transformation can return some transformed version of the full data to recombine.

In [5]:
# Transformation

df.groupby('key').transform(lambda x: x - x.mean() )



Unnamed: 0,data1,data2
0,-1.5,1.0
1,-1.5,-3.5
2,-1.5,-3.0
3,1.5,-1.0
4,1.5,3.5
5,1.5,3.0


**Apply()** : <br>

This method lets us apply an arbitrary function to the group results. This method should take a data frame and return either a Pandas object(e.g., DataFrame series) or a scalar; 



In [6]:
def norm_by_data2(x):
    x['data1'] /= x['data2'].sum()
    return x

print('****Original Dataframe****')
print(df)


print('****Apply() Example****')
print(df.groupby('key').apply(norm_by_data2))

****Original Dataframe****
  key  data1  data2
0   A      0      5
1   B      1      0
2   C      2      3
3   A      3      3
4   B      4      7
5   C      5      9
****Apply() Example****
  key     data1  data2
0   A  0.000000      5
1   B  0.142857      0
2   C  0.166667      3
3   A  0.375000      3
4   B  0.571429      7
5   C  0.416667      9


To combine datasets by linking rows using one or more keys, we use **Merge** or **Join** operations. <br>
**Example:** 



In [7]:
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a',  'b'], 'data1': range(7)})
df2 = pd.DataFrame ({'key': [ 'a', 'b', 'd'], 'data2': range(3)})
pd.merge(df1, df2) 

Unnamed: 0,data1,key,data2
0,0,b,1
1,1,b,1
2,6,b,1
3,2,a,0
4,4,a,0
5,5,a,0


This **Merge** merges by using the overlapping column names as the keys. This is because we did not specify which column to **join** explicitly. So to overcome this problem we use 



In [8]:
pd.merge(df1, df2, on = 'key')


Unnamed: 0,data1,key,data2
0,0,b,1
1,1,b,1
2,6,b,1
3,2,a,0
4,4,a,0
5,5,a,0


By default the behaviour of **join** is inner here, however that can also be modified as follows:

Other options : <br>
* Inner : Use only the key combinations observed in both tables.
* Outer : Use all the key combinations observed in both tables together.
* Left : Use all key combinations found in the left table.
* Right : Use all key combinations found in the right table.


In [9]:
pd.merge(df1, df2, how = 'outer')

Unnamed: 0,data1,key,data2
0,0.0,b,1.0
1,1.0,b,1.0
2,6.0,b,1.0
3,2.0,a,0.0
4,4.0,a,0.0
5,5.0,a,0.0
6,3.0,c,
7,,d,2.0


Concatenation is another operation used for data combination. This operation can be performed using numpy function as below:

In [10]:
arr = np.arange(12).reshape((3, 4))
print(arr)
np.concatenate([arr, arr], axis =1)

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]


array([[ 0,  1,  2,  3,  0,  1,  2,  3],
       [ 4,  5,  6,  7,  4,  5,  6,  7],
       [ 8,  9, 10, 11,  8,  9, 10, 11]])

The concat function in pandas provides a consistent way, specially when there is more than one axis which insufficient data.



In [11]:
s1 = pd.Series([0,1], index = ['a', 'b'])
s2 = pd.Series([ 2, 3, 4], index = ['c', 'd', 'e'])
s3 = pd.Series([5, 6], index = ['f', 'g'] )
print('S1 : ')
print(s1)
print('S2 : ')
print(s2)
print('S3 : ')
print(s3)


S1 : 
a    0
b    1
dtype: int64
S2 : 
c    2
d    3
e    4
dtype: int64
S3 : 
f    5
g    6
dtype: int64


In [12]:
pd.concat([s1, s2, s3], axis = 1, sort=True)

Unnamed: 0,0,1,2
a,0.0,,
b,1.0,,
c,,2.0,
d,,3.0,
e,,4.0,
f,,,5.0
g,,,6.0


Now let us define the behaviour of **join**

In [13]:
s4 = pd.concat([s1, s3])
pd.concat([s1, s4], axis = 1, join = 'inner')


Unnamed: 0,0,1
a,0,0
b,1,1


** Pivot Tables: **

A pivot table is a similar operation that is commonly seen in spreadsheets and other programs that operate on tabular data. The pivot table takes simple column-wise data as input and groups the entries into a two dimensional table that provides a multidimensional summarization of the data. <br>

Example: 



In [14]:
import numpy as np
import pandas as pd
import seaborn as sns
titanic = sns.load_dataset('titanic')

titanic.head()
titanic.pivot_table('survived', index = 'sex', columns = 'class')

class,First,Second,Third
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,0.968085,0.921053,0.5
male,0.368852,0.157407,0.135447


The aggfunc keyword controls what type of aggregation is applied, which is mean by default. As in the GroupBy, the aggregation specification can be a string representing one of several common choices(‘sum’, ‘mean’, ‘count’, ‘min’, ‘max’, etc.) or a function that implements an aggregation (np.sum(), min(), sum(), etc.).

Additionally, it can be specified as dictionary mapping a column to any of the above desired options: 

In [15]:
titanic.pivot_table(index = 'sex', columns = 'class', aggfunc = {'survived' : sum, 'fare' : 'mean'})

Unnamed: 0_level_0,fare,fare,fare,survived,survived,survived
class,First,Second,Third,First,Second,Third
sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
female,106.125798,21.970121,16.11881,91,70,72
male,67.226127,19.741782,12.661633,45,17,47
