# Reshaping and Aggregation Functions
from [Reshaping DataFrames in Pandas](https://towardsdatascience.com/reshaping-dataframes-in-pandas-f6bfbb2c5b0f) by [Anirudh Nanduri](https://medium.com/@nvpsani)  
and [Meet the hardest functions of Pandas, Part II](https://towardsdatascience.com/meet-the-hardest-functions-of-pandas-part-ii-f8029a2b0c9b) by [Bex t.](https://ibexorigin.medium.com)

<a id='Contents'></a>
## Contents:

### Type 1: Reforming without aggregation
> [Pivot](#Pivot) Converts rows of categorical values into separate columns  
> [Melt](#Melt) Converts columns into rows  
> [Stack](#Stack) Converts columns into index - for MultiIndex DataFrames  
> [Unstack](#Unstack) Converts index into columns - for MultiIndex DataFrames  

*Note* Need to add transpose
> 
### Type 2: Reforming with aggregation
> [Group by](#group_by)  
> [Pivot Table](#pivot_table)  
> [Crosstab](#Crosstab)  
> [crosstab vs. pivot_table](#crosstab_vs_pivot_table)  


In [1]:
# Create a DataFrame to work with

import numpy as np
import pandas as pd
df = pd.DataFrame({'Date': pd.Index(pd.date_range(start='2/2/2019', periods=3)).repeat(3),
                   'Class':['1A','2B','3C','1A','2B','3C','1A','2B','3C'],
                   'Numbers':np.random.randn(9)})

df['Numbers2']= df['Numbers']*2

## Type 1: Reshaping without aggregation
[Return to top](#Contents)
<a id='Pivot'></a>
### Pivot
Pivot rearranges the table by converting categorical values into separate columns:

In [2]:
df

Unnamed: 0,Date,Class,Numbers,Numbers2
0,2019-02-02,1A,1.643669,3.287337
1,2019-02-02,2B,0.923707,1.847414
2,2019-02-02,3C,-0.450652,-0.901305
3,2019-02-03,1A,-1.715478,-3.430955
4,2019-02-03,2B,-1.407965,-2.81593
5,2019-02-03,3C,-0.136307,-0.272613
6,2019-02-04,1A,-0.678125,-1.356249
7,2019-02-04,2B,2.326845,4.65369
8,2019-02-04,3C,-0.916117,-1.832234


In [3]:
df.pivot(index='Date', columns='Class', values='Numbers')

Class,1A,2B,3C
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2019-02-02,1.643669,0.923707,-0.450652
2019-02-03,-1.715478,-1.407965,-0.136307
2019-02-04,-0.678125,2.326845,-0.916117


If we don't specify the values parameter, pandas would create all the various possible views while taking all column names apart from what we specified as the index and columns.  

In [4]:
df.pivot(index='Date', columns='Class')

Unnamed: 0_level_0,Numbers,Numbers,Numbers,Numbers2,Numbers2,Numbers2
Class,1A,2B,3C,1A,2B,3C
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2019-02-02,1.643669,0.923707,-0.450652,3.287337,1.847414,-0.901305
2019-02-03,-1.715478,-1.407965,-0.136307,-3.430955,-2.81593,-0.272613
2019-02-04,-0.678125,2.326845,-0.916117,-1.356249,4.65369,-1.832234


The format below gives the same results as the first example, but it's slower. This is because it first gets the results for all of the columns, then creates a subset of the results.  The first example of specifying the `values` parameter is preferred.

In [5]:
df.pivot(index='Date', columns='Class')['Numbers']

Class,1A,2B,3C
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2019-02-02,1.643669,0.923707,-0.450652
2019-02-03,-1.715478,-1.407965,-0.136307
2019-02-04,-0.678125,2.326845,-0.916117


[Return to top](#Contents)
<a id='Melt'></a>
### Melt

Melt is the opposite of pivot.  
It converts multiple columns into a single column of categorical names and a second column that contains their values.

In [6]:
df

Unnamed: 0,Date,Class,Numbers,Numbers2
0,2019-02-02,1A,1.643669,3.287337
1,2019-02-02,2B,0.923707,1.847414
2,2019-02-02,3C,-0.450652,-0.901305
3,2019-02-03,1A,-1.715478,-3.430955
4,2019-02-03,2B,-1.407965,-2.81593
5,2019-02-03,3C,-0.136307,-0.272613
6,2019-02-04,1A,-0.678125,-1.356249
7,2019-02-04,2B,2.326845,4.65369
8,2019-02-04,3C,-0.916117,-1.832234


In [7]:
df.melt(id_vars=['Date','Class'])

Unnamed: 0,Date,Class,variable,value
0,2019-02-02,1A,Numbers,1.643669
1,2019-02-02,2B,Numbers,0.923707
2,2019-02-02,3C,Numbers,-0.450652
3,2019-02-03,1A,Numbers,-1.715478
4,2019-02-03,2B,Numbers,-1.407965
5,2019-02-03,3C,Numbers,-0.136307
6,2019-02-04,1A,Numbers,-0.678125
7,2019-02-04,2B,Numbers,2.326845
8,2019-02-04,3C,Numbers,-0.916117
9,2019-02-02,1A,Numbers2,3.287337


`value_vars` can be used when you only want to convert specific columns.

In [8]:
df.melt(id_vars=['Date','Class'], value_vars=['Numbers'])

Unnamed: 0,Date,Class,variable,value
0,2019-02-02,1A,Numbers,1.643669
1,2019-02-02,2B,Numbers,0.923707
2,2019-02-02,3C,Numbers,-0.450652
3,2019-02-03,1A,Numbers,-1.715478
4,2019-02-03,2B,Numbers,-1.407965
5,2019-02-03,3C,Numbers,-0.136307
6,2019-02-04,1A,Numbers,-0.678125
7,2019-02-04,2B,Numbers,2.326845
8,2019-02-04,3C,Numbers,-0.916117


`value_name` and `var_name` can be used to specify the resulting column names, rather than using the default names of `variable` and `value`.

In [9]:
df.melt(id_vars=['Date','Class'], value_vars=['Numbers'], value_name='Numbers_Value', var_name='Num_Var')

Unnamed: 0,Date,Class,Num_Var,Numbers_Value
0,2019-02-02,1A,Numbers,1.643669
1,2019-02-02,2B,Numbers,0.923707
2,2019-02-02,3C,Numbers,-0.450652
3,2019-02-03,1A,Numbers,-1.715478
4,2019-02-03,2B,Numbers,-1.407965
5,2019-02-03,3C,Numbers,-0.136307
6,2019-02-04,1A,Numbers,-0.678125
7,2019-02-04,2B,Numbers,2.326845
8,2019-02-04,3C,Numbers,-0.916117


[Return to top](#Contents)
<a id='Stack'></a>
### Stack and Unstack

*stack* and *unstack* are similar to *melt* and *pivot*, except they work with MultiIndex objects.  
- stack: columns to index  
- unstack: index to columns

### Stack: columns to index

In [10]:
# see what df looks like with a multiIndex
df.set_index(['Date','Class'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Numbers,Numbers2
Date,Class,Unnamed: 2_level_1,Unnamed: 3_level_1
2019-02-02,1A,1.643669,3.287337
2019-02-02,2B,0.923707,1.847414
2019-02-02,3C,-0.450652,-0.901305
2019-02-03,1A,-1.715478,-3.430955
2019-02-03,2B,-1.407965,-2.81593
2019-02-03,3C,-0.136307,-0.272613
2019-02-04,1A,-0.678125,-1.356249
2019-02-04,2B,2.326845,4.65369
2019-02-04,3C,-0.916117,-1.832234


In [11]:
df.set_index(['Date','Class']).stack()

Date        Class          
2019-02-02  1A     Numbers     1.643669
                   Numbers2    3.287337
            2B     Numbers     0.923707
                   Numbers2    1.847414
            3C     Numbers    -0.450652
                   Numbers2   -0.901305
2019-02-03  1A     Numbers    -1.715478
                   Numbers2   -3.430955
            2B     Numbers    -1.407965
                   Numbers2   -2.815930
            3C     Numbers    -0.136307
                   Numbers2   -0.272613
2019-02-04  1A     Numbers    -0.678125
                   Numbers2   -1.356249
            2B     Numbers     2.326845
                   Numbers2    4.653690
            3C     Numbers    -0.916117
                   Numbers2   -1.832234
dtype: float64

One possible use of `stack` is to create a nested lookup table.  Using the example above, you can look up values:

In [12]:
df.set_index(['Date','Class']).stack()['2019-02-03']['2B']['Numbers2']

-2.815929807593747

[Return to top](#Contents)
<a id='Unstack'></a>
### Unstack: Index to columns

In [13]:
df.set_index(['Date','Class'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Numbers,Numbers2
Date,Class,Unnamed: 2_level_1,Unnamed: 3_level_1
2019-02-02,1A,1.643669,3.287337
2019-02-02,2B,0.923707,1.847414
2019-02-02,3C,-0.450652,-0.901305
2019-02-03,1A,-1.715478,-3.430955
2019-02-03,2B,-1.407965,-2.81593
2019-02-03,3C,-0.136307,-0.272613
2019-02-04,1A,-0.678125,-1.356249
2019-02-04,2B,2.326845,4.65369
2019-02-04,3C,-0.916117,-1.832234


In [14]:
df.set_index(['Date', 'Class']).unstack()

Unnamed: 0_level_0,Numbers,Numbers,Numbers,Numbers2,Numbers2,Numbers2
Class,1A,2B,3C,1A,2B,3C
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2019-02-02,1.643669,0.923707,-0.450652,3.287337,1.847414,-0.901305
2019-02-03,-1.715478,-1.407965,-0.136307,-3.430955,-2.81593,-0.272613
2019-02-04,-0.678125,2.326845,-0.916117,-1.356249,4.65369,-1.832234


[Return to top](#Contents)
<a id='Type2'></a>
## Type 2: Reshaping with aggregation

Create a new dataframe for examples:

In [15]:
df = pd.DataFrame({'Date': pd.Index(pd.date_range(start='2/2/2019', periods=2)).repeat(4),
                  'Class':['1A','2B','3C','1A','2B','3C','1A','2B'],
                  'Numbers': np.random.randn(8)})
df

Unnamed: 0,Date,Class,Numbers
0,2019-02-02,1A,-0.077458
1,2019-02-02,2B,1.059071
2,2019-02-02,3C,0.2586
3,2019-02-02,1A,0.746885
4,2019-02-03,2B,0.974775
5,2019-02-03,3C,0.919365
6,2019-02-03,1A,1.005324
7,2019-02-03,2B,0.908988


[Return to top](#Contents)
<a id='group_by'></a>
### Group by

In [16]:
grps = df.groupby('Date')
for date, group in grps:
    print(date)
    print(group)

2019-02-02 00:00:00
        Date Class   Numbers
0 2019-02-02    1A -0.077458
1 2019-02-02    2B  1.059071
2 2019-02-02    3C  0.258600
3 2019-02-02    1A  0.746885
2019-02-03 00:00:00
        Date Class   Numbers
4 2019-02-03    2B  0.974775
5 2019-02-03    3C  0.919365
6 2019-02-03    1A  1.005324
7 2019-02-03    2B  0.908988


In [17]:
df.groupby('Date')['Numbers'].mean()

Date
2019-02-02    0.496775
2019-02-03    0.952113
Name: Numbers, dtype: float64

#### Some methods to get the result as a DataFrame

In [18]:
df.groupby('Date')[['Numbers']].mean()
df.groupby('Date').agg('mean')

Unnamed: 0_level_0,Numbers
Date,Unnamed: 1_level_1
2019-02-02,0.496775
2019-02-03,0.952113


**Creating a separate index**  
(instead of using 'Date' as an index by default)

In [19]:
df.groupby('Date', as_index=False)['Numbers'].mean()

Unnamed: 0,Date,Numbers
0,2019-02-02,0.496775
1,2019-02-03,0.952113


#### Aggregating on multiple columns:

In [20]:
df.groupby(['Date','Class'],as_index=False)['Numbers'].mean()

Unnamed: 0,Date,Class,Numbers
0,2019-02-02,1A,0.334714
1,2019-02-02,2B,1.059071
2,2019-02-02,3C,0.2586
3,2019-02-03,1A,1.005324
4,2019-02-03,2B,0.941882
5,2019-02-03,3C,0.919365


In [21]:
df['Numbers2'] = df['Numbers']*2

#### Multiple aggregations on a column

In [22]:
df.groupby('Date').agg({'Numbers':'sum', 'Numbers2':['mean','max']})

Unnamed: 0_level_0,Numbers,Numbers2,Numbers2
Unnamed: 0_level_1,sum,mean,max
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
2019-02-02,1.987098,0.993549,2.118142
2019-02-03,3.808453,1.904226,2.010647


#### Renaming aggregation columns

In [23]:
df.groupby(['Date'], as_index=False).agg(Numbers_Avg=('Numbers','mean'),Numbers_Sum=('Numbers','sum'))

Unnamed: 0,Date,Numbers_Avg,Numbers_Sum
0,2019-02-02,0.496775,1.987098
1,2019-02-03,0.952113,3.808453


#### Using Lambda Expressions

In [24]:
df.groupby('Date', as_index=False).agg(Num_As_Pct=('Numbers', lambda x: np.round(x.mean()*100,2)))

Unnamed: 0,Date,Num_As_Pct
0,2019-02-02,49.68
1,2019-02-03,95.21


#### Using Custom Functions

In [25]:
def sign(number):
    if (number.mean()>=0): #.all():
        return 'positive'
    else:
        return 'negative'

df.groupby('Date').agg({'Numbers':['mean',sign]})

Unnamed: 0_level_0,Numbers,Numbers
Unnamed: 0_level_1,mean,sign
Date,Unnamed: 1_level_2,Unnamed: 2_level_2
2019-02-02,0.496775,positive
2019-02-03,0.952113,positive


#### Including Missing Values

In [26]:
# insert some missing values
df.iloc[4:6,0]=np.nan
df

Unnamed: 0,Date,Class,Numbers,Numbers2
0,2019-02-02,1A,-0.077458,-0.154916
1,2019-02-02,2B,1.059071,2.118142
2,2019-02-02,3C,0.2586,0.5172
3,2019-02-02,1A,0.746885,1.493771
4,NaT,2B,0.974775,1.94955
5,NaT,3C,0.919365,1.838731
6,2019-02-03,1A,1.005324,2.010647
7,2019-02-03,2B,0.908988,1.817977


In [27]:
df.groupby('Date',dropna=False).mean()

Unnamed: 0_level_0,Numbers,Numbers2
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2019-02-02,0.496775,0.993549
2019-02-03,0.957156,1.914312
NaT,0.94707,1.894141


Sorting options

[Return to top](#Contents)
<a id='pivot_table'></a>
### Pivot_Table

Works similar to a pivot (categorical values to separate columns) with the addition of aggregation.

In [28]:
df

Unnamed: 0,Date,Class,Numbers,Numbers2
0,2019-02-02,1A,-0.077458,-0.154916
1,2019-02-02,2B,1.059071,2.118142
2,2019-02-02,3C,0.2586,0.5172
3,2019-02-02,1A,0.746885,1.493771
4,NaT,2B,0.974775,1.94955
5,NaT,3C,0.919365,1.838731
6,2019-02-03,1A,1.005324,2.010647
7,2019-02-03,2B,0.908988,1.817977


#### If not specified, the default aggregation function is `mean`:

In [29]:
df.pivot_table(index='Date', columns='Class')

Unnamed: 0_level_0,Numbers,Numbers,Numbers,Numbers2,Numbers2,Numbers2
Class,1A,2B,3C,1A,2B,3C
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2019-02-02,0.334714,1.059071,0.2586,0.669427,2.118142,0.5172
2019-02-03,1.005324,0.908988,,2.010647,1.817977,


#### The agg function can be specified:

In [30]:
df.pivot_table(index='Date', columns='Class', aggfunc='sum')

Unnamed: 0_level_0,Numbers,Numbers,Numbers,Numbers2,Numbers2,Numbers2
Class,1A,2B,3C,1A,2B,3C
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2019-02-02,0.669427,1.059071,0.2586,1.338855,2.118142,0.5172
2019-02-03,1.005324,0.908988,,2.010647,1.817977,


[Return to top](#Contents)
<a id='Crosstab'></a>
### Crosstab

- Crosstab works with categorical data.
- Crosstab always returns a DataFrame.  
- **By default, it takes two or more columns and returns a frequency of each combination.**

In [31]:
df

Unnamed: 0,Date,Class,Numbers,Numbers2
0,2019-02-02,1A,-0.077458,-0.154916
1,2019-02-02,2B,1.059071,2.118142
2,2019-02-02,3C,0.2586,0.5172
3,2019-02-02,1A,0.746885,1.493771
4,NaT,2B,0.974775,1.94955
5,NaT,3C,0.919365,1.838731
6,2019-02-03,1A,1.005324,2.010647
7,2019-02-03,2B,0.908988,1.817977


In [32]:
pd.crosstab(df.Date, df.Class)
#or
pd.crosstab(index=df['Date'], columns=df['Class'])

Class,1A,2B,3C
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2019-02-02,2,1,1
2019-02-03,1,1,0


#### Other aggregation parameters can be used:

In [33]:
pd.crosstab(df.Date, df.Class, values=df.Numbers, aggfunc='sum')
# or
pd.crosstab(index=df['Date'], columns=df['Class'], values=df['Numbers'], aggfunc='sum')

Class,1A,2B,3C
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2019-02-02,0.669427,1.059071,0.2586
2019-02-03,1.005324,0.908988,


In [34]:
pd.crosstab(df.Date, df.Class, values=df.Numbers, aggfunc='mean')
# or
pd.crosstab(index=df['Date'],
           columns=df['Class'],
           values=df['Numbers'],
           aggfunc=np.mean)

Class,1A,2B,3C
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2019-02-02,0.334714,1.059071,0.2586
2019-02-03,1.005324,0.908988,


#### Crosstab Subtotal Summaries
Crosstab also provides subtotal summaries when `margins=True` is used.  
By default, the subtotal is named "All", but the name can be specified, using `margins_name='Total Number'`, as an example.  

In [35]:
pd.crosstab(index=df['Date'], columns=df['Class'], margins=True, margins_name='Total_Number')

Class,1A,2B,3C,Total_Number
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2019-02-02 00:00:00,2,1,1,4
2019-02-03 00:00:00,1,1,0,2
Total_Number,3,2,1,6


#### Normalizing values with Crosstab
Crosstab also has the option of normalizing values.  `normalize` can be set to `all`, `index`, or `columns`.<br>
This example shows the percentages of each that adds up to 1.0 for the whole group:

In [36]:
pd.crosstab(index=df['Date'], columns=df['Class'], normalize='all')

Class,1A,2B,3C
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2019-02-02,0.333333,0.166667,0.166667
2019-02-03,0.166667,0.166667,0.0


[Return to Top](#Contents)
<a id='crosstab_vs_pivot_table'></a>
### crosstab vs pivot_table
Crosstab and Pivot_Table produce similar results.  Both are slower than groubpy.  
Some of the differences are:  

- Crosstab can work with any data type (lists, numpy arrays, DataFrame columns, etc.)
- Crosstab has optional parameter for `normalize`
- Crosstab can change names using `rownames` and `colnames`
 - (They both have options for `margins`, `margins_name`
 
 
- Pivot_table can only work with DataFrames
- Pivot_table has `fillvalue=xx` 
    - (Crosstab can be followed with `.fillna(0)`)

[Return to Top](#Contents)

In [37]:
print(f"numpy: {np.__version__}")
print(f"pandas: {pd.__version__}")

numpy: 1.21.2
pandas: 1.3.4
