# Pandas reference

Quick reference on getting common data processing tasks done with Pandas.

## Environment setup

### Import libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Set theme for plots

In [2]:
plt.style.use("ggplot")

### Show more data in dataframes

In [3]:
pd.options.display.max_rows = 999
pd.options.display.max_columns = 100
pd.options.display.max_colwidth = 200

## Grouping

Dataframe for examples that follow:

In [4]:
df = pd.DataFrame(
    columns=("Cat", "Val1", "Val2"),
    data=[
        ["C1", 1.0, 2.0],
        ["C1", 3.0, 4.0],
        ["C2", 5.0, 6.0],
        ["C2", 7.0, 8.0],
    ]
)

In [5]:
df

Unnamed: 0,Cat,Val1,Val2
0,C1,1.0,2.0
1,C1,3.0,4.0
2,C2,5.0,6.0
3,C2,7.0,8.0


### agg: reduce group-by-group and series-by-series

`df.groupby().agg(func)` will call `func(series)` once for each series of every group.

`func` should return a scalar.

In [6]:
df.groupby("Cat").agg(np.mean)

Unnamed: 0_level_0,Val1,Val2
Cat,Unnamed: 1_level_1,Unnamed: 2_level_1
C1,2.0,3.0
C2,6.0,7.0


Multiple aggregations can be specified:

In [17]:
df.groupby("Cat").agg([np.mean, np.var])

Unnamed: 0_level_0,Val1,Val1,Val2,Val2
Unnamed: 0_level_1,mean,var,mean,var
Cat,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
C1,2.0,2.0,3.0,2.0
C2,6.0,2.0,7.0,2.0


Use keyword arguments to rename the resulting columns:

In [25]:
df.groupby("Cat").agg(val1_mean=("Val1", np.mean), val2_mean=("Val2", np.mean))

Unnamed: 0_level_0,val1_mean,val2_mean
Cat,Unnamed: 1_level_1,Unnamed: 2_level_1
C1,2.0,3.0
C2,6.0,7.0


### apply: reduce group-by-group

`df.groupby().apply(func)` will call `func(group)` once for each group, where `group` is a dataframe containing the rows within each group.

`func` can return:
- a scalar - making the result of `apply()` a series
- a series - making the result of `apply()` a series
- a dataframe - making the result of `apply()` a dataframe

In [19]:
df.groupby("Cat").apply(lambda df: df.mean())

Unnamed: 0_level_0,Val1,Val2
Cat,Unnamed: 1_level_1,Unnamed: 2_level_1
C1,2.0,3.0
C2,6.0,7.0


### transform: transform rows one-by-one

`df.groupby().transform(func)` will call `func(series_in_group)` once for each series in each group. In contrast to `apply()`, the result of `transform()` is of the same dimensions as the original dataframe.

`func(series_in_group)` should either return a series of the same dimensions as `series_in_group` or a scalar, in which case pandas will take care of making a series of length `len(series_in_group)` out of it.

In [20]:
df.groupby("Cat").transform(lambda df: df.mean())

Unnamed: 0,Val1,Val2
0,2.0,3.0
1,2.0,3.0
2,6.0,7.0
3,6.0,7.0
