# Pandas reference

Quick reference on getting common data processing tasks done with Pandas.

## Setup

### Import libraries

In [1]:
import numpy as np
import pandas as pd

### Show more data in dataframes

In [2]:
pd.options.display.max_rows = 999
pd.options.display.max_columns = 100
pd.options.display.max_colwidth = 200

### Set floating point precision

In [3]:
pd.options.display.precision = 2

## Example dataframes

In [4]:
def example_df():
    return pd.DataFrame(
        columns=("Cat", "Val1", "Val2"),
        data=[
            ["C1", 1.0, 2.0],
            ["C1", 3.0, 4.0],
            ["C2", 5.0, 6.0],
            ["C2", 7.0, 8.0],
        ]
    )

## Filtering

### Filter with []

Select rows with `[]`:

In [5]:
df = example_df()
df[df["Cat"] == "C1"]

Unnamed: 0,Cat,Val1,Val2
0,C1,1.0,2.0
1,C1,3.0,4.0


Select a single column as a `pd.Series` with `[]`:

In [6]:
df = example_df()
df["Cat"]

0    C1
1    C1
2    C2
3    C2
Name: Cat, dtype: object

Select one or more columns as a `pd.DataFrame` by passing a list to `[]`:

In [7]:
df = example_df()
df[["Cat"]]

Unnamed: 0,Cat
0,C1
1,C1
2,C2
3,C2


Selection of rows and of columns can be combined:

In [8]:
df = example_df()
df[df["Cat"] == "C1"]["Val1"]

0    1.0
1    3.0
Name: Val1, dtype: float64

In [9]:
df = example_df()
df[df["Cat"] == "C1"][["Val1"]]

Unnamed: 0,Val1
0,1.0
1,3.0


Note that chaining `[]` does not work for the purpose of modifying or inserting data:

In [39]:
df = example_df()
df[df["Cat"] == "C1"]["Val1"] = 5

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[df["Cat"] == "C1"]["Val1"] = 5


`df[][]=` translates to a `df.__getitem__()` call on the data frame and then a `.__setitem__()` call on the resulting object. The problem is that the `df.__getitem__()` call might return either a view or a copy of the dataframe, so the dataframe might or might not be modified.

Instead, `df.loc[]` can be used to select rows and columns at the same time. `df.loc[]` will return a view or a copy just like `df[]`, but `df.loc[]=` is just a single method call on the `loc` attribute of the original dataframe, free of the ambiguity of `[][]=`, so that it will always correctly modify the dataframe.

### Filter with loc[]

Select rows:

In [34]:
df = example_df()
df.loc[df["Cat"] == "C1"]

Unnamed: 0,Cat,Val1,Val2
0,C1,1.0,2.0
1,C1,3.0,4.0


Select a single column as a `pd.Series`:

In [35]:
df = example_df()
df.loc[:, "Val1"]

0    1.0
1    3.0
2    5.0
3    7.0
Name: Val1, dtype: float64

Select one or more columns as a `pd.DataFrame`:

In [36]:
df = example_df()
df.loc[:, ["Val1"]]

Unnamed: 0,Val1
0,1.0
1,3.0
2,5.0
3,7.0


Modify a subpart of a dataframe:

In [37]:
df = example_df()
df.loc[df["Cat"] == "C1", "Val3"] = 9
df.loc[df["Cat"] == "C2", "Val3"] = 10
df

Unnamed: 0,Cat,Val1,Val2,Val3
0,C1,1.0,2.0,9.0
1,C1,3.0,4.0,9.0
2,C2,5.0,6.0,10.0
3,C2,7.0,8.0,10.0


### Boolean masks for [] and loc[]

Boolean masks can be formed with `&`, `|` and `~` (negation) and passed to `[]` and to `loc[]`. Conditions have to be enclosed in parenthesis since `&` and `|` have higher priority in Python than operators like `>=`:

In [12]:
df = example_df()
df[(df["Val2"] >= 4.0) & (df["Val2"] <= 6.0)]

Unnamed: 0,Cat,Val1,Val2
1,C1,3.0,4.0
2,C2,5.0,6.0


The condition inside `[]` translates to a boolean vector:

In [42]:
(df["Val2"] >= 4.0) & (df["Val2"] <= 6.0)

0    False
1     True
2     True
3    False
Name: Val2, dtype: bool

Use `isin()` series method for subset selection:

In [13]:
df = example_df()
df[df["Val2"].isin([4.0, 8.0])]

Unnamed: 0,Cat,Val1,Val2
1,C1,3.0,4.0
3,C2,7.0,8.0


## Grouping

### Reduce group-by-group and series-by-series with agg()

`df.groupby().agg(func)` will call `func(series)` once for each series of every group.

`func` should return a scalar.

In [14]:
df = example_df()
df.groupby("Cat").agg(np.mean)

Unnamed: 0_level_0,Val1,Val2
Cat,Unnamed: 1_level_1,Unnamed: 2_level_1
C1,2.0,3.0
C2,6.0,7.0


Multiple aggregations can be specified:

In [15]:
df = example_df()
df.groupby("Cat").agg([np.mean, np.var])

Unnamed: 0_level_0,Val1,Val1,Val2,Val2
Unnamed: 0_level_1,mean,var,mean,var
Cat,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
C1,2.0,2.0,3.0,2.0
C2,6.0,2.0,7.0,2.0


Use keyword arguments to rename the resulting columns:

In [16]:
df = example_df()
df.groupby("Cat").agg(val1_mean=("Val1", np.mean), val2_mean=("Val2", np.mean))

Unnamed: 0_level_0,val1_mean,val2_mean
Cat,Unnamed: 1_level_1,Unnamed: 2_level_1
C1,2.0,3.0
C2,6.0,7.0


The last type of agg() aggregation has slightly different syntax when dealing with a single series:

In [17]:
df = example_df()
df.groupby("Cat")["Val1"].agg(val1_mean=np.mean, val1_var=np.var)

Unnamed: 0_level_0,val1_mean,val1_var
Cat,Unnamed: 1_level_1,Unnamed: 2_level_1
C1,2.0,2.0
C2,6.0,2.0


### Reduce group-by-group with apply

`df.groupby().apply(func)` will call `func(group)` once for each group, where `group` is a dataframe containing the rows within each group.

`func` can return:
- a scalar - making the result of `apply()` a series
- a series - making the result of `apply()` a series
- a dataframe - making the result of `apply()` a dataframe

In [28]:
df = example_df()
df.groupby("Cat").apply(lambda df: df[["Val1", "Val2"]].mean())

Unnamed: 0_level_0,Val1,Val2
Cat,Unnamed: 1_level_1,Unnamed: 2_level_1
C1,2.0,3.0
C2,6.0,7.0


### Transform rows one-by-one with transform

`df.groupby().transform(func)` will call `func(series_in_group)` once for each series in each group. In contrast to `apply()`, the result of `transform()` is of the same dimensions as the original dataframe.

`func(series_in_group)` should either return a series of the same dimensions as `series_in_group` or a scalar, in which case pandas will take care of making a series of length `len(series_in_group)` out of it.

In [18]:
df = example_df()
df.groupby("Cat").transform(lambda df: df.mean())

Unnamed: 0,Val1,Val2
0,2.0,3.0
1,2.0,3.0
2,6.0,7.0
3,6.0,7.0
