# Pandas reference

Quick reference on getting common data processing tasks done with Pandas.

## Setup

### Import libraries

In [1]:
import numpy as np
import pandas as pd

### Show more data in dataframes

In [2]:
pd.options.display.max_rows = 999
pd.options.display.max_columns = 100
pd.options.display.max_colwidth = 200

### Set floating point precision

In [3]:
pd.options.display.precision = 3

## Datasets

### USD exchange rates (yearly averages)

In [4]:
def usd_exchange_rates_df():
    return pd.DataFrame(
        columns=("Year", "Currency", "Currency/USD", "USD/Currency"),
        data=[
            [pd.to_datetime("2016-12-31"), "EUR", 1.064, 0.940],
            [pd.to_datetime("2017-12-31"), "EUR", 1.083, 0.923],
            [pd.to_datetime("2018-12-31"), "EUR", 1.179, 0.848],
            [pd.to_datetime("2019-12-31"), "EUR", 1.120, 0.893],
            [pd.to_datetime("2020-12-31"), "EUR", 1.140, 0.877],
            [pd.to_datetime("2016-12-31"), "GBP", 1.299, 0.770],
            [pd.to_datetime("2017-12-31"), "GBP", 1.238, 0.808],
            [pd.to_datetime("2018-12-31"), "GBP", 1.333, 0.750],
            [pd.to_datetime("2019-12-31"), "GBP", 1.276, 0.784],
            [pd.to_datetime("2020-12-31"), "GBP", 1.284, 0.779],
        ]
    )

## Filtering

### Filter with []

Select rows with `[]`:

In [5]:
df = usd_exchange_rates_df()
df[df["Currency"] == "EUR"]

Unnamed: 0,Year,Currency,Currency/USD,USD/Currency
0,2016-12-31,EUR,1.064,0.94
1,2017-12-31,EUR,1.083,0.923
2,2018-12-31,EUR,1.179,0.848
3,2019-12-31,EUR,1.12,0.893
4,2020-12-31,EUR,1.14,0.877


Select a single column as a `pd.Series` with `[]`:

In [6]:
df = usd_exchange_rates_df()
df["Currency/USD"]

0    1.064
1    1.083
2    1.179
3    1.120
4    1.140
5    1.299
6    1.238
7    1.333
8    1.276
9    1.284
Name: Currency/USD, dtype: float64

Select one or more columns as a `pd.DataFrame` by passing a list to `[]`:

In [7]:
df = usd_exchange_rates_df()
df[["Currency/USD"]]

Unnamed: 0,Currency/USD
0,1.064
1,1.083
2,1.179
3,1.12
4,1.14
5,1.299
6,1.238
7,1.333
8,1.276
9,1.284


Selection of rows and of columns can be combined:

In [8]:
df = usd_exchange_rates_df()
df[df["Currency"] == "EUR"]["Currency/USD"]

0    1.064
1    1.083
2    1.179
3    1.120
4    1.140
Name: Currency/USD, dtype: float64

Note that chaining `[]` does not work for the purpose of modifying or inserting data:

In [9]:
df = usd_exchange_rates_df()
df[df["Currency"] == "EUR"]["Currency/USD"] = 5

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[df["Currency"] == "EUR"]["Currency/USD"] = 5


`df[][]=` translates to a `df.__getitem__()` call on the data frame and then a `.__setitem__()` call on the resulting object. The problem is that the `df.__getitem__()` call might return either a view or a copy of the dataframe, so the dataframe might or might not be modified.

Instead, `df.loc[]` can be used to select rows and columns at the same time. `df.loc[]` will return a view or a copy just like `df[]`, but `df.loc[]=` is just a single method call on the `loc` attribute of the original dataframe, free of the ambiguity of `[][]=`, so that it will always correctly modify the dataframe.

### Filter with loc[]

Select rows:

In [10]:
df = usd_exchange_rates_df()
df.loc[df["Currency"] == "EUR"]

Unnamed: 0,Year,Currency,Currency/USD,USD/Currency
0,2016-12-31,EUR,1.064,0.94
1,2017-12-31,EUR,1.083,0.923
2,2018-12-31,EUR,1.179,0.848
3,2019-12-31,EUR,1.12,0.893
4,2020-12-31,EUR,1.14,0.877


Select a single column as a `pd.Series`:

In [11]:
df = usd_exchange_rates_df()
df.loc[:, "Currency/USD"]

0    1.064
1    1.083
2    1.179
3    1.120
4    1.140
5    1.299
6    1.238
7    1.333
8    1.276
9    1.284
Name: Currency/USD, dtype: float64

Select one or more columns as a `pd.DataFrame`:

In [12]:
df = usd_exchange_rates_df()
df.loc[:, ["Currency/USD"]]

Unnamed: 0,Currency/USD
0,1.064
1,1.083
2,1.179
3,1.12
4,1.14
5,1.299
6,1.238
7,1.333
8,1.276
9,1.284


Modify a subpart of a dataframe:

In [13]:
df = usd_exchange_rates_df()
df.loc[df["Currency"] == "EUR", "Currency/USD"] = 2
df.loc[df["Currency"] == "GBP", "USD/Currency"] = 0.5
df

Unnamed: 0,Year,Currency,Currency/USD,USD/Currency
0,2016-12-31,EUR,2.0,0.94
1,2017-12-31,EUR,2.0,0.923
2,2018-12-31,EUR,2.0,0.848
3,2019-12-31,EUR,2.0,0.893
4,2020-12-31,EUR,2.0,0.877
5,2016-12-31,GBP,1.299,0.5
6,2017-12-31,GBP,1.238,0.5
7,2018-12-31,GBP,1.333,0.5
8,2019-12-31,GBP,1.276,0.5
9,2020-12-31,GBP,1.284,0.5


### Boolean masks for [] and loc[]

Boolean masks can be formed with `&`, `|` and `~` (negation) and passed to `[]` and to `loc[]`. Conditions have to be enclosed in parenthesis since `&` and `|` have higher priority in Python than operators like `>=`:

In [14]:
df = usd_exchange_rates_df()
df[(df["Year"] >= pd.to_datetime("2018-12-31")) &
   (df["Year"] <= pd.to_datetime("2020-12-31"))]

Unnamed: 0,Year,Currency,Currency/USD,USD/Currency
2,2018-12-31,EUR,1.179,0.848
3,2019-12-31,EUR,1.12,0.893
4,2020-12-31,EUR,1.14,0.877
7,2018-12-31,GBP,1.333,0.75
8,2019-12-31,GBP,1.276,0.784
9,2020-12-31,GBP,1.284,0.779


The condition inside `[]` translates to a boolean vector:

In [15]:
((df["Year"] >= pd.to_datetime("2018-12-31")) &
 (df["Year"] <= pd.to_datetime("2020-12-31")))

0    False
1    False
2     True
3     True
4     True
5    False
6    False
7     True
8     True
9     True
Name: Year, dtype: bool

Use `isin()` series method for subset selection:

In [16]:
df = usd_exchange_rates_df()
df[df["Year"].isin([
    pd.to_datetime("2018-12-31"),
    pd.to_datetime("2019-12-31"),
    pd.to_datetime("2020-12-31")
])]

Unnamed: 0,Year,Currency,Currency/USD,USD/Currency
2,2018-12-31,EUR,1.179,0.848
3,2019-12-31,EUR,1.12,0.893
4,2020-12-31,EUR,1.14,0.877
7,2018-12-31,GBP,1.333,0.75
8,2019-12-31,GBP,1.276,0.784
9,2020-12-31,GBP,1.284,0.779


## Grouping

### Reduce group-by-group and series-by-series with agg()

`df.groupby().agg(func)` will call `func(series)` once for each series of every group.

`func` should return a scalar.

In [17]:
df = usd_exchange_rates_df()
df.groupby("Currency").agg(np.mean)

Unnamed: 0_level_0,Currency/USD,USD/Currency
Currency,Unnamed: 1_level_1,Unnamed: 2_level_1
EUR,1.117,0.896
GBP,1.286,0.778


Multiple aggregations can be specified:

In [18]:
df = usd_exchange_rates_df()
df.groupby("Currency")[["Currency/USD", "USD/Currency"]].agg([np.mean, np.var])

Unnamed: 0_level_0,Currency/USD,Currency/USD,USD/Currency,USD/Currency
Unnamed: 0_level_1,mean,var,mean,var
Currency,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
EUR,1.117,0.002,0.896,0.001335
GBP,1.286,0.001,0.778,0.0004462


Use keyword arguments to rename the resulting columns:

In [19]:
df = usd_exchange_rates_df()
df.groupby("Currency")[["Currency/USD", "USD/Currency"]].agg(
    avg_cur2usd=("Currency/USD", np.mean),
    avg_usd2cur=("USD/Currency", np.mean),
)

Unnamed: 0_level_0,avg_cur2usd,avg_usd2cur
Currency,Unnamed: 1_level_1,Unnamed: 2_level_1
EUR,1.117,0.896
GBP,1.286,0.778


The last type of agg() aggregation has slightly different syntax when dealing with a single series:

In [20]:
df = usd_exchange_rates_df()
df.groupby("Currency")["Currency/USD"].agg(average=np.mean)

Unnamed: 0_level_0,average
Currency,Unnamed: 1_level_1
EUR,1.117
GBP,1.286


### Reduce group-by-group with apply

`df.groupby().apply(func)` will call `func(group)` once for each group, where `group` is a dataframe containing the rows within each group.

`func` can return:
- a scalar - making the result of `apply()` a series
- a series - making the result of `apply()` a series
- a dataframe - making the result of `apply()` a dataframe

In [21]:
df = usd_exchange_rates_df()
df.groupby("Currency")[["Currency/USD", "USD/Currency"]].apply(lambda df: df.mean())

Unnamed: 0_level_0,Currency/USD,USD/Currency
Currency,Unnamed: 1_level_1,Unnamed: 2_level_1
EUR,1.117,0.896
GBP,1.286,0.778


### Transform rows one-by-one with transform

`df.groupby().transform(func)` will call `func(series_in_group)` once for each series in each group. In contrast to `apply()`, the result of `transform()` is of the same dimensions as the original dataframe.

`func(series_in_group)` should either return a series of the same dimensions as `series_in_group` or a scalar, in which case pandas will take care of making a series of length `len(series_in_group)` out of it.

In [22]:
df = usd_exchange_rates_df()
df.groupby("Currency")[["Currency/USD", "USD/Currency"]].transform(lambda df: df.mean())

Unnamed: 0,Currency/USD,USD/Currency
0,1.117,0.896
1,1.117,0.896
2,1.117,0.896
3,1.117,0.896
4,1.117,0.896
5,1.286,0.778
6,1.286,0.778
7,1.286,0.778
8,1.286,0.778
9,1.286,0.778
