# `.groupby()` and `.agg()` - 1

The real power of data analysis with `DataFrames` comes into focus when we start utilizing the `.groupby()` and `.agg()` methods.  This is known as *grouping* and *aggregating*.

Talking about grouping in the abstract can be confusing; I think it's best to see grouping in action by doing meaningful calculations.

The purpose of this chapter is to introduce grouping and aggregation by way of the following finance task: calculating monthly returns and monthly volatilities for several ETFs.

## Loading Packages

Let's begin by loading the packages that we will need.

In [1]:
import numpy as np
import pandas as pd
import yfinance as yf

## Reading-In Data

Our analysis will be on the set of of July 2021 prices for `SPY`, `IWM`, `QQQ`, `DIA`.

Let's read-in that data with **pandas_datareader**.

In [2]:
pd.options.display.max_rows = 25
df_etf = yf.download(
    ['SPY', 'QQQ', 'IWM', 'DIA'], start='2021-06-30', end='2021-07-31',
    auto_adjust=False, rounding=True
)
df_etf.head()

[*********************100%***********************]  4 of 4 completed


Price,Adj Close,Adj Close,Adj Close,Adj Close,Close,Close,Close,Close,High,High,...,Low,Low,Open,Open,Open,Open,Volume,Volume,Volume,Volume
Ticker,DIA,IWM,QQQ,SPY,DIA,IWM,QQQ,SPY,DIA,IWM,...,QQQ,SPY,DIA,IWM,QQQ,SPY,DIA,IWM,QQQ,SPY
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
2021-06-30,320.48,217.77,345.63,404.51,344.95,229.37,354.43,428.06,345.51,230.32,...,353.83,427.18,342.38,228.65,354.83,427.21,3778900,26039000,32724000,64827900
2021-07-01,321.79,219.69,345.76,406.75,346.36,231.39,354.57,430.43,346.4,231.85,...,352.68,428.8,345.78,230.81,354.07,428.87,3606900,18089100,29290000,53441000
2021-07-02,323.26,217.6,349.73,409.86,347.94,229.19,358.64,433.72,348.29,232.08,...,356.28,430.52,347.04,232.0,356.52,431.67,3013500,21029700,32727200,57697700
2021-07-06,321.29,214.44,351.24,409.11,345.82,225.86,360.19,432.93,348.11,229.46,...,356.49,430.01,347.75,229.36,359.26,433.78,3910600,27771300,38842400,68710400
2021-07-07,322.32,212.44,351.99,410.56,346.92,223.76,360.95,434.46,347.14,226.67,...,358.94,431.51,345.65,225.54,362.45,433.66,3347000,28521500,35265200,63549500


This data is not as tidy as we would like.  Let's use method chaining to perform a series of data munging operations.

In [3]:
df_etf = (
    df_etf[[('Close','DIA'), ('Close', 'IWM'), ('Close', 'QQQ'), ('Close', 'SPY')]] # grab close prices
    .stack() # pivot the table
    .reset_index()
    .rename_axis(None, axis=1) # remove the name from the row index
    .rename(columns={'Date':'date', 'Ticker':'symbol', 'Close':'close'}) # columns in snake-case
    .sort_values(by=['symbol', 'date'])
)
df_etf

Unnamed: 0,date,symbol,close
0,2021-06-30,DIA,344.95
4,2021-07-01,DIA,346.36
8,2021-07-02,DIA,347.94
12,2021-07-06,DIA,345.82
16,2021-07-07,DIA,346.92
...,...,...,...
71,2021-07-26,SPY,441.02
75,2021-07-27,SPY,439.01
79,2021-07-28,SPY,438.83
83,2021-07-29,SPY,440.65


## Daily Returns with `groupby()`

Our ultimate goal is to calculate monthly returns and monthly volatilities for each ETF in `df_etf`.  These quantities are both functions of daily returns.  So our first order of business is to calculate daily returns. 

In a previous tutorial we calculated daily returns in a simple vectorized fashion.  Unfortunately, we can't use the exact same approach here because there are multiple ETFs in the data set.

To overcome this challenge we will use our first application of `.groupby()`.

Here is the `.groupby()` code that calculates daily returns for each ETF.

In [4]:
# sorting values to get everything in the right order
df_etf.sort_values(['symbol', 'date'], inplace=True)

# vectorized return calculation
df_etf['ret'] = \
    df_etf['close'].groupby(df_etf['symbol']).pct_change()
df_etf

Unnamed: 0,date,symbol,close,ret
0,2021-06-30,DIA,344.95,
4,2021-07-01,DIA,346.36,0.004088
8,2021-07-02,DIA,347.94,0.004562
12,2021-07-06,DIA,345.82,-0.006093
16,2021-07-07,DIA,346.92,0.003181
...,...,...,...,...
71,2021-07-26,SPY,441.02,0.002455
75,2021-07-27,SPY,439.01,-0.004558
79,2021-07-28,SPY,438.83,-0.000410
83,2021-07-29,SPY,440.65,0.004147


--- 

**Code Challenge:** If the `.group_by()` worked correctly, we should see a `NaN` value in the `ret` column for the first trade-date of each ETF.  Use `DataFrame.query()` to confirm this.

In [5]:
#| code-fold: true
#| code-summary: "Solution"
df_etf.query('ret.isnull()')

Unnamed: 0,date,symbol,close,ret
0,2021-06-30,DIA,344.95,
1,2021-06-30,IWM,229.37,
2,2021-06-30,QQQ,354.43,
3,2021-06-30,SPY,428.06,


---

## Monthly Return for Each `symbol`

We'll now proceed to calculate monthly returns and monthly volatilities for each of the ETFs in our data set.  This amounts to first grouping by `symbol`, and then performing an aggregation calculation on `ret` (daily returns).  

Let's start with monthly returns.  As a preliminary step we'll calculate the daily growth factor in a separate column.

In [6]:
df_etf['daily_factor'] = 1 + df_etf['ret']
df_etf.head()

Unnamed: 0,date,symbol,close,ret,daily_factor
0,2021-06-30,DIA,344.95,,
4,2021-07-01,DIA,346.36,0.004088,1.004088
8,2021-07-02,DIA,347.94,0.004562,1.004562
12,2021-07-06,DIA,345.82,-0.006093,0.993907
16,2021-07-07,DIA,346.92,0.003181,1.003181


Recall that the monthly growth factor is the product of the daily growth factors.  Here is a way to write all that logic in a single line using `.groupby()` and `.agg()`:

In [7]:
df_grouped_factor = \
    df_etf.groupby(['symbol'])['daily_factor'].agg([np.prod]).reset_index()
df_grouped_factor

Unnamed: 0,symbol,prod
0,DIA,1.013132
1,IWM,0.963727
2,QQQ,1.028609
3,SPY,1.024412


Notice that **pandas** isn't very sophisticated about the name that it gives to the column that stores the aggregation calculation.  It just gave it the name `prod`, which is the name of the function that was used in the aggregation calculation.  Let's make `df_grouped_factor` a bit more readable by renaming that column.

In [8]:
df_grouped_factor.rename(columns={'prod': 'monthly_factor'}, inplace=True)
df_grouped_factor

Unnamed: 0,symbol,monthly_factor
0,DIA,1.013132
1,IWM,0.963727
2,QQQ,1.028609
3,SPY,1.024412


And finally, recall that the monthly return is calculated by subtracting one from the monthly growth factor.

In [9]:
df_grouped_factor['monthly_return'] = df_grouped_factor['monthly_factor'] - 1
df_grouped_factor[['symbol', 'monthly_return']]

Unnamed: 0,symbol,monthly_return
0,DIA,0.013132
1,IWM,-0.036273
2,QQQ,0.028609
3,SPY,0.024412


## Monthly Volatility for Each `symbol`

Now let's calculate the (realized/historical) volatility for each of the ETFs.

We once again use `.groupby()` and `.agg()` to do most of the work in a single line of code.

In [10]:
df_grouped_vol = \
    df_etf.groupby(['symbol'])['ret'].agg([np.std]).reset_index()

df_grouped_vol

Unnamed: 0,symbol,std
0,DIA,0.007733
1,IWM,0.014032
2,QQQ,0.006832
3,SPY,0.007152


Again, let's rename our aggregation column to something more descriptive.

In [11]:
df_grouped_vol.rename(columns={'std':'daily_vol'}, inplace=True)
df_grouped_vol

Unnamed: 0,symbol,daily_vol
0,DIA,0.007733
1,IWM,0.014032
2,QQQ,0.006832
3,SPY,0.007152


What we have calculated is a daily volatility, but when practitioners talk about volatility, they typically annualize it.  A daily volatility is annualized by multiplying by $\sqrt{252}$.

In [12]:
df_grouped_vol['ann_vol'] = df_grouped_vol['daily_vol'] * np.sqrt(252)
df_grouped_vol

Unnamed: 0,symbol,daily_vol,ann_vol
0,DIA,0.007733,0.122752
1,IWM,0.014032,0.222744
2,QQQ,0.006832,0.108455
3,SPY,0.007152,0.113542


--- 

**Code Challenge** Use `.groupby()` and `.agg()` to calculate the average daily return for each of the ETFs.

In [13]:
(
df_etf
    .groupby(['symbol'])[['ret']].agg(np.mean)
    .reset_index()
    .rename(columns={'ret':'daily_avg_ret'})
)

Unnamed: 0,symbol,daily_avg_ret
0,DIA,0.00065
1,IWM,-0.001665
2,QQQ,0.001366
3,SPY,0.001174


---

## Related Reading

*Python Data Science Handbook (VanderPlas)* - Section 3.8 - Aggregation and Grouping

*Python for Data Analysis (McKinney)* - Chapter 9 (pp 251-274) Data Aggregation and Grouping Operations

*Options, Futures, and Other Derivatives (Hull)* - Chapter 15 (pp 325-329) The Black-Scholes-Merton Model