# `.groupby()` and `.agg()` - Part 1

The real power of data analysis with `DataFrames` comes into focus when we start utilizing the `.groupby()` and `.agg()` methods.  

This is known as *grouping* and *aggregating*.

Talking about grouping in the abstract can be confusing; I think it's best to see grouping in action by doing meaningful calculations.

The purpose of this tutorial is to introduce grouping and aggregation by way of the following finance task: calculating monthly returns and volatilities for several ETFs.

### Loading Packages

Let's load the packages that we will need for this tutorial.

In [1]:
import numpy as np
import pandas as pd
import pandas_datareader as pdr

### Reading-In Data

Our analysis will be on the set of of July 2021 prices for `SPY`, `IWM`, `QQQ`, `DIA`.

Let's readin that data with `pandas_datareader`.

In [2]:
pd.options.display.max_rows = 25
df_etf = pdr.get_data_yahoo(['SPY', 'QQQ', 'IWM', 'DIA'], start='2021-06-30', end='2021-07-31')
df_etf = df_etf.round(2)
df_etf.head()

Attributes,Adj Close,Adj Close,Adj Close,Adj Close,Close,Close,Close,Close,High,High,...,Low,Low,Open,Open,Open,Open,Volume,Volume,Volume,Volume
Symbols,SPY,QQQ,IWM,DIA,SPY,QQQ,IWM,DIA,SPY,QQQ,...,IWM,DIA,SPY,QQQ,IWM,DIA,SPY,QQQ,IWM,DIA
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
2021-06-30,428.06,354.43,229.37,344.75,428.06,354.43,229.37,344.95,428.78,355.23,...,227.76,342.35,427.21,354.83,228.65,342.38,64827900.0,32724000.0,26039000.0,3778900.0
2021-07-01,430.43,354.57,231.39,346.16,430.43,354.57,231.39,346.36,430.6,355.09,...,229.71,344.92,428.87,354.07,230.81,345.78,53441000.0,29290000.0,18089100.0,3606900.0
2021-07-02,433.72,358.64,229.19,347.73,433.72,358.64,229.19,347.94,434.1,358.97,...,228.56,346.18,431.67,356.52,232.0,347.04,57697700.0,32727200.0,21029700.0,3013500.0
2021-07-06,432.93,360.19,225.86,345.62,432.93,360.19,225.86,345.82,434.01,360.48,...,223.87,343.6,433.78,359.26,229.36,347.75,68710400.0,38842400.0,27771300.0,3910600.0
2021-07-07,434.46,360.95,223.76,346.71,434.46,360.95,223.76,346.92,434.76,362.76,...,221.8,344.43,433.66,362.45,225.54,345.65,63549500.0,35265200.0,28521500.0,3347000.0


This data is not as tidy as we would like.  Let's use method chaining to perform a series of data munging operations.

In [3]:
df_etf = \
    (
    df_etf
        .stack(level='Symbols') #pivot the table
        .reset_index() #turn date into a column 
        .sort_values(by=['Symbols', 'Date']) #sort
        .rename(columns={'Date':'date', 'Symbols':'symbol', 'Adj Close':'adj_close','Close':'close', 
                         'High':'high', 'Low':'low', 'Open':'open', 'Volume':'volume'}) #renaming columns
        [['date', 'symbol','open', 'high', 'low', 'close', 'volume', 'adj_close']] #reordering columns
    )
df_etf

Attributes,date,symbol,open,high,low,close,volume,adj_close
3,2021-06-30,DIA,342.38,345.51,342.35,344.95,3778900.0,344.75
7,2021-07-01,DIA,345.78,346.40,344.92,346.36,3606900.0,346.16
11,2021-07-02,DIA,347.04,348.29,346.18,347.94,3013500.0,347.73
15,2021-07-06,DIA,347.75,348.11,343.60,345.82,3910600.0,345.62
19,2021-07-07,DIA,345.65,347.14,344.43,346.92,3347000.0,346.71
...,...,...,...,...,...,...,...,...
68,2021-07-26,SPY,439.31,441.03,439.26,441.02,43719200.0,441.02
72,2021-07-27,SPY,439.91,439.94,435.99,439.01,67397100.0,439.01
76,2021-07-28,SPY,439.68,440.30,437.31,438.83,52472400.0,438.83
80,2021-07-29,SPY,439.82,441.80,439.81,440.65,47435300.0,440.65


### Calculating Daily Returns with `groupby()`

Our ultimate goal is to calculate monthly returns and monthly volatilities for each ETF in `df_etf`.  These quantities are both functions of daily returns.  So, our first order of business is to calculate daily returns. 

In a previous tutorial we calculated daily returns in a simple vectorized fashion.  Unfortunately, we can't use the exact same approach here because there are multiple ETFs in the data set.

To overcome this challenge we will use our first application of `groupby()`.

Here is the `.groupby()` code that calculates daily returns for each ETF.

In [4]:
# sorting values to get everything in the right order
df_etf.sort_values(['symbol', 'date'], inplace=True)

# vectorized return calculation
df_etf['ret'] = \
    df_etf['close'].groupby(df_etf['symbol']).pct_change()
df_etf.head()

Attributes,date,symbol,open,high,low,close,volume,adj_close,ret
3,2021-06-30,DIA,342.38,345.51,342.35,344.95,3778900.0,344.75,
7,2021-07-01,DIA,345.78,346.4,344.92,346.36,3606900.0,346.16,0.004088
11,2021-07-02,DIA,347.04,348.29,346.18,347.94,3013500.0,347.73,0.004562
15,2021-07-06,DIA,347.75,348.11,343.6,345.82,3910600.0,345.62,-0.006093
19,2021-07-07,DIA,345.65,347.14,344.43,346.92,3347000.0,346.71,0.003181


**Code Challenge:** If the `group_by()` worked correctly, we should see a `NaN` value in the `ret` column for the first trade-date of each ETF.  Use `DataFrame.query()` to confirm this.

In [5]:
df_etf.query('ret.isnull()')

Attributes,date,symbol,open,high,low,close,volume,adj_close,ret
3,2021-06-30,DIA,342.38,345.51,342.35,344.95,3778900.0,344.75,
2,2021-06-30,IWM,228.65,230.32,227.76,229.37,26039000.0,229.37,
1,2021-06-30,QQQ,354.83,355.23,353.83,354.43,32724000.0,354.43,
0,2021-06-30,SPY,427.21,428.78,427.18,428.06,64827900.0,428.06,


### Monthly Return for Each `symbol`

We'll now proceed to calculate monthly returns and volatilities for each of the ETFs in our data set.  This amounts to first grouping by `symbol`, and then performing an aggregation calculation on `returns`.  

Let's start with monthly returns.  As a preliminary step we'll calculate the daily growth factor in a separate column.

In [6]:
df_etf['daily_factor'] = 1 + df_etf['ret']
df_etf.head()

Attributes,date,symbol,open,high,low,close,volume,adj_close,ret,daily_factor
3,2021-06-30,DIA,342.38,345.51,342.35,344.95,3778900.0,344.75,,
7,2021-07-01,DIA,345.78,346.4,344.92,346.36,3606900.0,346.16,0.004088,1.004088
11,2021-07-02,DIA,347.04,348.29,346.18,347.94,3013500.0,347.73,0.004562,1.004562
15,2021-07-06,DIA,347.75,348.11,343.6,345.82,3910600.0,345.62,-0.006093,0.993907
19,2021-07-07,DIA,345.65,347.14,344.43,346.92,3347000.0,346.71,0.003181,1.003181


Recall that the monthly growth factor is the product of the daily growth factors.  Here is a way to write all that logic in a single line using `.groupby()` and `.agg()`:

In [7]:
df_grouped_factor = \
    df_etf.groupby(['symbol'])['daily_factor'].agg([np.prod]).reset_index()
df_grouped_factor

Unnamed: 0,symbol,prod
0,DIA,1.013132
1,IWM,0.963727
2,QQQ,1.028609
3,SPY,1.024412


Notice that `pandas` isn't very sophisticated about the name that it gives to the column that stores the aggregation calculation.  It just gave it the name `prod`, which the name of the function that was used in the aggregation calculation.  Let's make `df_grouped_factor` a bit more readable by renaming that column.

In [8]:
df_grouped_factor.rename(columns={'prod': 'monthly_factor'}, inplace=True)
df_grouped_factor

Unnamed: 0,symbol,monthly_factor
0,DIA,1.013132
1,IWM,0.963727
2,QQQ,1.028609
3,SPY,1.024412


And finally, recall that the monthly return is calculated by subtracting one from the monthly growth factor.

In [9]:
df_grouped_factor['monthly_return'] = df_grouped_factor['monthly_factor'] - 1
df_grouped_factor[['symbol', 'monthly_return']]

Unnamed: 0,symbol,monthly_return
0,DIA,0.013132
1,IWM,-0.036273
2,QQQ,0.028609
3,SPY,0.024412


### Monthly Volatility for Each `symbol`

Now let's calculate the (realized/historical) volatility for each of the ETFs.

We once again use `.groupby()` and `.agg()` to do this all in a single line of code.

In [10]:
df_grouped_vol = \
    df_etf.groupby(['symbol'])['ret'].agg([np.std]).reset_index()

df_grouped_vol

Unnamed: 0,symbol,std
0,DIA,0.007733
1,IWM,0.014032
2,QQQ,0.006832
3,SPY,0.007152


Again, let's rename our aggregation column to something more descriptive.

In [11]:
df_grouped_vol.rename(columns={'std':'daily_vol'}, inplace=True)
df_grouped_vol

Unnamed: 0,symbol,daily_vol
0,DIA,0.007733
1,IWM,0.014032
2,QQQ,0.006832
3,SPY,0.007152


What we have calculated is a daily volatility, but when practitioners talk about volatility, they typically annualize it.  A daily volatility is annualized by multiplying by $\sqrt{252}$.

In [12]:
df_grouped_vol['ann_vol'] = df_grouped_vol['daily_vol'] * np.sqrt(252)
df_grouped_vol

Unnamed: 0,symbol,daily_vol,ann_vol
0,DIA,0.007733,0.122752
1,IWM,0.014032,0.222744
2,QQQ,0.006832,0.108455
3,SPY,0.007152,0.113542


**Code Challenge** Use `.groupby()` and `.agg()` to calculate the average daily return for each of the ETFs.

In [13]:
(
df_etf
    .groupby(['symbol'])[['ret']].agg(np.mean)
    .reset_index()
    .rename(columns={'ret':'daily_avg_ret'})
)

Attributes,symbol,daily_avg_ret
0,DIA,0.00065
1,IWM,-0.001665
2,QQQ,0.001366
3,SPY,0.001174


### Related Reading

*PDSH* - 3.8 - Aggregation and Grouping

*Python for Data Analysis (McKinney)* - Chapter 9 (pp 251-274) Data Aggregation and Grouping Operations

*Options, Futures, and Other Derivatives (Hull)* - Chapter 15 (pp 325-329) The Black-Scholes-Merton Model