# `.groupby()` and `.agg()` - Part 1

The real power of data analysis with `DataFrames` comes into focus when we start utilizing the `.groupby()` and `.agg()` methods.  

This is known as *grouping* and *aggregating*.

Talking about grouping in the abstract can be confusing; I think it's best to see grouping in action by doing meaningful calculations.

The purpose of this tutorial is to introduce grouping and aggregation by way of the following finance task: calculating monthly returns and volatilities for several ETFs.

### Loading Packages

Let's load the packages that we will need for this tutorial.

In [None]:
##> import numpy as np
##> import pandas as pd
##> import pandas_datareader as pdr




### Reading-In Data

Our analysis will be on the set of of July 2021 prices for `SPY`, `IWM`, `QQQ`, `DIA`.

Let's readin that data with `pandas_datareader`.

In [None]:
##> pd.options.display.max_rows = 25
##> df_etf = pdr.get_data_yahoo(['SPY', 'QQQ', 'IWM', 'DIA'], start='2021-06-30', end='2021-07-31')
##> df_etf = df_etf.round(2)
##> df_etf.head()




This data is not as tidy as we would like.  Let's use method chaining to perform a series of data munging operations.

In [None]:
##> df_etf = \
##>     (
##>     df_etf
##>         .stack(level='Symbols') #pivot the table
##>         .reset_index() #turn date into a column 
##>         .sort_values(by=['Symbols', 'Date']) #sort
##>         .rename(columns={'Date':'date', 'Symbols':'symbol', 'Adj Close':'adj_close','Close':'close', 
##>                          'High':'high', 'Low':'low', 'Open':'open', 'Volume':'volume'}) #renaming columns
##>         [['date', 'symbol','open', 'high', 'low', 'close', 'volume', 'adj_close']] #reordering columns
##>     )
##> df_etf




### Calculating Daily Returns with `groupby()`

Our ultimate goal is to calculate monthly returns and monthly volatilities for each ETF in `df_etf`.  These quantities are both functions of daily returns.  So, our first order of business is to calculate daily returns. 

In a previous tutorial we calculated daily returns in a simple vectorized fashion.  Unfortunately, we can't use the exact same approach here because there are multiple ETFs in the data set.

To overcome this challenge we will use our first application of `groupby()`.

Here is the `.groupby()` code that calculates daily returns for each ETF.

In [None]:
##> # sorting values to get everything in the right order
##> df_etf.sort_values(['symbol', 'date'], inplace=True)
##> 
##> # vectorized return calculation
##> df_etf['ret'] = \
##>     df_etf['close'].groupby(df_etf['symbol']).pct_change()
##> df_etf.head()




**Code Challenge:** If the `group_by()` worked correctly, we should see a `NaN` value in the `ret` column for the first trade-date of each ETF.  Use `DataFrame.query()` to confirm this.

### Monthly Return for Each `symbol`

We'll now proceed to calculate monthly returns and volatilities for each of the ETFs in our data set.  This amounts to first grouping by `symbol`, and then performing an aggregation calculation on `returns`.  

Let's start with monthly returns.  As a preliminary step we'll calculate the daily growth factor in a separate column.

In [None]:
##> df_etf['daily_factor'] = 1 + df_etf['ret']
##> df_etf.head()




Recall that the monthly growth factor is the product of the daily growth factors.  Here is a way to write all that logic in a single line using `.groupby()` and `.agg()`:

In [None]:
##> df_grouped_factor = \
##>     df_etf.groupby(['symbol'])['daily_factor'].agg([np.prod]).reset_index()
##> df_grouped_factor




Notice that `pandas` isn't very sophisticated about the name that it gives to the column that stores the aggregation calculation.  It just gave it the name `prod`, which the name of the function that was used in the aggregation calculation.  Let's make `df_grouped_factor` a bit more readable by renaming that column.

In [None]:
##> df_grouped_factor.rename(columns={'prod': 'monthly_factor'}, inplace=True)
##> df_grouped_factor




And finally, recall that the monthly return is calculated by subtracting one from the monthly growth factor.

In [None]:
##> df_grouped_factor['monthly_return'] = df_grouped_factor['monthly_factor'] - 1
##> df_grouped_factor[['symbol', 'monthly_return']]




### Monthly Volatility for Each `symbol`

Now let's calculate the (realized/historical) volatility for each of the ETFs.

We once again use `.groupby()` and `.agg()` to do this all in a single line of code.

In [None]:
##> df_grouped_vol = \
##>     df_etf.groupby(['symbol'])['ret'].agg([np.std]).reset_index()
##> 
##> df_grouped_vol




Again, let's rename our aggregation column to something more descriptive.

In [None]:
##> df_grouped_vol.rename(columns={'std':'daily_vol'}, inplace=True)
##> df_grouped_vol




What we have calculated is a daily volatility, but when practitioners talk about volatility, they typically annualize it.  A daily volatility is annualized by multiplying by $\sqrt{252}$.

In [None]:
##> df_grouped_vol['ann_vol'] = df_grouped_vol['daily_vol'] * np.sqrt(252)
##> df_grouped_vol




**Code Challenge** Use `.groupby()` and `.agg()` to calculate the average daily return for each of the ETFs.

### Related Reading

*PDSH* - 3.8 - Aggregation and Grouping

*Python for Data Analysis (McKinney)* - Chapter 9 (pp 251-274) Data Aggregation and Grouping Operations

*Options, Futures, and Other Derivatives (Hull)* - Chapter 15 (pp 325-329) The Black-Scholes-Merton Model