# Tutorial 09 - Grouping and Aggregating - Part 1

The real power of data analysis with `DataFrames` comes into focus when we start utilizing the `.groupby()` and `.agg()` methods.  This is known as *grouping* and *aggegregating*.

Talking about grouping in the abstract can be confusing; I think it's best to see grouping in action by doing meaningful calculations.

The purpose of this tutorial is to introduce grouping and aggregation by way of following finance task: calculating monthly returns and volatilies for several assets.

## Loading Packages

Let's load the packages that we will need for this tutorial.

In [None]:
##> import numpy as np
##> import pandas as pd




## Reading-In Data

The analysis we will be performing will be on the set of of December 2018 prices for `SPY`, `IWM`, `QQQ`, `DIA`.

Let's read that data in from the CSV file.

In [None]:
##> df_etf = pd.read_csv("../data/index_etf_dec_2018.csv")
##> df_etf.head()




**Coding Challenge:** Use `DataFrame` masking to isolate all the `QQQ` rows.

## Calculating Daily Returns

The following bit of code calculates the daily return for each day and each symbol.  It's similar to the code we used in a previous tutorial except there is some logic to deal with the fact that there are different symbols in the same data set.  For the purposes of this tutorial you can simply run the cod.  If you are curious about it, I encourage you to analyze it on your own.

In [None]:
df_etf['return'] = np.nan
for ix in range(1, df_etf.shape[0]):
    
    # grabbing symbols from df_etf
    curr_sym = df_etf.at[ix, 'symbol']
    prev_sym = df_etf.at[ix-1, 'symbol']
    
    # grabbling prices from df_etf
    curr_px = df_etf.at[ix, 'close']
    prev_px = df_etf.at[ix-1, 'close']
    
    # calculating return
    if curr_sym == prev_sym:
        df_etf.at[ix, 'return'] = (curr_px / prev_px) - 1

df_etf.head()

The first date in our data set is November 30, 2018.  As a simple sanity check, let's utilize `DataFrame` masking to make sure that all those returns are set as `NaN`, since the return is not defined on the first day.

In [None]:
##> df_etf[df_etf.date == '2018-11-30']



Our sanity check looks good.

We'll now proceed to calculating monthly returns and volatility for each of the ETFs in our data set.  This amounts to first grouping on `symbol`, and then aggregating the `returns`.  

Let's start with monthly return.

## Monthly Return for Each `symbol`

If you take a look at our data set, we now have daily returns for each day, for each of the four ETFs.  Given a bit of time, you could probably come up with a `for` loop to iterate through the `DataFrame` and produce the monthly returns for each ETF (in fact, you could modify the returns `for` loop from the previous section to do just that).  As a one-off solution, that would be fine.  But grouping and aggregating are such ubiquitous operations, that it would be inconvenient to have write a `for`-loop every time.


Let's begin by calculating the daily growth factor in a separate column.

In [None]:
##> df_etf['growth_factor'] = 1 + df_etf['return']
##> df_etf.head()




Recall that the monthly growth factor is the product of the daily growth factors.  Here is a way to write all that logic in a single line using `.groupby()` and `.agg()`:

In [None]:
##> df_grouped_factor = \
##>     df_etf.groupby(['symbol'])['growth_factor'].agg([np.prod]).reset_index()
##> df_grouped_factor




Notice that `pandas` isn't very sophisticated about the name that it gives to the column that stores the aggregation calculation column.  It just gave it the name `prod`, which the name of the function that was used in the aggregation calculation.  Let's make `df_grouped_factor` a bit more readable by renaming that column.

In [None]:
##> df_grouped_factor.rename(columns={'prod': 'monthly_factor'}, inplace=True)
##> df_grouped_factor




And finally, recall that the monthly return is gotten by subracting one from the monthly growth factor.

In [None]:
##> df_grouped_factor['monthly_return'] = df_grouped_factor['monthly_factor'] - 1
##> df_grouped_factor[['symbol', 'monthly_return']]




## Monthly Volatility for Each `symbol`

Now let's calculate the (realized/historical) volatility for each of the ETFs.  Recall that volatility is the standard deviation of the daily returns.  If we were to do this in a brute force manner, we could use dataframe masking to separate out the returns for each of the four ETFs, and then calculate four separate standarde deviations.

However, once again with the power of `.groupby()` and `.agg()` we can do all of this with a single line of code.

In [None]:
##> df_grouped_vol = \
##>     df_etf.groupby(['symbol'])['return'].agg([np.std]).reset_index()
##> 
##> df_grouped_vol




Again, let's rename our aggregation column to something more descriptive.

In [None]:
##> df_grouped_vol.rename(columns={'std':'daily_vol'}, inplace=True)



What we have calculated is a daily volatility, but when practitioners talk about volatility, the typically annualize it.  A daily volatility is annualized by multiplying by $\sqrt{252}$.

In [None]:
##> df_grouped_vol['ann_vol'] = df_grouped_vol['daily_vol'] * np.sqrt(252)
##> df_grouped_vol




**Code Challenge** Use `groupby()` and `.agg()` to calculated the average daily return for each of the ETFs.

## Related Reading

*PDSH* - 3.8 - Aggregation and Grouping

*Python for Data Analysis (McKinney)* - Chapter 9 (pp 251-274) Data Aggregation and Grouping Operations

*Options, Futures, and Other Derivatives (Hull)* - Chapter 15 (pp 325-329) The Black-Scholes-Merton Model