<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Lecture-overview" data-toc-modified-id="Lecture-overview-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Lecture overview</a></span></li><li><span><a href="#Preliminaries" data-toc-modified-id="Preliminaries-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Preliminaries</a></span></li><li><span><a href="#Grouping-your-data:-the-.groupby()-function" data-toc-modified-id="Grouping-your-data:-the-.groupby()-function-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Grouping your data: the <code>.groupby()</code> function</a></span></li><li><span><a href="#The-.apply()-and-.transform()-methods" data-toc-modified-id="The-.apply()-and-.transform()-methods-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>The <code>.apply()</code> and <code>.transform()</code> methods</a></span></li><li><span><a href="#Winsorizing-outliers" data-toc-modified-id="Winsorizing-outliers-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Winsorizing outliers</a></span></li></ul></div>

# Lecture overview

In this lecture we introduce a set of Pandas functions that are very useful in describing subsamples of your data (this is often called "subsample analysis"). Looking at subsamples of your data individually is important because patterns that show up in your overall dataset may look quite different if you limit yourself to a subset of the dataset. This is exemplified in Simpson's Paradox: https://en.wikipedia.org/wiki/Simpson%27s_paradox.

We finish the lecture with a discussion of the impact of outliers on your descriptive statistics, and a method of mitigating that impact called "windsorization".

# Preliminaries

In [None]:
import pandas as pd
import numpy as np
import pandas_datareader as pdr
pd.options.display.max_rows = 20

We'll use data on the Fama-French 5-industry portfolio returns for this lecture:

In [None]:
raw = pdr.DataReader(name = '5_Industry_Portfolios', data_source = 'famafrench', 
                     start = '2011-01-01', end = '2020-12-31')
raw

Extract equal-weighted *annual* industry returns, and turn them to decimal (they are in percentage points):

Let's take a look at the data:

Calculate cumulative products of gross returns (i.e. compound returns over time) and plot them:

Stack industry returns on top of each other for the purpose of this class:

And bring date and industry names as data inside the dataframe:

**Challenge:**

Do the same for value-weighted annual returns (i.e. create a "vw_long" dataframe, using the same steps we used for "ew_long":

Merge the EW returns and VW returns into a single dataframe called "ireturns":

# Grouping your data: the ``.groupby()`` function

The ``.groupby()`` function can be used to tell Python that you want to split your data into groups. The parameters of the ``.groupby()`` function tell Python *how* those groups should be created. The purpose is usually to apply some function (e.g. the ".mean()" function) to each of these groups separately.

Abbreviated syntax:
```python
DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, dropna=True)
```

The most important parameter is ``by``. This is where you tell Python which column (or index) in your DataFrame contains the information based on which you want to group your data. Python will split your DataFrame into "mini" dataframes, one for each unique value of the variable(s) you supplied to the ``by`` parameter.

For example, the line below splits ``ireturns`` into 5 different dataframes, one for each unique entry found in the "Industry" column, and then applies the ``.mean()`` function for each of these 5 dataframes separately. Finally, these subsample means are all collected into a new dataframe ``ind_means``: 

If you don't want the ``by`` variable (i.e. "Industry" in the example above) to be the index of the resulting dataframe:

Another example, with a different ``by`` variable and a different function applied to each group (i.e. median instead of mean):

You can group by more than one variable:

The example above did not really change the ``ireturns`` dataframe, since each "Date" x "Industry" pair has a single entry for both "ewret" and "vwret". Since the mean of a single number is the number itself, the ``twodim`` dataframe will be identical to ``ireturns``. Note that this is not necessarily the case if we used a different function instead of ``.mean()``, for example ``.count()``:

You can specify which variable(s) you want to apply the function to, in brackets, right before the function name (if you leave this out (like above), the function will be applied to all the columns in the dataframe):

# The ``.apply()`` and ``.transform()`` methods

The ``.apply()`` and ``.transform()``  methods do similar things: they can be used to tell Python to apply a given function to some data from a dataframe. As the examples above show, there are many Pandas functions, like ``.mean()`` and ``.median()`` that can do this without the help of ``.apply()`` or ``.transform()`` (we just have to add the names of these functions after the ``.groupby()`` statement, just like we did above). But what if the function we want to apply is not a built-in Pandas function that can be applied with a dot after the name of a dataframe? This is where ``.apply()`` and ``.transform()`` come in handy. These methods are especially useful when we want to apply a particular function, separately, to each group we created with a ``.groupby`` statement. 

Here is their syntax:

Syntax for ``.transform()``:
```python
DataFrame.transform(func, axis=0, *args, **kwargs)
```
Syntax for ``.apply()``:
```python
DataFrame.apply(func, axis=0, raw=False, result_type=None, args=(), **kwargs)
```

The most important argument is ``func`` which is where we tell Python which function we want to apply to the data. 

The main difference between ``.transform()`` and ``.apply()`` is that ``.transform()`` returns a sequence of the same length as the dataframe to which it is applied, while ``.apply()`` returns a DataFrame or Series of the same size as the number of groups to which it is applied.

We usually add the results of ``.transform()`` as a new column to the same dataframe:

Note, also, that with ``.transform()``, you can pass the name of the function you want as a **string** to the ``func`` argument, whereas with ``.apply()`` you can not:

Whereas the line below will not work. You have to specify which package the "median" function belongs to (which is why we used ``.apply(np.median)`` above):

In [None]:
#ireturns.groupby('Industry')[['ewret','vwret']].apply('median') #this gives an error

We are not restricted to applying functions that come with a package that we have installed. We can also use a function that we created ourselves.

For example, below, we create a function that can take in a Series or a DataFrame of returns, and compounds them:

Now we can apply that function to the returns of each industry:

Let's see if it worked:

# Winsorizing outliers 

"Winsorizing" a variable means replacing its most extreme values with less extreme values. For example, winsorizing a variable "at the 5 and 95 percentiles", means that the values of that variable that are smaller than the 5th percentile will be made equal to the 5th percentile and the values that are larger than the 95th percentile will be made equal to the 95th percentile.

You can pick other values for the percentiles at which you want to winsorize but (1,99) and (5, 95) are by far the most common ones.

To winsorize a variable, in a Pandas dataframe, we use the ``.clip()`` function as below. This also requires us to use the ``.quantile()`` function to calculate the 5th and 95th percentiles. First, let's sort the returns so we can easily see its most extreme values (top and bottom):

Let's calculate the 5th and 95th percentiles:

And now let's create a version of ``ewret`` that is winsorized at the 5 and 95 percentiles:

Let's see if it worked: