<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Preliminaries" data-toc-modified-id="Preliminaries-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Preliminaries</a></span></li><li><span><a href="#Vizualization-tools-(the-.plot()-function)" data-toc-modified-id="Vizualization-tools-(the-.plot()-function)-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Vizualization tools (the <code>.plot()</code> function)</a></span><ul class="toc-item"><li><span><a href="#Line-plots" data-toc-modified-id="Line-plots-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Line plots</a></span></li><li><span><a href="#Scatter-plots" data-toc-modified-id="Scatter-plots-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Scatter plots</a></span></li><li><span><a href="#Histograms" data-toc-modified-id="Histograms-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Histograms</a></span></li><li><span><a href="#Box-plots" data-toc-modified-id="Box-plots-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Box plots</a></span></li></ul></li><li><span><a href="#Single-variable-statistics" data-toc-modified-id="Single-variable-statistics-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Single-variable statistics</a></span><ul class="toc-item"><li><span><a href="#The-.describe()-function" data-toc-modified-id="The-.describe()-function-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>The <code>.describe()</code> function</a></span></li><li><span><a href="#Calculating-individual-statistics" data-toc-modified-id="Calculating-individual-statistics-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Calculating individual statistics</a></span><ul class="toc-item"><li><span><a href="#.mean()" data-toc-modified-id=".mean()-3.2.1"><span class="toc-item-num">3.2.1&nbsp;&nbsp;</span><code>.mean()</code></a></span></li><li><span><a href="#.var()" data-toc-modified-id=".var()-3.2.2"><span class="toc-item-num">3.2.2&nbsp;&nbsp;</span><code>.var()</code></a></span></li><li><span><a href="#.std()" data-toc-modified-id=".std()-3.2.3"><span class="toc-item-num">3.2.3&nbsp;&nbsp;</span><code>.std()</code></a></span></li><li><span><a href="#.median()" data-toc-modified-id=".median()-3.2.4"><span class="toc-item-num">3.2.4&nbsp;&nbsp;</span><code>.median()</code></a></span></li></ul></li><li><span><a href="#Calculating-row-level-statistics" data-toc-modified-id="Calculating-row-level-statistics-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Calculating row-level statistics</a></span></li><li><span><a href="#Creating-your-own-list-of-summary-statistics-with-the-.agg()-function" data-toc-modified-id="Creating-your-own-list-of-summary-statistics-with-the-.agg()-function-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Creating your own list of summary statistics with the <code>.agg()</code> function</a></span></li></ul></li><li><span><a href="#Two-variable-statistics" data-toc-modified-id="Two-variable-statistics-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Two-variable statistics</a></span><ul class="toc-item"><li><span><a href="#Covariance:-.cov()" data-toc-modified-id="Covariance:-.cov()-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Covariance: <code>.cov()</code></a></span></li><li><span><a href="#Correlation:-.corr()" data-toc-modified-id="Correlation:-.corr()-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Correlation: <code>.corr()</code></a></span></li><li><span><a href="#Autocorrelation:-.autocorr()" data-toc-modified-id="Autocorrelation:-.autocorr()-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Autocorrelation: <code>.autocorr()</code></a></span></li></ul></li><li><span><a href="#Rolling-statistics" data-toc-modified-id="Rolling-statistics-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Rolling statistics</a></span><ul class="toc-item"><li><span><a href="#Fixed-window-rolling-statistics" data-toc-modified-id="Fixed-window-rolling-statistics-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Fixed-window rolling statistics</a></span></li><li><span><a href="#Expanding-window-rolling-statistics" data-toc-modified-id="Expanding-window-rolling-statistics-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Expanding-window rolling statistics</a></span></li></ul></li></ul></div>

# Preliminaries

In [None]:
import pandas as pd
import pandas_datareader as pdr

In [None]:
# Download data on Fama-French three factors (we will use this data in all our examples)
ff3 = pdr.DataReader('F-F_Research_Data_Factors', 'famafrench', 
                     '1970-01-01','2020-12-31'
                    )[0]/100
ff3

In [None]:
# Rename for convenience
ff3.rename(columns = {'Mkt-RF': 'MKT'}, inplace = True)
ff3.head(2)

# Vizualization tools (the ``.plot()`` function)

There are many different ways to visualize the data from a Pandas dataframe (e.g. the ``matplotlib`` and ``seaborn`` packages are very popular). However, for the purpose of this class, the ``.plot()`` function that comes with the Pandas package will be sufficient.

Below we work through some examples of the most common types of plots used for financial data: line plots, scatter plots, histograms, and box plots.

Abbreviated syntax:
```python
DataFrame.plot(kind = 'line', x = None, y = None, 
               title = None, xlabel = None, ylabel = None,
               legend = True, grid = False, layout = None, 
               sharex = True, sharey = False, figsize = None)
```

More detail on the ``.plot()`` function can be found here:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html

## Line plots

Note that, by default, ``plot()`` creates a "line" plot, using the index of the dataframe for the x axis (in our case, the Date):

You can specify which variables you want plotted by subsetting the overall dataframe first:

Below, we show more of the functionality of ``.plot()`` through a more involved example:

## Scatter plots
To create a scatter plot, we need to change the ``kind`` parameter to "scatter" and also specify what is on the x axis and what is on the y axis:

## Histograms
For a histogram, we use ``kind='hist'`` and then use ``subplots=True`` to specify that we want each variable to have its own histogram, in a separate subplot:

We can change the position of the subplots relative to each other using the ``layout`` parameter:

We can create a continuous approximation of the histogram using ``kind='density'``:

## Box plots

For box plots, we use ``kind='box'``:

# Single-variable statistics

We start by looking at statistics that describe a single variable (as opposed to the relationship between two variables). 

Since our data will almost always be in a Pandas dataframe, we will use pandas functions (attributes) to calculate sample statistics, but many other packages can be used to calculate summary statistics for your data (e.g. the ``numpy`` package allows you to calculate descriptive statistics if your data is in a Numpy array).

## The ``.describe()`` function

We can use the  ``.describe()`` function to get some standard descriptive statistics for the entire dataset. 

Syntax:
```python
DataFrame.describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False)
```

The default is for ``.describe()`` to produce summary statistics only for numerical data types in the dataframe. You can change this with the ``include`` and ``exclude`` parameters. 

The ``percentiles`` parameter allows you to specify which percentiles you want ``.describe()`` to calculate (default is 25th, 50th and 75th percentiles). For example, below, we only ask for the 50th percentile (the median):

## Calculating individual statistics

Each individual statistic produced by ``.describe()`` has its own function that can be applied either to the entire dataframe or to subsets of it. Below I only show examples for mean, variance, standard deviation and median (but you can also use ``.count()``, ``.min()``, ``.max()``, ``.sum()`` and many others).

### ``.mean()``

### ``.var()``

### ``.std()``

### ``.median()``

**Challenge:**

Note that the output of ``.describe()`` is also a dataframe. So we can use ``.loc[]`` to access specific numbers in that output table.

Use the space below to calculate and print the interquartile range (IQR = percentile 75 minus percentile 25) for the 'MKT' variable:

## Calculating row-level statistics

All statistical functions in Pandas (e.g. ``.mean()``, ``.median()``, etc) have an ``axis`` argument that allows you to specify if you want that statistic to be calculated column-wise (axis=0, the default) or row-wise (axis=1). 

For example, if we want to know, each month, which of the columns in ``ff3`` had the highest return, we would use:

As usual, we can also calculate row-wise statistics using only a subset of the columns:

## Creating your own list of summary statistics with the ``.agg()`` function

If we want a different selection of summary statistics than the one offered by the ``.describe()`` function, we can use the ``.agg()`` function to specify exactly which statistics we want:

Syntax:
```python
DataFrame.agg(func=None, axis=0, *args, **kwargs)
```

If you want the same stats for all variables, just provide a list of the names of the functions you want to be used (e.g. use 'mean' for the ``.mean()`` function, 'std' for the ``.std()`` function etc.).

You can also specify different functions (stats) for each variable:

**Challenge:**

Create a table that shows just the mean and standard deviation for the SMB and HML variables

# Two-variable statistics

These are statistics that describe the relation between two variables. The most commonly used ones are the **covariance* and the **correlation**. Both of these try to quantify the strength of the **linear** relation between the two variables. The main difference between them is that the correlation coefficient is bounded between -1 and 1 and so it is easier to interpret. 

*If two variables are tightly related to each other, but not in a linear fashion (e.g. $Y = X^4$), the covariance and correlation will **underestimate** the strength of that relation.*

## Covariance: ``.cov()``

The ``cov()`` function produces a covariance matrix for the variables (columns) in the dataframe. The numbers on the diagonal are actually variances. Each number on the off-diagonal is the covariance between the two variables specified in the column/row headers.

The output table above is a dataframe, so we can access individual numbers in it using the ``.loc[]`` operator. 

For example, below, we extract the covariance between the 'MKT' and 'SMB' variables:

Remember, if you want to use these estimates later on, you need to store them as new variables:

## Correlation: ``.corr()``

Just like with covariance, we can calculate a correlation matrix for the entire dataset:

Or we can extract the correlation of a particular pair of variables in your dataset:

**Challenge:**

Calculate the correlation between 'MKT' returns in the current month and the SMB return from 12 months ago.

## Autocorrelation: ``.autocorr()``

The autocorrelation of a variable is the correlation between its current value and a value from the past. So there is not one single autocorrelation for any given variable, there is one autocorrelation for every "lag" between the current value and the value from the past. For example, below, we calculate the "1-month autocorrelation" and "12-month autocorrelation" for the market portfolio returns:

And below we verify that the autocorrelation is nothing but the correlation between the current value and a lagged value:

# Rolling statistics

These are statistics that are re-calculated at each point in time, using either 
- a fixed number of data points from the past 
    - these are called "fixed window" rolling statistics
    - can be calculated with the "rolling" Pandas function
- all the data from the past (expanding window) 
    - these are called "expanding window" rolling statistics
    - can be calculated with the "expanding" Pandas function

Both the "expanding" and the "rolling" functions should be followed by the name of the statistic that you want to calculate.

## Fixed-window rolling statistics 

We use the ``.rolling()`` function to calculate summary statistics at each point in time "t" using only the observations from "t - w" to "t", where "w" is referred to as the "window" length.

Syntax:
```python
DataFrame.rolling(window, min_periods=None, center=False, win_type=None, on=None, axis=0, closed=None, method='single')
```

As an example, below, we calculate 60-month rolling means (i.e. "w" is 60) for all the variables in ``ff3``:

We can calculate rolling versions for all summary statistics that the pandas package knows how to calculate. For example, below, we calculate the rolling, 36-month standard deviations of market returns, and we plot these over time:

We can even calculate rolling versions of two-variable summary statistics (like correlation and covariance). However, we have to remember that ``.corr()`` and ``.cov()`` produce matrices not single numbers. So if we want rolling correlations between, say, market returns and the risk-free rate, the cell below will produce a correlation matrix at each point in time:

Instead, we need to supply one of the variables as a parameter to the ``.corr()`` function:

## Expanding-window rolling statistics

With expanding-window summary statistics, at each point in time, we use all the available data up to that point to calculate the statistic. We use the ``.expanding()`` function for this purpose, which also gives us the option to specify that we want to calculate the statistic only if we have a minimum number of observations available at that point (see the ``min_period`` parameter below): 

Syntax:
```python
DataFrame.expanding(min_periods=1, center=None, axis=0, method='single')
```

Note that, if we don't supply a large enough ``min_periods``, in the beginning of the sample, the statistics will be calculated using a very low number of observations (starting with 1), so they will be quite volatile:

This looks a lot more stable if we make sure each statistic is calculated using at least 36 observations:

As another example, let's look at the behavior of market volatility over time:

Finally, below, we see that the correlation between market returns and tbill yields, while changing over time, is negative throughout (when we do not restrict ourselves to just the prior 60 observations):