# MSDM5058 Tutorial 5 - Mean-Variance Analysis

## Contents
1. Mean, variance and correlation
2. Portfolio theory
3. Autoregressive model

---

# 1. Recap: Mean, variance and correlation

Just for refreshing your memory.

## 1.1. Population mean and population variance

The most common form of mean and variance. 

$$
\begin{align*}
\mu &= \langle X\rangle = \int_{-\infty}^\infty xf(x)\mathrm{d}x \\
\sigma^2 &= \langle (X-\mu)^2 \rangle = \int_{-\infty}^\infty (x-\mu)^2 f(x)\mathrm{d}x \\
&= \langle X^2\rangle - \mu^2 = \int_{-\infty}^\infty x^2 f(x)\mathrm{d}x - \mu^2
\end{align*}
$$
 

Commonly used notation: 
- Population mean: $\mu$, $\langle X\rangle$, $\text{E}(X)$

- Population variance: $\sigma^2$, $\text{var}(X)$


## 1.2. Sample mean and sample variance

Experimentally it is impossible to tell the exact values of population mean and population variance of a distribution because it takes infinite trials of sampling. What we can do is to estimate by a finite set of samples $X = \{x_1,x_2,\dots,x_n\}$, and then the mean and variance are called the $X$'s sample mean $\hat{\mu}$ and sample variance $\hat{\sigma}^2$ respectively.

$$
\begin{align*}
\hat{\mu} =& \frac{1}{n} \sum_{i=1}^n x_i \\
\hat{\sigma}^2 =& \frac{1}{n-1} \sum_{i=1}^n (x_i-\hat{\mu})^2 \,.
\end{align*}
$$

The use of ($n-1$) instead of $n$ in the leading factor of $\hat{\sigma}^2$ is for removing the sampling bias. This is known as Bessel's correction.

Commonly used notation: 

- Sample mean: $\hat{\mu}$, $\bar{x}$

- Sample variance: $\hat{\sigma}^2$, $s^2$

Usually the hat $\hat{}$ stands for the sampling quantity. 


## 1.3 Standard error of mean

The sample mean $\hat{\mu}$ is used to estimate the population mean $\mu$. We would like to know how much $\hat{\mu}$ differs from $\mu$, therefore we need the standard error of mean $\delta$. 

$$\delta = \sqrt{\frac{\hat{\sigma}^2}{n}}$$

Statistics tells us that the probability that $\hat{\mu}$ covers $\mu$ follows the standard normal distribution, i.e. for $z>0$,

$$
P(\hat\mu-z\delta\leq\mu\leq\hat\mu+z\delta) =
C(z) \equiv \int_{-z}^z e^{-x^2/2}\mathrm{d}x\,.
$$

The range $[\hat\mu-z\delta, \hat\mu+z\delta]$ is the $100\times C(z)\%$ confidence interval of $\mu$. Notice that $\hat\mu$ and $\delta$ are still random variables to be measured in the formula. Once $\hat\mu$ and $\delta$ are determined with a particular set of samples $\{x_1,x_2,\dots,x_n\}$, the variable range $[\hat\mu-z\delta, \hat\mu+z\delta]$ collapses to a fixed range $[\hat\mu-z\delta, \hat\mu+z\delta]_\text{fixed}$, into which $\mu$ either falls or does not fall.



 ## 1.4. Covariance and covariance matrix
 
Between two random variables $X$ and $Y$, the covariance is defined as 

$$
\begin{align*}
\sigma_{XY}^2 &= \langle (X-\langle X\rangle)(Y-\langle Y \rangle)\rangle = \int^\infty_{-\infty}\int^\infty_{-\infty} (x-\mu_x)(y-\mu_y)f_{XY}(x,y) \mathrm{d}x\mathrm{d}y\\
&= \langle XY \rangle - \langle X\rangle \langle Y \rangle = \int^\infty_{-\infty}\int^\infty_{-\infty} xyf_{XY}(x,y)\mathrm{d}x\mathrm{d}y - \mu_x\mu_y
\end{align*}
$$

It also has its sample version. For two samples $X=\{x_1,x_2,...,x_n\}$ and $Y=\{y_1,y_2,...,y_n\}$:

$$
q_{XY} = \frac{1}{n-1} \sum_{i=1}^n (x_i-\hat{\mu}_x)(y_i-\hat{\mu}_y) 
$$


Commonly used notation:

- Covariance: $\sigma_{XY}^2$, $\text{cov}(X,Y)$

- Sampling covariance: $q_{XY}$. In some cases, $\text{cov}(X,Y)$ also means the sample version.


## 1.5. Algebra

These rules are applicable for both the population and sample version. Here $X$, $Y$, $W$, $Z$ are some random variables, and $a$, $b$ are constants

For mean:
1. $\text{E}(X+a) = \text{E}(X) + a$
2. $\text{E}(aX) = a\cdot\text{E}(X)$
3. $\text{E}(X+Y) = \text{E}(X) + \text{E}(Y)$ 


For variance:
1. $\text{var}(X+a) = \text{var}(X)$
2. $\text{var}(aX) = a^2\cdot\text{var}(X)$
3. $\text{var}(X+Y) = \text{var}(X) + 2\text{cov}(X,Y) + \text{var}(Y)$

For covariance:
1. $\text{cov}(X,X) = \text{var}(X)$
2. $\text{cov}(X+a, Y+b) = \text{cov}(X,Y)$
3. $\text{cov}(aX, bY) = ab\cdot\text{cov}(X,Y)$
4. $\text{cov}(X+W, Y+Z) = \text{cov}(X,Y) + \text{cov}(X,Z) + \text{cov}(W,Y) + \text{cov}(W,Z)$




---

# 2. Modern portfolio theory

Markowitz pioneered the modern portfolio theory (MPT) in 1952, for which he won the Nobel Prize of Economics in 1990. This section discusses Markowitz's version of portfolio theory and ignores its descendants. 

Modern portfolio theory is also known as the name **Mean-Variance Analysis** (MVA).
 

## 2.1. Terminology

Let $P(t)$ be a discrete-time series of a stock's price. For example if time is measured in days, then $P(t)$ is its price on the $t$-th day. 

### 2.1.1. Return

Return is the percentage gain/loss of money due to the change in price of an investment. There are two common types of return:

1. **Ordinary return**
    $$
    R(t,\Delta t) = \frac{P(t+\Delta t)-P(t)}{P(t)}
    $$

2. **Log return**
    $$
    R_L(t,\Delta t) = \log P(t+\Delta t) - \log P(t) = \log \left(\frac{P(t+\Delta t)}{P(t)}\right)
    $$

It is subjective to choose which return we should use. Intrinsically, their difference lies on how you model the stock's growth. The ordinary return models a stock growing linearly like simple interest:

$$
P(t+\Delta t) = [1+R(t,\Delta t)] P(t)
$$

On the other hand, the log-return models a stock growing exponentially like compound interest (in long term):

$$
P(t+\Delta t) = P(t) e^{R_L(t,\Delta t)} \approx
P(t) \left[1+\frac{R_L(t,\Delta t)}{\Delta t}\right]^{\Delta t}
$$

The last approximation comes from the limit defintion $e = \lim_{n\to \infty}\left(1+\frac{1}{n}\right)^n$ and then substitute $n=\frac{\Delta t}{R_L(t,\Delta t)}$. 

### 2.1.2. Volatility

Finance people use the term "volatility" to describe dispersion of returns, i.e. the return is "volatile" if the return swings frequently over a period of time. Hence it is correlated to the risk of the stock.

The simplest method to measure the volatility of a stock is by the sample variance (or standard deviation) of the return (or log return). 

$$
\sigma^2 = \frac{1}{N-1}\sum_{i=1}^N[R(t_i,\Delta t) - \langle R(t_i, \Delta t)\rangle]
$$

where $\langle R(t, \Delta t)\rangle$ is by averaging some window of time over $N$ data, which need to be chosen subjectively. Therefore the volatility is often hard to determine. 


### 2.1.3. Dividend

A dividend is a distribution of profits by a corporation to its shareholders. The dividend is usually proportional to the profit earned by the cooperation and is allocated as a fixed amount per share. Let the dividend of holding one share of stock is $d$, we may modify the norminator of the return to include the dividend as part of the absolute profit.

$$R(t,\Delta t) = \frac{P(t+\Delta t) + d -P(t)}{P(t)}$$

## 2.2. Mathematical model

### 2.2.1. Portfolio

As the idiom "Don't put all your eggs in one basket" says, we should always diversify our investments so that we will not lose everything when situations go adverse. In finance, diversification is done by constructing a portfolio - the strategy of buying a combination of stocks such that the risk is reduced by compensation between negatively-correlated stocks. 

Assume we have a portfolio consists of $n$ stocks with (expected) returns $\{R_1,...,R_n\}$, volatility $\{\sigma_1^2,...\sigma_n^2\}$ and we are buying each stock according to the weightings $\{w_1,...,w_n\}$ where $1\geq w_i \geq 0$ and $\sum_{i=1}^N w_i = 1$. Define the measures of a portfolio:

- Portfolio's expected return = weighted sum of the expected return of each stock

 $$R_P = \sum_{i=1}^n w_i R_i $$

- Portfolio's volatility (i.e. risk) = weighted volatility of each stock
 $$\sigma_P^2 = \text{var}(R_P) = \sum_{i=1}^n w_i^2\text{var}(R_i) + \sum_{i=1}^n\sum_{j\neq i} w_iw_j\text{cov}(R_i, R_j)$$





### 2.2.2. Optimization

In a simple scenrio, our goal is to find the weightings of the stocks such that

1. The risk $\sigma_P^2$ is minimized given that the expected return $R_P$ is at least higher than a certain level; or
2. The expected return $R_P$ is maximized given that the risk $\sigma_P^2$ is at most at a certain level.


These are in fact some simple optimization problems that can be solved by the method of Lagrangian multipliers. For example, let's say we are in the first case - the investor want to find the least risky portfolio whose expected return is $\mu$, then it is an optimization problem to the weightings $w_i$ subject to the equality constraints:

$$
w^* = \underset{w_i}{\text{argmin}} \ \sigma_P^2(w_i) \quad \text{subject to} \quad \begin{cases}R_p(w_i) = \mu  \\ \sum w_i=1 \\ 0\leq w_i\leq 1\end{cases}
$$

But in reality, we can also imposed more complicated constraints such that this become a non-linear optimization problem.

### 2.2.3. Efficient frontier

The traditional efficient frontier model is the simplest case of this optimization problem: 

- Only 2 stocks $\{R_1,\sigma^2_1\}, \{R_2,\sigma^2_2\}$ are considered. 
- The only free variable we can tune = one of the weighting $w_1$ (then $w_2=1-w_1$ is not free). 
- Risk $\sigma^2_P$ is minimized.
- No constraint in expected return $R_P$.

Under these conditions, the portfolio's risk $\sigma_P^2(w_1)$ can be found as the formula in the lecture notes:

\begin{align*}
\sigma_P^2(w_1) = aR_P^2+bR_P+c
\end{align*}

with 

\begin{align*}
R_P &= w_1R_1+(1-w_1)R_2 \\[0.5em]
a &= \frac{\sigma_1^2+\sigma_2^2-2\sigma_{12}}{(R_1-R_2)^2}\\[0.5em]
b &= \frac{-2(R_2\sigma_1^2+R_1\sigma_2^2-R_1\sigma_{12}-R_2\sigma_{12})}{(R_1-R_2)^2}\\[0.5em]
c &= \frac{R_2^2\sigma_1^2+R_1^2\sigma_2^2-2R_1R_2\sigma_{12}}{(R_1-R_2)^2}
\end{align*}

This risk function $\sigma_P^2(w_1)$ is quadratic to the return $R_P$ with the coefficients $a,b,c$ being quite symmetric. So commonly people plot it out as a parabola of $\sigma_P^2$ vs $R_P$ (rather than vs $w_1$ directly, which will also give a parabola). As a property of quadratic function, there is only one minimum at 

$$R_P^*= -\frac{b}{2a}$$ 

and the corresponding risk's value and weightings are

\begin{align*}
w_1^* &= \frac{R_P^*-R_2}{R_1-R_2} \\[0.5em]
\sigma_P(w_1^*) &= c-\frac{b^2}{4a}
\end{align*}

---
# 3. Supplementary: Autoregressive model

Let $X = \{X_t | t\in \mathbb{N} \}$ be a stock's daily return or daily log-return. We may fit $X$ with an autoregressive model to estimate its population properties. The model assumes that each data point in a time series depends on its previous values linearly. The $k$-th-order autoregressive model, i.e. AR($k$), suggests that

$$X_t = \beta_0 + \sum_{i=1}^k \beta_i X_{t-i} + \varepsilon_t$$

for some constants $\{\beta\}$. The last term $\varepsilon_t$ corresponds to white noise, which is formally a set of random variables independently drawn from the same distribution with mean $\langle \varepsilon \rangle = 0$ and finite variance $\langle \varepsilon^2\rangle = \sigma_{\varepsilon}^2$. The noise distribution may not be normal, but if it is normal, the noise may be called Gaussian white noise.

According to Gauss's ordinary least squares (OLS) method, we get $\boldsymbol{\beta} = (\beta_0, \beta_1, \dots, \beta_k )^\intercal$ by solving $\mathbf{y} = \mathbf{X} \mathbf{\beta}$ with $\mathbf{y} = (X_{k+1}, X_{k+2}, \dots, X_t)^\intercal$ and

$$
\mathbf{X} = \begin{pmatrix}
1 & X_1 & X_2 & \cdots & X_k\\
1 & X_2 & X_3 & \cdots & X_{k+1} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
1 & X_{t-k} & X_{t-k+1} & \cdots & X_{t-1} \\
\end{pmatrix}\,.
$$

The general solution is

$$\mathbf{\beta} = (\mathbf{X}^\intercal\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}\,.$$

## 3.1 The AR(1) model

The first-order autoregressive model, i.e. AR(1), is the simplest yet nontrivial model.

$$X_t = c + m X_{t-1} + \varepsilon_t$$

### 3.1.1. Fitting

We may estimate the values of $c$ and $m$ with a much simpler version of the OLS method. Define $\overline{X}_t \equiv \frac{1}{t-1} \sum_{i=2}^t X_i$ and $\overline{X}_{t-1} \equiv \frac{1}{t-1} \sum_{i=1}^{t-1} X_i$. The OLS estimates are

$$
\hat m =  \frac{\sum_{i=1}^{t-1} (X_i-\overline{X}_{t-1}) (X_{i+1}-\overline{X_t})}{\sum_{i=1}^{t-1} (X_i-\overline{X_{t-1}})^2} \quad \text{and}\quad \hat c= \overline{X}_t - \hat m \overline{X}_{t-1}
$$


Finally, you should check if the measured noise $\hat\varepsilon_t = X_t -\hat{c}-\hat{m}X_{t-1}$ violates the assumptions: 

1. Whether its mean $\hat\mu_\varepsilon \approx 0$ 
2. Whether its autocorrelation $\hat{B}_\varepsilon(n) = E(\hat\varepsilon_{t+n}\hat\varepsilon_t) - \hat\mu_\varepsilon^2$ decays sufficiently fast.


Ideally, white noise $\varepsilon$ is so independent of time that its autocorrelation vanishes once $n\ne0$.

$$
B_\varepsilon(n) = \begin{cases}
\sigma_\varepsilon^2 &(n=0) \\
0 &(n\neq0)
\end{cases}
$$

Large deviance from the assumptions distorts the quality of the estimation.

### 3.1.2. Population mean

We can calculate $X$'s population mean $\mu$ analytically.

$$
\begin{align*} E(X_t)
&= E(c + m X_{t-1} + \varepsilon_t) \\
&= c + m E(X_{t-1})
\end{align*}
$$

Each data point $X_t$ is a particular realization of $X$. Since it, by definition, does not depend on any particular realizations, the population mean $\mu$ does not change with time if it exists. Hence, we may rewrite the equation as

$$
\begin{align*}
&\mu = c+m\mu \\ \Rightarrow\,
&\mu = \frac{c}{1-m}\,.
\end{align*}
$$

At $m=1$, $\mu$ diverges, so the AR(1) model has a population mean if and only if $m\neq1$.

### 3.1.3. Population variance

We can similarly calculate $X$'s population variance $\sigma^2$.

$$
\begin{align*}
\mathrm{var}(X_t) &= \mathrm{var}(c+mX_{t-1}+\varepsilon_t) \\
&= \mathrm{var}(mX_{t-1}+\varepsilon_t) \\
&= m^2 \mathrm{var}(X_{t-1}) + \sigma_\varepsilon^2+2m\,\mathrm{cov}(X_{t-1}, \varepsilon_t)
\end{align*}
$$

Before using last argument of time independence, we need to calculate the covariance.

$$
\begin{align*}
\mathrm{cov}(X_{t-1}, \varepsilon_t)
&= E(X_{t-1} \varepsilon_t) - \mu\cdot0 \\
&= E(c \varepsilon_t + mX_{t-2}\varepsilon_t + \varepsilon_t \varepsilon_{t-1}) \\
&= 0+m E(X_{t-2} \varepsilon_t)+0 \\
&= m^{t-2} E(X_1\varepsilon_t) \\
&= 0
\end{align*}
$$

In the third line, $E(\varepsilon_t \varepsilon_{t-1}) = 0$ because the white noise at each time step is independent, so the covariance on one day recursively depends on its value one day before. The recursion stops at $E(X_1 \varepsilon_t) = 0$ because $X_1$ is, after all, a constant. Finally, we conclude that

$$
\begin{align*}
&\sigma^2 = m^2 \sigma^2 + \sigma_n^2 \\ \Rightarrow\,
&\sigma^2 = \frac{\sigma_\varepsilon^2}{1-m^2}\,.
\end{align*}
$$

 In addition to the divergence at $|m|=1$, $\sigma^2$ absurdly becomes negative if $|m|>1$. Therefore, the AR(1) model has a population variance if and only if $|m|<1$. 