## Portfolio Optimization

Portfolio optimization models look for the optimal way to make investments. Usually investors expect 
either a maximum return for a given level of risk or a given return for a minimum risk so these models
are typically based on two criteria: maximization of the expected return and/or minimization of the risk.

There is a variety of measures of risk and the most popular is the variance in return. 

### Some Notation

* expected return: 
    $$\mathbb{E}(R_{p}) = \sum _{i}w_{i} \mathbb{E}(R_{i})$$
  where $R_{p}$ is the return on the portfolio, $R_{i}$ is the return on
  asset $i$ and $w_{i}$ is the weighting of component asset $i$
  (that is, the proportion of asset $i$ in the portfolio) and
  $\sum_{i}w_i = 1$ and $0 \le w_i \le 1$;
* portfolio return variance:
  $$ \sigma _{p}^{2} = \sum _{i}\sum _{j}w_{i}w_{j}\sigma _{ij}$$ 
  where
  $\Sigma = \sigma _{ij}=\sigma _{i}\sigma _{j}\rho _{ij}$ is the (sample)
  covariance of the periodic returns on the two assets, $\rho_{ij}$ is the correlation coefficient. In matrix notation:
  $$\sigma_p^2 = \mathbf{w^T}\Sigma\mathbf{w}$$
  where $\mathbf{w} = (w_1, w_2,\ldots,w_N)$ is the vector of weights.
* portfolio return volatility (standard deviation):
  $$ \sigma _{p}= \sqrt{\sigma _{p}^{2}}$$

### The Markowitz Mean/Variance Portfolio Model
The portfolio model, introduced by Markowitz, assumes an investor has two considerations when constructing an investment portfolio: expected return and variance in return (i.e., risk, since it measures the variability in realized return around the expected return). 

The Markowitz model requires two major kinds of information: 

* the estimated expected return for each candidate investment;
* the covariance matrix of returns. 

The latter characterizes not only the individual variability of the return on each investment, but also how each investment’s return tends to move with other investments. 

Throughout this lesson we will use real market data stored in [portfolio_data.csv](https://drive.google.com/file/d/1srCzNlKVY_LHRpkaKoUynnmI0KImfT6Y/view?usp=sharing).
The sample includes, for each entry, a date and the corresponding closing price of five company stocks:

In [1]:
import pandas as pd

df = pd.read_csv("portfolio_data.csv", index_col="date")
print (df.head())

                 AAPL     AMZN     FB    GOOG       NFLX
date                                                    
2014-03-27  71.865678  338.470  60.97  558.46  52.025714
2014-03-28  71.785450  338.290  60.01  559.99  51.267143
2014-03-31  71.769404  336.365  60.24  556.97  50.290000
2014-04-01  72.425937  342.990  62.62  567.16  52.098571
2014-04-02  72.546280  341.960  62.72  567.00  51.840000


<img src="portfolio_sample.png">

With $\tt{pandas}$ the main characteristics of these time series can be easily computed (e.g. daily returns, covariance matrix):

In [2]:
# returns daily and annualized
daily_returns = df.pct_change()
returns = daily_returns.mean()*252
print (returns)

AAPL    0.239188
AMZN    0.415127
FB      0.263797
GOOG    0.172818
NFLX    0.528046
dtype: float64


In [3]:
# covariance
covariance = daily_returns.cov()*252
print (covariance)

          AAPL      AMZN        FB      GOOG      NFLX
AAPL  0.051902  0.025037  0.025737  0.022454  0.027760
AMZN  0.025037  0.085839  0.041025  0.039501  0.048412
FB    0.025737  0.041025  0.069550  0.036127  0.044528
GOOG  0.022454  0.039501  0.036127  0.051797  0.040390
NFLX  0.027760  0.048412  0.044528  0.040390  0.178298


In our sample correlations are rather small and all the stocks are positively correlated.

Simulating a large number of set of weights to construct portfolios with the five stocks shown before, we can see which is the distribution of these portfolios in terms of return and volatility. In this case no attempt of any optimization whatsoever has been made.

In principle investors may use short sales in their portfolios (a portfolio is short in those stocks with negative weights). Although short selling extends the set of possible portfolios we are not going to consider them here.

<img src="return_variance.png" width=500>

### Optimization

Markowitz model states that the weights $w_i$ should be chosen such that the portfolio has the minimum volatility (variance). So the application of Markovitz model reduces to an optimization problem

$$\underset{\mathbf{w}}{\min}\{\sigma_P^2\}= \underset{\mathbf{w}}{\min}\{\mathbf{w^T}\Sigma\mathbf{w}\}$$

with the constraint $\sum_{i}w_i = 1$ and $0 \le w_i \le 1$.

With have already seen how to solve minimization problems in $\tt{python}$ so we just need to repeat the usual steps seen before.

In [4]:
# markowitz
import numpy as np
from scipy.optimize import minimize

def sum_weights(w): 
    return np.sum(w) - 1

def markowitz(w, cov):
    return w.T.dot(cov.dot(w))

num_assets = 5
constraints = ({'type': 'eq', 'fun': sum_weights},) 
bounds = tuple((0, 1) for asset in range(num_assets))
weights = [1./num_assets for _ in range(num_assets)]

opts = minimize(markowitz, weights, args=(covariance,),
                bounds=bounds, constraints=constraints)
print (opts)

     fun: 0.03607029963784209
     jac: array([0.0723022 , 0.07233166, 0.0718961 , 0.07199311, 0.07223676])
 message: 'Optimization terminated successfully.'
    nfev: 63
     nit: 9
    njev: 9
  status: 0
 success: True
       x: array([0.44544146, 0.06252825, 0.12333117, 0.3662157 , 0.00248342])


In [5]:
print ("Expected portfolio return: {:.3f}".format(np.dot(opts.x, returns)))

Expected portfolio return: 0.230


The solution recommends about 44% of the portfolio be invested in AAPL, about 7% in AMZN, 12% in FB and so on...
The expected return is about 23%, with a variance of about 0.036 or,
equivalently, a standard deviation of 0.19.

In this example we based the model simply on statistical data derived from daily returns. However it could be possible, rather than use historical series for estimating the
expected return of an asset, to base this estimate on information about its expected future performance.

### Efficient Frontier and Parametric Analysis
There is no precise way for an investor to determine the “correct” trade off between risk and return. If an investor wants a higher expected return, she generally has to “pay for it” with higher risk. 
Thus, one is frequently interested in looking at the relative distribution of the two. 

In finance terminology, we
would like to trace out the **efficient frontier** of return and risk. 
To determine it we need to solve for the minimum variance
portfolio over a range of values for the expected return.

In [6]:
def efficient_frontier(w, asset_returns, target_return): 
    portfolio_return = asset_returns.dot(w) 
    return (portfolio_return - target_return)

results = []
bounds = tuple((0, 1) for asset in range(num_assets))
    
for eff in np.arange(0.20, 0.45, 0.005):
    constraints = ({'type': 'eq', 'fun': efficient_frontier, 
                    'args':(returns, eff,)},
                   {'type': 'eq', 'fun': sum_weights})
    weights = [1./num_assets for _ in range(num_assets)]
    opts = minimize(markowitz, weights, args=(covariance,),
                    bounds=bounds, constraints=constraints) 
    
    results.append((np.sqrt(opts.x.T.dot(covariance.dot(opts.x))),
                    np.sum(returns*opts.x))) 

<img src="efficient_frontier.png">

### Criticisms to Markowitz Model

Despite the significant utility of the Markowitz theory, there are some major limitations in this model:

* it is difficult to forecast asset returns with accuracy using historical data. As return estimations have a much larger impact on the asset allocations, small changes in return assumptions can lead to inefficient portfolios. Therefore, the model tends to lead to highly concentrated portfolios (out-of sample weights) that do not offer as much diversification benefits;
* the model assumes that asset correlations are linear. In reality, asset correlations move dynamically, changing with the market cycles. During the global financial crisis, asset correlations approached almost 1, so if anything, diversification seemed to have insignificant impacts on the portfolios;
* last but not the least, the model assumes normality in return distributions. Therefore, it does not factor in extreme market moves which tend to make returns distributions either skewed, fat tailed or both.

### Portfolios with a Risk-Free Asset
When one of the asset of the portfolio is risk free, then the efficient frontier has a particularly simple form: a line, the *capital allocation line* (CAL). 

The slope of the this line measures the trade off between risk and return: a higher slope means that investors receive a higher expected
return in exchange for taking on more risk. 
The CAL aids investors in choosing how much to invest in a risk-free asset and one or more risky assets.

The simplest example is a portfolio containing two assets: one risk-free (e.g. treasury bill) and one risky (e.g. a stock).

Assume that the expected return of the treasury bill is $\mathbb{E}[R_f] = 3\%$ and its risk is 0%. Further, assume that the expected return of the stock is $\mathbb{E}[R_r] = 10\%$ and its standard deviation is $\sigma_r = 20\%$. The question that needs to be answered for any individual investor is how much to invest in each of these assets.

The expected return ($\mathbb{E}[R_p]$) of this portfolio is calculated as follows:

$$\mathbb{E}[R_p] = \mathbb{E}[R_f]\cdot w_f + \mathbb{E}[R_r] \cdot ( 1 − w_f )$$
where $w_f$ is the relative allocation to the risk-free asset.

The calculation of risk for this portfolio is simple because the standard deviation of the treasury bill is 0%. Thus, risk is calculated as:

$$\sigma_p = ( 1 − w_f ) \cdot \sigma_r$$

In this very simple example, if an investor were to invest 100% into the risk-free asset ($w_f = 1$),
the expected return would be 3% and the risk of the portfolio would be 0%. Likewise, investing
100% into the stock ($w_f = 0$) would give an investor an expected return of 10% and a portfolio risk
of 20%. If the investor allocated 25% to the risk-free asset and 75% to the risky asset, the portfolio
expected return and risk calculations would be:

$$\mathbb{E}[R_p] = ( 3\% \cdot 25\% ) + ( 10\% \cdot 75\% ) = 0.75\% + 7.5\% = 8.25\%$$

$$\sigma_p = 75\%\cdot 20\% = 15\%$$

Applying the same steps to our example we can consider an additional risk-free asset with an expected return of 10% and repeat the minimisation to determine the efficient frontier of
the resulting portfolio. Notice how the objective function is almost the same while the constraint on the target return now includes also the risk-free asset.

In [7]:
num_assets = 6
    
def markowitz_with_rf(w, cov):
    return w[:-1].T.dot(cov.dot(w[:-1]))

def efficient_frontier_with_rf(w, asset_returns, target_return, risk_free): 
    portfolio_return = np.sum(asset_returns*w[:-1]) + risk_free*w[-1] 
    return (portfolio_return - target_return)

rf_asset_return = 0.10
results_rf = []
bounds = tuple((0, 1) for asset in range(num_assets))
for eff in np.arange(0.10, 0.45, 0.01):
    constraints = ({'type': 'eq', 'fun': efficient_frontier_with_rf, 
                    "args":(returns, eff, rf_asset_return)},
                    {'type': 'eq', 'fun': sum_weights})
    weights = [1./num_assets for _ in range(num_assets)]
    opts = minimize(markowitz_with_rf, weights, 
                    args=(covariance),
                    bounds=bounds, constraints=constraints)
    results_rf.append((np.sqrt(opts.x[:-1].T.dot(covariance.dot(opts.x[:-1]))), 
                     np.sum(returns*opts.x[:-1])+opts.x[5]*rf_asset_return))

<img src="cal.png">

The efficient frontier has become a straight line, tangent to the
frontier of the risky assets only. 

When the target is 10% the entire investment is allocated to the
risk-free asset, as the target increases the fraction of the risky assets grows proportionally to the volatility.

### The Sharpe Ratio
The goal of an investor who is seeking to earn the highest possible expected return for any 
level of volatility is to find the portfolio that generates the steepest possible line
when combined with the risk-free investment. The slope of this line is called the 
*Sharpe ratio* of the portfolio.

For some portfolio $p$ of risky assets let

* $R_p$ its expected return;
* $\sigma_p$ its standard deviation in return;
* $r_0$ the return of a reference risk-free asset.

A plausible single measure (as opposed to the two measures, risk and return) of attractiveness of a portfolio $p$ is the Sharpe ratio defined as

$$ \cfrac{R_p - r_0}{\sigma_p} $$

In words, it measures how much additional return we achieved for the additional risk we took on, relative to putting all our money in a risk-free asset.
The portfolio that maximizes this ratio has a certain well-defined appeal. Suppose:

* $R_\textrm{target}$ our desired target return;
* $w_p$ the fraction of our wealth we place in portfolio $p$ (the rest placed in the risk-free asset).

To meet our return target, we must have:

$$ ( 1 - w_p ) * r_0 + w_p * R_p = R_\textrm{target} $$

The standard deviation of our total investment is: $w_p\cdot \sigma_p$.
Solving for $w_p$ in the return constraint, we get:

$$ w_p = \cfrac{R_\textrm{target} – r_0}{R_p – r_0} $$

Thus, the standard deviation of the portfolio is:

$$ w_p\cdot \sigma_p = \left(\cfrac{R_\textrm{target} – r_0}{R_p – r_0}\right)\cdot \sigma_p $$

Minimizing the portfolio standard deviation means:

$$ \min\left[\left(\cfrac{R_\textrm{target} – r_0}{R_p – r_0}\right)\cdot \sigma_p\right]\implies\max\left[\cfrac{R_p – r_0}{\sigma_p}\right]$$

So, regardless of our risk/return preference, the money we invest in risky assets should be invested in the risky portfolio that maximizes the Sharpe ratio since it is the one that minimize the variance at the same time .

In [None]:
# implement sharpe ratio
num_assets = 5
rf_asset_return = 0.10

In [None]:
# print sharpe ratio















<img src="sharpe_ratio.png">

Notice that the relative proportions of the stocks are the same as in the previous case where we explicitly included the risk-free asset (0.12, 0.54, 0., 0., 0.33).

So the optimization using the Sharpe ratio gives us a portfolio that is
on the minimum volatility efficient frontier, and gives the maximum
return relative to putting all our money in the risk-free asset.

Usually, any Sharpe ratio greater than 1.0 is considered acceptable to good by investors. A ratio higher than 2.0 is rated as very good. A ratio of 3.0 or higher is considered excellent. A ratio under 1.0 is sub-optimal.

## Capital Asset Pricing Model

The Capital Asset Pricing Model (CAPM) describes the relationship between expected return of assets and *systematic risk* of the market.

Indeed we can divide a security’s total risk into unsystematic risk, the portion peculiar to the company that can be diversified away, and systematic risk, the nondiversifiable portion that is related to the movement of the stock market and is therefore unavoidable.

The assumption of CAPM is that investors are risk averse and want to maximize return. This
notion implies that investors demand compensation for taking on risk, which in the financial market means that higher-risk securities are priced to yield higher expected returns than lower risk securities.

CAPM states that the expected return of an asset is equal to the risk-free return plus a risk premium. Mathematically, we can summarize CAPM with the following formula

$$r_i = r_f + \beta_i(r_m-r_f)$$
where:
* $r_i$ is the expected return of the $i^{th}$ security;
* $r_f$ is the risk-free rate with zero standard deviation (an example of a risk-free asset includes Treasury Bills as they are backed by the U.S. government);
* $r_m - r_f$ is the risk premium ($r_m$ denotes the market return including all securities in the market, it can be represented with an index like S\&P 500);
* $\beta_i$ is a measure of $i^{th}$ asset volatility in relation to the overall market. 
$\beta$ is used in the CAPM to describe the relationship between systematic risk, or market risk, and its expected return.
	
Therefore the key point in CAPM is the determination of $\beta$ which can be achieved with the measurement of the slope of the *regression line*, of the market vs individual stock return distribution.

Given two sets of measurements $X$ and $y$ the linear regression determines the parameter $\alpha$ and $\beta$ such that

$$\hat{y}=\beta X + \alpha$$

by minimizing the sum of the squared differences between the predicted $\hat{y}$ and true $y$ values.

<img src="linear_regression.png">

In our case the regressed line estimates the stock returns given the global market returns and in particular 

$$\beta \approx \cfrac{\textrm{cov}(X,y)}{\textrm {var}(X)}$$

so provides insights about how *volatile*, or how risky, a stock is relative to the rest of the market.
In CAPM $\beta$ calculation is used to help investors understand whether a stock moves in the same direction as the rest of the market but for it to provide any useful insight, the market that is used as a benchmark should be related to the stock.

Those who use CAPM pick individual stocks or portfolios, and compare them to different indexes. The point is to find stocks that have high $\beta$, and portfolios that have high $\alpha$. High $\beta$ values mean that the stock fares better than index with positive market and performs worse for negative market (contrary low $\beta$ gives lower performance for positive market and better returns in negative market), so those stocks have a chance at beating the market. $\alpha$ values above zero mean that your portfolio outperform market whatever it does.

This risk vs expected return relationship is called the security market line (SML). 

<img src ="sel.png">

In the freely competitive financial markets described by CAPM, no security can sell for long at prices low enough to yield more than its appropriate return on the SML. The security would then be very attractive compared with other securities of similar risk, and investors would bid its price up until its expected return fell to the appropriate position on the SML. Conversely, investors would sell off any stock selling at a price high enough to put its expected return below its appropriate position. The resulting reduction in price would continue until the stock’s expected return rose to the level justified by its systematic risk.

In order to see an example of CAPM application we use [capm.csv](https://drive.google.com/file/d/1G4U8foyhq9agGPs8cg83-aY4qU8K3jJI/view?usp=sharing) file which contains the historical series of S&P500 and of some stocks.
As usual with $\tt{pandas}$ we can inspect the file and compute some useful quantities like daily and annualized expected returns.

In [None]:
# check capm.csv and define daily returns
import pandas as pd

### Calculate $\beta$ and CAPM for each Stock
To recap $\beta$ is the slope of the regression line of the market return vs stock return plot (and $\alpha$ is the intercept of this line with the $y$ axis).

These quantities can be calculated with $\tt{numpy.polyfit}$ passing as inputs the market and the stock return lists. 

With the daily returns and $\beta$ of each stock, we can calculate the capital asset pricing model. First, we can calculate the average daily return of the market and annualize this return by multiplying it by the number of trading days in a year.
Assuming a risk-free rate of 1%, we can then calculate CAPM using its definition.

In [None]:
# implement CAPM for each asset
rm = np.mean(daily_returns.iloc[1:]['SP500'])*252 
rf = 0.01
betas = {}
alphas = {}
ERs = {}












<img src="capm_fit.png">

Now that we have the $\beta$ of each individual stock we can calculate the CAPM for a portfolio made of the same stocks (we assume equal weights for simplicity).
It is enough indeed to perform a weighted sum of of the expected return according to the model of each stock.

In [None]:
# print portfolo expected return 

The expected return of the portfolio is roughly 10\% and this is what an investor should expect according to CAPM.

### Criticism to CAPM 
As we have seen the whole model is about plotting a line in a scatter plot. It’s not a very complex model. Assumptions under the model are even more simplistic. For example:

* expect that all investors are rational and they avoid risk;
* everyone have full information about the market;
* everyone have similar investment horizons and expectations about future movements;
* stocks are all correctly priced.

Moreover, this is a model from the 1960s. Market dynamics were different back then. And of course, this is a retrospective model. We cannot know how future stock prices move and how the market behaves.

## Risk Parity Portfolio

An alternative approach to Markowitz theory is given by the *risk parity*. A risk parity portfolio is an investment allocation strategy which focuses on the allocation of risk, rather than the allocation of capital. 

A risk parity (equal risk) portfolio is characterised by having equal risk contributions to the total risk from each individual asset. 
Risk parity allocation is also referred to as equally-weighted risk contributions portfolio method. Equally-weighted risk contributions is not about having the same volatility, it is about having each asset contributing in the same way to the portfolio overall volatility. 

For this we will have to define the contribution of each asset to the portfolio risk. **This allocation strategy has gained popularity in the last decades since it is believed to provide better risk adjusted return than capital based allocation strategies.**

Consider a portfolio of $N$ assets: $x_1,\ldots , x_N$ where as usual the weight of the asset $x_i$ is denoted by $w_i$. The $w_i$ form the allocation vector $\mathbf{w}$. Let us further denote the covariance matrix of the assets as $\Sigma$. The volatility of the portfolio is then defined as:

$$\sigma_p={\sqrt {\mathbf{w^T}\Sigma \mathbf{w}}} = \sum _{i=1}^{N}\sigma _{i}\qquad\textrm{with}~\sigma _{i} = w_{i}\cdot \cfrac{\partial\sigma_p}{\partial w_{i}}={\cfrac {w_{i}(\Sigma \mathbf{w})_{i}}{\sqrt {\mathbf{w^T}\Sigma \mathbf{w}}}}$$
so that $\sigma _{i}$ can be interpreted as the contribution of asset $i$ in the portfolio to the overall risk of the portfolio.
Equal risk contribution then means $\sigma _{i} =\sigma _{j}$ for all
$i,j$ or equivalently $\sigma _{i}=\sigma_p/N$. So

$$\sigma _{i} = \cfrac{\sigma_p}{N}={\cfrac {w_{i}(\Sigma \mathbf{w})_{i}}{\sqrt {\mathbf{w^T}\Sigma \mathbf{w}}}}\implies w_{i} = \frac {\sigma_p^{2}}{(\Sigma \mathbf{w})_{i}N}$$

Since we want the previous expression to be true for each $i$, the solution for the weights can be found by solving the minimisation problem

$$\underset{\mathbf{w}}{\min } \sum _{i=1}^{N}\left[w_{i}-{\frac {\sigma_p^{2}}{(\Sigma \mathbf{w})_{i}N}}\right]^{2}$$

In [None]:
# implement risk-parity
num_assets = 5

In [None]:
# check sigma_i for each asset

### Risk Budget Allocation
The same technique can be used if we would like to calculate a portfolio with risk budget allocation. If we consider the previous equation

$$\sigma _{i}=\cfrac{\sigma_p}{N}$$

where we set the risk contribution fraction to every asset to $1/N$;
now we can simply replace $1/N$ with the desired fraction ($f_i$) for each asset:

$$\sigma _{i}=f_i \cdot \sigma_p$$
so that the relation to minimise becomes:

$$\underset{\mathbf{w}}{\min } \sum _{i=1}^{N}\left[w_{i}-{\frac {f_i \cdot \sigma_p^{2}}{(\Sigma \mathbf{w})_{i}}}\right]^{2} $$

Translating it to $\tt{python}$ we get:

In [None]:
# implement risk budget

In [None]:
# print risk budget for each asset

## Portfolio Diversification 

Diversification allows to combine risky stocks so that their combination (the portfolio) is less risky than any of its components. 

Suppose there are two companies located on an isolated island whose chief industry is tourism. 
One company manufactures suntan lotion. Its stock predictably performs well in sunny years and poorly in rainy ones. The other company produces disposable umbrellas. Its stock performs equally poorly in sunny years and well in rainy ones. 

Each company earns a 12% average return. In purchasing either stock, investors incur a great amount of risk because of variability in the stock price driven by fluctuations in weather conditions. Investing half the funds in the suntan lotion stock and half in the stock of the umbrella manufacturer, however, results in a return of 12% regardless of which weather condition prevails. Portfolio diversification thus transforms two
risky stocks, each with an average return of 12%, into a riskless portfolio certain of earning the expected 12%.

Unfortunately, the perfect negative relationship between the returns on these two stocks is very rare in the real world. To some extent, corporate securities move together, so complete elimination of risk through simple portfolio diversification is impossible. However, as long as some lack of parallelism in the returns of securities exists, diversification will always reduce risk.

Empirical studies have demonstrated that unsystematic risk can be virtually eliminated in
portfolios of 30 to 40 randomly selected stocks. Of course, if investments are made in closely related industries, more securities are required to eradicate unsystematic risk.

As we have already seen no measure of unsystematic risk appears in the risk premium, of
CAPM model, since it is assumed that diversification has eliminated it.

In the Markowitz model instead diversification is achieved by seeking to combine in a portfolio assets with returns that are less than perfectly positively correlated, in an effort to lower portfolio risk (variance) without sacrificing return, through the reduction of the correlation matrix $\Sigma$.

### Maximum Diversification Portfolio

Divesrification is most often either pursued in tandem with other objective, such as return maximization, or pursued simply by including more asset classes in out portfolio.

But it does not have to be this way; diversification can be pursued explicitly as the sole objective in portfolio construction.

In a 2008 paper, the diversification ratio $D$ of a portfolio has been defined as:

$$D=\cfrac{\mathbf{w^T}\boldsymbol{\sigma}}{\sqrt {\mathbf{w^T}\Sigma \mathbf{w}}}$$

where $\boldsymbol{\sigma}$ is the vector of volatilities and $\Sigma$ is the covariance matrix. The term in the denominator is the volatility of the portfolio and the term in the numerator is the weighted average volatility of the assets. More diversification within a portfolio decreases the denominator and leads to a higher diversification ratio.

# Machine Learning

In this lesson we will see how machine learning techniques can be successfully applied to solve financial problems. We will first do a quick tour on the theory behind neural networks and then we will see few examples of practical applications regarding regression and classification issues. 

**Disclaimer**: this lecture just scratches the surface of the machine learning topic which has seen a huge development in the latest years leading to thousands of applications in many different fields.

## Neural Network Definition
Artificial Neural Networks (ANN or simply NN) are information processing models that are developed by inspiring from the working principles of human brain. Their most essential property is the ability of learning from sample sets. 

The basic unit of ANN architecture are neurons which are internally in connection with other neurons. 

![Model of an artificial neuron.](neuron.jpeg)

A neuron consists of weights ($w_i$) and real numbers ($x_i$). All inputs injected into a neuron are individually weighted, added together (sometimes it is added also a bias $w_0$) and passed into the activation function which produce the neuron output

$$ \textrm{Inputs} = \sum_{i=1}^{N} x_i w_i +w_0 = \Sigma \rightarrow = f(\Sigma) \rightarrow \textrm{Output}$$  

There are many different types of activation function but one of the simplest is the *step function* which returns just 0 or 1 according to the input value (another is the *sigmoid* which can be thought of as the continuous version of the step function). 

![Sigmoid function.](sigmoid.png)

Other commonly used activation functions are Rectified Linear Unit (ReLU) and hyperbolic tangent (tanh).

For an deeper discussion of activation functions see [this article](https://medium.com/the-theory-of-everything/understanding-activation-functions-in-neural-networks-9491262884e0)

## Training of a neuron

When teaching children how to recognize a bus, we just tell them, showing an example: “This is a bus. That is not a bus.” until they learn the concept of what a bus is. 
Furthermore, if the child sees new objects that she hasn’t seen before, we could expect her to recognize correctly whether the new object is a bus or not.

This is exactly the idea behind neurons.
Similarly, inputs from a *training* set are presented to the neuron one after the other together with the correct output and the neuron weights are modified accordingly.

When an entire pass through all of the input training vectors is completed (*epoch*) the neuron has learnt. Actually we can present many times the same set to the neuron to make it learn better.

At this time, if an input vector $\mathbf{x}$ (already in the training set) is given to the neuron, it will output the correct value. If $\mathbf{x}$ is not in the training set, the network will respond with an output similar to other training vectors close to $\mathbf{x}$.

This kind of training is called *supervised* because we have a set of training data with known targets and, during training, we want our model to learn to predict the target from the other variables. 

Unfortunately using just a neuron is not too useful since it is not possible to solve
the interesting problems we would like to face with just that simple architecture. The next step is then to put together more neurons in *layers*.

### Multilayered Neural Networks

![A multilayered neural network.](multilayer.jpeg)

In a multilayered NN each neuron from the *input layer* is fed up to each neuron in the next hidden layer, and from there to each neuron on the output layer. We should note that there can be any number of neurons per layer and there are usually multiple hidden layers to pass through before ultimately reaching the output layer.

### Training a Multilayered Neural Network

The training of a multilayered NN follows these steps:

* present a training sample to the neural network (initialized with random weights $w_i$);
* compute the network output obtained by calculating activations of each layer;
* calculate the error (loss) as the difference between the NN predicted output and the target output;
* having calculated the error, re-adjust the weights of the network such that the difference with the target decreases;
* continue the process for all samples several times (epochs).

<img src="training_nn.png">

The NN error is computed by the *loss function*. Different loss functions will give different errors for the same prediction, and thus have a considerable effect on the performance of the model. Two are the main possible choices

* Mean Absolute Error (MAE): the average of the absolute value of the differences between the predictions and true values;
* Root Mean Squared Error (MSE): the square root of the average of the squared differences between the predictions and true values.

The mean absolute error is easily interpretable, as it represents how far off we are on average from the correct value. The root mean squared error penalizes larger errors more heavily and is commonly used in regression tasks. Either metrics may be appropriate depending on the situation and you can use both for comparison. [Here](https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d) is a discussion of the merits of these metrics.

The error or loss is a function of the internal parameters of the model (i.e the weights and bias). For accurate predictions, one needs to minimize the calculated error.
In a neural network, this is done using *back propagation* (see [this article](https://towardsdatascience.com/understanding-backpropagation-algorithm-7bb3aa2f95fd)). The current error is typically propagated backwards to a previous layer, where it is used to modify the weights and bias in such a way that the error is minimized.

<img src="loss_function.png">

The weights are modified using a function called Optimization Function (we will use *Adam* as optimizator in the following but there are more).

A common mistake to avoid is to *overtrain* a NN. Overtraining is what happens when the NN learns too well the training sample but its performance degrade substantially in an independent testing sample. 

So usually it is required to split the available sample in two parts: training and testing (e.g. 80% and 20%) and to use the former to perform the training and the latter to cross-check the performance. **Usually performance are measured using the loss function value at the end of the training.**

### Neural Network Design

There is no rule to guide developer into the design of a neural network in terms of number of layers and neuron per layer. The most common strategy is a trail and error one where you finally pick up the solution giving the best accuracy. In general a larger number of nodes is better to catch highly structured data with a lot of feature although it may require larger training sample to work correctly.

Anyway as a rule of thumb a NN with just one hidden layer with a number of neurons averaging the inputs and outputs is sufficient in most cases. In the following we will use more complex networks just for illustration, no strong attempt in optimizing the layout has been done though.

## Regression and Classification
The two main categories of problems that can be solved with neural networks are *classification* and *regression*. Let's see their characteristics and differences.

### Classification 
Classification is a process of finding a function which helps in dividing the dataset into classes based on different parameters. 

The task of the classification algorithm is to find the mapping function to map the input ($x$) to the **discrete** output($y$).
We try to find the decision boundary, which can divide the dataset into different classes.

Example: the best example to understand the classification problem is email spam detection. The model is trained on the basis of millions of emails on different parameters, and whenever it receives a new email, it identifies whether the email is spam or not.
Classification algorithms can also be in speech recognition, car plates identification, etc.

### Regression
Regression is a process of finding the correlations between dependent and independent variables. It helps in predicting the continuous variables such as prediction of market trends, house prices, etc.

The task of the regression algorithm is to find the mapping function to map the input variable ($x$) to the **continuous** output variable($y$).
We try to find the best fit line, which can predict the output more accurately. 	

Example: suppose we want to do weather forecasting, so for this, we will use a regression algorithm. In weather prediction, the model is trained on the past data, and once the training is completed, it can predict the weather for future days.
In general whenever we are dealing with function approximation this kind of algorithms can be applied. 	

### Technical Note
Neural network training and testing is performed using two modules: $\tt{keras}$ (which in turn is based on a Google opensource library called $\tt{tensorflow}$) and $\tt{scikit-learn}$ which provide many useful utilitites for the training.

In order hide as much as possible the many little details that have to be set when developing NN I have developed a simple class ($\tt{FinNN}$) which relies on $\tt{keras}$ anyway but should make the whole process easier.

## Function approximation 

As a first practical example let's try to design an ANN which is capable of learning the functional form underlying a set of data.

Let's generate a sample with $x$ (input), $f(x)$ (target output) pairs where $f(x) = x^3 +2$ and let's start to code the NN structure. 

We start by importing the necessary modules.

In [None]:
from finnn import FinNN
import numpy as np

Then we generate the training sample (i.e. the $x$, $f(x)$ pairs) and apply a simple transformation on the sample in order to have all the inputs and outputs in the $[0, 1]$ range. This is usually done to provide the NN with *normalized* data, infact the NN can be fooled by large or very small numbers giving unstable results.

In [None]:
# define the dataset 

Next we can define the structure of the neural network. There is no predefined rule to decide the number of layers and nodes you need to go by trial and error. Here the problem is quite simple so there is no need to use a complecated NN. 

In the end I have decided to use two layers with 15 and 5 neurons and a *tanh* activation function. The $\tt{inputs}$ parameter has to be set to 1 since we have just one single input, the $x$ value. 

In [None]:
# design the neural network model

# define the loss function (mean squared error) 
# and optimization algorithm (Adam)

# fit the model on the training dataset 2000

After the training is completed we can evaluate how good it is. To do this we can compute the residuals or the square root of the sum of the squared difference between the true value and the one predicted by the NN. We will also plot the true function and the predicted one in order to have a graphical representation of the goodness of our training.
To have a numerical estimate of the agreement it has been computed also the *mean squared error* defined as:

$$\textrm{MSE} = \cfrac{\sum_{i=1}^n{\big(\frac{x_{i}^{pred} - x_i^{truth}}{x_i^{truth}}\big)^2}}{n}$$

A *perfect* prediction would lead to $\textrm{MSE}=0$ so the lower this number the better the agreement. 

In [None]:
from sklearn.metrics import mean_squared_error

trainer.fullPrediction()
# report model error computing the mean squared error
print('MSE: %.7f' % mean_squared_error(trainer.y, trainer.predictions))

To get an idea of what it is going on in the picture below are shown the actual function we want to approximate and different predictions of our NN obtained with four epoch numbers (5, 100, 800, 5000).

<img src="training_vs_epoch.png">

It is clear how the agreement improves with higher number of epochs which means that the NN has more opportunities to adapt the weights and reduce the loss (or error or distance) to the target values. Even in the case of 5000 epochs zooming in you could see discrepancies not visible at the scale of the plot. Remember that increasing too much the number of epochs may lead to overfitting. So in this case if we need more accuracy we need to either increase the training sample or to change the NN design.

To check if this is the case we can *evaluate* our NN with both the training ad the testing samples. If the losses are comparable the NN is ok otherwise if the training losses are much smaller than the testing we had overfitting.

In [None]:
# evaluate for overfitting

Since the two numbers are in good agreement we can be confident that our NN didn't overfit.


### Black-Scholes Call Options

The first financial application of a NN concerns the pricing of european call options: essentially we will create a neural network capable of approximate the famous Black-Scholes pricing formula

$$ P_\textrm{call} = F_\textrm{BS}(K, r, \sigma, \mathrm{ttm})$$

Like before we are going to generate the training sample this time made of a grid of volatility-rate pairs $(\sigma, r)$ (for simplicity we are going to set moneyness and time to maturity to 1). The target values are the price of a call computed using the pricing function in the $\tt{finmarkets.py}$ library with the corresponding inputs.

In [None]:
from finmarkets import call

data = []
rates = np.arange(0.01, 0.11, 0.001)
sigmas = np.arange(0.1, 0.6, 0.005)

for r in rates:
    for sigma in sigmas:
        call_price = call(1, r, sigma, 1)
        data.append([r, sigma, call_price]) 

Since it takes some time to generate data samples, it is always advisable to save them in a file since we may need to load it many times during the NN development.
This can be done with $\tt{pandas}$.

In [None]:
import pandas as pd

df = pd.DataFrame()

data = np.array(data)
df['rate'] = data[:, 0]
df['vol'] = data[:, 1]
df['price'] = data[:, 2]

df.to_csv("bs_training_sample.csv")

In [None]:
print (df.describe())

Following the previous example we will use the $\tt{FinNN}$ utility class to develop the NN and also we will *normalize* data to get better results.
**Beware that this time we have TWO input parameters (rate and volatilty)** and not just one.

In [None]:
# reload data 

In [None]:
# define NN

# define the NN architecture 20, 8, relu

# fit 3000

In [None]:
# evaluate 

# when the training takes some time it is useful
# to save the model weights in a file to use it later on

As you can see the training and test samples give roughly the same MSE value so we are reasonably sure that there hasn't been *overfitting*.

Again we can evaluate graphically how good it is.
<img src="vol_rate.png">

In general to judge if the level of accuracy we have reached is enough you have to

* if you are using the metric MSE you need to make the $\sqrt{\mathrm{MSE}}$ to get the *real* error
* apply this error to a typical output and check if the accuracy is enough.

In this example we know that the our prices go from 0.04 to 0.28 and the final accuracy is 0.0007. But since we are working with the moneyness we need to multiply those values for a typical strike, say 100. So in the worst case we know are able to price our call as $40 \pm 0.07$, which is not bad for our study but may not be ideal for deciding if we would like to invest in this call or not. 

We can also compare the prediction in a practical case; let's say we want to know the price of a call (with moneyness 1 and time to maturity 1 year) when the interest rate is 0.015 and the volatility 0.234:

In [None]:
# check a value
import numpy as np
from finmarkets import call

It is very import to remeber that a **NN cannot extrapolate**. Indeed if you try to predict the price of a call from rate and volatility outside the training *phase space* (with values that aren't in the intervals used in the training), say $r = 0.22$ and $\sigma = 0.01$...

In [None]:
# check another value
                 
# here we compare the predection with the BS call price                 