$$
\newcommand{\x}{\mathbf{x}}
\newcommand{\y}{\mathbf{y}}
\newcommand{\price}{{p}}
\newcommand{\ret}{{r}}
\newcommand{\tp}{{(t)}}
\newcommand{\aapl}{{\text{AAPL}}}
\newcommand{\ba}{{\text{BA}}}
\newcommand{\spy}{{\text{SPY}}}
$$

# Assignment: Using Machine Learning for Hedging

Welcome to the first assignment !

# Problem description

We will solve a Regression task that is very common in Finance
- Given the return of "the market", predict the return of a particular stock

That is
- Given the return of a proxy for "the market" at time $t$, predict the return of, e.g., Apple at time $t$.

As we will explain
being able to predict the relationship between two financial instruments opens up possibilities
- Use one instrument to "hedge" or reduce the risk of holding the other
- Create strategies whose returns are independent of "the market"
    - Hopefully make a profit regardless of whether the market goes up or down

## Goal

You will create models of increasing complexity in order to explain the return of Apple (ticker \aapl)
- The first model will have a single feature: return of the market proxy, ticker $\spy$
- Subsequent models will add the return of other tickers as additional features

## Objectives
- Learn how to solve a Regression taks
- Become facile in the `sklearn` toolkit for Machine Learning

## How to report your answers
I will mix explanation of the topic with tasks that you must complete. Look for 
the string "**Queston**" to find a task that you must perform.
Most of the tasks will require you to assign values to variables and execute a `print` statement.

**Motivation**

If you **do not change** the print statement then the GA (or a machine) can automatically find your answer to each part by searching for the string.


# Standard imports

In [None]:
# Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import sklearn

import os
import math

%matplotlib inline

# API for students

We will define some utility routines.

This will simplify problem solving

helper = helper.HELPER()
- getData: get the data of a ticker list, used as follows    
    > `tickers` is a list of tickers      
    > `index` is the ticker of index     
    > `DATA_DIR` is the data directory       
    > `attrs` is a list of data attributes to retain       
    > `data = helper.getData(tickers, index, DATA_DIR, attrs)`

In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Reload all modules imported with %aimport
%load_ext autoreload
%autoreload 1

# Import nn_helper module
import helper
%aimport helper

helper = helper.HELPER()

# Get The data

The data are the daily prices of a number of individual equities and equity indices.
The prices are arranged in a series in ascending date order (a timeseries).
- There are many `.csv` files for equity or index in the directory `/resource/asnlib/publicdata/data`

The first step in our Recipe is Get the Data.

We have provided a utility method `getData` to simplify this for you
      
**Question:**
- Get the adjusted close price of AAPL and SPY 

**Hint:**
- look up the Pandas `read_csv()` method

In [None]:
DATA_DIR = './data'
if not os.path.isdir(DATA_DIR):
    DATA_DIR  = "../resource/asnlib/publicdata/data"

ticker = "AAPL"
index_ticker = "SPY"
dateAttr = "Dt"
priceAttr = "Adj Close"

### BEGIN SOLUTION
data = helper.getData([ticker], index_ticker, DATA_DIR, [priceAttr])
### END SOLUTION

# Have a look at the data

We will not go through all steps in the Recipe, nor in depth.

But here's a peek

In [None]:
data.head()

Expected outputs should be similar to this:   
data:    
<table> 
    <tr> 
        <td>Dt</td><td>AAPL_Adj_Close</td><td>SPY_Adj_Close</td>
    </tr>
    <tr> 
        <td>2017-01-03</td><td>110.9539</td><td>213.8428</td> 
    </tr>
    <tr> 
        <td>2017-01-04</td><td>110.8297</td><td>215.1149</td>
    </tr>
    <tr> 
        <td>2017-01-05</td><td>111.3933</td><td>214.9440</td>
    </tr>
    <tr> 
        <td>2017-01-06</td><td>112.6351</td><td>215.7131</td> 
    </tr>
    <tr> 
        <td>2017-01-09</td><td>113.6668</td> <td>215.0010</td>
    </tr>
</table>   

In [None]:
# Print the Start time and End time
print("Start time: ", data.index.min())
print("End time: ", data.index.max())

As you can see, our data has two attributes for each date
- Price (adjusted close) of ticker $\aapl$
- Price (adjusted close) of the market proxy $\spy$

# Prepare the data

In Finance, it is very typical to work with *relative changes* (e.g., percent price change)
rather than *absolute changes* (price change) or *levels* (prices).

Without going into too much detail
- Relative changes are more consistent over time than either absolute changes or levels
- The consistency can facilitate the use of data over a longer time period

For example, let's suppose that prices are given in units of USD (dollar)
- A price change of 1 USD is more likely for a stock with price level 100 than price level 10
    - A relative change of $1/100 = 1 %$ is more likely than a change of $1/10 = 10%$
    - So relative changes are less dependent on price level than either price changes or price levels
    
    
To compute the *return* (percent change in prices)
 for ticker $\aapl$ (Apple) on date $t$

$$
\begin{array}[lll]\\
\ret_\aapl^\tp = \frac{\price _\aapl^\tp}{\price _\aapl^{(t-1)}} -1 \\
\text{where} \\
\price_\aapl^\tp \text{ denotes the price of ticker } \aapl \text{ on date } t \\
\ret_\aapl^\tp \text{ denotes the return of ticker } \aapl \text{ on date } t
\end{array}
$$


# Transformations: transform the training data

Our first task is to transform the data from price levels (Adj Close)
to Percent Price Changes.

Moreover, the date range for the training data is specified to be in the range
from `start_dt` (start date) to `end_dt`, inclusive on both sides.

**Note**

We will need to apply **identical** transformations to both the training and test data examples.

In the cells that immediately follow, we will do this only for the **training data**

You will need to repeat these steps for the test data in a subsequent step.

You are well-advised to create subroutines or functions to accomplish these tasks !
- You will apply them first to transform training data
- You will apply them a second time to transform the test data

We will achieve this is several steps

## Create Dataframe of price levels for the training examples

- The Dataframe should have two columns: the price level for the ticker and for the index
- The minimum date in the Dataframe should be **the day before** `start_dt`
- The maximum date in the Dataframe shold be `end_dt`

The reason we are adding one day prior to `start_dt`
- We want to have returns (percent price changes) from `start_dt` onwards
- In order to compute a return for `start_dt`, we need the level from the prior day

**Question:**
- Complete the function `getRange()` to select the data between "2017-12-29" and "2018-09-28" as training data
- Set variable train_data_price to be a pandas DataFrame with two columns AAPL_Adj_Close, SPY_Adj_Close and with dates as the index, having minimum date equal to THE DAY BEFORE start_dt and maximum date equal to end_dt

In [None]:
start_dt = "2018-01-02"
end_dt = "2018-09-28"
train_data_price = None

def getRange(df, start_dt, end_dt):
    '''
    Return the data in 2018 with two columns AAPL_Adj_Close, SPY_Adj_Close and with dates as the index
    
    Parameters
    ----------
    start_dt: String
    - Start date, but you need to add one more day prior to this start_dt
    
    end_dt: String
    - End date
    '''
    ### BEGIN SOLUTION
    dates = data.index.to_list()
    
    # Find the position of start_dt in the list of dates (so we can find the day before)
    start_pos = dates.index(start_dt)

    # Return the slice of the date range
    return data[ dates[ start_pos -1 ]:end_dt]
    ### END SOLUTION

train_data_price = getRange(data, start_dt, end_dt)
print(train_data_price.head())

In [None]:
### BEGIN HIDDEN TESTS
assert train_data_price.index.min() == '2017-12-29'
assert train_data_price.index.max() == '2018-09-28'
### END HIDDEN TESTS

## Create Dataframe of returns for training examples

Create a new Dataframe with percent price changes of the columns, rather than the levles

**Question:**
- Complete function `getReturns()` to compute the returns of tickers. Name the column of returns "Return"
- Set variable train_data_ret to a DataFrame with two columns containing the day over day percent changes in AAPL_Adj_Close, SPY_Adj_Close with dates as the index
- Please continue to name these columns with their original names (AAPL_Adj_Close, SPY_Adj_Close). We will rename the columns to reflect the data in the next step

**Hint:**
- look up the Pandas `pct_change()` method    

In [None]:
train_data_df = None

def getReturns(df, start_dt, end_dt):
    '''
    Return the day over day percent changes of adjusted price
    '''
    ### BEGIN SOLUTION
    df_range = getRange(df, start_dt, end_dt)
    return df_range.pct_change()
    ### END SOLUTION

train_data_ret = getReturns(train_data_price, start_dt, end_dt)
print(train_data_ret.head())

In [None]:
### BEGIN HIDDEN TESTS
def getReturns_(df, start_dt, end_dt):
    df_range = getRange(df, start_dt, end_dt)
    return df_range.pct_change()

train_data_ret_test = getReturns_(train_data_price, start_dt, end_dt)
assert np.allclose(train_data_ret.values[1:,:], train_data_ret_test.values[1:,:])
### END HIDDEN TESTS

In [None]:
## Rename the columns to indicate that they have been transformed from price (Adj_close) to Return
def renamePriceToRet(df):
    rename_map = { }
    rename_map = { orig:  orig.replace( priceAttr.replace(" ", "_"), "Ret") for orig in df.columns.to_list() }
    return df.rename(columns = rename_map)

train_data_ret = renamePriceToRet( train_data_ret )

## Drop the first date (the day before `start_dt`) since it has an undefied return
train_data_ret = train_data_ret[ start_dt:]
print(train_data_ret.head())

Expected outputs should be similar to this:   
data:    
<table> 
    <tr> 
        <td>Dt</td><td>AAPL_Ret</td><td>SPY_Ret</td>
    </tr>
    <tr> 
        <td>2018-01-02</td><td>0.017905</td><td>0.006325</td>
    </tr>
    <tr> 
        <td>2018-01-03</td><td>-0.000174</td><td>0.006325</td>
    </tr>
    <tr> 
        <td>2018-01-04</td><td>0.004645</td><td>0.004215</td> 
    </tr>
    <tr> 
        <td>2018-01-05</td><td>0.011385</td><td>0.006664</td>
    </tr>
    <tr> 
        <td>2018-01-08</td><td>-0.003714</td><td>0.001829</td>
    </tr>
</table>   

## Remove the target (AAPL_Ret)

The only feature is the return of the market proxy $\spy$.

Predicting the target given the target as a feature would be cheating !

In [None]:
targetAttr = index_ticker + "_Ret"
X_train, y_train =  train_data_ret[ [targetAttr] ], train_data_ret.drop(columns=[ targetAttr ] )

# Transformations: transform the test data

The test data will be returns from `test_start_dt` to `test_end_dt` inclusive.

We will apply identical transformations as we did to the training data, but with a different date range.

Test data is between "2018-10-01" to "2018-12-31"

**Question:**
- Repeat the steps of transforming training data to transform test data
    - select the testing time
    - create returns
    - rename columns
    - drop the first date
    - remove the target

In [None]:
test_start_dt = '2018-10-01'
test_end_dt = '2018-12-31'
test_data_ret = None
X_test = None
y_test = None

### BEGIN SOLUTION
test_data_price = getRange(data, test_start_dt, test_end_dt, )
test_data_ret = getReturns(test_data_price, test_start_dt, test_end_dt)
test_data_ret = renamePriceToRet( test_data_ret )
test_data_ret = test_data_ret[ test_start_dt:]
targetAttr = index_ticker + "_Ret"
X_test, y_test = test_data_ret.drop(columns=[ targetAttr ] ), test_data_ret[ [targetAttr] ]
### END SOLUTION

print("test data length", test_data_ret.shape[0])
print("X test length", X_test.shape[0])
print("y test length", y_test.shape[0])
print(test_data_ret.head())

Expected outputs should be similar to this:        
test data length 63      
X test length 63      
y test length 63       
data:    
<table> 
    <tr> 
        <td>Dt</td><td>AAPL_Ret</td><td>SPY_Ret</td>
    </tr>
    <tr> 
        <td>2018-10-01</td><td>0.006733</td><td>0.003474</td>
    </tr>
    <tr> 
        <td>2018-10-02</td><td>0.008888</td><td>-0.000583</td>
    </tr>
    <tr> 
        <td>2018-10-03</td><td>0.012169</td><td>0.000548</td> 
    </tr>
    <tr> 
        <td>2018-10-04</td><td>-0.017581</td><td>-0.007815</td>
    </tr>
    <tr> 
        <td>2018-10-05</td><td>-0.016229</td><td>-0.005597</td>
    </tr>
</table>   

In [None]:
### BEGIN HIDDEN TESTS
test_data_price_ = getRange(data, test_start_dt, test_end_dt, )
test_data_ret_ = getReturns_(test_data_price_, test_start_dt, test_end_dt)
test_data_ret_ = renamePriceToRet( test_data_ret_ )
test_data_ret_ = test_data_ret_[ test_start_dt:]
targetAttr = index_ticker + "_Ret"
X_test_, y_test_ = test_data_ret_.drop(columns=[ targetAttr ] ), test_data_ret_[ [targetAttr] ]
assert np.allclose(X_test_, X_test)
assert np.allclose(y_test_, y_test)
### END HIDDEN TESTS

# Train a model (Regression)

Use Linear Regression to predict the return of a ticker from the return of the market proxy $\spy$.
For example, for ticker $\aapl$

$$
\ret_\aapl^\tp = \beta_0 + \beta_{\aapl, \spy} * \ret_\spy^\tp + \epsilon_{\aapl}^\tp
$$

Each example corresponds to one day (time $t$)
- has features
    - constant 1, corresponding to the intercept parameter
    - return of the market proxy $\spy$
       $$\x^\tp = \begin{pmatrix}
        1 \\
        \ret_\spy^\tp
        \end{pmatrix}$$

- has target
    - return of the ticker
    $$\y^\tp = \ret_\aapl^\tp$$

 
You will use Linear Regression to solve for parameters $\beta_0$,  $\beta_{\aapl, \spy}$ 

- In the lectures we used the symbol $\Theta$ to denote the parameter vector; here we use $\mathbf{\beta}$
- In Finance the symbol $\beta$ is often used to denote the relationship between returns.
- Rather than explicitly creating a constant 1 feature
    - you may invoke the model object with the option including an intercept
    - if you do so, the feature vector you pass will be
   $$\x^\tp = \begin{pmatrix}
        \ret_\spy^\tp
        \end{pmatrix}$$  
    


- Use the entire training set
- Do not use cross-validation

**Questions:**
- Complete the function `createModel()` to build your linear regression model
- Complete the function `regress()` to do regression and return intercept and coefficients
- Replace the 0 values in the following cell with your answers, and execute the print statements


In [None]:
from sklearn import datasets, linear_model

beta_0 = 0    # The regression parameter for the constant
beta_SPY = 0  # The regression parameter for the return of SPY
ticker = "AAPL"

def createModel():
    '''
    Build your linear regression model using sklearn
    '''
    ### BEGIN SOLUTION
    model = linear_model.LinearRegression()
    return model
    ### END SOLUTION

def regress(model, X, y):
    '''
    Do regression using returns of your ticker and index
    
    Arguments:
    model: model you build with method "createModel()"
    X: ticker returns
    y: index returns
    '''
    ### BEGIN SOLUTION
    _= model.fit( X.values, y.values )
    
    return model.intercept_[0], model.coef_[0][0]
    ### END SOLUTION

# Assign to answer variables
### BEGIN SOLUTION
regr = createModel()

beta_0, beta_SPY = regress(regr, X_train, y_train)

### END SOLUTION

print("{t:s}: beta_0={b0:3.3f}, beta_SPY={b1:3.3f}".format(t=ticker, b0=beta_0, b1=beta_SPY))

Your expected outputs should be:
<table> 
    <tr> 
        <td>  
            beta_0
        </td>
        <td>
         0.001
        </td>
    </tr>
    <tr> 
        <td>
            beta_SPY
        </td>
        <td>
         1.071
        </td>
    </tr>

</table>

In [None]:
### BEGIN HIDDEN TESTS
def createModel_():
    model = linear_model.LinearRegression()
    return model

def regress_(model, X, y):
    _= model.fit( X.values, y.values )
    
    return model.intercept_[0], model.coef_[0][0]

model_test = createModel_()
aapl_beta_0, aapl_beta_1 = regress_(model_test, X_train, y_train)
assert np.allclose( [beta_0, beta_SPY], [aapl_beta_0, aapl_beta_1] )
### END HIDDEN TESTS

## Train the model using Cross valiation

Use 5-fold cross validation

**Question:**
- Complete the function `compute_cross_val_avg()` to compute the average score of 5-fold cross validation
- Replace the 0 values in the following cell with your answers, and execute the print statements

**Hint:**  
- You can use the `cross_val_score` in `sklearn.model_selection`

In [None]:
from sklearn.model_selection import cross_val_score

cross_val_avg = 0 # average score of cross validation
k = 5             # 5-fold cross validation

def compute_cross_val_avg(model, X, y, k):
    '''
    Compute the average score of k-fold cross validation
    
    Arguments:
    model: model you build with method "createModel()"
    X: ticker returns
    y: index returns
    k: k-fold cross validation
    '''
    ### BEGIN SOLUTION
    cross_val_score_ = cross_val_score(model, X, y, cv=k)
    return np.mean(cross_val_score_)
    ### END SOLUTION

    
cross_val_avg = compute_cross_val_avg(regr, X_train, y_train, 5)
print("{t:s}: Avg cross val score = {sc:3.2f}".format(t=ticker, sc=cross_val_avg) )

In [None]:
### BEGIN HIDDEN TESTS
def compute_cross_val_avg_(model, X, y, k):
    cross_val_score_ = cross_val_score(model, X, y, cv=k)
    return np.mean(cross_val_score_)
cross_val_avg_ = compute_cross_val_avg_(regr, X_train, y_train, 5)
assert np.allclose(cross_val_avg, cross_val_avg_)
### END HIDDEN TESTS

## Evaluate Loss (in sample RMSE) and Performance (Out of sample RMSE)

**Question:**
- Complete the function `computeRMSE()` to calculate the Root of Mean Square Error (RMSE)
- Replace the 0 values in the following cell with your answers, and execute the print statements

In [None]:
from sklearn.metrics import mean_squared_error

rmse_in_sample = 0 # in sample loss
rmse_out_sample = 0 # out of sample performance

# Predicted  in-sample returns of AAPL using SPY index
aapl_predicted_in_sample = regr.predict(X_train)
# Predicted out-of-sample returns of AAPL using SPY index
aapl_predicted_out_sample = regr.predict(X_test)

def computeRMSE( target, predicted ):
    '''
    Calculate the RMSE
    
    Arguments:
    target: real ticker returns
    predicted: predicted ticker returns
    '''
    ### BEGIN SOLUTION
    rmse = np.sqrt( mean_squared_error(target,  predicted))
    return rmse
    ### END SOLUTION
    
rmse_in_sample = computeRMSE(y_train, aapl_predicted_in_sample)
rmse_out_sample = computeRMSE(y_test, aapl_predicted_out_sample)

print("In Sample Root Mean squared error: {:.3f}".format( rmse_in_sample ) )
print("Out of Sample Root Mean squared error: {:.3f}".format( rmse_out_sample ) )

In [None]:
### BEGIN HIDDEN TESTS
def computeRMSE_( target, predicted ):
    rmse = np.sqrt( mean_squared_error(target,  predicted))
    return rmse
rmse_in_sample_test = computeRMSE_(y_train, regr.predict(X_train))
rmse_out_sample_test = computeRMSE_(y_test, regr.predict(X_test))
assert np.allclose(rmse_in_sample, rmse_in_sample_test)
assert np.allclose(rmse_out_sample, rmse_out_sample_test)
### END HIDDEN TESTS

## Hedged returns

Why is being able to predict the return of a ticker, given the return of another instrument (e.g., the market proxy) useful ?
- It **does not** allow us to predict the future
    - To predict $\ret_\aapl^\tp$, we require the same day return of the proxy $\ret_\spy$
- It **does** allow us to predict how much $\aapl$ will outperform the market proxy

Consider an investment that goes long (i.e, holds a positive quantity of $\aapl$
- Since the relationship between returns is positive
    - You will likely make money if the market goes up
    - You will likely lose money if the market goes down
    
Consider instead a *hedged* investment
- Go long 1 USD of $\aapl$
- Go short (hold a negative quantity) $\beta_{\aapl,\spy}$ USD of the market proxy $\spy$

Your *hedged return* on this long/short portfolio will be
$$
{\ret'}_{\aapl}^\tp = \ret_\aapl^\tp - \beta_{\aapl, \spy} * \ret_\spy^\tp
$$

As long as
$$
\ret_\aapl^\tp \gt \beta_{\aapl, \spy} * \ret_\spy^\tp
$$
you will make a profit, regardless of whether the market proxy rises or falls !

That is: you make money as long as $\aapl$ *outperforms* the market proxy.


This hedged portfolio is interesting
- Because your returns are independent of the market
- The volatility of your returns is likely much lower than the volatility of the long-only investment
- There is a belief that it is difficult to predict the market $\ret_\spy$
- But you might be able to discover a ticker (e.g., $\aapl$) that will outpeform the market

This is a real world application of the Regression task in Finance.

## Compute the hedged return on the test data examples
$$
{\ret'}_{\aapl}^\tp = \ret_\aapl^\tp - \beta_{\aapl, \spy} * \ret_\spy^\tp
$$
for all dates $t$ in the **test set**.  

**Question:**
- Complete the function `compute_hedged_series` to get the hedged series

In [None]:
hedged_series = pd.DataFrame()

def compute_hedged_series(model, X, y):
    '''
    Compute the hedged series
    
    Arguments:
    model: model you build with method "createModel()"
    X: ticker returns in test dataset
    y: index returns in test dataset
    '''
    ### BEGIN SOLUTION
    hedged_series = X.values - model.coef_[0][0] * y.values
    hedged_series = pd.DataFrame({'Return':hedged_series.flatten()}, index=X.index)
    return hedged_series
    ### END SOLUTION

hedged_series = compute_hedged_series(regr, X_test, y_test)
print(hedged_series.head())

In [None]:
### BEGIN HIDDEN TESTS
def compute_hedged_series_(model, X, y):
    hedged_series = X.values - model.coef_[0][0] * y.values
    hedged_series = pd.DataFrame({'Return':hedged_series.flatten()}, index=X.index)
    return hedged_series

hedged_series_test = compute_hedged_series_(regr, X_test, y_test)
assert np.allclose(hedged_series_test, hedged_series)
### END HIDDEN TESTS