$$
\newcommand{\x}{\mathbf{x}}
\newcommand{\y}{\mathbf{y}}
\newcommand{\price}{{p}}
\newcommand{\ret}{{r}}
\newcommand{\tp}{{(t)}}
\newcommand{\aapl}{{\text{AAPL}}}
\newcommand{\ba}{{\text{BA}}}
\newcommand{\spy}{{\text{SPY}}}
$$

# Problem description

We will solve a Regression task that is very common in Finance
- Given the return of "the market", predict the return of a particular stock

That is
- Given the return of a proxy for "the market" at time $t$, predict the return of, e.g., Apple at time $t$.

As we will explain
being able to predict the relationship between two financial instruments opens up possibilities
- Use one instrument to "hedge" or reduce the risk of holding the other
- Create strategies whose returns are independent of "the market"
    - Hopefully make a profit regardless of whether the market goes up or down
    

## Goal

You will create models of increasing complexity in order to explain the return of Apple (ticker \aapl)
- The first model will have a single feature: return of the market proxy, ticker $\spy$
- Subsequent models will add the return of other tickers as additional features


## Learning objectives
- Learn how to solve a Regression taks
- Become facile in the `sklearn` toolkit for Machine Learning

# Standard imports

In [1]:
# %load "./assignment_1_answers.py"
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score


In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import sklearn

import os
import math

%matplotlib inline

In [3]:
DATA_DIR = "./data"

# API for students

We will define some utility routines.

This will simplify problem solving


In [4]:
import pdb

ticker = "AAPL"
index_ticker = "SPY"
dateAttr = "Dt"
priceAttr = "Close"
retAttr = "Return"

def attrRename(df, ticker):
    """
    Rename attributes of DataFrame
    - prepend the string "T_" to original attribute name, where T is the string of ticker
    """
    rename_map = { orig:  ticker + "_" + orig.replace(" ", "_") for orig in df.columns.to_list() }
    
    return df.rename(columns=rename_map)

def getData(tickers, indx, attrs= [priceAttr]):
    """
    Return DataFrame with data for a list of tickers plus and index
    
    Parameters
    ----------
    tickers: List
    - List of tickers
    
    indx: String
    - Ticker of index
    
    attrs: List
    - List of data attributes to retain
    
    Returns
    -------
    DataFrame
    - attributes:
    -- each original attribute, prepended with "T_" where T is ticker or index string
    -- This is necessary to distinguish between attributes of different tickers
    """
    dateAttr = "Dt"
    
    use_cols =  attrs.copy()
    use_cols.insert(0, dateAttr)
   
    # Read the CSV files
    dfs = []
    for ticker_num, ticker in enumerate(tickers):
        ticker_file = os.path.join(DATA_DIR, "{t}.csv".format(t=ticker) )
        ticker_df = pd.read_csv(ticker_file, index_col=dateAttr, usecols=use_cols)
        
        # Rename attributes with ticker name
        ticker_df = attrRename(ticker_df, ticker)
        
        dfs.append(ticker_df)
        
          
    index_file   = os.path.join(DATA_DIR, "{t}.csv".format(t=indx) )
    index_df   = pd.read_csv(index_file, index_col=dateAttr, usecols=use_cols)
    index_df = attrRename(index_df, indx)
    
    dfs.append(index_df)
    
    data_df = pd.concat( dfs, axis=1)
   
    return data_df



# Get the data

The first step in our Recipe is Get the Data.

We have provided a utility method `getData` to simplify this for you

In [5]:
data = getData([ticker], index_ticker)

## Have a look at the data

We will not go through all steps in the Recipe, nor in depth.

But here's a peek

In [6]:
data.head()

Unnamed: 0_level_0,AAPL_Close,SPY_Close
Dt,Unnamed: 1_level_1,Unnamed: 2_level_1
2017-01-03,116.15,225.24
2017-01-04,116.02,226.58
2017-01-05,116.61,226.4
2017-01-06,117.91,227.21
2017-01-09,118.99,226.46


In [7]:
data.index.min(), data.index.max()

('2017-01-03', '2019-10-31')

As you can see, our data has two attributes for each date
- Price (adjusted close) of ticker $\aapl$
- Price (adjusted close) of the market proxy $\spy$

# Prepare the data

In Finance, it is very typical to work with *relative changes* (e.g., percent price change)
rather than *absolute changes* (price change) or *levels* (prices).

Without going into too much detail
- Relative changes are more consistent over time than either absolute changes or levels
- The consistency can facilitate the use of data over a longer time period

For example, let's suppose that prices are given in units of USD (dollar)
- A price change of 1 USD is more likely for a stock with price level 100 than price level 10
    - A relative change of $1/100 = 1 %$ is more likely than a change of $1/10 = 10%$
    - So relative changes are less dependent on price level than either price changes or price levels
    
    
To compute the *return* (percent change in prices)
 for ticker $\aapl$ (Apple) on date $t$

$$
\begin{array}[lll]\\
\ret_\aapl^\tp = \frac{\price _\aapl^\tp}{\price _\aapl^{(t-1)}} -1 \\
\text{where} \\
\price_\aapl^\tp \text{ denotes the price of ticker } \aapl \text{ on date } t \\
\ret_\aapl^\tp \text{ denotes the return of ticker } \aapl \text{ on date } t
\end{array}
$$

Moreover: our Problem Description is to predict returns (percent price changes) of a ticker
given the returns of a market proxy (the "index" with tickr $\spy$)

    

# Transformations: transform the training data

Our first task is to transform the data from price levels (Adj Close)
to Percent Price Changes.

Moreover, the date range for the training data is specified to be in the range
from `start_dt` (start date) to `end_dt`, inclusive on both sides.

**Note**

We will need to apply **identical** transformations to both the training and test data examples.

In the cells that immediately follow, we will do this only for the **training data**

You will need to repeat these steps for the test data in a subsequent step.

You are well-advised to create subroutines or functions to accomplish these tasks !
- You will apply them first to transform training data
- You will apply them a second time to transform the test data

We will achieve this is several steps

In [8]:

start_dt = "2018-01-02"
train_size, test_size = 188, 63

# Find the position of end_dt in the list of dates (so we can find the day after)
dates = data.index.to_list()
start_pos = dates.index(start_dt)

end_dt = dates[ start_pos + train_size-1 ]

print("Train data return range from {st:s} to {e:s}".format(st=start_dt, e=end_dt))

Train data return range from 2018-01-02 to 2018-09-28


## Create Dataframe of price levels for the training examples

- The Dataframe should have two columns: the price level for the ticker and for the index
- The minimum date in the Dataframe should be **the day before** `start_dt`
- The maximum date in the Dataframe shold be `end_dt`

The reason we are adding one day prior to `start_dt`
- We want to have returns (percent price changes) from `start_dt` onwards
- In order to compute a return for `start_dt`, we need the level from the prior day


In [9]:
# Set variable train_data_price to be a DataFrame with two columns
## AAPL_Adj_Close, SPY_Adj_Close
## with dates as the index
## Having minimum date equal to THE DAY BEFORE start_dt
## Having maximum date equal to end_dt


### BEGIN SOLUTION
def getRange(df, start_dt, end_dt):
    dates = data.index.to_list()
    
    # Find the position of start_dt in the list of dates (so we can find the day before)
    start_pos = dates.index(start_dt)

    # Return the slice of the date range
    return data[ dates[ start_pos -1 ]:end_dt]

train_data_price = getRange(data, start_dt, end_dt)
### END SOLUTION

In [10]:
train_data_price

Unnamed: 0_level_0,AAPL_Close,SPY_Close
Dt,Unnamed: 1_level_1,Unnamed: 2_level_1
2017-12-29,169.23,266.86
2018-01-02,172.26,268.77
2018-01-03,172.23,270.47
2018-01-04,173.03,271.61
2018-01-05,175.00,273.42
...,...,...
2018-09-24,220.79,291.02
2018-09-25,222.19,290.75
2018-09-26,220.42,289.88
2018-09-27,224.95,290.69


## Create Dataframe of returns for training examples

Create a new Dataframe with percent price changes of the columns, rather than the levles


In [11]:
# Set variable train_data_ret to a DataFrame with two columns
## containing the day over day percent changes in AAPL_Adj_Close, SPY_Adj_Close
## with dates as the index
## Having minimum date equal to start_dt
## Having maximum date equal to end_dt

## Please continue to name these columns with their original names (AAPL_Adj_Close, SPY_Adj_Close)
## even though the data in the columns will be returns rather than prices.
## We will rename the columns to reflect the data in the next step

### BEGIN SOLUTION
def getReturns(df, start_dt, end_dt):
    df_range = getRange(df, start_dt, end_dt)
    return df_range.pct_change()

train_data_ret = getReturns(train_data_price, start_dt, end_dt)
### END SOLUTION

In [12]:
train_data_ret.head()

Unnamed: 0_level_0,AAPL_Close,SPY_Close
Dt,Unnamed: 1_level_1,Unnamed: 2_level_1
2017-12-29,,
2018-01-02,0.017905,0.007157
2018-01-03,-0.000174,0.006325
2018-01-04,0.004645,0.004215
2018-01-05,0.011385,0.006664


In [13]:
## Rename the columns to indicate that they have been transformed from price (Adj_close) to Return
def renamePriceToRet(df):
    rename_map = { }
    rename_map = { orig:  orig.replace( priceAttr.replace(" ", "_"), "Ret") for orig in df.columns.to_list() }
    return df.rename(columns = rename_map)

train_data_ret = renamePriceToRet( train_data_ret )

In [14]:
train_data_ret.head()

Unnamed: 0_level_0,AAPL_Ret,SPY_Ret
Dt,Unnamed: 1_level_1,Unnamed: 2_level_1
2017-12-29,,
2018-01-02,0.017905,0.007157
2018-01-03,-0.000174,0.006325
2018-01-04,0.004645,0.004215
2018-01-05,0.011385,0.006664


In [15]:
## Drop the first date (the day before `start_dt`) since it has an undefied return
train_data_ret = train_data_ret[ start_dt:]

train_data_ret.head()

Unnamed: 0_level_0,AAPL_Ret,SPY_Ret
Dt,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-01-02,0.017905,0.007157
2018-01-03,-0.000174,0.006325
2018-01-04,0.004645,0.004215
2018-01-05,0.011385,0.006664
2018-01-08,-0.003714,0.001829


## Remove the target (AAPL_Ret)

The only feature is the return of the market proxy $\spy$.

Predicting the target given the target as a feature would be cheating !


In [16]:
targetAttr = index_ticker + "_Ret"
X_train, y_train =  train_data_ret[ [targetAttr] ], train_data_ret.drop(columns=[ targetAttr ] )

In [17]:
X_train.head()

Unnamed: 0_level_0,SPY_Ret
Dt,Unnamed: 1_level_1
2018-01-02,0.007157
2018-01-03,0.006325
2018-01-04,0.004215
2018-01-05,0.006664
2018-01-08,0.001829


In [18]:
y_train.head()

Unnamed: 0_level_0,AAPL_Ret
Dt,Unnamed: 1_level_1
2018-01-02,0.017905
2018-01-03,-0.000174
2018-01-04,0.004645
2018-01-05,0.011385
2018-01-08,-0.003714


# Transformations: transform the test data

The test data will be returns from `test_start_dt` to `test_end_dt` inclusive.

We will apply identical transformations as we did to the training data, but with a different date range.


In [19]:
# Find the position of end_dt in the list of dates (so we can find the day after)
dates = data.index.to_list()
end_pos = dates.index(end_dt)

test_start_dt = dates[ end_pos+1 ]
test_end_dt   = dates[ end_pos + test_size]

print("Test data return range from {st:s} to {e:s}".format(st=test_start_dt, e=test_end_dt))

Test data return range from 2018-10-01 to 2018-12-31


## Create Dataframe of price levels for the test data

Use the same details as specified for the training data


In [20]:
# Set variable test_data_price to be a DataFrame with two columns
## AAPL_Adj_Close, SPY_Adj_Close
## with dates as the index
## Having minimum date equal to THE DAY BEFORE test_start_dt
## Having maximum date equal to test_end_dt


### BEGIN SOLUTION

test_data_price = getRange(data, test_start_dt, test_end_dt)
### END SOLUTION

In [21]:
test_data_price

Unnamed: 0_level_0,AAPL_Close,SPY_Close
Dt,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-09-28,225.74,290.72
2018-10-01,227.26,291.73
2018-10-02,229.28,291.56
2018-10-03,232.07,291.72
2018-10-04,227.99,289.44
...,...,...
2018-12-24,146.83,234.34
2018-12-26,157.17,246.18
2018-12-27,156.15,248.07
2018-12-28,156.23,247.75


## Create Dataframe of returns for test examples

Use the same details as specified for the training data

In [22]:
# Set variable test_data_ret to a DataFrame with two columns
## containing the day over day percent changes in AAPL_Adj_Close, SPY_Adj_Close
## with dates as the index
## Having minimum date equal to test_start_dt
## Having maximum date equal to test_end_dt

## Please continue to name these columns with their original names (AAPL_Adj_Close, SPY_Adj_Close)
## even though the data in the columns will be returns rather than prices.
## We will rename the columns to reflect the data in the next step

### BEGIN SOLUTION
test_data_ret = getReturns(test_data_price, test_start_dt, test_end_dt)
### END SOLUTION

In [23]:
test_data_ret.head()

Unnamed: 0_level_0,AAPL_Close,SPY_Close
Dt,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-09-28,,
2018-10-01,0.006733,0.003474
2018-10-02,0.008888,-0.000583
2018-10-03,0.012169,0.000549
2018-10-04,-0.017581,-0.007816


In [24]:
## Rename the columns to indicate that they have been transformed from price (Adj_close) to Return

test_data_ret = renamePriceToRet( test_data_ret )

In [25]:
test_data_ret.head()

Unnamed: 0_level_0,AAPL_Ret,SPY_Ret
Dt,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-09-28,,
2018-10-01,0.006733,0.003474
2018-10-02,0.008888,-0.000583
2018-10-03,0.012169,0.000549
2018-10-04,-0.017581,-0.007816


In [26]:
## Drop the first date (the day before `test_start_dt`) since it has an undefined return
test_data_ret = test_data_ret[ test_start_dt:]

test_data_ret.head()

Unnamed: 0_level_0,AAPL_Ret,SPY_Ret
Dt,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-10-01,0.006733,0.003474
2018-10-02,0.008888,-0.000583
2018-10-03,0.012169,0.000549
2018-10-04,-0.017581,-0.007816
2018-10-05,-0.016229,-0.005597


## Remove the target (AAPL_Ret) 

The only feature is the return of the market proxy $\spy$.

Predicting the target given the target as a feature would be cheating !


In [27]:
targetAttr = index_ticker + "_Ret"
X_test, y_test = test_data_ret.drop(columns=[ targetAttr ] ), test_data_ret[ [targetAttr] ]

# Train a model

Use Linear Regression to predict the return of a ticker from the return of the market proxy $\spy$.
For example, for ticker $\aapl$

$$
\ret_\aapl^\tp = \beta_0 + \beta_{\aapl, \spy} * \ret_\spy^\tp + \epsilon_{\aapl}^\tp
$$

Each example corresponds to one day (time $t$)
- has features
    - constant 1, corresponding to the intercept parameter
    - return of the market proxy $\spy$
       $$\x^\tp = \begin{pmatrix}
        1 \\
        \ret_\spy^\tp
        \end{pmatrix}$$

- has target
    - return of the ticker
    $$\y^\tp = \ret_\aapl^\tp$$

 
You will use Linear Regression to solve for parameters $\beta_0$,  $\beta_{\aapl, \spy}$ 

- In the lectures we used the symbol $\Theta$ to denote the parameter vector; here we use $\mathbf{\beta}$
- In Finance the symbol $\beta$ is often used to denote the relationship between returns.
- Rather than explicitly creating a constant 1 feature
    - you may invoke the model object with the option including an intercept
    - if you do so, the feature vector you pass will be
   $$\x^\tp = \begin{pmatrix}
        \ret_\spy^\tp
        \end{pmatrix}$$  
    


- Use the entire training set
- Do not use cross-validation



In [28]:
beta_0 = 0    # The regression parameter for the constant (i.e., the intercept)
beta_SPY = 0  # The regression parameter for the return of SPY

### BEGIN SOLUTION
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score

def createModel():
    model = linear_model.LinearRegression()
    return model

def regress(model, X, y):
    _= model.fit( X.values, y.values )
    
    return model.intercept_[0], model.coef_[0][0]

def computeRMSE( target, predicted ):
    rmse = np.sqrt( mean_squared_error(target,  predicted))
    return rmse

regr = createModel()

aapl_beta_0, aapl_beta_1 = regress(regr, X_train, y_train)

# Assign to answer variables
beta_0, beta_SPY = aapl_beta_0, aapl_beta_1
### END SOLUTION
print("{t:s}: beta_0={b0:3.3f}, beta_SPY={b1:3.3f}".format(t=ticker, b0=aapl_beta_0, b1=beta_SPY))


AAPL: beta_0=0.001, beta_SPY=1.071


# Train the model using Cross valiation

Use 5-fold cross validation

In [29]:
# Set variable `scores` 
## A numpy.ndarray containing the 5 scores from 5-fold cross validation
### BEGIN SOLUTION
from sklearn.model_selection import cross_val_score

scores = cross_val_score(regr, X_train, y_train, cv=5)
### END SOLUTION

print("Scores: min={mn:4.2f}, max={mx:4.2f}, avg={a:4.2f}".format(mn=scores.min(), mx=scores.max(), a=scores.mean()))

Scores: min=-0.01, max=0.62, avg=0.33


# Evaluate Loss (in sample RMSE)

In [30]:
## BEGIN SOLUTION
y_train_pred = regr.predict( X_train )
rmse = np.sqrt( mean_squared_error(y_train,  y_train_pred))
print("Root Mean squared error: {:.3f}".format( rmse ) )

# Explained variance score: 1 is perfect prediction
print("R-squared: {:.2f}".format(r2_score(y_train, y_train_pred)) )
### END SOLUTION

Root Mean squared error: 0.011
R-squared: 0.42


# Evaluate Performance Metric (out of sample RMSE)

Use test examples

In [31]:
## BEGIN SOLUTION
y_test_pred = regr.predict( X_test )
rmse = np.sqrt( mean_squared_error(y_test,  y_test_pred))
print("Root Mean squared error: {:.3f}".format( rmse ) )

### END SOLUTION

Root Mean squared error: 0.017


## Hedged returns

Why is being able to predict the return of a ticker, given the return of another instrument (e.g., the market proxy) useful ?
- It **does not** allow us to predict the future
    - To predict $\ret_\aapl^\tp$, we require the same day return of the proxy $\ret_\spy$
- It **does** allow us to predict how much $\aapl$ will outperform the market proxy

Consider an investment that goes long (i.e, holds a positive quantity of $\aapl$
- Since the relationship between returns is positive
    - You will likely make money if the market goes up
    - You will likely lose money if the market goes down
    
Consider instead a *hedged* investment
- Go long 1 USD of $\aapl$
- Go short (hold a negative quantity) $\beta_{\aapl,\spy}$ USD of the market proxy $\spy$

Your *hedged return* on this long/short portfolio will be
$$
{\ret'}_{\aapl}^\tp = \ret_\aapl^\tp - \beta_{\aapl, \spy} * \ret_\spy^\tp
$$

As long as
$$
\ret_\aapl^\tp \gt \beta_{\aapl, \spy} * \ret_\spy^\tp
$$
you will make a profit, regardless of whether the market proxy rises or falls !

That is: you make money as long as $\aapl$ *outperforms* the market proxy.



This hedged portfolio is interesting
- Because your returns are independent of the market
- The volatility of your returns is likely much lower than the volatility of the long-only investment
- There is a belief that it is difficult to predict the market $\ret_\spy$
- But you might be able to discover a ticker (e.g., $\aapl$) that will outpeform the market

This is a real world application of the Regression task in Finance.

## Compute the hedged return on the test data examples
$$
{\ret'}_{\aapl}^\tp = \ret_\aapl^\tp - \beta_{\aapl, \spy} * \ret_\spy^\tp
$$
for all dates $t$ in the test set.  


In [32]:
# Set variable y_resid to 
## a numpy.ndarray
## containing the values of the hedged returns
## for all days in the test examples

### BEGIN SOLUTION
y_resid = y_test.values - X_test.values * beta_SPY
### END SOLUTION

In [33]:
y_resid.shape
y_test.shape
X_test.shape

(63, 1)

(63, 1)

(63, 1)

# OLD version follows

# Assignment: Using Machine Learning for Hedging

Welcome to the first assignment.  

We will show how Machine Learning can be used in Finance to build multi-asset portfolios that have better risk/return characteristics than a portfolio consisting of a single asset.

# Objectives
We will be using Linear Regression to establish the relationship between the returns of individual equities and "the market".

The purpose of the assignment is two-fold
- to get you up to speed with Machine Learning in general, and `sklearn` in particular
- to get you up to speed with the other programming tools (e.g., Pandas) that will help you in data preparation, etc.

# How to report your answers
I will mix explanation of the topic with tasks that you must complete. Look for 
the string "**Queston**" to find a task that you must perform.
Most of the tasks will require you to assign values to variables and execute a `print` statement.

**Motivation**

If you *do not change* the print statement then the GA (or a machine) can automatically find your answer to each part by searching for the string.


# The data

The data are the daily prices of a number of individual equities and equity indices.
The prices are arranged in a series in ascending date order (a timeseries).
- There is a separate `.csv` file for each equity or index in the directory `data/assignment_1`

## Reading the data

You should get the price data into some sort of data structure.  Pandas DataFrame is super useful
so I recommend that's what you use (not required though).

**Hints**: 
- look up the Pandas `read_csv` method
- it will be very convenient to use dates as the index of your DataFrame

## Preliminary data preparation

In the rest of the assignment we will *not* be working with prices but with *returns* (percent change in prices).
For example, for ticker $\aapl$ (Apple)

$$
\begin{array}[lll]\\
\ret_\aapl^\tp = \frac{\price _\aapl^\tp}{\price _\aapl^{(t-1)}} -1 \\
\text{where} \\
\price_\aapl^\tp \text{ denotes the price of ticker } \aapl \text{ on date } t \\
\ret_\aapl^\tp \text{ denotes the return of ticker } \aapl \text{ on date } t
\end{array}
$$

- You will want to convert the price data into return data
- We only want the returns for the year 2018; discard any other return

**Question**
Replace the 0 values in the following cell with your answers, and execute the print statements

In [34]:
import numpy as np
np.array([])

array([], dtype=float64)

In [35]:
ticker_returns = np.array([]) # Returns of the ticker for year 2018
idx_returns = np.array([])    # Returns of the index for year 2018

num_returns = 0  # Number of returns in year 2018
first_return = 0 # The return on the earliest date in 2018
last_return  = 0 # The return on the latest date in 2018
avg_return  = 0  # The average return over the  year 2018

### BEGIN SOLUTION

import pandas as pd
import numpy as np

DATA_DIR = "./data"
DATE_ATTR="Dt"
PRICE_ATTR = "Close"
RET_ATTR = "Return"

def read_ticker(ticker):
    df = pd.read_csv( DATA_DIR + "/" + ticker + ".csv", index_col=DATE_ATTR)
    return df


def add_ret(df):
    df[RET_ATTR] = df[PRICE_ATTR].pct_change()
    return df

def select_yr(df, year):
    df_yr = df[ df[RET_ATTR].notnull() ][ ( str(year) + "-01-01"):(str(year) + "-12-31")]
    return df_yr

def summarize(df):
    num_returns = df.shape[0]
    first_return = df[RET_ATTR].iloc[0]  # The return on the earliest date 
    last_return  = df[RET_ATTR].iloc[-1] # The return on the latest date
    avg_return   = df[RET_ATTR].mean()  # The average return
    
    return num_returns, first_return, last_return, avg_return

# Read the raw data
aapl, spy = read_ticker("AAPL"), read_ticker("SPY")

# Add a return attribute
_= add_ret(aapl)
_= add_ret(spy)
aapl_2018, spy_2018 = select_yr(aapl, 2018), select_yr(spy, 2018)

# Assign to answer variables
ticker_returns = aapl_2018[ RET_ATTR ].values
num_returns, first_return, last_return, avg_return = summarize( aapl_2018 )
### END SOLUTION

print("There are {num:d} returns. First={first:.3%}, Last={last:.3%}, Avg={avg:.3%}".format(num=num_returns, first=first_return, last=last_return, avg=avg_return))

There are 251 returns. First=1.790%, Last=0.967%, Avg=-0.012%


In [36]:
assert np.allclose( [num_returns], [251], rtol=1e-05, atol=1e-03 )
assert np.allclose( [first_return * 100], [ 1.790 ], rtol=1e-05, atol=1e-03 )
assert np.allclose( [last_return * 100], [ 0.967 ], rtol=1e-05, atol=1e-03 )
assert np.allclose( [avg_return * 100], [ - 0.012 ], rtol=1e-05, atol=1e-03 )

assert np.allclose( ticker_returns, aapl_2018[ RET_ATTR ].values )

# Split into Train and Test datasets

In general, you will split the data into two sets by choosing the members of each set at random.

To facilitate grading for this assignment, we will *use a specific test set*
- the training set are the returns for the months of January through September (inclusive), i.e., 9 months
- the test set are the returns for the months of October through December (inclusive), i.e., 3 months

Thus, you will be using the early part of the data for training, and the latter part of the data for testing.

**Question**
Replace the 0 values in the following cell with your answers, and execute the print statements

In [37]:
train_num_returns = 0  # Number of returns in train set
train_first_return = 0 # The return on the earliest date in train set
train_last_return  = 0 # The return on the latest date in train set
train_avg_return  = 0  # The average return over the  year train set

test_num_returns = 0  # Number of returns in test set
test_first_return = 0 # The return on the earliest date in test set
test_last_return  = 0 # The return on the latest date in test set
test_avg_return  = 0  # The average return over the  year test set

### BEGIN SOLUTION

def split(df, year):
    df_yr = select_yr(df, year)
    df_train = df_yr[str(year) + "-01-01":str(year) + "-09-30"]
    df_test  = df_yr[str(year) + "-10-01":str(year) + "-12-31"]
    
    return df_train, df_test

# Check

aapl_train, aapl_test = split(aapl, 2018)
spy_train, spy_test = split(spy, 2018)

# Assign to answer variables
train_num_returns, train_first_return, train_last_return, train_avg_return = summarize( aapl_train )

test_num_returns, test_first_return, test_last_return, test_avg_return = summarize( aapl_test )

### END SOLUTION

print("Train set: There are {num:d} returns. First={first:.2%}, Last={last:.2%}, Avg={avg:.2%}".format(num=train_num_returns, 
                                                                                                         first=train_first_return, 
                                                                                                         last=train_last_return, 
                                                                                                         avg=train_avg_return))

print("Test set: There are {num:d} returns. First={first:.2%}, Last={last:.2%}, Avg={avg:.2%}".format(num=test_num_returns, 
                                                                                                         first=test_first_return, 
                                                                                                         last=test_last_return, 
                                                                                                         avg=test_avg_return))

Train set: There are 188 returns. First=1.79%, Last=0.35%, Avg=0.16%
Test set: There are 63 returns. First=0.67%, Last=0.97%, Avg=-0.54%


# $\aapl$ regression

Use Linear Regression to predict the return of a ticker from the return of the $\spy$ index.
For example, for ticker $\aapl$

$$
\ret_\aapl^\tp =  \beta_{\aapl, \spy} * \ret_\spy^\tp + \epsilon_{\aapl}^\tp
$$

That is
- each example is a pair consisting of one day's return 
    - of the ticker (e.g., $\aapl$).  This is the target (e.g, $\y$ in our lectures)
    - of the index $\spy$. This is a feature vector of length 1 (e.g., $\x$ in our lectures)

You will use Linear Regression to solve for parameter $\beta_{\aapl, \spy}$ 

- In the lectures we used the symbol $\Theta$ to denote the parameter vector; here we use $\mathbf{\beta}$
- In Finance the symbol $\beta$ is often used to denote the relationship between returns. 
- You may should add an "intercept" so that the feature vector is length 2 rather than length 1
    - $\x^\tp = \begin{pmatrix}
        1 \\
        \ret_\spy^\tp
        \end{pmatrix}$




- Report the $\mathbf{\beta}$ parameter vector you obtain for $\aapl$
    - you will subsequently do this for another ticker in a different part of the assignment
        - so think ahead: you may want to parameterize your code
        - change the assignment to `ticker` when you report the next part

        
**Question**
Replace the 0 values in the following cell with your answers, and execute the print statements


In [38]:
beta_0 = 0    # The regression parameter for the constant
beta_SPY = 0  # The regression parameter for the return of SPY
ticker = "AAPL"

### BEGIN SOLUTION
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

def createModel():
    model = linear_model.LinearRegression()
    return model

def regress(model, dep_df, ind_df):
    _= model.fit( ind_df[ [RET_ATTR] ].values, dep_df[ [RET_ATTR] ].values )
    
    return model.intercept_[0], model.coef_[0][0]

def computeRMSE( target, predicted ):
    rmse = np.sqrt( mean_squared_error(target,  predicted))
    return rmse

regr = createModel()

aapl_beta_0, aapl_beta_1 = regress(regr, aapl_train, spy_train)

# Assign to answer variables
beta_0, beta_SPY = aapl_beta_0, aapl_beta_1
### END SOLUTION
print("{t:s}: beta_0={b0:3.3f}, beta_SPY={b1:3.3f}".format(t=ticker, b0=aapl_beta_0, b1=aapl_beta_1))


AAPL: beta_0=0.001, beta_SPY=1.071


In [39]:
assert np.allclose( [beta_0, beta_SPY], [aapl_beta_0, aapl_beta_1] )

- Report the average of the cross validation scores, using 5 fold cross validation

**Question**
Replace the 0 values in the following cell with your answers, and execute the print statements


In [40]:
cross_val_avg = 0

print("{t:s}: Avg cross val score = {sc:3.2f}".format(t=ticker, sc=cross_val_avg) )

AAPL: Avg cross val score = 0.00


## $\aapl$ hedged returns

- Compute the series
$$
{\ret'}_{\aapl}^\tp = \ret_\aapl^\tp - \beta_{\aapl, \spy} * \ret_\spy^\tp
$$
for all dates $t$ in the test set.  
- Sort the dates in ascending order and plot the timeseries ${\ret}'_{\aapl}$

${\ret}'_{\aapl}$ is called the "hedged return" of $\aapl$
- It is the daily return you would realize if you created a portfolio that was
    - long 1 dollar of $\aapl$
    - short $\beta_{\aapl, \spy}$ dollars of the index $\spy$
- It represents the outperformance of $\aapl$ relative to the index $\spy$
    - $\spy$ is the proxy for "the market" (it tracks the S&P 500 index)
    - The hedged return is the *value added* by going long $\aapl$ rather than just going "long the market"
    - Sometimes referred to as the "alpha" ($\alpha_\aapl$)
- So **if** you are able to correctly forecast that $\aapl$ will have positive outperformance (i.e, have $\alpha_\aapl > 0$ most days)
    - then you can earn a positive return regardless of whether the market ($\spy$) goes up or down !
    - this is much lower risk than just holding $\aapl$ long
    - people will pay you very well if you can really forecast correctly !

**Question**
Replace the 0 values in the following cell with your answers, and execute the print statements

In [41]:
hedged_num_returns = 0  # Number of returns in hedged series
hedged_first_return = 0 # The return on the earliest date in hedged series
hedged_last_return  = 0 # The return on the latest date in hedged series
hedged_avg_return  = 0  # The average return over the hedged series


ticker="AAPL"
print("{t:s} hedged returns: There are {num:d} returns. First={first:.2%}, Last={last:.2%}, Avg={avg:.2%}".format(t=ticker,
                                                                                                                    num=hedged_num_returns,
                                                                                                                    first=hedged_first_return, 
                                                                                                                    last=hedged_last_return, 
                                                                                                                    avg=hedged_avg_return))


AAPL hedged returns: There are 0 returns. First=0.00%, Last=0.00%, Avg=0.00%


# $\ba$ regression

Repeat the regression you carried out for $\aapl$ but this time instead for the ticker $\ba$ (Boeing)

**Motivation**

The idea is to encourage you to build re-usable pieces of code.

So if you created some functions in solving Part 1, you may reuse these functions to easily solve part 2,
particulary if you treated the ticker (e.g., $\aapl$ or $\ba$) as a parameter to your functions.

If you simply copy and paste the code from Part 1 you will only get partial credit.


**Question**
Replace the 0 values in the following cell with your answers, and execute the print statements

In [42]:
beta_0 = 0    # The regression parameter for the constant
beta_SPY = 0  # The regression parameter for the return of SPY
ticker = "BA"

print("{t:s}: beta_0={b0:3.2f}, beta_SPY={b1:3.2f}".format(t=ticker, b0=beta_0, b1=beta_SPY))


BA: beta_0=0.00, beta_SPY=0.00


**Question**
Replace the 0 values in the following cell with your answers, and execute the print statements

In [43]:
cross_val_avg = 0

print("{t:s}: Avg cross val score = {sc:3.2f}".format(t=ticker, sc=cross_val_avg) )

BA: Avg cross val score = 0.00


**Question**
Replace the 0 values in the following cell with your answers, and execute the print statements

In [44]:
hedged_num_returns = 0  # Number of returns in hedged series
hedged_first_return = 0 # The return on the earliest date in hedged series
hedged_last_return  = 0 # The return on the latest date in hedged series
hedged_avg_return  = 0  # The average return over the hedged series

ticker="BA"
print("{t:s} hedged returns: There are {num:d} returns. First={first:.2%}, Last={last:.2%}, Avg={avg:.2%}".format(t=ticker,
                                                                                                                    num=hedged_num_returns,
                                                                                                                    first=hedged_first_return, 
                                                                                                                    last=hedged_last_return, 
                                                                                                                    avg=hedged_avg_return))


BA hedged returns: There are 0 returns. First=0.00%, Last=0.00%, Avg=0.00%


# Returns to prices

- You have already computed the predicted returns of $\aapl$ for each date in the test set.
- Create the predicted *price* timeseries for $\aapl$ for the date range in the test set
- Plot (on the same graph) the actual price timeseries of $\aapl$ and the predicted price timeseries.

There is a particular reason that we choose to perform the Linear Regression on returns rather than prices.

It is beyond the scope of this lecture to explain why, but we want to show that we can easily convert
back into prices.

**Question**
Replace the 0 values in the following cell with your answers, and execute the print statements

In [45]:
num_prices = 0  # Number of prices in price series
first_price = 0 # The price on the earliest date in price series
last_price  = 0 # The price on the latest date in price series
avg_price  = 0  # The average price over the price series

ticker="AAPL"
print("{t:s} predicted prices: There are {num:d} prices. First={first:3.2f}, Last={last:3.2f}, Avg={avg:3.2f}".format(t=ticker,
                                                                                                                    num=num_prices,
                                                                                                                    first=first_price, 
                                                                                                                    last=last_price, 
                                                                                                                    avg=avg_price))
 

AAPL predicted prices: There are 0 prices. First=0.00, Last=0.00, Avg=0.00


# Extra credit

The data directory has the prices of many other indices.
- Any ticker in the directory beginning with the letter "X" is an index

Choose *one* index (we'll call it $I$) other than $\spy$ to use as a second feature and compute the Linear Regression

$$
\ret_\aapl^\tp = \beta^T \x + \epsilon_{\aapl}^\tp
$$

where $\x$ is the feature vector
  - $\x^\tp = \begin{pmatrix}
        1 \\
        \ret_\spy^\tp \\
        \ret_I^\tp \\
        \end{pmatrix}$

That is, predict the returns of $\aapl$ in terms of a constant, the returns of $\spy$ and the returns of another index $I$.

**Question**
There is no specified format.  Treat this like an interview question and show off your analytical
and explanatory skills. Be sure to explain how you came about choosing the second index.

In [46]:
print("Done")

Done
