# Predicting Stock Returns

Our objective in this chapter is to predict stock returns using linear regression and k-nearest neighbors (KNN).  Specifically, we will try to predict the daily returns of MSFT from the returns of various correlated assets including stock indices, currencies, and other stocks.

In the related homework, you will test a simple trading strategy that is based on our predictions and see how it performs relative to a buy and hold strategy.

## Import Packages

Let's begin by loading the packages that we will need.

In [1]:
import numpy as np
import pandas as pd
import yfinance as yf
import sklearn

## Reading-In Data

Next, let's read-in our data.  We will start the stocks, whose data we will get from Yahoo Finance.

In [2]:
stock_tickers = ['MSFT', 'IBM', 'GOOGL'] # define tickers
df_stock = yf.download(
    stock_tickers, start='2005-01-01', end='2021-07-31', auto_adjust=False,
)
df_stock = df_stock['Adj Close'] # select only the adjusted close price
df_stock.columns = df_stock.columns.str.lower() # clean-up column names
df_stock.rename_axis('trade_date', inplace=True) # clean-up index name
df_stock.rename_axis('', axis=1, inplace=True) # clean-up index name
df_stock

[*********************100%***********************]  3 of 3 completed


Unnamed: 0_level_0,googl,ibm,msft
trade_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2005-01-03,5.038075,50.237438,18.454884
2005-01-04,4.834026,49.697815,18.523891
2005-01-05,4.809422,49.595032,18.482494
2005-01-06,4.686148,49.440849,18.461786
2005-01-07,4.817872,49.224987,18.406570
...,...,...,...
2021-07-26,133.116898,114.338173,279.157196
2021-07-27,130.996490,114.322159,276.733063
2021-07-28,135.161774,113.537323,276.424042
2021-07-29,134.847443,113.665466,276.694489


Next we'll grab currency data from FRED.

In [3]:
currency_tickers = ['JPY=X', 'GBPUSD=X']
df_currency = yf.download(
    currency_tickers, start='2005-01-01', end='2021-07-31',
    auto_adjust=False, ignore_tz=True
)
df_currency = df_currency['Adj Close']
df_currency.columns = df_currency.columns.str.lower()
df_currency.rename_axis('trade_date', inplace=True)
df_currency.rename_axis('', axis=1, inplace=True)
df_currency

[*********************100%***********************]  2 of 2 completed


Unnamed: 0_level_0,gbpusd=x,jpy=x
trade_date,Unnamed: 1_level_1,Unnamed: 2_level_1
2005-01-03,1.904617,102.739998
2005-01-04,1.883594,104.339996
2005-01-05,1.885512,103.930000
2005-01-06,1.876490,104.889999
2005-01-07,1.871293,104.889999
...,...,...
2021-07-26,1.375781,110.543999
2021-07-27,1.382915,110.302002
2021-07-28,1.388272,109.806000
2021-07-29,1.390685,109.890999


Finally, we'll grab index data from Yahoo Finance.

In [4]:
index_tickers = ['SPY', 'DIA', '^VIX'] 
df_index = yf.download(
    index_tickers, start='2005-01-01', end='2021-07-31', auto_adjust=False
)
df_index = df_index['Adj Close']
df_index.columns = df_index.columns.str.lower().str.replace('^', '')
df_index.rename_axis('trade_date', inplace=True)
df_index.rename_axis('', axis=1, inplace=True)
df_index

[*********************100%***********************]  3 of 3 completed


Unnamed: 0_level_0,dia,spy,vix
trade_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2005-01-03,67.762733,82.074066,14.080000
2005-01-04,67.118683,81.071114,13.980000
2005-01-05,66.746178,80.511726,14.090000
2005-01-06,66.954475,80.921043,13.580000
2005-01-07,66.828262,80.805023,13.490000
...,...,...,...
2021-07-26,326.679779,416.758026,17.580000
2021-07-27,325.945404,414.858643,19.360001
2021-07-28,324.774078,414.688538,18.309999
2021-07-29,326.131317,416.408417,17.700001


## Join and Clean Data

Now we can join together our price data and convert it into returns (we actually use differences for VIX as these are more stationary).  Notice that we are implicitly adding a time series component to our regression by adding lagged `msft` returns as a feature.

In [5]:
df_data = \
    (
    df_stock
        .merge(df_index, how='left', left_index=True, right_index=True) # join currency data
        .merge(df_currency, how='left', left_index=True, right_index=True) # join index data
        .dropna()
        .assign(msft = lambda df: df['msft'].pct_change())   # percent change
        .assign(msft_lag_0 = lambda df: df['msft'].shift(0)) #
        .assign(msft_lag_1 = lambda df: df['msft'].shift(1)) #
        .assign(ibm = lambda df: df['ibm'].pct_change())     #
        .assign(googl = lambda df: df['googl'].pct_change()) #
        .assign(spy = lambda df: df['spy'].pct_change())     #
        .assign(dia = lambda df: df['dia'].pct_change())     #
        .assign(vix = lambda df: df['vix'].diff())           # absolute change
        .assign(dexjpus = lambda df: df['jpy=x'].pct_change()) # percent change
        .assign(dexusuk = lambda df: df['gbpusd=x'].pct_change()) #
        .dropna()
    )
df_data

Unnamed: 0_level_0,googl,ibm,msft,dia,spy,vix,gbpusd=x,jpy=x,msft_lag_0,msft_lag_1,dexjpus,dexusuk
trade_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2005-01-05,-0.005090,-0.002068,-0.002235,-0.005550,-0.006900,0.110001,1.885512,103.930000,-0.002235,0.003739,-0.003929,0.001018
2005-01-06,-0.025632,-0.003109,-0.001120,0.003121,0.005084,-0.510000,1.876490,104.889999,-0.001120,-0.002235,0.009237,-0.004785
2005-01-07,0.028109,-0.004366,-0.002991,-0.001885,-0.001434,-0.090000,1.871293,104.889999,-0.002991,-0.001120,0.000000,-0.002769
2005-01-10,0.006242,-0.001044,0.004874,0.003401,0.004729,-0.260000,1.876912,104.169998,0.004874,-0.002991,-0.006864,0.003003
2005-01-11,-0.007792,-0.007108,-0.002612,-0.006403,-0.006891,-0.040000,1.878605,103.419998,-0.002612,0.004874,-0.007200,0.000902
...,...,...,...,...,...,...,...,...,...,...,...,...
2021-07-26,0.007668,0.010117,-0.002140,0.002396,0.002455,0.379999,1.375781,110.543999,-0.002140,0.012337,0.003668,-0.001168
2021-07-27,-0.015929,-0.000140,-0.008684,-0.002248,-0.004558,1.780001,1.382915,110.302002,-0.008684,-0.002140,-0.002189,0.005186
2021-07-28,0.031797,-0.006865,-0.001117,-0.003594,-0.000410,-1.050001,1.388272,109.806000,-0.001117,-0.008684,-0.004497,0.003873
2021-07-29,-0.002326,0.001129,0.000978,0.004179,0.004147,-0.609999,1.390685,109.890999,0.000978,-0.001117,0.000774,0.001738


## Training Set and Testing Set

We'll train our models on data prior to 2016, and then we'll use data from 2016 onward for testing.  So let's separate out these two subsets of data.

In [6]:
df_train = df_data.query('trade_date < "2016-01-01"')
df_train

Unnamed: 0_level_0,googl,ibm,msft,dia,spy,vix,gbpusd=x,jpy=x,msft_lag_0,msft_lag_1,dexjpus,dexusuk
trade_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2005-01-05,-0.005090,-0.002068,-0.002235,-0.005550,-0.006900,0.110001,1.885512,103.930000,-0.002235,0.003739,-0.003929,0.001018
2005-01-06,-0.025632,-0.003109,-0.001120,0.003121,0.005084,-0.510000,1.876490,104.889999,-0.001120,-0.002235,0.009237,-0.004785
2005-01-07,0.028109,-0.004366,-0.002991,-0.001885,-0.001434,-0.090000,1.871293,104.889999,-0.002991,-0.001120,0.000000,-0.002769
2005-01-10,0.006242,-0.001044,0.004874,0.003401,0.004729,-0.260000,1.876912,104.169998,0.004874,-0.002991,-0.006864,0.003003
2005-01-11,-0.007792,-0.007108,-0.002612,-0.006403,-0.006891,-0.040000,1.878605,103.419998,-0.002612,0.004874,-0.007200,0.000902
...,...,...,...,...,...,...,...,...,...,...,...,...
2015-12-24,-0.003474,-0.002093,-0.002687,-0.003356,-0.001650,0.170000,1.487697,120.934998,-0.002687,0.008491,-0.000785,0.003615
2015-12-28,0.021415,-0.004629,0.005029,-0.001370,-0.002285,1.170000,1.493206,120.231003,0.005029,-0.002687,-0.005821,0.003703
2015-12-29,0.014983,0.015769,0.010724,0.011430,0.010672,-0.830000,1.489403,120.349998,0.010724,0.005029,0.000990,-0.002547
2015-12-30,-0.004610,-0.003148,-0.004244,-0.006667,-0.007087,1.210001,1.482228,120.528999,-0.004244,0.010724,0.001487,-0.004817


In [7]:
df_test = df_data.query('trade_date >= "2016-01-01"')
df_test

Unnamed: 0_level_0,googl,ibm,msft,dia,spy,vix,gbpusd=x,jpy=x,msft_lag_0,msft_lag_1,dexjpus,dexusuk
trade_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2016-01-04,-0.023869,-0.012135,-0.012257,-0.015518,-0.013980,2.490002,1.473709,120.310997,-0.012257,-0.014740,-0.001154,-0.005541
2016-01-05,0.002752,-0.000735,0.004562,0.000583,0.001691,-1.360001,1.471410,119.467003,0.004562,-0.012257,-0.007015,-0.001560
2016-01-06,-0.002889,-0.005006,-0.018165,-0.014294,-0.012614,1.250000,1.467394,119.101997,-0.018165,0.004562,-0.003055,-0.002729
2016-01-07,-0.024140,-0.017090,-0.034783,-0.023558,-0.023992,4.400000,1.462994,118.610001,-0.034783,-0.018165,-0.004131,-0.002999
2016-01-08,-0.013617,-0.009258,0.003067,-0.010428,-0.010977,2.020000,1.462694,117.540001,0.003067,-0.034783,-0.009021,-0.000205
...,...,...,...,...,...,...,...,...,...,...,...,...
2021-07-26,0.007668,0.010117,-0.002140,0.002396,0.002455,0.379999,1.375781,110.543999,-0.002140,0.012337,0.003668,-0.001168
2021-07-27,-0.015929,-0.000140,-0.008684,-0.002248,-0.004558,1.780001,1.382915,110.302002,-0.008684,-0.002140,-0.002189,0.005186
2021-07-28,0.031797,-0.006865,-0.001117,-0.003594,-0.000410,-1.050001,1.388272,109.806000,-0.001117,-0.008684,-0.004497,0.003873
2021-07-29,-0.002326,0.001129,0.000978,0.004179,0.004147,-0.609999,1.390685,109.890999,0.000978,-0.001117,0.000774,0.001738


## Training

In order to train our model, we first put our training features into `X_train` and our training labels into `y_train`

In [8]:
X_train = df_train.drop(columns=['msft'])[0:len(df_train)-1]
X_train

Unnamed: 0_level_0,googl,ibm,dia,spy,vix,gbpusd=x,jpy=x,msft_lag_0,msft_lag_1,dexjpus,dexusuk
trade_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2005-01-05,-0.005090,-0.002068,-0.005550,-0.006900,0.110001,1.885512,103.930000,-0.002235,0.003739,-0.003929,0.001018
2005-01-06,-0.025632,-0.003109,0.003121,0.005084,-0.510000,1.876490,104.889999,-0.001120,-0.002235,0.009237,-0.004785
2005-01-07,0.028109,-0.004366,-0.001885,-0.001434,-0.090000,1.871293,104.889999,-0.002991,-0.001120,0.000000,-0.002769
2005-01-10,0.006242,-0.001044,0.003401,0.004729,-0.260000,1.876912,104.169998,0.004874,-0.002991,-0.006864,0.003003
2005-01-11,-0.007792,-0.007108,-0.006403,-0.006891,-0.040000,1.878605,103.419998,-0.002612,0.004874,-0.007200,0.000902
...,...,...,...,...,...,...,...,...,...,...,...
2015-12-23,0.001799,0.004422,0.010344,0.012384,-1.030001,1.482338,121.029999,0.008491,0.009484,-0.001452,-0.004936
2015-12-24,-0.003474,-0.002093,-0.003356,-0.001650,0.170000,1.487697,120.934998,-0.002687,0.008491,-0.000785,0.003615
2015-12-28,0.021415,-0.004629,-0.001370,-0.002285,1.170000,1.493206,120.231003,0.005029,-0.002687,-0.005821,0.003703
2015-12-29,0.014983,0.015769,0.011430,0.010672,-0.830000,1.489403,120.349998,0.010724,0.005029,0.000990,-0.002547


Notice that the label we are predicting is the *next* day `msft` return; the features we are using to predict are the *current* day returns of the various correlated assets. 

In [9]:
y_train = df_train[['msft']][1:len(df_train)]
y_train

Unnamed: 0_level_0,msft
trade_date,Unnamed: 1_level_1
2005-01-06,-0.001120
2005-01-07,-0.002991
2005-01-10,0.004874
2005-01-11,-0.002612
2005-01-12,0.001871
...,...
2015-12-24,-0.002687
2015-12-28,0.005029
2015-12-29,0.010724
2015-12-30,-0.004244


### Linear Regression

Let's first fit a simple linear regression to our training data.

In [10]:
from sklearn.linear_model import LinearRegression
linear_regression = LinearRegression()
linear_regression.fit(X_train, y_train)

Recall that the `.score()` of a Linear Regression gives the $R^2$.

In [11]:
print("LR R^2:", linear_regression.score(X_train, y_train))

LR R^2: 0.01795965729568516


We can also examine the coefficients of our model.

In [12]:
np.round(linear_regression.coef_, 3)

array([[ 0.002, -0.016,  0.214, -0.328,  0.   , -0.003,  0.   ,  0.012,
        -0.048, -0.002, -0.002]])

--- 

**Code Challenge:** Implement a `Lasso` regression:

1. Experiment with the `alpha` parameter.  
1. Examine the coefficients (`.coef_` attribute) of the model

Does using a Lasso over a Linear Regression seem like a good idea?

In [13]:
#| code-fold: true
#| code-summary: "Solution"
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
print(lasso.score(X_train, y_train))
print(lasso.coef_)

# The model seems very sensitive to `alpha`; anything but 
# tiny values makes all the coefficients zero.  Moreover, the $R^2$ 
# doesn't seem to improve so Lasso isn't an improvement over 
# linear regression.

0.0
[-0. -0. -0. -0.  0. -0. -0. -0. -0. -0. -0.]


---

### KNN

Next, let's fit a KNN model to our data.  

In [14]:
from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor(n_neighbors=10)
knn.fit(X_train, y_train)

As you can see, the in-sample $R^2$ is higher for KNN over Linear Regression.

In [15]:
print("KNN R^2:", knn.score(X_train, y_train))

KNN R^2: 0.12123948996547762


### Mean-Squared Error

Another goodness of fit metric is the mean squared error.  As you can see the models are close on this metric.

In [16]:
print("LR MSE: ", \
      sklearn.metrics.mean_squared_error(y_train, linear_regression.predict(X_train)))
print("KNN MSE:", \
      sklearn.metrics.mean_squared_error(y_train, knn.predict(X_train)))

LR MSE:  0.00029402930819746436
KNN MSE: 0.0002631066501047995


## Testing

Let's now test the model with the data after 2016.

In [17]:
X_test = df_test.drop(columns=['msft'])[0:len(df_test)-1]
X_test

Unnamed: 0_level_0,googl,ibm,dia,spy,vix,gbpusd=x,jpy=x,msft_lag_0,msft_lag_1,dexjpus,dexusuk
trade_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2016-01-04,-0.023869,-0.012135,-0.015518,-0.013980,2.490002,1.473709,120.310997,-0.012257,-0.014740,-0.001154,-0.005541
2016-01-05,0.002752,-0.000735,0.000583,0.001691,-1.360001,1.471410,119.467003,0.004562,-0.012257,-0.007015,-0.001560
2016-01-06,-0.002889,-0.005006,-0.014294,-0.012614,1.250000,1.467394,119.101997,-0.018165,0.004562,-0.003055,-0.002729
2016-01-07,-0.024140,-0.017090,-0.023558,-0.023992,4.400000,1.462994,118.610001,-0.034783,-0.018165,-0.004131,-0.002999
2016-01-08,-0.013617,-0.009258,-0.010428,-0.010977,2.020000,1.462694,117.540001,0.003067,-0.034783,-0.009021,-0.000205
...,...,...,...,...,...,...,...,...,...,...,...
2021-07-23,0.035769,0.004477,0.006633,0.010288,-0.490000,1.377390,110.139999,0.012337,0.016844,-0.001043,0.004365
2021-07-26,0.007668,0.010117,0.002396,0.002455,0.379999,1.375781,110.543999,-0.002140,0.012337,0.003668,-0.001168
2021-07-27,-0.015929,-0.000140,-0.002248,-0.004558,1.780001,1.382915,110.302002,-0.008684,-0.002140,-0.002189,0.005186
2021-07-28,0.031797,-0.006865,-0.003594,-0.000410,-1.050001,1.388272,109.806000,-0.001117,-0.008684,-0.004497,0.003873


In [18]:
y_test = df_test[['msft']][1:len(df_test)]
y_test

Unnamed: 0_level_0,msft
trade_date,Unnamed: 1_level_1
2016-01-05,0.004562
2016-01-06,-0.018165
2016-01-07,-0.034783
2016-01-08,0.003067
2016-01-11,-0.000574
...,...
2021-07-26,-0.002140
2021-07-27,-0.008684
2021-07-28,-0.001117
2021-07-29,0.000978


In terms of $R^2$, the Linear Regression performs better than KNN on the testing data.

In [19]:
print("LR R^2: ", linear_regression.score(X_test, y_test))
print("KNN R^2:", knn.score(X_test, y_test))

LR R^2:  0.03710153632078805
KNN R^2: -0.022695602656459313


On the testing data, the models are again quite similar from an mean square error perspective.

In [20]:
print("LR MSE: ", \
      sklearn.metrics.mean_squared_error(y_test, linear_regression.predict(X_test)))
print("KNN MSE:", \
      sklearn.metrics.mean_squared_error(y_test, knn.predict(X_test)))

LR MSE:  0.0002819204831847927
KNN MSE: 0.000299428080246605
