# Predicting Stock Returns

Our objective in this tutorial is to predict stock returns using linear regression and k-nearest neighbors.  Specifically, we will try to predict the daily returns of MSFT from the returns of various correlated assets including stock indices, currencies, and other stocks.

In the related homework, you will test a simple trading strategy that is based on our predictions and see how it performs relative to a buy and hold strategy.

### Import Packages

Let's begin by loading the packages that we will need.

In [1]:
import numpy as np
import pandas as pd
import pandas_datareader as pdr
import sklearn

### Load Data

Next, let's load our data.  We will start the stocks, who's data we will get from Yahoo.

In [2]:
stock_tickers = ['MSFT', 'IBM', 'GOOGL'] # define tickers
df_stock = pdr.get_data_yahoo(stock_tickers, start='2005-01-01', end='2021-07-31') # grab the data
df_stock = df_stock['Adj Close'] # select only the adjusted close price
df_stock.columns = df_stock.columns.str.lower() # clean-up column names
df_stock.rename_axis('trade_date', inplace=True) # clean-up index name
df_stock.rename_axis('', axis=1, inplace=True) # clean-up index name
df_stock

Unnamed: 0_level_0,msft,ibm,googl
trade_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2005-01-03,19.108891,62.729687,101.456459
2005-01-04,19.180344,62.055878,97.347343
2005-01-05,19.137478,61.927532,96.851852
2005-01-06,19.116039,61.735004,94.369370
2005-01-07,19.058872,61.465485,97.022018
...,...,...,...
2021-07-26,289.049988,142.770004,2680.699951
2021-07-27,286.540009,142.750000,2638.000000
2021-07-28,286.220001,141.770004,2721.879883
2021-07-29,286.500000,141.929993,2715.550049


Next we'll grab the currency data from FRED.

In [3]:
currency_tickers = ['DEXJPUS', 'DEXUSUK']
df_currency = pdr.get_data_fred(currency_tickers, start='2005-01-01', end='2021-07-31')
df_currency = df_currency
df_currency.columns = df_currency.columns.str.lower()
df_currency.rename_axis('trade_date', inplace=True)
df_currency.rename_axis('', axis=1, inplace=True)
df_currency

Unnamed: 0_level_0,dexjpus,dexusuk
trade_date,Unnamed: 1_level_1,Unnamed: 2_level_1
2005-01-03,102.83,1.9058
2005-01-04,104.27,1.8834
2005-01-05,103.95,1.8875
2005-01-06,104.87,1.8751
2005-01-07,104.93,1.8702
...,...,...
2021-07-26,110.31,1.3829
2021-07-27,109.64,1.3884
2021-07-28,110.06,1.3884
2021-07-29,109.53,1.3966


Finally, we'll grab the index data Yahoo.

In [4]:
index_tickers = ['SPY', 'DIA', '^VIX'] 
df_index = pdr.get_data_yahoo(index_tickers, start='2005-01-01', end='2021-07-31')
df_index = df_index['Adj Close']
df_index.columns = df_index.columns.str.lower().str.replace('^', '')
df_index.rename_axis('trade_date', inplace=True)
df_index.rename_axis('', axis=1, inplace=True)
df_index

Unnamed: 0_level_0,spy,dia,vix
trade_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2005-01-03,86.852043,72.892433,14.080000
2005-01-04,85.790764,72.199684,13.980000
2005-01-05,85.198761,71.798958,14.090000
2005-01-06,85.631920,72.023048,13.580000
2005-01-07,85.509186,71.887222,13.490000
...,...,...,...
2021-07-26,441.019989,351.410004,17.580000
2021-07-27,439.010010,350.619995,19.360001
2021-07-28,438.829987,349.359985,18.309999
2021-07-29,440.649994,350.820007,17.700001


### Join and Clean Data

Now we can join together our price data and convert it into returns and differences (for VIX) as these are more stationary.  Notice that we are implicitly adding a time series component to our regression by adding lagged `msft` returns as a feature.

In [5]:
df_data = \
    (
    df_stock
        .merge(df_index, how='left', left_index=True, right_index=True) # join currency data
        .merge(df_currency, how='left', left_index=True, right_index=True) # join index data
        .dropna()
        .assign(msft = lambda df: df['msft'].pct_change())   # percent change
        .assign(msft_lag_0 = lambda df: df['msft'].shift(0)) #
        .assign(msft_lag_1 = lambda df: df['msft'].shift(1)) #
        .assign(ibm = lambda df: df['ibm'].pct_change())     #
        .assign(googl = lambda df: df['googl'].pct_change()) #
        .assign(spy = lambda df: df['spy'].pct_change())     #
        .assign(dia = lambda df: df['dia'].pct_change())     #
        .assign(vix = lambda df: df['vix'].diff())           # absolute change
        .assign(dexjpus = lambda df: df['dexjpus'].pct_change()) # percent change
        .assign(dexusuk = lambda df: df['dexusuk'].pct_change()) #
        .dropna()
    )
df_data

Unnamed: 0_level_0,msft,ibm,googl,spy,dia,vix,dexjpus,dexusuk,msft_lag_0,msft_lag_1
trade_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2005-01-05,-0.002235,-0.002068,-0.005090,-0.006901,-0.005550,0.110001,-0.003069,0.002177,-0.002235,0.003739
2005-01-06,-0.001120,-0.003109,-0.025632,0.005084,0.003121,-0.510000,0.008850,-0.006570,-0.001120,-0.002235
2005-01-07,-0.002991,-0.004366,0.028109,-0.001433,-0.001886,-0.090000,0.000572,-0.002613,-0.002991,-0.001120
2005-01-10,0.004874,-0.001044,0.006242,0.004728,0.003402,-0.260000,-0.005813,0.002620,0.004874,-0.002991
2005-01-11,-0.002612,-0.007107,-0.007792,-0.006891,-0.006404,-0.040000,-0.008627,0.002400,-0.002612,0.004874
...,...,...,...,...,...,...,...,...,...,...
2021-07-26,-0.002140,0.010118,0.007668,0.002455,0.002396,0.379999,-0.001900,0.005819,-0.002140,0.012337
2021-07-27,-0.008684,-0.000140,-0.015929,-0.004558,-0.002248,1.780001,-0.006074,0.003977,-0.008684,-0.002140
2021-07-28,-0.001117,-0.006865,0.031797,-0.000410,-0.003594,-1.050001,0.003831,0.000000,-0.001117,-0.008684
2021-07-29,0.000978,0.001129,-0.002326,0.004147,0.004179,-0.609999,-0.004816,0.005906,0.000978,-0.001117


### Training Set and Testing Set

We'll train our models on data prior to 2016, and then we'll use data from 2016 onward for testing.  So let's separate out these two subsets of data.

In [6]:
df_train = df_data.query('trade_date < "2016-01-01"')
df_train

Unnamed: 0_level_0,msft,ibm,googl,spy,dia,vix,dexjpus,dexusuk,msft_lag_0,msft_lag_1
trade_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2005-01-05,-0.002235,-0.002068,-0.005090,-0.006901,-0.005550,0.110001,-0.003069,0.002177,-0.002235,0.003739
2005-01-06,-0.001120,-0.003109,-0.025632,0.005084,0.003121,-0.510000,0.008850,-0.006570,-0.001120,-0.002235
2005-01-07,-0.002991,-0.004366,0.028109,-0.001433,-0.001886,-0.090000,0.000572,-0.002613,-0.002991,-0.001120
2005-01-10,0.004874,-0.001044,0.006242,0.004728,0.003402,-0.260000,-0.005813,0.002620,0.004874,-0.002991
2005-01-11,-0.002612,-0.007107,-0.007792,-0.006891,-0.006404,-0.040000,-0.008627,0.002400,-0.002612,0.004874
...,...,...,...,...,...,...,...,...,...,...
2015-12-24,-0.002687,-0.002093,-0.003474,-0.001651,-0.003356,0.170000,-0.005127,0.005382,-0.002687,0.008491
2015-12-28,0.005030,-0.004629,0.021414,-0.002285,-0.001370,1.170000,-0.000166,-0.004015,0.005030,-0.002687
2015-12-29,0.010724,0.015769,0.014983,0.010672,0.011430,-0.830000,0.001164,-0.005980,0.010724,0.005030
2015-12-30,-0.004244,-0.003148,-0.004610,-0.007088,-0.006667,1.210001,0.001328,0.002569,-0.004244,0.010724


In [7]:
df_test = df_data.query('trade_date > "2016-01-01"')
df_test

Unnamed: 0_level_0,msft,ibm,googl,spy,dia,vix,dexjpus,dexusuk,msft_lag_0,msft_lag_1
trade_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2016-01-04,-0.012257,-0.012135,-0.023869,-0.013979,-0.015518,2.490002,-0.008065,-0.004069,-0.012257,-0.014740
2016-01-05,0.004562,-0.000735,0.002752,0.001691,0.000584,-1.360001,-0.002934,-0.001498,0.004562,-0.012257
2016-01-06,-0.018165,-0.005006,-0.002889,-0.012614,-0.014295,1.250000,-0.003447,-0.002591,-0.018165,0.004562
2016-01-07,-0.034783,-0.017090,-0.024140,-0.023991,-0.023558,4.400000,-0.004555,-0.003213,-0.034783,-0.018165
2016-01-08,0.003067,-0.009258,-0.013617,-0.010977,-0.010427,2.020000,-0.002203,-0.003841,0.003067,-0.034783
...,...,...,...,...,...,...,...,...,...,...
2021-07-26,-0.002140,0.010118,0.007668,0.002455,0.002396,0.379999,-0.001900,0.005819,-0.002140,0.012337
2021-07-27,-0.008684,-0.000140,-0.015929,-0.004558,-0.002248,1.780001,-0.006074,0.003977,-0.008684,-0.002140
2021-07-28,-0.001117,-0.006865,0.031797,-0.000410,-0.003594,-1.050001,0.003831,0.000000,-0.001117,-0.008684
2021-07-29,0.000978,0.001129,-0.002326,0.004147,0.004179,-0.609999,-0.004816,0.005906,0.000978,-0.001117


### Training (Fitting the Models)

In order to train our model, we first put our training features into `X_train` and our training labels into `y_train`

In [8]:
X_train = df_train.drop(columns=['msft'])[0:len(df_train)-1]
X_train

Unnamed: 0_level_0,ibm,googl,spy,dia,vix,dexjpus,dexusuk,msft_lag_0,msft_lag_1
trade_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2005-01-05,-0.002068,-0.005090,-0.006901,-0.005550,0.110001,-0.003069,0.002177,-0.002235,0.003739
2005-01-06,-0.003109,-0.025632,0.005084,0.003121,-0.510000,0.008850,-0.006570,-0.001120,-0.002235
2005-01-07,-0.004366,0.028109,-0.001433,-0.001886,-0.090000,0.000572,-0.002613,-0.002991,-0.001120
2005-01-10,-0.001044,0.006242,0.004728,0.003402,-0.260000,-0.005813,0.002620,0.004874,-0.002991
2005-01-11,-0.007107,-0.007792,-0.006891,-0.006404,-0.040000,-0.008627,0.002400,-0.002612,0.004874
...,...,...,...,...,...,...,...,...,...
2015-12-23,0.004423,0.001799,0.012383,0.010344,-1.030001,0.000000,0.003511,0.008491,0.009484
2015-12-24,-0.002093,-0.003474,-0.001651,-0.003356,0.170000,-0.005127,0.005382,-0.002687,0.008491
2015-12-28,-0.004629,0.021414,-0.002285,-0.001370,1.170000,-0.000166,-0.004015,0.005030,-0.002687
2015-12-29,0.015769,0.014983,0.010672,0.011430,-0.830000,0.001164,-0.005980,0.010724,0.005030


Notice that the label we are predicting is the *next* day `msft` return; the features we are using to predict are the *current* day returns of the various correlated assets. 

In [9]:
y_train = df_train[['msft']][1:len(df_train)]
y_train

Unnamed: 0_level_0,msft
trade_date,Unnamed: 1_level_1
2005-01-06,-0.001120
2005-01-07,-0.002991
2005-01-10,0.004874
2005-01-11,-0.002612
2005-01-12,0.001871
...,...
2015-12-24,-0.002687
2015-12-28,0.005030
2015-12-29,0.010724
2015-12-30,-0.004244


#### Linear Regression

Let's first fit a simple linear regression to our training data.

In [10]:
from sklearn.linear_model import LinearRegression
linear_regression = LinearRegression()
linear_regression.fit(X_train, y_train)

LinearRegression()

Recall that the `.score()` of a Linear Regression gives the $R^2$.

In [11]:
linear_regression.score(X_train, y_train)

0.017817633697614133

We can also examine the coefficients of our model.

In [12]:
np.round(linear_regression.coef_, 3)

array([[-0.027,  0.004, -0.432,  0.288,  0.   ,  0.105, -0.006,  0.031,
        -0.027]])

**Code Challenge:** Implement a `Lasso` regression:

1. Experiment with the `alpha` parameter.  
1. Examine the coefficients (`.coef_` attribute) of the model

Does using a Lasso over a Linear Regression seem like a good idea?

In [13]:
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
print(lasso.score(X_train, y_train))
print(lasso.coef_)

# The model seems very sensitive to `alpha`; anything but tiny values makes all the coefficients zero.  
# Moreover, the $R^2$ doesn't seem to improve so Lasso isn't an improvement over linear regression.

0.0
[-0. -0. -0. -0.  0.  0. -0. -0. -0.]


#### KNN

Next, let's fit a KNN to our model.  As you can see, the in-sample $R^2$ is higher for KNN over Linear Regression.

In [14]:
from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor(n_neighbors=10)
knn.fit(X_train, y_train)
knn.score(X_train, y_train)

0.10833769997186615

#### Mean-Squared Error

Another goodness of fit metric is the mean squared error.  As you can see the models are close on this metric.

In [15]:
sklearn.metrics.mean_squared_error(y_train, linear_regression.predict(X_train))

0.0002835384335823397

In [16]:
sklearn.metrics.mean_squared_error(y_train, knn.predict(X_train))

0.0002574069139381872

### Testing the Model

Let's now test the model with the data after 2016.

In [17]:
X_test = df_test.drop(columns=['msft'])[0:len(df_test)-1]
X_test

Unnamed: 0_level_0,ibm,googl,spy,dia,vix,dexjpus,dexusuk,msft_lag_0,msft_lag_1
trade_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2016-01-04,-0.012135,-0.023869,-0.013979,-0.015518,2.490002,-0.008065,-0.004069,-0.012257,-0.014740
2016-01-05,-0.000735,0.002752,0.001691,0.000584,-1.360001,-0.002934,-0.001498,0.004562,-0.012257
2016-01-06,-0.005006,-0.002889,-0.012614,-0.014295,1.250000,-0.003447,-0.002591,-0.018165,0.004562
2016-01-07,-0.017090,-0.024140,-0.023991,-0.023558,4.400000,-0.004555,-0.003213,-0.034783,-0.018165
2016-01-08,-0.009258,-0.013617,-0.010977,-0.010427,2.020000,-0.002203,-0.003841,0.003067,-0.034783
...,...,...,...,...,...,...,...,...,...
2021-07-23,0.004477,0.035769,0.010288,0.006633,-0.490000,0.003815,-0.000073,0.012337,0.016844
2021-07-26,0.010118,0.007668,0.002455,0.002396,0.379999,-0.001900,0.005819,-0.002140,0.012337
2021-07-27,-0.000140,-0.015929,-0.004558,-0.002248,1.780001,-0.006074,0.003977,-0.008684,-0.002140
2021-07-28,-0.006865,0.031797,-0.000410,-0.003594,-1.050001,0.003831,0.000000,-0.001117,-0.008684


In [18]:
y_test = df_test[['msft']][1:len(df_test)]
y_test

Unnamed: 0_level_0,msft
trade_date,Unnamed: 1_level_1
2016-01-05,0.004562
2016-01-06,-0.018165
2016-01-07,-0.034783
2016-01-08,0.003067
2016-01-11,-0.000573
...,...
2021-07-26,-0.002140
2021-07-27,-0.008684
2021-07-28,-0.001117
2021-07-29,0.000978


In terms of $R^2$, the Linear Regression performs better than KNN on the testing data.

In [19]:
linear_regression.score(X_test, y_test)

0.02720698181687098

In [20]:
knn.score(X_test, y_test)

0.00854090187736889

On the testing data, the models are again quite similar from an mean square error perspective.

In [21]:
sklearn.metrics.mean_squared_error(y_test, linear_regression.predict(X_test))

0.00028311336436085083

In [22]:
sklearn.metrics.mean_squared_error(y_test, knn.predict(X_test))

0.000288545780704639