# <span style='color:red'>Quantitative Investing with Python</span>

### Professor Juhani Linnainmaa

Dartmouth College (George J. Records Professor of Investments) and Kepos Capital (Co-Director of Research)

*First version:* January 26, 2024

--- 

# **Topic 6:** Machine Learning in Quantitative Finance--Linear Models

### Goal

We want to create a model that predicts monthly stock returns with

- Past monthly returns
- Book-to-market ratio
- Asset growth (investment) 
- Gross profitability 

The goal is to get these predictions and then see how well we would do by buying stocks with high predicted returns and selling those with low predicted returns


### How

- Work with 500 randomly selected stocks
  - We use the same CRSP/Compustat database we constructed in Topic #4


- We divide the sample into three parts: 
  1. training sample
  2. validation sample, and 
  3. testing sample
   
   
  In this topic we won't touch the testing sample
  
  
- We **train** different models using the training sample
  - We use the *validation* sample to compare different models
  
  
- Once we are satisfied that we have made a reasonable choice, we will use the testing sample


### Plan:

1. Create the sample by selecting stocks that appear somewhere in the training + test samples we choose


2. Define **target** variable (next month's return) and **features**


3. Linear models:
   - Linear regression
   - Ridge regression
     - **What is it?**
     - Train the model with cross validation

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

import statsmodels.api as sm

from sklearn.linear_model import RidgeCV
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler

## Step 1: Constructing sample

Construct a random sample of {number_of_stocks} starting from CRSP-Compustat file we created before

In [None]:
# parameters
number_of_stocks = 500

# sample dates
first_train_date, last_train_date = '1963-06', '1990-12'
first_validation_date, last_validation_date = '1991-01', '2000-12'
first_test_date, last_test_date = '2001-01', '2010-12'

# create random sample
cs_crsp = pd.read_pickle('data/cs_crsp.pkl')

train_dates = pd.period_range(first_train_date, last_train_date, freq='M')
validation_dates = pd.period_range(first_validation_date, last_validation_date, freq='M')
test_dates  = pd.period_range(first_test_date, last_test_date, freq='M')

all_dates = train_dates.union(validation_dates).union(test_dates)

# limit the sample to contain the dates determined above
idx = pd.IndexSlice # for slicing a MultiIndex
cs_crsp = cs_crsp.loc[idx[:, all_dates], :]

# randomly select {number_of_stocks} that show up somewhere in the sample period
permnos = pd.Series(cs_crsp.index.get_level_values(0).unique(), name='PERMNO')
sample_permnos = permnos.sample(n=number_of_stocks, random_state=42).values

cs_crsp = cs_crsp.loc[sample_permnos,:]

# save 
cs_crsp.to_pickle('data/ml_crsp.pkl')

## Step 2: Determine the target variables and features

In [None]:
print('Here is the raw data:\n\n', cs_crsp)

### Define target variable

Each stock's return the next month

### Define features

1-12: a stock's return in month t-k+1

13: log-size

14: log-book-to-market

15: gross profitability


**Note:** We often take 'logs' in linear models so that the variables have nicer distributions


### Normalize variables by demeaning

- We "cross-sectionally demean" both the target variable and the features so that it makes sense to estimate pooled regressions
- A pooled regression refers to a sample that has both a time dimension and some cross-sectional dimension (multiple stocks each month in our case)
- This does *not* introduce a lookahead bias
- In terms of investing, this means that we are trying to model which stocks do poorly or well *relative* to other stocks
  - Known as 'relative value' investing

In [None]:
# load data

cs_crsp = pd.read_pickle('data/ml_crsp.pkl')

# the TARGET variable is the return next month

cs_crsp['retnm'] = cs_crsp.groupby(level='PERMNO')['ret'].shift(-1)

# The FIRST set of features consist of monthly returns over the past year
for lag in range(12):
    cs_crsp['x0_retlag' + str(lag)] = cs_crsp.groupby(level='PERMNO')['ret'].shift(lag)
    
# The SECOND set of features are (a) log-size, (b) log-BE/ME, (c) log-asset growth, and (d) gross profitability

# (1) log-size
cs_crsp['x1_logme'] = np.log(cs_crsp['me'])

# (2) log-book-to-market
cs_crsp['beme'] = cs_crsp['be'] / cs_crsp['me']
cs_crsp['x2_logbeme'] = np.log(cs_crsp['beme'])

# (3) asset growth
cs_crsp['at_lag12'] = cs_crsp.groupby(level='PERMNO')['at'].shift(12)
cs_crsp['x3_log_asset_growth'] = np.log(cs_crsp['at'] / cs_crsp['at_lag12'])
bad_data = (cs_crsp['at'] <= 0) | (cs_crsp['at_lag12'] <= 0) 
cs_crsp.loc[bad_data, 'x3_log_asset_growth'] = np.nan

# (4) gross profitability
cs_crsp['x4_gross_profitability'] = (cs_crsp['sale'] - cs_crsp['cogs']) / cs_crsp['at']
bad_data = cs_crsp['at'] <= 0 
cs_crsp.loc[bad_data, 'x4_gross_profitability'] = np.nan

# Keep only the variables we need
# This will give an error because we try to take logs of some negative numbers. This is fine because we want those to be missing.
target_var = ['retnm']
features = [c for c in cs_crsp.columns if c.startswith('x')]
cs_crsp = cs_crsp[target_var + features]

# Normalize variable by cross-sectionally demeaning

cs_crsp = cs_crsp.sub(cs_crsp.groupby(level='date').mean(), level='date')

## Step 3: Estimate Linear Regression using Training Data

- We first use statsmodels.api, similar to Topic #4, to estimate a linear regression using the **training data**
- This package is good for its summary capabilities -- that is, when we want *to* interpret the model
- We do this so that we can compare the results to what we know about stock returns from the academic literature

In [None]:
# limit the sample to contain all TRAINING data observations and drop all observations with any missing values
train_data = cs_crsp.loc[idx[:, train_dates], :].dropna()

y = train_data['retnm']
X = train_data[features]
X = sm.add_constant(X)

# Specify the model
model = sm.OLS(y, X)

# Fit the model
results = model.fit()

# Print the summary
print(results.summary())

## Interpreting the results:

We see:

- Short-term reversals
- Momentum, with some of the variables statistically significant in isolation
- Value, investment, and profitability effects


Put differently, the estimates are consistent with what we know about stock returns from the academic literature

## Step 3.1: Get predicted values from the model 

#### Questions:

- Do they correlate with realized returns?
- How well does a strategy that buys all stocks with predicted positive returns and sells those with predicted negative returns do?

In [None]:
train_data['pred_retnm'] = results.fittedvalues

print('Correlation:\n')
train_data[['retnm', 'pred_retnm']].corr().round(3)

Create a variable "position" that indicates which stocks we want to buy and which to sell

In [None]:
train_data['position'] = train_data['pred_retnm'].apply(lambda x: "buys" if x>0 else "sells" if x<0 else np.nan)
portfolio_returns = train_data.reset_index(level='date').groupby(['date','position'])['retnm'].mean()
portfolio_returns = portfolio_returns.unstack(level='position')
strategy = portfolio_returns['buys'] - portfolio_returns['sells']
strategy = strategy.shift(1) # undo timing, that is, the fact that we compute returns based on the return NEXT month
strategy

In [None]:
def analyze_returns(r=None, name=None, start_date='1962-05', end_date='2023-09'):
    r = r.loc[start_date:end_date]
    start_date, end_date = r.index.min(), r.index.max()
    ir = np.sqrt(12) * r.mean() / r.std()
    print(f'Analysis of a strategy: "{name}"')
    print(f'Start: {start_date}, End: {end_date}')
    print(f'Sharpe ratio: {ir:.2f}')
    r.cumsum().plot(figsize=(12,8))

In [None]:
analyze_returns(r=strategy, name='Trade predictions of the linear regression in the training sample')

### Question: Is this a strategy we could have implemented in real time?

- No, even if we knew the model we wanted to estimate (and the variables we wanted to use to predict returns), we fit the model using the same data
- If the model fit to the noise, it will still look like we made money
- We can use the **validation sample** to examine how good the model is
  - That data is also noisy but the noise is something that the model didn't see when it was fit

In [None]:
validation_data = cs_crsp.loc[idx[:, validation_dates], :].dropna()

y = validation_data['retnm']
X = validation_data[features]
X = sm.add_constant(X)

# we can feed features from the validation sample to the estimated model using the "predict" method
validation_data['pred_retnm'] = results.predict(X)

print('Correlation:\n')
validation_data[['retnm', 'pred_retnm']].corr().round(3)

In [None]:
validation_data['position'] = validation_data['pred_retnm'].apply(lambda x: "buys" if x>0 else "sells" if x<0 else np.nan)
portfolio_returns = validation_data.reset_index(level='date').groupby(['date','position'])['retnm'].mean()
portfolio_returns = portfolio_returns.unstack(level='position')
strategy = portfolio_returns['buys'] - portfolio_returns['sells']

analyze_returns(r=strategy, name='Trade predictions of the linear regression in the training sample')

# Step 3.2: Ridge regression 

- Ridge regression is a technique for preventing overfitting to data


- When you estimate a linear regression, you're searching for best slopes (or coefficients) for the least squares line
  - These slopes can be sensitive to outliers; those outliers 'pull' slopes towards them


- A ridge regression is different from a linear regression in that there is a single penalty parameter that makes coefficient estimates less sensitive to the data
  - You can think about the penalty parameter as being a "budget" for how large all the slopes can be together
  - This penalty shrinks all slopes towards zero
  
  
- The penalty parameter -- known as **L2 penalty** for its mathematical definition and often designated as 'alpha' -- is a **hyperparameter**
  - We need to specify this parameter before we fit the model to the data
  - How do we know how to set this parameter?
  - Typical solution: 
    - Split the sample into fitting and validation samples 
    - Train models for many different choices of the penalty parameter using the training sample
    - Pick the model that performs the best in the validation sample


- In the code below I use cross validation to do this


- Note: For technical reasons, we should always normalize features when we estimate a ridge regression
  - The 'scales' of the variables should be comparable -- unless there is a really good reason to deviate from this principle

In [None]:
# limit the sample to contain all TRAINING data observations and drop all observations with any missing values
train_data = cs_crsp.loc[idx[:, train_dates], :].dropna()

y_train = train_data['retnm']
X_train = train_data[features]

# Scale the features to have mean zero and unit standard deviation
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# Define the RidgeCV model with 5-fold cross-validation
# alphas is a numpy array of values for alpha (the regularization strength) that we want to test
# np.logspace gives us a wide range of values
model = RidgeCV(alphas=np.logspace(-6, 6, 50), cv=KFold(n_splits=5, shuffle=True, random_state=42))

# Fit the model
model.fit(X_train_scaled, y_train)

# Get the predicted values
train_data['pred_retnm'] = model.predict(X_train_scaled)

# Since you're using scikit-learn, there's no summary function like in statsmodels
# But you can print out the alpha (regularization strength) that was chosen and the coefficients
print(f'Chosen alpha from cross validation: {model.alpha_:,.2f}\n')

print('Coefficients')
pd.Series(model.coef_, index=features)

In [None]:
validation_data = cs_crsp.loc[idx[:, validation_dates], :].dropna()

X_validation = validation_data[features]
X_validation_scaled = scaler.transform(X_validation)

# we can feed features from the validation sample to the estimated model using the "predict" method
validation_data['pred_retnm'] = model.predict(X_validation_scaled)

print('Correlation:\n')
validation_data[['retnm', 'pred_retnm']].corr().round(3)

In [None]:
validation_data['position'] = validation_data['pred_retnm'].apply(lambda x: "buys" if x>0 else "sells" if x<0 else np.nan)
portfolio_returns = validation_data.reset_index(level='date').groupby(['date','position'])['retnm'].mean()
portfolio_returns = portfolio_returns.unstack(level='position')
strategy = portfolio_returns['buys'] - portfolio_returns['sells']

analyze_returns(r=strategy, name='Ridge regression predictions in the validation sample')