# <span style='color:red'>Quantitative Investing with Python</span>

### Professor Juhani Linnainmaa

Dartmouth College and Kepos Capital

*Last revised:* January 20, 2025

--- 

# **Topic 4:** Replicating Academic Factors and Measuring Alphas

1. We will merge accounting data from Compustat and stock data from CRSP
   - I'll use the processed CRSP file we created and saved previously
   - By having the merged dataset, we can construct factors such as HML, which is based on sorting stocks into portfolios by their book-to-market ratios<br><br>


2. I provide some general code for constructing factors based on arbitrary signals, such as BE/ME or the signal underneath the momentum factor 
   - I'll be more careful in my replicating than what I did with the short-term reversals factor
   - There are some more details. These are not *that* important in practice, but it is useful to think about them---at the very least, we highlight the fact that many decisions go into constructing trading strategies / factors<br><br>


3. We will then estimate linear regression to assess strategies'/factors' alphas in asset pricing models such as the Capital Asset Pricing model or the Fama-French five-factor model
   - I'll use the same Fama-French factors we downloaded and pickled previously
   - I'll discuss the meaning of alphas in class, not on this notebook

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

import requests
from io import BytesIO, StringIO
import zipfile

<br>
<div style="text-align: center; font-family: 'Georgia', sans-serif; font-size: 36px; font-weight: bold; color: red;">
    Read and process annual fundamentals from Compustat
</div>

- Compustat is a well-known provider of fundamental data
  - Fundamental in this context means accounting data, that is, income statement and balance sheet information
- I downloaded all data for U.S. firms. I look at *annual* reports, which is still the standard in academic literature
  - The same ideas of course apply if we use quarterly data or any other data sources (such as CapitalIQ or Datastream) besides Compustat

In [None]:
compustat_url = 'https://dl.dropboxusercontent.com/scl/fi/vd2ci1fw093kbx9375m2z/Compustat_September2023.csv.zip?rlkey=g68xz4deyiq5n7cx5n7q1ma0u'
response = requests.get(compustat_url)
with zipfile.ZipFile(BytesIO(response.content)) as z:
    with z.open('Compustat_September2023.csv') as f:
        df = pd.read_csv(f)

df.tail(3)

## Pre-process the data a bit

1. Rename the stock identifier variable to be consistent with the CRSP name
2. Compute the book value of equity using the Fama-French definition
   - Fama and French (1993) don't provide all the details
   - Cohen, Polk, and Vuolteenaho (2003), who were Ph.D. students at Chicago, write:
   
     > Book equity is defined as the stockholders' equity, plus balance sheet deferred taxes (data item 74) and investment tax credit (data item 208; if available), plus postretirement benefit liabilities (data item 330; if available), minus the book value of preferred stock. Depending on availability, we use redemption (data item 56), liquidation (data item 10), or par value (data item 130) in that order for the book value of preferred stock. Stockholders' equity used in the above formula is calculated as follows. We prefer the stockholders' equity number reported by Moody's or COMPUSTAT (data item 216). If neither one is available, we measure stockholders' equity as the book value of common equity (data item 60), plus the par value of preferred stock. (Note that the preferred stock is added at this stage, because it is later subtracted in the book equity formula.) If common equity is not available, we compute stockholders' equity as the book value of assets (data item 6) minus total liabilities (data item 181), all from COMPUSTAT.
   

In [None]:
# rename stock identifier to be consistent with CRSP
df = df.rename(columns = {'LPERMNO': 'permno'})
df['datadate'] = pd.to_datetime(df['datadate'], format='%Y-%m-%d').dt.to_period('M').dt.to_timestamp()
df = df.set_index(['permno', 'datadate'])

# compute book value of equity using the Fama-French rules
be = df['seq'].combine_first(df['ceq'] + df['pstk']).combine_first(df['at'] - df['lt'])

# 1. compute preferred stock
pref = df['pstkrv'].combine_first(df['pstkl']).combine_first(df['pstk'])

# 2. adjust book value of equity for preferred stock (if exists)
pref_not_missing = pref.notnull()
be.loc[pref_not_missing] -= pref

# 3. investment tax credit (only for fiscal years ending in 1993 or before)
df['txditc'] = df['txditc'].replace({np.nan: 0})
before_1993 = df.index.get_level_values('datadate') <= '1993-12'
be.loc[before_1993] += df.loc[before_1993, 'txditc']

df['be'] = be

df.tail(5)

## Keep only the necessary data

- PERMNO and datadate (which are in the index)
- at (total assets)
- sale (revenue) and cogs (cost of goods sold)
- be (book value of equity)
- **Note:** I don't use at, sale, and cogs on this notebook, but I'll leave them in for reasons to be discussed in class

In [None]:
cs_vars = ['at', 'sale', 'cogs', 'be']

compustat = df[['at', 'sale', 'cogs', 'be']].copy().dropna(how='all')
compustat.head(10)

### Load CRSP data from Topic #3 and filter Compustat data so that it only includes the PERMNOs we have in the CRSP data

In [None]:
cs_crsp = pd.read_pickle('/home/jovyan/data/crsp.pkl')
cs_crsp = cs_crsp.reset_index(level='date')
cs_crsp = cs_crsp.set_index('date', append=True)
display(cs_crsp.tail(5))

ok = compustat.index.get_level_values('permno').isin(cs_crsp.index.get_level_values('permno').unique())
print(f'Compustat shape before filtering: {compustat.shape}')
compustat = compustat[ok]
print(f'Compustat shape after filtering: {compustat.shape}')

<br>
<div style="text-align: center; font-family: 'Georgia', sans-serif; font-size: 36px; font-weight: bold; color: red;">
    Merge Compustat data with CRSP
</div>

- We have monthly stocks returns but *annual* Compustat data
- Moreover, Compustat data is reported *not* when the information is available to investors but by fiscal-year ends
  - Variable ```datadate``` tells the end date of the fiscal year
- **THIS IS A PROBLEM!**
  - I need to lag Compustat data appropriately relative to CRSP
  - What does this mean? 
    - Companies announce their earnings with some lag 
    - A conservative assumption from Fama-French is that accounting data from a fiscal year that ended in year t is available at the end of June in year t+1
  - We could do this in many different ways but I use the following logic:
    1. Reindex Compustat data so that we have data for all months over which the same stock is present in either CRSP or Compustat
    2. Lag fundamental information by six months (the Fama-French assumption)
    3. Forward will the data for up to 23 months
- We now have monthly observations that we can merge directly with CRSP

In [None]:
# rename Compustat datadate variable to 'date'
compustat = compustat.rename_axis(index={'datadate': 'date'})

# reindex Compustat data to cover all dates seen in Compustat and CRSP 
cs_index = compustat.index
crsp_index = cs_crsp.index
combined_index = cs_index.union(crsp_index)
compustat = compustat.reindex(combined_index)

# there might be some gaps in dates
# in the code below, I create an index that covers every month from the first time we see a firm to the last time
min_max_dates = compustat.reset_index(level='date').groupby(level='permno').agg(min_date=('date', 'min'), max_date=('date', 'max'))

# Create list for multi-index to cover all months for each firm
multi_index_list = []

for index, row in min_max_dates.iterrows():
    months_range = pd.date_range(start=row['min_date'], end=row['max_date'], freq='MS')
    for month in months_range:
        multi_index_list.append((index, month))

new_index = pd.MultiIndex.from_tuples(multi_index_list, names=['permno', 'date'])
compustat = compustat.reindex(new_index)

print('Data without shifting and ffill')
display(compustat.tail(25))
# we want to lag data by six months (so that if a firm's fiscal year ends in December, we see this info next June)
# we also want to forward-fill data. If we forward fill exactly 11 months we'd cover all months until the next fiscal year end
# however, Compustat data might have gaps, so I forward fill by 23 months (it'll cover one missed annual report)

compustat = compustat.groupby(level='permno').shift(6).ffill(limit=23)
print('Data after shifting and ffill')
display(compustat.tail(25))

### Merge Compustat data with CRSP and pickle

In [None]:
cs_crsp = cs_crsp.merge(compustat, left_index=True, right_index=True, how='left')
display(cs_crsp.tail(12))

cs_crsp.to_pickle('/home/jovyan/data/cs_crsp.pkl')

<br>
<div style="text-align: center; font-family: 'Georgia', sans-serif; font-size: 36px; font-weight: bold; color: red;">
    Define three functions for replicating academic factors and analyzing returns
</div>


## Function 1: Assign stocks into portfolios based breakpoints

- I let the function take in a bunch of inputs so that we can be flexible to create all kinds of factors
Note:

- Fama and French update their portfolio sorts only in June
- If we do similar "annual" sorts:
  - Set non-June assignments to zero
  - Copy previous groups assignment forward to fill non-June months
  
## Function 2: Assign compute portfolio returns for sorts define in 'sort_groups'

This function computes value-weighted portfolio returns   

## Function 3: Analysis function for measuring Sharpe ratios 

- This is from the previous notebook

In [None]:
def portfolio_sort(df=None, col=None, percentiles=None, id_col=None, annual=True):
    sortvar = df.loc[df['exchcd']==1, col]
    
    df[id_col] = np.nan  
    group_id = 1
    grp = sortvar.dropna().groupby(level='date')
    
    for pct in percentiles:
        breakpoint = grp.apply(lambda x: np.percentile(x, pct))
        breakpoint.name = 'breakpoint'
        df_merged = df.merge(breakpoint, left_on='date', right_index=True, how='left')
        assigned = df_merged[id_col].isnull() & (df_merged[col] <= df_merged['breakpoint'])         
        df.loc[assigned[assigned].index, id_col] = group_id
        group_id += 1
    
    # assign firms to to the right from the last breakpoint into a group 
    assigned = df_merged[id_col].isnull() & (df_merged[col] > df_merged['breakpoint']) 
    df.loc[assigned[assigned].index, id_col] = group_id
    
    if annual:
        nonJune = df.index.get_level_values(level='date').month != 6
        df.loc[nonJune, id_col] = np.nan
        df[id_col] = df.groupby(level='permno')[id_col].ffill(limit=11)
    
    return df


def compute_portfolio_returns(df=None, sort_groups=None):
    
    display(df.shape)

    df['retnm'] = df['ret'].groupby(level='permno').shift(-1)
    df['me_x_retnm'] = df['me'] * df['retnm']

    # require me, sort variables, and return next month
    ok = df['me'].notnull()
    for required_var in ['retnm'] + sort_variables:
        ok = ok & df[required_var].notnull()
    df = df[ok]

    display(df.shape)

    sums = df.reset_index().groupby(by=['date'] + sort_groups)[['me', 'me_x_retnm']].sum()
    portfolio_returns = sums['me_x_retnm'] / sums['me']
    portfolio_returns = portfolio_returns.unstack(level=sort_groups)
    
    # because we used return as of NEXT MONTH, undo the timing so that the date in the index corresponds to the return realization
    portfolio_returns = portfolio_returns.shift(1)
    
    return portfolio_returns


def analyze_returns(r=None, name=None, start_date='1964-01', end_date='2023-09'):
    r = r.loc[start_date:end_date]
    ir = np.sqrt(12) * r.mean() / r.std()
    print(f'Start: {start_date}, End: {end_date}')
    print(f'Sharpe ratio: {ir:.2f}')
    r.cumsum().plot(title=f'Analysis of a strategy: "{name}"', figsize=(12,8))

<br>
<div style="text-align: center; font-family: 'Georgia', sans-serif; font-size: 36px; font-weight: bold; color: red;">
    Fama-French Value Factors (HML)
</div>

- With the functions I defined above, I just need to define
  - What variable is our signal
    - book value of equity-to-market value of equity (```beme''')
  - What percentiles do we use for size and book-to-market sorts?
    - Fama and French use the ```50``` percentile for size and ```30```th and ```70```th percentiles for signal variable
  - Do we re-sort only annually at the end of June? 
    - Fama and French re-sort annually

Fama and French's portfolio is constructed by assigning stocks into six portfolios: small-value, small-neutral, small-growth, big-value,...

The return on HML is then:

```HML = (1/2) * (small-value + big-value) - (1/2) * (small-growth + big-growth)```

#### Notes on the code below:

- I create a list ```sort_variables``` to indicate by which variables I sort
- I create a dictionary percentiles (with the sort variables as keys) to indicate what breakpoints I want to use

In [None]:
# start from the original data (we will modify it)
cs_crsp = pd.read_pickle('/home/jovyan/data/cs_crsp.pkl')

# construct BE/ME - set firms with negative BEs to missing
cs_crsp['beme'] = cs_crsp['be'] / cs_crsp['me']
cs_crsp['beme'] = cs_crsp.groupby(level='permno')['beme'].shift(24)

negative_be = cs_crsp['be'] < 0
cs_crsp.loc[negative_be, 'beme'] = np.nan

sort_variables = ['me', 'beme']
percentiles = {'me': [50],
              'beme': [30, 70]}

for sortvar in sort_variables:
    cs_crsp = portfolio_sort(df=cs_crsp, col=sortvar, percentiles=percentiles[sortvar], id_col=sortvar + '_group', annual=True)
    
sort_groups = [sortvar + '_group' for sortvar in sort_variables]

portfolio_returns = compute_portfolio_returns(cs_crsp, sort_groups=sort_groups)

hml = portfolio_returns.loc[:,[(1,3), (2,3)]].mean(axis=1) - portfolio_returns.loc[:,[(1,1), (2,1)]].mean(axis=1)
hml.name = 'hml'

In [None]:
analyze_returns(hml, 'Our HML', end_date='2007-06')

## Read Fama and French data from the pickle file we created previously

In [None]:
ff_data = pd.read_pickle('/home/jovyan/data/ff_data.pkl')
ff_data.tail()

In [None]:
merged_data = pd.concat([hml, ff_data], axis=1).dropna()
merged_data[['hml', 'Mkt-RF', 'SMB', 'HML']].corr().round(3)

<br>
<div style="text-align: center; font-family: 'Georgia', sans-serif; font-size: 36px; font-weight: bold; color: red;">
    Carhart's Momentum Factor (UMD)
</div>

Momentum in stock returns is typically defined by sorting on stocks returns from month t-12 to month t-2 

I construct the equivalent of the Fama and French's UMD (which is Carhart's ```PR1YR```) factor

Note: This factor is rebalanced monthly, and so I set annual=False

In [None]:
# start from the original data (we will modify it)
cs_crsp = pd.read_pickle('/home/jovyan/data/cs_crsp.pkl')

cs_crsp['r12_2'] = cs_crsp['ret'].rolling(window=11).sum().shift(1)
cs_crsp['r12_2_count'] = cs_crsp['ret'].rolling(window=11).count().shift(1)

# set to missing if fewer than 6 obs -- newer versions of Pandas could also do ...rolling(window=11).sum(min_count=6)...
not_ok = cs_crsp['r12_2_count'] < 6
cs_crsp.loc[not_ok, 'r12_2'] = np.nan

sort_variables = ['me', 'r12_2']
percentiles = {'me': [50],
              'r12_2': [30, 70]}

for sortvar in sort_variables:
    cs_crsp = portfolio_sort(df=cs_crsp, col=sortvar, percentiles=percentiles[sortvar], id_col=sortvar + '_group', annual=False)
    
sort_groups = [sortvar + '_group' for sortvar in sort_variables]

portfolio_returns = compute_portfolio_returns(cs_crsp, sort_groups=sort_groups)

umd = portfolio_returns.loc[:,[(1,3), (2,3)]].mean(axis=1) - portfolio_returns.loc[:,[(1,1), (2,1)]].mean(axis=1)
umd = umd.shift(1)
umd.name = 'umd'
analyze_returns(umd, 'Momentum Factor')

<br>
<div style="text-align: center; font-family: 'Georgia', sans-serif; font-size: 36px; font-weight: bold; color: red;">
    Measuring Alphas
</div>

We measure alphas by

```Running a linear regression of strategy returns against some factors```

- In CAPM there is only one factor on the RHS: MKTRF
- In the Fama-French three-factor model, there are three factors: MKTRF, SMB, and HML
- In the Fama-French five-factor model, there are three factors: MKTRF, SMB, HML, RMW, and CMA

**Alphas** measure stocks', managers', or strategies' *abnormal* returns

That is, how profitable an investment is when we 'expunge' from returns any exposures to the factors of the factor model

Notes:

- I use statsmodels.api for running the linear regression
  - This is a well-known (and well-maintained) package
  - The benefit of this package is that it gives a nice, easy summary of the results
- Other packages, such as sklearn, are better for estimating more complex models
  - They don't provide similar summary statistics -- because such summary statistics are often hard to compute for more complicated models

In [None]:
import statsmodels.api as sm

regression_data = pd.concat([umd, ff_data], axis=1).dropna()

y = regression_data['umd']
X = regression_data[['Mkt-RF', 'SMB', 'HML']]
X = sm.add_constant(X)

# Create a model. This is an OBJECT that comes with methods. We are NOT estimating the model yet, just creating it.
model = sm.OLS(y, X)

# Fit the model
results = model.fit()

# Print the summary
display(results.summary())

## Question:

What if we repeat this computation for our HML factor?