# <span style='color:red'>Quantitative Investing with Python</span>

### Professor Juhani Linnainmaa

Dartmouth College and Kepos Capital (Co-Director of Research)

--- 

# **Topic 3:** Reversals and Momentum

The goal in this section of the course is to
1. Get familiarity with constructing trading strategies...
2. By replicating some academic factors

In this part I will use monthly CRSP data -- CRSP stands for Center for Research in Security Prices.
- Most universities and colleges subscribe to data such as CRSP through Wharton's WRDS service
- I'm providing the full monthly CRSP file from May 1962 through September 2023 
  - However, I only include a fraction of the fields available
  
I will consider "price-based" factors, that is, trading rules that are based only past security price information

The major price-based factors include
- Size
- Short- and long-term reversals
- Momentum
- Idiosyncratic volatility
- Betting against beta

When you construct a factor -- as discussed in Lecture 1 -- you need to make *many* choices 
- Moreover, the data may change over time and so, in practice, it is *very* difficult to replicate a factor perfectly unless you have the original data and code
- Sometimes the original papers (and industry reports) do not provide enough details for replicating the factors
  - For example, Li, Novy-Marx, and Velikov (2019) (https://cfr.pub/published/papers/li2020liquidity.pdf) struggled to replicate a famous factor paper until they figured out that the authors had used an unreported rule:
  
  
  ```
  "Finally, while not noted by PS, they delete zero-volume observations when estimating Eq. (1), and doing so here is crucial to generating a high correspondence between our results and those reported in their paper.
  
  Determining this fact required implementing numerous variations on the methodology described in PS. This involved labor far beyond what could reasonably be expected for casual replication, and was only possible because of the public aggregate liquidity series maintained by PS, which allowed us to infer which variations were important for generating a close correspondence." (p. 227)
  ```


In [None]:
import pandas as pd
import numpy as np
import yfinance as yf
from datetime import datetime, timedelta

import requests
from io import BytesIO, StringIO
import zipfile

# Read and process monthly stock data from CRSP

- I downloaded it from WRDS, zipped it, and put into Dropbox
- The code below downloads the file and unzips it into a DataFrame

In [None]:
crsp_url = 'https://dl.dropboxusercontent.com/scl/fi/jnzk25egqsup3j4ibyvni/CRSP_September2023.csv.zip?rlkey=9im4ectsyl9ls7o9aw7x67odg'
response = requests.get(crsp_url)
with zipfile.ZipFile(BytesIO(response.content)) as z:
    with z.open('CRSP_September2023.csv') as f:
        df = pd.read_csv(f)
        
df.tail(3)

### Pre-processing

- I don't really need company names, but I'll put them into a Series just in case I want to look something up based on PERMNO
- I also don't need TICKER

In [None]:
company_names = df[['PERMNO','COMNAM']]
company_names = company_names.groupby('PERMNO').last().squeeze()
df = df.drop(columns=['COMNAM','TICKER'])
company_names.name = 'Company names'
company_names.tail(5)

### Pre-processing 2

1. change dates to datetime
2. set permno-date as the index (it becomes a multi-index)
3. the typical universe in equities is to keep common stock traded on NYSE, Nasdaq, and AMEX -- filter based on SHRCD and EXCHCD
4. drop SHRCD and EXCHCD because we don't need them anymore

In [None]:
df['date'] = pd.to_datetime(df['date'])
df = df.set_index(['PERMNO', 'date'])

# we want to keep SHRCD = 10 or 11 and EXCHCD=1,2, or 3 -- the raw input data should already have these filters
df = df[(df['SHRCD'].isin([10,11])) & (df['EXCHCD'].isin([1,2,3]))]

# drop the SHRCD variable - we don't need it anymore
df = df.drop(columns=['SHRCD'])

df.tail(3)

### Convert two returns variable, DLRET and RET, into floats

- There are some strings, which we ignore (this is the 'coerce' argument)

In [None]:
# what *is* DLRET?
print(df['DLRET'].describe())

col_list = ['DLRET', 'RET']
for col in col_list:
    df[col] = pd.to_numeric(df[col], errors='coerce')

### Compute returns inclusive of delisting returns

- Our return variable is either 'normal return', 'delisting return', or, if both exist, the compounded return
- We need to be careful with missign values - I'm filling in zeros for NaNs in the computation but then, if neither return exists, putting them back in  

In [None]:
df['ret'] = (1 + df['RET'].fillna(0)) * (1 + df['DLRET'].fillna(0)) - 1

neither_return_exists = (df['RET'].isnull()) & (df['DLRET'].isnull())

df.loc[neither_return_exists, 'ret'] = np.nan

# drop the original return variables - we don't need them anymore
df = df.drop(columns=['RET', 'DLRET'])

df.tail(3)

### Compute market cap in millions
- Note that PRC is negative to indicate that it is the spread midpoint


In [None]:
df['me'] = np.abs(df['PRC']) * df['SHROUT'] / 1_000

# Shares outstanding is sometimes zero or PRC missing -> set me to missing
df['me'] = df['me'].replace({0: np.nan})

# Drop PRC -> we don't need it anymore
df = df.drop(columns=['PRC', 'SHROUT'])

df.tail(3)

What stock has a market cap of \\$794,196.8M at the end of September 2023?

In [None]:
company_names.loc[93436]

### Save finished file into a PKL file

- Be careful with pickle files - they are very convenient but not efficient and break between Python and Pandas versions!

In [None]:
df.to_pickle('data/crsp.pkl')

In [None]:
df = pd.read_pickle('data/crsp.pkl')

In [None]:
df

## Trading strategies

A trading strategy is a systematic rule based on data we know at the time we make the trading decisions

Your strategy could be "value investing:"

1. Buy value stocks, sell growth stocks (this is known as a long-short portfolio)
2. Look at the data once a month to rebalance your portfolios as stocks' characteristics change

### Strategy 1: Short-term reversals

**Short-term reversals** is the empirical finding that, at short horizons, stock returns tend **reverse**

In monthly data, a strategy that trades short-term reversals is simple:

- Rank stocks by their prior-month returns (e.g., their returns in December)
- At the end of the month, buy stocks with the lowest returns and sell stocks with the highest returns

To get started, I'll create the following strategy:

1. Every month, based on the entire universe, identify the bottom and top 10% of stocks based on their returns
2. Create two portfolios:
   - an equal-weighted 'long' portfolio that buys the bottom stocks
   - an equal-weighted 'short' portfolio that sells the top stocks
3. The return on the strategy will be the return on these stocks the *following* month

We need to pay some attention to timing

### Determine top and bottom deciles

- I group stocks by month and compute the 10th and 90th percentiles
- I get a new dataframe that shows how low or high a stock's return must be to be in the tails of the distribution

In [None]:
grp = df['ret'].dropna().groupby(level=1)
p10 = grp.apply(lambda x: np.percentile(x, 10))
p90 = grp.apply(lambda x: np.percentile(x, 90))
breakpoints = pd.DataFrame({'p10': p10, 'p90': p90})
breakpoints.tail(5)

### Merge breakpoints back into our original dataframe
- We need to specify what we are merging on.
- On the "left" we are merging by (level) "date"; on the "right" we are merging by the index (which is also date)
- We also need to specify what observations we want to keep: those on the left, those on the right, the union (inner), or the join (outer)

In [None]:
df = df.merge(breakpoints, left_on='date', right_index=True, how='left')
df.tail(5)

### Create 0/1 variables to indicate stocks that we want to hold LONG or SHORT *next month*

- I also create a -1 / +1 variable called 'position' for later use

In [None]:
df['long'] = (df['ret'] <= df['p10']).astype('int')
df['short'] = (df['ret'] >= df['p90']).astype('int')
df['position'] = df['long'] - df['short']
df

### Create equal-weight portfolios

- Each stock in a portfolio gets a weight of 1/N
- So I need to compute how many stocks we have in the two portfolios each month
- I can groupby 'date' (month) and take sums of my long and short variables
- I can then get weights by dividing the original dataframe with these counts
  - The columns align: we divide long by long and short by short

In [None]:
# how many stocks do we have in each portfolio each month?
ns = df[['long','short']].groupby(level='date').sum()
print(ns.tail(5))

weights = df[['long','short']] / ns

In [None]:
# verify that the weights sum up to 1
weights.groupby(level=1).sum().head(3)

### Merge weights back into the main dataframe

- I drop, on the fly, some columns that we don't need and that have conflicting names (namely, "long" and "short") 

In [None]:
df = df.drop(columns=['p10', 'p90', 'long','short']).merge(weights,left_index=True,right_index=True)
df.tail(10)

### Compute returns for the two portfolios

- Here timing is important: we have weights as of month t, that is, the return we have on each row is the same return we looked at to determine the weight
- We want to either shift 'ret' forward one month or the weights back one month
- I shift weights back one month
  - **Important**: I need to shift within each PERMNO so that when PERMNO changes, I don't grab the weights from the previous row
  - I could alternative reshape the dataframe so that I only had dates in the index
    - But this would be slower and take more memory 

In [None]:
w = df[['long', 'short']].groupby(level='PERMNO').shift(1)
w.head(5)

### Return computation

- Portfolio return is the sum of weights * returns each month 
- I multiple (now lagged) weights with returns and take the sum of these products
- min_count = 1 is used to make sure that if there are any months without anything to sum, the result is a missing value (NaN)
  - By default, a sum of missing values equals zero

In [None]:
portfolio_returns = w.mul(df['ret'], axis=0).groupby(level='date').sum(min_count=1)
strev = portfolio_returns['long'] - portfolio_returns['short']

### Analyze the strategy's performance

I define a helper function for doing some analysis
- It just means that I don't have to rewrite the same code
- It is good to write modular code

In [None]:
def analyze_returns(r=None, name=None, start_date='1962-05', end_date='2023-09'):
    r = r.loc[start_date:end_date]
    ir = np.sqrt(12) * r.mean() / r.std()
    print(f'Analysis of a strategy: "{name}"')
    print(f'Start: {start_date}, End: {end_date}')
    print(f'Sharpe ratio: {ir:.2f}')
    r.cumsum().plot(figsize=(12,8))

In [None]:
analyze_returns(strev, 'Short-term reversals (deciles, equal-weighted)', end_date='1995-12')

### What issues might this strategy have?

- A *huge* amount of turnover
- Trading all stocks the same way independent of market caps
- Additional practical issue: we cannot implement this specific rule in real life
  - There is no gap between on observing the signal and when we assume we got into the positions

### Define a *value-weighted* strategy instead

- The amount we invest in each stock is proportional to its market capitalization
- This how almost all academic factors are construct and it is a far fairer representation of how well the strategy might perform
  - But it is still "gross" of trading costs
- I'll construct this portfolio slightly differently
  1. I create a new column that contains next month's return for each stock (shift = -1 now)
  2. I take the product of market caps and these future returns
  3. I take the sum of these products (and market caps) seperately for the stocks that belong to the long and short portfolios
  4. The value-weighted return is sum(me * retnm) / sum(me)
  
I end up with two series: long_return and short_return

- I *could* have again computed weights based on market caps

In [None]:
# add next month's return
df['retnm'] = df['ret'].groupby(level='PERMNO').shift(-1)
df['me_x_retnm'] = df['me'] * df['retnm']

# long portfolio
long = df['position'] == 1
long_sums = df.dropna().loc[long,['me','me_x_retnm']].groupby(level='date').sum()
long_return = long_sums['me_x_retnm'] / long_sums['me']

# short portfolio
short = df['position'] == -1
short_sums = df.dropna().loc[short,['me','me_x_retnm']].groupby(level='date').sum()
short_return = short_sums['me_x_retnm'] / short_sums['me']

### Define the value-weighted strategy

- I need to SHIFT strategy returns so that the return each month corresponds to what the index says
- This undoes my 'retnm' convention from above

In [None]:
strev_vw = long_return - short_return
strev_vw = strev_vw.shift(1)
analyze_returns(strev_vw, 'Short-term reversals (deciles, value-weighted)', end_date='1995-12')

### How highly are the equal- and value-weighted strategies correlated?

In [None]:
pd.concat([strev, strev_vw], axis=1).corr()

## Compare to Fama and French's computation

- Did we do this right? 

### Get Fama-French factors from Ken French's website 

- I write a helper function that I can use to download the data
- There are some file-specific issues that I need to control for 
- I get both Fama-French factors (for later use) and portfolios formed based on short-term reversals

In [None]:
def download_french_data(url=None, csvname=None, skiplines=None):
    
    response = requests.get(url)

    # If the request is NOT successful, raise an exception
    if response.status_code != 200:
        raise Exception(f"Failed to download zip file. Status code: {response.status_code}")

    with zipfile.ZipFile(BytesIO(response.content)) as zip_file:

        # Check if the file exists in the zip archive
        if csvname in zip_file.namelist():
            # Read the CSV file directly from the zip archive
            with zip_file.open(csvname) as csv_file:
                lines = csv_file.readlines()

            # Remove rows from the beginning
            lines = lines[skiplines:]

            # Create a DataFrame from the trimmed lines using StringIO
            # First need to decode byte strings into unicode
            lines = [line.decode("utf-8") for line in lines]

            # at some point the file switches from monthly factors to annual factors and other stuff
            # we can delete what ever comes after
            for idx, line in enumerate(lines):
                if ('Annual Factors' in line) or (len(line.strip())==0): break
                
            lines = lines[:idx]
            clean_csv = '\n'.join(lines)
            df = pd.read_csv(StringIO(clean_csv))                
            print(f'File {csvname} read successfully!')
            return df
        else:
            print(f'Zip file found but file {csvname} not found in the archive.')   
            return pd.DataFrame()

### Read FF5 factors and clean the data

- Convert returns to decimals and date from YYYYMM to datatime

In [None]:
# Specify the file we want to read -- the CSV file inside has almost the same name 
url = 'https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/F-F_Research_Data_5_Factors_2x3_CSV.zip'
csvname = 'F-F_Research_Data_5_Factors_2x3.csv'

df_ff5 = download_french_data(url=url, csvname=csvname, skiplines=3)

print('\nData before processing:\n')
print(df_ff5.head(3))

df_ff5['date'] = df_ff5['Unnamed: 0'].apply(lambda x: datetime.strptime(str(x), '%Y%m'))
ff_data = df_ff5.loc[:,'Mkt-RF':'date'].set_index('date') / 100

print('\nData after processing:\n')
print(ff_data.head(3))

### Read returns on portfolios formed based on short-term reversals

- Clean

In [None]:
url = 'https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/10_Portfolios_Prior_1_0_CSV.zip'
csvname = '10_Portfolios_Prior_1_0.CSV'    

df_rev = download_french_data(url=url, csvname=csvname, skiplines=10)

print('Data before processing:\n')
print(df_rev.head(3))

df_rev['date'] = df_rev['Unnamed: 0'].apply(lambda x: datetime.strptime(str(x), '%Y%m'))
df_rev = df_rev.drop(columns='Unnamed: 0')
rev_data = df_rev.set_index('date') / 100

print('\nData after processing:\n')
print(rev_data.head(3))

In [None]:
strev_ff = rev_data['Lo PRIOR'] - rev_data['Hi PRIOR']
analyze_returns(strev_ff, 'Short-term reversals (deciles, value-weighted, FF)', end_date='1995-12')

### Correlation between our strategy and that of Fama and French

- There is the small issue that the dates are different
- I change them to monthly so I can merge the two series

In [None]:
ours = strev_vw.copy()
ours.index = ours.index.to_period('M').to_timestamp('M')
ours.name = 'Our strategy'

theirs = strev_ff.copy()
theirs.index = theirs.index.to_period('M').to_timestamp('M')
theirs.name = 'FF\'s strategy'
pd.concat([ours, theirs], axis=1).corr().round(3)

### Question: What is the difference?