## Pre-Processing & Investible Universe

Notes:
Program 2a does a bit of preprocessing/cleaning, but the most important thing is how it limits the investable universe that we examine
* Working back from the end goal, we (arbitrarily) want ~2000 stocks (sorted by market cap) to divide up into quintiles at the end
* So we create a universe that has 3000 stocks every year just in case
  * But because we are looking at previous year’s beta to predict the next year’s returns, we need to make sure that stock exists in both years
  * Stocks will enter and leave the top 3000 for various reasons
  * So if you look at the total number of stocks that come out of permscreen, it is much larger than the 3000 ever year, but every year you are guaranteed to have 3000 stocks to work with that have a beta signal that year, as well as return data the year after
* **Very specific to asam**, since we invest in December, we picked stocks that have returns in end of December, for predicting next year’s returns

### Import Packages

In [2]:
import pandas as pd
import numpy as np
import datetime as dt
import psycopg2 
import matplotlib.pyplot as plt
from dateutil.relativedelta import *
from pandas.tseries.offsets import *
from scipy import stats
import statsmodels.api as sm
import statistics
import sys
sys.path.insert(0, "../")
import util
import multiprocessing as mp
from  timeit import default_timer as timer

### Set Local Macro Variables

In [3]:
#num of top market equity to keep
numstocks=3000

### Read in util.py Function Bank

In [4]:
from IPython.lib.backgroundjobs import BackgroundJobFunc

with open('util.py') as code:
    job = BackgroundJobFunc(exec, code.read())

result = job.run()

### Import Data

#### CRSP Monthly Stock File

In [5]:
crsp_m = pd.read_csv('qcrspmsf_raw.csv.gz', compression='gzip')

crsp_m = crsp_m[['permno', 'permco', 'date', 'ret', 'retx', 'shrout', 'prc']]

crsp_m.head()

Unnamed: 0,permno,permco,date,ret,retx,shrout,prc
0,10000.0,7952.0,1985-12-31,,,,
1,10000.0,7952.0,1986-01-31,,,3680.0,-4.375
2,10000.0,7952.0,1986-02-28,-0.257143,-0.257143,3680.0,-3.25
3,10000.0,7952.0,1986-03-31,0.365385,0.365385,3680.0,-4.4375
4,10000.0,7952.0,1986-04-30,-0.098592,-0.098592,3793.0,-4.0


#### CRSP Monthly Stock Event - Delisting

In [6]:
qcrspmse_raw = pd.read_csv('qcrspmse_raw.csv.gz', compression='gzip')

qcrspmse_raw = qcrspmse_raw[['permno', 'shrcd', 'exchcd', 'namedt', 'nameendt']]

qcrspmse_raw = qcrspmse_raw[qcrspmse_raw['exchcd'].isin([1, 2, 3])]
qcrspmse_raw = qcrspmse_raw[qcrspmse_raw['shrcd'].isin([10, 11])]

qcrspmse_raw.head()

Unnamed: 0,permno,shrcd,exchcd,namedt,nameendt
0,10000.0,10.0,3.0,1986-01-07,1986-12-03
1,10000.0,10.0,3.0,1986-12-04,1987-03-09
2,10000.0,10.0,3.0,1987-03-10,1987-06-11
3,10001.0,11.0,3.0,1986-01-09,1993-11-21
4,10001.0,11.0,3.0,1993-11-22,2004-06-09


##### Join monthly stock data with name history

In [7]:
crsp_m = crsp_m.merge(qcrspmse_raw, on='permno', how='left')

crsp_m = crsp_m[(crsp_m.namedt <= crsp_m.date) & (crsp_m.date <= crsp_m.nameendt)]

crsp_m[['permco','permno','shrcd','exchcd']] = crsp_m[['permco','permno','shrcd','exchcd']].astype(int)

crsp_m.head()

Unnamed: 0,permno,permco,date,ret,retx,shrout,prc,shrcd,exchcd,namedt,nameendt
3,10000,7952,1986-01-31,,,3680.0,-4.375,10,3,1986-01-07,1986-12-03
6,10000,7952,1986-02-28,-0.257143,-0.257143,3680.0,-3.25,10,3,1986-01-07,1986-12-03
9,10000,7952,1986-03-31,0.365385,0.365385,3680.0,-4.4375,10,3,1986-01-07,1986-12-03
12,10000,7952,1986-04-30,-0.098592,-0.098592,3793.0,-4.0,10,3,1986-01-07,1986-12-03
15,10000,7952,1986-05-30,-0.222656,-0.222656,3793.0,-3.109375,10,3,1986-01-07,1986-12-03


#### CRSP Monthly Stock Event - Delisting

In [8]:
qdlret_raw = pd.read_csv('qdlret_raw.csv.gz', compression='gzip')

dlret = qdlret_raw[['permno', 'dlret', 'dlstdt']]

dlret[['permno']] = dlret[['permno']].astype(int)

dlret['date'] = pd.to_datetime(dlret['dlstdt'])

dlret.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._set_item(key, value)


Unnamed: 0,permno,dlret,dlstdt,date
0,10000,0.0,1987-06-11,1987-06-11
1,10001,0.011583,2017-08-03,2017-08-03
2,10002,0.046007,2013-02-15,2013-02-15
3,10003,0.01373,1995-12-15,1995-12-15
4,10005,0.125,1991-07-11,1991-07-11


##### Format Dates

In [9]:
# Set all timestamps to common month end
crsp_m = util.monthEnd(crsp_m)
dlret = util.monthEnd(dlret)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._set_item(key, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


### Combine Returns

In [10]:
crsp = util.comebineRet(crsp_m,dlret)

### Prune Stocks

In [11]:
# Prune stocks, map all equity to largest permno
crsp=util.maxCap(crsp)

In [12]:
# Only keeping top X stocks by market cap per jdate, so a stock could be missing from long time period ago
crsp=util.topX(crsp,'me',numstocks)

crsp.head()

Unnamed: 0,permno,permco,ret,retx,shrcd,exchcd,namedt,nameendt,jdate,retadj,retxadj,me
3130349,10001,7953,-0.118436,-0.12237,11,2,2010-07-09,2017-08-03,2010-11-30,-0.118436,-0.12237,78653.359701
3138140,10001,7953,0.028992,0.024715,11,2,2010-07-09,2017-08-03,2011-01-31,0.028992,0.024715,84450.517908
3142018,10001,7953,0.022727,0.018553,11,2,2010-07-09,2017-08-03,2011-02-28,0.022727,0.018553,86017.316414
3145894,10001,7953,0.072404,0.068306,11,2,2010-07-09,2017-08-03,2011-03-31,0.072404,0.068306,91892.816414
3149754,10001,7953,-0.038789,-0.042626,11,2,2010-07-09,2017-08-03,2011-04-30,-0.038789,-0.042626,91535.726269


In [13]:
# EXPORT MONTHLY PERMNO LIST FOR DAILY CALC HERE
# will reduce previous list to AT MOST 3k stocks per year effectively,
# based on the 3k that exist in December of that year
# if stocks did not exist jan of that year, then jan should have fewer 
# than 3k stocks.
crsp['month']=crsp['jdate'].dt.month
permlist=crsp[crsp['month']==12][['permno','jdate']].drop_duplicates()

permlist.to_csv("qpermlist.csv.gz", 
           index=False, 
           compression="gzip")

# Dump pruned crps that matches permlist
crsp.to_csv("qlocalcrsp_matched.csv.gz", 
           index=False, 
           compression="gzip")

Notes:
* Filter to only keep stocks with December return
* In order to compare vs prior year
* Delisting events is to check for existence