<a href="https://colab.research.google.com/github/kerryback/2022-BUSI520/blob/main/wrds.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
# uncomment and execute the following if necessary

# !pip install wrds

In [3]:
import numpy as np
import pandas as pd
import wrds       
conn = wrds.Connection() 

WRDS recommends setting up a .pgpass file.
Created .pgpass file successfully.
Loading library list...
Done


### Annual Compustat

An example of pulling data from the annual Compustat table.  datadate is the end of the fiscal year.  We impose standard filters.  See https://wrds-web.wharton.upenn.edu/wrds/demo/demoform_compustat.cfm for a full list of Compustat variable definitions. Quarterly data is also available.

In [19]:
comp = conn.raw_sql(
    """
    select gvkey, datadate, at
    from comp.funda
    where datadate >= '2000-01-01' and at>0 
    and indfmt='INDL' and datafmt='STD' and popsrc='D' and consol='C'
    order by gvkey, datadate
    """, 
    date_cols=['datadate']
)

# convert string or float to int
comp.gvkey = comp.gvkey.astype(int)

### Days between datadates

In [97]:
def days_between(dates) :
    days = dates.diff().astype(str)                                          
    return days.map(lambda x: np.nan if x=='NaT' else int(x.split()[0])) 

comp["days"] = comp.groupby("gvkey").datadate.apply(days_between)


### Investment rate

Set it to missing if more than x days between datadates

In [98]:
comp["inv"] = comp.groupby("gvkey")["at"].pct_change()
comp["inv"] = np.where((comp.days).notnull() & (comp.days>720), np.nan, comp.inv)


Example

In [101]:
comp[comp.gvkey==1019].head()

Unnamed: 0,gvkey,datadate,at,date,days,inv
38,1019,2000-12-31,28.638,2001-06-30,,
39,1019,2001-12-31,30.836,2002-06-30,365.0,0.076751
40,1019,2007-12-31,34.18,2008-06-30,2191.0,
41,1019,2008-12-31,33.486,2009-06-30,366.0,-0.020304
42,1019,2009-12-31,32.678,2010-06-30,365.0,-0.024129


### Use Fama-French lagging

Shift all annual reports in a calendar year to June 30 of the following year.  

In [21]:
# define date as June 30 of year following datadate
comp['date'] = pd.to_datetime(comp.datadate.map(lambda d: str(d.year+1)+'-06-30'))
    
# if two annual reports in one calendar year (due to change of fiscal year), keep last one
comp = comp.drop_duplicates(subset=['gvkey', 'date'], keep='last') 

### Assign permnos if merging with CRSP

In [29]:
link = conn.raw_sql(
    """
    select distinct gvkey, lpermno as permno, linkdt, linkenddt
    from crsp.Ccmxpf_linktable
    where linktype in ('LU', 'LC')
    and LINKPRIM in ('P', 'C')
    """
)

# convert strings or floats to ints
link['gvkey'] = link.gvkey.astype(int)
link['permno'] = link.permno.astype(int)

# fill in missing end dates with a future date
link['linkenddt'] = pd.to_datetime(link.linkenddt).fillna(pd.Timestamp('21000101'))

# merge with Compustat data and keep rows with Compustat datadate between link date and link end date
comp = comp.merge(link, on='gvkey', how='inner')
comp = comp[(comp.datadate>=comp.linkdt) & (comp.datadate<=comp.linkenddt)]

comp = comp.drop(columns=['gvkey', 'datadate', 'linkdt', 'linkenddt'])

### Merge CRSP with Compustat

* Change dates to monthly period format before merging, because Compustat date is the last day of the month, and CRSP date is the last trading day  of the month.
* Merge keeping all rows of CRSP data.  There will be NaNs for Compustat data for 11 months each year.

In [30]:
crsp.date = crsp.date.dt.to_period('M')
comp.date = comp.date.dt.to_period('M')

df = crsp.merge(comp, on=['permno', 'date'], how='left')

### Fill Compustat data into months

* Group by permno when filling forward so we don't fill from one stock into another
* A limit of 11 months on the forward fill is the right limit if we have shifted to June 30, but it should probably be longer otherwise, because a firm might change its fiscal year and go more than 12 months between annual reports.

In [31]:
df[['at', 'inv']] = df.groupby('permno')[['at', 'inv']].ffill(limit=11)

### Check result

In [32]:
df.head()

Unnamed: 0,permno,permco,date,ret,me,exchcd,at,inv
0,10001,7953,2000-01,-0.044118,19906.25,3.0,,
1,10001,7953,2000-02,0.015385,20212.5,3.0,,
2,10001,7953,2000-03,-0.015758,19712.0,3.0,,
3,10001,7953,2000-04,0.011719,19943.0,3.0,,
4,10001,7953,2000-05,-0.023166,19481.0,3.0,,


In [33]:
df.tail()

Unnamed: 0,permno,permco,date,ret,me,exchcd,at,inv
1159141,93436,53453,2021-11,0.027612,1149642000.0,3.0,52148.0,0.519951
1159142,93436,53453,2021-12,-0.076855,1092218000.0,3.0,52148.0,0.519951
1159143,93436,53453,2022-01,-0.113609,968131900.0,3.0,52148.0,0.519951
1159144,93436,53453,2022-02,-0.070768,899619000.0,3.0,52148.0,0.519951
1159145,93436,53453,2022-03,0.238009,1113736000.0,3.0,52148.0,0.519951


### Coalescing Compustat data

Often one fills in a missing Compustat variable with another variable or with some calculation involving other variables.  An example is the calculation of preferred stock by Fama and French when they calculate book equity.  They use pstkrv if it is not missing; if it is missing, they use pstkl; if both of those are missing, they use pstk; if all three of those are missing they set it to zero.  

Here is an implementation using bfill.  axis=1 means fill across rows.  With axis=1, bfill fills from the right, going to the left.  It produces an array of the same shape that you started with.  The .iloc[:,0] produces the first (zero-th) column of the array.

In [39]:
comp = conn.raw_sql(
    """
    select gvkey, datadate, pstkrv, pstkl, pstk
    from comp.funda 
    where datadate >= '2000-01-01' 
    and indfmt='INDL' and datafmt='STD' and popsrc='D' and consol='C'
    order by gvkey, datadate
    """, 
    date_cols=['datadate']
)

comp['preferred'] = comp[['pstkrv','pstkl','pstk']].bfill(axis=1).iloc[:, 0].fillna(0)

In [40]:
comp[comp.preferred != 0].head()

Unnamed: 0,gvkey,datadate,pstkrv,pstkl,pstk,preferred
23,1010,2000-12-31,11.225,9.6,9.6,11.225
24,1010,2001-12-31,48.07,48.07,48.1,48.07
25,1010,2002-12-31,70.209,70.209,70.2,70.209
65,1021,2006-06-30,4.744,4.744,4.744,4.744
77,1037,2000-03-31,2.583,2.583,2.583,2.583


In [38]:
comp[(comp.pstkrv.isna()) & (comp.preferred != 0)].head()

Unnamed: 0,gvkey,datadate,pstkrv,pstkl,pstk,preferred
1303,1429,2000-08-31,,,38.275,38.275
1304,1429,2001-08-31,,,38.275,38.275
2857,1932,2000-12-31,,,44.865,44.865
2858,1932,2001-12-31,,,43.629,43.629
2859,1932,2002-12-31,,49.895,49.895,49.895
