# Preparing CRSP data for backtesting (daily data)

&copy; **Johannes Ruf** (comments welcome under j.ruf@lse.ac.uk, February 2023)

This notebook considers daily data instead of monthly data as the previous notebook did. We again construct a dataframe `df` that can be used to backtest systematic trading strategies. The class of trading strategies to be considered are strategies that are functions of the stock capitalizations only (and don't depend on other characteristics, e.g., industries). 

The dataframe `df` will have three components: a matrix of returns, a matrix of market capitalizations, and a matrix of flags that tag problematic returns.

A flag value of 0 implies no special issues. The remaining flag values are constructed as follows.

In [None]:
FLAG_PROBLEMATIC_INTERMEDIATE_RETURN = 2

FLAG_TEMPORARY_DELISTING = 3

FLAG_DELRET_MISSING = 4

FLAG_MISSING_RETURN_IMPUTED = 5
# If return was missing but the trading days before and after have 'good' returns.
# The missing returns are replaced by 0.0 on those days.

FLAG_RETURN_BASED_ON_BA = 1    
# if return is based on a bid-ask average and not corresponding to a missing delisting or a problematic intermediate return

Moreover, we add 10 to the flag value if the corresponding return is larger/smaller than the following cutoff. 

In [None]:
CUTOFF_LARGE_RETURN = 1   # 1 corresponds to doubling over the period, i.e., a period return of 100%.
CUTOFF_SMALL_RETURN = -0.5

If securities are 'temporary delisted' (here defined as previous capitalization is available but returns are missing), we replace the first missing return in a consecutive sequence of missing returns by `TEMPORARY_DELISTING_RETURN`.

If delisting returns are missing, we set them to `MISSING_DELIST_RETURN`. 

Of course, other values or methods to handle these returns are possible, too.  When backtesting trading strategies, robustness checks with respect to these assumptions are recommended (easily implemented thanks to the backtesting flags constructed below).

In [None]:
TEMPORARY_DELISTING_RETURN = -0.1
MISSING_DELIST_RETURN = -0.3

## Preparations

In [None]:
import pandas as pd

import wrds
WRDS_LOGIN = 'xxx'    # update to your login info on CRSP

DATAPATH = '~/Desktop/YOUR_FOLDER_NAME/'

In [None]:
db = wrds.Connection(wrds_username=WRDS_LOGIN)

## Loading the data

The trading strategies we will consider may depend on the stocks' capitalizations. To avoid 'anticipatory' strategies, at any month we are only allowed to use the previous months' capitalizations.

The next cell takes around 50 minutes on my computer, with a standard broadband connection.

In [None]:
%%time

df = db.raw_sql("""SELECT d.dlycaldt, d.permno, d.dlyprevcap, d.dlyret, d.dlyprcflg 
                   FROM crsp.StkDlySecurityData AS d
                   JOIN crsp.StkSecurityInfoHist AS s
                   ON 
                   d.permno = s.permno AND s.secinfostartdt <= d.dlycaldt 
                   AND d.dlycaldt <= s.secinfoenddt
                   WHERE 
                   s.sharetype = 'NS' AND s.securitysubtype = 'COM' 
                   AND s.issuertype IN ('ACOR','CORP') AND s.usincflg = 'Y'
                   """, date_cols='dlycaldt')

It turns out that the above query does not return the delisting returns (see the warning in Notebook 4). Hence we need to add those returns by hand. The next query returns the delisting returns provided the `StkSecurityInfoHdr` states that the last information of the security satisfies the criteria to be a member of the investment universe.

In [None]:
%%time

df_delists = db.raw_sql("""SELECT d.dlycaldt, d.permno, d.dlyprevcap, d.dlyret, d.dlyprcflg 
                   FROM crsp.StkDlySecurityData AS d
                   JOIN crsp.StkSecurityInfoHdr AS h
                   ON d.permno = h.permno 
                   WHERE 
                   h.sharetype = 'NS' AND h.securitysubtype = 'COM' 
                   AND h.issuertype IN ('ACOR','CORP') AND h.usincflg = 'Y'
                   AND d.dlydelflg = 'Y'
                   """, date_cols='dlycaldt')

In [None]:
db.close()

In [None]:
%%time

df = pd.concat([df, df_delists])

In [None]:
df.info()

For temporary backups, we can store and load intermediate results using the next cells.

In [None]:
%%time

with pd.HDFStore(DATAPATH + 'tmp_daily.h5') as store:
    df = store['df']

## Outline of the cleaning steps

We now proceed with the following steps:
1) First, we do some preliminary cleaning steps and add a column to flag critical returns.
2) We pivot the data so that each row corresponds to one date, and each column to a `permno`.
3) We check and clean the beginning and end of each time series.
4) We check and clean for 'temporary delistings' and missing intermediate returns.
5) We flag very large/small returns.
6) We store the data.

## Preliminary cleaning steps

We check whether return are based on bid-ask spreads, and whether CRSP tagged returns as problematic (in particular, if intermediate returns are missing).

In contrast to the monthly data, we do *not* entries where `dlyprevcap` is missing. The reason is that even large securities often have days with missing data. (See for the example the case study with IBM in Notebook 2.) For such cases it seems to be better not to remove these securities from the investment universe. We handle these cases below when treating missing intermediate returns.

#### A new column and prices based on bid-ask-spreads

We now add a new column called `bcktstflg` ('backtesting flag') to the dataframe, where we flag all problematic returns. 

We first flag all returns that are based on a bid-ask-average instead of a trading price. (Those returns might be re-tagged if they correspond to a missing delisting return.)

In [None]:
%%time

bl = (df['dlyprcflg']=='BA')

In [None]:
print('This step tages {:_} ({:.2f}%) rows with the BA flag.'.format(
    bl.sum(), 100 * bl.sum() / len(df)))

In [None]:
df['bcktstflg'] = 0
df.loc[bl, 'bcktstflg'] = FLAG_RETURN_BASED_ON_BA

#### Intermediate missing and problematic returns

What kind of price flags are there?

In [None]:
%%time

df['dlyprcflg'].value_counts(normalize=True)

What kind of price flags are there when returns are missing?

In [None]:
df.loc[df['dlyret'].isnull(), 'dlyprcflg'].value_counts()

Let's now find all problematic returns:

In [None]:
%%time

bl = df['dlyret'].isnull() | df['dlyprevcap'].isnull() | df['dlyprcflg'].isin(['NT', 'MP', 'HA', 'SU', 'DM'])
df = df.drop('dlyprcflg', axis=1)

The above code captures all problematic returns. We have included the flag values `HA` and `SU` for which CRSP doesn't provide any explanations [here](https://www.crsp.org/files/appendix/FlagType_PC.html), see also the warning in Notebook 2.
Some of the above tagged returns will be removed below, for example, when appearing at the beginning of a time series. 

In [None]:
print('This step tags {:_} ({:.2f}%) rows.'.format(bl.sum(), 100 * bl.sum() / len(df)))

In [None]:
df.loc[bl, 'bcktstflg'] = FLAG_PROBLEMATIC_INTERMEDIATE_RETURN   

In [None]:
df.info()

## Pivoting the data

In [None]:
%%time

df = df.pivot(index='dlycaldt', columns='permno')

In [None]:
%%time

# check that index is sorted
assert df['dlyret'].index.is_monotonic_increasing

In [None]:
df.info()

## Cleaning the beginning and end of each time series

####  Beginning of each time series

Let's clean a bit the *beginning* of each time series. We remove entries corresponding to returns of assets that have not yet observed a valid return. Note that the following code cells only change the beginning of each time series. The intuition behind this cleaning step is that in real-time we would only start investing in such securities as soon as they are sufficiently well traded. 

In [None]:
%%time

mask = df['dlyret'].isnull() | df['dlyprevcap'].isnull() | df['bcktstflg'].gt(0)

mask = mask.cummin()

In [None]:
%%time

for c in df.columns.levels[0]:
    df[c] = df[c].mask(mask)

There are quite a few time series which correspond to assets in which we never would start investing according to this rule above. Many of these price time series have never observed trading prices (i.e. `dlyprcflg` is set to `BA` throughout).

We remove them from the dataframe in the following:

In [None]:
bl = mask.iloc[-1]

In [None]:
print('There are {} ({:.2f}%) time series without a valid return after removing missing data at the beginning.'.format(
            bl.sum(), 100 * bl.mean()))

In [None]:
%%time

df = df.loc[:, df.columns.get_level_values('permno').isin(bl.index[~bl])]

In [None]:
df.info()

#### Delisting and cleaning the end of the time series

We now remove missing returns at the end of each time series and set missing delisting returns to `MISSING_DELIST_RETURN`. Note that a time series need not have a regular or missing delisting return; for example, if the security is still traded or fell out of the investment universe because of a change in status (e.g. change of `usincflg`).

To understand the values of the related column `mthdelflg`, see [here](https://www.crsp.org/files/appendix/FlagType_DE.html).

In [None]:
%%time

mask = df['dlyret'].isnull() | df['dlyprevcap'].isnull() | df['bcktstflg'].gt(0)

mask = mask[::-1].cummin()[::-1]

mask = mask.mask(df['bcktstflg'].isnull()[::-1].cummin()[::-1], other=False)

In [None]:
print('There are {} ({:.2f}%) time series whose return series at the end are being modified.'.format(
    mask.any().sum(), 100 * mask.any().mean()))

print('There are {} returns being modified.'.format(mask.sum().sum()))

If there is no `dlyprevcap` corresponding to the first of the problematic days then we need to modify the return on the last non-problematic days:

In [None]:
%%time

mask = mask | (mask.shift(-1, fill_value=False) & ~mask & df['dlyprevcap'].isnull().shift(-1, fill_value=False))

In [None]:
mask_first_return = mask & ~mask.shift(1, fill_value=False)   
# the first of the problematic returns at the end of each problematic time series

mask_others = mask & ~mask_first_return

In [None]:
%%time

df['dlyret'] = df['dlyret'].mask(mask_first_return, 
                                 other=df['dlyret'].fillna(0).add(1).multiply(1 + MISSING_DELIST_RETURN).subtract(1))
df['bcktstflg'] = df['bcktstflg'].mask(mask_first_return, other=FLAG_DELRET_MISSING)

In [None]:
%%time

for c in df.columns.levels[0]:
    df[c] = df[c].mask(mask_others)

In [None]:
%%time

# Check that each return time series has at least one value
assert df['dlyret'].notnull().any().all()

In [None]:
%%time

# check that if a return is provided then also the previous capitalization is provided
assert ~(df['dlyret'].notnull() & df['dlyprevcap'].isnull()).any().any()

## Temporary delistings and missing returns

We now take care of missing returns for securities in the investment universe. Note that by the above manipulations, returns on delisting dates always exist.

We distinguish three cases: (a) the previous return exists (security was in the investment universe) and is not problematic, and the following return exists; (b) the previous return exists (security was in the investment universe) and is not problematic, but the following return does not exist; (c) the previous return is problematic or the security was not in the investment universe.

In case (a), we consider this situation as a non-trade day for the specific security. We populate `dlyprevcap` with the value on the day after and `dlyret` with 0. (For an example of this situation, recakk the missing price observations for IBM in Notebook 2.)  In case (b), we consider this as a temporary delisting.  In case (c), we assume any temporary delisting effects are already taken into account by the previous returns and we remove the security from the investment universe for that month.

The following `mask` captures all missing returns that follow a problematic (or missing) return (case (c)). The corresponding entries will be removed.

In [None]:
%%time

mask = df['dlyret'].isnull() & df['bcktstflg'].notnull() & \
      (df['dlyret'].isnull() | df['bcktstflg'].eq(FLAG_PROBLEMATIC_INTERMEDIATE_RETURN)).shift(1, fill_value=False) 

#Note that the second `df['dlyret'].isnull()` is required to capture the case when the security was not in the 
#investment universe on the previous day.

In [None]:
print("There are {} ({:.2f}%) time series for which we remove missing returns from investment universe).".format(
      mask.any().sum(), 100 * mask.any().mean()))
print("In total, we remove {} returns.".format(mask.sum().sum()))

In [None]:
%%time

for c in df.columns.levels[0]:
    df[c] = df[c].mask(mask)

In [None]:
%%time

# Check that each return time series has at least one value
assert df['dlyret'].notnull().any().all()

We know take care of cases (a) and (b):

In [None]:
%%time

mask = df['dlyret'].isnull() & df['bcktstflg'].notnull()
mask_tmp_delist = mask & df['dlyret'].isnull().shift(-1, fill_value=False)    # case (b)
mask_fillna = mask & ~mask_tmp_delist    # case (a)

In [None]:
print("There are {} ({:.2f}%) time series that have temporary delistings.".format(
      mask_tmp_delist.any().sum(), 100 * mask_tmp_delist.any().mean()))
print("In total, we have {} temporary delistings.".format(mask_tmp_delist.sum().sum()))

print("There are {} ({:.2f}%) time series that have missing return sequences of exactly length one.".format(
      mask_fillna.any().sum(), 100 * mask_fillna.any().mean()))
print("In total, we have {} missing return sequences of exactly length one.".format(mask_fillna.sum().sum()))

In [None]:
%%time

df['dlyprevcap'] = df['dlyprevcap'].mask(mask, other=df['dlyprevcap'].fillna(method='bfill'))

In [None]:
%%time

df['dlyret'] = df['dlyret'].mask(mask_tmp_delist, other=TEMPORARY_DELISTING_RETURN)
df['bcktstflg'] = df['bcktstflg'].mask(mask_tmp_delist, other=FLAG_TEMPORARY_DELISTING)

In [None]:
%%time

df['dlyret'] = df['dlyret'].mask(mask_fillna, other=0.)
df['bcktstflg'] = df['bcktstflg'].mask(mask_fillna, other=FLAG_MISSING_RETURN_IMPUTED)

An example for a security that has several delistings and missing return imputs, you could inspect the security with `permno` equal to 12204.

## Flagging very large/small returns

In [None]:
%%time

mask = df['dlyret'].gt(CUTOFF_LARGE_RETURN) | df['dlyret'].lt(CUTOFF_SMALL_RETURN)

In [None]:
print('There are {} very large/small returns, which will be flagged.'.format(mask.sum().sum()))

print('These very large/small returns appear in {} ({:.2f}%) time series.'.format(
            mask.any().sum(), 100 * mask.any().mean()))

In [None]:
%%time

df['bcktstflg'] = df['bcktstflg'].mask(mask, other=df['bcktstflg'].add(10))

## Storing

In [None]:
%%time

with pd.HDFStore(DATAPATH + 'CRSP_daily.h5') as store:
    store['df'] = df

## Some quick summary statistics:

In [None]:
%%time

df['bcktstflg'].stack().value_counts().sort_index()

In [None]:
%%time

df['bcktstflg'].stack().value_counts(normalize=True).sort_index()

In [None]:
del df