## Class02
> *"Data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data."* -- Wiki

Data cleaning is an important prerequisite before your data analysis. Raw data, usually, contains inaccurate and irrevalent observations. Your data analysis is not reliable without proper data cleaning.

In [1]:
import pandas as pd
import numpy as np

### CRSP
You should consider the following points before you use CRSP:
* Data type
* Stock exchanges
* Share types
* Duplicated observations
* Negative price
* Adjusted price
* Industry

In [2]:
file_path = '/Users/ml/Google Drive/af/teaching/database/data/'
msf_raw = pd.read_csv(file_path+'msf_2010_2017.txt',sep='\t',low_memory=False)

In [3]:
msf_raw[msf_raw.columns[:10]].head()

Unnamed: 0,PERMNO,date,SHRCD,EXCHCD,SICCD,NCUSIP,COMNAM,PERMCO,HSICCD,CUSIP
0,10001,20100129,11.0,2.0,4925,29269V10,ENERGY INC,7953,4925,36720410
1,10001,20100226,11.0,2.0,4925,29269V10,ENERGY INC,7953,4925,36720410
2,10001,20100331,11.0,2.0,4925,29269V10,ENERGY INC,7953,4925,36720410
3,10001,20100430,11.0,2.0,4925,29269V10,ENERGY INC,7953,4925,36720410
4,10001,20100528,11.0,2.0,4925,29269V10,ENERGY INC,7953,4925,36720410


In [4]:
msf = msf_raw.copy()
msf.columns = msf.columns.str.lower()

#### Data type
It is important to identify the type of the data, especially for variable with mix type. For example, return or price data should be numeric format. However, the variable will be text format if the variable contains both numeric and string data. For string variable, you cannot perform any calculation.

Return data in CRSP contains missing codes, such as 'A', 'B', 'C', etc. So the return is in string format after you import raw data from CRSP.

In [5]:
msf.dtypes

permno      int64
date        int64
shrcd     float64
exchcd    float64
siccd      object
ncusip     object
comnam     object
permco      int64
hsiccd     object
cusip      object
prc       float64
vol       float64
ret        object
shrout    float64
cfacpr    float64
retx       object
dtype: object

> **ret**, **retx**, **siccd**, **hsiccd** are string format. Actually, they should be numeric.

In [6]:
msf[msf['ret'].str.extract('([A-Z])',expand=False).notnull()]['ret'].value_counts()

B    4393
C    4023
Name: ret, dtype: int64

> **ret** contains 'B' and 'C' which are not valid return.

In [7]:
msf[msf['siccd'].str.extract('([A-Z])',expand=False).notnull()]['siccd'].value_counts()

Z    111
Name: siccd, dtype: int64

> **siccd** contains 'Z' which is not valid SIC code.

In [8]:
msf_1 = msf.copy()
for i in ['ret','retx','siccd','hsiccd']:
    msf_1[i] = pd.to_numeric(msf_1[i],errors='coerce')

In [9]:
msf_1.dtypes

permno      int64
date        int64
shrcd     float64
exchcd    float64
siccd     float64
ncusip     object
comnam     object
permco      int64
hsiccd    float64
cusip      object
prc       float64
vol       float64
ret       float64
shrout    float64
cfacpr    float64
retx      float64
dtype: object

> After convertion, the four variables are numeric.

#### Stock exchanges
Conventionall, we use three main stock exchanges in US market and **exchcd** allows us to find out them.

| Stock exchange | exchcd |
|----------------|--------|
| NYSE           | 1      |
| AMEX           | 2      |
| NASDAQ         | 3      |

In [10]:
msf_2 = msf_1[msf_1['exchcd'].isin([1,2,3])].copy()

#### Share type
CRSP includes different type of securities, for example, common stocks and ETFs. Common stocks are most widely used and **shrcd** = 10 or 11  can helps us to filter out common shares.

In [11]:
msf_3 = msf_2[msf_2['shrcd'].isin([10,11])].copy()

#### Duplicated observations
CRSP contains duplicates, therefore, we have to remove them.

In [12]:
msf_4 = msf_3.drop_duplicates(['permno','date'],keep='last').copy()

#### Negative price
CRSP uses average of bid and ask price to replace price if there is no valid closing price for a given data. To distinguish them, CRSP assigns negative sign (-) in front of the average of bid and ask price. It does not mean the price is negative. Therefore, we need to convert them to positive value.

In [13]:
msf_4['price'] = msf_4['prc'].abs()

#### Adjusted price
CRSP returns (**ret**) already adjusted dividends and stock splits, and it also considers reinvestments effect. However, price in CRSP does not consider those corporate events. To adjust price, we can use the formula below:
$$p_{adj} = \frac{price}{cfacpr}$$

In [14]:
msf_4['p_adj'] = msf_4['price'] / msf_4['cfacpr']

#### Time periods

In [15]:
msf_p1 = msf_4[msf_4['date']<=20131231].copy()
msf_p2 = msf_4[msf_4['date']>20131231].copy()

#### Industry
For industry related research, we need to know the industry classification. **siccd** indicates the industry group.

In [16]:
msf_5 = msf_4[(msf_4['siccd']<6000)|(msf_4['siccd']>6999)].copy()

> The above excludes financial firms. (SIC code for financial firms are between 6000 and 6999)

### Summary statistics
Descriptive statistics can help us to check the distribution of variables and any extreme values.

In [17]:
msf_5['ret'].describe()

count    293470.000000
mean          0.011176
std           0.162939
min          -0.961852
25%          -0.059090
50%           0.005038
75%           0.069364
max          15.984456
Name: ret, dtype: float64

### Convert 8-digit CUSIP to 6-digit CUSIP
8-digit CUSIP: stock (issue) level

6-digit CUSIP: firm level

In [18]:
msf_5['cusip6'] = msf_5['cusip'].str[:6]
msf_5[['cusip','date','cusip6','ret']].head()

Unnamed: 0,cusip,date,cusip6,ret
0,36720410,20100129,367204,-0.018932
1,36720410,20100226,367204,-0.000656
2,36720410,20100331,367204,0.020643
3,36720410,20100430,367204,0.124385
4,36720410,20100528,367204,0.004829


### Annual return (December to December)

In [19]:
annual = msf_5[['permno','date','p_adj']].copy()
annual['yrm'] = (annual['date']/100).astype(int)
annual['year'] = (annual['yrm']/100).astype(int)
annual = annual[annual['yrm']%100==12]
annual = annual.sort_values(['permno','year']).reset_index(drop=True)
annual['p_adj_l1'] = annual.groupby('permno')['p_adj'].shift(1)
annual['ret'] = annual['p_adj'] / annual['p_adj_l1'] - 1
annual['year_l1'] = annual.groupby('permno')['year'].shift(1)
annual['yr_diff'] = annual['year'] - annual['year_l1']
annual['ret'] = np.where(annual['yr_diff']!=1,np.nan,annual['ret']) 
annual.head()

Unnamed: 0,permno,date,p_adj,yrm,year,p_adj_l1,ret,year_l1,yr_diff
0,10001,20101231,10.52,201012,2010,,,,
1,10001,20111230,11.42,201112,2011,10.52,0.085551,2010.0,1.0
2,10001,20121231,9.33,201212,2012,11.42,-0.183012,2011.0,1.0
3,10001,20131231,8.03,201312,2013,9.33,-0.139335,2012.0,1.0
4,10001,20141231,11.02,201412,2014,8.03,0.372354,2013.0,1.0
