## Read data
The data used in this tutorial is from CRSP via WRDS. The sample period is between 1980 to 2016. The data file is named 'crsp_monthly.txt'.
I choose the text format to store the data because:
- CUSIP in other formats (eg. excel or csv) will miss leading zeros
- The size of text file is smaller than other formats

### Import required libraries

In [1]:
import pandas as pd

### Read CRSP monthly data

In [2]:
data_path = '/users/ml/git/'    # change to your data folder
crsp_monthly_raw = pd.read_csv(data_path + 'crsp_monthly.txt', sep='\t', engine='python')
print 'number of observations from raw CRSP data: %s' % len(crsp_monthly_raw)

number of observations from raw CRSP data: 3234134


### Keep stocks in NYSE, AMEX and NASDAQ

In [3]:
crsp = crsp_monthly_raw.copy()
exchanges = (crsp['exchcd'] == 1) | (crsp['exchcd'] == 2) | (crsp['exchcd'] == 3)
crsp = crsp[exchanges]
print 'number of observations (NYSE, AMEX, NASDAQ): %s' % len(crsp)

number of observations (NYSE, AMEX, NASDAQ): 3048750


### Keep common stocks

In [4]:
share_types = (crsp['shrcd'] == 10) | (crsp['shrcd'] == 11)
crsp = crsp[share_types]
print 'number of observations (common stocks): %s' % len(crsp)

number of observations (common stocks): 2408278


### Remove duplicates

In [5]:
crsp = crsp.drop_duplicates(subset=['permno', 'date'], keep='last')
print 'number of observations (delete duplicates): %s' % len(crsp)

number of observations (delete duplicates): 2399080


### Sort data

In [6]:
crsp = crsp.sort_values(['permno', 'date']).reset_index(drop=True)
crsp.head(10)

Unnamed: 0,permno,cusip,date,ret,prc,shrout,exchcd,shrcd,vol,bid,ask,vwretd,siccd,ncusip,cfacpr,cfacshr,dlret,dlstcd,dlpdt
0,10000,68391610,19860131,C,-4.375,3680.0,3.0,10.0,1771.0,,,0.009829,3990,68391610,1.0,1.0,,,
1,10000,68391610,19860228,-0.257143,-3.25,3680.0,3.0,10.0,828.0,,,0.0725,3990,68391610,1.0,1.0,,,
2,10000,68391610,19860331,0.365385,-4.4375,3680.0,3.0,10.0,1078.0,,,0.053885,3990,68391610,1.0,1.0,,,
3,10000,68391610,19860430,-0.098592,-4.0,3793.0,3.0,10.0,957.0,,,-0.007903,3990,68391610,1.0,1.0,,,
4,10000,68391610,19860530,-0.222656,-3.10938,3793.0,3.0,10.0,1074.0,,,0.050844,3990,68391610,1.0,1.0,,,
5,10000,68391610,19860630,-0.005025,-3.09375,3793.0,3.0,10.0,1069.0,,,0.014246,3990,68391610,1.0,1.0,,,
6,10000,68391610,19860731,-0.080808,-2.84375,3793.0,3.0,10.0,1163.0,,,-0.0597,3990,68391610,1.0,1.0,,,
7,10000,68391610,19860829,-0.615385,-1.09375,3793.0,3.0,10.0,3049.0,,,0.066181,3990,68391610,1.0,1.0,,,
8,10000,68391610,19860930,-0.057143,-1.03125,3793.0,3.0,10.0,3551.0,,,-0.079021,3990,68391610,1.0,1.0,,,
9,10000,68391610,19861031,-0.242424,-0.78125,3843.0,3.0,10.0,1903.0,,,0.049303,3990,68391610,1.0,1.0,,,


### Save clean data for future use

In [7]:
crsp.to_csv(data_path + 'crsp_monthly_clean.txt', sep='\t', index=False)