## Basic statistics
Last tutorial, we import data and have done some data cleaning work.
This tutorial will use the clean data to compute summary statistics of variables and to present cross sectional statistics by date.

### Import required libraries

In [1]:
import pandas as pd

pd.set_option('display.width', 180)

### Read data

In [2]:
data_path = '/users/ml/git/'
crsp_monthly = pd.read_csv(data_path + 'crsp_monthly_clean.txt', sep='\t', engine='python')

### Check data type of each variable
<p>If there is mixed type of data in a column (e.g. both numeric and string variable in a column), Python will read it as object. We need to clarify the data type of the variable.</p>
<p>For example, return should be numeric otherwise we cannot do any calculation in Python. However, CRSP return data contains missing codes, i.e. some letters (e.g. 'A', 'B' and 'S') rather than numeric value to indicate the reason why the return is missing. Therefore, we need to convert these missing codes to missing value in numeric format which is NaN.</p> 
<p>Another example is date format, the date in the data is not date format after you import the data. This could make problems when you want to compute the difference between dates (e.g. 20100101 should be one day after 20091231, but if you do not convert them into date format, it will return 20100101-20091231=8870).</p>

In [3]:
crsp_monthly.dtypes

permno       int64
cusip       object
date         int64
ret         object
prc        float64
shrout     float64
exchcd     float64
shrcd      float64
vol        float64
bid        float64
ask        float64
vwretd     float64
siccd       object
ncusip      object
cfacpr     float64
cfacshr    float64
dlret       object
dlstcd     float64
dlpdt      float64
dtype: object

### Convert return to numeric variable

In [4]:
for i in ['ret', 'siccd', 'dlret']:
    crsp_monthly[i] = pd.to_numeric(crsp_monthly[i], errors='coerce')

### Convert date to date format

In [5]:
crsp_monthly['date'] = pd.to_datetime(crsp_monthly['date'], format='%Y%m%d')
crsp_monthly['yr_mo'] = crsp_monthly['date'].apply(lambda x: x.year) * 100 + crsp_monthly['date'].apply(lambda x: x.month)

### Compute summary statistics

#### Pooled statistics

In [6]:
stats = crsp_monthly[['ret', 'vol', 'vwretd']].describe().T
for i in stats.columns:
    stats[i] = stats[i].apply(lambda x: format(x, '.3f'))

stats

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ret,2359429.0,0.012,0.19,-0.981,-0.069,0.0,0.073,24.0
vol,2301218.0,88576.943,677522.224,0.0,1278.0,6584.0,36118.0,201242689.0
vwretd,2399080.0,0.01,0.045,-0.225,-0.017,0.014,0.04,0.128


#### Cross sectional statistics
- For each month, we compute summary statistics across stocks. 
- Then we compute the time-series average of cross sectional statistics. 

##### First, we will take return as example.

In [7]:
stats_cs_ret = crsp_monthly.groupby('yr_mo')['ret'].describe()
stats_cs_ret = stats_cs_ret.unstack()
stats_cs_ret.head()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
yr_mo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
198001,4482.0,0.104479,0.208856,-0.445783,0.0,0.056639,0.160536,3.333333
198002,4504.0,-0.006683,0.143766,-0.435484,-0.085312,-0.026953,0.035595,2.142857
198003,4505.0,-0.165808,0.11379,-0.695652,-0.239766,-0.162162,-0.089796,0.517241
198004,4512.0,0.050782,0.129045,-0.625,-0.015625,0.040541,0.117117,1.111111
198005,4502.0,0.07065,0.126001,-0.785714,0.0,0.059594,0.125,1.307692


In [8]:
print 'number of month: %s' % len(stats_cs_ret)

number of month: 444


In [9]:
stats_cs_ret = pd.DataFrame({'ret': stats_cs_ret.mean()}).T
stats_cs_ret

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ret,5314.029279,0.011879,0.170991,-0.772804,-0.06528,0.001107,0.071288,3.330868


##### Add more variables to present results

In [10]:
stats_cs = pd.DataFrame()
for i in ['ret','vol','vwretd']:
    summary = crsp_monthly.groupby('yr_mo')[i].describe()
    summary = pd.DataFrame({i: summary.unstack().mean()}).T
    stats_cs = pd.concat([stats_cs,summary])

for i in stats_cs.columns:
    stats_cs[i] = stats_cs[i].apply(lambda x: format(x, '.3f'))

stats_cs

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ret,5314.029,0.012,0.171,-0.773,-0.065,0.001,0.071,3.331
vol,5182.923,101608.47,446870.141,1.255,3439.077,17548.838,66896.845,16666372.5
vwretd,5403.333,0.01,0.0,0.01,0.01,0.01,0.01,0.01
