# Stata for research: an introduction to Stata
This session is an introduction to Stata and it will cover:
 - Read data into Stata
 - Generate and replace variables
 - Summary statistics
 - If condition
 - Work with group
 - Loop
 - Macro
 - Rank variables into groups
 - t-test

In [1]:
import pandas as pd
import ipystata

IPyStata is loaded in batch mode.


### Change your working directory

In [2]:
cd "/Users/ml/LUBS/PhD data management/data"

/Users/ml/LUBS/PhD data management/data


### Check your working directory
<font color='blue'>pwd</font> will show your working directory, i.e. where your data is stored and where you want to save your data.
You can use <font color='blue'>pwd</font> in command window to show your current working directory.

In [3]:
pwd

'/Users/ml/LUBS/PhD data management/data'

### Set up
<font color='blue'>cd</font>: change your working directory to any folder you are about to use

<font color='blue'>capture log close</font>: close any open log file

<font color='blue'>clear all</font>: cleanup the data in your memory

<font color='blue'>set more off</font>: avoid break when displaying long results

<font color='blue'>cls</font>: clean your results window

In [4]:
%%stata

cd "/Users/ml/LUBS/PhD data management/data"
capture log close
log using class06.log, replace
clear all
set more off
cls


/Users/ml/LUBS/PhD data management/data


### Read CRSP monthly file
The data is stored in .dta foratm in the working directory, so you can use <font color='blue'>use</font> to read it.

In [5]:
%%stata -o crsp_month_raw

use crsp_month_raw
// use "/Users/ml/LUBS/PhD data management/data/crsp_month_raw"





### Check variables in the data
<font color='blue'>describe</font> or <font color='blue'>des</font> will show a list of variables with their data type. 

In [6]:
%%stata -d crsp_month_raw

des


Contains data from /Users/ml/.ipython/stata/data_input.dta
  obs:       590,309                          
 vars:            13                          07 Nov 2017 12:50
 size:    47,815,029                          
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
index           long    %12.0g                
permno          long    %12.0g                
date            long    %12.0g                
shrcd           double  %10.0g                
exchcd          double  %10.0g                
ncusip          str8    %8s                

To show the data in Stata, you can use <font color='blue'>browse</font> or <font color='blue'>br</font> or you can clicking the icon below:
<img src="https://photos-6.dropbox.com/t/2/AADfFp5ZwbTK0tbbD4v1B3-AkZU9RIiDHcEWHNJUpBpkCA/12/714528999/png/32x32/3/1510074000/0/2/stata_br.png/EPLu-PEFGAYgBygH/tvWeyPRBFKerOnaBXSw7J65iGN5MI2gEFZR5Vznk9KQ?dl=0&amp;size=32x32&amp;size_mode=5">

In [7]:
crsp_month_raw.head()

Unnamed: 0,permno,date,shrcd,exchcd,ncusip,permco,cusip,prc,vol,ret,shrout,cfacpr
0,10001,20100129,11.0,2.0,29269V10,7953,36720410,10.06,3104.0,-0.018932,4361.0,1.0
1,10001,20100226,11.0,2.0,29269V10,7953,36720410,10.0084,1510.0,-0.000656,4361.0,1.0
2,10001,20100331,11.0,2.0,29269V10,7953,36720410,10.17,2283.0,0.020643,4361.0,1.0
3,10001,20100430,11.0,2.0,29269V10,7953,36720410,11.39,3350.0,0.124385,6070.0,1.0
4,10001,20100528,11.0,2.0,29269V10,7953,36720410,11.4,3451.0,0.004829,6071.0,1.0


### This is a monthly file, so you can generate a date variable in 'yyyymm' format
Use <font color='blue'>generate</font> or <font color='blue'>gen</font> to create a new variable 

In [8]:
%%stata -d crsp_month_raw -o crsp_month_raw

gen yrm = int(date/100)





In [9]:
crsp_month_raw.head()

Unnamed: 0,permno,date,shrcd,exchcd,ncusip,permco,cusip,prc,vol,ret,shrout,cfacpr,yrm
0,10001,20100129,11.0,2.0,29269V10,7953,36720410,10.06,3104.0,-0.018932,4361.0,1.0,201001.0
1,10001,20100226,11.0,2.0,29269V10,7953,36720410,10.0084,1510.0,-0.000656,4361.0,1.0,201002.0
2,10001,20100331,11.0,2.0,29269V10,7953,36720410,10.17,2283.0,0.020643,4361.0,1.0,201003.0
3,10001,20100430,11.0,2.0,29269V10,7953,36720410,11.39,3350.0,0.124385,6070.0,1.0,201004.0
4,10001,20100528,11.0,2.0,29269V10,7953,36720410,11.4,3451.0,0.004829,6071.0,1.0,201005.0


### Convert string to numeric
After importing CRSP data, the **ret** is in string format and you have to convert it to numeric format. <font color='blue'>real</font> function can achieve it.

In [10]:
%%stata -d crsp_month_raw -o crsp_month_raw

gen ret_temp = real(ret)
drop ret
rename ret_temp ret
des


(14,875 missing values generated)

Contains data from /Users/ml/.ipython/stata/data_input.dta
  obs:       590,309                          
 vars:            14                          07 Nov 2017 12:50
 size:    47,224,720                          
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
index           long    %12.0g                
permno          long    %12.0g                
date            long    %12.0g                
shrcd           double  %10.0g                
exchcd          double  %10.0g                
ncusip  

### Select stocks from NYSE/NASDAQ/AMEX
**exchcd** indicates the stock exchange:
- 1 - NYSE
- 2 - AMEX
- 3 - NASDAQ

Stata command: <font color='blue'>inlist</font>

In [11]:
%%stata -d crsp_month_raw -o crsp_month_raw

keep if inlist(exchcd, 1,2,3)
// equivalent to keep if exchcd==1 | exchcd==2 | exchcd==3
count


(108,567 observations deleted)
. count
  481,742


### Keep common stocks
**shrcd** indicates share type. 10 and 11 indicate common shares.

In [12]:
%%stata -d crsp_month_raw -o crsp_month_raw

keep if inlist(shrcd,10,11)
// equivalent to keep if shrcd==10 | shrcd==11
count


(161,155 observations deleted)
. count
  320,587


### Remove duplicates

In [13]:
%%stata -d crsp_month_raw -o crsp_month_raw

duplicates drop permno yrm, force
count


Duplicates in terms of permno yrm

(782 observations deleted)
  319,805


### Compute market value (in milliion)
$$\text{Market value} = \frac{\big|prc\big| \times shrout}{1000}$$
where **prc** is stock price and **shrout** is number of shares outstanding in 1,000

Negative price indicates that there is no valid closing price on that date, and CRSP replaces them with average of bid and ask price. Therefore, you have to use absolute value of price to make sure all prices are non-negative.

In case data error, we will drop observations with market value less zero. 

In [14]:
%%stata -d crsp_month_raw -o crsp_month_raw

gen mv = abs(prc) * shrout
keep if mv > 0


(1,871 missing values generated)
(0 observations deleted)


### Adjusted price
$$\text{adjusted price} = \frac{\big|prc\big|}{cfacpr}$$
where **prc** is stock price and **cfacpr** is cumulative adjusted factor.

In [15]:
%%stata -d crsp_month_raw -o crsp_month_raw

gen p_adj = abs(prc) / cfacpr


(1,871 missing values generated)


In [16]:
crsp_month_raw.head()

Unnamed: 0,permno,date,shrcd,exchcd,ncusip,permco,cusip,prc,vol,shrout,cfacpr,yrm,ret,mv,p_adj
0,10001,20100129,11.0,2.0,29269V10,7953,36720410,10.06,3104.0,4361.0,1.0,201001.0,-0.018932,43871.660156,10.06
1,10001,20100226,11.0,2.0,29269V10,7953,36720410,10.0084,1510.0,4361.0,1.0,201002.0,-0.000656,43646.632812,10.0084
2,10001,20100331,11.0,2.0,29269V10,7953,36720410,10.17,2283.0,4361.0,1.0,201003.0,0.020643,44351.371094,10.17
3,10001,20100430,11.0,2.0,29269V10,7953,36720410,11.39,3350.0,6070.0,1.0,201004.0,0.124385,69137.304688,11.39
4,10001,20100528,11.0,2.0,29269V10,7953,36720410,11.4,3451.0,6071.0,1.0,201005.0,0.004829,69209.398438,11.4


### Save the data
If you have a big file, it is better that you only keep the variables which you will use later so that you can reduce data size.

In [17]:
%%stata -d crsp_month_raw -o crsp_month_raw

drop shrcd prc shrout cfacpr
save crspm, replace


file crspm.dta saved


### Basic data management
Pros and Cons of Stata in empirical finance.
- Pros
 - Easy to learn.
 - Less coding. Many tasks just need one line command compared with multiple lines in other computing language. 
 - Powerful statistics and regression package (fantastic).


- Cons
 - Poor in merge and join datasets (lack of SQL support).
 - Not a good choice for large dataset because it takes long time to finish a job even you are using MP version. 
 - Cannot keep multiple tables, i.e. you cannot open different tables at the same time.

In [18]:
%%stata -o crspm
capture log close
log using class06.log, replace
clear all
set more off

use crspm





### Summary statistics

In [19]:
%%stata -d crspm

su ret
su ret, d
tabstat ret, stat(mean sd min max p1 p99)


    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         ret |    316,489    .0113973    .1552864   -.961852   15.98446

                             ret
-------------------------------------------------------------
      Percentiles      Smallest
 1%     -.344353       -.961852
 5%     -.192179       -.935356
10%     -.131313       -.928773       Obs             316,489
25%        -.055       -.917722       Sum of Wgt.     316,489

50%      .006053                      Mean           .0113973
                        Largest       Std. Dev.      .1552864
75%       .06689       7.634921
90%       .14795       8.298913       Variance       .0241139
95%      .220057       9.564357       Skewness       8.786649
99%       .46063       15.98446       Kurtosis       531.6552

    variable |      mean        sd       min       max        p1       p99
-------------+----------------------------------

In [20]:
%%stata -d crspm

bysort exchcd: su mv


--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-> exchcd = 1

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
          mv |    111,708     9696845    2.70e+07    4625.13   4.42e+08

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-> exchcd = 2

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
          mv |     18,902    188072.5    745313.2   550.0957   1.91e+07

-------------------------------------------------------------------------------------------------------------------------------------

### Count unique identifier

In [21]:
%%stata -d crspm

egen permno_id = group(permno)
tabstat permno_id, stat(max)


    variable |       max
-------------+----------
   permno_id |      5520
------------------------


In [22]:
%%stata -d crspm

codebook permno


--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
permno                                                                                                                                                                                       (unlabeled)
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

                  type:  numeric (long)

                 range:  [10001,93436]                units:  1
         unique values:  5,520                    missing .:  0/319,805

                  mean:   64522.4
              std. dev:   30667.6

           percentiles:        10%       25%       50%       75%       90%
                             13412     33823     80236     89262     9173

### Sort
- <font color='blue'>sort</font>: ascending sort
- <font color='blue'>gsort</font>: ascending or descending sort

In [23]:
%%stata -d crspm -o crspm

sort permno yrm
gsort permno -yrm





In [24]:
crspm.head(10)

Unnamed: 0,permno,date,exchcd,ncusip,permco,cusip,vol,yrm,ret,mv,p_adj
0,10001,20161230,2.0,36720410,7953,36720410,7960.0,201612.0,0.01,132026.0,12.55
1,10001,20161130,2.0,36720410,7953,36720410,13525.0,201611.0,0.012146,131500.0,12.5
2,10001,20161031,2.0,36720410,7953,36720410,41174.0,201610.0,0.619948,129922.007812,12.35
3,10001,20160930,2.0,36720410,7953,36720410,2995.0,201609.0,0.04212,80657.71875,7.67
4,10001,20160831,2.0,36720410,7953,36720410,4277.0,201608.0,0.03662,77397.757812,7.36
5,10001,20160729,2.0,36720410,7953,36720410,7696.0,201607.0,0.026466,74635.195312,7.1
6,10001,20160630,2.0,36720410,7953,36720410,5332.0,201606.0,-0.021008,73478.875,6.99
7,10001,20160531,2.0,36720410,7953,36720410,5579.0,201605.0,-0.021918,75055.679688,7.14
8,10001,20160429,2.0,36720410,7953,36720410,7573.0,201604.0,-0.055698,76737.601562,7.3
9,10001,20160331,2.0,36720410,7953,36720410,1949.0,201603.0,-0.006361,82067.476562,7.81


### Count by group

In [25]:
%%stata -d crspm -o crspm

sort exchcd
by exchcd: count


--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-> exchcd = 1
  112,157
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-> exchcd = 2
  19,057
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-> exchcd = 3
  188,591


In [26]:
%%stata -d crspm -o crspm

sort exchcd
by exchcd: codebook permno


codebook may not be combined with by
r(190);

end of do-file
r(190);


In [27]:
%%stata -d crspm -o crspm

egen exch_permno_id = tag(exchcd permno)
sort exchcd permno yrm





In [28]:
crspm[70:80]

Unnamed: 0,permno,date,exchcd,ncusip,permco,cusip,vol,yrm,ret,mv,p_adj,exch_permno_id
70,10051,20151130,1.0,41043F20,7999,41043F20,35370.0,201511.0,0.07975,549480.9,15.57,0
71,10051,20151231,1.0,41043F20,7999,41043F20,34264.0,201512.0,0.056519,580537.0,16.450001,0
72,10051,20160129,1.0,41043F20,7999,41043F20,40408.0,201601.0,-0.179939,476075.6,13.49,0
73,10051,20160229,1.0,41043F20,7999,41043F20,28413.0,201602.0,,,,0
74,10092,20100129,1.0,35086510,8035,35086510,3681.0,201001.0,-0.075472,19627.44,1.47,0
75,10092,20100226,1.0,35086510,8035,35086510,3586.0,201002.0,-0.122449,17224.08,1.29,0
76,10092,20100331,1.0,35086510,8035,35086510,15849.0,201003.0,-0.108527,15354.8,1.15,0
77,10092,20100430,1.0,35086510,8035,35086510,7660.0,201004.0,-0.06087,14420.16,1.08,1
78,10092,20100528,1.0,35086510,8035,35086510,12532.0,201005.0,-0.323519,9754.972,0.7306,0
79,10104,20130731,1.0,68389X10,8045,68389X10,6737296.0,201307.0,0.05731,149806600.0,32.349998,0


In [29]:
%%stata -d crspm -o crspm

bysort exchcd: count if exch_permno_id==1


--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-> exchcd = 1
  1,807
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-> exchcd = 2
  403
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-> exchcd = 3
  3,473


In [30]:
%%stata -d crspm -o crspm

codebook permno if exchcd==1
codebook permno if exchcd==2
codebook permno if exchcd==3


--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
permno                                                                                                                                                                                       (unlabeled)
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

                  type:  numeric (long)

                 range:  [10051,93426]                units:  1
         unique values:  1,807                    missing .:  0/112,157

                  mean:   58531.5
              std. dev:   30177.5

           percentiles:        10%       25%       50%       75%       90%
                             13627     24985     68144     86560     9120

### Correlation

In [31]:
%%stata -d crspm

cor vol mv


(obs=317,934)

             |      vol       mv
-------------+------------------
         vol |   1.0000
          mv |   0.3633   1.0000


### Lead and Lag of variables
There are two ways ot generate lag values:
1. Use Stata sequence function, <font color='blue'>[_n-1]</font> or <font color='blue'>[_n+1]</font>
2. Use lag operator, <font color='blue'>L1.</font> or <font color='blue'>F1.</font>. You have to use <font color='blue'>xtset</font> or <font color='blue'>tsset</font> to declare data to be time series. if your date variable is not in date format, you have to convert it to date format first.

In [32]:
%%stata -d crspm -o crspm

// Method 1
sort permno yrm
by permno: gen mv_lag1 = mv[_n-1]
by permno: gen mv_lead1 = mv[_n+1]

// Method 2
gen year = int(yrm/100)
gen month = mod(yrm,100)
gen yrm1 = ym(year,month)
format yrm1 %tm
xtset permno yrm1
gen mv_lag1_1 = L1.mv
gen mv_lead1_1 = F1.mv


. sort permno yrm
(5,568 missing values generated)
(7,351 missing values generated)
. gen year = int(yrm/100)
       panel variable:  permno (unbalanced)
        time variable:  yrm1, 2010m1 to 2016m12, but with gaps
                delta:  1 month
(5,605 missing values generated)
(7,386 missing values generated)


In [33]:
crspm[['permno','yrm','yrm1','mv','mv_lag1','mv_lag1_1','mv_lead1','mv_lead1_1']][80:90]

Unnamed: 0,permno,yrm,yrm1,mv,mv_lag1,mv_lag1_1,mv_lead1,mv_lead1_1
80,10001,201609.0,2016-09-01,80657.71875,77397.757812,77397.757812,129922.007812,129922.007812
81,10001,201610.0,2016-10-01,129922.007812,80657.71875,80657.71875,131500.0,131500.0
82,10001,201611.0,2016-11-01,131500.0,129922.007812,129922.007812,132026.0,132026.0
83,10001,201612.0,2016-12-01,132026.0,131500.0,131500.0,,
84,10002,201001.0,2010-01-01,69125.28125,,,79882.023438,79882.023438
85,10002,201002.0,2010-02-01,79882.023438,69125.28125,69125.28125,85549.148438,85549.148438
86,10002,201003.0,2010-03-01,85549.148438,79882.023438,79882.023438,109185.414062,109185.414062
87,10002,201004.0,2010-04-01,109185.414062,85549.148438,85549.148438,78348.976562,78348.976562
88,10002,201005.0,2010-05-01,78348.976562,109185.414062,109185.414062,65264.300781,65264.300781
89,10002,201006.0,2010-06-01,65264.300781,78348.976562,78348.976562,54151.730469,54151.730469


### Loop and macro
- <font color='blue'>forvalues i=1/10</font>: loop each value from 1 to 10
- <font color='blue'>forvalues i=1(2)/10</font>: loop 1,3,5,7,9
- <font color='blue'>foreach i in var1 var2 ... </font>: loop variables
- <font color='blue'>foreach i of varlist var1 var2 ...</font>: loop variables

In [34]:
%%stata -d crspm -o crspm

foreach i in mv vol p_adj {
    gen ln`i' = log(`i')
}              


(1,871 missing values generated)
(328 missing values generated)
(1,871 missing values generated)


In [35]:
crspm.head()

Unnamed: 0,permno,date,exchcd,ncusip,permco,cusip,vol,yrm,ret,mv,...,mv_lag1,mv_lead1,year,month,yrm1,mv_lag1_1,mv_lead1_1,lnmv,lnvol,lnp_adj
0,10001,20100129,2.0,29269V10,7953,36720410,3104.0,201001.0,-0.018932,43871.660156,...,,43646.632812,2010.0,1.0,2010-01-01,,43646.632812,10.689024,8.040447,2.308567
1,10001,20100226,2.0,29269V10,7953,36720410,1510.0,201002.0,-0.000656,43646.632812,...,43871.660156,44351.371094,2010.0,2.0,2010-02-01,43871.660156,44351.371094,10.683882,7.319865,2.303425
2,10001,20100331,2.0,29269V10,7953,36720410,2283.0,201003.0,0.020643,44351.371094,...,43646.632812,69137.304688,2010.0,3.0,2010-03-01,43646.632812,69137.304688,10.699899,7.733246,2.319442
3,10001,20100430,2.0,29269V10,7953,36720410,3350.0,201004.0,0.124385,69137.304688,...,44351.371094,69209.398438,2010.0,4.0,2010-04-01,44351.371094,69209.398438,11.143849,8.116715,2.432736
4,10001,20100528,2.0,29269V10,7953,36720410,3451.0,201005.0,0.004829,69209.398438,...,69137.304688,66028.796875,2010.0,5.0,2010-05-01,69137.304688,66028.796875,11.144892,8.14642,2.433613


In [36]:
%%stata -d crspm -o crspm

local exchange NYSE AMEX NASDAQ
forvalues i=1/3 {
    local j: word `i' of `exchange'
    di "Summary statistics of `j':" 
    su ret if exchcd==`i'
}


Summary statistics of NYSE:

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         ret |    111,302    .0121042    .1178818   -.878425   4.169352
Summary statistics of AMEX:

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         ret |     18,808    .0062267    .2073981    -.82699   7.634921
Summary statistics of NASDAQ:

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         ret |    186,379    .0114968    .1682433   -.961852   15.98446


### Standardize
Standardized variables means that they follow distribution of 0 mean and unit standard deviation. 

$$\text{Standardized score} = \frac{x-\bar{x}}{\sigma}$$

The standardized variable can easily tell you how many standard deviation away from the mean. For example, 3 means the value is 3 standard deviation above the mean, while -3 means the value is 3 standard deviation below the mean.

Another reason to standardize variables is to make variables with different units into a same scale.

In [37]:
%%stata -d crspm -o crspm

foreach i in mv vol {
    egen `i'_mean = mean(`i')
    egen `i'_std = sd(`i')
    gen `i'_stdd = (`i'-`i'_mean) / `i'_std
}


(1,871 missing values generated)
(266 missing values generated)


### Rank stocks into groups
To examine relation between two variables, usually we can rank stocks based on one variable into equal groups and then check the pattern of another variable in each group.

For example, we rank stocks based on lagged market value into 10 groups. First, in each month, we need to calculate percentiles:

$$\text{size 1 (small)}: mv\_lag<=p10$$
$$\text{size 2}: p10<mv\_lag<=p20$$
$$\text{size 3}: p20<mv\_lag<=p30$$
$$\text{size 4}: p30<mv\_lag<=p40$$
$$\text{size 5}: p40<mv\_lag<=p50$$
$$\text{size 6}: p50<mv\_lag<=p60$$
$$\text{size 7}: p60<mv\_lag<=p70$$
$$\text{size 8}: p70<mv\_lag<=p80$$
$$\text{size 9}: p80<mv\_lag<=p90$$
$$\text{size 10 (large)}: mv\_lag>p90$$

In [38]:
%%stata -d crspm -o crspm

keep if !missing(mv_lag1)
forvalues i=10(10)90 {
    egen p`i' = pctile(mv_lag1), p(`i') by(yrm)
}


(5,568 observations deleted)


In [39]:
crspm.head()

Unnamed: 0,permno,date,exchcd,ncusip,permco,cusip,vol,yrm,ret,mv,...,vol_stdd,p10,p20,p30,p40,p50,p60,p70,p80,p90
0,10001,20100226,2.0,29269V10,7953,36720410,1510.0,201002.0,-0.000656,43646.632812,...,-0.197257,23660.601562,51635.199219,96942.0,171888.609375,293775.75,511141.03125,937931.625,1889172.0,4597581.0
1,10001,20100331,2.0,29269V10,7953,36720410,2283.0,201003.0,0.020643,44351.371094,...,-0.196637,24346.140625,53107.558594,103611.75,183042.609375,309665.28125,532357.0,993061.5,1983601.125,4872902.5
2,10001,20100430,2.0,29269V10,7953,36720410,3350.0,201004.0,0.124385,69137.304688,...,-0.19578,27283.166016,58872.101562,113963.929688,199936.0,333082.09375,579908.6875,1076586.0,2145456.75,5265974.0
3,10001,20100528,2.0,29269V10,7953,36720410,3451.0,201005.0,0.004829,69209.398438,...,-0.195699,30234.960938,63093.851562,124838.039062,221253.75,362653.6875,619696.875,1140228.375,2224883.25,5447492.5
4,10001,20100630,2.0,29269V10,7953,36720410,3537.0,201006.0,-0.043421,66028.796875,...,-0.19563,28525.480469,57432.796875,114687.5,200910.75,333614.875,575103.3125,1046152.625,2037284.125,5214371.0


In [40]:
%%stata -d crspm -o crspm

gen mv_rank = 1 if mv_lag1<=p10
forvalues i=20(10)90 {
    local j = `i' - 10
    replace mv_rank = `i'/10 if mv_lag1>p`j' & mv_lag1<=p`i'
}
replace mv_rank = 10 if mv_lag1>p90


(282,779 missing values generated)(31,418 real changes made)
(31,430 real changes made)
(31,419 real changes made)
(31,415 real changes made)
(31,432 real changes made)
(31,432 real changes made)
(31,417 real changes made)
(31,431 real changes made)
(31,385 real changes made)


In [41]:
%%stata -d crspm

tabstat ret, by(mv_rank) stat(mean sd n) nototal


Summary for variables: ret
     by categories of: mv_rank 

 mv_rank |      mean        sd         N
---------+------------------------------
       1 |  .0130232  .2884693     30942
       2 |  .0063831  .1753804     31260
       3 |  .0096091  .1682871     31228
       4 |  .0115563  .1499232     31248
       5 |  .0117593  .1414125     31247
       6 |  .0136738  .1301887     31285
       7 |  .0126616  .1217552     31299
       8 |  .0129353  .1053568     31286
       9 |  .0125934  .0907003     31293
      10 |  .0120801  .0743108     31321
----------------------------------------


If we also want to check the results for 5 groups rather than 10 groups, we can use <font color='blue'>recode</font>.

In [42]:
%%stata -d crspm -o crspm

recode mv_rank (1 2 = 1) (3 4 = 2) (5 6 = 3) (7 8 =4) (9 10 = 5), gen(mv_rank_1)


(282779 differences between mv_rank and mv_rank_1)


#### How to determine the number of group and cut-off points
There is a trade-off between number of group and observations. More groups can present more apparent difference between top and bottom group, while the number of observations in each group may smaller. In contrast, small number of groups can guarantee adequate observations in each group, the difference, however, between top and bottom group may become less clear.

The cut-off pionts depend on the number of groups and the distribution of rank variable. See the number of analyst example.

### t test
- One sample t-test
- Two sample t-test

#### Return significantly different to 0 ?

In [43]:
%%stata -d crspm

ttest ret = 0


One-sample t test
------------------------------------------------------------------------------
Variable |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
---------+--------------------------------------------------------------------
     ret | 312,409    .0116269    .0002776     .155158    .0110828     .012171
------------------------------------------------------------------------------
    mean = mean(ret)                                              t =  41.8844
Ho: mean = 0                                     degrees of freedom =   312408

    Ha: mean < 0                 Ha: mean != 0                 Ha: mean > 0
 Pr(T < t) = 1.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 0.0000


#### t-test by group

In [44]:
%%stata -d crspm

bysort mv_rank: ttest ret = 0


--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-> mv_rank = 1

One-sample t test
------------------------------------------------------------------------------
Variable |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
---------+--------------------------------------------------------------------
     ret |  30,942    .0130232    .0016399    .2884693    .0098089    .0162376
------------------------------------------------------------------------------
    mean = mean(ret)                                              t =   7.9413
Ho: mean = 0                                     degrees of freedom =    30941

    Ha: mean < 0                 Ha: mean != 0                 Ha: mean > 0
 Pr(T < t) = 1.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 0.0000

------------------------------------------------------

#### Return of size 1 significantly different from return of size 10 ?

In [45]:
%%stata -d crspm

ttest ret if inlist(mv_rank,1,10), by(mv_rank) 


Two-sample t test with equal variances
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
---------+--------------------------------------------------------------------
       1 |  30,942    .0130232    .0016399    .2884693    .0098089    .0162376
      10 |  31,321    .0120801    .0004199    .0743108    .0112571    .0129031
---------+--------------------------------------------------------------------
combined |  62,263    .0125488    .0008419    .2100747    .0108987    .0141989
---------+--------------------------------------------------------------------
    diff |            .0009431    .0016838               -.0023572    .0042435
------------------------------------------------------------------------------
    diff = mean(1) - mean(10)                                     t =   0.5601
Ho: diff = 0                                     degrees of freedom =    62261

    Ha: dif

### Trimming and winsorizing
- Trimming or truncation: simply delete or exclude outliers.
  For example, [-500,1,2,3,4,5,6,7,8,9,10,1000]. If we want trimming at 10% and 90% percentile, then just delete -500 and 1000.
- Winsorizing: replace outliers by cut-off values.
  If we want winsorizing the above sample, then the sample will be [1,1,2,3,4,5,6,7,8,9,10,10]

In [46]:
%%stata -d crspm

su ret, d


                             ret
-------------------------------------------------------------
      Percentiles      Smallest
 1%     -.345029       -.961852
 5%     -.192308       -.935356
10%     -.131025       -.928773       Obs             312,409
25%     -.054492       -.917722       Sum of Wgt.     312,409

50%      .006412                      Mean           .0116269
                        Largest       Std. Dev.       .155158
75%      .067176       7.634921
90%       .14811       8.298913       Variance        .024074
95%      .219886       9.564357       Skewness       8.847758
99%      .459228       15.98446       Kurtosis       539.2814


In [47]:
%%stata -d crspm

qui su ret, d
return list


scalars:
                  r(N) =  312409
              r(sum_w) =  312409
               r(mean) =  .0116269143563768
                r(Var) =  .0240740046461512
                 r(sd) =  .1551579989757253
           r(skewness) =  8.847757637704198
           r(kurtosis) =  539.2813501115215
                r(sum) =  3632.352687161329
                r(min) =  -.9618520140647888
                r(max) =  15.98445606231689
                 r(p1) =  -.3450289964675903
                 r(p5) =  -.1923079937696457
                r(p10) =  -.1310250014066696
                r(p25) =  -.054492000490427
                r(p50) =  .0064119999296963
                r(p75) =  .0671759992837906
                r(p90) =  .1481100022792816
                r(p95) =  .2198860049247742
                r(p99) =  .4592280089855194


In [48]:
%%stata -d crspm

// trimming
// keep if ret>=r(p1) & ret<=r(p99)

// winsorizing
// replace ret = r(p99) if ret>r(p99)
// replace ret = r(p1) if ret<r(p1)


. // keep if ret>=r(p1) & ret<=r(p99). // replace ret = r(p99) if ret>r(p99)


### Another example of ranking variables
After this example, you should get better understanding of ranking variables into groups and the issues when you do grouping.

In [49]:
%%stata -o numest

use numest, clear

// Distribution of number of analyst in each year
qui: levelsof year, local(year_list)

foreach i in `year_list' {
    _pctile numest, nq(10)
    di `i'
    return list
}


. qui: levelsof year, local(year_list)1985

scalars:
                 r(r1) =  1
                 r(r2) =  1
                 r(r3) =  2
                 r(r4) =  3
                 r(r5) =  4
                 r(r6) =  5
                 r(r7) =  7
                 r(r8) =  11
                 r(r9) =  18
1986

scalars:
                 r(r1) =  1
                 r(r2) =  1
                 r(r3) =  2
                 r(r4) =  3
                 r(r5) =  4
                 r(r6) =  5
                 r(r7) =  7
                 r(r8) =  11
                 r(r9) =  18
1987

scalars:
                 r(r1) =  1
                 r(r2) =  1
                 r(r3) =  2
                 r(r4) =  3
                 r(r5) =  4
                 r(r6) =  5
                 r(r7) =  7
                 r(r8) =  11
                 r(r9) =  18
1988

scalars:
                 r(r1) =  1
                 r(r2) =  1
                 r(r3) =  2
                 r(r4) =  3
                 r(r5) =  4

It is clearly that the 10% and 20% percentile are the same, therefore, if you try to rank stocks into 10 groups then group 1 (numest<=p10) and group 2 (p10<numest<=p20) would be problematic. 

When there is less variation of your data, you can reduce the number of groups. In this case, you can rank stocks into 5 groups.

### Transpose data
When the raw data is not panel data structure, you need to transpose or reshape the data.

In the following example, the raw data is firms' total asset from Datastream. In the data, -999 indicates missing value. 

In [50]:
%%stata -o ta

import excel using "ds_at.xlsx", firstrow clear





Below is how the raw data looks like. Each column is a firm except for the first columns which indicates the date.

In [51]:
ta.head()

Unnamed: 0,Name,ds134982,ds135084,ds135090,ds135092,ds135116,ds135127,ds135132,ds135176,ds135196,ds135197,ds135206,ds135215,ds135225,ds135229
0,1996,303300,45956,1703537,126980,468797,6398,-999,15063,569800,67863,58758,77667,638993,266600
1,1997,283000,70909,1906801,355475,730969,14047,43481,20745,674900,203660,70645,66821,756237,464200
2,1998,297000,99051,3755755,429206,819500,17799,45186,43189,698600,443032,98000,46693,-999,677400
3,1999,307700,639109,4319100,577164,1023800,12223,86489,132330,-999,823303,131634,67425,-999,852900
4,2000,359800,656345,7274600,731618,3012700,16022,547683,200842,-999,779713,158692,86624,-999,1665900


In [52]:
%%stata -d ta -o ta

foreach i of varlist _all {
    replace `i'=. if `i'==-999
}

reshape long ds, i(Name) j(dscode)


(0 real changes made)
(0 real changes made)
(0 real changes made)
(3 real changes made, 3 to missing)
(12 real changes made, 12 to missing)
(10 real changes made, 10 to missing)
(0 real changes made)
(0 real changes made)
(15 real changes made, 15 to missing)
(15 real changes made, 15 to missing)
(17 real changes made, 17 to missing)
(11 real changes made, 11 to missing)
(7 real changes made, 7 to missing)
(6 real changes made, 6 to missing)
(18 real changes made, 18 to missing)
(0 real changes made)
(note: j = 134982 135084 135090 135092 135116 135127 135132 135176 135196 135197 135206 135215 135225 135229)

Data                               wide   ->   long
-----------------------------------------------------------------------------
Number of obs.                       20   ->     280
Number of variables                  16   ->       4
j variable (14 values)                    ->   dscode
xij variables:
         ds134982 ds135084 ... ds135229   ->   ds
---------------------------

In [53]:
ta.head()

Unnamed: 0,Name,dscode,ds
0,1996,134982.0,303300.0
1,1996,135084.0,45956.0
2,1996,135090.0,1703537.0
3,1996,135092.0,126980.0
4,1996,135116.0,468797.0


In [54]:
%%stata -d ta -o ta

ren Name year
ren ds asset_tot
order dscode 
sort dscode year





In [55]:
ta[:25]

Unnamed: 0,dscode,year,asset_tot
0,134982.0,1996,303300.0
1,134982.0,1997,283000.0
2,134982.0,1998,297000.0
3,134982.0,1999,307700.0
4,134982.0,2000,359800.0
5,134982.0,2001,305400.0
6,134982.0,2002,304700.0
7,134982.0,2003,293800.0
8,134982.0,2004,286100.0
9,134982.0,2005,285800.0
