# Data Preprocessing Example

---

In [25]:
import numpy as np
import sqlite3       
import pandas as pd
pd.options.display.max_rows = 10
np.set_printoptions(precision=4, suppress=True)

---

### Q1.1

In this question, you are required to connect to the database, ```hk_stocks.db```, used in lecture 6, as follows:
```python
conn = sqlite3.connect('../data/hk_stocks.db')
c = conn.cursor()
```
More details in this database can be found in the related lecture notes.

After you create the connection, please use the SQL ```SELECT``` statement to extract the following variables:
* ```key```: identifier for each observation;
* ```id```: firm identifier;
* ```eom```: end of month;
* ```ret_exc_lead1m```: stock returns;
* ```be_me```: book-to-market equity;
* ```bev_mev```: book-to-market enterprise value;
* ```ret_12_1```: price momentum $t-12$ to $t-1$;
* ```ret_9_1```: price momentum $t-9$ to $t-1$.

Finally, using the data extracted by the ```SELECT``` statement, you need to construct a ```pd.DataFrame```, named ```data_df```. The first five rows of ```data_df``` are as follows:
```python
        key           id                  eom  ret_exc_lead1m     be_me  bev_mev  ret_12_1   ret_9_1  
0  13581256  310108801.0  1990-07-31 00:00:00       -0.094007  0.552603  0.573481  0.597304  0.434458 
1  13581257  310108801.0  1990-08-31 00:00:00       -0.145700  0.605826  0.626845  0.720185  0.510205   
2  13581258  310108801.0  1990-09-30 00:00:00        0.151076  0.704216  0.724663  0.358864  0.301394   
3  13581259  310108801.0  1990-10-31 00:00:00        0.017782  0.614900  0.635911  0.182734  0.072359 
4  13581260  310108801.0  1990-11-30 00:00:00        0.020163  0.600821  0.621840  0.371875  0.160488 
```

In [26]:
conn = sqlite3.connect('../data/hk_stocks.db')
c = conn.cursor()

In [27]:
query = """
SELECT stock_returns.key, 
       stock_returns.id, 
       stock_returns.eom, 
       ret_exc_lead1m, be_me, bev_mev, ret_12_1, ret_9_1
FROM stock_returns, value_signals, momentum_signals
WHERE value_signals.key = stock_returns.key AND
      momentum_signals.key = stock_returns.key
"""

stock_returns = c.execute(query)
data_df = pd.DataFrame(stock_returns.fetchall(), 
                       columns=['key', 'id', 'eom', 'ret_exc_lead1m', 'be_me', 'bev_mev', 'ret_12_1', 'ret_9_1'])

In [28]:
print(data_df.head())

        key           id                  eom  ret_exc_lead1m     be_me  \
0  13581256  310108801.0  1990-07-31 00:00:00       -0.094007  0.552603   
1  13581257  310108801.0  1990-08-31 00:00:00       -0.145700  0.605826   
2  13581258  310108801.0  1990-09-30 00:00:00        0.151076  0.704216   
3  13581259  310108801.0  1990-10-31 00:00:00        0.017782  0.614900   
4  13581260  310108801.0  1990-11-30 00:00:00        0.020163  0.600821   

    bev_mev  ret_12_1   ret_9_1  
0  0.573481  0.597304  0.434458  
1  0.626845  0.720185  0.510205  
2  0.724663  0.358864  0.301394  
3  0.635911  0.182734  0.072359  
4  0.621840  0.371875  0.160488  


### Q1.2

Built on the dataset ```data_df``` from Q1.1, you need to handle the missing data. 

First, please print the average missing rates of all variables in ```data_df```.

Second, print the percentage of observations that have missing values of ```bev_mev``` (```ret_12_1```) but non-missing values of ```be_me``` (```ret_9_1```). 

Third, we impute the missing data points in ```bev_mev``` (```ret_12_1```) with the non-missing observations in ```be_me``` (```ret_9_1```). However, if both ```bev_mev``` and ```be_me``` (```ret_12_1``` and ```ret_9_1```) are missing, we leave it as NaN. 

Finally, remove the rows whenever the stock returns are missing. What are the average missing rates of all columns?

In [29]:
pd.isnull(data_df).mean()

key               0.000000
id                0.000000
eom               0.000000
ret_exc_lead1m    0.006100
be_me             0.078330
bev_mev           0.127977
ret_12_1          0.085896
ret_9_1           0.066911
dtype: float64

In [34]:
print((pd.isnull(data_df.bev_mev) * (1 - pd.isnull(data_df.be_me))).mean())
print((pd.isnull(data_df.ret_12_1 ) * (1 - pd.isnull(data_df.ret_9_1))).mean())

0.0643076468922931
0.01898475364100281


In [8]:
data_df.bev_mev = data_df.bev_mev.combine_first(data_df.be_me)
data_df.ret_12_1 = data_df.ret_12_1.combine_first(data_df.ret_9_1)

In [9]:
data_df = data_df[pd.isnull(data_df.ret_exc_lead1m)==False]

In [10]:
pd.isnull(data_df).mean()

key               0.000000
id                0.000000
eom               0.000000
ret_exc_lead1m    0.000000
be_me             0.077637
bev_mev           0.063115
ret_12_1          0.065985
ret_9_1           0.065985
dtype: float64

### Q1.3

After you finish Q1.2, you will find that there are still tons of missing data in ```data_df```. In this question, we aim to impute the remaining missing entries with the column means per period. In particular, you need to do the following:
* Keep only five columns ```['id', 'eom', 'ret_exc_lead1m', 'bev_mev', 'ret_12_1']``` and use ```eom``` and ```id``` as the index.
* In each month, you need to impute the missing data in ```'bev_mev'``` and ```'ret_12_1'``` using their column means computed based on the observations this period. Next, you need to standardize ```'bev_mev'``` and ```'ret_12_1'``` to have zero means and unit standard deviations per month. 
* Whether do you detect outliers in ```'bev_mev'``` and ```'ret_12_1'```?

In [11]:
data_df2 = data_df[['id', 'eom', 'ret_exc_lead1m', 'bev_mev', 'ret_12_1']].set_index(['eom', 'id'])
data_df2 = data_df2.sort_index(level=0)

In [12]:
fill_mean = lambda g: g.fillna(g.mean())   # g is a group

In [13]:
data_df2 = data_df2.groupby(level=0).apply(fill_mean)
data_df2

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,ret_exc_lead1m,bev_mev,ret_12_1
eom,eom,id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1990-07-31 00:00:00,1990-07-31 00:00:00,301510501.0,-0.057822,0.123895,0.369475
1990-07-31 00:00:00,1990-07-31 00:00:00,301549801.0,-0.084795,0.587863,0.235526
1990-07-31 00:00:00,1990-07-31 00:00:00,301553001.0,-0.065578,0.456535,0.308667
1990-07-31 00:00:00,1990-07-31 00:00:00,301565201.0,-0.253837,0.814995,0.167575
1990-07-31 00:00:00,1990-07-31 00:00:00,301569701.0,-0.039236,0.595812,0.760345
...,...,...,...,...,...
2021-12-31 00:00:00,2021-12-31 00:00:00,335161801.0,-0.000060,0.471711,0.220724
2021-12-31 00:00:00,2021-12-31 00:00:00,335163701.0,-0.339582,0.016337,0.220724
2021-12-31 00:00:00,2021-12-31 00:00:00,335170601.0,0.223572,0.068281,0.220724
2021-12-31 00:00:00,2021-12-31 00:00:00,335183201.0,-0.198522,3.395933,0.220724


In [14]:
signals_mean_bymonth = data_df2.groupby(level=0)[['bev_mev', 'ret_12_1']].mean()
signals_std_bymonth = data_df2.groupby(level=0)[['bev_mev', 'ret_12_1']].std()
print(signals_mean_bymonth.head())
print(signals_std_bymonth.head())

                      bev_mev  ret_12_1
eom                                    
1990-07-31 00:00:00  1.315702  0.496676
1990-08-31 00:00:00  1.586242  0.598938
1990-09-30 00:00:00  2.026636  0.164139
1990-10-31 00:00:00  1.990379  0.055248
1990-11-30 00:00:00  2.523066  0.144498
                      bev_mev  ret_12_1
eom                                    
1990-07-31 00:00:00  2.686950  0.682272
1990-08-31 00:00:00  3.446855  0.615510
1990-09-30 00:00:00  5.321064  0.412677
1990-10-31 00:00:00  5.479092  0.389590
1990-11-30 00:00:00  8.170042  0.518468


In [15]:
data_df2[['bev_mev', 'ret_12_1']] = (data_df2[['bev_mev', 'ret_12_1']] - signals_mean_bymonth) / signals_std_bymonth
data_df2.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,ret_exc_lead1m,bev_mev,ret_12_1
eom,eom,id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1990-07-31 00:00:00,1990-07-31 00:00:00,301510501.0,-0.057822,-0.443554,-0.186437
1990-07-31 00:00:00,1990-07-31 00:00:00,301549801.0,-0.084795,-0.270879,-0.382764
1990-07-31 00:00:00,1990-07-31 00:00:00,301553001.0,-0.065578,-0.319755,-0.275563
1990-07-31 00:00:00,1990-07-31 00:00:00,301565201.0,-0.253837,-0.186348,-0.48236
1990-07-31 00:00:00,1990-07-31 00:00:00,301569701.0,-0.039236,-0.267921,0.386458


In [16]:
data_df2.describe()

Unnamed: 0,ret_exc_lead1m,bev_mev,ret_12_1
count,410758.0,410758.0,410758.0
mean,0.012128,-1.8163249999999998e-19,1.418463e-18
std,0.315845,0.999541,0.999541
min,-0.977637,-1.746098,-3.175112
25%,-0.077113,-0.1616233,-0.3830122
50%,-0.00839,-0.06788305,-0.1136466
75%,0.062795,0.0,0.1304931
max,105.555126,48.76194,45.70662


It seems that we have very asymmetric distributions of ```'bev_mev'``` and ```'ret_12_1'```, which may lead to strange results in the regression analysis. We will discuss how to handle this issue in lecture 10. 

In [17]:
conn.close()   # Don't forget to close the connection!

---

# END