# Tutorial 10: Data Analysis in-class practices part III
### 10.1 Revision of Data Aggregation in Pandas

In this tutorial, we will repeat the operations in Problem Set 2 Q3 to query database content without using SQL.

### Q1

First, you need to create three dataframes with following name from ```HK_stocks_151signals.parquet``` file that we used during the lecture. Retain the original index in each DataFrame and name the index as `key`.
* `stock_returns`: contains three columns, ```['id', 'eom', 'ret_exc_lead1m']```;
* `st_reversal_signals`: contains eight columns ```['id', 'eom', 'iskew_capm_21d', 'iskew_ff3_21d', 'iskew_hxz4_21d', 'ret_1_0', 'rmax5_rvol_21d', 'rskew_21d']```;
* `quality_signals`: contains seven columns ```['id', 'eom', 'at_turnover', 'cop_at', 'cop_atl1', 'dgp_dsale', 'gp_at']```.

Afterwards, print the information from each DataFrame by using `df.info()` function.

#### Answers:

In [1]:
import pandas as pd
import numpy as np

D = pd.read_parquet(
    r"D:\OneDrive - The University Of Hong Kong\HKU TA\Fall 2024-2025\FINA2390\data_to_share\HK_stocks_151signals.parquet", 
    engine='pyarrow'
)
D.index.name = 'key'
D.id = D.id.astype(str) #convert id column to str datatype
identifier_var_list = ['id', 'eom', 'ret_exc_lead1m']
st_reversal_list = ['id', 'eom', 'iskew_capm_21d', 
                    'iskew_ff3_21d', 'iskew_hxz4_21d', 
                    'ret_1_0', 'rmax5_rvol_21d', 'rskew_21d']
quality_list = ['id', 'eom', 'at_turnover', 
                'cop_at', 'cop_atl1', 'dgp_dsale', 'gp_at']

In [2]:
stock_returns = D[identifier_var_list]
st_reversal_signals = D[st_reversal_list]
quality_signals = D[quality_list]

print('Information about table stock_returns:')
stock_returns.info()
print('\n')
print('Information about table st_reversal_signals:')
st_reversal_signals.info()
print('\n')
print('Information about table quality_signals:')
quality_signals.info()

Information about table stock_returns:
<class 'pandas.core.frame.DataFrame'>
Index: 413279 entries, 13581256 to 23638850
Data columns (total 3 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   id              413279 non-null  object        
 1   eom             413279 non-null  datetime64[us]
 2   ret_exc_lead1m  410758 non-null  float64       
dtypes: datetime64[us](1), float64(1), object(1)
memory usage: 12.6+ MB


Information about table st_reversal_signals:
<class 'pandas.core.frame.DataFrame'>
Index: 413279 entries, 13581256 to 23638850
Data columns (total 8 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   id              413279 non-null  object        
 1   eom             413279 non-null  datetime64[us]
 2   iskew_capm_21d  331579 non-null  float64       
 3   iskew_ff3_21d   331253 non-null  float64       
 4   iskew_hxz4_21d  265093 non

### Q2

Create a DataFrame `D_Q2` with the stock-month observations from ```stock_returns``` DataFrame that satisfy the following requirements:
* Select all three columns
* The stock return should be between $-0.05$ and $0.05$
* The stock id contains ```999```
* Sort the data by ```eom``` and ```id``` in order
* Select only 10 lines from the sorted data.

#### Answers:

In [3]:
D_Q2 = stock_returns[
    (stock_returns.ret_exc_lead1m > -0.05) 
    & (stock_returns.ret_exc_lead1m < 0.05)
    & (stock_returns.id.str.contains('999'))
].sort_values(["eom", "id"]).head(10)

In [4]:
D_Q2

Unnamed: 0_level_0,id,eom,ret_exc_lead1m
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
13815424,310199901.0,1994-03-31,0.031025
13815425,310199901.0,1994-04-30,-0.022661
13815428,310199901.0,1994-07-31,-0.043609
13815435,310199901.0,1995-02-28,0.007507
13815436,310199901.0,1995-03-31,-0.016917
13815438,310199901.0,1995-05-31,-0.043693
13815439,310199901.0,1995-06-30,0.018399
13815441,310199901.0,1995-08-31,0.040565
13815443,310199901.0,1995-10-31,0.006818
13815444,310199901.0,1995-11-30,-0.002219


### Q3

Create `D_Q3` DataFrame to compute the average, maximal, and minimal returns for the stocks that satisfy the following requirements: 
* Set `id` as index and create three new columns (`mean_ret`, `max_ret`, and `min_ret`) to denote the average, maximal, and minimal returns for selected stocks
* The stock returns should NOT be missing
* The number of time-series observations is greater than 300.

#### Answers:

In [5]:
D_Q3 = stock_returns[
    stock_returns.ret_exc_lead1m.notna() #NOT NULL
].groupby("id")['ret_exc_lead1m'].agg( #GROUP BY
    [('mean_ret', 'mean'), #AVG, mean_ret as column name
    ('max_ret','max'), #MAX, max_ret as column name
    ('min_ret','min'), #MIN, min_ret as column name
    'size'] #COUNT
)

In [6]:
D_Q3 = D_Q3.loc[
    D_Q3['size'] > 300,
    ['mean_ret', 'max_ret', 'min_ret']
]
D_Q3

Unnamed: 0_level_0,mean_ret,max_ret,min_ret
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
301549801.0,0.011850,0.364470,-0.341721
301553001.0,0.008724,0.429556,-0.404597
301565201.0,0.012885,0.683868,-0.503526
301569701.0,0.012152,0.673881,-0.359551
301574903.0,0.009998,0.788158,-0.407551
...,...,...,...
322232901.0,0.009226,0.506172,-0.365105
322234101.0,0.004638,2.145951,-0.577159
322671801.0,0.018392,2.767344,-0.523290
322897601.0,0.008670,1.789013,-0.548270


In [7]:
#Alternative approach using groupby.filter() method, correspond to SQL HAVING
D_Q3a = stock_returns[
    stock_returns.ret_exc_lead1m.notna() #SQL NOT NULL
].groupby('id').filter( #SQL HAVING
    lambda g: len(g) > 300
).groupby('id')['ret_exc_lead1m'].agg( #SQL GROUP BY
    [('mean_ret', 'mean'), #AVG
    ('max_ret','max'), #MAX
    ('min_ret','min')] #MIN
)

In [8]:
D_Q3a

Unnamed: 0_level_0,mean_ret,max_ret,min_ret
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
301549801.0,0.011850,0.364470,-0.341721
301553001.0,0.008724,0.429556,-0.404597
301565201.0,0.012885,0.683868,-0.503526
301569701.0,0.012152,0.673881,-0.359551
301574903.0,0.009998,0.788158,-0.407551
...,...,...,...
322232901.0,0.009226,0.506172,-0.365105
322234101.0,0.004638,2.145951,-0.577159
322671801.0,0.018392,2.767344,-0.523290
322897601.0,0.008670,1.789013,-0.548270


### Q4

In this question, you are asked to merge both the `iskew_capm_21d` in DataFrame `st_reversal_signals` and `at_turnover` in DataFrame `quality_signal` into the `stock_returns` DataFrame.

#### Answers:

In [9]:
stock_returns = stock_returns.join(
    [st_reversal_signals.iskew_capm_21d,
    quality_signals.at_turnover]
) #join the two columns by key(index)

In [10]:
stock_returns

Unnamed: 0_level_0,id,eom,ret_exc_lead1m,iskew_capm_21d,at_turnover
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
13581256,310108801.0,1990-07-31,-0.094007,0.557268,
13581257,310108801.0,1990-08-31,-0.145700,-0.012097,
13581258,310108801.0,1990-09-30,0.151076,0.458586,
13581259,310108801.0,1990-10-31,0.017782,0.231357,
13581260,310108801.0,1990-11-30,0.020163,-0.469612,
...,...,...,...,...,...
23638846,333190801.0,2021-08-31,-0.096822,-1.588060,0.784841
23638847,333190801.0,2021-09-30,0.142375,0.296512,0.784841
23638848,333190801.0,2021-10-31,-0.040934,1.274403,0.786187
23638849,333190801.0,2021-11-30,-0.032445,0.491286,0.786187


### Q5

Create a new Series `mean_ret` to calculate the average returns per period (grouped by `eom`) that satisfy the following requirements: 
* Set `eom` as index
* The observations should satisfy `rmax5_rvol_21d > 0.5` and `gp_at > 0.1`
* Sorted by `eom`.

#### Answers:

In [13]:
mean_ret = stock_returns[
    (st_reversal_signals.rmax5_rvol_21d > 0.5)
    & (quality_signals.gp_at > 0.1)
].groupby('eom')['ret_exc_lead1m'].mean().sort_index()

In [14]:
mean_ret

eom
1990-07-31   -0.064610
1990-08-31   -0.114103
1990-10-31   -0.030510
1990-11-30    0.060013
1990-12-31    0.049794
                ...   
2021-08-31   -0.033741
2021-09-30    0.004713
2021-10-31   -0.016152
2021-11-30   -0.003287
2021-12-31   -0.045870
Name: ret_exc_lead1m, Length: 377, dtype: float64