### macro 변수들을 lstm에 넣어서 추출한 4개의 hidden states를 머신러닝 모델의 설명변수로 사용하기

In [1]:
import pandas as pd
import numpy as np
import os

In [2]:
from sklearn.preprocessing import MinMaxScaler

# min max scaling을 하기 위한 함수
def my_scaler(scaling_features, train_data, test_data):
    '''
    scaling_features: min max scaling을 적용할 변수들 리스트
    train data로 fit 한 후 
    train data와 test data에 scaling을 적용(transform)한다.
    '''
    scaler = MinMaxScaler(feature_range=(0, 1))
    
    scaler.fit(train_data[scaling_features])
    
    train_data[scaling_features] = scaler.transform(train_data[scaling_features])
    test_data[scaling_features] = scaler.transform(test_data[scaling_features])

# Firm-Specific Variables

In [3]:
# Inf, -Inf 있는 경우 0으로 대체하는 함수
def ind_clean_data(raw_data, scaling_features):
    # 원하는 변수들을 float32 type으로 바꾼다 (이 작업 필요 없으면 안 해도 됨)
    raw_data[scaling_features] = raw_data[scaling_features].astype('float32')

    for col in scaling_features:
        # Inf 또는 -Inf가 하나라도 존재하는 경우 그 값은 0으로 대체하기
        is_inf = (raw_data[col] == float("inf")) | (raw_data[col] == float("-inf"))
        res = sum(is_inf)
        if res != 0:
            raw_data.loc[is_inf,col] = 0

In [4]:
# # 연도별로 시총 top 1000에 드는 회사만을 데이터셋에 포함시킨다
# def AnnualCap(raw_data, start_year, end_year, num_top=1000):
#     raw_data['date'] = pd.to_datetime(raw_data['date'])
#     df_top = pd.DataFrame()
    
#     for year in range(start_year, end_year+1): # 매년
#         # 각 연도에 해당하는 데이터만을 추출
#         temp = raw_data[raw_data['date'].dt.year == year]
        
#         # 회사(permno)별로 시가총액(me)의 평균을 구한 후, 시총이 큰 순서대로 내림차순 정렬
#         # 내림차순 정렬된 상태에서 상위 num_top개의 회사를 추출하여 리스트로 만들기
#         top_list = temp.groupby('permno').mean()[['me']].reset_index().\
#                     sort_values(by='me', ascending=False)[:num_top]['permno'].tolist()
#         # temp에서 상위 num_top개의 회사만을 포함하는 데이터를 추출 후 이를 매년 반복한 데이터와 concat
#         df_top = pd.concat([df_top, temp[temp['permno'].isin(top_list)]], axis=0)

#     return df_top

In [51]:
def AnnualCap(raw_data, start_year, end_year, num_top=100):
    df_top = pd.DataFrame()
    for year in range(start_year, end_year+1):
        raw_temp = raw_data[raw_data['year'] == year]

        df_me = raw.groupby(['permno','year']).mean()[['me']].shift(1).bfill().reset_index()
        df_me.rename({'me':'me_last_year'}, axis=1, inplace=True)
        temp = df_me[df_me['year'] == year]
        top_list = temp.sort_values(by='me_last_year', ascending=False)[:num_top]['permno'].tolist()
        df_top = pd.concat([df_top, raw_temp[raw_temp['permno'].isin(top_list)]], axis=0)

    return df_top

In [52]:
# raw 데이터는 firm-specific 변수들에 대한 전처리가 된 데이터
# 이 데이터의 ret값이 타겟변수이고, 한달 후의 ret으로 설정되어있음 (한달 후를 예측하는 것이므로)

raw1990 = pd.read_csv('../01_Preprocess/Preprocess-1/data/chars60_raw_1990s.csv')
raw2000 = pd.read_csv('../01_Preprocess/Preprocess-1/data/chars60_raw_2000s.csv')
raw2010 = pd.read_csv('../01_Preprocess/Preprocess-1/data/chars60_raw_2010s.csv')
raw2020 = pd.read_csv('../01_Preprocess/Preprocess-1/data/chars60_raw_2020s.csv')

In [53]:
# 모든 연도들의 raw 데이터 합치기
raw = pd.concat([raw1990, raw2000, raw2010, raw2020], axis=0)
raw['date'] = pd.to_datetime(raw['date'])

In [54]:
# raw 데이터로부터 매년 시총 top 100에 드는 회사만을 포함하는 데이터셋 구축
raw_top = AnnualCap(raw, 1996, 2022, 100)

In [55]:
# train / test split
raw_train = raw_top[(raw_top['date'] >= '1996-01-01') & (raw_top['date'] <= '2016-12-31')]
raw_test = raw_top[(raw_top['date'] >= '2017-01-01') & (raw_top['date'] <= '2022-12-31')]

## Load hidden states data
---
- 127개의 거시경제 변수들을 LSTM에 통과시킨 후 4개의 hidden states를 추출하여 데이터프레임으로 만들어놨음 (hidden_states_df.csv)
- 이 데이터프레임과 기존의 firm-specific 변수들만 있는 raw 데이터를 합쳐야 함


In [56]:
hs_df = pd.read_csv('../01_Preprocess/Preprocess-2/data/hidden_states_df.csv')

In [57]:
hs_df['date'] = pd.to_datetime(hs_df['date'])

In [58]:
# train / test split
hs_train = hs_df[(hs_df['date'] >= '1996-01-01') & (hs_df['date'] <= '2016-12-31')]
hs_test = hs_df[(hs_df['date'] >= '2017-01-01') & (hs_df['date'] <= '2022-12-31')]

In [59]:
# hs1, hs2, hs3, hs4: 4개의 hidden states
display(hs_train.head(3))
display(hs_test.tail(3))

Unnamed: 0,date,hs1,hs2,hs3,hs4
0,1996-01-31,0.022903,-0.143527,-0.07593,0.115051
1,1996-02-29,-0.084758,-0.19141,-0.100266,0.177098
2,1996-03-31,-0.074154,-0.174632,-0.104189,0.196434


Unnamed: 0,date,hs1,hs2,hs3,hs4
313,2022-02-28,-0.064052,-0.107213,-0.12108,0.141095
314,2022-03-31,-0.021913,-0.104888,-0.17874,0.176583
315,2022-04-30,0.013325,-0.10922,-0.188333,0.201438


In [60]:
# 날짜를 기준으로 raw 데이터와 hidden states 데이터 합치기
# 둘 다 월별 데이터이므로 날짜에 주의 (두 데이터 모두 매월 말일로 설정되어있는지 확인하기)
# 날짜가 다르다면 year, month 변수를 따로 만들어서 on=['year', 'month']로 조인해야 함

train_ind_mac = pd.merge(raw_train, hs_train, on='date', how='left')
test_ind_mac = pd.merge(raw_test, hs_test, on='date', how='left')

In [61]:
train_ind_mac.head(3)

Unnamed: 0,ticker,gvkey,permno,sic,ret,exchcd,shrcd,adm,bm_ia,herf,...,rsup,sgr,sp,date,ffi49,year,hs1,hs2,hs3,hs4
0,ABS,1240,50032,5411,0.037618,1.0,11.0,6.3e-05,0.163795,0.102612,...,-0.000338,-0.020226,0.001253,1996-06-30,43,1996,-0.029664,-0.414425,-0.458792,0.224881
1,ABS,1240,50032,5411,-0.005438,1.0,11.0,6.4e-05,0.157838,0.102612,...,0.000557,0.302662,0.001207,1996-07-31,43,1996,-0.030297,-0.26097,-0.374075,0.187817
2,ABS,1240,50032,5411,0.033537,1.0,11.0,8.4e-05,0.159282,0.102612,...,0.000105,0.362695,0.001218,1996-08-31,43,1996,-0.163543,-0.468305,-0.555463,0.109156


# Model
---
- 머신러닝 모델들 중 부스팅 모델 몇개를 사용 (Catboost, XGBoost, LightGBM 등)
- Catboost를 사용하는 경우 범주형 변수를 따로 설정해줘야 함 (cat_features = ['exchcd','shrcd','ffi49'])
- 범주형 변수들에 대해 좀 더 피처 엔지니어링 할 필요 있음 (일단은 아무 처리도 하지 않고 그냥 넣음)

- 논문에서는 train data의 window size를 늘려가며 recursive 하게 validation 함 (cross validation과 다름)
- 시간 순서를 고려하기 위함임
- 예를 들면 다음과 같다. 
 - 1996-2011 데이터로 학습 -> 2012 데이터로 validation => R-Squared 구함: R-Squared_1
 - 1996-2012 데이터로 학습 -> 2013 데이터로 validation => R-Squared 구함: R-Squared_2
 - 1996-2013 데이터로 학습 -> 2014 데이터로 validation => R-Squared 구함: R-Squared_3
  - => 그러면 validation R-Squared = (R-Squared_1 + R-Squared_2 + R-Squared_3) / 3
- 이런식으로 validation 하는 경우, 1996-2011 을 train set, 2012-2014를 validation set이라고 함.
- 이걸 하게 해주는 함수들을 짜서 refit_model 모듈로 만들어놓음

In [62]:
from refit_model import *

In [63]:
train = train_ind_mac.copy()
test = test_ind_mac.copy()

In [64]:
df = pd.concat([train, test], axis=0)

In [65]:
features = train.columns.drop(['date','gvkey','permno','ret','ticker','sic']).tolist()
print(features)

['exchcd', 'shrcd', 'adm', 'bm_ia', 'herf', 'hire', 'me_ia', 'ill', 'maxret', 'mom12m', 'mom1m', 'mom36m', 'mom60m', 'mom6m', 'std_dolvol', 'std_turn', 'me', 'dy', 'turn', 'dolvol', 'cinvest', 'nincr', 'pscore', 'acc', 'bm', 'agr', 'alm', 'ato', 'cash', 'cashdebt', 'cfp', 'chcsho', 'chpm', 'chtx', 'depr', 'ep', 'gma', 'grltnoa', 'lev', 'lgr', 'ni', 'noa', 'op', 'pctacc', 'pm', 'rd_sale', 'rdm', 'rna', 'roa', 'roe', 'rsup', 'sgr', 'sp', 'ffi49', 'year', 'hs1', 'hs2', 'hs3', 'hs4']


In [66]:
# r2 = refit_catboost(df, features, train_start=1996, train_end=2011, valid_size=5)
# print('Average r2 = ', r2)

In [67]:
# test_r2 = refit_catboost(df, features, train_start=1996, train_end=2016, valid_size=5)
# print('Average r2 = ', test_r2)

# Save Model
---
- 2016까지 학습한 모델: model2_2016 ==> 2017년 test
- 2017까지 학습한 모델: model2_2017 ==> 2018년 test
- 2018까지 학습한 모델: model2_2018 ==> 2019년 test
- 2019까지 학습한 모델: model2_2019 ==> 2020년 test
- 2020까지 학습한 모델: model2_2020 ==> 2021년 test


In [68]:
import pickle
import catboost
import os

In [77]:
def save_catboost(df, features, train_start=1996, train_end=2016, valid_size=5):
    cat_features = ['exchcd','shrcd','ffi49']
    r2_list = []
    for i in range(valid_size):
        start_year = train_start
        end_year = train_end + i
        train = df[(df['year']<=end_year)&(df['year']>=start_year)] ; print('train: ', start_year, end_year)
        valid = df[df['year']==end_year+1] ; print('valid: ', end_year+1)
        

        X_train, y_train = train[features], np.array(train['ret']) ; print('shape: ', X_train.shape)
        X_valid, y_valid = valid[features], np.array(valid['ret']) ; print('shape: ', X_valid.shape)

        #     cat_features : LabelEncoder
        for data in [X_train, X_valid]:
            data[cat_features] = data[cat_features].astype(str)

        # Set up 
        cat_params = {
                      'random_state':42,
                      'learning_rate': 1e-4, 
                      'n_estimators': 1000
                     }
        
        cat_model = CatBoostRegressor(**cat_params)        
        fit_model = cat_model.fit(X_train, y_train,
                    early_stopping_rounds=35,
                    cat_features=cat_features,
                    verbose=100,                    
                        )

        # save model
        save_path = './saved_model/'
        if not os.path.exists(save_path):
            os.mkdir(save_path)
        
        filename = f'model2_{end_year}.sav' ; print(filename)
        pickle.dump(fit_model, open(save_path + filename, 'wb'))
        
        cat_pred = fit_model.predict(X_valid)
        r2 = r2_oos(y_valid, cat_pred) ; print(r2)
        r2_list.append(r2)
    print(r2_list)

In [78]:
save_catboost(df, features, train_start=1996, train_end=2016, valid_size=6)

train:  1996 2016
valid:  2017
shape:  (23309, 59)
shape:  (1182, 59)
0:	learn: 0.1064826	total: 20.8ms	remaining: 20.8s


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


100:	learn: 0.1064393	total: 791ms	remaining: 7.04s
200:	learn: 0.1063972	total: 1.57s	remaining: 6.26s
300:	learn: 0.1063535	total: 2.31s	remaining: 5.37s
400:	learn: 0.1063118	total: 3.05s	remaining: 4.55s
500:	learn: 0.1062718	total: 3.79s	remaining: 3.77s
600:	learn: 0.1062318	total: 4.52s	remaining: 3s
700:	learn: 0.1061906	total: 5.26s	remaining: 2.24s
800:	learn: 0.1061498	total: 6s	remaining: 1.49s
900:	learn: 0.1061084	total: 6.75s	remaining: 742ms
999:	learn: 0.1060675	total: 7.51s	remaining: 0us
model2_2016.sav
0.0708
train:  1996 2017
valid:  2018
shape:  (24491, 59)
shape:  (1177, 59)
0:	learn: 0.1044856	total: 7.94ms	remaining: 7.94s
100:	learn: 0.1044449	total: 809ms	remaining: 7.2s
200:	learn: 0.1044050	total: 1.58s	remaining: 6.28s
300:	learn: 0.1043662	total: 2.35s	remaining: 5.45s
400:	learn: 0.1043284	total: 3.1s	remaining: 4.62s
500:	learn: 0.1042901	total: 3.87s	remaining: 3.85s
600:	learn: 0.1042509	total: 4.65s	remaining: 3.08s
700:	learn: 0.1042130	total: 5.4s	

In [71]:
def load_catboost(df, features, train_start=1996, train_end=2016, valid_size=5):
    df_top10 = pd.DataFrame()
    cat_features = ['exchcd','shrcd','ffi49']

    for i in range(valid_size):
        end_year = train_end + i
        valid = df[df['year']==end_year+1]
        X_valid, y_valid = valid[features], np.array(valid['ret'])
        df_temp = valid.copy()

        for data in [X_valid]:
            data[cat_features] = data[cat_features].astype(str)

        # Load Saved Model 
        path = './saved_model/'
        filename = f'model2_{end_year}.sav' ; print(filename)
        model = pickle.load(open(path + filename, 'rb'))

        # prediction
        y_pred = model.predict(X_valid)
        df_temp['ret_pred'] = y_pred
        df_temp = df_temp[['date','year','permno','ticker','me','ret','ret_pred']]
        df_me = df_temp.groupby(['permno','year']).mean()[['me']].shift(1).bfill().reset_index()
        df_me.rename({'me':'me_last_year'}, axis=1, inplace=True)
        
        df_temp = pd.merge(df_temp, df_me, on=['permno','year'], how='left')

        df_temp = df_temp.sort_values(by=['date','ret_pred'], ascending=[True, False]) # 날짜 오름차순, return 내림차순 정렬
        df_temp = df_temp.groupby(['date']).head(10) # Return Top 10 for each month
        
        df_top10 = pd.concat([df_top10, df_temp], axis=0)
        df_top10.drop(['me'], axis=1, inplace=True)
    return df_top10
    # return df_temp


In [79]:
df_top10 = load_catboost(df, features, train_start=1996, train_end=2016, valid_size=6)

model2_2016.sav
model2_2017.sav
model2_2018.sav
model2_2019.sav
model2_2020.sav
model2_2021.sav


In [80]:
df_top10

Unnamed: 0,date,year,permno,ticker,ret,ret_pred,me_last_year
188,2017-01-31,2017,83443,BRK,0.007117,0.007911,7.573290e+07
484,2017-01-31,2017,12369,GM,0.050804,0.007909,9.011816e+07
1026,2017-01-31,2017,77418,TWX,0.003315,0.007902,9.555091e+07
888,2017-01-31,2017,18163,PG,0.049828,0.007901,9.393792e+07
852,2017-01-31,2017,86783,PCLN,0.074397,0.007898,5.496533e+07
...,...,...,...,...,...,...,...
113,2022-03-31,2022,66181,HD,-0.046227,0.009138,8.451418e+07
173,2022-03-31,2022,22752,MRK,0.080439,0.009127,9.410076e+07
212,2022-03-31,2022,18163,PG,-0.019822,0.009123,1.390232e+08
194,2022-03-31,2022,57665,NKE,-0.012340,0.009122,3.880286e+08


In [81]:
df_top10.to_csv('../03_Strategy/data/df_top10.csv', index=False)