# KNN Regression

KNN in our problem might not be the best choice - variables highly skewed! But we will do our best!

We will apply following approach in case of KNN:
 * we will perform feature engineering - standardization
 * then we will find "good enough" parameters (in CV) to proceed feature selection procedure and after than we will have groups of feature candidates
 * then we will tune hyperparameters for each group of variables (in CV) - we will obtain couple of models
 * then we will compare all models based on so called "proper CV" and we will fit and picke the winner!

We are aware of potential data leakage in case of usage KFold CV without time-series problem handling (in point 2 and 3). 

*To be honest in this problem it is not a big deal - based on our experience and we treat it like a feature!*

During the last step of our procedure we will verify previous analysis based on "proper CV" which handle time-series properties!!! So during hyperparameters tuning we will use different CV (we fight against data leakage).

### Dependencies loading

In [19]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_selection import RFECV
from sklearn.inspection import permutation_importance
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from sklearn.metrics import make_scorer
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GridSearchCV
from ReliefF import ReliefF
import pickle
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 150)

### Data loading

In [20]:
df = pd.read_csv("../data/train_fe.csv", index_col=0)

In [21]:
fr = pd.read_excel("../data/feature_ranking.xlsx", index_col=0)

### Feature engineering for KNN model

We have to standardize our variables. We will use range standardization (Min Max Scaler) because we have got dummies! We gave every variable a chance to have the same impact on the model.

In [22]:
print(df.columns.tolist())

['Ticker', 'Nazwa2', 'rok', 'ta', 'txt', 'pi', 'str', 'xrd', 'ni', 'ppent', 'intant', 'dlc', 'dltt', 'capex', 'revenue', 'cce', 'adv', 'etr', 'diff', 'roa', 'lev', 'intan', 'rd', 'ppe', 'sale', 'cash_holdings', 'adv_expenditure', 'capex2', 'cfc', 'dta', 'capex2_scaled', 'y_v2x_polyarchy', 'y_e_p_polity', 'y_BR_Democracy', 'WB_GDPgrowth', 'WB_GDPpc', 'WB_Inflation', 'rr_per_country', 'rr_per_sector', 'sektor_consumer discretionary', 'sektor_consumer staples', 'sektor_energy', 'sektor_health care', 'sektor_industrials', 'sektor_materials', 'sektor_real estate', 'sektor_technology', 'sektor_utilities', 'gielda_2', 'gielda_3', 'gielda_4', 'gielda_5', 'ta_log', 'txt_cat_(-63.011, -34.811]', 'txt_cat_(-34.811, 0.488]', 'txt_cat_(0.488, 24.415]', 'txt_cat_(24.415, 25.05]', 'txt_cat_(25.05, 308.55]', 'txt_cat_(308.55, 327.531]', 'txt_cat_(327.531, inf]', 'pi_cat_(-8975.0, -1.523]', 'pi_cat_(-1.523, 157.119]', 'pi_cat_(157.119, 465.9]', 'pi_cat_(465.9, 7875.5]', 'pi_cat_(7875.5, 8108.5]', 'pi_c

In [23]:
columns = ['rok', 'ta', 'txt', 'pi', 'str', 'xrd', 'ni', 'ppent', 'intant', 'dlc', 'dltt', 'capex', 'revenue', 'cce', 'adv', 'etr', 'diff', 'roa', 'lev', 'intan', 'rd', 'ppe', 'sale', 'cash_holdings', 'adv_expenditure', 'capex2', 'cfc', 'dta', 'capex2_scaled', 'y_v2x_polyarchy', 'y_e_p_polity', 'y_BR_Democracy', 'WB_GDPgrowth', 'WB_GDPpc', 'WB_Inflation', 'rr_per_country', 'rr_per_sector', 'sektor_consumer discretionary', 'sektor_consumer staples', 'sektor_energy', 'sektor_health care', 'sektor_industrials', 'sektor_materials', 'sektor_real estate', 'sektor_technology', 'sektor_utilities', 'gielda_2', 'gielda_3', 'gielda_4', 'gielda_5', 'ta_log', 'txt_cat_(-63.011, -34.811]', 'txt_cat_(-34.811, 0.488]', 'txt_cat_(0.488, 24.415]', 'txt_cat_(24.415, 25.05]', 'txt_cat_(25.05, 308.55]', 'txt_cat_(308.55, 327.531]', 'txt_cat_(327.531, inf]', 'pi_cat_(-8975.0, -1.523]', 'pi_cat_(-1.523, 157.119]', 'pi_cat_(157.119, 465.9]', 'pi_cat_(465.9, 7875.5]', 'pi_cat_(7875.5, 8108.5]', 'pi_cat_(8108.5, inf]', 'str_cat_(0.0875, 0.192]', 'str_cat_(0.192, 0.28]', 'str_cat_(0.28, inf]', 'xrd_exists', 'ni_profit', 'ni_profit_20000', 'ppent_sqrt', 'intant_sqrt', 'dlc_cat_(42.262, 176.129]', 'dlc_cat_(176.129, 200.9]', 'dlc_cat_(200.9, inf]', 'dltt_cat_(39.38, 327.85]', 'dltt_cat_(327.85, 876.617]', 'dltt_cat_(876.617, inf]', 'capex_cat_(7.447, 79.55]', 'capex_cat_(79.55, 5451.0]', 'capex_cat_(5451.0, inf]', 'revenue_cat_(0.174, 1248.817]', 'revenue_cat_(1248.817, 4233.587]', 'revenue_cat_(4233.587, inf]', 'cce_cat_(5.619, 63.321]', 'cce_cat_(63.321, inf]', 'adv_cat_(0.3, 874.5]', 'adv_cat_(874.5, inf]', 'diff_positive', 'roa_clip', 'lev_sqrt', 'intan_pow2', 'rd_sqrt', 'ppe_clip', 'cash_holdings_sqrt', 'adv_expenditure_positive', 'diff_dta', 'cfc_dta', 'etr_y_past', 'etr_y_ma', 'diff_ma', 'roa_ma', 'lev_ma', 'intan_ma', 'ppe_ma', 'sale_ma', 'cash_holdings_ma', 'roa_past', 'lev_past', 'intan_past', 'ppe_past', 'sale_past', 'cash_holdings_past']

In [24]:
standardization = list()
not_standardization = list()
for i in columns:
    if df[i].nunique() > 2:
        standardization.append(i)
    else:
        not_standardization.append(i)

In [25]:
print(standardization)

['rok', 'ta', 'txt', 'pi', 'str', 'xrd', 'ni', 'ppent', 'intant', 'dlc', 'dltt', 'capex', 'revenue', 'cce', 'adv', 'etr', 'diff', 'roa', 'lev', 'intan', 'rd', 'ppe', 'sale', 'cash_holdings', 'adv_expenditure', 'capex2', 'capex2_scaled', 'y_v2x_polyarchy', 'WB_GDPgrowth', 'WB_GDPpc', 'WB_Inflation', 'rr_per_country', 'rr_per_sector', 'ta_log', 'ppent_sqrt', 'intant_sqrt', 'roa_clip', 'lev_sqrt', 'intan_pow2', 'rd_sqrt', 'ppe_clip', 'cash_holdings_sqrt', 'diff_dta', 'etr_y_past', 'etr_y_ma', 'diff_ma', 'roa_ma', 'lev_ma', 'intan_ma', 'ppe_ma', 'sale_ma', 'cash_holdings_ma', 'roa_past', 'lev_past', 'intan_past', 'ppe_past', 'sale_past', 'cash_holdings_past']


In [26]:
print(not_standardization)

['cfc', 'dta', 'y_e_p_polity', 'y_BR_Democracy', 'sektor_consumer discretionary', 'sektor_consumer staples', 'sektor_energy', 'sektor_health care', 'sektor_industrials', 'sektor_materials', 'sektor_real estate', 'sektor_technology', 'sektor_utilities', 'gielda_2', 'gielda_3', 'gielda_4', 'gielda_5', 'txt_cat_(-63.011, -34.811]', 'txt_cat_(-34.811, 0.488]', 'txt_cat_(0.488, 24.415]', 'txt_cat_(24.415, 25.05]', 'txt_cat_(25.05, 308.55]', 'txt_cat_(308.55, 327.531]', 'txt_cat_(327.531, inf]', 'pi_cat_(-8975.0, -1.523]', 'pi_cat_(-1.523, 157.119]', 'pi_cat_(157.119, 465.9]', 'pi_cat_(465.9, 7875.5]', 'pi_cat_(7875.5, 8108.5]', 'pi_cat_(8108.5, inf]', 'str_cat_(0.0875, 0.192]', 'str_cat_(0.192, 0.28]', 'str_cat_(0.28, inf]', 'xrd_exists', 'ni_profit', 'ni_profit_20000', 'dlc_cat_(42.262, 176.129]', 'dlc_cat_(176.129, 200.9]', 'dlc_cat_(200.9, inf]', 'dltt_cat_(39.38, 327.85]', 'dltt_cat_(327.85, 876.617]', 'dltt_cat_(876.617, inf]', 'capex_cat_(7.447, 79.55]', 'capex_cat_(79.55, 5451.0]', '

In [27]:
standardization.remove("etr")

In [28]:
standardization.append("y_e_p_polity")

In [29]:
print(standardization)

['rok', 'ta', 'txt', 'pi', 'str', 'xrd', 'ni', 'ppent', 'intant', 'dlc', 'dltt', 'capex', 'revenue', 'cce', 'adv', 'diff', 'roa', 'lev', 'intan', 'rd', 'ppe', 'sale', 'cash_holdings', 'adv_expenditure', 'capex2', 'capex2_scaled', 'y_v2x_polyarchy', 'WB_GDPgrowth', 'WB_GDPpc', 'WB_Inflation', 'rr_per_country', 'rr_per_sector', 'ta_log', 'ppent_sqrt', 'intant_sqrt', 'roa_clip', 'lev_sqrt', 'intan_pow2', 'rd_sqrt', 'ppe_clip', 'cash_holdings_sqrt', 'diff_dta', 'etr_y_past', 'etr_y_ma', 'diff_ma', 'roa_ma', 'lev_ma', 'intan_ma', 'ppe_ma', 'sale_ma', 'cash_holdings_ma', 'roa_past', 'lev_past', 'intan_past', 'ppe_past', 'sale_past', 'cash_holdings_past', 'y_e_p_polity']


In [17]:
scaler = MinMaxScaler()
scaler.fit(df[standardization])
df[standardization] = scaler.transform(df[standardization])

In [32]:
pickle.dump(scaler, open("final_models/minmaxscaler.sav", 'wb'))

In [30]:
# df[standardization]= scaler.fit_transform(df[standardization])

Double check. Everything seems to be good :-) 

In [33]:
df[columns].describe().T["min"].unique()

array([0., 1.])

In [16]:
df[columns].describe().T["max"].unique()

array([1., 1., 1.])

### Searching for "good enough" model to feature selection

Let's use top 10 variables proposed by Mutual Information!

In [358]:
var = fr.mi_score.sort_values(ascending=False).index.tolist()[0:10]

In [359]:
print(var)

['etr_y_past', 'etr_y_ma', 'txt', 'diff', 'ni', 'pi', 'intant', 'intant_sqrt', 'ta', 'dlc']


In [360]:
df.shape[0]**(0.5)

63.190189111918315

In [361]:
param = {"n_neighbors":[5,7,10,12,15,25,40,50,100], "weights": ["uniform", "distance"], "metric":["minkowski", "manhattan", "chebyshev"], "p":[1,2]}

In [362]:
mse = make_scorer(mean_squared_error, greater_is_better=False)

In [364]:
model = KNeighborsRegressor()
grid_CV = GridSearchCV(model, param, cv=5, scoring=mse, return_train_score=True, n_jobs=-1)
grid_CV.fit(df.loc[:,var].values, df.loc[:,"etr"].values.ravel())

GridSearchCV(cv=5, error_score=nan,
             estimator=KNeighborsRegressor(algorithm='auto', leaf_size=30,
                                           metric='minkowski',
                                           metric_params=None, n_jobs=None,
                                           n_neighbors=5, p=2,
                                           weights='uniform'),
             iid='deprecated', n_jobs=-1,
             param_grid={'metric': ['minkowski', 'manhattan', 'chebyshev'],
                         'n_neighbors': [5, 7, 10, 12, 15, 25, 40, 50, 100],
                         'p': [1, 2], 'weights': ['uniform', 'distance']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
             scoring=make_scorer(mean_squared_error, greater_is_better=False),
             verbose=0)

In [366]:
grid_CV.best_estimator_

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=50, p=1,
                    weights='uniform')

In [281]:
grid_CV.cv_results_

{'mean_fit_time': array([0.00699539, 0.00599513, 0.0059999 , 0.00619631, 0.00639958,
        0.00599599, 0.00599456, 0.00599608, 0.00619659, 0.0059989 ,
        0.00619164, 0.00600019, 0.00639243, 0.00619764, 0.00599799,
        0.00599689, 0.00619345, 0.00619082, 0.00659761, 0.00639524,
        0.00639319, 0.00638847, 0.00639615, 0.00639029, 0.00600004,
        0.00639601, 0.00599761, 0.00639439, 0.00620642, 0.00581326,
        0.00619998, 0.00619473, 0.00599489, 0.00620046, 0.00599527,
        0.00600066, 0.00600047, 0.00599756, 0.0059948 , 0.00599728,
        0.00599585, 0.00599647, 0.00600309, 0.00600033, 0.00599685,
        0.00599203, 0.00599699, 0.00619464, 0.00599642, 0.0061975 ,
        0.00600266, 0.00619683, 0.00599642, 0.00601416, 0.00639596,
        0.00599284, 0.00619831, 0.00619302, 0.00639606, 0.00619602,
        0.00599527, 0.00599575, 0.00599475, 0.00599551, 0.0059958 ,
        0.00639553, 0.00619688, 0.00599027, 0.00599742, 0.00599236,
        0.00619998, 0.00619287,

Right now (temporary) we will this hyperparameters as the best one:

In [82]:
grid_CV.best_estimator_

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='chebyshev',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=1,
                    weights='distance')

### Feature selection for KNN

We would like to base our feature selection on:
 * feature ranking
 * forward elimination
 * relief - special feature selection model for KNN found during Michał Woźniak & Michał Dyczko undergrad. research

#### Feature ranking

In [282]:
fr.sort_values("mi_score", ascending=False, inplace=True)

In [283]:
fr.head()

Unnamed: 0,mi_score,sign_fscore,sign_fscore_0_1,corr,EN_coef,boruta_rank
etr_y_past,1.00786,1.30404e-84,1,0.520405,,1
etr_y_ma,0.825573,2.47377e-125,1,0.526827,,1
txt,0.633965,5.246456e-13,1,0.368732,1.466269e-05,1
diff,0.628929,0.02257712,1,-0.291716,,1
ni,0.616573,1.74723e-09,1,0.263458,-3.442e-07,2


In [284]:
br_features = fr[fr.boruta_rank.isin([1,2,3])].index.tolist()

In [285]:
mi_features = fr.iloc[0:20].index.tolist()

In [286]:
mi_features_25 = fr.iloc[0:25].index.tolist()

In [287]:
mi_features_35 = fr.iloc[0:35].index.tolist()

In [288]:
mi_features_50 = fr.iloc[0:50].index.tolist()

In [289]:
fr["corr_abs"] = np.abs(fr["corr"])
fr.sort_values("corr_abs", ascending=False, inplace=True)
corr_features = fr.iloc[0:20].index.tolist()

We will use our intuition and create two additional benchmark sets of variables:

In [290]:
benchmark = ['rok', 'ta', 'txt', 'pi', 'str', 'xrd', 'ni', 'ppent', 'intant', 'dlc', 'dltt', 'capex', 'revenue', 'cce', 'adv', 'etr', 'diff', 'roa', 'lev', 'intan', 'rd', 'ppe', 'sale', 'cash_holdings', 'adv_expenditure', 'capex2', 'cfc', 'dta', 'capex2_scaled', 'y_v2x_polyarchy', 'y_e_p_polity', 'y_BR_Democracy', 'WB_GDPgrowth', 'WB_GDPpc', 'WB_Inflation', 'rr_per_country', 'rr_per_sector', 'sektor_consumer discretionary', 'sektor_consumer staples', 'sektor_energy', 'sektor_health care', 'sektor_industrials', 'sektor_materials', 'sektor_real estate', 'sektor_technology', 'sektor_utilities', 'gielda_2', 'gielda_3', 'gielda_4', 'gielda_5', 'etr_y_past', 'etr_y_ma', 'diff_ma', 'roa_ma', 'lev_ma', 'intan_ma', 'ppe_ma', 'sale_ma', 'cash_holdings_ma', 'roa_past', 'lev_past', 'intan_past', 'ppe_past', 'sale_past', 'cash_holdings_past']

In [291]:
benchmark2 = ['ta', 'txt', 'pi', 'str', 'xrd', 'ni', 'ppent', 'intant', 'dlc', 'dltt', 'capex', 'revenue', 'cce', 'adv', 'etr', 'diff', 'roa', 'lev', 'intan', 'rd', 'ppe', 'sale', 'cash_holdings', 'adv_expenditure', 'capex2', 'cfc', 'dta', 'y_v2x_polyarchy', 'WB_GDPgrowth', 'WB_GDPpc', 'WB_Inflation', 'rr_per_country', 'rr_per_sector', 
              'etr_y_past', 'etr_y_ma', 'diff_ma', 'roa_ma', 'lev_ma', 'intan_ma', 'ppe_ma', 'sale_ma', 'cash_holdings_ma', 'roa_past', 'lev_past', 'intan_past', 'ppe_past', 'sale_past', 'cash_holdings_past']

#### Forward elimination

In [292]:
forward_elimination = ['rok', 'ta', 'txt', 'pi', 'str', 'xrd', 'ni', 'ppent', 'intant', 'dlc', 'dltt', 'capex', 'revenue', 'cce', 'adv', 'diff', 'roa', 'lev', 'intan', 'rd', 'ppe', 'sale', 'cash_holdings', 'adv_expenditure', 'capex2', 'cfc', 'dta', 'capex2_scaled', 'y_v2x_polyarchy', 'y_e_p_polity', 'y_BR_Democracy', 'WB_GDPgrowth', 'WB_GDPpc', 'WB_Inflation', 'rr_per_country', 'rr_per_sector', 'sektor_consumer discretionary', 'sektor_consumer staples', 'sektor_energy', 'sektor_health care', 'sektor_industrials', 'sektor_materials', 'sektor_real estate', 'sektor_technology', 'sektor_utilities', 'gielda_2', 'gielda_3', 'gielda_4', 'gielda_5', 'ta_log', 'txt_cat_(-63.011, -34.811]', 'txt_cat_(-34.811, 0.488]', 'txt_cat_(0.488, 24.415]', 'txt_cat_(24.415, 25.05]', 'txt_cat_(25.05, 308.55]', 'txt_cat_(308.55, 327.531]', 'txt_cat_(327.531, inf]', 'pi_cat_(-8975.0, -1.523]', 'pi_cat_(-1.523, 157.119]', 'pi_cat_(157.119, 465.9]', 'pi_cat_(465.9, 7875.5]', 'pi_cat_(7875.5, 8108.5]', 'pi_cat_(8108.5, inf]', 'str_cat_(0.0875, 0.192]', 'str_cat_(0.192, 0.28]', 'str_cat_(0.28, inf]', 'xrd_exists', 'ni_profit', 'ni_profit_20000', 'ppent_sqrt', 'intant_sqrt', 'dlc_cat_(42.262, 176.129]', 'dlc_cat_(176.129, 200.9]', 'dlc_cat_(200.9, inf]', 'dltt_cat_(39.38, 327.85]', 'dltt_cat_(327.85, 876.617]', 'dltt_cat_(876.617, inf]', 'capex_cat_(7.447, 79.55]', 'capex_cat_(79.55, 5451.0]', 'capex_cat_(5451.0, inf]', 'revenue_cat_(0.174, 1248.817]', 'revenue_cat_(1248.817, 4233.587]', 'revenue_cat_(4233.587, inf]', 'cce_cat_(5.619, 63.321]', 'cce_cat_(63.321, inf]', 'adv_cat_(0.3, 874.5]', 'adv_cat_(874.5, inf]', 'diff_positive', 'roa_clip', 'lev_sqrt', 'intan_pow2', 'rd_sqrt', 'ppe_clip', 'cash_holdings_sqrt', 'adv_expenditure_positive', 'diff_dta', 'cfc_dta', 'etr_y_past', 'etr_y_ma', 'diff_ma', 'roa_ma', 'lev_ma', 'intan_ma', 'ppe_ma', 'sale_ma', 'cash_holdings_ma', 'roa_past', 'lev_past', 'intan_past', 'ppe_past', 'sale_past', 'cash_holdings_past']
forward_elimination.remove("ta_log")
forward_elimination.remove("ppent_sqrt")
forward_elimination.remove("intant_sqrt")
forward_elimination.remove("roa")
forward_elimination.remove("lev")
forward_elimination.remove("intan")
forward_elimination.remove("rd_sqrt")
forward_elimination.remove("ppe")
forward_elimination.remove("cash_holdings_sqrt")

In [293]:
candidates_withoud_discr = [i for i in forward_elimination if "]" not in i]

In [294]:
model = KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='chebyshev',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=1,
                    weights='distance')

In [295]:
sf = SFS(model, 
           k_features=(5,15), 
           forward=True, 
           floating=False, 
           verbose=0,
           scoring=mse,
           cv=5)

sffit = sf.fit(df.loc[:,candidates_withoud_discr].values, df.loc[:,"etr"].values.ravel())

sf_features = df.loc[:,candidates_withoud_discr].columns[list(sffit.k_feature_idx_)]

sf_features

Index(['adv', 'adv_expenditure', 'dta', 'y_e_p_polity', 'y_BR_Democracy',
       'sektor_energy', 'sektor_utilities', 'gielda_4', 'gielda_5',
       'xrd_exists', 'ni_profit_20000', 'diff_positive',
       'adv_expenditure_positive', 'cfc_dta'],
      dtype='object')

In [296]:
model = KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='chebyshev',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=1,
                    weights='distance')

In [297]:
sf = SFS(model, 
           k_features=(5,15), 
           forward=True, 
           floating=False, 
           verbose=0,
           scoring=mse,
           cv=5)

sffit = sf.fit(df.loc[:,forward_elimination].values, df.loc[:,"etr"].values.ravel())

sf_features2 = df.loc[:,forward_elimination].columns[list(sffit.k_feature_idx_)]

sf_features2

Index(['y_e_p_polity', 'y_BR_Democracy', 'sektor_technology', 'gielda_4',
       'gielda_5', 'txt_cat_(-63.011, -34.811]', 'txt_cat_(-34.811, 0.488]',
       'pi_cat_(7875.5, 8108.5]', 'str_cat_(0.0875, 0.192]',
       'str_cat_(0.192, 0.28]', 'ni_profit_20000', 'dlc_cat_(176.129, 200.9]',
       'adv_cat_(874.5, inf]', 'diff_positive'],
      dtype='object')

In [298]:
sf_one_more = ['rok', 'diff', 'roa', 'lev', 'intan', 'rd', 'ppe', 'sale', 'cash_holdings', 'adv_expenditure', 'capex2', 'cfc', 'dta']

In [299]:
sf = SFS(model, 
           k_features=(5,10), 
           forward=True, 
           floating=False, 
           verbose=0,
           scoring=mse,
           cv=5)

sffit = sf.fit(df.loc[:,sf_one_more].values, df.loc[:,"etr"].values.ravel())

sf_features3 = df.loc[:,sf_one_more].columns[list(sffit.k_feature_idx_)]

sf_features3

Index(['rok', 'diff', 'adv_expenditure', 'capex2', 'cfc', 'dta'], dtype='object')

#### Relief

In [300]:
relief = ['rok', 'ta', 'txt', 'pi', 'str', 'xrd', 'ni', 'ppent', 'intant', 'dlc', 'dltt', 'capex', 'revenue', 'cce', 'adv', 'diff', 'rd', 'sale', 'cash_holdings', 'adv_expenditure', 'capex2', 'cfc', 'dta', 'capex2_scaled', 'y_v2x_polyarchy', 'y_e_p_polity', 'y_BR_Democracy', 'WB_GDPgrowth', 'WB_GDPpc', 'WB_Inflation', 'rr_per_country', 'rr_per_sector', 'sektor_consumer discretionary', 'sektor_consumer staples', 'sektor_energy', 'sektor_health care', 'sektor_industrials', 'sektor_materials', 'sektor_real estate', 'sektor_technology', 'sektor_utilities', 'gielda_2', 'gielda_3', 'gielda_4', 'gielda_5', 'txt_cat_(-63.011, -34.811]', 'txt_cat_(-34.811, 0.488]', 'txt_cat_(0.488, 24.415]', 'txt_cat_(24.415, 25.05]', 'txt_cat_(25.05, 308.55]', 'txt_cat_(308.55, 327.531]', 'txt_cat_(327.531, inf]', 'pi_cat_(-8975.0, -1.523]', 'pi_cat_(-1.523, 157.119]', 'pi_cat_(157.119, 465.9]', 'pi_cat_(465.9, 7875.5]', 'pi_cat_(7875.5, 8108.5]', 'pi_cat_(8108.5, inf]', 'str_cat_(0.0875, 0.192]', 'str_cat_(0.192, 0.28]', 'str_cat_(0.28, inf]', 'xrd_exists', 'ni_profit', 'ni_profit_20000', 'dlc_cat_(42.262, 176.129]', 'dlc_cat_(176.129, 200.9]', 'dlc_cat_(200.9, inf]', 'dltt_cat_(39.38, 327.85]', 'dltt_cat_(327.85, 876.617]', 'dltt_cat_(876.617, inf]', 'capex_cat_(7.447, 79.55]', 'capex_cat_(79.55, 5451.0]', 'capex_cat_(5451.0, inf]', 'revenue_cat_(0.174, 1248.817]', 'revenue_cat_(1248.817, 4233.587]', 'revenue_cat_(4233.587, inf]', 'cce_cat_(5.619, 63.321]', 'cce_cat_(63.321, inf]', 'adv_cat_(0.3, 874.5]', 'adv_cat_(874.5, inf]', 'diff_positive', 'roa_clip', 'lev_sqrt', 'intan_pow2', 'ppe_clip', 'adv_expenditure_positive', 'diff_dta', 'cfc_dta', 'etr_y_past', 'etr_y_ma', 'diff_ma', 'roa_ma', 'lev_ma', 'intan_ma', 'ppe_ma', 'sale_ma', 'cash_holdings_ma', 'roa_past', 'lev_past', 'intan_past', 'ppe_past', 'sale_past', 'cash_holdings_past']

In [301]:
fs = ReliefF(n_neighbors=30, n_features_to_keep=5)

In [302]:
fs.fit(df.loc[:,relief].values, df.loc[:,"etr"].values.ravel())

In [303]:
relief1 = df.loc[:,relief].iloc[:,fs.top_features[0:10]].columns.tolist()

In [304]:
relief2 = df.loc[:,relief].iloc[:,fs.top_features[0:15]].columns.tolist()

In [305]:
relief3 = df.loc[:,relief].iloc[:,fs.top_features[0:20]].columns.tolist()

### Hyperparametes Tunning for each group of variables

In [306]:
param = {"n_neighbors":[5,7,10,12,15,25,40,50], "weights": ["uniform", "distance"], "metric":["minkowski", "manhattan", "chebyshev"], "p":[1,2]}
mse = make_scorer(mean_squared_error, greater_is_better=True)

In [307]:
def cv_proc(var):
    model = KNeighborsRegressor()
    grid_CV = GridSearchCV(model, param, cv=5, scoring=mse, return_train_score=True)
    grid_CV.fit(df.loc[:,var].values, df.loc[:,"etr"].values.ravel())
    print(grid_CV.best_estimator_)
    print(grid_CV.best_score_)

In [308]:
cv_proc(benchmark)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=1,
                    weights='uniform')
0.019392767515359007


In [309]:
cv_proc(benchmark2)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=50, p=1,
                    weights='uniform')
0.017643274949978863


In [310]:
cv_proc(mi_features_25)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='chebyshev',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=1,
                    weights='distance')
0.023687438936828415


In [311]:
cv_proc(mi_features_35)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=1,
                    weights='distance')
0.024132263841463268


In [312]:
cv_proc(mi_features_50)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=1,
                    weights='distance')
0.02408388165995253


In [313]:
cv_proc(br_features)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                    weights='distance')
0.021729978128990298


In [314]:
cv_proc(mi_features)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='chebyshev',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=1,
                    weights='distance')
0.023515242037715163


In [315]:
cv_proc(corr_features)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='chebyshev',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=1,
                    weights='distance')
0.023940236434834748


In [316]:
cv_proc(sf_features)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='chebyshev',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=1,
                    weights='distance')
0.02453915241044045


In [317]:
cv_proc(sf_features2)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                    weights='distance')
0.027360700781965192


In [318]:
cv_proc(sf_features3)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                    weights='distance')
0.025757480072568784


In [319]:
cv_proc(relief1)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='chebyshev',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=1,
                    weights='distance')
0.02808229700930199


In [320]:
cv_proc(relief2)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='chebyshev',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=1,
                    weights='distance')
0.02782631490123545


In [321]:
cv_proc(relief3)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='chebyshev',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=1,
                    weights='distance')
0.028517759404415605


### Final models comparison - winner obtaining

We would like to fight against data leakage in our CV, so we will treat it like a panel problem with a rolling window. We now based on our experience that this kind of approach is crucial to fight against overfitting.

Sliding window:
 * T: 2005 - 2008; V: 2009
 * T: 2005 - 2009; V: 2010
 * T: 2005 - 2010; V: 2011
 * ...

In [322]:
df = df.sort_values(by="rok").reset_index(drop=True)

In [323]:
def proper_CV(x, y, model, display_res = False):
    train_score = list()
    valid_score = list()
    train_indexes = [0, 1452]
    valid_indexes = [1452, 1815]
    for i in range(0,6):
        train_x =  x[x.index.isin(range(train_indexes[0],train_indexes[1]))]
        train_y =  y[y.index.isin(range(train_indexes[0],train_indexes[1]))]
        valid_x =  x[x.index.isin(range(valid_indexes[0],valid_indexes[1]))]
        valid_y =  y[y.index.isin(range(valid_indexes[0],valid_indexes[1]))]

        model.fit(train_x.values, train_y.values.ravel())
        
        pred_y_train = model.predict(train_x.values)
        rmse = np.sqrt(mean_squared_error(train_y, pred_y_train))
        train_score.append(rmse)

        pred_y_val = model.predict(valid_x.values)
        rmse = np.sqrt(mean_squared_error(valid_y, pred_y_val))
        valid_score.append(rmse)
        
        train_indexes = [0, valid_indexes[1]]
        valid_indexes = [train_indexes[1], valid_indexes[1]+363]
    
    if display_res == True:
        view = pd.DataFrame([train_score, valid_score]).T.rename(columns = {0:"cv_train", 1:"cv_val"})
        display(view)
        return train_score, valid_score, view
    else:
        return train_score, valid_score

In [324]:
model = KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=1,
                    weights='uniform')
var = benchmark
cv_output0 = proper_CV(df.loc[:,var], df.loc[:,"etr"], model, display_res=True)

Unnamed: 0,cv_train,cv_val
0,0.108977,0.126407
1,0.107472,0.128212
2,0.106584,0.138039
3,0.106594,0.13691
4,0.106169,0.118245
5,0.105365,0.126566


In [325]:
model = KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=1,
                    weights='uniform')


var = benchmark2
cv_output1 = proper_CV(df.loc[:,var], df.loc[:,"etr"], model, display_res=True)

Unnamed: 0,cv_train,cv_val
0,0.092589,0.118902
1,0.0924,0.125504
2,0.093093,0.124851
3,0.095166,0.124028
4,0.095611,0.108053
5,0.09455,0.113391


In [326]:
model = KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                    weights='distance')
var = br_features
cv_output2 = proper_CV(df.loc[:,var], df.loc[:,"etr"], model, display_res=True)

Unnamed: 0,cv_train,cv_val
0,0.0,0.152696
1,0.0,0.135391
2,0.0,0.168754
3,0.0,0.155859
4,0.0,0.136967
5,0.0,0.146087


In [327]:
model = KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='chebyshev',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=1,
                    weights='distance')
var = mi_features
cv_output3 = proper_CV(df.loc[:,var], df.loc[:,"etr"], model, display_res=True)

Unnamed: 0,cv_train,cv_val
0,0.024271,0.162502
1,0.025683,0.143987
2,0.028871,0.169731
3,0.02813,0.156194
4,0.026329,0.130178
5,0.025827,0.141895


In [328]:
model = KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='chebyshev',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=1,
                    weights='distance')
var = corr_features
cv_output4 = proper_CV(df.loc[:,var], df.loc[:,"etr"], model, display_res=True)

Unnamed: 0,cv_train,cv_val
0,0.0,0.159257
1,0.0,0.137849
2,0.0,0.158056
3,0.0,0.152921
4,0.0,0.138011
5,0.0,0.145993


In [329]:
model = KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='chebyshev',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=1,
                    weights='distance')
var = sf_features
cv_output5 = proper_CV(df.loc[:,var], df.loc[:,"etr"], model, display_res=True)

Unnamed: 0,cv_train,cv_val
0,0.156189,0.151892
1,0.149483,0.145875
2,0.146388,0.156155
3,0.151007,0.156266
4,0.152196,0.151464
5,0.148406,0.148948


In [330]:
model = KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='chebyshev',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=1,
                    weights='distance')
var = sf_features2
cv_output6 = proper_CV(df.loc[:,var], df.loc[:,"etr"], model, display_res=True)

Unnamed: 0,cv_train,cv_val
0,0.145662,0.136099
1,0.144824,0.135153
2,0.146024,0.171389
3,0.150517,0.15327
4,0.15226,0.146433
5,0.150838,0.152338


In [331]:
model = KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=1,
                    weights='distance')
var = sf_features3
cv_output7 = proper_CV(df.loc[:,var], df.loc[:,"etr"], model, display_res=True)

Unnamed: 0,cv_train,cv_val
0,0.0,0.159178
1,0.0,0.140352
2,0.0,0.16248
3,0.0,0.154449
4,0.0,0.141927
5,0.0,0.143532


In [332]:
model = KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='chebyshev',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=1,
                    weights='distance')
var = relief1
cv_output8 = proper_CV(df.loc[:,var], df.loc[:,"etr"], model, display_res=True)

Unnamed: 0,cv_train,cv_val
0,0.028174,0.149546
1,0.030226,0.155123
2,0.033734,0.162585
3,0.032146,0.160728
4,0.030527,0.142725
5,0.029191,0.15752


In [333]:
model = KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='chebyshev',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=1,
                    weights='distance')
var = relief2
cv_output9 = proper_CV(df.loc[:,var], df.loc[:,"etr"], model, display_res=True)

Unnamed: 0,cv_train,cv_val
0,0.028174,0.158226
1,0.030226,0.149294
2,0.033435,0.159193
3,0.032709,0.156948
4,0.030148,0.143187
5,0.029186,0.148951


In [334]:
model = KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='chebyshev',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=1,
                    weights='distance')
var = relief3 
cv_output10 = proper_CV(df.loc[:,var], df.loc[:,"etr"], model, display_res=True)

Unnamed: 0,cv_train,cv_val
0,0.028174,0.161661
1,0.030226,0.161281
2,0.033404,0.159285
3,0.032693,0.163421
4,0.030143,0.145755
5,0.029224,0.153538


In [335]:
model = KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='chebyshev',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=1,
                    weights='distance')
var = mi_features_25 
cv_output11 = proper_CV(df.loc[:,var], df.loc[:,"etr"], model, display_res=True)

Unnamed: 0,cv_train,cv_val
0,0.024263,0.162663
1,0.025652,0.141275
2,0.028848,0.164978
3,0.027889,0.15455
4,0.026176,0.130554
5,0.025451,0.141324


In [336]:
model = KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=1,
                    weights='distance')
var = mi_features_35 
cv_output12 = proper_CV(df.loc[:,var], df.loc[:,"etr"], model, display_res=True)

Unnamed: 0,cv_train,cv_val
0,0.023087,0.152087
1,0.024506,0.142924
2,0.027961,0.156399
3,0.027103,0.149464
4,0.025429,0.133746
5,0.024774,0.137505


In [337]:
model = KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=1,
                    weights='distance')
var = mi_features_50 
cv_output13 = proper_CV(df.loc[:,var], df.loc[:,"etr"], model, display_res=True)

Unnamed: 0,cv_train,cv_val
0,0.0,0.144449
1,0.0,0.141801
2,0.0,0.156533
3,0.0,0.151081
4,0.0,0.135173
5,0.0,0.139487


In [338]:
pd.DataFrame([cv_output0[2].mean().tolist(),
cv_output1[2].mean().tolist(), 
cv_output2[2].mean().tolist(),
cv_output3[2].mean().tolist(),
cv_output4[2].mean().tolist(),
cv_output5[2].mean().tolist(),
cv_output6[2].mean().tolist(),
cv_output7[2].mean().tolist(),
cv_output8[2].mean().tolist(),
cv_output9[2].mean().tolist(),
cv_output10[2].mean().tolist(),
cv_output11[2].mean().tolist(),
cv_output12[2].mean().tolist(),
cv_output13[2].mean().tolist(),], columns=["train_mean", "test_mean"])

Unnamed: 0,train_mean,test_mean
0,0.10686,0.129063
1,0.093901,0.119122
2,0.0,0.149292
3,0.026519,0.150748
4,0.0,0.148681
5,0.150612,0.151767
6,0.148354,0.149114
7,0.0,0.15032
8,0.030666,0.154705
9,0.030646,0.152633


In [339]:
pd.DataFrame([cv_output0[2].std().tolist(),
cv_output1[2].std().tolist(), 
cv_output2[2].std().tolist(),
cv_output3[2].std().tolist(),
cv_output4[2].std().tolist(),
cv_output5[2].std().tolist(),
cv_output6[2].std().tolist(),
cv_output7[2].std().tolist(),
cv_output8[2].std().tolist(),
cv_output9[2].std().tolist(),
cv_output10[2].std().tolist(),
cv_output11[2].std().tolist(),
cv_output12[2].std().tolist(),
cv_output13[2].std().tolist(),], columns=["train_std", "test_std"])

Unnamed: 0,train_std,test_std
0,0.001242,0.007392
1,0.001384,0.007114
2,0.0,0.012561
3,0.001697,0.014665
4,0.0,0.009551
5,0.003399,0.004056
6,0.003201,0.013381
7,0.0,0.009584
8,0.00201,0.007438
9,0.002035,0.006432


Second models seems to be the best one! We see that our intuition was quite good - binary variables should be removed!

In [340]:
print(benchmark2)

['ta', 'txt', 'pi', 'str', 'xrd', 'ni', 'ppent', 'intant', 'dlc', 'dltt', 'capex', 'revenue', 'cce', 'adv', 'etr', 'diff', 'roa', 'lev', 'intan', 'rd', 'ppe', 'sale', 'cash_holdings', 'adv_expenditure', 'capex2', 'cfc', 'dta', 'y_v2x_polyarchy', 'WB_GDPgrowth', 'WB_GDPpc', 'WB_Inflation', 'rr_per_country', 'rr_per_sector', 'etr_y_past', 'etr_y_ma', 'diff_ma', 'roa_ma', 'lev_ma', 'intan_ma', 'ppe_ma', 'sale_ma', 'cash_holdings_ma', 'roa_past', 'lev_past', 'intan_past', 'ppe_past', 'sale_past', 'cash_holdings_past']


### Fit final model and save it

In [341]:
model = KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=50, p=1,
                    weights='uniform')
model.fit(df.loc[:,benchmark2].values, df.loc[:,"etr"].values.ravel())

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=50, p=1,
                    weights='uniform')

In [342]:
filename = 'final_models/knn.sav'

In [343]:
pickle.dump(model, open(filename, 'wb'))