## Filter Methods - Univariate mse

This procedure works as follows:

- First, it builds one decision tree per feature, to predict the target
- Second, it makes predictions using the decision tree and the mentioned feature
- Third, it ranks the features according to the machine learning metric (mse)
- It selects the highest ranked features

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

### load dataset

In [None]:
# load dataset and features from previus method
features = np.load('../features/featuresFromUnivariateClassif.npy').tolist()
data = pd.read_pickle('../../data/features/features.pkl').loc[:,features].sample(frac=0.35).fillna(-9999)

In [None]:
data.head()

In [None]:
# In practice, feature selection should be done after data pre-processing,
# so ideally, all the categorical variables are encoded into numbers,
# and then you can assess how deterministic they are of the target

# here for simplicity I will use only numerical variables
# select numerical columns:

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
numerical_vars = list(data.select_dtypes(include=numerics).columns)
data = data[numerical_vars]
data.shape

### split train - test

In [None]:
# In all feature selection procedures, it is good practice to select the features by examining only the training set. And this is to avoid overfit.

# separate train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1),
    data['target'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

### calculate metric for each variable

In [None]:
# loop to build a tree, make predictions and get the mse
# for each feature of the train set
mse_values = []
for feature in X_train.columns:
    clf = DecisionTreeRegressor()
    clf.fit(X_train[feature].fillna(0).to_frame(), y_train)
    y_scored = clf.predict(X_test[feature].fillna(0).to_frame())
    mse_values.append(mean_squared_error(y_test, y_scored))

In [None]:
# let's add the variable names and order it for clearer visualisation
mse_values = pd.Series(mse_values)
mse_values.index = X_train.columns

# Remember that for regression, the smaller the mse, the better the model performance is. So in this case, we need to select from the right to the left.
mse_values.sort_values(ascending=False).plot.bar(figsize=(20, 8))

### save features

In [None]:
# For the mse, you have to set up the cut-off value. The value will depend on how many features you would like to end up with.

features_to_keep = mse_values[mse_values < CUTOFF].index.tolist()

In [None]:
np.save('../features/featuresFromUnivariateMSERegression.npy',features_to_keep)