## Filter Methods - Univariate feature selection - Regression

Univariate feature selection works by selecting the best features based on univariate statistical tests (ANOVA). The methods based on F-test estimate the degree of linear dependency between two random variables. They assume a linear relationship between the feature and the target. These methods also assume that the variables follow a Gaussian distribution.

These may not always be the case for the variables in your dataset, so if looking to implement these procedure, you will need to corroborate these assumptions.

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.model_selection import train_test_split

from sklearn.feature_selection import f_regression
from sklearn.feature_selection import SelectPercentile

### load dataset

In [None]:
# load dataset and features from previus method
features = np.load('../features/featuresFromMIRegression.npy').tolist()
data = pd.read_pickle('../../data/features/features.pkl').loc[:,features].sample(frac=0.35).fillna(-9999)



In [None]:
# In practice, feature selection should be done after data pre-processing,
# so ideally, all the categorical variables are encoded into numbers,
# and then you can assess how deterministic they are of the target

# here for simplicity I will use only numerical variables
# select numerical columns:

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
numerical_vars = list(data.select_dtypes(include=numerics).columns)
data = data[numerical_vars]
data.shape

### split train - test

In [None]:
# In all feature selection procedures, it is good practice to select the features by examining only the training set. And this is to avoid overfit.

# separate train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1),
    data['target'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

### calculate univariate statistical

In [None]:
# calculate the univariate statistical measure between
# each of the variables and the target
# similarly to chi2, the output is the array of f-scores
# and an array of pvalues, which are the ones we will compare

univariate = f_regression(X_train.fillna(0), y_train)
univariate

In [None]:
# let's add the variable names and order it for clearer visualisation
univariate = pd.Series(univariate[1])
univariate.index = X_train.columns
univariate.sort_values(ascending=False, inplace=True)

In [None]:
# and now let's plot the p values
univariate.sort_values(ascending=False).plot.bar(figsize=(20, 8))
pass

Remember that the lower the p_value, the most predictive the feature is in principle.  Features towards the left with pvalues above 0.05, which are candidates to be removed, as this means that the features do not statistically significantly discriminate the target.

Further investigation is needed if we want to know the true nature of the relationship between feature and target.

In big datasets it is not unusual that the pvalues of the different features are really small. This does not say as much about the relevance of the feature. Mostly it indicates that it is a big the dataset.


### save features

In [None]:
# how many var would you like to keep from the previous ANOVA analysis 
NNUMVAR = 10

sel_ = SelectPercentile(f_regression, percentile=NNUMVAR).fit(X_train.fillna(0), y_train)
features_to_keep = X_train.columns[sel_.get_support()].tolist()

In [None]:
np.save('../features/featuresFromUnivariateRegression.npy',features_to_keep)