## Features selection
## Tutorials

This project has the objective of exploring and testing alternative methods of features selection. It has started with notebook "Features Selection - Discussion", where the relevance, approaches and methods of features selection are presented, mainly based on the reading of articles from specialized websites, besides of some books on machine learning fundamentals. As discussed on the first notebook of the series, the two main objectives of selecting features are reducing model complexity (thus saving memory and time) and eventually improving model performance.
<br>
<br>
Notebook "Features Selection - Discussion" organizes popular methods based on three different classes of methods: *analytical methods*, which focus on the relationship between two variables (different inputs or an input and the output) or even consider only one variable at a time; *supervised learning selection*, which makes use of statistical learning methods that rank input variables according to their importance while training a model; and *exaustive methods*, which explore several distinct subsets of the entire set of available features.
<br>
<br>
In order to explore and test alternative methods of features selection, the development of this project has led to four major contents: first, the already mentioned notebook "Features Selection - Discussion"; second, a Python class providing a unified API for implementing multiple methods from those three classes mentioned above (module "features_selection" and notebook "Features Selection"; third, a notebook which illustrates how to use the most relevant methods of features selection, by using either the native classes and functions or the developed class with a unifed API; and finally, a notebook ("Features Selection - Empirical Tests") implements tests for assessing the most adequate method for a given regression problem.

---------

In this notebook, a dataset of a regression problem from [UCI repository](https://archive.ics.uci.edu/ml/datasets/Communities+and+Crime+Unnormalized) is used for illustrating how to implement some of the most relevant features selection methods:
1. Analytical methods:
    * [*VarScreeningNumerical*](#variance)<a href='#variance'></a> class from my [Github](https://github.com/m-rosso/unsupervised-features-screening): the usage of this class supports either variance thresholding or the selection of features with the *K* highest variances, even though only variance thresholding is used here. Parameters:
        * *threshold*: variance below which a feature is dropped out from the entire collection of features.
    <br>
    <br>
    * [*CorrScreeningNumerical*](#correlation)<a href='#correlation'></a> class from my [Github](https://github.com/m-rosso/unsupervised-features-screening): it sorts features based on their variance and then sequentially drops those features with excessive pair-wise (linear) correlation. Parameters:
        * *threshold*: correlation above which a feature is dropped out from the entire collection of features.
<br>
<br>
2. Exaustive methods:
    * [Recursive features elimination](#rfe)<a href='#rfe'></a> class from [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html): given an initialized estimator, at each step a predefined number of the least relevant features are dropped until a given number is reached. Parameters:
        * *estimator*: machine learning algorithm for training a model.
        * *n_features_to_select*: final number of features to be selected.
        * *step*: number of features to be dropped at each iteration.
    <br>
    <br>
    * [RFECV](#rfecv)<a href='#rfecv'></a> class from[sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html): selected features are defined according to the optimization of some performance metric that is calculated using K-folds cross-validation. Consequently, at each step the least important features are dropped, and from the final collection of models where each has a different number of features the best model is chosen through cross-validation. Parameters:
        * *estimator*: machine learning algorithm for training a model.
        * *min_features_to_select*: minimum final number of features to be selected.
        * *step*: number of features to be dropped at each iteration.
        * *cv*: configuration of K-folds cross-validation for final model selection.
        * *scoring*: performance metric for final model selection.
    <br>
    <br>
    * [SequentialFeatureSelector](#sfs)<a href='#sfs'></a> class from[sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SequentialFeatureSelector.html): depending on the "direction" initialization parameter, a version of forward-stepwise selection or a version of backward-stepwise selection can be implemented. As with RFE, the number of features to be selected is another parameter that should ultimately be defined in order to optimize model performance. Parameters:
        * *estimator*: machine learning algorithm for training a model.
        * *n_features_to_select*: final number of features to be selected.
        * *cv*: configuration of K-folds cross-validation for final model selection.
        * *scoring*: performance metric for final model selection.
        * *direction*: indicates whether forward or backward-stepwise selection should be implemented.

Section [features selection](#features_selection)<a href='#features_selection'></a> not only has codes illustrating the use of the methods presented above, but also demonstrates how to [use](#fs_class)<a href='#fs_class'></a> the developed class that unifies all the major methods for selecting features. First, instead of initializing an object from *FeaturesSelection* class, three different static methods are used as functions for constructing a list with names of the selected features:
1. *analytical_selection*: implements variance or correlation thresholding. Arguments:
    * *method*: defines whether variance or correlation thresholding should take place.
    * *threshold*: reference value for variance or correlation thresholding.
<br>
<br>
2. *supervised_selection*: executes supervised learning selection, under which only those features whose importance is greater than some threshold are selected, where this importance is calculated as a model is trained. Arguments:
    * *estimator*: machine learning algorithm containing either a "coef_" or a "feature_importances_" attribute.
    * *threshold*: importance value above which features are selected.
<br>
<br>
3. *exaustive_selection*: implements one of the following exaustive methods (presented above): RFE (method='rfe'), RFECV (method='rfecv'), SequentialFeatureSelector (method='sequential'), random selection (method='random_selection'). Random selection, not previously mentioned, defines a collection of models with different numbers of features (all randomly picked), and then chooses the best model using K-folds CV.
    * Arguments for running RFE:
        * estimator: machine learning algorithm.
        * num_folds: number of folds of K-folds CV for selecting final model.
        * metric: performance metric for selecting final model.
        * max_num_feats: maximum number of features to be tested.
        * step: number of features to be dropped at each iteration.
    <br>
    <br>
    * Arguments for running RFECV:
        * estimator: machine learning algorithm.
        * num_folds: number of folds of K-folds CV for selecting final model.
        * metric: performance metric for selecting final model.
        * min_num_feats: minimum number of features to be selected.
        * step: number of features to be dropped at each iteration.
    <br>
    <br>
    * Arguments for running SequentialFeatureSelector:
        * estimator: machine learning algorithm.
        * num_folds: number of folds of K-folds CV for selecting final model.
        * metric: performance metric for selecting final model.
        * max_num_feats: maximum number of features to be tested.
        * direction: indicates whether forward or backward-stepwise selection should be implemented.
    <br>
    <br>
    * Arguments for running random selection:
        * estimator: machine learning algorithm.
        * num_folds: number of folds of K-folds CV for selecting final model.
        * metric: performance metric for selecting final model.
        * max_num_feats: maximum number of features to be tested.
        * step: number of features to be randomly included at each iteration.

Finally, the main difference between using static methods or initializing an [object of *FeaturesSelection* class](#init_fs)<a href='#init_fs'></a> is that the static methods require the definition of two additional arguments: "inputs" and "output", dataframes with training data. When using directly the *FeaturesSelection* class, after passing all arguments for initialization, the "select_features" method should be called having "inputs" and "output" as arguments. Another difference relies on the fact that "estimator" parameter should be declared in the "select_features" method whenever supervised learning selection or exaustive methods are to be applied.

**Summary:**
1. [Libraries](#libraries)<a href='#libraries'></a>.
2. [Functions and classes](#functions_classes)<a href='#functions_classes'></a>.
3. [Settings](#settings)<a href='#settings'></a>.
4. [Importing datasets](#imports)<a href='#imports'></a>.
    * [Features and outcome variables](#feats_outcomes)<a href='#feats_outcomes'></a>.
<br>
<br>
5. [Features selection](#features_selection)<a href='#features_selection'></a>.
    * [Analytical methods](#analytical_methods)<a href='#analytical_methods'></a>.
    * [Exaustive methods](#exaustive_methods)<a href='#exaustive_methods'></a>.
    * [Class for features selection](#fs_class)<a href='#fs_class'></a>.

<a id='libraries'></a>

## Libraries

In [1]:
import pandas as pd
import numpy as np
import os
import json

from sklearn.feature_selection import RFE, RFECV, SequentialFeatureSelector
from sklearn.linear_model import Lasso

<a id='functions_classes'></a>

## Functions and classes

In [2]:
from utils import train_test_split
from pre_process import pre_process
from screening_features import VarScreeningNumerical, CorrScreeningNumerical
from features_selection import FeaturesSelection

<a id='settings'></a>

## Settings

### Data management

In [3]:
# Declare whether to export results:
export = False

### Data transformation

In [4]:
# Define whether to apply logarithmic transformation over numerical variables:
log_transform = True

# Define whether to standardize numerical variables:
standardize = True

### Features selection

#### Methods

In [5]:
# Defines whether to implement analytical methods:
analytical_selection = True
variance_screening = True # Selection based on the variance of features.
correlation_screening = True # Selection based on the correlation among features.

# Defines whether to implement supervised features selection:
supervised_selection = False

# Defines whether to implement exaustive methods:
exaustive_selection = True
rfe_selection = True # Recursive feature elimination.
rfecv_selection = True # Recursive feature elimination with cross-validation.
sfs_selection = True # Sequential feature selection.

<a id='imports'></a>

## Importing datasets

<a id='feats_outcomes'></a>

### Features and outcome variables

In [6]:
# Importing data:
df = pd.read_csv('../Datasets/CommViolPredUnnormalizedData.txt', header=None)

# Columns names:
columns_names = pd.read_csv('../Datasets/columns_names.csv')

# Defining columns names:
df.columns = list(columns_names['column_name'])

# Auxiliary variables:
drop_vars = ['communityname', 'countyCode', 'communityCode', 'fold', 'ViolentCrimesPerPop']

# Additional outcome variables:
additional_outcomes = ['nonViolPerPop', 'murders', 'murdPerPop', 'rapes', 'rapesPerPop', 'robberies',
                       'robbbPerPop', 'assaults', 'assaultPerPop', 'burglaries', 'burglPerPop', 'larcenies',
                       'larcPerPop', 'autoTheft', 'autoTheftPerPop', 'arsons', 'arsonsPerPop']
df.drop(additional_outcomes, axis=1, inplace=True)

print(f'Shape of data: {df.shape}.')
print(f'Number of distinct instances: {len(np.unique(df["communityname"] + df["state"]))}.')
df.head(3)

Shape of data: (2215, 130).
Number of distinct instances: 2215.


Unnamed: 0,communityname,state,countyCode,communityCode,fold,population,householdsize,racepctblack,racePctWhite,racePctAsian,...,LandArea,PopDens,PctUsePubTrans,PolicCars,PolicOperBudg,LemasPctPolicOnPatr,LemasGangUnitDeploy,LemasPctOfficDrugUn,PolicBudgPerPop,ViolentCrimesPerPop
0,BerkeleyHeightstownship,NJ,39,5320,1,11980,3.1,1.37,91.78,6.5,...,6.5,1845.9,9.63,?,?,?,?,0.0,?,41.02
1,Marpletownship,PA,45,47616,1,23123,2.82,0.8,95.57,3.44,...,10.6,2186.7,3.84,?,?,?,?,0.0,?,127.56
2,Tigardcity,OR,?,?,1,29344,2.43,0.74,94.33,3.43,...,10.6,2780.9,4.37,?,?,?,?,0.0,?,218.59


#### Correcting missing values and data types

In [7]:
# Loop over columns:
for c in df.columns:
    df[c] = df[c].apply(lambda x: np.NaN if x == '?' else x)
    
    # Converting data into float:
    if c not in ['communityname', 'state', 'countyCode', 'communityCode', 'fold']:
        df[c] = df[c].apply(float)
    
    # Treating missings for support variables:
    if c in ['communityname', 'countyCode', 'communityCode', 'fold']:
        df[c] = ['' if pd.isnull(x) else x for x in df[c]]

In [8]:
# Dropping instances with missing for the outcome variable:
df = df[df['ViolentCrimesPerPop'].isnull()==False]
df.reset_index(drop=True, inplace=True)

print(f'Shape of data: {df.shape}.')
print(f'Number of distinct instances: {len(np.unique(df["communityname"] + df["state"]))}.')
df.head(3)

Shape of data: (1994, 130).
Number of distinct instances: 1994.


Unnamed: 0,communityname,state,countyCode,communityCode,fold,population,householdsize,racepctblack,racePctWhite,racePctAsian,...,LandArea,PopDens,PctUsePubTrans,PolicCars,PolicOperBudg,LemasPctPolicOnPatr,LemasGangUnitDeploy,LemasPctOfficDrugUn,PolicBudgPerPop,ViolentCrimesPerPop
0,BerkeleyHeightstownship,NJ,39.0,5320.0,1,11980.0,3.1,1.37,91.78,6.5,...,6.5,1845.9,9.63,,,,,0.0,,41.02
1,Marpletownship,PA,45.0,47616.0,1,23123.0,2.82,0.8,95.57,3.44,...,10.6,2186.7,3.84,,,,,0.0,,127.56
2,Tigardcity,OR,,,1,29344.0,2.43,0.74,94.33,3.43,...,10.6,2780.9,4.37,,,,,0.0,,218.59


#### Train-test split

In [9]:
df_train, df_test = train_test_split(df, test_ratio=0.25, shuffle=True)

#### Data pre-processing

In [10]:
df_train, df_test, df_train_scaled, df_test_scaled = pre_process(training_data=df_train, test_data=df_test,
                                                                 vars_to_drop=drop_vars,
                                                                 log_transform=True, standardize=True)

---------------------------------------------------------------------------------------------------------
[1mCLASSIFYING FEATURES AND EARLY SELECTION[0m


Initial number of features: 125.
0 features were dropped for excessive number of missings!
0 features were dropped for having no variance!
125 remaining features.


---------------------------------------------------------------------------------------------------------


---------------------------------------------------------------------------------------------------------
[1mASSESSING MISSING VALUES[0m


[1mTraining data:[0m
[1mNumber of features with missings:[0m 23 out of 130 features (17.69%).
[1mAverage number of missings:[0m 213 out of 1496 observations (14.24%).

[1mTest data:[0m
[1mNumber of features with missings:[0m 22 out of 130 features (16.92%).
[1mAverage number of missings:[0m 70 out of 498 observations (14.06%).


--------------------------------------------------------------------------------------

<a id='features_selection'></a>

## Features selection

In [11]:
# Complete collection of features:
all_vars = list(df_train_scaled.drop(drop_vars, axis=1).columns)

# Numerical features:
cont_vars = [c for c in df_train.columns if 'L#' in c]

<a id='analytical_methods'></a>

### Analytical methods

In [12]:
if analytical_selection:
    # Numerical features:
    cont_df = df_train[cont_vars].copy()
    means = dict(zip(cont_df.mean().index, cont_df.mean().values))

    # Loop over numerical features:
    for f in means:
        # Scaling each variable:
        cont_df[f] = [x/means[f] for x in cont_df[f]]

#### Selection based on variance<a id='variance'></a>

Click [here](https://github.com/m-rosso/unsupervised-features-screening) for the documentation.

In [13]:
# Creating the object for variance thresholding:
var_screen = VarScreeningNumerical(features=cont_vars,
                                   select_k=False, thresholding=True, variance_threshold=0.1)

# Selecting features based on their variances:
var_screen.select_feat(data=cont_df)
var_screen_features = var_screen.selected_feat

print(f'From {len(cont_vars)} features, {len(var_screen_features)} were selected!')

From 124 features, 40 were selected!


#### Selection based on correlation<a id='correlation'></a>

Click [here](https://github.com/m-rosso/unsupervised-features-screening) for the documentation.

In [14]:
# Creating the object for correlation thresholding:
corr_screen = CorrScreeningNumerical(features=cont_vars,
                                     corr_threshold=0.8)

# Selecting features based on the correlation among them:
corr_screen.select_feat(data=cont_df)
corr_screen_features = corr_screen.selected_feat

print(f'From {len(cont_vars)} features, {len(corr_screen_features)} were selected!')

From 124 features, 84 were selected!


<a id='exaustive_methods'></a>

### Exaustive methods

#### RFE<a id='rfe'></a>

sklearn [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html).

In [15]:
# Creating the object for the recursive feature elimination:
rfe_select = RFE(estimator=Lasso(alpha=1.0),
                 n_features_to_select=1,
                 step=1)

# Running the recursive feature elimination:
rfe_selec = rfe_select.fit(X=df_train_scaled.drop(drop_vars, axis=1),
                           y=df_train_scaled['ViolentCrimesPerPop'])

# Selected features:
rfe_features = [c for s, c in zip(rfe_select.support_, all_vars) if s]
print(f'From {len(all_vars)} features, {len(rfe_features)} were selected!')

From 174 features, 1 were selected!


#### RFECV<a id='rfecv'></a>

sklearn [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html).

In [16]:
# Creating the object for the recursive feature elimination with cross-validation:
rfecv_select = RFECV(estimator=Lasso(alpha=1.0),
                     step=1,
                     min_features_to_select=1,
                     cv=5,
                     scoring='r2')

# Running the recursive feature elimination with cross-validation:
rfecv_select = rfecv_select.fit(X=df_train_scaled.drop(drop_vars, axis=1),
                                y=df_train_scaled['ViolentCrimesPerPop'])

# Selected features:
rfecv_features = [c for s, c in zip(rfecv_select.support_, all_vars) if s]
print(f'From {len(all_vars)} features, {len(rfecv_features)} were selected!')

From 174 features, 108 were selected!


#### SequentialFeatureSelector<a id='sfs'></a>

sklearn [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SequentialFeatureSelector.html).

In [17]:
# Creating the object for the sequential feature selection:
sfs_select = SequentialFeatureSelector(estimator=Lasso(alpha=1.0),
                                       n_features_to_select=1,
                                       direction='forward',
                                       scoring='r2',
                                       cv=5)

# Running the sequential feature selection:
sfs_select = sfs_select.fit(X=df_train_scaled.drop(drop_vars, axis=1),
                            y=df_train_scaled['ViolentCrimesPerPop'])

# Selected features:
sfs_features = [c for s, c in zip(sfs_select.support_, all_vars) if s]
print(f'From {len(all_vars)} features, {len(sfs_features)} were selected!')

From 174 features, 1 were selected!


<a id='fs_class'></a>

### Class for features selection

See the Python module containing the *FeaturesSelection* class for the documentation.

#### Analytical methods

In [18]:
selected_features = FeaturesSelection.analytical_selection(inputs=cont_df, method='variance', threshold=0.1)

From 124 features, 40 were selected!


In [19]:
selected_features = FeaturesSelection.analytical_selection(inputs=cont_df, method='correlation', threshold=0.8)

From 124 features, 84 were selected!


#### Supervised learning selection

In [20]:
selected_features = FeaturesSelection.supervised_selection(Lasso(alpha=1.0),
                                                           inputs=df_train_scaled.drop(drop_vars, axis=1),
                                                           output=df_train_scaled['ViolentCrimesPerPop'],
                                                           threshold=0)

From 174 features, 97 were selected!


#### Exaustive methods

Recursive feature elimination

In [21]:
selected_features = FeaturesSelection.exaustive_selection(estimator=Lasso(alpha=1.0),
                                                          inputs=df_train_scaled.drop(drop_vars, axis=1),
                                                          output=df_train_scaled['ViolentCrimesPerPop'],
                                                          method='rfe', num_folds=5, metric='r2',
                                                          max_num_feats=5, step=1)

From 174 features, 1 were selected!
From 174 features, 2 were selected!
From 174 features, 3 were selected!
From 174 features, 4 were selected!
From 174 features, 5 were selected!

From 174 features, 5 were finally selected!


Recursive feature elimination with cross-validation

In [22]:
selected_features = FeaturesSelection.exaustive_selection(estimator=Lasso(alpha=1.0),
                                                          inputs=df_train_scaled.drop(drop_vars, axis=1),
                                                          output=df_train_scaled['ViolentCrimesPerPop'],
                                                          method='rfecv', num_folds=5, metric='r2',
                                                          min_num_feats=1, step=1)

From 174 features, 108 were selected!


Sequential feature selection

In [23]:
selected_features = FeaturesSelection.exaustive_selection(estimator=Lasso(alpha=1.0),
                                                          inputs=df_train_scaled.drop(drop_vars, axis=1),
                                                          output=df_train_scaled['ViolentCrimesPerPop'],
                                                          method='sequential', num_folds=5, metric='r2',
                                                          max_num_feats=5, direction='forward')

From 174 features, 1 were selected!
From 174 features, 2 were selected!
From 174 features, 3 were selected!
From 174 features, 4 were selected!
From 174 features, 5 were selected!

From 174 features, 5 were finally selected!


Random selection

In [24]:
selected_features = FeaturesSelection.exaustive_selection(estimator=Lasso(alpha=1.0),
                                                          inputs=df_train_scaled.drop(drop_vars, axis=1),
                                                          output=df_train_scaled['ViolentCrimesPerPop'],
                                                          method='random_selection', num_folds=5, metric='r2',
                                                          max_num_feats=100, step=10)

From 174 features, 10 were selected!
From 174 features, 20 were selected!
From 174 features, 30 were selected!
From 174 features, 40 were selected!
From 174 features, 50 were selected!
From 174 features, 60 were selected!
From 174 features, 70 were selected!


  positive)
  positive)


From 174 features, 80 were selected!
From 174 features, 90 were selected!
From 174 features, 100 were selected!

From 174 features, 80 were finally selected!


<a id='init_fs'></a>

#### Initializing the *FeaturesSelection* class

Analytical methods

In [15]:
# Variance selection:
selection = FeaturesSelection(method='variance', threshold=0.1)

selection.select_features(inputs=cont_df)

From 124 features, 40 were selected!


In [26]:
# Correlation selection:
selection = FeaturesSelection(method='correlation', threshold=0.8)

selection.select_features(inputs=cont_df)

From 124 features, 84 were selected!


Supervised learning selection

In [17]:
selection = FeaturesSelection(method='supervised', threshold=0)

selection.select_features(inputs=df_train_scaled.drop(drop_vars, axis=1),
                          output=df_train_scaled['ViolentCrimesPerPop'],
                          estimator=Lasso(alpha=1.0))

From 176 features, 99 were selected!


Exaustive methods

In [18]:
# Recursive feature elimination:
selection = FeaturesSelection(method='rfe', num_folds=5, metric='r2',
                              max_num_feats=5, step=1)

selection.select_features(inputs=df_train_scaled.drop(drop_vars, axis=1),
                          output=df_train_scaled['ViolentCrimesPerPop'],
                          estimator=Lasso(alpha=1.0))

From 176 features, 1 were selected!
From 176 features, 2 were selected!
From 176 features, 3 were selected!
From 176 features, 4 were selected!
From 176 features, 5 were selected!

From 176 features, 5 were finally selected!


In [29]:
# Recursive feature elimination with cross-validation:
selection = FeaturesSelection(method='rfecv', num_folds=5, metric='r2',
                              min_num_feats=1, step=1)

selection.select_features(inputs=df_train_scaled.drop(drop_vars, axis=1),
                          output=df_train_scaled['ViolentCrimesPerPop'],
                          estimator=Lasso(alpha=1.0))

From 174 features, 108 were selected!


In [30]:
# Sequential feature selection:
selection = FeaturesSelection(method='sequential', num_folds=5, metric='r2',
                              max_num_feats=5, step=1, direction='forward')

selection.select_features(inputs=df_train_scaled.drop(drop_vars, axis=1),
                          output=df_train_scaled['ViolentCrimesPerPop'],
                          estimator=Lasso(alpha=1.0))

From 174 features, 1 were selected!
From 174 features, 2 were selected!
From 174 features, 3 were selected!
From 174 features, 4 were selected!
From 174 features, 5 were selected!

From 174 features, 5 were finally selected!


In [31]:
# Random selection:
selection = FeaturesSelection(method='random_selection', num_folds=5, metric='r2',
                              max_num_feats=100, step=10)

selection.select_features(inputs=df_train_scaled.drop(drop_vars, axis=1),
                          output=df_train_scaled['ViolentCrimesPerPop'],
                          estimator=Lasso(alpha=1.0))

From 174 features, 10 were selected!
From 174 features, 20 were selected!
From 174 features, 30 were selected!
From 174 features, 40 were selected!
From 174 features, 50 were selected!
From 174 features, 60 were selected!
From 174 features, 70 were selected!
From 174 features, 80 were selected!
From 174 features, 90 were selected!
From 174 features, 100 were selected!

From 174 features, 70 were finally selected!
