## Features selection
## Empirical tests

This project has the objective of exploring and testing alternative methods of features selection. It has started with notebook "Features Selection - Discussion", where the relevance, approaches and methods of features selection are presented, mainly based on the reading of articles from specialized websites, besides of some books on machine learning fundamentals. As discussed on the first notebook of the series, the two main objectives of selecting features are reducing model complexity (thus saving memory and time) and eventually improving model performance.
<br>
<br>
Notebook "Features Selection - Discussion" organizes popular methods based on three different classes of methods: *analytical methods*, which focus on the relationship between two variables (different inputs or an input and the output) or even consider only one variable at a time; *supervised learning selection*, which makes use of statistical learning methods that rank input variables according to their importance while training a model; and *exaustive methods*, which explore several distinct subsets of the entire set of available features.
<br>
<br>
In order to explore and test alternative methods of features selection, the development of this project has led to four major contents: first, the already mentioned notebook "Features Selection - Discussion"; second, a Python class providing a unified API for implementing multiple methods from those three classes mentioned above (module "features_selection" and notebook "Features Selection"); third, a notebook which illustrates how to use the most relevant methods of features selection, by using either the native classes and functions or the developed class with a unifed API; and finally, a notebook ("Features Selection - Empirical Tests") implements tests for assessing the most adequate method for a given regression problem.

---------

Once features selection has been discussed and given the developed class *FeaturesSelection*, which groups alternative methods, this notebook tries to assess which approach is the most adequate for the **regression problem** provided by the [Communities and Crime Unnormalized](https://archive.ics.uci.edu/ml/datasets/Communities+and+Crime+Unnormalized) dataset obtained from the UCI Machine Learning Repository.
<br>
<br>
This dataset has 18 potential (continuous) target variables and 125 original features (one of which is a categorical variable that gives rise to additional input variables). In this empirical application, the variable chosen as output is "ViolentCrimesPerPop", the total number of violent crimes per 100.000 population. Each row of the dataset represent a unique instance, which consists of a community from US cities. Crime data refers to 1995 and comes from the FBI, while demographics refers to the 1990 Census. The main advantages of this dataset are its limited amount of observations, simplifying tests as estimations require less amount of memory and time, and the moderately high number of features.
<br>
<br>
When it comes to the **methodology of tests**, the following procedures are executed in order to produce, from raw data, outcomes that may help pointing to the most adequate approach of features selection for the regression problem at hand:
1. *Train-test split:* data is shuffled and, then, 25% is kept held out as test data, while the first 75% of the data is used not only for training models, but also for any calculation needed during data preparation.
2. *Data pre-processing:* features are classified (continuous, binary, categorical) and an early selection is implemented in order to drop variables with an excessive number of missings (more than 95% of instances of training data) and variables with no variance. Then, missing values are assessed and transformations take place: logarithmic transformation and standard scaling for numerical features, besides of the outcome variable. Missing values are treated as follows: a new category is created for missings in the categorical variable, while 0 is imputed for missings in numerical variables, plus the creation of binary variables indicating the existence of missings. Finally, the categorical variable is transformed as one-hot encoding is applied.
3. *Features selection:* all methods presented in notebook "Features Selection - Tutorials" are covered here: variance and correlation screening, supervised learning selection and exaustive methods (RFE, RFECV, SequentialFeatureSelector, random selection). Below, we find the complete grid of approaches to be tested (which may involve two or more methods sequentially).
4. *Model training:* two learning algorithms were picked for training models, Lasso (a linear, regularized method) and XGBoost (more flexibly, boosted models). Their hyper-parameters (regularization parameter for Lasso; subsample parameter, learning rate, maximum depth and number of estimators in the ensemble for XGBoost) are defined using K-folds cross-validation over the training data.
5. *Model evaluation:* the following performance metrics are calculated on the test data so the best approach can be identified: RMSE, R2, MAE, MSLE.

Is crucial to notice that features selection is inserted into each iteration of the K-folds CV estimation. When the final model is trained using the best hyper-parameters, a new selection of features takes place using the entire training data. Consequently, the *FeaturesSelection* class is not directly used here. Instead, the *KfoldsCV_fit* class (available in my [Github](https://github.com/m-rosso/validation)) proceeds to an aggregation of classes, since it initializes an object of that class previously to the model training based on train-validation split at each iteration of K-folds, and a final initialization previously to the training of the final model.
<br>
<br>
This is done in order to be highly cautious to avoid the [Freedman paradox](https://www.alexejgossmann.com/Freedmans_paradox/), which would occur if features selection was implemented using the entire training data. Even that a conservative approach was adopted here, note that this strategy is not necessary, since it would be enough to first select features based on cross-validation over the training data and then feed the algorithm that optimizes hyperparameters with those pre-selected features.

The following **approaches for features selection** are tested in this notebook of empirical tests:
* Single methods:
    * Variance thresholding.
    * Correlation thresholding.
    * Supervised learning selection (using a linear estimator).
    * RFE.
    * RFECV.
    * Sequential selection (only forward-stepwise selection).
    * Random selection (for each model size, a random set of features is selected; then, the best model is defined).


* Combined methods:
    * Variance or correlation thresholding and supervised learning selection.
    * Variance or correlation thresholding and RFE.
    * Variance or correlation thresholding and RFECV.
    * Variance or correlation thresholding and sequential selection (forward-stepwise selection).
    * Variance or correlation thresholding and random selection.

By implementing these empirical tests, we find that features selection has not a strong impact in predictive performance for this learning task. Even so, at least competitive metrics are obtained with shorter computing times and with less complex models. Therefore, features selection was able to reduce complexity of models while preserving generalization capacity. The **main conclusions** from the tests are summarized below:
* [Performance metrics](#metric_by_approach)<a href='#metric_by_approach'></a>: even that very similar results are found, supervised learning selection seems to be the best choice for this learning task given both algorithms used during tests.
	* Besides, no selection of features has only the 9th highest R2 for XGBoost.

* When total elapsed time is taken into account, this is even more evident. Although having no absolute meaning, the [ratio between R2 and running time](#ratio_by_approach)<a href='#ratio_by_approach'></a> shows the superiority of first screening features based on the correlation among them, and finally selecting features through supervised learning methods. If no selection has very poor relative performance (e.g., excessive running for lasso estimation), sequential features selection requires a prohibitive computing time for a performance not better than that for more simple alternatives.

* When explicitly relating performance metric with [running time](#metric_by_time)<a href='#metric_by_time'></a>, the extent to which sequential selection is not appropriate for this learning task is strengthen. If sequential selection is disregarded, a light positive association is found between performance and running time, although some cost-effective alternatives are available, such as the above mentioned supervised learning selection with a relatively strong regularization whose small subset of selected features allows a good performance with just a few computational complexity.

* A similar conclusion can be drawn from the relationship between [performance and the number of selected features](#metric_by_num_feats)<a href='#metric_by_num_feats'></a> performance and the number of selected features.

It is important to notice that results derived, presented and discussed here do not hold for any supervised learning task. However, some notes may help choosing a features selection for a given setting. Supervised learning selection seems adequate for a first approach when trying to reduce complexity of models. Methods such as RFE and RFECV are more robust techniques with good balance between performance and running time. Both forward and backward-stepwise selection may only be considered when performance is expected to be highly optimized, since they have extremely high computational costs. Finally, unsupervised features screening, either by variance or correlation thresholding, should always be considered, since may help dropping irrelevant features at a very low computational cost.

**Summary:**
1. [Libraries](#libraries)<a href='#libraries'></a>.
2. [Functions and classes](#functions_classes)<a href='#functions_classes'></a>.
3. [Settings](#settings)<a href='#settings'></a>.
4. [Importing datasets](#imports)<a href='#imports'></a>.
    * [Features and outcome variables](#feats_outcomes)<a href='#feats_outcomes'></a>.
<br>
<br>
5. [Data modeling](#data_modeling)<a href='#data_modeling'></a>.
    * [Linear regression (Lasso)](#linear)<a href='#linear'></a>.
    * [XGBoost](#xgboost)<a href='#xgboost'></a>.
<br>
<br>
6. [Analysis of results](#analysis_results)<a href='#analysis_results'></a>.
    * [Processing the outcomes](#processing_outcomes)<a href='#processing_outcomes'></a>.
    * [Visualizing the outcomes](#data_vis)<a href='#data_vis'></a>.

<a id='libraries'></a>

## Libraries

In [1]:
import pandas as pd
import numpy as np
import os
import json

from time import time

from sklearn.linear_model import Lasso
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import r2_score, mean_squared_error

from scipy.stats import uniform, norm, randint

<a id='functions_classes'></a>

## Functions and classes

In [2]:
from utils import train_test_split, plot_outcomes
from pre_process import pre_process
from kfolds import Kfolds_fit
from bootstrap import BootstrapEstimation
from features_selection import FeaturesSelection

<a id='settings'></a>

## Settings

### Data management

In [3]:
# Declare whether to export results:
export = True

### Data transformation

In [4]:
# Define whether to apply logarithmic transformation over numerical variables:
log_transform = True

# Define whether to standardize numerical variables:
standardize = True

### Features selection

#### Grid of methods and parameters

In [5]:
selection_params = [
    (False, {'analytical': None}, None, False, 'No selection'),
    
    (False, {'analytical': {'method': 'variance', 'threshold': 0.003}}, None, False, 'Variance selection'),
    (False, {'analytical': {'method': 'correlation', 'threshold': 0.9}}, None, False, 'Correlation selection'),
    
    (True, {'analytical': None}, {'method': 'supervised', 'threshold': 0, 'estimator': Lasso(alpha=10)},
     False, 'Supervised selection (alpha=10)'),
    (True, {'analytical': None}, {'method': 'supervised', 'threshold': 0, 'estimator': Lasso(alpha=1)},
     False, 'Supervised selection (alpha=1)'),
    (True, {'analytical': None}, {'method': 'supervised', 'threshold': 0, 'estimator': Lasso(alpha=0.1)},
     False, 'Supervised selection (alpha=0.1)'),
    
    (True, {'analytical': None}, {'method': 'rfe', 'estimator': Lasso(alpha=1.0), 'num_folds': 5, 'metric': 'r2',
                                  'max_num_feats': 100, 'step': 1}, True, 'RFE'),
    (True, {'analytical': None}, {'method': 'rfecv', 'estimator': Lasso(alpha=1.0), 'num_folds': 5,
                                  'metric': 'r2', 'min_num_feats': 50, 'step': 1}, True, 'RFECV'),
    (True, {'analytical': None}, {'method': 'sequential', 'estimator': Lasso(alpha=1.0), 'num_folds': 5,
                                  'metric': 'r2', 'max_num_feats': 100, 'step': 1, 'direction': 'forward'},
     True, 'Sequential (forward)'),
    (True, {'analytical': None}, {'method': 'random_selection', 'estimator': Lasso(alpha=1.0), 'num_folds': 5,
                                  'metric': 'r2', 'max_num_feats': 100, 'step': 10}, True, 'Random selection'),
    
    (True, {'analytical': {'method': 'correlation', 'threshold': 0.9}},
     {'method': 'supervised', 'threshold': 0, 'estimator': Lasso(alpha=0.001)},
     False, 'Correlation selection, then supervised selection (alpha=0.001)'),
    (True, {'analytical': {'method': 'correlation', 'threshold': 0.9}},
     {'method': 'rfe', 'estimator': Lasso(alpha=1.0), 'num_folds': 5, 'metric': 'r2', 'max_num_feats': 100,
      'step': 1}, True, 'Correlation selection, then RFE', 'Correlation selection, then RFE'),
    (True, {'analytical': {'method': 'correlation', 'threshold': 0.9}},
     {'method': 'rfecv', 'estimator': Lasso(alpha=1.0), 'num_folds': 5, 'metric': 'r2', 'min_num_feats': 50,
      'step': 1}, True, 'Correlation selection, then RFECV', 'Correlation selection, then RFECV'),
    (True, {'analytical': {'method': 'correlation', 'threshold': 0.9}},
     {'method': 'sequential', 'estimator': Lasso(alpha=1.0), 'num_folds': 5, 'metric': 'r2', 'max_num_feats': 100,
      'step': 1, 'direction': 'forward'}, True, 'Correlation selection, then sequential (forward)'),
    (True, {'analytical': {'method': 'correlation', 'threshold': 0.9}},
     {'method': 'random_selection', 'estimator': Lasso(alpha=1.0), 'num_folds': 5, 'metric': 'r2',
      'max_num_feats': 100, 'step': 10}, True, 'Correlation selection, then random selection'),
    (True, {'analytical': {'method': 'correlation', 'threshold': 0.9}},
         {'method': 'supervised', 'threshold': 0, 'estimator': Lasso(alpha=10)},
         False, 'Correlation selection, then supervised selection (alpha=10)')
]

<a id='imports'></a>

## Importing datasets

<a id='feats_outcomes'></a>

### Features and outcome variables

In [6]:
# Importing data:
df = pd.read_csv('../Datasets/CommViolPredUnnormalizedData.txt', header=None)

# Columns names:
columns_names = pd.read_csv('../Datasets/columns_names.csv')

# Defining columns names:
df.columns = list(columns_names['column_name'])

# Auxiliary variables:
drop_vars = ['communityname', 'countyCode', 'communityCode', 'fold', 'ViolentCrimesPerPop']

# Additional outcome variables:
additional_outcomes = ['nonViolPerPop', 'murders', 'murdPerPop', 'rapes', 'rapesPerPop', 'robberies',
                       'robbbPerPop', 'assaults', 'assaultPerPop', 'burglaries', 'burglPerPop', 'larcenies',
                       'larcPerPop', 'autoTheft', 'autoTheftPerPop', 'arsons', 'arsonsPerPop']
df.drop(additional_outcomes, axis=1, inplace=True)

print(f'Shape of data: {df.shape}.')
print(f'Number of distinct instances: {len(np.unique(df["communityname"] + df["state"]))}.')
df.head(3)

Shape of data: (2215, 130).
Number of distinct instances: 2215.


Unnamed: 0,communityname,state,countyCode,communityCode,fold,population,householdsize,racepctblack,racePctWhite,racePctAsian,...,LandArea,PopDens,PctUsePubTrans,PolicCars,PolicOperBudg,LemasPctPolicOnPatr,LemasGangUnitDeploy,LemasPctOfficDrugUn,PolicBudgPerPop,ViolentCrimesPerPop
0,BerkeleyHeightstownship,NJ,39,5320,1,11980,3.1,1.37,91.78,6.5,...,6.5,1845.9,9.63,?,?,?,?,0.0,?,41.02
1,Marpletownship,PA,45,47616,1,23123,2.82,0.8,95.57,3.44,...,10.6,2186.7,3.84,?,?,?,?,0.0,?,127.56
2,Tigardcity,OR,?,?,1,29344,2.43,0.74,94.33,3.43,...,10.6,2780.9,4.37,?,?,?,?,0.0,?,218.59


#### Correcting missing values and data types

In [7]:
# Loop over columns:
for c in df.columns:
    df[c] = df[c].apply(lambda x: np.NaN if x == '?' else x)
    
    # Converting data into float:
    if c not in ['communityname', 'state', 'countyCode', 'communityCode', 'fold']:
        df[c] = df[c].apply(float)
    
    # Treating missings for support variables:
    if c in ['communityname', 'countyCode', 'communityCode', 'fold']:
        df[c] = ['' if pd.isnull(x) else x for x in df[c]]

In [8]:
# Dropping instances with missing for the outcome variable:
df = df[df['ViolentCrimesPerPop'].isnull()==False]
df.reset_index(drop=True, inplace=True)

print(f'Shape of data: {df.shape}.')
print(f'Number of distinct instances: {len(np.unique(df["communityname"] + df["state"]))}.')
df.head(3)

Shape of data: (1994, 130).
Number of distinct instances: 1994.


Unnamed: 0,communityname,state,countyCode,communityCode,fold,population,householdsize,racepctblack,racePctWhite,racePctAsian,...,LandArea,PopDens,PctUsePubTrans,PolicCars,PolicOperBudg,LemasPctPolicOnPatr,LemasGangUnitDeploy,LemasPctOfficDrugUn,PolicBudgPerPop,ViolentCrimesPerPop
0,BerkeleyHeightstownship,NJ,39.0,5320.0,1,11980.0,3.1,1.37,91.78,6.5,...,6.5,1845.9,9.63,,,,,0.0,,41.02
1,Marpletownship,PA,45.0,47616.0,1,23123.0,2.82,0.8,95.57,3.44,...,10.6,2186.7,3.84,,,,,0.0,,127.56
2,Tigardcity,OR,,,1,29344.0,2.43,0.74,94.33,3.43,...,10.6,2780.9,4.37,,,,,0.0,,218.59


#### Train-test split

In [9]:
df_train, df_test = train_test_split(df, test_ratio=0.25, shuffle=True, seed=1)

#### Data pre-processing

In [10]:
df_train, df_test, df_train_scaled, df_test_scaled = pre_process(training_data=df_train, test_data=df_test,
                                                                 vars_to_drop=drop_vars,
                                                                 log_transform=True, standardize=True)

---------------------------------------------------------------------------------------------------------
[1mCLASSIFYING FEATURES AND EARLY SELECTION[0m


Initial number of features: 125.
0 features were dropped for excessive number of missings!
0 features were dropped for having no variance!
125 remaining features.


---------------------------------------------------------------------------------------------------------


---------------------------------------------------------------------------------------------------------
[1mASSESSING MISSING VALUES[0m


[1mTraining data:[0m
[1mNumber of features with missings:[0m 23 out of 130 features (17.69%).
[1mAverage number of missings:[0m 212 out of 1496 observations (14.17%).

[1mTest data:[0m
[1mNumber of features with missings:[0m 22 out of 130 features (16.92%).
[1mAverage number of missings:[0m 71 out of 498 observations (14.26%).


--------------------------------------------------------------------------------------

<a id='model_assess'></a>

### Model assessment

In [11]:
if 'model_assess.json' not in os.listdir('../Datasets'):
    model_assess = {}

else:
    with open('../Datasets/model_assess.json') as json_file:
        model_assess = json.load(json_file)

<a id='data_modeling'></a>

## Data modeling

In [12]:
# Complete collection of features:
all_vars = list(df_train_scaled.drop(drop_vars, axis=1).columns)

# Numerical features:
cont_vars = [c for c in df_train.drop(drop_vars, axis=1).columns if 'L#' in c]

In [13]:
# Numerical features:
cont_df = df_train[cont_vars].copy()
means = dict(zip(cont_df.mean().index, cont_df.mean().values))

# Loop over numerical features:
for f in means:
    # Scaling each variable:
    cont_df[f] = [x/means[f] for x in cont_df[f]]

<a id='linear'></a>

### Linear regression (Lasso)

In [19]:
# Grid of hyper-parameters:
grid_param = {'alpha': [0.0001, 0.0003, 0.001, 0.003, 0.01, 0.03, 0.1, 0.25, 0.3, 0.5, 0.75, 1, 3, 10]}
default_param = {'alpha': 1.0}
fixed_params = {'warm_start':True, 'max_iter': 100000}

# Loop over alternatives fo features selection:
for i in range(len(selection_params)):
    print(f'\033[1mFeatures selection method: {selection_params[i][4]}.\033[0m')
    print('\n')
    
    estimation_id = str(int(time()))
    
    ##############################################################################################################
    # FEATURES SELECTION BASED ON VARIANCE/CORRELATION:

    if selection_params[i][1]['analytical'] is not None:
        # Creating the object for features selection:
        selection = FeaturesSelection(method=selection_params[i][1]['analytical']['method'],
                                      threshold=selection_params[i][1]['analytical']['threshold'])

        # Running the features selection:
        selection.select_features(inputs=cont_df)

        # List of selected features:
        first_selection = selection.selected_features

    else:
        first_selection = list(df_train_scaled.drop(drop_vars, axis=1))

    ##############################################################################################################
    # MODEL ESTIMATION WITH FEATURES SELECTION:

    # Creating K-folds CV object:
    model = Kfolds_fit(task = 'regression', method = 'lasso', num_folds = 5, metric = 'r2',
                       random_search = False,
                       pre_selecting=selection_params[i][0], pre_selecting_params=selection_params[i][2],
                       only_final_selection=selection_params[i][3],
                       grid_param = grid_param, default_param = default_param, fixed_params=fixed_params)

    # Running K-folds CV:
    model.fit(train_inputs = df_train_scaled[first_selection],
              train_output = df_train_scaled['ViolentCrimesPerPop'],
              test_inputs = df_test_scaled[first_selection],
              test_output = df_test_scaled['ViolentCrimesPerPop'],
              print_outcomes=False, print_time=False)

    ##############################################################################################################
    # MODEL ASSESSMENT:

    model_assess[estimation_id] = {
        'estimation_id': estimation_id,
        'n_obs_train': len(df_train_scaled),
        'n_obs_test': len(df_test_scaled),
        'n_cols': df_train_scaled.drop(drop_vars, axis=1).shape[1],
        'avg_y_train': df_train['ViolentCrimesPerPop'].mean(),
        'avg_y_test': df_test['ViolentCrimesPerPop'].mean(),
        'method': 'lasso',
        'features_selection': selection_params[i][4],
        'num_selected_features': model.num_selected_features if hasattr(model,
                                                                        'num_selected_features') else len(first_selection),
        'performance_metrics': model.performance_metrics,
        'running_time': model.running_time
    }
    
    if export:
        with open('../Datasets/model_assess.json', 'w') as json_file:
            json.dump(model_assess, json_file, indent=2)
        
    print('\n')

[1mFeatures selection method: Correlation selection, then supervised selection (alpha=0.001).[0m




[1mGrid estimation progress:[0m [--                                    ]   7%

From 124 features, 99 were selected!
From 99 features, 27 were selected!
From 99 features, 28 were selected!
From 99 features, 28 were selected!
From 99 features, 33 were selected!
From 99 features, 34 were selected!
From 99 features, 27 were selected!
From 99 features, 28 were selected!
From 99 features, 28 were selected!
From 99 features, 33 were selected!
From 99 features, 34 were selected!


[1mGrid estimation progress:[0m [--------                              ]  21%

From 99 features, 27 were selected!
From 99 features, 28 were selected!
From 99 features, 28 were selected!
From 99 features, 33 were selected!
From 99 features, 34 were selected!


[1mGrid estimation progress:[0m [----------                            ]  28%

From 99 features, 27 were selected!
From 99 features, 28 were selected!
From 99 features, 28 were selected!
From 99 features, 33 were selected!
From 99 features, 34 were selected!


[1mGrid estimation progress:[0m [-------------                         ]  35%

From 99 features, 27 were selected!
From 99 features, 28 were selected!
From 99 features, 28 were selected!
From 99 features, 33 were selected!
From 99 features, 34 were selected!


[1mGrid estimation progress:[0m [----------------                      ]  42%

From 99 features, 27 were selected!
From 99 features, 28 were selected!
From 99 features, 28 were selected!
From 99 features, 33 were selected!
From 99 features, 34 were selected!


[1mGrid estimation progress:[0m [-------------------                   ]  50%

From 99 features, 27 were selected!
From 99 features, 28 were selected!
From 99 features, 28 were selected!
From 99 features, 33 were selected!
From 99 features, 34 were selected!


[1mGrid estimation progress:[0m [---------------------                 ]  57%

From 99 features, 27 were selected!
From 99 features, 28 were selected!
From 99 features, 28 were selected!
From 99 features, 33 were selected!
From 99 features, 34 were selected!
From 99 features, 27 were selected!
From 99 features, 28 were selected!
From 99 features, 28 were selected!
From 99 features, 33 were selected!


[1mGrid estimation progress:[0m [------------------------              ]  64%

From 99 features, 34 were selected!
From 99 features, 27 were selected!
From 99 features, 28 were selected!

[1mGrid estimation progress:[0m [---------------------------           ]  71%


From 99 features, 28 were selected!
From 99 features, 33 were selected!
From 99 features, 34 were selected!


[1mGrid estimation progress:[0m [-----------------------------         ]  78%

From 99 features, 27 were selected!
From 99 features, 28 were selected!
From 99 features, 28 were selected!
From 99 features, 33 were selected!
From 99 features, 34 were selected!


[1mGrid estimation progress:[0m [--------------------------------      ]  85%

From 99 features, 27 were selected!
From 99 features, 28 were selected!
From 99 features, 28 were selected!
From 99 features, 33 were selected!
From 99 features, 34 were selected!
From 99 features, 27 were selected!
From 99 features, 28 were selected!
From 99 features, 28 were selected!
From 99 features, 33 were selected!


[1mGrid estimation progress:[0m [-----------------------------------   ]  92%

From 99 features, 34 were selected!
From 99 features, 27 were selected!
From 99 features, 28 were selected!
From 99 features, 28 were selected!


[1mGrid estimation progress:[0m [--------------------------------------] 100%

From 99 features, 33 were selected!
From 99 features, 34 were selected!
From 99 features, 31 were selected!




<a id='xgboost'></a>

### XGBoost

In [21]:
# Loop over alternatives for features selection:
for i in range(len(selection_params)):
    print(f'\033[1mFeatures selection method: {selection_params[i][4]}.\033[0m')
    print('\n')
    
    estimation_id = str(int(time()))

    ##############################################################################################################
    # FEATURES SELECTION BASED ON VARIANCE/CORRELATION:

    if selection_params[i][1]['analytical'] is not None:
        # Creating the object for features selection:
        selection = FeaturesSelection(method=selection_params[i][1]['analytical']['method'],
                                      threshold=selection_params[i][1]['analytical']['threshold'])

        # Running the features selection:
        selection.select_features(inputs=cont_df)

        # List of selected features:
        first_selection = selection.selected_features

    else:
        first_selection = list(df_train_scaled.drop(drop_vars, axis=1))

    ##############################################################################################################
    # MODEL ESTIMATION WITH FEATURES SELECTION:

    # Grid of hyper-parameters:
    grid_param = {'subsample': uniform(0.5, 0.5),
                  'eta': uniform(0.0001, 0.1),
                  'max_depth': randint(1, 10),
                  'num_boost_round': [100, 250, 500]}
    default_param = {'subsample': 0.75, 'eta': 0.01, 'max_depth': 10, 'num_boost_round': 100}

    # Creating K-folds CV object:
    model = Kfolds_fit(task = 'reg:squarederror', method = 'xgboost', num_folds = 5, metric = 'r2',
                       random_search = True, n_samples=10,
                       pre_selecting=selection_params[i][0], pre_selecting_params=selection_params[i][2],
                       only_final_selection=selection_params[i][3],
                       grid_param = grid_param,
                       default_param = default_param)

    # Running K-folds CV:
    model.fit(train_inputs = df_train_scaled[first_selection],
              train_output = df_train_scaled['ViolentCrimesPerPop'],
              test_inputs = df_test_scaled[first_selection],
              test_output = df_test_scaled['ViolentCrimesPerPop'],
              print_outcomes=False, print_time=False)
        
    ##############################################################################################################
    # MODEL ASSESSMENT:

    model_assess[estimation_id] = {
        'estimation_id': estimation_id,
        'n_obs_train': len(df_train_scaled),
        'n_obs_test': len(df_test_scaled),
        'n_cols': df_train_scaled.drop(drop_vars, axis=1).shape[1],
        'avg_y_train': df_train['ViolentCrimesPerPop'].mean(),
        'avg_y_test': df_test['ViolentCrimesPerPop'].mean(),
        'method': 'xgboost',
        'features_selection': selection_params[i][4],
        'num_selected_features': model.num_selected_features if hasattr(model,
                                                                        'num_selected_features') else len(first_selection),
        'performance_metrics': model.performance_metrics,
        'running_time': model.running_time
    }
    
    if export:
        with open('../Datasets/model_assess.json', 'w') as json_file:
            json.dump(model_assess, json_file, indent=2)

[1mFeatures selection method: Correlation selection, then supervised selection (alpha=0.001).[0m




[1mGrid estimation progress:[0m [                                      ]   0%

From 124 features, 99 were selected!
From 99 features, 27 were selected!
From 99 features, 28 were selected!
From 99 features, 28 were selected!
From 99 features, 33 were selected!
From 99 features, 34 were selected!


[1mGrid estimation progress:[0m [---                                   ]  10%

From 99 features, 27 were selected!
From 99 features, 28 were selected!
From 99 features, 28 were selected!
From 99 features, 33 were selected!
From 99 features, 34 were selected!


[1mGrid estimation progress:[0m [-------                               ]  20%

From 99 features, 27 were selected!
From 99 features, 28 were selected!
From 99 features, 28 were selected!
From 99 features, 33 were selected!
From 99 features, 34 were selected!


[1mGrid estimation progress:[0m [-----------                           ]  30%

From 99 features, 27 were selected!
From 99 features, 28 were selected!
From 99 features, 28 were selected!
From 99 features, 33 were selected!
From 99 features, 34 were selected!


[1mGrid estimation progress:[0m [---------------                       ]  40%

From 99 features, 27 were selected!
From 99 features, 28 were selected!
From 99 features, 28 were selected!
From 99 features, 33 were selected!
From 99 features, 34 were selected!


[1mGrid estimation progress:[0m [-------------------                   ]  50%

From 99 features, 27 were selected!
From 99 features, 28 were selected!
From 99 features, 28 were selected!
From 99 features, 33 were selected!
From 99 features, 34 were selected!


[1mGrid estimation progress:[0m [----------------------                ]  60%

From 99 features, 27 were selected!
From 99 features, 28 were selected!
From 99 features, 28 were selected!
From 99 features, 33 were selected!
From 99 features, 34 were selected!


[1mGrid estimation progress:[0m [--------------------------            ]  70%

From 99 features, 27 were selected!
From 99 features, 28 were selected!
From 99 features, 28 were selected!
From 99 features, 33 were selected!
From 99 features, 34 were selected!


[1mGrid estimation progress:[0m [------------------------------        ]  80%

From 99 features, 27 were selected!
From 99 features, 28 were selected!
From 99 features, 28 were selected!
From 99 features, 33 were selected!
From 99 features, 34 were selected!


[1mGrid estimation progress:[0m [----------------------------------    ]  90%

From 99 features, 27 were selected!
From 99 features, 28 were selected!
From 99 features, 28 were selected!
From 99 features, 33 were selected!
From 99 features, 34 were selected!


[1mGrid estimation progress:[0m [--------------------------------------] 100%

From 99 features, 31 were selected!


<a id='analysis_results'></a>

## Analysis of results

<a id='processing_outcomes'></a>

### Processing the outcomes

#### Linear regression (Lasso)

In [14]:
# Outcomes from lasso estimation:
lasso_ids = [k for k in model_assess if model_assess[k]['method']=='lasso']

outcomes_lasso = pd.DataFrame(data={
    'estimation_id': lasso_ids, 'method': [model_assess[k]['method'] for k in lasso_ids],
    'features_selection': [model_assess[k]['features_selection'] for k in lasso_ids],
    'num_selected_features': [model_assess[k]['num_selected_features'] for k in lasso_ids],
    'test_rmse': [model_assess[k]['performance_metrics']['test_rmse'] for k in lasso_ids],
    'test_r2': [model_assess[k]['performance_metrics']['test_r2'] for k in lasso_ids],
    'test_mae': [model_assess[k]['performance_metrics']['test_mae'] for k in lasso_ids],
    'test_msle': [model_assess[k]['performance_metrics']['test_msle'] for k in lasso_ids],
    'running_time': [model_assess[k]['running_time'] for k in lasso_ids],
})
outcomes_lasso.sort_values(['test_r2', 'running_time'], ascending=[False, True], inplace=True)

# Ratio between performance metric and running time:
outcomes_lasso['ratio_r2_time'] = [r/t for r, t in zip(outcomes_lasso['test_r2'], outcomes_lasso['running_time'])]

outcomes_lasso

Unnamed: 0,estimation_id,method,features_selection,num_selected_features,test_rmse,test_r2,test_mae,test_msle,running_time,ratio_r2_time
4,1626647396,lasso,Supervised selection (alpha=0.01),147,350.396246,0.66555,223.31375,,664.631431,0.001001
0,1626645231,lasso,No selection,175,350.411742,0.665521,223.314665,,778.253433,0.000855
6,1626648627,lasso,Supervised selection (alpha=1),94,350.412403,0.66552,223.316186,,77.230659,0.008617
3,1626647002,lasso,Supervised selection (alpha=0.1),135,350.429092,0.665488,223.374915,,394.838262,0.001685
8,1626654837,lasso,RFECV,58,351.285848,0.66385,223.784411,,762.531379,0.000871
9,1626815931,lasso,Sequential (forward),51,353.546342,0.65951,224.311788,,62643.367787,1.1e-05
7,1626653102,lasso,RFE,30,354.740129,0.657207,222.206558,,1734.867349,0.000379
10,1626899471,lasso,Random selection,100,355.06255,0.656583,227.61674,,867.69345,0.000757
5,1626648621,lasso,Supervised selection (alpha=10),32,365.606385,0.635884,231.894448,,5.293609,0.120123
13,1626966073,lasso,"Correlation selection, then RFECV",63,365.72788,0.635642,233.458288,,422.5901,0.001504


#### XGBoost

In [15]:
# Outcomes from XGBoost estimation:
xgboost_ids = [k for k in model_assess if model_assess[k]['method']=='xgboost']

outcomes_xgboost = pd.DataFrame(data={
    'estimation_id': xgboost_ids, 'method': [model_assess[k]['method'] for k in xgboost_ids],
    'features_selection': [model_assess[k]['features_selection'] for k in xgboost_ids],
    'num_selected_features': [model_assess[k]['num_selected_features'] for k in xgboost_ids],
    'test_rmse': [model_assess[k]['performance_metrics']['test_rmse'] for k in xgboost_ids],
    'test_r2': [model_assess[k]['performance_metrics']['test_r2'] for k in xgboost_ids],
    'test_mae': [model_assess[k]['performance_metrics']['test_mae'] for k in xgboost_ids],
    'test_msle': [model_assess[k]['performance_metrics']['test_msle'] for k in xgboost_ids],
    'running_time': [model_assess[k]['running_time'] for k in xgboost_ids],
})
outcomes_xgboost.sort_values(['test_r2', 'running_time'], ascending=[False, True], inplace=True)

# Ratio between performance metric and running time:
outcomes_xgboost['ratio_r2_time'] = [r/t for r, t in zip(outcomes_xgboost['test_r2'],
                                                         outcomes_xgboost['running_time'])]

outcomes_xgboost

Unnamed: 0,estimation_id,method,features_selection,num_selected_features,test_rmse,test_r2,test_mae,test_msle,running_time,ratio_r2_time
4,1626736268,xgboost,Supervised selection (alpha=1),94,367.986383,0.631128,224.533492,0.458285,64.28056,0.009818
8,1626900543,xgboost,Sequential (forward),51,368.043209,0.631014,218.529738,0.455324,62966.912289,1e-05
2,1626736175,xgboost,Correlation selection,99,368.800478,0.629494,224.283383,0.443836,60.241019,0.01045
7,1626737585,xgboost,RFECV,58,368.918686,0.629257,229.016597,,169.667866,0.003709
5,1626736332,xgboost,Supervised selection (alpha=0.1),135,371.344083,0.624366,229.713102,0.44194,105.497278,0.005918
3,1626736236,xgboost,Supervised selection (alpha=10),32,374.232074,0.618501,218.862018,0.440858,31.639005,0.019549
13,1627006433,xgboost,"Correlation selection, then random selection",70,374.233241,0.618498,230.555848,0.480494,64.433122,0.009599
12,1627006346,xgboost,"Correlation selection, then RFECV",63,374.667635,0.617612,231.025188,0.472915,86.477326,0.007142
0,1626736004,xgboost,No selection,175,374.866911,0.617205,232.611129,,92.568473,0.006668
11,1627006142,xgboost,"Correlation selection, then RFE",26,376.437221,0.613991,235.504149,0.504576,202.641066,0.00303


<a id='data_vis'></a>

### Visualizing the outcomes

<a id='metric_by_approach'></a>

#### Performance metric by features selection approach

In [16]:
plot_outcomes(outcomes_lasso=outcomes_lasso, outcomes_xgboost=outcomes_xgboost, plot='metric_by_approach')

<a id='ratio_by_approach'></a>

#### Ratio between performance metric and running time by features selection approach

In [17]:
plot_outcomes(outcomes_lasso=outcomes_lasso, outcomes_xgboost=outcomes_xgboost, plot='ratio_by_approach')

<a id='metric_by_time'></a>

#### Performance metric against running time

In [18]:
plot_outcomes(outcomes_lasso=outcomes_lasso, outcomes_xgboost=outcomes_xgboost, plot='metric_by_time')

<a id='metric_by_num_feats'></a>

#### Performance metric against the number of selected features

In [19]:
plot_outcomes(outcomes_lasso=outcomes_lasso, outcomes_xgboost=outcomes_xgboost, plot='metric_by_num_feats')