# ML Workflow - Feature Selection & Engineering

![Image](./img/scikit_learn.png)

In [None]:
# imports

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import date

from sklearn import datasets, ensemble
from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression

## Feature Selection

- Improve the accuracy with which the model is able to predict for new data.

- Reduce computational cost.

- Produce a more interpretable model.

![Image](./img/feature_selection.jpg)

---

### Manual feature selection

- data analysis
- intuition
- domain knowledge

In [None]:
# Diabetes dataset

diabetes = datasets.load_diabetes(as_frame=True)
description = diabetes.DESCR

diabetes = diabetes['data'].merge(diabetes['target'], left_index=True, right_index=True)
diabetes

In [None]:
diabetes.info()

In [None]:
diabetes.describe()

In [None]:
print(description)

In [None]:
# Vehicles dataset

vehicles = pd.read_csv('./datasets/vehicles.csv')
vehicles

In [None]:
vehicles.info()

In [None]:
vehicles.describe()

---

### Manual feature selection - [Correlation](https://support.minitab.com/en-us/minitab-express/1/help-and-how-to/modeling-statistics/regression/supporting-topics/basics/a-comparison-of-the-pearson-and-spearman-correlation-methods/)

A correlation coefficient measures the extent to which two variables tend to change together. The coefficient describes both the strength and the direction of the relationship.

__Pearson:__ The Pearson correlation evaluates the linear relationship between two continuous variables. A relationship is linear when a change in one variable is associated with a proportional change in the other variable.

__Spearman:__ The Spearman correlation evaluates the monotonic relationship between two continuous or ordinal variables. In a monotonic relationship, the variables tend to change together, but not necessarily at a constant rate.

We will use [`pandas.Series.corr`](https://pandas.pydata.org/docs/reference/api/pandas.Series.corr.html) and [`pandas.DataFrame.corr`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html).

In [None]:
fig, axs = plt.subplots(ncols=2, nrows=1, figsize=(14,7))
axs[0].scatter(diabetes['s1'], diabetes['s2'], c='green')
axs[0].set(xlabel='s1', ylabel='s2', title='Diabetes')
axs[1].scatter(vehicles['City MPG'], vehicles['Highway MPG'])
axs[1].set(xlabel='City MPG', ylabel='Highway MPG', title='Vehicles');

In [None]:
# Pearson

print(vehicles['City MPG'].corr(vehicles['Highway MPG'], method='pearson'))
print(diabetes['s1'].corr(diabetes['s2'], method='pearson'))

In [None]:
# Spearman

print(vehicles['City MPG'].corr(vehicles['Highway MPG'], method='spearman'))
print(diabetes['s1'].corr(diabetes['s2'], method='spearman'))

In [None]:
diabetes_correlation = diabetes.corr()
diabetes_correlation

In [None]:
fig = plt.figure(figsize=(12,12))
ax = fig.add_subplot(1, 1, 1)
sns.heatmap(diabetes.corr(method='pearson'), annot=True, fmt='.2f', ax=ax);

In [None]:
vehicles_correlation = vehicles.corr()
vehicles_correlation

In [None]:
fig = plt.figure(figsize=(12,12))
ax = fig.add_subplot(1, 1, 1)
sns.heatmap(vehicles.corr(method='pearson'), annot=True, fmt='.2f', ax=ax);

__We may want to remove a feature from the training phase because:__

- A feature that is highly correlated with another feature in the data set. If this is the case then both features are in essence providing the same information. Some algorithms are sensitive to correlated features.

- Features that provide little to no information. An example would be a feature where most examples have the same value.

- Features that have little to no statistical relationship with the target variable.

---

### Automated feature selection - [Variance threshold](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html)

This method takes a threshold value and when fitted to a feature set will remove any features below this threshold. The default value for the threshold is 0 and this will remove any features with zero variance, or in other words where all values are the same.

In [None]:
# Diabetes dataset

X = diabetes.drop('target', axis=1)
y = diabetes['target']

selector = VarianceThreshold(threshold=0.0)
print("Original feature shape:", X.shape)
new_X = selector.fit_transform(X)
print("Transformed feature shape:", new_X.shape)

In [None]:
# Vehicles dataset

X = vehicles[['Engine Displacement', 
              'Cylinders', 
              'Fuel Barrels/Year', 
              'City MPG', 
              'Highway MPG', 
              'Combined MPG', 
              'CO2 Emission Grams/Mile']]
y = vehicles['Fuel Cost/Year']

selector = VarianceThreshold(threshold=0.0)
print("Original feature shape:", X.shape)

new_X = selector.fit_transform(X)
print("Transformed feature shape:", new_X.shape)

In [None]:
#Breast cancer dataset

breast_cancer = datasets.load_breast_cancer()
X = breast_cancer['data']
y = breast_cancer['target']

selector = VarianceThreshold(threshold=0.0)
print("Original feature shape:", X.shape)

new_X = selector.fit_transform(X)
print("Transformed feature shape:", new_X.shape)

---

### Automated feature selection - [Univariate feature selection](https://statistics.laerd.com/statistical-guides/one-way-anova-statistical-guide.php)

The one-way ANOVA compares the means between the groups you are interested in and determines whether any of those means are statistically significantly different from each other. Specifically, it tests the null hypothesis:

![Image](./img/anova.JPG)

where µ = group mean and k = number of groups. If, however, the one-way ANOVA returns a statistically significant result, we accept the alternative hypothesis (HA), which is that there are at least two group means that are statistically significantly different from each other.

[Here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html) you may find the scikit-learn implementation.

In [None]:
# Diabetes dataset

X = diabetes.drop('target', axis=1)
y = diabetes['target']

selector = SelectKBest(score_func=f_regression, k='all')
print("Original feature shape:", X.shape)

new_X = selector.fit_transform(X, y)
print("Transformed feature shape:", new_X.shape)

In [None]:
# Vehicles dataset

X = vehicles[['Engine Displacement', 
              'Cylinders', 
              'Fuel Barrels/Year', 
              'City MPG', 
              'Highway MPG', 
              'Combined MPG', 
              'CO2 Emission Grams/Mile']]
y = vehicles['Fuel Cost/Year']

selector = SelectKBest(score_func=f_regression, k='all')
print("Original feature shape:", X.shape)

new_X = selector.fit_transform(X, y)
print("Transformed feature shape:", new_X.shape)

In [None]:
#Breast cancer dataset

breast_cancer = datasets.load_breast_cancer()
X = breast_cancer['data']
y = breast_cancer['target']

selector = SelectKBest()
print("Original feature shape:", X.shape)

new_X = selector.fit_transform(X, y)
print("Transformed feature shape:", new_X.shape)

---

## Feature Engineering

- Manual

- Automated (e.g.: [Featuretools](https://www.featuretools.com/))

![Image](./img/feature_engineering.jpg)

There are two main reasons:

- __The information contained within the feature is stronger if the data is aggregated or represented in a different way__. An example here might be a feature containing the age of a person, aggregating the ages into buckets or bins may better represent the relationship to the target.

- __A feature on its own does not have a strong enough statistical relationship with the target but when combined with another feature has a meaningful relationship__. Let’s say we have a data set that has a number of features based on credit history for a group of customers and a target that denotes if they have defaulted on a loan. Suppose we have a loan amount and a salary value. If we combined these into a new feature called “loan to salary ratio” this may give more or better information than those features alone.

### Manual feature engineering

- Binning

- Extracting date

In [None]:
# Numerical binning

vehicles['num_bin'] = pd.cut(vehicles['Fuel Cost/Year'], bins=3, labels=["Low", "Mid", "High"])
vehicles['num_bin'].unique()

In [None]:
# Categorical binning

vehicles['Drivetrain'].unique()

In [None]:
def cat_bin(x):
    if '4' in x:
        return '4x4'
    else:
        return '4x2'

In [None]:
vehicles['cat_bin'] = vehicles['Drivetrain'].apply(cat_bin)
vehicles['cat_bin'].unique()

In [None]:
# Extracting date

data = pd.DataFrame({'date':['01-01-2022',
                             '10-01-2019',
                             '13-08-2002',
                             '28-09-1995',
                             '13-01-1981']})

data['date'] = pd.to_datetime(data.date, format="%d-%m-%Y")

In [None]:
#Extracting Year
data['year'] = data['date'].dt.year

#Extracting Month
data['month'] = data['date'].dt.month

#Extracting passed years since the date
data['passed_years'] = date.today().year - data['date'].dt.year

#Extracting passed months since the date
data['passed_months'] = (date.today().year - data['date'].dt.year) * 12 + date.today().month - data['date'].dt.month

#Extracting the weekday name of the date
data['day_name'] = data['date'].dt.day_name()

In [None]:
data

---