# Pragmatic Model Evaluation 

## Feature Selection and Engineering

### Motivation

In certain contexts, it is very important to select and/or engineer useful features to feed into your ML models.

If you feed your Machine Learning methods poor features from your dataset, you can expect to get poor models out.
(The cliché is "Garbage in, Garbage out").

So, we need to identify or even create independent variables that can inform us about the dependent variable. The set of inputs *X* should predict the target variable *y*. Things that do not relate to  *y* should not be included, like noise or constants.

The process of identifying which of our features we should use as input to our model is known as *feature selection*. 

Some datasets can have thousands, millions, or even billions of features (for example, when working with genome data). Even with smaller datasets, some ML algorithms perform poorly when correlated or noisy variables are included (e.g. linear regression), and can greatly benefit from from feature selection. More generally, models made with all available features may be over-complicated, lack generalisation and can be hard to interpret. ML models should be parsimonious, simple as possible, with low error. Otherwise, models get mis-interpreted, take too long to run, or fit the noise (over-fit)!

The process of creating new features from the data is known as *feature engineering*. This can include:
+ multiplying columns together in tabular data
+ highlighting edges or regions of images 
+ PCA: rotating the axes you are using to more sensible ones and selecting information-rich features
+ transforming variables by applying arbitrary functions to them
+ making sure all required variables are available, not just using those readily available.

Feature engineering contributes to explainability, speeds-up training, and decreases over-fitting.

In fact, feature selection and engineering are often applied iteratively: 
Get features --> make ML model --> model optimization --> improve features --> make new model --> improve features...

In [None]:
# Load relevant libraries

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Getting to know this dataset

## Dataset: Boston Housing Data

**Dependent Variable**: 

MEDV: Median value of owner-occupied homes in 1000's of dollars

**Explanatory Variables**

CRIM: per capita crime rate by town

ZN: proportion of residential land zoned for lots over 25,000 sq.ft.

INDUS: proportion of non-retail business acres per town

CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

NOX: nitric oxides concentration (parts per 10 million)

RM: average number of rooms per dwelling

AGE: proportion of owner-occupied units built prior to 1940

DIS: weighted distances to five Boston employment centres

RAD: index of accessibility to radial highways

TAX: full-value property-tax rate per 10,000 dollars

PTRATIO: pupil-teacher ratio by town

B: 1000(Bk - 0.63)^2 where Bk is the proportion of black residents by town

LSTAT: lower status of the population 

In [None]:
df = pd.read_csv("data/boston_housing.csv")

df.head(10)

## Evaluating correlation between variables

**Q1**. Use pairplot function of `seaborn` to get a matrix of correlation plots between the variables.

In [None]:
# Your code here...


**Q2**. Calculate the correlation coefficient for variable AGE and the target, MEDV.

In [None]:
from scipy.stats import pearsonr

# Your code here...


**Q3**. Calculate the correlation coefficient for variable RM and the target, MEDV.

In [None]:
# Your code here...


**Q4**.  Identify the best features using the Kendall Tau correlation coefficient. 

Hint: you can use corr function, where the method can be specified as 'kendall'

In [None]:
# Your code here...


## Removing variables with low information content

ML models should be made without using all available variables, hence we should remove some.

A) A simple way to do this is to remove those that are not correlated with the target variable (E.g. MEDV as in the above example).

B) Another way is to remove variables that correlate strongly with each other (remove all but one).


PCA is a robust method for removing dimensions (features) that have linear correlations with each other.

**Q4**. Remove the variables that have a low correlation with MEDV. Remove any variable that has correlation less than 0.03, with the target variable.

For this, write a function name `keep_predictors` and take 3 args: `df`, `targ` - the target variable - and `cor_thresh` - the correlation threshold.

_Hints: Import `deepcopy` from `copy` (that does not change the original data frame) and please remember to use `abs()` values._

In [None]:
# Your code here...
# def keep_predictors(df, targ, cor_thresh):


In [None]:
targ = df["MEDV"]
df_redu = keep_predictors(df,targ,0.03)
df_redu.head()

**Q5**. Remove one of a pair of the remaining variables that have high correlations with each other.

For this purpose, write a function name `remove_correlated` that takes 2 args: `df` and `cor_thresh1` - the correlation threshold.


In [None]:
# Your code here...
# def remove_correlated(df, cor_thresh1):


You may test the function, `remove_correlated` with the following code:

In [None]:
cor_thresh1 = 0.8
boston_minimal = remove_correlated(df, cor_thresh1)
boston_minimal.head(14)

# Principle Component Analysis (PCA)

## PCA and z-score normalisation.

PCA looks for the most variance, so can be biased based on size of values (larger values can easily have more variance). Hence data need to be mormalised before applying PCA on it. 

**Q6**. Drop the target variable, "MEDV" from the original dataset, and call it `boston_x`.

In [None]:
# Your code here...


**Q7**. Use `StandardScaler` from `sklearn.preprocessing`, to transform the input data to z-score values.

In [None]:
# Your code here...


**Q8**. Now, use Principal Component Analysis (PCA) to reduce the dimensionality of this dataset to 5 Principal Components (PCs).

PCA operation can be imported from the decomposition package of the sklearn.

Save outputs of your PCA to the variable `pca`.

In [None]:
# Your code here...


**Q9**. Get the explained variance ratios of these first few PCs.

Hint: use the attribute called explained_variance_ratio_ 

Print their sum below too.

In [None]:
# Your code here...


In [None]:
PC_values = np.arange(pca.n_components_) + 1
plt.plot(PC_values, pca.explained_variance_ratio_, 'ro-', linewidth=2)
plt.title('Scree Plot')
plt.xlabel('Principal Component')
plt.ylabel('Proportion of Variance Explained')
plt.show()

So the PCs selected don't explain all the variance, but should explain most of it.

The Pareto Principle says that people can get most of the work done by focusing on a small percentage of the work.

This is also called the 80-20 Rule: For example: you can get 80% of success by focusing on only 20% of the tasks.

OR an even better possibility, 90% of success by focusing on only 15% of the tasks. 

**Bonus**

If you would like to know what went into making the PC, the following exercise will focus on that. 

What factors have the most variance in this dataset?
(Could be related to predicting MEDV, the median values of houses.)

Giving you the answer here, see seralouk's answer:

https://stackoverflow.com/questions/22984335/recovering-features-names-of-explained-variance-ratio-in-pca-with-sklearn

**Q10 bonus**. Extract the features present from the PCs.

_Hint: `pca.components_`_ may be helpful

In [None]:
# Your code here...


Some of these features might be important in more than 1 principle component.

Remember that having these un-rotated variables is not the point of PCA, the point is to rotate the data to a more variance-centric view, and to remove what is not needed by selecting only a few number of variables of all the available variables.

## Mutual Information (MI)

PCA is a brilliant way to remove correlated variables but it only notices linear relationships between variables.

Mutual Information (MI) on the other hand finds **all** relationships.

Mutual information is a method of finding relationships between variables. 

MI measures how much certainty can be gained about Variable 2 by knowing Variable 1.

The amount of 'uncertainty' is measured by entropy. 
Entropy is a fundamental measure used in Information Theory, which is the average  amount of information or certainty in a variable's possible outcomes. 

Entropy is the expected value of the information content. Shannon Information or the "level of surprise" of a variable, signifies how surprising a variable or message is on average.

Information content has units of bits or *shannons*. If an event is unlikely, then it is more surprising and informative when it does happen. So, the value of information content decreases as the probability of its occurrence increases.

Information, $I$, is defined as 
$I(E)=-log_2(P(E))$ 
where $E$ is an event, $P(E)$ is the probability of Event happening, and $log_2$ is logarithm base 2.
https://en.wikipedia.org/wiki/Entropy_(information_theory)

Value for MI can be in the range of zero to infinity.
$$I(X;Y) = D_{KL}(P_{X,Y}|| P_X x P_Y)$$

Where $D_{KL}$ is the Kullback-Liebler divergence, $P_{X,Y}$ is the joint distributions and $P_X$ and $P_Y$ are the marginal distributions.

A marginal distribution https://en.wikipedia.org/wiki/Marginal_distribution

**Q11**. 
Calculate the mutual information score between variables df["RM"] and df["CRIM"].

_Hint: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mutual_info_score.html_

In [None]:
# Your code here...


**Q12**. Get the mutual information scores for all variables in `df` vs the target, "MEDV" and print them out. 

Hint: for this, you may use the mutual_info_regression function

In [None]:
# Your code here...


Interpretation: 
LSTAT has the most mutual information with MEDV, the target. 

Credit: Kaggle Mutual Information

## Additional Material: Genetic Algorithms: evolving features

Genetic Algorithms can also be an effective and cool tool to select features.

You may find out more about them here: https://pypi.org/project/sklearn-genetic/#description

### Extra tool you could read about if you like:
A tool you can use for generating new features from original ones:

[tsfresh](https://tsfresh.readthedocs.io/en/latest/text/quick_start.html)

# Conclusions

During the lesson, we have seen some examples of feature selection before the ML models are trained.

Instead, feature engineering should be an iterative process.

So, the features are improved after the ML model is trained, then do a new ML model. Iterate!

Also, don't forget to tune hyperparameters of the ML model.

In this practical we have seen how feature selection can be done in 3 different ways:

+ Mutual Information
+ Conditional entropy
+ Information gain
