## Variable selection 6: mutual information

This notebook goes with the blog post: Variable selection in Python, part II.

### Preliminaries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_selection import mutual_info_regression
from sklearn.metrics import normalized_mutual_info_score, mutual_info_score

### Data loading, dropping production

In [2]:
data = pd.read_csv('../data/Table2_Hunt_2013_edit.csv')

In [3]:
data = data.loc[:, ['Production', 'Position', 'Gross pay', 'Phi-h', 'Pressure', 'Random 1', 'Random 2', 'Gross pay transform']]
data.head()

Unnamed: 0,Production,Position,Gross pay,Phi-h,Pressure,Random 1,Random 2,Gross pay transform
0,15.1,2.1,0.1,0.5,19,5,379,3.54
1,21.3,1.1,1.0,4.0,16,13,269,5.79
2,22.75,1.0,1.9,19.0,14,12,245,8.51
3,15.72,2.1,3.1,21.7,17,6,273,11.52
4,7.71,2.9,4.1,24.6,11,10,237,10.16


### Mutual information

One of the methods vailable in Scikit-learn to perform [univariate feature selection](https://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection) in a regression context is [Mutual information regression](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_regression.html#sklearn.feature_selection.mutual_info_regression): with this methode one can estimate how much information the presence (or absence, conversely) of each feature individually contributes to the prediction of the target.

An advantages of mutual information statistics over conventional statistics like the F-test is that it is nonparametric, whereas the latter only detects linear dependency between variables, as shown [in this example](https://scikit-learn.org/stable/auto_examples/feature_selection/plot_f_test_vs_mi.html), this reminds me of the comparison between [distance correlation](https://en.wikipedia.org/wiki/Distance_correlation) and [https://en.wikipedia.org/wiki/Correlation_and_dependence](correlation). 


In [4]:
X, y = data.drop('Production',axis=1), data['Production']

The normalizing and sorting of mutual information scores in the cell below is from [Feature Selection for Subsurface Data Analytics](https://github.com/GeostatsGuy/PythonNumericalDemos/blob/master/SubsurfaceDataAnalytics_Feature_Ranking.ipynb) by [Michael Pyrcz](https://github.com/GeostatsGuy).

In [5]:
mi = mutual_info_regression(X, y)
mi /= np.amax(mi) 
idx = np.argsort(mi)[::-1]  

As in my previous notebooks, I store the results to a convenient Pandas DataFrame.

In [6]:
mi_df = pd.DataFrame()
mi_df["mutual information"] = [mi[idx[f]] for f in range(X.shape[1])]
mi_df["features"] = [X.columns[idx[f]] for f in range(X.shape[1])]

In [7]:
mi_df.round(2)

Unnamed: 0,mutual information,features
0,1.0,Gross pay
1,0.85,Phi-h
2,0.59,Gross pay transform
3,0.23,Position
4,0.16,Random 2
5,0.02,Pressure
6,0.0,Random 1
