# Mutual Information

Mutual Information (MI) is a measure of the mutual dependence between the two variables. It quantifies the "amount of information" obtained about on random variable through observing the other random variable. [Wikipedia](https://en.wikipedia.org/wiki/Mutual_information) 

$$ I(X;Y) = H(X)-H(X|Y) = \sum_{x \in X}\sum_{y \in Y} P(x, y) \log(\frac{P(x,y)}{P(x) P(y)}) $$

where, 
* $I$ is the mutual information 
* $P(X, Y)$ is the probability of X and Y occuring together 
* $P(X)$ is the probability of X
* $P(Y)$ is the probability of Y

It determines how similar the joint distribution $P(X, Y)$ to the products of individual distributions $P(X)$ and $P(Y)$. 

If X and Y are independent, $P(X, Y)=P(X)P(Y)$, which makes the $\log(\frac{P_{(X, Y)}(x,y)}{P_X(x) P_Y(y)})=\log(1) = 0$, and the entire equation becomes zero. Therefore, **If X and Y are independent, mutual information is zero.**

On the other hand, if X is deterministic of Y, meaning by knowing X we can estimate Y, the mutual information is the uncertainty in X. 

Example: 

Let us suppose we have a join distribution table like below: 

.        | Y=0 | Y=1 | Y=2 | Marginals
---      | --- | --- | --- | ---
X=0      | 0.2 | 0.1 | 0.2 | 0.5
X=1      | 0   | 0.2 | 0.1 | 0.3
X=2      | 0.1 | 0   | 0.1 | 0.2
Marginals| 0.3 | 0.3 | 0.4 | 1

The mutual information between X and Y is; 

$$ 
I(X, Y) = 
p(0,0) \times \log \frac{p(0, 0)}{p(0)p(0)} 
+ p(0,1) \times \log \frac{p(0, 1)}{p(0)p(1)}
+ p(0,2) \times \log \frac{p(0, 2)}{p(0)p(2)}
+ p(1,0) \times \log \frac{p(1, 0)}{p(1)p(0)}
+ p(1,1) \times \log \frac{p(1, 1)}{p(1)p(1)}
+ p(1,2) \times \log \frac{p(1, 2)}{p(1)p(2)}
+ p(2,0) \times \log \frac{p(2, 0)}{p(2)p(0)}
+ p(2,1) \times \log \frac{p(2, 1)}{p(2)p(1)}
+ p(2,2) \times \log \frac{p(2, 2)}{p(2)p(2)} \\ 
= 0.2 \times \log \frac{0.2}{0.3 \times 0.5} + ... + 0.1 \times \log \frac{0.1}{0.4 \times 0.2}
$$


Steps:
step 1: determine mutual information between each feature and the target  
    - `mutual_info_classif` for classification  
    - `mutual_info_regression` for regression   
Step 2: rank the features based on the mutual information   
Step 3: select top k ranking features 

In [1]:
import numpy as np
import pandas as pd 

import matplotlib.pyplot as plt 

from sklearn.model_selection import train_test_split

from sklearn.feature_selection import mutual_info_classif, mutual_info_regression
from sklearn.feature_selection import SelectKBest, SelectPercentile

# Classification

In [2]:
data = pd.read_csv('../datasets/dataset_2.csv')

In [3]:
data.head()

Unnamed: 0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,var_9,var_10,...,var_100,var_101,var_102,var_103,var_104,var_105,var_106,var_107,var_108,var_109
0,4.53271,3.280834,17.982476,4.404259,2.34991,0.603264,2.784655,0.323146,12.009691,0.139346,...,2.079066,6.748819,2.941445,18.360496,17.726613,7.774031,1.473441,1.973832,0.976806,2.541417
1,5.821374,12.098722,13.309151,4.125599,1.045386,1.832035,1.833494,0.70909,8.652883,0.102757,...,2.479789,7.79529,3.55789,17.383378,15.193423,8.263673,1.878108,0.567939,1.018818,1.416433
2,1.938776,7.952752,0.972671,3.459267,1.935782,0.621463,2.338139,0.344948,9.93785,11.691283,...,1.861487,6.130886,3.401064,15.850471,14.620599,6.849776,1.09821,1.959183,1.575493,1.857893
3,6.02069,9.900544,17.869637,4.366715,1.973693,2.026012,2.853025,0.674847,11.816859,0.011151,...,1.340944,7.240058,2.417235,15.194609,13.553772,7.229971,0.835158,2.234482,0.94617,2.700606
4,3.909506,10.576516,0.934191,3.419572,1.871438,3.340811,1.868282,0.439865,13.58562,1.153366,...,2.738095,6.565509,4.341414,15.893832,11.929787,6.954033,1.853364,0.511027,2.599562,0.811364


In [4]:
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1),
    data['target'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((35000, 108), (15000, 108))

## Determine Mutual Information

In [None]:
# calculate the mutual information between the variables and the target 

mi = mutual_info_classif(X_train, y_train)
mi

In [None]:
mi = pd.Series(mi)
mi.index = X_train.columns
mi.sort_values(ascending=False).plot.bar(figsize=(20, 6))
plt.ylabel("Mutual Information")

## Select top K features based on MI

In [None]:
kb = SelectKBest(mutual_info_classif, k=10).fit(X_train, y_train)
kb.get_support()

In [None]:
# print out the feature names 
X_train.columns[kb.get_support()]

In [None]:
# remove the rest of the features
X_train = kb.transform(X_train)
X_test = kb.transform(X_test)

In [None]:
X_train

# Regression

In [None]:
# load dataset
data = pd.read_csv('../datasets/houseprice.csv')
data.shape

In [None]:
# ideally feature selection is done after categorical data encoding. 
# here, we will only use numeric data for simplicity 

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
numerical_vars = list(data.select_dtypes(include=numerics).columns)
data = data[numerical_vars]
data.shape

In [None]:
# separate train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['SalePrice'], axis=1),
    data['SalePrice'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

In [None]:
## fill missing values

X_train = X_train.fillna(0)
X_test = X_test.fillna(0)

## Determine Mutual Information

In [None]:
# determine the mutual information
mi = mutual_info_regression(X_train, y_train)

# and make a bar  plot
mi = pd.Series(mi)
mi.index = X_train.columns
mi.sort_values(ascending=False).plot.bar(figsize=(20,6))
plt.ylabel('Mutual Information')

## Select top 10th percentile features

In [None]:
# Select the features in the top percentile
sel_ = SelectPercentile(mutual_info_regression, percentile=10).fit(X_train, y_train)

# display the features
X_train.columns[sel_.get_support()]

In [None]:
# to remove the rest of the features:

X_train = sel_.transform(X_train)
X_test = sel_.transform(X_test)

X_train