# 5.8.4 Estimación de la Información Mutua (mutual_info_classif y mutual_info_regression) entre variables

In [13]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [14]:
import numpy as np
from sklearn.datasets import make_blobs, make_regression
from sklearn.feature_selection import mutual_info_classif, mutual_info_regression

import warnings
warnings.filterwarnings("ignore")

link: https://www.youtube.com/watch?v=l_orN0tUBe0&list=PLEFpZ3YehTnCx0mS5OhPWb75RIxryBzws&index=5

Las funciones **mutual_info_classif** y **mutual_info_regression** estiman la información mutua entre cada una de las variables explicativas y la variable dependiente.

La información mutua es una medida de la dependencia mutua entre dos variables aleatorias.

En el caso de variables discretas, la información mutua se calcula como:

$$ I(x,y) = \sum_{x}^{} \sum_{y}^{}  Prob(x,y) log \frac{Prob(x,y)}{Prob(x) * Prob(y)} $$

donde:

$ Prob(x,y) $ es la probabilidad conjunta de $x$ y $y$.

$ Prob(x) $ y $ Prob(y) $ son las probabilidades marginales.

Esta métrica se basa en la divergencia de Kullback-Leibler, la cual es una medida entre la diferencia entre dos distribuciones de probabilidad.

En este caso se parte del supuesto de que si no hay relación entre $x$ y $y$, ambas variables son independientes, y por tanto, $Prob(x,y)=Prob(x)×Prob(y)$, tal que $I(x,y)=0$.

## 5.8.4.1 Clasificación

links: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html#sklearn.feature_selection.mutual_info_classif


In [15]:
X, y = make_blobs(
    n_samples=150,
    n_features=2,
    centers=3,
    cluster_std=0.8,
    shuffle=False,
    random_state=12345,
)

#
# Note que x0 y x1 son significativas, mientras que x2 es una variable
# aleatoria no explicativa
#
X = np.hstack((X, 2 * np.random.random((X.shape[0], 1))))
X.shape

(150, 3)

In [16]:
mutual_info = mutual_info_classif(
    # -------------------------------------------------------------------------
    # Feature matrix.
    X=X,
    # -------------------------------------------------------------------------
    # Target vector.
    y=y,
    # -------------------------------------------------------------------------
    # Number of neighbors to use for MI estimation for continuous variables.
    n_neighbors=3,
    # -------------------------------------------------------------------------
    # If bool, then determines whether to consider all features discrete or
    # continuous.
    discrete_features="auto",
    # -------------------------------------------------------------------------
    # Determines random number generation for adding small noise to continuous
    # variables in order to remove repeated values.
    random_state=None,
)

mutual_info

array([1.10530858, 0.9911345 , 0.        ])

## 5.8.4.2 Regresión

In [17]:
X, y = make_regression(
    n_samples=300,
    n_features=4,
    n_informative=2,
    bias=0.0,
    tail_strength=0.9,
    noise=12.0,
    shuffle=False,
    coef=False,
    random_state=0,
)

In [18]:
mutual_info = mutual_info_regression(
    # -------------------------------------------------------------------------
    # Feature matrix.
    X=X,
    # -------------------------------------------------------------------------
    # Target vector.
    y=y,
    # -------------------------------------------------------------------------
    # If bool, then determines whether to consider all features discrete or
    # continuous.
    discrete_features="auto",
    # -------------------------------------------------------------------------
    # Number of neighbors to use for MI estimation for continuous variables.
    n_neighbors=3,
    # -------------------------------------------------------------------------
    # Determines random number generation for adding small noise to continuous
    # variables in order to remove repeated values.
    random_state=None,
)

mutual_info

array([0.07191495, 0.63637812, 0.        , 0.09702994])

In [19]:
print('ok_')

ok_
