# Feeture Selection


## Mutual Information

*   **sklearn.feature_selection.mutual_info_classif**

It calculates the Mutual Information between the features (independent variables) and the target variable (dependent variable) in a classification problem

Mutual Information is a measure of the statistical dependence or information shared between two random variables

Input:

X: This is the feature matrix, a 2D array-like structure where each row represents a sample, and each column represents a feature.
y: This is the target variable, a 1D array or list containing the class labels or target values for the corresponding samples in X.

After calculating the mutual information for each feature, mutual_info_classif returns a list of scores, one for each feature.
Higher scores indicate that a feature has more mutual information with the target variable, implying that it might be more important in making classification decisions.

$$I\left[X;Y\right]=\sum_{y\in Y}\sum_{x\in X}p\left[x,y\right]\log\left(\frac{p\left[x,y\right]}{p\left[x\right]p\left[y\right]}\right)$$

If Y and X are independent, is cannot find Y from X, so the Mutual information is: $I\left[X;Y\right]=0$ because: $\log{\left(\frac{p\left[x,y\right]}{p\left[x\right]p\left[y\right]}\right)}=\log{1}=0$

Si a determinist function exist you can get Y from X, so the mutual information will be: $I\left[X;Y\right]=1$.


```
from sklearn.feature_selection import mutual_info_classif

# X is your feature matrix, and y is your target variable
# Calculate mutual information scores for each feature
mi_scores = mutual_info_classif(X, y)

# Select the top-k features based on their mutual information scores
top_k_features = X[:, mi_scores.argsort()[::-1][:k]]
```




In [7]:
# Example
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import mutual_info_classif
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectPercentile
from sklearn.preprocessing import MinMaxScaler


# cancer is of type Bunch (Dictionary-like object)
cancer = load_breast_cancer ()
X = cancer['data']
y = cancer['target']

#print(cancer)

# Set 1: train and test datasets (using all data).
X_train_1,X_test_1,y_train_1,y_test_1 = train_test_split(X,y,random_state=0,stratify=y)

print("Datos de entrenamiento con todas las característica", X_train_1.shape)

print('\n')


# Compute MI (mutual information) score

transformer = MinMaxScaler().fit(X)
X = transformer.transform(X)
#print( X[0:5] )

# Vector with mutual information scores
mi_score = mutual_info_classif(X,y)
print(mi_score)

# Since there are 30 characteristics, the algorithm generates
# a vector with scores assigned to each one.

print('\n')

# Select the top-k features based on their mutual information scores
k  = 1
top_k_features = X[:, mi_score.argsort()[::-1][:k]]


Datos de entrenamiento con todas las característica (426, 30)


[0.36973092 0.10297018 0.4036949  0.36105915 0.07514198 0.21020062
 0.37300064 0.43997809 0.06489205 0.00863471 0.24943928 0.00077096
 0.27593128 0.33850185 0.01648752 0.0751082  0.11767169 0.12332735
 0.0133077  0.04037909 0.45154883 0.1252039  0.4779135  0.46387392
 0.10034046 0.22521388 0.31537009 0.43764105 0.09245879 0.0697103 ]




You can define a threshold, next cases will be 0.2.
For this example, you will have 15 attributes with more mutual information. it will reduce the current dataset in 15 feetures

In [11]:
# Set 2: with features having MI scores > 0.2

# Characteristics with mi_score > 0.2

mi_score_selected_index = np.where(mi_score >0.2)[0]
print(mi_score_selected_index)

# X_2 data with features having MI scores > 0.2
X_2 = X[:,mi_score_selected_index]

X_train_2,X_test_2,y_train,y_test = train_test_split(X_2,y,random_state=0,stratify=y)

print("Datos de entrenamiento con características con peso de información mutua > 0.2", X_train_2.shape)

[ 0  2  3  5  6  7 10 12 13 20 22 23 25 26 27]
Datos de entrenamiento con características con peso de información mutua > 0.2 (426, 15)


In this example you take the features with less MI

In [13]:
# Dataset 3 with features have MI (mutual information) scores less than 0.2
mi_score_selected_index = np.where(mi_score <= 0.2)[0]
print(mi_score_selected_index)

X_3 = X[:,mi_score_selected_index]
X_train_3,X_test_3,y_train,y_test = train_test_split(X_3,y,random_state=0,stratify=y)

print("Datos de entrenamiento con características con peso de información mutua < 0.2", X_train_3.shape)

[ 1  4  8  9 11 14 15 16 17 18 19 21 24 28 29]
Datos de entrenamiento con características con peso de información mutua < 0.2 (426, 15)


You can use a Decision Tree to Test the Features selection performance.
For this example is better to use the 15 best features (according to the MI) in comparison with the whole Dataset and the lower MI features.


In [14]:
# Test classifiers, one for each dataset with different columns.

from sklearn.tree import DecisionTreeClassifier
model_1 = DecisionTreeClassifier().fit(X_train_1,y_train)
model_2 = DecisionTreeClassifier().fit(X_train_2,y_train)
model_3 = DecisionTreeClassifier().fit(X_train_3,y_train)

# Return the mean accuracy on the given test data and labels.
score_1 = model_1.score(X_test_1,y_test)
score_2 = model_2.score(X_test_2,y_test)
score_3 = model_3.score(X_test_3,y_test)
print(f"score_1:{score_1}\n score_2:{score_2}\n score_3:{score_3}")

score_1:0.916083916083916
 score_2:0.9300699300699301
 score_3:0.8251748251748252


### Automated Way to Calculate the MI
This a more Automated way to Calculate the MI scores:

In [15]:
selector = SelectPercentile(percentile=50) # select features with top 50%
selector.fit(X,y)
X_4 = selector.transform(X)

X_train_4,X_test_4,y_train,y_test = train_test_split(X_4,y,random_state=0,stratify=y)

model_4 = DecisionTreeClassifier().fit(X_train_4,y_train)
score_4 = model_4.score(X_test_4,y_test)

print(f"score_4:{score_4}")

score_4:0.9300699300699301


## Regression problems:

The next information is from Chat GPT


1.   F-Regression:

F-regression measures the dependence between each feature and the continuous target variable in a regression problem.
You can use the sklearn.feature_selection.f_regression function in scikit-learn to compute F-statistic scores and p-values for each feature's relationship with the target variable.
Features with higher F-statistic scores and lower p-values are more likely to be informative for regression.

2.   Mutual Information for Regression:

There's a variant of Mutual Information specifically designed for regression tasks called "Mutual Information Regression" (MIR).
You can use sklearn.feature_selection.mutual_info_regression in scikit-learn to calculate mutual information scores between continuous features and a continuous target variable.
This is more appropriate for regression tasks compared to mutual_info_classif, which is designed for classification.

3.   LASSO Regression (L1 Regularization):

LASSO (Least Absolute Shrinkage and Selection Operator) is a regularization technique that can be used for feature selection in regression.
It encourages some feature coefficients to be exactly zero, effectively performing feature selection.
You can use the sklearn.linear_model.Lasso or sklearn.linear_model.LassoCV classes in scikit-learn for LASSO regression.

4.   Recursive Feature Elimination (RFE):

RFE is an iterative feature selection method that works well for regression tasks.
It starts with all features and iteratively removes the least important ones based on the model's performance.
You can use the sklearn.feature_selection.RFE class in scikit-learn for this purpose.


## VarianceThreshold

This can be used with Unsupervised Problems (you don't need the Y or Target Variable)



```
from sklearn.feature_selection import VarianceThreshold

selector = VarianceThreshold(threshold=.1)
X_new = selector.fit_transform(X)
print("varianzas: ", selector.variances_ )
print( X_new[0:5] )
```



## More Research to be Done