# Selección de Características - Machine Learning

La selección de características es el proceso de elegir un subconjunto de las variables más importantes mientras se intenta retener la mayor cantidad de información posible

Veamos algunos métodos

# Datasets

## Clasificación

info dataset pima-indians-diabetes: https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database

In [6]:
import pandas as pd

# Cargar data
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

df_classification = pd.read_csv(url, names=names)

print(df_classification.shape)

df_classification.head(3)

(768, 9)


Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1


In [7]:
# Separación de caracteristicas y target
X_class = df_classification.drop(['class'], axis=1)
y_class = df_classification['class']

print(X_class.shape)
print(y_class.shape)

(768, 8)
(768,)


## Regresión

In [8]:
# Para regresión
from sklearn.datasets import fetch_california_housing

# Lectura de datos
housing = fetch_california_housing(as_frame=True)
columns_drop = ["Longitude", "Latitude"]

X_reg = housing.data.drop(columns=columns_drop)
y_reg = housing.target

print("Feature data dimension: ", X_reg.shape)

Feature data dimension:  (20640, 6)


In [15]:
X_reg.head(3)
#y_reg.head(3)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226


## Métodos de filtrado

### Variance Threshold

Elimina todas las características cuya varianza no alcanza algún umbral. De forma predeterminada, elimina todas las características de varianza cero, es decir, las características que tienen el mismo valor en todas las muestras.

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html

In [11]:
from sklearn.feature_selection import VarianceThreshold

#Función de filtro de caracteristicas
def variance_threshold(X,th):
    var_thres=VarianceThreshold(threshold=th)
    var_thres.fit(X)
    new_cols = var_thres.get_support()
    return new_cols

In [13]:
# Para clasificación
# Obtener columnas seleccionadas
X_new_class = variance_threshold(X_class, 0.25)
# Nuevo dataframe
df_classification_new = X_class.iloc[:,X_new_class]
df_classification_new.head()

Unnamed: 0,preg,plas,pres,skin,test,mass,age
0,6,148,72,35,0,33.6,50
1,1,85,66,29,0,26.6,31
2,8,183,64,0,0,23.3,32
3,1,89,66,23,94,28.1,21
4,0,137,40,35,168,43.1,33


In [14]:
# Para regresión
# Obtener columnas seleccionadas
X_new_reg = variance_threshold(X_reg, 0.25)
# Nuevo dataframe
df_regression_new = X_reg.iloc[:,X_new_reg]
df_regression_new.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,Population,AveOccup
0,8.3252,41.0,6.984127,322.0,2.555556
1,8.3014,21.0,6.238137,2401.0,2.109842
2,7.2574,52.0,8.288136,496.0,2.80226
3,5.6431,52.0,5.817352,558.0,2.547945
4,3.8462,52.0,6.281853,565.0,2.181467


### SelectKBest

Selección de características de acuerdo con las k puntuaciones más altas. Utiliza una función que toma dos matrices X e y, y devuelve un par de matrices (puntuaciones, valores de p) o una única matriz con puntuaciones.

La función predeterminada solo funciona con tareas de clasificación (f_classif - ANOVA).

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest

In [16]:
# Para clasificación
from sklearn.feature_selection import SelectKBest, f_classif

# Función de filtro de caracteristicas - stadis. scores
def select_kbest_classification(X,y,score_f,k):
    sel_kb = SelectKBest(score_func=score_f, k=k)
    sel_kb.fit(X,y)
    new_cols = sel_kb.get_support()
    print("Scores:\n", sel_kb.scores_, "\nP-values:\n", sel_kb.pvalues_)
    return new_cols

In [17]:
# Obtener columnas seleciconadas - (5 caracteristicas)
X_new_class = select_kbest_classification(X_class, y_class, f_classif, 5)

# Nuevo conjunto de datos
df_classification_new = X_class.iloc[:,X_new_class]
df_classification_new.head()

Scores:
 [ 39.67022739 213.16175218   3.2569504    4.30438091  13.28110753
  71.7720721   23.8713002   46.14061124] 
P-values:
 [5.06512730e-10 8.93543165e-43 7.15139001e-02 3.83477048e-02
 2.86186460e-04 1.22980749e-16 1.25460701e-06 2.20997546e-11]


Unnamed: 0,preg,plas,mass,pedi,age
0,6,148,33.6,0.627,50
1,1,85,26.6,0.351,31
2,8,183,23.3,0.672,32
3,1,89,28.1,0.167,21
4,0,137,43.1,2.288,33


In [21]:
# Para regresión
from sklearn.feature_selection import SelectKBest, f_regression,r_regression

# Función de filtro de caracteristicas - stadis. scores
def select_kbest_regression(X,y,score_f,k):
    sel_kb = SelectKBest(score_func=score_f, k=k)
    sel_kb.fit(X,y)
    new_cols = sel_kb.get_support()
    print("Scores:\n", sel_kb.scores_, "\nP-values:\n", sel_kb.pvalues_)
    #print("Scores:\n", sel_kb.scores_)
    return new_cols

In [23]:
# Obtener columnas seleciconadas - (3 caracteristicas)
X_new_reg = select_kbest_regression(X_reg, y_reg, f_regression, 3)

# Nuevo conjunto de datos
df_regression_new = X_reg.iloc[:,X_new_reg]
df_regression_new.head()

Scores:
 [1.85565716e+04 2.32841479e+02 4.87757462e+02 4.51085756e+01
 1.25474103e+01 1.16353421e+01] 
P-values:
 [0.00000000e+000 2.76186068e-052 7.56924213e-107 1.91258939e-011
 3.97630785e-004 6.48344237e-004]


Unnamed: 0,MedInc,HouseAge,AveRooms
0,8.3252,41.0,6.984127
1,8.3014,21.0,6.238137
2,7.2574,52.0,8.288136
3,5.6431,52.0,5.817352
4,3.8462,52.0,6.281853


## Métodos Wrapper

### RFE

Dado un estimador externo que asigna pesos a las características (p. ej., los coeficientes de un modelo lineal), el objetivo de la eliminación recursiva de características (RFE) es seleccionar características considerando recursivamente conjuntos de características cada vez más pequeños.

Primero, el estimador se entrena en el conjunto inicial de características y la importancia de cada característica se obtiene a través de cualquier atributo específico o llamable. Luego, las características menos importantes se eliminan del conjunto actual de características. Ese procedimiento se repite recursivamente en el conjunto podado hasta que finalmente se alcanza el número deseado de características para seleccionar.

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html

In [24]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression, LogisticRegression

# Función recursiva de selección de características
def recursive_feature_selection(X,y,model,k):
  rfe = RFE(model, n_features_to_select=k, step=1)
  fit = rfe.fit(X, y)
  X_new = fit.support_
  print("Num Features: %s" % (fit.n_features_))
  print("Selected Features: %s" % (fit.support_))
  print("Feature Ranking: %s" % (fit.ranking_))

  return X_new

In [25]:
# Para clasificación

# Establecer Estimador
model = LogisticRegression(max_iter=300)
# Obtener columnas seleciconadas - (3 caracteristicas)
X_new_class = recursive_feature_selection(X_class, y_class, model, 5)
# Nuevo conjunto de datos
df_classification_new = X_class.iloc[:,X_new_class]
df_classification_new.head()

Num Features: 5
Selected Features: [ True  True False False False  True  True  True]
Feature Ranking: [1 1 2 4 3 1 1 1]


Unnamed: 0,preg,plas,mass,pedi,age
0,6,148,33.6,0.627,50
1,1,85,26.6,0.351,31
2,8,183,23.3,0.672,32
3,1,89,28.1,0.167,21
4,0,137,43.1,2.288,33


In [26]:
# Para regresión

# Establecer Estimador
model = LinearRegression()
# Obtener columnas seleciconadas - (3 caracteristicas)
X_new_reg = recursive_feature_selection(X_reg, y_reg, model, 3)
# Nuevo conjunto de datos
df_regression_new = X_reg.iloc[:,X_new_reg]
df_regression_new.head()


Num Features: 3
Selected Features: [ True False  True  True False False]
Feature Ranking: [1 2 1 1 4 3]


Unnamed: 0,MedInc,AveRooms,AveBedrms
0,8.3252,6.984127,1.02381
1,8.3014,6.238137,0.97188
2,7.2574,8.288136,1.073446
3,5.6431,5.817352,1.073059
4,3.8462,6.281853,1.081081


### SequentialFeatureSelector

Este selector secuencial de carcaterisicas agrega (selección hacia adelante) o elimina (selección hacia atrás) caracteristicas para formar un subconjunto de caracteristicas de manera codiciosa.

En cada etapa, este estimador elige la mejor característica para agregar o eliminar en función de la puntuación de validación cruzada de un estimador.

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SequentialFeatureSelector.html


In [27]:
from sklearn.feature_selection import SequentialFeatureSelector

# Selector secuencias utilizando regresión logistica - clasificación
sfs = SequentialFeatureSelector(LogisticRegression(max_iter=300),
                                n_features_to_select=5,
                                direction= "forward",
                                scoring='f1')

# Obtener variable seleccionadas
sfs = sfs.fit(X_class, y_class)
X_new_class = sfs.support_
df_classification_new = X_class.iloc[:,X_new_class]
df_classification_new.head()

Unnamed: 0,preg,plas,pres,mass,age
0,6,148,72,33.6,50
1,1,85,66,26.6,31
2,8,183,64,23.3,32
3,1,89,66,28.1,21
4,0,137,40,43.1,33


In [28]:
# Selector secuencias utilizando regresión lineal - Regresión
sfs = SequentialFeatureSelector(LinearRegression(),
                                n_features_to_select=3,
                                direction= "forward",
                                scoring='r2')

# Obtener variable seleccionadas
sfs = sfs.fit(X_reg, y_reg)
X_new_reg = sfs.support_
df_regression_new = X_reg.iloc[:,X_new_reg]
df_regression_new.head()

Unnamed: 0,MedInc,HouseAge,AveRooms
0,8.3252,41.0,6.984127
1,8.3014,21.0,6.238137
2,7.2574,52.0,8.288136
3,5.6431,52.0,5.817352
4,3.8462,52.0,6.281853


## Métodos integrados

### SelectFromModel

Meta-transformador para seleccionar características basadas en pesos de importancia.

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso

In [29]:
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
import numpy as np

# Estandarizar los datos
standard_scaler = StandardScaler()
X_reg_std = standard_scaler.fit_transform(X_reg)
y_reg_std = standard_scaler.fit_transform(np.array(y_reg).reshape(-1, 1))

# Selector de variables con Lasso
sel_ = SelectFromModel(Lasso(alpha=0.03), max_features=3)
sel_.fit(X_reg_std, y_reg_std)
print(sel_.estimator_.coef_)
#Obtener variables seleccionadas
X_new_reg = sel_.get_support()

df_regression_new = X_reg.iloc[:,X_new_reg]
df_regression_new.head()

[ 0.68231982  0.15427649 -0.01749387  0.          0.         -0.008662  ]


Unnamed: 0,MedInc,HouseAge,AveRooms
0,8.3252,41.0,6.984127
1,8.3014,21.0,6.238137
2,7.2574,52.0,8.288136
3,5.6431,52.0,5.817352
4,3.8462,52.0,6.281853


In [None]:
X_new_reg

# Esquema de validación

### Cross Validation

Permite evaluar las métricas de desempeño de un modelo mediante validación cruzada y también registra los tiempos de entrenamiento/puntuación.

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html

In [30]:
from sklearn.model_selection import  cross_validate

def cross_validation(model, _X, _y, _cv=5, scoring='f1'):
      _scoring = ['accuracy', 'precision', 'recall', 'f1']
      results = cross_validate(estimator=model,
                               X=_X,
                               y=_y,
                               cv=_cv,
                               scoring=_scoring,
                               return_train_score=True)

      return {"Training Accuracy scores": results['train_accuracy'],
              "Mean Training Accuracy": results['train_accuracy'].mean()*100,
              "Training Precision scores": results['train_precision'],
              "Mean Training Precision": results['train_precision'].mean(),
              "Training Recall scores": results['train_recall'],
              "Mean Training Recall": results['train_recall'].mean(),
              "Training F1 scores": results['train_f1'],
              "Mean Training F1 Score": results['train_f1'].mean(),
              "Validation Accuracy scores": results['test_accuracy'],
              "Mean Validation Accuracy": results['test_accuracy'].mean()*100,
              "Validation Precision scores": results['test_precision'],
              "Mean Validation Precision": results['test_precision'].mean(),
              "Validation Recall scores": results['test_recall'],
              "Mean Validation Recall": results['test_recall'].mean(),
              "Validation F1 scores": results['test_f1'],
              "Mean Validation F1 Score": results['test_f1'].mean()
              }

In [31]:
# Modelo de regresión logística
log_model = LogisticRegression(class_weight="balanced", random_state=0, max_iter=300)
# Evaluación del modelo 2
log_model_2_result = cross_validation(log_model, X_class, y_class, 5)

In [32]:
log_model_2_result

{'Training Accuracy scores': array([0.76710098, 0.77361564, 0.76872964, 0.75121951, 0.76260163]),
 'Mean Training Accuracy': 76.46534784566089,
 'Training Precision scores': array([0.6473029 , 0.65060241, 0.64516129, 0.625     , 0.63967611]),
 'Mean Training Precision': 0.6415485435771549,
 'Training Recall scores': array([0.72897196, 0.75700935, 0.74766355, 0.72093023, 0.73488372]),
 'Mean Training Recall': 0.7378917626602913,
 'Training F1 scores': array([0.68571429, 0.69978402, 0.69264069, 0.66954644, 0.68398268]),
 'Mean Training F1 Score': 0.6863336231802755,
 'Validation Accuracy scores': array([0.77272727, 0.7012987 , 0.75324675, 0.83006536, 0.74509804]),
 'Mean Validation Accuracy': 76.04872251931076,
 'Validation Precision scores': array([0.65079365, 0.55882353, 0.63793103, 0.72881356, 0.62962963]),
 'Mean Validation Precision': 0.6411982807279676,
 'Validation Recall scores': array([0.75925926, 0.7037037 , 0.68518519, 0.81132075, 0.64150943]),
 'Mean Validation Recall': 0.720

In [33]:
print("Mean Training F1 Score: ", log_model_2_result['Mean Training F1 Score'],
      "\nMean Validation F1 Score: ", log_model_2_result['Mean Validation F1 Score'])

Mean Training F1 Score:  0.6863336231802755 
Mean Validation F1 Score:  0.6775781935579699


In [None]:
# Otra alternativa utilizando LogisticRegressionCV
from sklearn.linear_model import LogisticRegressionCV

# Definición de modelo y ajuste a todos los datos
clf = LogisticRegressionCV(cv=10, random_state=0, class_weight="balanced", scoring='f1', max_iter=300).fit(X_class, y_class)

print("Score: ", clf.score(X_class, y_class))

### LeaveOneOut Cross-Validation

Proporciona índices de entrenamiento/prueba para dividir datos en conjuntos de entrenamiento/prueba. Cada muestra se usa una vez como conjunto de prueba (singleton) mientras que las muestras restantes forman el conjunto de entrenamiento.

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeaveOneOut.html

In [None]:
X_class = X_class.values
y_class = y_class.values

In [None]:
from sklearn.model_selection import LeaveOneOut
from sklearn.metrics import accuracy_score, f1_score

# create loocv procedure
cv = LeaveOneOut()

# Listas de valores predichos y valores reales
y_true, y_pred = list(), list()

for train_ix, test_ix in cv.split(X_class):
  # split data
  X_train, X_test = X_class[train_ix, :], X_class[test_ix, :]
  y_train, y_test = y_class[train_ix], y_class[test_ix]
  # fit model
  model_log_3 = LogisticRegression(class_weight="balanced", random_state=0, max_iter=1000)
  model_log_3.fit(X_train, y_train)
  # evaluate model
  yhat = model_log_3.predict(X_test)
  # store
  y_true.append(y_test[0])
  y_pred.append(yhat[0])

# Metricas de evaluación
acc = accuracy_score(y_true, y_pred)
print('Accuracy: %.3f' % acc)
f1 = f1_score(y_true, y_pred)
print('F1 score: %.3f' % f1)