# üõ†Ô∏è Pipeline de Selecci√≥n de Atributos por Consenso (Ensemble Feature Selection)

Este cuaderno implementa un sistema robusto para identificar las variables m√°s importantes en un conjunto de datos. En lugar de confiar en un solo criterio estad√≠stico, utilizamos un enfoque de **voto de mayor√≠a**.

### üß¨ ¬øC√≥mo funciona?
El sistema utiliza 10 m√©todos distintos provenientes de diferentes familias de an√°lisis de datos:

1.  **Estad√≠sticos:** Correlaci√≥n, ANOVA F-score, Chi-cuadrado y T-score.
2.  **Teor√≠a de la Informaci√≥n:** Entrop√≠a (H(X)) e Informaci√≥n Mutua.
3.  **Basados en Distancia:** Algoritmo ReliefF (efectivo para detectar interacciones entre variables).
4.  **Basados en Modelos (Random Forest):**
    *   **Gini/MDI:** Importancia basada en la reducci√≥n de la impureza en los nodos.
    *   **MDA (Permutation Importance):** Importancia basada en la ca√≠da del rendimiento del modelo al permutar una columna.

### üó≥Ô∏è El Proceso de Fusi√≥n
Cada m√©todo selecciona las $k$ mejores caracter√≠sticas. Luego, la funci√≥n `fuse_features` act√∫a como un jurado: solo aquellas caracter√≠sticas que reciben votos de la mayor√≠a absoluta de los m√©todos (umbral calculado din√°micamente) son seleccionadas para el modelo final.

Este m√©todo reduce dr√°sticamente el riesgo de seleccionar variables que son artefactos estad√≠sticos de un solo algoritmo, asegurando un subconjunto de datos m√°s estable y generalizable.

In [None]:
!pip install skrebate

Collecting skrebate
  Downloading skrebate-0.62.tar.gz (19 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: skrebate
  Building wheel for skrebate (setup.py) ... [?25l[?25hdone
  Created wheel for skrebate: filename=skrebate-0.62-py3-none-any.whl size=29253 sha256=bcd1ac67217d43e2162160f0aaa37c63e03c14ee9b67f4cb6a287cf48f7f3668
  Stored in directory: /root/.cache/pip/wheels/03/4c/36/bc6b70d88998635e0ec0e617d15cd97483f5008d6bb77c1c7a
Successfully built skrebate
Installing collected packages: skrebate
Successfully installed skrebate-0.62


In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.feature_selection import mutual_info_classif, f_classif, chi2
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.preprocessing import KBinsDiscretizer
from skrebate import ReliefF
from scipy.stats import ttest_ind
from sklearn.feature_selection import SelectKBest

In [None]:
# 1Ô∏è‚É£ Correlaci√≥n
def select_correlation(X, y, k):
    corr = np.abs(np.corrcoef(X, y, rowvar=False)[-1, :-1])
    indices = np.argsort(corr)[::-1][:k]
    return list(X.columns[indices])

In [None]:
# 2Ô∏è‚É£ Entrop√≠a (H(X))
def select_entropy(X, y, k):
    def entropy(col):
        hist, _ = np.histogram(col, bins='fd')
        probs = hist / hist.sum()
        probs = probs[probs > 0]
        return -np.sum(probs * np.log2(probs))
    ent = X.apply(entropy)
    indices = np.argsort(ent.values)[::-1][:k]
    return list(X.columns[indices])

In [None]:
# 3Ô∏è‚É£ Informaci√≥n mutua
def select_mutual_info(X, y, k):
    mi = mutual_info_classif(X, y, discrete_features='auto', random_state=42)
    indices = np.argsort(mi)[::-1][:k]
    return list(X.columns[indices])

In [None]:
# 4Ô∏è‚É£ ReliefF
def select_relieff(X, y, k):
    fs = ReliefF(n_neighbors=100, n_features_to_select=k)
    fs.fit(X.values, y)
    return list(X.columns[fs.top_features_[:k]])

In [None]:
# 5Ô∏è‚É£ ANOVA F-score
def select_anova(X, y, k):
    F, _ = f_classif(X, y)
    indices = np.argsort(F)[::-1][:k]
    return list(X.columns[indices])

In [None]:
# 6Ô∏è‚É£ Chi-square
def select_chi2(X, y, k):
    X_disc = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform').fit_transform(X)
    chi, _ = chi2(X_disc, y)
    indices = np.argsort(chi)[::-1][:k]
    return list(X.columns[indices])

In [None]:
# 7Ô∏è‚É£ Gini (MDI en RandomForest)
def select_gini(X, y, k):
    rf = RandomForestClassifier(n_estimators=100, random_state=42)
    rf.fit(X, y)
    imp = rf.feature_importances_
    indices = np.argsort(imp)[::-1][:k]
    return list(X.columns[indices])

In [None]:
# 8Ô∏è‚É£ T-score
def select_tscore(X, y, k):
    tscores = []
    classes = np.unique(y)
    for col in X.columns:
        group1 = X[col][y == classes[0]]
        group2 = X[col][y == classes[1]] if len(classes) > 1 else X[col]
        t, _ = ttest_ind(group1, group2, equal_var=False)
        tscores.append(np.abs(t))
    indices = np.argsort(tscores)[::-1][:k]
    return list(X.columns[indices])

In [None]:
# 9Ô∏è‚É£ MDI
def select_mdi(X, y, k):
    rf = RandomForestClassifier(n_estimators=100, random_state=42)
    rf.fit(X, y)
    mdi_importances = rf.feature_importances_

    top_k_idx = np.argsort(mdi_importances)[::-1][:k]

    top_k_features = X.columns[top_k_idx].tolist()


    return top_k_features

In [None]:
# üîü MDA (permutation importance)
def select_mda(X, y, k):
    rf = RandomForestClassifier(n_estimators=100, random_state=42)
    rf.fit(X, y)
    perm = permutation_importance(rf, X, y, n_repeats=20, random_state=42)
    imp = perm.importances_mean
    indices = np.argsort(imp)[::-1][:k]
    return list(X.columns[indices])

In [None]:


# --------------------------
# FUSI√ìN DE ATRIBUTOS
# --------------------------
def fuse_features(selected_lists):
    from collections import Counter
    flat = [feat for sublist in selected_lists for feat in sublist]
    count = Counter(flat)
    n_methods = len(selected_lists)
    threshold = n_methods // 2 + 1
    fused = [feat for feat, c in count.items() if c >= threshold]
    return fused

# --------------------------
# MAIN
# --------------------------
def main(X, y, k=3, verbose=True):
    methods = [
        select_correlation, select_entropy, select_mutual_info, select_relieff,
        select_anova, select_chi2, select_gini, select_tscore, select_mdi, select_mda
    ]

    selected_all = []

    for func in methods:
        sel = func(X, y, k)
        selected_all.append(sel)
        if verbose:
            print(f"{func.__name__}: {sel}")

    fused = fuse_features(selected_all)
    if verbose:
        print("\nFused features (aprobadas por mayor√≠a):", fused)

    return fused

# --------------------------
# EJEMPLO DE USO
# --------------------------
if __name__ == "__main__":
    # Dataset de ejemplo
    data = load_iris()
    X = pd.DataFrame(data.data, columns=data.feature_names)
    y = data.target

    fused_features = main(X, y, k=2)


select_correlation: ['petal width (cm)', 'petal length (cm)']
select_entropy: ['sepal width (cm)', 'sepal length (cm)']
select_mutual_info: ['petal length (cm)', 'petal width (cm)']
select_relieff: ['petal width (cm)', 'petal length (cm)']
select_anova: ['petal length (cm)', 'petal width (cm)']
select_chi2: ['petal width (cm)', 'petal length (cm)']
select_gini: ['petal length (cm)', 'petal width (cm)']
select_tscore: ['petal length (cm)', 'petal width (cm)']
select_mdi: ['petal length (cm)', 'petal width (cm)']
select_mda: ['petal length (cm)', 'petal width (cm)']

Fused features (aprobadas por mayor√≠a): ['petal width (cm)', 'petal length (cm)']
