# 5.8.9 Selección de variables basadas en la tasa de descubrimientos falsos (SelectFdr)

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectFdr, chi2

import warnings
warnings.filterwarnings("ignore")

Este metodología selecciona las características a partir de una prueba FDR (false discovery rate). <bt>
Ordena los valores críticos de menor a mayor. <br>
Selecciona las variables para las cuales: <bt>

$$ p-value_i < (i+1) * \frac{\alpha}{n} $$

con $i=0,...,n$ es el número de características.

In [3]:
X, y = load_breast_cancer(return_X_y=True)
X.shape

(569, 30)

In [4]:
selectFdr = SelectFdr(
    # -------------------------------------------------------------------------
    # Function taking two arrays X and y, and returning a pair of arrays
    # (scores, pvalues).
    score_func=chi2,
    # -------------------------------------------------------------------------
    # The highest p-value for features to be kept.
    alpha=0.01,
)

selectFdr.fit(X, y)

X_new = selectFdr.transform(X)
X_new.shape

(569, 16)

In [5]:
selectFdr.scores_

array([2.66104917e+02, 9.38975081e+01, 2.01110286e+03, 5.39916559e+04,
       1.49899264e-01, 5.40307549e+00, 1.97123536e+01, 1.05440354e+01,
       2.57379775e-01, 7.43065536e-05, 3.46752472e+01, 9.79353970e-03,
       2.50571896e+02, 8.75850471e+03, 3.26620664e-03, 6.13785332e-01,
       1.04471761e+00, 3.05231563e-01, 8.03633831e-05, 6.37136566e-03,
       4.91689157e+02, 1.74449400e+02, 3.66503542e+03, 1.12598432e+05,
       3.97365694e-01, 1.93149220e+01, 3.95169151e+01, 1.34854195e+01,
       1.29886140e+00, 2.31522407e-01])

In [6]:
selectFdr.pvalues_

array([8.01397628e-060, 3.32292194e-022, 0.00000000e+000, 0.00000000e+000,
       6.98631644e-001, 2.01012999e-002, 9.00175712e-006, 1.16563638e-003,
       6.11926026e-001, 9.93122221e-001, 3.89553429e-009, 9.21168192e-001,
       1.94877489e-056, 0.00000000e+000, 9.54425121e-001, 4.33366115e-001,
       3.06726812e-001, 5.80621137e-001, 9.92847410e-001, 9.36379753e-001,
       6.11324751e-109, 7.89668299e-040, 0.00000000e+000, 0.00000000e+000,
       5.28452867e-001, 1.10836762e-005, 3.25230064e-010, 2.40424384e-004,
       2.54421307e-001, 6.30397277e-001])

In [7]:
print('ok_')

ok_
