## Feature Selection Methods:

- Filter Method (SelectKBest): Selects the top k features based on statistical tests like ANOVA F-value.


##### SelectKBest is a feature selection technique in scikit-learn that selects the top k features based on univariate statistical tests. It evaluates each feature individually and selects those with the highest scores according to the chosen statistical test. It's a filter method, meaning it doesn't involve training a model but instead operates directly on the dataset's features. It's commonly used for classification tasks and offers various statistical tests like ANOVA F-value, mutual information, and chi-square.



- Wrapper Method (Recursive Feature Elimination): Iteratively selects features by training a model and eliminating the least important ones.




##### RFE stands for Recursive Feature Elimination. It's a feature selection technique that recursively removes features, training the model each time, and selecting the features that contribute most to the model's performance. This process continues until the desired number of features is reached.


In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [2]:
# Load the breast cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

In [19]:
X

array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
        1.189e-01],
       [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
        8.902e-02],
       [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
        8.758e-02],
       ...,
       [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
        7.820e-02],
       [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
        1.240e-01],
       [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
        7.039e-02]])

In [3]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [4]:
# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 1. SelectKBest (Filter Method):

In [5]:
from sklearn.feature_selection import SelectKBest, f_classif

In [6]:
# Select the top k features using ANOVA F-value
k = 10
selector = SelectKBest(score_func=f_classif, k=k)
X_train_selected = selector.fit_transform(X_train_scaled, y_train)
X_test_selected = selector.transform(X_test_scaled)


In [7]:

# Get the selected feature indices
selected_indices = selector.get_support(indices=True)

In [8]:
selected_indices

array([ 0,  2,  3,  6,  7, 20, 22, 23, 26, 27], dtype=int64)

# 2. Recursive Feature Elimination (Wrapper Method):


In [9]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

In [10]:
# Create a logistic regression model
estimator = LogisticRegression(max_iter=1000)

In [11]:
# Perform recursive feature elimination
rfe = RFE(estimator, n_features_to_select=k, step=1)
rfe.fit(X_train_scaled, y_train)
X_train_rfe = rfe.transform(X_train_scaled)
X_test_rfe = rfe.transform(X_test_scaled)

In [12]:
# Get the selected feature indices
selected_indices_rfe = np.where(rfe.support_)[0]

In [13]:
selected_indices_rfe

array([ 7, 10, 13, 15, 20, 21, 22, 23, 26, 27], dtype=int64)

### Finally, we can evaluate the performance of the selected features using a classifier:

In [14]:
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier

In [15]:
# Initialize and train a classifier (Random Forest for example)
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train_selected, y_train)


RandomForestClassifier(random_state=42)

In [16]:
# Predictions on the test set
y_pred = clf.predict(X_test_selected)


In [17]:
# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy with SelectKBest:", accuracy)

Accuracy with SelectKBest: 0.956140350877193


In [18]:
# Initialize and train a classifier (Random Forest for example) using features selected by RFE
clf_rfe = RandomForestClassifier(n_estimators=100, random_state=42)
clf_rfe.fit(X_train_rfe, y_train)

# Predictions on the test set using features selected by RFE
y_pred_rfe = clf_rfe.predict(X_test_rfe)

# Evaluate accuracy
accuracy_rfe = accuracy_score(y_test, y_pred_rfe)
print("Accuracy with RFE:", accuracy_rfe)


Accuracy with RFE: 0.956140350877193
