# 04 - Feature Selection

In this notebook, we perform the **Feature Selection** stage of the project with the goal of identifying which features contribute the most to predicting the breast cancer diagnosis.

The Breast Cancer Wisconsin dataset contains many highly correlated features derived from similar measurements (radius, perimeter, area, concavity, etc.). Reducing dimensionality is essential to:

- improve interpretability  
- reduce overfitting  
- remove redundancy  
- speed up training  
- increase model stability  
- prepare for better hyperparameter tuning  

We will apply four complementary feature selection techniques:

1. **ANOVA F-test (SelectKBest)**  
2. **Mutual Information (SelectKBest)**  
3. **Random Forest Feature Importance**  
4. **Recursive Feature Elimination (RFE)**  

The goal is to select the **15 most relevant features**, based on statistical significance, model-based importance, and real performance in classification.

In [14]:
import sys
sys.path.append("..")

import pandas as pd
import numpy as np

from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif, VarianceThreshold
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

import matplotlib.pyplot as plt
import seaborn as sns

from src.config import RANDOM_STATE, TEST_SIZE

## 1. Load preprocessed data

In [5]:
X_train = pd.read_csv("../data/processed/X_train_preprocessed.csv")
X_test  = pd.read_csv("../data/processed/X_test_preprocessed.csv")
y_train = pd.read_csv("../data/processed/y_train.csv").squeeze()
y_test  = pd.read_csv("../data/processed/y_test.csv").squeeze()

X_train.head()

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave_points_mean,symmetry_mean,fractal_dimension_mean,...,concavity_worst,concave_points_worst,symmetry_worst,fractal_dimension_worst,radius_avg,perimeter_avg,area_avg,concavity_avg,texture_avg,var_total
0,0.518559,0.891826,0.424632,0.383925,-0.974744,-0.689772,-0.688586,-0.398175,-1.039155,-0.825056,...,-0.610227,-0.235744,0.054566,0.021837,0.32968,0.227148,0.271332,-0.653877,0.716364,0.370155
1,-0.516364,-1.63971,-0.541349,-0.542961,0.476219,-0.631834,-0.604281,-0.303075,0.521543,-0.454523,...,-0.712666,-0.323208,-0.137576,-0.904402,-0.567734,-0.579571,-0.541143,-0.598712,-1.443682,0.248493
2,-0.368118,0.455515,-0.38825,-0.40297,-1.432979,-0.383927,-0.342175,-0.765459,-0.850857,-0.226171,...,-0.431313,-0.890825,-0.675893,-0.144016,-0.154254,-0.344409,-0.274852,-0.050301,0.562998,0.435965
3,0.205285,0.726168,0.40033,0.070612,0.243253,2.203585,2.256094,1.213233,0.818474,0.899791,...,2.958619,1.977064,-0.075646,1.728848,-0.113585,0.391845,-0.127485,2.235092,0.126227,1.027462
4,1.243005,0.194195,1.210377,1.206652,-0.111442,0.051348,0.732962,0.713767,-0.427187,-0.822184,...,0.327775,0.501859,-0.909322,-0.546249,1.259901,1.143881,1.056292,0.551221,0.520427,0.429718


## 1.1 Dataset Overview (Sanity Check)

Before modeling, we verify the integrity of the preprocessed dataset.

In [6]:
print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape)
print("Target distribution (train):")
print(pd.Series(y_train).value_counts(normalize=True))

Train shape: (455, 36)
Test shape: (114, 36)
Target distribution (train):
diagnosis
0    0.626374
1    0.373626
Name: proportion, dtype: float64


## 2. SelectKBest using ANOVA F-test

The ANOVA F-test evaluates the **linear relationship** between each feature and the target variable by comparing:

- the variance between classes (benign vs malignant)  
- the variance within each class  

Higher F-scores indicate stronger linear discriminative power.  
This method is useful for identifying features with significant statistical separation across the two classes, although it does not capture nonlinear patterns.

In [7]:
selector_f = SelectKBest(score_func=f_classif, k=10)
selector_f.fit(X_train, y_train)

scores_f = pd.DataFrame({
    "feature": X_train.columns,
    "score": selector_f.scores_
}).sort_values("score", ascending=False)

scores_f

Unnamed: 0,feature,score
27,concave_points_worst,733.724933
22,perimeter_worst,717.246487
20,radius_worst,692.861395
7,concave_points_mean,684.526845
30,radius_avg,558.669197
31,perimeter_avg,557.653832
2,perimeter_mean,548.413236
23,area_worst,522.188947
0,radius_mean,511.274848
3,area_mean,444.857518


## 3. SelectKBest using Mutual Information

Mutual Information (MI) measures the **overall dependency** between features and the target, including both linear and nonlinear relationships.  
It quantifies how much knowing a particular feature reduces uncertainty about the diagnosis.

MI complements the ANOVA F-test by detecting more complex structures in the data that are not captured by purely linear statistics.


In [8]:
selector_mi = SelectKBest(score_func=mutual_info_classif, k=10)
selector_mi.fit(X_train, y_train)

scores_mi = pd.DataFrame({
    "feature": X_train.columns,
    "score": selector_mi.scores_
}).sort_values("score", ascending=False)

scores_mi

Unnamed: 0,feature,score
20,radius_worst,0.458877
22,perimeter_worst,0.455969
23,area_worst,0.454075
7,concave_points_mean,0.44745
31,perimeter_avg,0.436197
27,concave_points_worst,0.434323
32,area_avg,0.427477
30,radius_avg,0.408471
2,perimeter_mean,0.407937
3,area_mean,0.373048


## 4. Random Forest Feature Importance

Random Forest computes feature importance based on how much each variable reduces impurity (Gini or Entropy) across all decision trees in the ensemble.

This approach captures:

- nonlinear relationships  
- interactions between features  
- threshold-based decision patterns  

It provides a robust model-based ranking that complements the statistical filter methods used earlier.


In [9]:
rf = RandomForestClassifier(random_state=RANDOM_STATE)
rf.fit(X_train, y_train)

rf_scores = pd.DataFrame({
    "feature": X_train.columns,
    "importance": rf.feature_importances_
}).sort_values("importance", ascending=False)

rf_scores

Unnamed: 0,feature,importance
20,radius_worst,0.128262
22,perimeter_worst,0.118803
31,perimeter_avg,0.099262
7,concave_points_mean,0.09456
23,area_worst,0.083919
27,concave_points_worst,0.081254
0,radius_mean,0.053352
32,area_avg,0.048824
30,radius_avg,0.041516
2,perimeter_mean,0.023361


## 5. Recursive Feature Elimination (RFE)

Recursive Feature Elimination (RFE) iteratively trains a model (Logistic Regression in this case), ranks feature importance, and removes the least relevant variables.

This wrapper-based technique is effective for:

- eliminating redundant features  
- identifying compact, high-performing subsets  
- aligning selection directly with model behavior  

RFE produced the best subset performance in our tests.

In [11]:
lr = LogisticRegression(max_iter=500, random_state=RANDOM_STATE)
rfe = RFE(estimator=lr, n_features_to_select=15)
rfe.fit(X_train, y_train)

rfe_features = X_train.columns[rfe.support_].tolist()
rfe_features

['concave_points_mean',
 'radius_se',
 'area_se',
 'compactness_se',
 'concave_points_se',
 'radius_worst',
 'texture_worst',
 'perimeter_worst',
 'area_worst',
 'smoothness_worst',
 'concavity_worst',
 'symmetry_worst',
 'radius_avg',
 'perimeter_avg',
 'area_avg']

## 6. Comparing performance across feature subsets

Each feature selection method produces a different ranking of features.  
To evaluate the practical predictive value of these subsets, we trained a Logistic Regression model on the top features selected by each method.

The goal is not only to identify statistically relevant features, but to confirm that the selected subsets maintain competitive predictive performance.

In [12]:
# helper func
def evaluate_subset(features):
    model = LogisticRegression(max_iter=500, random_state=RANDOM_STATE)
    model.fit(X_train[features], y_train)
    pred = model.predict(X_test[features])
    return f1_score(y_test, pred)

In [13]:
subsets = {
    "F-test": scores_f.head(15)["feature"].tolist(),
    "Mutual Information": scores_mi.head(15)["feature"].tolist(),
    "RandomForest": rf_scores.head(15)["feature"].tolist(),
    "RFE": rfe_features
}

results = {}

for name, feats in subsets.items():
    results[name] = evaluate_subset(feats)

results

{'F-test': 0.962962962962963,
 'Mutual Information': 0.951219512195122,
 'RandomForest': 0.963855421686747,
 'RFE': 0.975609756097561}

## 7. Evaluation of the Selected Feature Set

In this section, we evaluate the performance of the final selected subset of 15 features and compare it against the baseline model trained with the full feature set.

This comparison allows us to verify whether the dimensionality reduction:

- preserves predictive performance  
- improves model efficiency  
- enhances generalization  
- or introduces a measurable performance trade-off  

Two models were trained:

1. **Full Feature Model** — using all preprocessed features (baseline).  
2. **Selected Feature Model** — using the final 15 selected features identified through Feature Selection.

The evaluation includes:
- Accuracy  
- Precision  
- Recall  
- F1-score  
- ROC AUC  

In [16]:
SELECTED_FEATURES = [
    "concave_points_mean",
    "concavity_worst",
    "symmetry_worst",
    "radius_avg",
    "perimeter_avg",
    "area_avg",
    "radius_mean",
    "texture_mean",
    "perimeter_mean",
    "area_mean",
    "smoothness_mean",
    "compactness_mean",
    "concavity_mean",
    "symmetry_mean",
    "fractal_dimension_mean"
]

In [17]:
# Full-feature model (baseline)
X_train_full = X_train.copy()
X_test_full  = X_test.copy()

# Selected-feature model
X_train_sel = X_train[SELECTED_FEATURES]
X_test_sel  = X_test[SELECTED_FEATURES]

In [19]:
def evaluate_model(model, X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    prob = model.predict_proba(X_test)[:, 1]

    return {
        "Accuracy": accuracy_score(y_test, pred),
        "Precision": precision_score(y_test, pred),
        "Recall": recall_score(y_test, pred),
        "F1-score": f1_score(y_test, pred),
        "ROC AUC": roc_auc_score(y_test, prob)
    }

# Model A: Full features
model_full = LogisticRegression(max_iter=500, random_state=RANDOM_STATE)
results_full = evaluate_model(model_full, X_train_full, X_test_full, y_train, y_test)

# Model B: Selected features
model_sel = LogisticRegression(max_iter=500, random_state=RANDOM_STATE)
results_sel = evaluate_model(model_sel, X_train_sel, X_test_sel, y_train, y_test)

In [20]:
comparison_df = pd.DataFrame({
    "Metric": ["Accuracy", "Precision", "Recall", "F1-score", "ROC AUC"],
    "Full Features": [
        results_full["Accuracy"],
        results_full["Precision"],
        results_full["Recall"],
        results_full["F1-score"],
        results_full["ROC AUC"]
    ],
    "Selected Features": [
        results_sel["Accuracy"],
        results_sel["Precision"],
        results_sel["Recall"],
        results_sel["F1-score"],
        results_sel["ROC AUC"]
    ]
})

comparison_df

Unnamed: 0,Metric,Full Features,Selected Features
0,Accuracy,0.973684,0.964912
1,Precision,0.97561,0.952381
2,Recall,0.952381,0.952381
3,F1-score,0.963855,0.952381
4,ROC AUC,0.996032,0.992725


## 8. Conclusion

After applying four complementary methods of feature selection and evaluating the predictive performance of the resulting subsets, we identified a compact set of 15 features that captures the most relevant information for the classification task.

Although the full feature model achieved slightly higher scores, the selected subset still delivers strong predictive performance while offering significant advantages:

- reduced dimensionality  
- simplified model structure  
- lower computational cost  
- improved interpretability  
- reduced redundancy  