# Introduction Feature Selection



In this practical exercise, you will learn essential aspects of feature selection techniques using Python, specifically Recursive Feature Elimination (RFE), Chi-square, and Lasso. They will gain hands-on experience in implementing these techniques in a machine learning pipeline, along with Random Forest as the classifier. They'll learn how to use GridSearchCV for hyperparameter tuning, and cross-validation for model evaluation. The exercise will also provide insights on model performance by generating ROC AUC curves. This application-based learning will reinforce theoretical understanding and enable students to apply feature selection methods in real-world scenarios.

Data used in this practical 
```
'breast_cancer data set from scikit learn'
``` 

The dataset used in this practical exercise is the Breast Cancer Wisconsin (Diagnostic) Dataset, which is available in the sklearn datasets module. This dataset contains characteristics derived from digitized images of fine-needle aspirate (FNA) of a breast mass.

Key characteristics of the dataset:

Number of Instances: 569
Number of Attributes: 30 numeric, predictive attributes/features
Target variable: Diagnosis (M = malignant, B = benign)
The features in the dataset are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image, such as radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension. Each of these features is represented ten times in the dataset (mean, standard error, and "worst" or largest) leading to 30 features.

The objective with this dataset is to predict whether a tumor is malignant or benign based on these features. It's a commonly used dataset for binary classification problems in machine learning.



### Import Necessary Libraries:

This section imports the necessary libraries used for data manipulation, model training, feature selection, and metrics calculation.

### Load the Data:

This loads the breast cancer dataset from sklearn and assigns the independent variables to X and the dependent variable to y.

In [62]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFE, SelectKBest, chi2
from sklearn.linear_model import LassoCV
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import roc_auc_score, roc_curve, auc, precision_score, recall_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.feature_selection import SelectFromModel
import matplotlib.pyplot as plt

### Normalize the Features:

1. Load the data 
2. This step uses StandardScaler to normalize the features which is crucial before applying LassoCV as it is sensitive to the scale of the input features.


In [None]:
# Load the Pima Indians Diabetes dataset
data = pd.read_csv('diabetes.tsv', sep = "\t")
X = data.iloc[:, :-1] # Select all columns except last column (which is target)
X = X.iloc[:, 2:X.shape[1]]

y = data.iloc[:, -1].values   # Select the last column (target)
X.describe()

feature_names = X.columns 
feature_names

In [64]:
# Normalize the features for LASSO
sc = StandardScaler()
X_scaled = sc.fit_transform(X)

In [65]:
# Define the feature selection methods
feature_selectors = {
    "rfe": RFE(estimator=RandomForestClassifier(n_estimators=100, random_state=42)),
    "chi2": SelectKBest(score_func=chi2, k=4),  # Select top 4 features; you can change this according to your need
}

In [66]:
# Initialize StratifiedKFold
skf = StratifiedKFold(n_splits=5)

In [None]:
## Start with RFE
print("Starting with Recursive Feature Elimination (RFE)...")

selector = feature_selectors["rfe"]
roc_aucs = []

# Apply StratifiedKFold cross-validation for RFE
for train_index, test_index in skf.split(X, y):
    X_train, X_test = X_scaled[train_index], X_scaled[test_index]
    y_train, y_test = y[train_index], y[test_index]

    # Feature selection
    X_train_selected = selector.fit_transform(X_train, y_train)
    X_test_selected = selector.transform(X_test)

    # Classifier training and prediction
    rf = RandomForestClassifier(n_estimators=100, random_state=42)
    rf.fit(X_train_selected, y_train)
    y_pred_prob = rf.predict_proba(X_test_selected)[:, 1]

    # ROC AUC calculation and append to list
    roc_auc = roc_auc_score(y_test, y_pred_prob)
    roc_aucs.append(roc_auc)

# Print the average ROC AUC score across the folds
print(f"RFE CV ROC AUC: {np.mean(roc_aucs)} ± {np.std(roc_aucs)}\n")


In [None]:
sc = MinMaxScaler(feature_range=(0, 100))
X_scaled = sc.fit_transform(X)
X_scaled

In [None]:
# Now move on to Chi-square
print("Moving on to Chi-square Feature Selection...")

selector = feature_selectors["chi2"]
roc_aucs = []
feature_scores = []

# Apply StratifiedKFold cross-validation for Chi-square
for train_index, test_index in skf.split(X, y):
    X_train, X_test = X_scaled[train_index], X_scaled[test_index]
    y_train, y_test = y[train_index], y[test_index]

    # Feature selection
    X_train_selected = selector.fit_transform(X_train, y_train)
    X_test_selected = selector.transform(X_test)

    # Append the feature scores
    feature_scores.append(selector.scores_)

    # Classifier training and prediction
    rf = RandomForestClassifier(n_estimators=100, random_state=42)
    rf.fit(X_train_selected, y_train)
    y_pred_prob = rf.predict_proba(X_test_selected)[:, 1]

    # ROC AUC calculation and append to list
    roc_auc = roc_auc_score(y_test, y_pred_prob)
    roc_aucs.append(roc_auc)


# Print the average ROC AUC score across the folds
print(f"Chi-square CV ROC AUC: {np.mean(roc_aucs)} ± {np.std(roc_aucs)}\n")

# # Print the average ROC AUC score across the folds
# print(f"Chi-square CV ROC AUC: {np.mean(roc_aucs)} ± {np.std(roc_aucs)}")

# Calculate average feature scores across folds and plot them
avg_scores = np.mean(feature_scores, axis=0)
plt.bar(feature_names, avg_scores)
plt.title("Feature Importances (Chi-square Scores)")
plt.xlabel("Features")
plt.ylabel("Importance")
plt.xticks(rotation=90)
plt.show()


#### Your Task

1. Task 1 : For you the task is to run "Lasso" feature selection on the data set. 
    - Use RandomForestClassifier as classifier
    - Which is the most important features. 
    - What is the model performance ? 