## Feature Selection
1. Filter Methods
    - ANOVA [Numerical]
    - Chi-Square Test [Categorical]
2. Wrapper Methods
    - Forward/ Backward Selection
3. Embedded Methods
    - Random Forests
4. Hybrid/ Advanced Methods
    - Laplacian Score

In [1]:
import numpy as np
import pandas as pd

from sklearn.datasets import fetch_openml
from sklearn.datasets import make_classification
from sklearn.feature_selection import SelectKBest, f_classif, chi2

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

In [None]:
# Binary classification

def LogisticRegressionClassifier(X, y):
    # Train-test split on reduced dataset
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    # Train model
    model = LogisticRegression()
    model.fit(X_train, y_train)

    print("Accuracy on test set:", np.round(model.score(X_test, y_test), 2))

## Filter Methods
- How they work: Select features based on statistical properties of the data, independent of the learning algorithm
- Advantages: Simple, fast, model-agnostic, less risk of overfitting.
- Disadvantages: Do not capture feature interactions with the model.

#### ANOVA - Analysis of Variance
- Works only with numerical independent variables and categorical dependent variables (classification).
- Useful for identifying features that separate classes well.
- Assumes normal distribution and equal variance across groups.

In [26]:
# synthetic dataset
X, y = make_classification(n_samples=100, n_features=20, n_informative=8, n_redundant=8, random_state=42)
feature_names = [f"F{i}" for i in range(X.shape[1])]
df = pd.DataFrame(X, columns=feature_names)

# Apply ANOVA
# This code selects the 8 best features from X that have the strongest statistical relationship with y 
# according to the ANOVA F-test, and returns a new dataset (X_new) with only those features.
selector = SelectKBest(score_func=f_classif, k=8)
X_new = selector.fit_transform(df, y)

# Get mask of selected features
mask = selector.get_support()

# Get feature names
selected_features = df.columns[mask]

print("Selected Features:", list(selected_features))

Selected Features: ['F1', 'F3', 'F4', 'F5', 'F10', 'F13', 'F16', 'F17']


In [27]:
X

array([[-3.82113336, -1.72761111,  0.02485234, ..., -4.76902424,
         2.11323528, -1.20280453],
       [-0.48794335, -1.04955638, -0.45821652, ..., -1.27180095,
        -3.61522701,  2.39431413],
       [-2.09808336, -0.05236947, -0.23289341, ..., -2.87091828,
         0.91288706,  1.97676115],
       ...,
       [ 6.73455321, -0.70715514,  0.38004746, ...,  3.06138915,
        -1.49128992,  0.44245816],
       [-1.96835398, -2.29580631,  2.27086528, ..., -1.14324578,
        -1.8506362 ,  1.84271981],
       [ 0.19438346,  0.36812812, -2.29397634, ...,  0.8246515 ,
        -2.15522817,  0.72180249]])

In [28]:
feature_names

['F0',
 'F1',
 'F2',
 'F3',
 'F4',
 'F5',
 'F6',
 'F7',
 'F8',
 'F9',
 'F10',
 'F11',
 'F12',
 'F13',
 'F14',
 'F15',
 'F16',
 'F17',
 'F18',
 'F19']

In [4]:
df.head()

Unnamed: 0,F0,F1,F2,F3,F4,F5,F6,F7,F8,F9,F10,F11,F12,F13,F14,F15,F16,F17,F18,F19
0,-3.821133,-1.727611,0.024852,3.724986,-4.291174,-1.758557,-2.235676,-1.238163,1.001928,2.986384,1.964534,-1.107971,-1.563752,-5.002432,0.298052,-0.093705,4.33343,-4.769024,2.113235,-1.202805
1,-0.487943,-1.049556,-0.458217,1.220114,1.682745,2.264427,2.732692,-0.686916,2.218421,0.20104,0.95527,-0.970621,-2.759998,0.517214,-0.010116,-1.792456,0.098361,-1.271801,-3.615227,2.394314
2,-2.098083,-0.052369,-0.232893,2.51796,-1.628642,1.099946,0.662142,-1.193155,0.205027,0.74913,0.310403,0.077356,-3.380699,0.559598,1.001952,-1.040515,0.142905,-2.870918,0.912887,1.976761
3,1.073791,1.876327,0.301695,-3.020519,0.120694,-0.597747,-1.393866,-0.412072,-0.068791,0.631349,-0.882034,-0.456179,-6.077826,3.699888,-0.108038,-0.687402,-3.560037,2.693257,0.820371,5.132176
4,6.208669,-2.752368,0.797803,2.062771,0.362207,0.955333,-0.883463,0.826994,0.236425,-2.69777,3.169112,2.346627,2.733313,0.842772,0.230933,-1.493111,-2.004667,0.676131,-4.056706,1.021478


In [29]:
df.shape

(100, 20)

In [5]:
print("\nLogistic Regression on Selected Features:")
LogisticRegressionClassifier(X_new, y)

print("\nLogistic Regression on All Features:")
LogisticRegressionClassifier(X, y)


Logistic Regression on Selected Features:
Accuracy on test set: 0.9

Logistic Regression on All Features:
Accuracy on test set: 0.9


#### Chi-square Test
- Works only with categorical independent variables and categorical dependent variables (classification).
- Useful for identifying features that are associated with the target.

In [6]:
# Load Car dataset
car = fetch_openml(name="car", version=1, as_frame=True)
df_car = car.frame

  warn(


In [7]:
df_car.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [8]:
df_car['class'].value_counts()

class
unacc    1210
acc       384
good       69
vgood      65
Name: count, dtype: int64

In [9]:
# Target variable: 'class' (edible=e, poisonous=p)
y = df_car['class']

In [10]:
df_car.columns

Index(['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class'], dtype='object')

In [11]:
# Select only categorical features (all columns except target)
categorical_columns = df_car.drop(columns=['class']).select_dtypes(include=['object', 'category']).columns
categorical_columns

Index(['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety'], dtype='object')

In [12]:
from scipy.stats import chi2_contingency

# Perform Chi-square test for each feature
print("Chi-square test results (feature vs target):")
for feature in categorical_columns:
    contingency_table = pd.crosstab(df_car[feature], y)
    chi2, p, dof, expected = chi2_contingency(contingency_table)
    print(f"{feature}: chi2={chi2:.2f}, p-value={p:.4f}")

Chi-square test results (feature vs target):
buying: chi2=189.24, p-value=0.0000
maint: chi2=142.94, p-value=0.0000
doors: chi2=10.38, p-value=0.3202
persons: chi2=371.34, p-value=0.0000
lug_boot: chi2=53.28, p-value=0.0000
safety: chi2=479.32, p-value=0.0000


 - p-value tells you whether the association is statistically significant. [p < 0.05 → feature is associated with target.]
 - Chi2 value indicates magnitude of association. [Higher Chi2 → stronger association with the target.]

doors - Chi2 = 10.38 (low) and p-value = 0.32 → not significant → weak/no association.

In [13]:
y = df_car['class']
le = LabelEncoder()
y_encoded = le.fit_transform(y)

# Features selected based on Chi-square test results
X = df_car[['buying', 'maint', 'persons', 'lug_boot', 'safety']]

# One-Hot Encode categorical features
X_encoded = pd.get_dummies(X, drop_first=True)

print("\nLogistic Regression on Selected Features:")
LogisticRegressionClassifier(X_encoded, y_encoded)


Logistic Regression on Selected Features:
Accuracy on test set: 0.9


## Wrapper Methods
- How they work: Use a predictive model to evaluate subsets of features.
- Advantages: Capture feature interactions with the model.
- Disadvantages: Computationally expensive, risk of overfitting.

#### Forward Selection
- Works with numerical or categorical independent variables and a target variable (classification or regression).
- Adds features one by one, selecting the feature that improves model performance the most at each step.
- Useful for building a minimal subset of predictive features.
- Can be computationally expensive for high-dimensional data.

In [31]:
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.model_selection import cross_val_score

# Load California Housing dataset
data = fetch_california_housing(as_frame=True)
X = data.data
y = data.target

house_price_df = pd.concat([X, y], axis=1)
house_price_df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [32]:
# Initialize linear regression model
lr = LinearRegression()

# Forward selection
# Forward selection: adds features one by one based on model performance.
# Backward selection: starts with all features and removes the least useful ones. backward, forward
sfs = SequentialFeatureSelector(lr, n_features_to_select=3, direction='backward', scoring='r2', cv=5)
sfs.fit(X, y)
selected_features = X.columns[sfs.get_support()]
print("Selected features:", list(selected_features))

# R2 on all features (CV)
cv_r2_all = cross_val_score(lr, X, y, cv=5, scoring='r2').mean()
cv_r2_selected = cross_val_score(lr, X[selected_features], y, cv=5, scoring='r2').mean()

print("Cross-validated R² on all features:", np.round(cv_r2_all, 3))
print("Cross-validated R² on selected features:", np.round(cv_r2_selected, 3))

Selected features: ['MedInc', 'Latitude', 'Longitude']
Cross-validated R² on all features: 0.553
Cross-validated R² on selected features: 0.533


## Embedded Methods
- How they work: Feature selection happens naturally during model training.
- Advantages: More efficient than wrappers; consider model-specific interactions.
- Disadvantages: Dependent on chosen algorithm.

#### Random Forest Feature Selection
- Works with numerical or categorical independent variables and a target variable (classification or regression).
- Performs feature selection naturally during model training by evaluating feature importance based on splits.
- Useful for identifying the most predictive features without separate selection steps.
- Handles high-dimensional data well and captures non-linear relationships.
- May be biased toward features with more levels (categorical) or higher variance (numerical).

In [22]:
from sklearn.ensemble import RandomForestRegressor

# Load dataset
data = fetch_california_housing(as_frame=True)
X = data.data
y = data.target

# Train Random Forest
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X, y)

# Get feature importances
importances = pd.Series(rf.feature_importances_, index=X.columns)
importances = importances.sort_values(ascending=False)

# Print top features
print("Top features by Random Forest importance:")
print(importances.head(5))

# Optional: select features above a threshold
threshold = 0.05
selected_features = importances[importances > threshold].index.tolist()
print("\nSelected features (importance > 0.05):", selected_features)


Top features by Random Forest importance:
MedInc       0.520037
AveOccup     0.136406
Latitude     0.092856
Longitude    0.092694
HouseAge     0.052964
dtype: float64

Selected features (importance > 0.05): ['MedInc', 'AveOccup', 'Latitude', 'Longitude', 'HouseAge']


- Random Forest calculates importance automatically during training.
- Features with higher importance contribute more to prediction.
- You can select a subset of features using a threshold or top-k features.
- Works for both regression and classification tasks.

## Hybrid/ Advanced Methods
- How they work: Combine filter, wrapper, and/or embedded approaches for better balance.
- Advantages: Leverage strengths of multiple methods; improve accuracy; handle high-dimensional and complex data; reduce overfitting risk.
- Disadvantages: Computationally expensive; harder to implement and tune; may still select redundant features; reproducibility can be challenging.

#### Laplacian Score - Unsupervised
- Works with numerical independent variables.
- Measures the locality-preserving power of each feature by evaluating how well it respects the intrinsic geometric structure of the data.
- Useful for selecting features that best preserve the data manifold and local relationships.
- Can handle high-dimensional data and is effective in identifying discriminative features in unsupervised settings.
- Does not capture complex non-linear dependencies between features and may be sensitive to noise or outliers.

In [23]:
from skfeature.function.similarity_based import lap_score
from skfeature.utility import construct_W
from sklearn.datasets import load_iris

# Load dataset (for demonstration, using iris dataset)
data = load_iris()
X = data.data
y = data.target
feature_names = data.feature_names

iris_df = pd.DataFrame(X, columns=feature_names)
iris_df['target'] = y

iris_df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [25]:
# Construct the affinity (similarity) matrix W using the k-nearest neighbors (k=5) approach, 
# where similarity between data points in X is computed with the Euclidean distance metric.
W = construct_W.construct_W(X, mode='knn', neighbor=5, metric='euclidean')

# Compute Laplacian Score
score = lap_score.lap_score(X, W=W)
lap_scores = score.flatten()

# Rank features (lower score = more important)
feature_ranking = np.argsort(lap_scores)
print("Feature ranking by Laplacian Score (most important first):")
for i in feature_ranking:
    print(f"{feature_names[i]}: {lap_scores[i]:.4f}")

# Optionally, select top k features
k = 3
top_features = [feature_names[i] for i in feature_ranking[:k]]
print("\nTop", k, "features:", top_features)

Feature ranking by Laplacian Score (most important first):
sepal width (cm): 0.0000
sepal length (cm): 1.0000
petal width (cm): 2.0000
petal length (cm): 3.0000

Top 3 features: ['sepal width (cm)', 'sepal length (cm)', 'petal width (cm)']
