# Feature Selection: Picking the Right Inputs for Better Models

In Machine Learning, **features** are the input columns we feed into a model.

**Feature Selection** means identifying and choosing the *most important* features from all available ones, creating a smarter, leaner subset.  
Not every feature is valuable — sometimes, **less is more**!

---

## Why Perform Feature Selection?

1. **Curse of Dimensionality**  
   - Models perform best with an optimum number of features.  
   - Adding too many can introduce **sparsity** and **reduce performance** instead of improving it.

2. **Computational Complexity**  
   - Fewer features mean faster training and less resource consumption.

3. **Interpretability**  
   - Models with fewer, relevant features are easier to interpret, explain, and deploy.

---

## Types of Feature Selection Techniques

- **Filter Methods**  
  (Selection based on statistical properties of the data.)

- **Wrapper Methods**  
  (Selection based on model performance metrics.)

- **Embedded Methods**  
  (Feature selection occurs naturally during model training.)

- **Hybrid Methods**  
  (Combination of filter and wrapper approaches.)

---

## Important Tip: Avoid Data Leakage!

Whenever applying feature selection, **always split your dataset first**:
- **Training Data** (for model building)
- **Testing Data** (for model evaluation)

Then, **apply feature selection only on the training data**.

> **Why?**  
> If you perform feature selection on the entire dataset, the model may accidentally "see" parts of the testing data, causing **Data Leakage**.  
> This results in overly optimistic performance during evaluation, but poor performance in real-world scenarios.

---

✅ **Feature Selection isn't just about reducing features — it's about building smarter, faster, and more reliable models.**


# Filter-Based Feature Selection Techniques

---

## Techniques

### 1. Drop Duplicate Columns
- If two columns are identical, drop one.

---

### 2. Variance Threshold
Variance thresholding is applied mainly on two types of features:
- **Constant Features:**  
  Columns where all values are the same (zero variance).

- **Quasi-Constant Features:**  
  Columns where a large percentage (e.g., 99.5%) of rows have the same value.

**Steps:**
1. Define a threshold (typically between 0.01 to 0.1).
2. Calculate the variance for each feature.
3. Drop the feature if its variance is less than the threshold.

> Use `VarianceThreshold` from `scikit-learn`.

**Points to Consider:**
- Ignore the target variable.
- Ignores feature interaction.
- Sensitive to data scaling.
- Arbitrary threshold choice.

**Disadvantages:**
- Assumes linearity.
- Doesn't capture complex feature relationships.
- Sensitive to outliers.
- Threshold selection can be challenging.

---

### 3. ANOVA (Analysis of Variance)
Use when:
- **Input Features:** Numerical
- **Target Variable:** Categorical (with more than two classes)

Variation:
- If both input and output are numerical, use f-ANOVA.

**Disadvantages:**
- Assumption of normality.
- Homogeneity of variance required.
- Independence of observations needed.
- Sensitive to outliers.
- Does not capture feature interactions.

---

### 4. Chi-Square Test
Use when both input and output columns are categorical.

- Null Hypothesis: No relationship between the two variables.
- Small p-value → Important feature.
- Large p-value → Less important feature.

**Disadvantages:**
- Works only with categorical data.
- Assumes independence of observations.
- Requires a sufficient sample size.
- Ignores interactions between features.

---

## Filter Method: Advantages and Disadvantages

**Advantages:**
- Simple to apply.
- Fast and scalable.
- Suitable as a preprocessing step.

**Disadvantages:**
- Ignores feature-to-feature relationships.
- Model-agnostic (does not involve model learning).
- Limited to basic statistical measures.
- Arbitrary threshold decisions.

---

## Overall

Filter methods evaluate features independently using statistical tests.  
However, they ignore possible relationships between features, meaning important combinations might be missed.

---


# Wrapper Methods for Feature Selection

---

## What is Wrapper Method?

Wrapper methods use a predictive model to score feature subsets.  
They are called "wrapper" methods because they **wrap** the feature selection process around model evaluation.

- All subsets of input features are generated.
- Each subset, along with the target variable, is used to train a machine learning model (e.g., Linear Regression).
- Model performance (e.g., R²-Score) is calculated for each subset.
- The subset with the highest performance score is selected as the best.

---

## Steps:

1. Subset Generation
2. Subset Evaluation
3. Stopping Criterion

---

## Four Types of Wrapper Methods:

### 1. Exhaustive Feature Selection (Best Subset Selection)
- Tries all possible feature combinations.
- **Disadvantages:**
  - Very high computational complexity.
  - Risk of overfitting.
  - Requires a good evaluation metric.
  
- **Note:**  
  Gives very good results for small feature sets but impractical when features are large.

---

### 2. Backward Elimination (Sequential Backward Selection)
- Starts with all features.
- Iteratively removes the least important feature at each step.

- **Advantages:**
  - Faster than Exhaustive Search.

- **Disadvantages:**
  - Once a feature is removed, it cannot be re-added.
  - May miss globally optimal feature set.

- **When to use:**
  - Useful when you have many features and want to retain most of them.

---

### 3. Forward Selection (Sequential Forward Selection)
- Starts with no features.
- Iteratively adds the most important feature at each step.

- **Advantages:**
  - Good balance between model complexity and performance.

- **Disadvantages:**
  - Like backward selection, may miss the global best subset.

- **When to use:**
  - Useful when you want to select a small number of best features from a large feature set.

---

### 4. Recursive Feature Elimination (RFE)
- Recursively removes features based on model weights or importance.
- A popular method for model-driven feature selection.

---

## Important Parameters:

| Method        | Steps (Approx.)         |
|---------------|--------------------------|
| Exhaustive    | 2^n - 1 combinations      |
| Backward      | n(n+1)/2 steps             |
| Forward       | n(n+1)/2 steps             |

> Example (Forward Selection with Linear Regression):  
> `SFS(lr, k_features='best', forward=True, floating=False, scoring='r2', cv=5)`

---

## How to decide Forward or Backward?

- If you have many features and want to retain most → **Backward Selection**.
- If you have many features but want to pick few best → **Forward Selection**.
- If you just want "best features" without fixed count → Either Forward or Backward.

---

## Advantages of Wrapper Methods:

- High Accuracy.
- Considers Feature Interactions.

---

## Disadvantages of Wrapper Methods:

- Very high computational cost.
- Risk of Overfitting.
- Model-specific (depends on chosen algorithm).

---

## Filter vs Wrapper Method:

| Filter Method                  | Wrapper Method                   |
|---------------------------------|-----------------------------------|
| Evaluates features individually | Evaluates feature subsets together |
| Ignores feature interactions     | Considers feature interactions    |
| Fast and simple                 | Computationally expensive         |
| Model-agnostic                  | Model-specific                    |

---

## Overall

Wrapper methods study the interaction between features but are computationally expensive and slower compared to filter methods.

---


# Embedded Method for Feature Selection

---

## What is Embedded Method?

Embedded methods handle feature interactions and are faster than wrapper methods.  
They integrate feature selection as part of the model-building process. When you send data to the machine learning model for training, you not only get the model predictions but also the **importance of the features**.

### How It Works:
- During model training, feature importance is calculated along with the model's prediction.
- This approach studies feature interactions and is much faster than wrapper methods.

**Applicable Algorithms:**

- **Models with coefficients or feature importance attributes:**
  - **Linear Models:**  
    - Linear Regression, Logistic Regression, Ridge, Lasso, Elastic Net
  - **Tree-based Models:**  
    - Decision Tree, Random Forest, Gradient Boosting

---

## Linear Regression Assumptions:

1. Linearity
2. Independence of errors
3. Homoscedasticity (constant variance)
4. Normality of errors
5. No multicollinearity

---

## Regularized Linear Models

Regularization helps **reduce overfitting** by adding a penalty term to the loss function.  
This term discourages complex models and lowers the coefficients, which reduces the model's overfitting tendency.

**Types of Regularization:**
- **Ridge:** Reduces coefficients but does not set them to zero.
- **Lasso:** Can completely set some feature coefficients to zero (great for feature selection).
- **Elastic Net:** A mix of Ridge and Lasso.

---

## Tree-based Models

- **Decision Tree**, **Random Forest**, **Gradient Boosting**:  
  All tree-based algorithms naturally provide feature importance.

---

## Recursive Feature Elimination (RFE)

- **How It Works:**
  1. Use a model that provides coefficients or feature importance.
  2. Rank features by importance.
  3. Iteratively remove the least important features until the best feature subset is selected.

- **Hybrid Approach:**  
  Combines both **Embedded** and **Wrapper** methods for enhanced feature selection.

---

## Advantages of Embedded Methods:

1. **High Performance:** Efficient and fast.
2. **Less Prone to Overfitting:** As feature selection is embedded in the model.
3. **Efficiency:** Integrates feature selection and model building.

---

## Disadvantages of Embedded Methods:

1. **Model-Specific:** Dependent on the chosen algorithm.
2. **Complexity:** More complex than filter-based methods.
3. **Requires Tuning:** Models need proper tuning.
4. **Stability:** Can be sensitive to data fluctuations.

---

## Filter-based Method: Mutual Information

**What it does:** Measures the dependency between two columns (features).  
It calculates how much information is shared between them.

- **Formula:**  
  $$
  MI(X, Y) = \sum P(X, Y) \log \left( \frac{P(X, Y)}{P(X) P(Y)} \right)
  $$

- **Advantages:**
  1. Non-negative.
  2. Symmetric.
  3. Can capture any type of statistical dependency.

- **Disadvantages:**
  1. Estimation difficulty.
  2. Requires large sample sizes.
  3. Computationally expensive.
  4. Difficulty with continuous variables.
  5. Doesn’t indicate the nature of relationships.
  6. Doesn’t account for redundancy.

---

## Overall

**Embedded methods** are a good choice for handling feature selection with minimal computational cost compared to wrapper methods while accounting for feature interactions.  
But they are model-specific and may require more complex tuning and stability considerations.

---


# Feature Selection Cheat Sheet

---

## 1) Filter Methods:
- **Variance Threshold:** Removes features with low variance.
- **Correlation Coefficient:** Removes features highly correlated with others.
- **Chi-Square Test:** Measures the relationship between categorical features and target.
- **Mutual Information:** Captures dependency between features.
- **ANOVA:** Measures the relationship between numerical input and categorical output.

---

## 2) Wrapper Methods:
- **Recursive Feature Elimination (RFE):** Iteratively removes features to find the optimal subset.
- **Sequential Feature Selection (SFS):** Adds or removes features one by one based on model performance.
- **Exhaustive Feature Selection:** Tries all possible feature subsets to find the best combination.

---

## 3) Embedded Methods:
- **Lasso Regression:** Uses L1 regularization to shrink coefficients and perform feature selection.
- **Ridge Regression:** Uses L2 regularization to reduce feature impact.
- **Elastic Net:** Combines both L1 and L2 regularization.
- **Random Forest Feature Importance:** Measures feature importance based on tree-based models.

---


# Code Examples

In [3]:
from sklearn.feature_selection import VarianceThreshold
import numpy as np

# Example: Removing features with low variance
# X is your input feature matrix, where each row is a sample and each column is a feature
# y is the target variable

# Simulate some example data (for illustration purposes)
X = np.array([[1, 2, 3, 1],   # Feature 1: High variance
              [1, 2, 3, 1],   # Feature 2: High variance
              [1, 2, 3, 1],   # Feature 3: Low variance
              [4, 5, 6, 1]])  # Feature 4: High variance
y = np.array([0, 1, 0, 1])

# Apply VarianceThreshold to remove features with variance below 0.1
selector = VarianceThreshold(threshold=0.1)
X_new = selector.fit_transform(X)

print("Original X shape:", X.shape)
print("New X shape after variance thresholding:", X_new.shape)


Original X shape: (4, 4)
New X shape after variance thresholding: (4, 3)


# Correlation Coefficient

```python
import pandas as pd

# Example: Remove highly correlated features
corr_matrix = df.corr().abs()
upper_triangle = corr_matrix.where(
    pd.np.triu(pd.np.ones(corr_matrix.shape), k=1).astype(bool)
)
to_drop = [column for column in upper_triangle.columns if any(upper_triangle[column] > 0.9)]
df.drop(columns=to_drop, inplace=True)
```

# Chi-Square Test
```python 
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# Example: Applying Chi-Square test for feature selection
X_new = SelectKBest(chi2, k=5).fit_transform(X, y)
```

# Mutual Information
```python
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_classif

# Example: Applying Mutual Information for feature selection
X_new = SelectKBest(mutual_info_classif, k=5).fit_transform(X,
``` y)


# ANOVA
``` Python
from sklearn.feature_selection import f_classif
from sklearn.feature_selection import SelectKBest

# Example: Applying ANOVA F-test for feature selection
X_new = SelectKBest(f_classif, k=5).fit_transform(X,``` y)


# 2) Wrapper Methods:
### Recursive Feature Elimination (RFE)
```python
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# Example: Recursive Feature Elimination
model = LogisticRegression()
rfe = RFE(model, 5)  # Select 5 features
X_new = rfe.fit_transform(
```X, y)


# Sequential Feature Selection (SFS)
```python
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LinearRegression

# Example: Sequential Forward Selection
model = LinearRegression()
sfs = SequentialFeatureSelector(model, n_features_to_select=5)
sfs.fit(X, y)
X_new = sfs.transf
```orm(X)


# Exhaustive Feature Selection
```python
from mlxtend.feature_selection import ExhaustiveFeatureSelector
from sklearn.linear_model import LogisticRegression

# Example: Exhaustive Feature Selection
model = LogisticRegression()
efs = ExhaustiveFeatureSelector(model, min_features=1, max_features=5, scoring='accuracy')
efs.fit(X, y)
X_new = efs.transf
```orm(X)


# 3) Embedded Methods:
### Lasso Regression
```python 
from sklearn.linear_model import Lasso

# Example: Lasso for feature selection
lasso = Lasso(alpha=0.1)
lasso.fit(X, y)
X_new = X[:, lasso.coef_ != 0]  # Select features with non-zero coefficients
```

# Ridge Regression
```python
from sklearn.linear_model import Ridge

# Example: Ridge for feature selection
ridge = Ridge(alpha=0.1)
ridge.fit(X, y)
X_new = X[:, ridge.coef_ != 0]  # Select features with non-zero coefficients
```

# Elastic Net
```python
from sklearn.linear_model import ElasticNet

# Example: Elastic Net for feature selection
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic_net.fit(X, y)
X_new = X[:, elastic_net.coef_ != 0]  # Select features with non-zero coeffici```ents


# Random Forest Feature Importance
```python
from sklearn.ensemble import RandomForestClassifier

# Example: Using Random Forest for feature importance
model = RandomForestClassifier()
model.fit(X, y)
importances = model.feature_importances_

# Select top features
indices = importances.argsort()[-5:][::-1]  # Top 5 important features
X_new = X[:, ```indices]
