<a href="https://colab.research.google.com/github/kingsleynwafor54/RemoteSkillHub_Python/blob/main/Feature_Selection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

🎯 Feature Selection in Machine Learning

🧩 1. Why Feature Selection Matters

Feature selection helps your model focus only on what truly matters.

✅ Benefits:

🚀 Improves model accuracy (removes noise)

⚡ Reduces computation time

🧠 Makes the model more interpretable

🧯 Prevents overfitting (too many irrelevant variables)

🧠 2. Feature Selection Methods (for Regression)

We’ll go from simple → advanced.

🔹 A. Correlation Analysis (Quick & Visual)

Find which features are most correlated with your target variable.

```python
import seaborn as sns
import matplotlib.pyplot as plt

corr = df.corr(numeric_only=True)
plt.figure(figsize=(8,6))
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title("Feature Correlation Heatmap")
plt.show()
```

Interpretation:

Features with high positive or negative correlation (close to ±1) with the target are useful.

Features with correlation ≈ 0 are less useful.

📘 Example:

| Feature         | Correlation | Importance  |
|-----------------|-------------|-------------|
| Annual Income   | +0.7        | ✅ Important |
| Gender          | 0.05        | ❌ Weak     |

🔹 B. Using Model Coefficients (Linear Regression)

Inspect feature coefficients after fitting a model.

```python
from sklearn.linear_model import LinearRegression
import pandas as pd

X = df[['Age', 'Annual Income (k$)', 'Gender']]
y = df['Spending Score (1-100)']

model = LinearRegression()
model.fit(X, y)

coef_df = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': model.coef_
}).sort_values(by='Coefficient', ascending=False)

coef_df
```

Interpretation:

Larger absolute coefficient → stronger effect

Coefficient close to 0 → weaker influence

🔹 C. Recursive Feature Elimination (RFE)

Automatically ranks features by importance and selects the best ones.

```python
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

model = LinearRegression()
rfe = RFE(model, n_features_to_select=2)
rfe.fit(X, y)

for i in range(len(X.columns)):
    print(f"{X.columns[i]}: Selected={rfe.support_[i]}, Rank={rfe.ranking_[i]}")
```

Interpretation:

✅ Selected = True → Best features
💡 Lower rank → more important

🔹 D. Using Feature Importance from Tree Models

Even if using Linear Regression, you can check feature power with Random Forest.

```python
from sklearn.ensemble import RandomForestRegressor
import matplotlib.pyplot as plt
import pandas as pd

rf = RandomForestRegressor(random_state=42)
rf.fit(X, y)

importances = pd.Series(rf.feature_importances_, index=X.columns).sort_values(ascending=False)
print(importances)

importances.plot(kind='bar', title='Feature Importance (Random Forest)')
plt.show()
```

🔹 E. Statistical Tests (for Numeric Data)

Use f_regression to check statistical significance.

```python
from sklearn.feature_selection import f_regression, SelectKBest

selector = SelectKBest(score_func=f_regression, k='all')
selector.fit(X, y)

feature_scores = pd.DataFrame({
    'Feature': X.columns,
    'F-Score': selector.scores_,
    'P-Value': selector.pvalues_
}).sort_values(by='F-Score', ascending=False)

feature_scores
```

✅ Interpretation:

High F-Score → more relevant

Low P-value (< 0.05) → statistically significant

🧭 3. How to Decide Practically

| Method                     | When to Use                | Output                |
|----------------------------|----------------------------|-----------------------|
| Correlation Heatmap        | First look                 | Visual relationships  |
| Model Coefficients         | Linear models              | Numeric importance    |
| RFE                        | Regression or classification | Rank & select         |
| Random Forest Importance   | Any model                  | Nonlinear relationships|
| F-Test                     | Numeric regression         | Statistical significance|

🧠 4. Example Decision

| Feature       | Correlation | RFE Rank | Keep?       |
|---------------|-------------|----------|-------------|
| Age           | -0.6        | 2        | ✅          |
| Annual Income | +0.8        | 1        | ✅          |
| Gender        | 0.03        | 3        | ❌          |

✅ Best features: Age, Annual Income

⚡ Pro Tip: Automate Feature Selection in a Pipeline

```python
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.linear_model import LinearRegression

pipeline = Pipeline([
    ('select', SelectKBest(score_func=f_regression, k=2)),
    ('model', LinearRegression())
])

pipeline.fit(X, y)
```

✅ This way, the model automatically selects the 2 best features every time you train.