<a href="https://colab.research.google.com/github/harunpirim/IME775/blob/main/week-09/notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>---

# Week 9: Feature Engineering and Selection
**IME775: Data Driven Modeling and Optimization**
ðŸ“– **Reference**: Watt, Borhani, & Katsaggelos (2020). *Machine Learning Refined* (2nd ed.), **Chapter 9**
---
## Learning Objectives
- Understand the importance of feature engineering
- Apply feature scaling techniques
- Handle missing values appropriately
- Implement feature selection via boosting and regularization


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Lasso

## Introduction (Section 9.1)
**Feature Engineering**: The process of creating, transforming, and selecting features to improve model performance.
> "Coming up with features is difficult, time-consuming, requires expert knowledge. Applied machine learning is basically feature engineering." â€” Andrew Ng


## Histogram Features (Section 9.2)
For categorical or binned data:
### One-Hot Encoding
Convert categorical variable to binary vectors:
- Category A â†’ [1, 0, 0]
- Category B â†’ [0, 1, 0]
- Category C â†’ [0, 0, 1]
### Binning Continuous Features
Convert continuous to discrete:
- Age 0-18 â†’ "child"
- Age 19-65 â†’ "adult"
- Age 65+ â†’ "senior"


## Feature Scaling via Standard Normalization (Section 9.3)
### Why Scale Features?
- Many algorithms sensitive to feature scales
- Gradient descent converges faster
- Regularization works properly
### Standard Normalization (Z-score)
$$\tilde{x}_j = \frac{x_j - \mu_j}{\sigma_j}$$
Result: Zero mean, unit variance.


In [None]:
# Feature scaling example
np.random.seed(42)
n = 100
# Features with very different scales
X_raw = np.column_stack([
    np.random.randn(n) * 100 + 500,  # Feature 1: large scale
    np.random.randn(n) * 0.1 + 2      # Feature 2: small scale
])
# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_raw)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Before scaling
ax1 = axes[0]
ax1.scatter(X_raw[:, 0], X_raw[:, 1], alpha=0.7)
ax1.set_xlabel('Feature 1 (large scale)')
ax1.set_ylabel('Feature 2 (small scale)')
ax1.set_title('Before Scaling')
ax1.grid(True, alpha=0.3)
# After scaling
ax2 = axes[1]
ax2.scatter(X_scaled[:, 0], X_scaled[:, 1], alpha=0.7)
ax2.set_xlabel('Feature 1 (standardized)')
ax2.set_ylabel('Feature 2 (standardized)')
ax2.set_title('After Standard Normalization')
ax2.grid(True, alpha=0.3)
ax2.set_xlim(-4, 4)
ax2.set_ylim(-4, 4)
plt.tight_layout()
fig

## Imputing Missing Values (Section 9.4)
### Strategies
| Strategy | Description | When to Use |
|----------|-------------|-------------|
| **Mean** | Replace with feature mean | Numerical, few missing |
| **Median** | Replace with median | Numerical, outliers present |
| **Mode** | Replace with most frequent | Categorical |
| **Model-based** | Predict missing values | Complex patterns |
### Implementation
```python
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)
```


## Feature Scaling via PCA-Sphering (Section 9.5)
### Whitening/Sphering
Transform data to have:
- Zero mean
- Identity covariance matrix
$$\tilde{X} = (X - \mu) \Sigma^{-1/2}$$
### Effect
- Removes correlations between features
- Equalizes variances in all directions
- Can improve some algorithms


## Feature Selection via Boosting (Section 9.6)
### Forward Selection
Iteratively add the most useful feature:
```
1. Start with empty feature set S
2. Repeat:
   a. For each feature j not in S:
      - Evaluate model with S âˆª {j}
   b. Add feature with best performance to S
3. Until: stopping criterion met
```
### Backward Elimination
Start with all features, iteratively remove least useful.
### Advantages
- Wrapper method: Uses actual model performance
- Can capture feature interactions


## Feature Selection via Regularization (Section 9.7)
### L1 Regularization (Lasso)
$$\min_w \frac{1}{P}\|y - Xw\|^2 + \lambda \|w\|_1$$
The L1 penalty encourages **sparsity** â€” many coefficients become exactly zero.
### Why Sparsity?
- Automatic feature selection
- Interpretability
- Reduced overfitting


In [None]:
# Lasso for feature selection
np.random.seed(42)
n, p = 100, 20
# Only 5 features are truly relevant
X = np.random.randn(n, p)
true_coef = np.zeros(p)
true_coef[:5] = [3, -2, 1.5, -1, 0.5]
y = X @ true_coef + 0.5 * np.random.randn(n)
# Fit Lasso with different alpha values
alphas = [0.001, 0.01, 0.1, 0.5]
fig2, axes = plt.subplots(2, 2, figsize=(12, 10))
for ax, alpha in zip(axes.flat, alphas):
    lasso = Lasso(alpha=alpha)
    lasso.fit(X, y)
    colors = ['green' if i < 5 else 'gray' for i in range(p)]
    ax.bar(range(p), lasso.coef_, color=colors, alpha=0.7)
    ax.axhline(0, color='black', linewidth=0.5)
    ax.set_xlabel('Feature Index')
    ax.set_ylabel('Coefficient')
    n_nonzero = np.sum(lasso.coef_ != 0)
    ax.set_title(f'Î± = {alpha}, Non-zero: {n_nonzero}')
fig2.suptitle('Lasso Feature Selection (ML Refined, Section 9.7)', fontsize=14)
plt.tight_layout()
fig2

## Summary
| Technique | Purpose | Key Idea |
|-----------|---------|----------|
| **Scaling** | Normalize features | Same scale for all |
| **Imputation** | Handle missing data | Replace with statistics |
| **Boosting** | Feature selection | Greedy search |
| **Regularization** | Feature selection | L1 sparsity |
---
## References
- **Primary**: Watt, J., Borhani, R., & Katsaggelos, A. K. (2020). *Machine Learning Refined* (2nd ed.), Chapter 9.
- **Supplementary**: Hastie, T. et al. (2009). *The Elements of Statistical Learning*, Chapter 3.
## Next Week
**Principles of Nonlinear Feature Engineering** (Chapter 10): Beyond linear models.
