# Module 5: Advanced Regression Techniques
In this module, we explore regularized regression methods and logistic regression — powerful tools for improving model generalization and handling classification problems.

## 🎯 Learning Objectives
- Understand why and when to use Ridge and Lasso regression
- Apply logistic regression for binary classification
- Evaluate classification models with confusion matrix and ROC curve

## 🧮 Ridge and Lasso Regression
These are regularized versions of linear regression that add a penalty to the model coefficients to reduce overfitting:

- **Ridge Regression** adds an L2 penalty (squared magnitude of coefficients)
- **Lasso Regression** adds an L1 penalty (absolute value of coefficients)

These methods help when you have multicollinearity or want to shrink irrelevant features.

In [None]:
from sklearn.linear_model import Ridge, Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
import pandas as pd

# Generate synthetic data
np.random.seed(42)
X = np.random.randn(100, 5)
coefs = np.array([5, 0, 3, 0, 2])
y = X @ coefs + np.random.normal(0, 1, 100)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Fit Ridge model
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
y_pred_ridge = ridge.predict(X_test)
print(f"Ridge RMSE: {mean_squared_error(y_test, y_pred_ridge, squared=False):.2f}")

In [None]:
# Fit Lasso model
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
y_pred_lasso = lasso.predict(X_test)
print(f"Lasso RMSE: {mean_squared_error(y_test, y_pred_lasso, squared=False):.2f}")

## 🤖 Logistic Regression
Logistic regression is used when the target variable is binary (e.g., yes/no, success/failure).

Instead of predicting a value, it predicts the **probability** of class membership using the logistic (sigmoid) function.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification

# Create classification data
X, y = make_classification(n_samples=100, n_features=4, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

# Fit logistic model
clf = LogisticRegression()
clf.fit(X_train, y_train)
print(f"Accuracy: {clf.score(X_test, y_test):.2f}")

## 📊 Evaluating Classification Models
- **Confusion Matrix** shows counts of true/false positives/negatives
- **ROC Curve** plots true positive rate vs false positive rate at various thresholds
- **AUC** is area under ROC curve (closer to 1 is better)

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, roc_curve, auc
import matplotlib.pyplot as plt

# Predict probabilities
probs = clf.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, probs)
roc_auc = auc(fpr, tpr)

# Plot ROC Curve
plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {roc_auc:.2f}')
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()

In [None]:
# Confusion Matrix
preds = clf.predict(X_test)
cm = confusion_matrix(y_test, preds)
ConfusionMatrixDisplay(cm).plot()
plt.title('Confusion Matrix')
plt.show()

## ✅ Practice Exercises
1. Simulate multicollinear data and compare Ridge vs Lasso performance.
2. Use logistic regression to classify iris flower species (setosa vs non-setosa).
3. Plot an ROC curve and calculate AUC.
4. Reflect: What are the trade-offs of using regularization?
5. Try adjusting `alpha` for Ridge and Lasso — how do the coefficients and RMSE change?