<div style="text-align: center;">

# **Spring 2026 &mdash; CIS 3813<br>Advanced Data Science<br>(Introduction to Machine Learning)**
### Week 4: Multi-Feature Regression — Standard vs. Ridge vs. Lasso

</div>

---

## **Lab Instructions**

**Due Date**: Monday, 23 February @ 6:00 PM (with grace period until Wednesday, 25 February @ 11:59 PM)

In this lab, you will apply what you learned to a different dataset: a synthetic dataset mirroring the structure of the Ames Housing dataset (a more detailed housing dataset from Ames, Iowa). You will:

1. Load and explore the data
2. Prepare features and handle scaling
3. Fit Standard, Ridge, and Lasso regression models
4. Compare performance and interpret coefficients
5. Explore the effect of alpha on Lasso's feature selection


**AI Usage**: 
- You may use AI tools for this lab
- **REQUIRED**: Include AI attribution using the format shown in the syllabus
- For B/A level credit, include detailed attribution in markdown cells

## **Grading**

| Component | Points |
|-----------|--------|
| Exercise 1: Load and Explore the Data | 10 |
| Exercise 2: Prepare the Data | 15 |
| Exercise 3: Fit and Compare Three Models | 30 |
| Exercise 4: Alpha Exploration | 25 |
| Exercise 5: Reflection | 10 |
| In-Class Mastery Assessment (Week 5) | 10 |
| **Total** | **100** |


---

## **AI Assistance Declaration**

**Tools used:** [ChatGPT-4 / GitHub Copilot / Claude / None / Other — update this]

**Sections with AI help:** [e.g., "Part 2, Question 3" — update this]

**What I learned:** [Brief description — update this]

**What I did independently:** [Sections completed without AI — update this]

---

# **Configuring Our Environment**

In [None]:
# ============================================================
# SETUP — Run this first
# ============================================================
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Plots below may contain warnings. Uncomment these two lines to suppress them.
import warnings
warnings.filterwarnings('ignore')

# plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12

from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score


In [None]:
# ===== RUN THIS CELL TO GENERATE THE LAB DATASET =====
# This creates a realistic housing dataset for our lab

np.random.seed(42)
n = 1000

# Generate correlated housing features
sq_ft = np.random.normal(1800, 500, n).clip(600, 5000)
bedrooms = np.round(1 + sq_ft / 700 + np.random.normal(0, 0.5, n)).clip(1, 7).astype(int)
bathrooms = np.round(0.5 + bedrooms * 0.6 + np.random.normal(0, 0.3, n), 1).clip(1, 5)
lot_size = np.random.normal(8000, 3000, n).clip(2000, 30000)
year_built = np.random.randint(1950, 2024, n)
garage_cars = np.random.choice([0, 1, 2, 3], n, p=[0.05, 0.25, 0.55, 0.15])
overall_quality = np.round(3 + sq_ft / 1000 + np.random.normal(0, 1, n)).clip(1, 10).astype(int)
neighborhood_score = np.random.uniform(1, 10, n)  # 1-10 desirability

# Noisy/irrelevant features (these SHOULD get zeroed out by Lasso)
fence_type = np.random.randint(0, 5, n)           # Mostly irrelevant to price
month_sold = np.random.randint(1, 13, n)           # Weak seasonal effect
misc_feature_val = np.random.exponential(50, n)    # Random noise

# Generate target: Sale Price (with realistic relationships)
sale_price = (
    200 * sq_ft
    + 15000 * bathrooms
    + 7500 * bedrooms
    + 1 * lot_size
    - 1000 * (2025 - year_built)
    + 1000 * garage_cars
    + 15000 * overall_quality
    + 20000 * neighborhood_score
    + 100 * fence_type              # Very small effect
    + 25 * month_sold               # Negligible effect
    + 0 * misc_feature_val          # Zero effect (pure noise)
    + np.random.normal(0, 2500, n)  # Random noise
)

# Create DataFrame
ames_df = pd.DataFrame({
    'SqFt': sq_ft.round(0),
    'Bedrooms': bedrooms,
    'Bathrooms': bathrooms,
    'LotSize': lot_size.round(0),
    'YearBuilt': year_built,
    'GarageCars': garage_cars,
    'OverallQuality': overall_quality,
    'NeighborhoodScore': neighborhood_score.round(2),
    'FenceType': fence_type,
    'MonthSold': month_sold,
    'MiscFeatureVal': misc_feature_val.round(2),
    'SalePrice': sale_price.round(0)
})

print(f"Dataset created: {ames_df.shape[0]} houses, {ames_df.shape[1] - 1} features")

---

## **Exercise 1: Load and Explore the Data (10 points)**

### **Task 1a: Explore the dataset (5 points)**

Run `.describe()` on the dataset and write a brief Markdown observation about the **range/scale differences** between features. Which features have the largest range? Which have the smallest?

In [None]:
# YOUR CODE HERE — use .describe() and examine the data


*YOUR OBSERVATIONS HERE (double-click to edit):*

- **Largest ranges:** 
- **Smallest ranges:** 
- **Implication:** 

### **Task 1b: Visualize feature correlations with the target (5 points)**

Create a **correlation heatmap** or bar chart showing how each feature correlates with `SalePrice`. Which features appear most/least correlated?

In [None]:
# YOUR CODE HERE — compute and visualize correlations with SalePrice
# Hint: ames_df.corr()['SalePrice'].sort_values()


*MOST/LEAST Correlated Features*

- **Most Correlated:** 
- **Least Corrleted:** 

---

## **Exercise 2: Prepare the Data (15 points)**

### **Task 2a: Split into features (X) and target (y), then train/test split (5 points)**

In [None]:
# YOUR CODE HERE
# 1. Separate features (X) and target (y = SalePrice)
# 2. Split into 80% train / 20% test with random_state=42
# 3. Print the resulting shapes of X_train, X_test, y_train, y_test to confirm they look correct



# UNCOMMENT OUT BELOW AND RUN THE FOLLOWING TO MAKE SURE YOUR SPLIT LOOKS CORRECT
# print(f"X_train shape: {X_train.shape}")
# print(f"X_test shape:  {X_test.shape}")
# print(f"y_train shape: {y_train.shape}")
# print(f"y_test shape:  {y_test.shape}")

### **Task 2b: Scale the features using StandardScaler (10 points)**

**Important:** Fit the scaler on the training data only, then transform both train and test.

In a Markdown cell, explain in 1–2 sentences **why** we fit the scaler on training data only.

In [None]:
# YOUR CODE HERE
# 1. Create a StandardScaler
# 2. fit_transform on training features
# 3. transform (NOT fit_transform!) on test features



# UNCOMMENT OUT BELOW AND RUN THE FOLLOWING TO MAKE SURE YOUR SPLIT LOOKS CORRECT
# print("Training data — mean and std after scaling:")
# print(X_train_scaled.describe().round(4).loc[['mean', 'std']])
# print("\nTest data — mean and std (should be CLOSE to 0/1 but not exact):")
# print(X_test_scaled.describe().round(4).loc[['mean', 'std']])

*YOUR EXPLANATION of why we fit on training data only:*



---

## **Exercise 3: Fit and Compare Three Models (30 points)**

### **Task 3a: Fit Standard Linear Regression, Ridge, and Lasso (15 points)**

Fit the following three models on the **scaled** training data:
1. `LinearRegression()` — no regularization
2. `Ridge(alpha=1.0)` — L2 regularization
3. `Lasso(alpha=100.0, max_iter=10000)` — L1 regularization

For each model, compute and print:
- Training MSE
- Test MSE
- Test R² score

In [100]:
# YOUR CODE HERE
# Fit all three models and print their performance metrics


### **Task 3b: Compare the coefficients (15 points)**

Create a DataFrame showing the coefficients from all three models side by side (similar to what we did in the lecture). Then create a **bar chart** comparing them.

In a Markdown cell, answer:
1. Which features did Lasso zero out?
2. Do the zeroed-out features make intuitive sense? Why or why not?
3. How do the Ridge coefficients compare to OLS?

In [None]:
# YOUR CODE HERE
# 1. Create a comparison DataFrame of coefficients


In [None]:
# YOUR CODE HERE
# 2. Create a bar chart visualization


*YOUR ANALYSIS HERE:*

1. **Features Lasso zeroed out:** 

2. **Does this make intuitive sense?** 

3. **How do Ridge coefficients compare to OLS?** 

---
## **Exercise 4: Alpha Exploration (25 points)**

### **Task 4a: Lasso coefficient paths (15 points)**

Fit Lasso models with the following alphas: `[0.01, 0.1, 1, 10, 100, 1000, 10000, 50000, 100000]`

Create a **Lasso coefficient path plot** (like in the lecture) showing how each feature's coefficient changes as alpha increases. Use `plt.xscale('log')`.

In [101]:
# YOUR CODE HERE
# 1. Loop through alphas, fit Lasso, store coefficients
# 2. Plot coefficient paths

alphas = [0.01, 0.1, 1, 10, 100, 1_000, 10_000, 50_000, 100_000]


### **Task 4b: Find the best alpha (10 points)**

For each alpha above, compute the **Test MSE** and **Test R²** using **Lasso (L1) Regularization**. Which alpha gives the best test performance?

Create a plot of **Test MSE vs. Alpha** to visualize the optimal point.

Some Hints:
- You can reuse the test MSE and R² code from before, but now loop through all alphas to find the best one.
- The best alpha will be the one with the lowest Test MSE.
- Don't forget to set `max_iter=10000`.
- Use the Lecture Notes for examples of annotating plots.

*(Note: Next week we'll learn the proper way to do this with Cross-Validation! For now, using the test set is fine for practice.)*

In [102]:
# YOUR CODE HERE
# 1. Compute Test MSE for each alpha


In [103]:
# YOUR CODE HERE
# 2. Plot Test MSE vs Alpha
# 3. Annotate the best alpha


*YOUR CONCLUSION: Which alpha performed best and why?*


---

## Exercise 5: Reflection Questions (10 points)

Answer the following questions in Markdown. Each answer should be 2–4 sentences.

**Q1 (2 points):** You're building a model to predict patient hospital readmission risk using 200 medical features collected from electronic health records. Many features might be irrelevant. Would you use Ridge or Lasso? Explain your reasoning.

**Q2 (2 points):** A colleague says, *"I ran Lasso and it zeroed out 'Age' so I removed it permanently from all future analyses."* What's potentially problematic about this reasoning?

**Q3 (2 points):** Explain in your own words why feature scaling is required before applying Ridge or Lasso regression. What could go wrong if you skip scaling?

**Q4 (4 points):** Connect tonight's lesson to the biblical principle from Luke 14:28 (our opening verse). How does regularization relate to the idea of "counting the cost" in model building?

*YOUR ANSWERS:*

**Q1:** 

**Q2:** 

**Q3:** 

**Q4:** 



---

### **Submission Checklist:**

Before you submit your `.ipynb` file, please ensure you have completed the following:

- [ ] **Data Prep:** Features are correctly split into X and y and a 80/20 train-test split has been performed with the random state set to 42.
- [ ] **Feature Scaling:** `StandardScaler` has been applied to the training data (fit/transform) and test data (transform only).
- [ ] **Model Implementation:** You have successfully fit a Standard Linear Regression, a Ridge model, and a Lasso model.
- [ ] **Coefficient Analysis:** You have identified which feature Lasso "zeroed out" and explained why this happened.
- [ ] **Alpha Experiment:** You have tested at least three different values of $\alpha$ and observed the change in the number of non-zero coefficients.
- [ ] **Final Check:** All code cells have been executed in order, and all visualizations (L1 vs L2 shapes) are visible.

### **Submission Instructions**

1. Save this notebook
2. **Restart kernel and run all cells** (Kernel → Restart & Run All)
3. Verify all outputs appear correctly (especially visualizations)
4. Check that all written responses are complete
5. Submit the `.ipynb` file to Canvas before Monday, 23 February @ 6:00 PM
   - Grace period until Wednesday, 25 February @ 11:59 PM

**Remember:** This notebook submission is worth 90% of your Week 4 Lab grade. The remaining 10% comes from next week's in-class mastery assessment.

---

### **Mastery Assessment Preparation Tips**

* **The "Complexity Tax" Equation:** Be able to write the general objective function for regularized regression: $\text{Total Cost} = \text{RSS} + \text{Penalty}$.
* **L1 vs. L2 Differences:** 
    * Which one uses the absolute value of weights ($|w|$)? (Lasso/L1)
    * Which one uses the square of weights ($w^2$)? (Ridge/L2)
    * Which one is capable of setting coefficients to **exactly zero**? (Lasso/L1)
* **The Geometry of Regularization:** Be able to identify the **Diamond** shape (Lasso) vs. the **Circle** shape (Ridge) and explain why the "corners" of the diamond lead to sparsity.
* **Why Scale?** Be prepared to explain in one sentence why regularization fails if features (like "number of rooms" vs. "square footage") are not put on the same scale first.
* **The Effect of Alpha ($\alpha$):** If we increase $\alpha$ to a very large number, what happens to the size of our coefficients? (They shrink toward zero).