# HW1 - Regression Fundamentals

See Canvas for details on how to complete and submit this assignment.

## Introduction

This assignment covers the foundational concepts of statistical learning and regression modeling from the first half of our regression unit (lectures 01 through 05a). You'll demonstrate understanding of key principles like the bias-variance tradeoff, then apply sklearn to build and interpret increasingly complex regression models.

The assignment progresses from conceptual understanding to hands-on implementation, mirroring how we approach problems in practice: understand the theory, then apply it systematically.

### Learning Objectives

By completing this assignment, you will demonstrate ability to:

- Articulate the No Free Lunch theorem and its implications for model selection
- Explain the bias-variance tradeoff and its effect on training vs test error
- Interpret regression coefficients, including dummy variables and interaction terms
- Implement the sklearn workflow: prepare data, fit models, evaluate performance
- Build progressively complex regression models (SLR → MLR → with categoricals → with interactions)
- Make informed decisions about model complexity based on performance metrics

### Time Estimate

This assignment should take 4-6 hours to complete.

### Generative AI Allowance

You may use GenAI tools for brainstorming, explanations, and code sketches if you disclose it, understand it, and validate it. Your submission must represent your own work and you are solely responsible for its correctness.

### Scoring

| Section | Points |
|---------|-------:|
| Reading: No Free Lunch | 15 |
| Conceptual: Bias-Variance Tradeoff | 15 |
| Conceptual: Interpreting Regression | 15 |
| Applied: Carseats Model Progression | 45 |
| Reflection | 10 |
| **Total** | **100** |

## Reading: No Free Lunch Theorem

### Background

The No Free Lunch (NFL) theorems, introduced by Wolpert and Macready in 1997, provide a fundamental insight about optimization and machine learning: no single algorithm performs best across all possible problems.

Read Wolpert and Macready's "No Free Lunch Theorems for Optimization" (available on Canvas). Pay particular attention to:

- **Introduction and Conclusion** - motivation and basic concept
- **Section III (The NFL Theorems) and IIIA (Implications)** - core theorem and implications
- **Section IV (A Geometric Perspective)** - builds intuition

You can skip or skim the technical sections (II, III.B, V, VI, VII) as they deal with mathematical proofs and specialized topics not central to our discussion.

### Reflection (15 pts)

Write a brief reflection (2-3 concise paragraphs rich with information and personal insight) discussing how NFL informs our understanding of model evaluation. Consider:

- Why training performance alone isn't sufficient for model selection
- The role of domain knowledge in algorithm selection
- The connection to our discussions of training vs test performance

If you used AI tools to help understand the paper, briefly describe what you used, how you used it, and how that worked out.

---

##### NFL Reflection

*Write your reflection here (2-3 paragraphs)...*

---

## Conceptual Problems

### Bias-Variance Tradeoff (15 pts)

I collect a set of data (n = 100 observations) containing a single predictor and a quantitative response. I then fit a linear regression model to the data, as well as a separate cubic regression:

$$Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3 + \epsilon$$

Frame your answers in terms of the bias-variance tradeoff. Explain how this tradeoff affects each model's performance on both training and test data. Clearly state any assumptions that underpin your responses.

**(a)** Suppose that the true relationship between X and Y is linear, i.e., $Y = \beta_0 + \beta_1 X + \epsilon$. Consider the training residual sum of squares (RSS) for the linear regression, and also the training RSS for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer.

---

##### Answer (a)

*Your answer here...*

---

**(b)** Answer (a) using test rather than training RSS.

---

##### Answer (b)

*Your answer here...*

---

**(c)** Suppose that the true relationship between X and Y is not linear, but we don't know how far it is from linear. Consider the training RSS for the linear regression, and also the training RSS for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer.

---

##### Answer (c)

*Your answer here...*

---

**(d)** Answer (c) using test rather than training RSS.

---

##### Answer (d)

*Your answer here...*

---

**(e)** The expected test Mean Squared Error (MSE) can be decomposed into three components:

$$\text{Expected Test MSE} = \text{Variance} + \text{Bias}^2 + \text{Irreducible Error}$$

When the true relationship between X and Y is linear, explain how each component (Variance and Squared Bias) would change as we increase the polynomial degree from linear to cubic. Which component dominates the increase in test MSE as model complexity increases?

---

##### Answer (e)

*Your answer here...*

---

### Interpreting Regression Models (15 pts)

Suppose we have a data set with five predictors:
- $X_1$ = GPA
- $X_2$ = IQ
- $X_3$ = Level (1 for College, 0 for High School)
- $X_4$ = GPA × IQ
- $X_5$ = GPA × Level

The response is starting salary after graduation (in thousands of dollars). Suppose we use least squares to fit the model, and get:

$$\hat{\beta}_0 = 50, \quad \hat{\beta}_1 = 20, \quad \hat{\beta}_2 = 0.07, \quad \hat{\beta}_3 = 35, \quad \hat{\beta}_4 = 0.01, \quad \hat{\beta}_5 = -10$$

**Hint:** Write out the complete regression equation for both high school and college graduates, and compare them to determine under what conditions (if any) one group earns more than the other.

**(a)** Which answer is correct, and why?

1. For a fixed value of IQ and GPA, high school graduates earn more, on average, than college graduates.
2. For a fixed value of IQ and GPA, college graduates earn more, on average, than high school graduates.
3. For a fixed value of IQ and GPA, high school graduates earn more, on average, than college graduates provided that the GPA is high enough.
4. For a fixed value of IQ and GPA, college graduates earn more, on average, than high school graduates provided that the GPA is high enough.

---

##### Answer (a)

*Your answer here...*

---

**(b)** Predict the salary of a college graduate with IQ of 110 and a GPA of 4.0.

---

##### Answer (b)

*Your answer here (show your calculation)...*

---

**(c)** True or False: Since the coefficient for the GPA/IQ interaction term ($\hat{\beta}_4 = 0.01$) is very small, there is little evidence of an interaction effect. Justify your answer. In your justification, identify any factors that determine whether an interaction effect is meaningful.

---

##### Answer (c)

*Your answer here...*

---

## Applied: Carseats Model Progression

In this section, you'll build a sequence of increasingly complex regression models to predict car seat sales. This progression mirrors real-world model development: start simple, add complexity systematically, and evaluate whether the added complexity improves performance.

**Dataset:** Carseats from Introduction to Statistical Learning

The dataset contains sales data for child car seats at 400 stores. Features include price, advertising budget, population, shelf location quality, and other store/market characteristics.

More information: [ISL Carseats Documentation](https://intro-stat-learning.github.io/ISLP/datasets/Carseats.html)

### Setup

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import PolynomialFeatures

### Load and Explore (5 pts)

Load the data and perform initial exploration to understand its structure and contents.

In [None]:
# Load the data
url = "https://raw.githubusercontent.com/olearydj/INSY7120/refs/heads/main/notebooks/data/Carseats.csv"
carseats = pd.read_csv(url)

In [None]:
# Explore the data: shape, types, summary statistics, missing values
# Add cells as needed


---

##### Data Summary

Summarize your key findings from the exploration. What are the predictors? Which are quantitative vs qualitative? What is the response variable?

*Your summary here...*

---

### Model Development (25 pts)

Build the sequence of models described below, using `Sales` as the response variable. For each model:
1. Prepare the features (X) and target (y)
2. Fit the model
3. Report the training R²
4. Briefly interpret the results

#### Model 1: Simple Linear Regression (5 pts)

Fit a simple linear regression using what you consider the most important numerical predictor based on your exploration.

In [None]:
# Model 1: SLR with best numerical predictor
# Your code here...


---

##### Model 1 Interpretation

Which predictor did you choose and why? What does the R² tell you?

*Your interpretation here...*

---

#### Model 2: MLR with Numerical Predictors (5 pts)

Fit a multiple linear regression using all numerical predictors: `CompPrice`, `Income`, `Advertising`, `Population`, `Price`, `Age`, `Education`.

In [None]:
# Model 2: MLR with all numerical predictors
# Your code here...


---

##### Model 2 Interpretation

How does R² change compared to Model 1? Which coefficients are largest in magnitude?

*Your interpretation here...*

---

#### Model 3: MLR with All Predictors (5 pts)

Fit a multiple linear regression with all predictors, including the categorical variables (`ShelveLoc`, `Urban`, `US`). You'll need to create dummy variables.

**Hint:** Use `pd.get_dummies()` with `drop_first=True` to avoid multicollinearity.

In [None]:
# Model 3: MLR with all predictors (including dummies)
# Your code here...


---

##### Model 3 Interpretation

How does adding categorical predictors affect R²? What do the ShelveLoc coefficients tell you about shelf location's effect on sales?

*Your interpretation here...*

---

#### Model 4: MLR with Polynomial Terms (5 pts)

Starting from Model 3, add quadratic terms ($X^2$) for `Price`, `CompPrice`, and `Income`.

**Hint:** You can add polynomial terms manually:
```python
data['Price_sq'] = data['Price'] ** 2
```

In [None]:
# Model 4: MLR with polynomial terms
# Your code here...


---

##### Model 4 Interpretation

Does adding polynomial terms improve R²? What might this suggest about the relationship between these predictors and sales?

*Your interpretation here...*

---

#### Model 5: MLR with Interaction Terms (5 pts)

Starting from Model 3, add the following interaction terms:
- Price × ShelveLoc (you'll need interactions with each dummy)
- Price × US
- Advertising × Income

**Hint:** For interactions with dummy variables, multiply the numerical variable by each dummy column.

In [None]:
# Model 5: MLR with interaction terms
# Your code here...


---

##### Model 5 Interpretation

Does adding interactions improve R²? What does the Price × ShelveLoc interaction suggest about how shelf location moderates the effect of price?

*Your interpretation here...*

---

### Model Comparison (10 pts)

Create a summary comparing all five models.

In [None]:
# Create a summary table of all models
# Include: model name, number of predictors, training R²


---

##### Overall Interpretation

Based on your results:

1. Which model would you recommend and why? Consider both performance and complexity.

2. What predictors appear most important for predicting car seat sales?

3. Do the interaction or polynomial terms provide meaningful improvement? How would you determine if this improvement is "worth it"?

*Your analysis here...*

---

### Predictions (5 pts)

Using your recommended model, predict sales for the following scenarios:

In [None]:
# Prediction scenarios
new_stores = pd.DataFrame([
    {'CompPrice': 125, 'Income': 80, 'Advertising': 15,
     'Population': 200, 'Price': 85, 'Age': 40, 'Education': 14,
     'ShelveLoc': 'Good', 'Urban': 'Yes', 'US': 'Yes'},
    {'CompPrice': 110, 'Income': 60, 'Advertising': 2,
     'Population': 300, 'Price': 150, 'Age': 65, 'Education': 12,
     'ShelveLoc': 'Bad', 'Urban': 'No', 'US': 'No'},
    {'CompPrice': 140, 'Income': 100, 'Advertising': 10,
     'Population': 250, 'Price': 100, 'Age': 35, 'Education': 16,
     'ShelveLoc': 'Medium', 'Urban': 'Yes', 'US': 'Yes'},
])

# Generate predictions using your recommended model
# Note: You'll need to prepare these the same way as your training data
# Your code here...


---

##### Prediction Interpretation

Briefly interpret the predictions. Do they make sense given what you learned about the important predictors?

*Your interpretation here...*

---

## Reflection (10 pts)

Address the following (concise bullets or short paragraphs are fine):

### 1. Key Takeaway

What part of this assignment most improved your understanding of regression or the sklearn workflow? Include a concrete example of something you understand better now than before.

---

##### Key Takeaway

*Your response here...*

---

### 2. GenAI Use

If you used GenAI tools (ChatGPT, Claude, etc.):
- What tool/model did you use?
- How did you use it (understanding concepts, debugging code, etc.)?
- How did you verify the output was correct?
- What worked well and what didn't?

If you didn't use GenAI, explain why and whether you plan to use it for future assignments.

---

##### GenAI Use

*Your response here...*

---

### 3. Feedback

- Approximately how much time did you spend on this assignment?
- What was the most difficult part?
- How would you improve this assignment?
- Anything else you want to share or ask?

---

##### Feedback

*Your response here...*

---