
# Statistical Learning & Linear Regression — Jupyter Notebook Template

This notebook is structured to match the assignment prompt.  
Run each cell in order. If you do **not** have `house_prices.csv` in the notebook folder, the notebook will **auto‑generate** a synthetic dataset with plausible values so you can still complete the tasks.

> **Tip:** If packages are missing, run the setup cell in **Part 0**.



## Part 0 — Environment Setup (Run if needed)

Use this cell to install any missing packages. If you already have them, you can skip.


In [None]:

# If needed, uncomment the lines below to install packages in this environment.
# Note: In conda/miniforge, prefer: conda install pandas numpy scikit-learn matplotlib scipy statsmodels
# Otherwise, pip installs will work inside your active environment.

# %pip install -q pandas numpy scikit-learn matplotlib scipy statsmodels



---
## Part 1 — Theoretical Concepts

### 1) Define the following in the context of statistical learning
a) **Training error** — *Your answer here.*  
b) **Test error** — *Your answer here.*  
c) **Bias–variance trade‑off** — *Your answer here.*  
d) **Overfitting** — *Your answer here.*  
e) **Model complexity** — *Your answer here.*

### 2) Parametric vs. Nonparametric Models
- **Explain the difference.** *Your answer here.*  
- **Give an example of each.** *Your answer here.*

### 3) Bias–Variance Trade‑off (Discussion)
- **Discuss how model complexity affects performance.** *Your answer here.*  
- **Example of overfitting with a complex model.** *Your answer here.*  
- **Example of underfitting with an overly simple model.** *Your answer here.*



---
## Part 2 — Linear Regression

You are provided with (or will generate) a dataset with columns:

- `Price` *(USD, dependent variable)*  
- `Size` *(square feet)*  
- `Bedrooms` *(integer count)*  
- `Age` *(years)*  
- `Distance_to_city_center` *(miles)*

### 0) Load dataset (or auto‑generate synthetic data)

- If a file named **`house_prices.csv`** is found in the working directory, it will be loaded.  
- Otherwise, the notebook will **generate** a synthetic dataset of 500 rows with realistic relationships.


In [None]:

import os
import math
import numpy as np
import pandas as pd

rng = np.random.default_rng(42)

csv_path = "house_prices.csv"

def generate_synthetic_data(n=500, seed=42):
    rng = np.random.default_rng(seed)
    # Base features
    Size = rng.normal(loc=1800, scale=500, size=n).clip(500, 4500)  # sq ft
    Bedrooms = np.round((Size / 700) + rng.normal(0, 0.6, size=n)).clip(1, 8).astype(int)
    Age = rng.integers(0, 80, size=n)  # years
    Distance = np.abs(rng.normal(loc=8, scale=5, size=n))  # miles to center
    
    # Price model (true) with noise
    # Larger size => higher price; more bedrooms => higher price; older => lower; farther => lower
    base = 50000
    price = (
        base
        + 220 * Size
        + 15000 * Bedrooms
        - 1200 * Age
        - 8000 * Distance
        + rng.normal(0, 35000, size=n)  # noise
    )
    price = np.clip(price, 50000, None)
    
    df = pd.DataFrame({
        "Price": price.astype(float),
        "Size": Size.astype(float),
        "Bedrooms": Bedrooms.astype(int),
        "Age": Age.astype(int),
        "Distance_to_city_center": Distance.astype(float),
    })
    return df

if os.path.exists(csv_path):
    df = pd.read_csv(csv_path)
    print(f"Loaded {csv_path} with shape {df.shape}")
else:
    df = generate_synthetic_data(n=500, seed=42)
    print("Synthetic dataset generated (since 'house_prices.csv' was not found).")
    
# Basic sanity check
display(df.head())
print(df.describe(include='all'))



### 1) Simple Linear Regression: `Price ~ Size`

- **Task (a)**: Provide the regression equation.  
- **Task (b)**: Interpret the slope and intercept in context.


In [None]:

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Simple LR with Size only
X_simple = df[["Size"]].values
y = df["Price"].values

simple_lr = LinearRegression()
simple_lr.fit(X_simple, y)

intercept = float(simple_lr.intercept_)
slope = float(simple_lr.coef_[0])

print("Simple Linear Regression: Price = b0 + b1 * Size")
print(f"b0 (intercept): {intercept:,.2f}")
print(f"b1 (slope for Size): {slope:,.2f}")

# Example prediction for interpretation
example_size = 2000
pred_price = simple_lr.predict([[example_size]])[0]
print(f"Predicted Price for Size={example_size} sq ft: {pred_price:,.2f} USD")



**Interpretation (write in your own words):**  
- **Intercept (b0):** *Your answer here — expected price when Size = 0 (often not meaningful physically).*  
- **Slope (b1):** *Your answer here — expected change in price (USD) for a 1 sq ft increase in Size, holding all else constant (in this simple model).*  



### 2) Multiple Linear Regression: `Price ~ Size + Bedrooms + Age + Distance_to_city_center`

- **Task (a)**: Provide the regression equation.  
- **Task (b)**: Interpret the coefficients. Which variable has the strongest impact on price?


In [None]:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

features = ["Size", "Bedrooms", "Age", "Distance_to_city_center"]
X_multi = df[features].values
y = df["Price"].values

multi_lr = LinearRegression()
multi_lr.fit(X_multi, y)

print("Multiple Linear Regression: Price = b0 + b1*Size + b2*Bedrooms + b3*Age + b4*Distance_to_city_center")
print(f"Intercept (b0): {multi_lr.intercept_:,.2f}")
for name, coef in zip(features, multi_lr.coef_):
    print(f"{name:>25}: {coef:,.2f}")

# Standardized coefficients to compare relative impact
std_pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("lr", LinearRegression())
])
std_pipeline.fit(X_multi, y)
std_coefs = std_pipeline.named_steps["lr"].coef_

print("\nStandardized coefficients (features z-scored; larger |coef| => stronger impact):")
for name, coef in zip(features, std_coefs):
    print(f"{name:>25}: {coef:,.4f}")



**Interpretation (write in your own words):**  
- Explain the meaning of each coefficient in context (holding other variables constant).  
- Use the **standardized coefficients** (by absolute value) to argue which variable has the strongest impact.  



### 3) Evaluate Model Fit

- Compute **R-squared** for both the simple and multiple models.  
- Explain what each R-squared tells you.  
- Compare which model fits better and why.


In [None]:

# R-squared for simple model
y_hat_simple = simple_lr.predict(X_simple)
r2_simple = r2_score(y, y_hat_simple)

# R-squared for multiple model
y_hat_multi = multi_lr.predict(X_multi)
r2_multi = r2_score(y, y_hat_multi)

print(f"R-squared (Simple: Size only): {r2_simple:.4f}")
print(f"R-squared (Multiple: all features): {r2_multi:.4f}")



**Your discussion here:**  
- What does each R-squared value indicate about goodness-of-fit?  
- Which model fits better? Why might that be?  



### 4) Check Linear Regression Assumptions (Multiple Model)

- Create residual plots for the multiple regression model.  
- Discuss linearity, homoscedasticity, and normality. If assumptions seem violated, suggest remedies.


In [None]:

import matplotlib.pyplot as plt
from scipy import stats

# Residuals
residuals = y - y_hat_multi
fitted = y_hat_multi

# Residuals vs Fitted
plt.figure()
plt.scatter(fitted, residuals, s=12)
plt.axhline(0, linestyle="--")
plt.xlabel("Fitted values")
plt.ylabel("Residuals")
plt.title("Residuals vs Fitted (Multiple LR)")
plt.show()

# Histogram of residuals
plt.figure()
plt.hist(residuals, bins=30)
plt.xlabel("Residual")
plt.ylabel("Frequency")
plt.title("Histogram of Residuals")
plt.show()

# Q-Q plot for normality
plt.figure()
stats.probplot(residuals, dist="norm", plot=plt)
plt.title("Q–Q Plot of Residuals")
plt.show()



**Interpretation (write in your own words):**  
- **Linearity:** *Your answer here.*  
- **Homoscedasticity (constant variance):** *Your answer here.*  
- **Normality of residuals:** *Your answer here.*  
- **Potential remedies if violated:** transformations (e.g., log `Price`), add interaction terms, remove/trim outliers, or try regularization/other models.



---
## Part 3 — Model Selection and Performance



### 1) Train/Test Split & MSE

- Split data into **80% train / 20% test**.  
- Fit the **multiple linear regression** on the training set.  
- Compute **MSE** on both training and test sets; compare generalization.


In [None]:

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X = df[features].values
y = df["Price"].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=123, shuffle=True)

lr_tt = LinearRegression()
lr_tt.fit(X_train, y_train)

y_train_pred = lr_tt.predict(X_train)
y_test_pred  = lr_tt.predict(X_test)

mse_train = mean_squared_error(y_train, y_train_pred)
mse_test  = mean_squared_error(y_test, y_test_pred)

print(f"Train MSE: {mse_train:,.2f}")
print(f"Test  MSE: {mse_test:,.2f}")



**Interpretation (write in your own words):**  
- Compare Train vs. Test MSE. What does this say about generalization?  
- Signs of **overfitting**: Train MSE ≪ Test MSE.  
- Signs of **underfitting**: Both MSEs high and similar; low R².



### 2) Overfitting vs. Underfitting

Based on Train/Test results, discuss whether the model is overfitting or underfitting.  
*Your answer here — include ideas like cross‑validation, regularization (Ridge/Lasso), adding/removing features, transformations, or collecting more data.*



### 3) Feature Engineering Idea

Suggest **one additional variable** that could improve predictions (e.g., lot size, neighborhood quality score, renovation status, HOA fees, local school rating index).  
Justify your choice based on the **housing** context. *Your answer here.*



---
## Submission Checklist

- Clear sections matching **Part 1–3**.  
- Comments explaining each step.  
- All outputs and visualizations included (after you **run all**).  
- **Screencast (4–5 min)** covering:  
  - Key theoretical concepts (Part 1).  
  - Data, models, and code walkthrough (Part 2–3).  
  - Interpretation of results and diagnostics.  
  - What you would improve next and why.
