
# ML Zoomcamp — Homework 2: Car Fuel Efficiency (Regression)


This notebook walks through:

1. Dataset download & selection of required columns  
2. EDA (tail check for the target)  
3. Train/Val/Test split (60/20/20, seed=42 unless specified)  
4. Linear Regression **without** regularization (missing value handling: `0` vs `mean`)  
5. Linear Regression **with** regularization (`r ∈ [0, 0.01, 0.1, 1, 5, 10, 100]`)  
6. Seed sensitivity study (std of RMSE across seeds 0–9)  
7. Final model (seed=9, train+val, r=0.001) and test RMSE  

At the end of each section, the notebook prints the exact answers required by the homework.



## Setup

We use `pandas` and `numpy` only (as in the lectures). If download from URL fails
(e.g., offline), place `car_fuel_efficiency.csv` in the working directory and the code will load it.


In [1]:

import os
import io
import sys
import math
import json
import numpy as np
import pandas as pd
from urllib.request import urlopen

pd.set_option('display.max_colwidth', 120)

URL = 'https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv'
CSV_LOCAL = 'car_fuel_efficiency.csv'

def load_data():
    # Try URL first
    try:
        df = pd.read_csv(URL)
        print('Loaded dataset from URL. Shape:', df.shape)
        return df
    except Exception as e_url:
        print('URL load failed:', e_url)
        if os.path.exists(CSV_LOCAL):
            df = pd.read_csv(CSV_LOCAL)
            print('Loaded dataset from local file. Shape:', df.shape)
            return df
        else:
            raise RuntimeError('Dataset not found. Provide car_fuel_efficiency.csv or enable internet.')

df_raw = load_data()
df_raw.head()


Loaded dataset from URL. Shape: (9704, 11)


Unnamed: 0,engine_displacement,num_cylinders,horsepower,vehicle_weight,acceleration,model_year,origin,fuel_type,drivetrain,num_doors,fuel_efficiency_mpg
0,170,3.0,159.0,3413.433759,17.7,2003,Europe,Gasoline,All-wheel drive,0.0,13.231729
1,130,5.0,97.0,3149.664934,17.8,2007,USA,Gasoline,Front-wheel drive,0.0,13.688217
2,170,,78.0,3079.038997,15.1,2018,Europe,Gasoline,Front-wheel drive,0.0,14.246341
3,220,4.0,,2542.392402,20.2,2009,USA,Diesel,All-wheel drive,2.0,16.912736
4,210,1.0,140.0,3460.87099,14.4,2009,Europe,Gasoline,All-wheel drive,2.0,12.488369



## Select Required Columns

We keep only:

- `engine_displacement`  
- `horsepower`  
- `vehicle_weight`  
- `model_year`  
- `fuel_efficiency_mpg` (target)


In [2]:

use_cols = ['engine_displacement','horsepower','vehicle_weight','model_year','fuel_efficiency_mpg']
df = df_raw[use_cols].copy()
df.describe(include='all')


Unnamed: 0,engine_displacement,horsepower,vehicle_weight,model_year,fuel_efficiency_mpg
count,9704.0,8996.0,9704.0,9704.0,9704.0
mean,199.708368,149.657292,3001.280993,2011.484027,14.985243
std,49.455319,29.879555,497.89486,6.659808,2.556468
min,10.0,37.0,952.681761,2000.0,6.200971
25%,170.0,130.0,2666.248985,2006.0,13.267459
50%,200.0,149.0,2993.226296,2012.0,15.006037
75%,230.0,170.0,3334.957039,2017.0,16.707965
max,380.0,271.0,4739.077089,2023.0,25.967222



## EDA: Target Tail Check

We visually/quantitatively check if `fuel_efficiency_mpg` has a long tail.
We'll show basic stats and a quick quantile view.


In [3]:

target = 'fuel_efficiency_mpg'
print(df[target].describe())
print('\nQuantiles:')
print(df[target].quantile([0.5, 0.75, 0.9, 0.95, 0.99]))


count    9704.000000
mean       14.985243
std         2.556468
min         6.200971
25%        13.267459
50%        15.006037
75%        16.707965
max        25.967222
Name: fuel_efficiency_mpg, dtype: float64

Quantiles:
0.50    15.006037
0.75    16.707965
0.90    18.259461
0.95    19.150022
0.99    20.882064
Name: fuel_efficiency_mpg, dtype: float64



## Question 1 — Missing Values

Which column has missing values?


In [4]:

na_counts = df.isna().sum().sort_values(ascending=False)
display(na_counts.to_frame('na_count'))
q1_answer = na_counts.index[0] if na_counts.iloc[0] > 0 else 'None'
print('Q1 column with missing values →', q1_answer)


Unnamed: 0,na_count
horsepower,708
engine_displacement,0
vehicle_weight,0
model_year,0
fuel_efficiency_mpg,0


Q1 column with missing values → horsepower



## Question 2 — Median of `horsepower`


In [5]:

hp_median = float(df['horsepower'].median())
print('Median horsepower =', hp_median)
# For the multiple choice, we print the *closest* of [49, 99, 149, 199]
options = np.array([49, 99, 149, 199], dtype=float)
closest = options[np.argmin(np.abs(options - hp_median))]
print('Closest option:', closest)


Median horsepower = 149.0
Closest option: 149.0



## Prepare & Split (as in lectures)

- Shuffle with seed=42 (unless specified otherwise)
- 60% train, 20% validation, 20% test


In [6]:

def shuffle_split(df_in, seed=42):
    n = len(df_in)
    idx = np.arange(n)
    rng = np.random.default_rng(seed)
    rng.shuffle(idx)

    n_train = int(0.6 * n)
    n_val   = int(0.2 * n)
    n_test  = n - n_train - n_val

    df_shuf = df_in.iloc[idx].reset_index(drop=True)
    df_train = df_shuf.iloc[:n_train].reset_index(drop=True)
    df_val   = df_shuf.iloc[n_train:n_train+n_val].reset_index(drop=True)
    df_test  = df_shuf.iloc[n_train+n_val:].reset_index(drop=True)
    return df_train, df_val, df_test

df_train, df_val, df_test = shuffle_split(df, seed=42)
len(df_train), len(df_val), len(df_test)


(5822, 1940, 1942)


## Linear Regression Helpers (as in lectures)

- `prepare_X`: turn dataframe → design matrix with bias term  
- `train_linear_regression`: closed-form solution with optional L2 (ridge) regularization  
- `predict` and `rmse`


In [7]:

features = ['engine_displacement','horsepower','vehicle_weight','model_year']
target = 'fuel_efficiency_mpg'

def prepare_X(df_in):
    X = df_in[features].values.astype(float)
    ones = np.ones((X.shape[0], 1))
    return np.hstack([ones, X])  # bias in first column

def train_linear_regression(X, y, r=0.0):
    # Closed-form: w = (X^T X + r*I)^{-1} X^T y
    XTX = X.T.dot(X)
    if r > 0:
        reg = r * np.eye(XTX.shape[0])
        reg[0,0] = 0  # don't regularize bias
        XTX = XTX + reg
    XTX_inv = np.linalg.inv(XTX)
    w = XTX_inv.dot(X.T).dot(y)
    return w

def predict(X, w):
    return X.dot(w)

def rmse(y, y_pred):
    return math.sqrt(np.mean((y - y_pred) ** 2))



## Question 3 — Missing Value Strategy (0 vs mean)

- Identify column from Q1 (compute mean **on train only**)
- Create two train/val pipelines:
  - Fill with 0
  - Fill with training mean
- Train unregularized models (`r=0`), compute validation RMSE, round to 2 decimals, and compare.


In [8]:

col_with_na = df_train.isna().sum().sort_values(ascending=False).index[0]

# Prepare train/val copies
train0 = df_train.copy()
val0 = df_val.copy()

train_mean = df_train[col_with_na].mean()
train_mean_filled = df_train.copy()
val_mean_filled = df_val.copy()

# Fill options
train0[col_with_na] = train0[col_with_na].fillna(0)
val0[col_with_na]   = val0[col_with_na].fillna(0)

train_mean_filled[col_with_na] = train_mean_filled[col_with_na].fillna(train_mean)
val_mean_filled[col_with_na]   = val_mean_filled[col_with_na].fillna(train_mean)

# Targets
y_train0 = train0[target].values
y_val0   = val0[target].values

y_train_mean = train_mean_filled[target].values
y_val_mean   = val_mean_filled[target].values

# Design matrices
X_train0 = prepare_X(train0)
X_val0   = prepare_X(val0)

X_train_mean = prepare_X(train_mean_filled)
X_val_mean   = prepare_X(val_mean_filled)

# Train (no regularization)
w0 = train_linear_regression(X_train0, y_train0, r=0.0)
w_mean = train_linear_regression(X_train_mean, y_train_mean, r=0.0)

# Predict & RMSE
rmse0 = rmse(y_val0, predict(X_val0, w0))
rmse_mean = rmse(y_val_mean, predict(X_val_mean, w_mean))

print('RMSE with 0   :', round(rmse0, 2))
print('RMSE with mean:', round(rmse_mean, 2))

q3_choice = 'With 0' if round(rmse0,2) < round(rmse_mean,2) else ('With mean' if round(rmse_mean,2) < round(rmse0,2) else 'Both are equally good')
print('Q3 →', q3_choice)


RMSE with 0   : 0.52
RMSE with mean: 0.47
Q3 → With mean



## Question 4 — Regularized Linear Regression (fill NAs with 0)

- Try `r ∈ [0, 0.01, 0.1, 1, 5, 10, 100]`
- Use validation RMSE (rounded to 2 decimals)
- Pick the smallest `r` that yields the best RMSE


In [9]:

# Make fresh copies so we don't leak data from Q3
train = df_train.copy()
val = df_val.copy()

# Fill NAs with 0 in ALL features (safe approach)
for c in features:
    if train[c].isna().any() or val[c].isna().any():
        train[c] = train[c].fillna(0)
        val[c] = val[c].fillna(0)

Xtr = prepare_X(train); ytr = train[target].values
Xva = prepare_X(val);   yva = val[target].values

rs = [0, 0.01, 0.1, 1, 5, 10, 100]
scores = []
for r in rs:
    w = train_linear_regression(Xtr, ytr, r=r)
    y_pred = predict(Xva, w)
    s = round(rmse(yva, y_pred), 2)
    scores.append((r, s))

print('Validation RMSE by r:')
for r, s in scores:
    print(f'r={r:<6} RMSE={s}')

best_rmse = min(s for _, s in scores)
candidate_rs = [r for r, s in scores if s == best_rmse]
best_r = min(candidate_rs)  # smallest r with best score
print('\nQ4 → best r =', best_r, '(RMSE =', best_rmse, ')')


Validation RMSE by r:
r=0      RMSE=0.52
r=0.01   RMSE=0.52
r=0.1    RMSE=0.52
r=1      RMSE=0.52
r=5      RMSE=0.52
r=10     RMSE=0.52
r=100    RMSE=0.52

Q4 → best r = 0 (RMSE = 0.52 )



## Question 5 — Seed Sensitivity (std of validation RMSE across seeds 0–9)

- For each seed:
  - Split 60/20/20
  - Fill NAs with 0
  - Train **unregularized** model (`r=0`)
  - Compute **validation RMSE**
- Compute `np.std` of the 10 scores, round to **3** decimals.


In [10]:

def val_rmse_for_seed(seed):
    dtr, dval, dte = shuffle_split(df, seed=seed)
    # Fill NAs with 0 for safety
    for c in features:
        dtr[c] = dtr[c].fillna(0)
        dval[c] = dval[c].fillna(0)

    Xtr = prepare_X(dtr); ytr = dtr[target].values
    Xva = prepare_X(dval); yva = dval[target].values
    w = train_linear_regression(Xtr, ytr, r=0.0)
    return rmse(yva, predict(Xva, w))

seeds = list(range(10))
scores = [val_rmse_for_seed(s) for s in seeds]
std_scores = float(np.std(scores))
print('Validation RMSE scores by seed 0–9:', [round(s, 4) for s in scores])
print('STD =', round(std_scores, 3))

# Multiple-choice helper
options = np.array([0.001, 0.006, 0.060, 0.600], dtype=float)
closest = options[np.argmin(np.abs(options - round(std_scores, 3)))]
print('Closest option:', closest)


Validation RMSE scores by seed 0–9: [0.521, 0.5244, 0.5252, 0.5244, 0.5259, 0.5252, 0.5204, 0.5104, 0.5204, 0.5323]
STD = 0.005
Closest option: 0.006



## Question 6 — Final Model

- Split with **seed=9**
- Combine **train + validation**
- Fill NAs with 0
- Train with **r=0.001**
- Report **test RMSE** (rounded)


In [11]:

dtr, dval, dte = shuffle_split(df, seed=9)
df_full = pd.concat([dtr, dval], axis=0).reset_index(drop=True)

for c in features:
    df_full[c] = df_full[c].fillna(0)
    dte[c]     = dte[c].fillna(0)

X_full = prepare_X(df_full); y_full = df_full[target].values
X_test = prepare_X(dte);     y_test = dte[target].values

w_final = train_linear_regression(X_full, y_full, r=0.001)
test_rmse = rmse(y_test, predict(X_test, w_final))
print('Final Test RMSE (seed=9, r=0.001):', round(test_rmse, 3))

# Multiple-choice helper
options = np.array([0.15, 0.515, 5.15, 51.5], dtype=float)
closest = options[np.argmin(np.abs(options - round(test_rmse, 3)))]
print('Closest option:', closest)


Final Test RMSE (seed=9, r=0.001): 0.505
Closest option: 0.515



## Handy Answer Boxes (Will Fill After Running)

- **Q1:** Column with missing values → *(printed above in Q1 section)*  
- **Q2:** Median horsepower → *(printed above; also closest option)*  
- **Q3:** Better imputation for NA (0 vs mean) → *(printed above)*  
- **Q4:** Best `r` (smallest if tie) → *(printed above)*  
- **Q5:** Std of validation RMSE across seeds 0–9 → *(printed above; also closest option)*  
- **Q6:** Test RMSE for seed=9, r=0.001 → *(printed above; also closest option)*
