<a href="https://colab.research.google.com/github/lugasaji/ML-Zoomcamp-2025/blob/main/ML_Zoomcamp_2025_Homework_2_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## __Homework 2__

> Note: sometimes your answer doesn't match one of
> the options exactly. That's fine.
> Select the option that's closest to your solution.



### __Dataset__

For this homework, we'll use the Car Fuel Efficiency dataset. Download it from <a href='https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv'>here</a>.

You can do it with wget:
```bash
wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv
```

The goal of this homework is to create a regression model for predicting the car fuel efficiency (column `'fuel_efficiency_mpg'`).

### __Imports__

In [290]:
import pandas as pd
import numpy as np
from typing import List, Tuple

### __Preparing the dataset__

Use only the following columns:

* `'engine_displacement'`,
* `'horsepower'`,
* `'vehicle_weight'`,
* `'model_year'`,
* `'fuel_efficiency_mpg'`

In [291]:
url = "https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv"
df = pd.read_csv(url)
df = df[['engine_displacement', 'horsepower', 'vehicle_weight', 'model_year', 'fuel_efficiency_mpg']]
df.head()

Unnamed: 0,engine_displacement,horsepower,vehicle_weight,model_year,fuel_efficiency_mpg
0,170,159.0,3413.433759,2003,13.231729
1,130,97.0,3149.664934,2007,13.688217
2,170,78.0,3079.038997,2018,14.246341
3,220,,2542.392402,2009,16.912736
4,210,140.0,3460.87099,2009,12.488369


### __EDA__

* Look at the `fuel_efficiency_mpg` variable. Does it have a long tail?

In [292]:
skew_value = df['fuel_efficiency_mpg'].skew()
print(f"Skew value is {skew_value}")

Skew value is -0.012062219273507929


The Skew value is very close to 0 so The `fuel_efficiency_mpg` variable has't a long tail.

### __Question 1__

There's one column with missing values. What is it?

* `'engine_displacement'`
* __`'horsepower'`__
* `'vehicle_weight'`
* `'model_year'`

In [293]:
for column in df.columns:
    if df[column].isnull().sum() > 0:
        print(f"The column {column} has missing values")

The column horsepower has missing values


### __Question 2__

What's the median (50% percentile) for variable `'horsepower'`?

- 49
- 99
- __149__
- 199

In [294]:
median_horsepower = df['horsepower'].median()
print(f"The median is {median_horsepower}")

The median is 149.0


### Prepare and split the dataset

* Shuffle the dataset (the filtered one you created above), use seed `42`.
* Split your data in train/val/test sets, with 60%/20%/20% distribution.

Use the same code as in the lectures

In [295]:
def get_splited_data(df: pd.DataFrame, train_distribution: int, test_distribution: int, seed: int) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    n = len(df)
    val_distribution = 100 - train_distribution - test_distribution
    n_val = int(n * val_distribution /100)
    n_test = int(n * test_distribution / 100)
    n_train = n - n_val - n_test

    idx = np.arange(n)
    np.random.seed(seed)
    np.random.shuffle(idx)
    df_train = df.iloc[idx[:n_train]].reset_index(drop=True)
    df_val= df.iloc[idx[n_train:n_train + n_val]].reset_index(drop=True)
    df_test = df.iloc[idx[n_train + n_val:]].reset_index(drop=True)

    return (df_train, df_test, df_val)

In [296]:
df_train, df_test, df_val = get_splited_data(df= df, train_distribution=60, test_distribution=20, seed=42)
print(len(df_train), len(df_test), len(df_val))
df_train.head()

5824 1940 1940


Unnamed: 0,engine_displacement,horsepower,vehicle_weight,model_year,fuel_efficiency_mpg
0,220,144.0,2535.887591,2009,16.642943
1,160,141.0,2741.170484,2019,16.298377
2,230,155.0,2471.880237,2017,18.591822
3,150,206.0,3748.164469,2015,11.818843
4,300,111.0,2135.716359,2006,19.402209


### __Question 3__

* We need to deal with missing values for the column from Q1.
* We have two options: fill it with 0 or with the mean of this variable.
* Try both options. For each, train a linear regression model without regularization using the code from the lessons.
* For computing the mean, use the training only!
* Use the validation dataset to evaluate the models and compare the RMSE of each option.
* Round the RMSE scores to 2 decimal digits using `round(score, 2)`
* Which option gives better RMSE?

Options:

- With 0
- __With mean__
- Both are equally good

In [297]:
def train_linear_regression(X, y):
    ones = np.ones(X.shape[0])
    X = np.column_stack([ones, X])

    XTX = X.T.dot(X)
    XTX_inv = np.linalg.inv(XTX)
    w_full = XTX_inv.dot(X.T).dot(y)

    return w_full[0], w_full[1:]

In [298]:
def rmse(y, y_pred):
    se = (y - y_pred) ** 2
    mse = se.mean()
    return np.sqrt(mse)

In [299]:
def fill_column(column: pd.Series, fill_with) -> pd.Series:
    if column.isnull().sum() > 0:
        if fill_with == "mean":
            mean_t = column.mean()
            return column.fillna(mean_t)
        elif isinstance(fill_with, float):
            return column.fillna(fill_with)
        else:
            return column.fillna(0)
    return column

In [300]:
def prepare_data(df: pd.DataFrame, base_columns: List[str], predict_column: str, fill_with) -> Tuple:
    df_copy = df.copy()

    for column in df_copy.columns:
        df_copy[column] = fill_column(df_copy[column], fill_with)

    X = df_copy[base_columns].values
    y = df_copy[predict_column].values
    return X, y

In [301]:
def get_rmse_with_linear_regression(df_val: pd.DataFrame, df_train: pd.DataFrame, base_columns: List[str], predict_column: str, fill_with) -> float:
    X_train, y_train = prepare_data(df_train, base_columns, predict_column, fill_with)

    w0, w = train_linear_regression(X_train, y_train)

    X_val, y_val = prepare_data(df_val, base_columns, predict_column, fill_with)

    y_pred = w0 + X_val.dot(w)

    rmse_val = rmse(y_val, y_pred)

    return rmse_val

In [302]:
base_columns = ['engine_displacement', 'horsepower', 'vehicle_weight', 'model_year']
predict_column = 'fuel_efficiency_mpg'

mean_train_horsepower = df_train.horsepower.mean()
msre_mean = get_rmse_with_linear_regression(df_val, df_train, base_columns, predict_column, mean_train_horsepower)
msre_zero = get_rmse_with_linear_regression(df_val, df_train, base_columns, predict_column, '')

print(f"RMSE when fill null values with the mean is {msre_mean:0.2f}")
print(f"RMSE when fill null values with  0 is {msre_zero:0.2f}")

RMSE when fill null values with the mean is 0.46
RMSE when fill null values with  0 is 0.52


### __Question 4__

* Now let's train a regularized linear regression.
* For this question, fill the NAs with 0.
* Try different values of `r` from this list: `[0, 0.01, 0.1, 1, 5, 10, 100]`.
* Use RMSE to evaluate the model on the validation dataset.
* Round the RMSE scores to 2 decimal digits.
* Which `r` gives the best RMSE?

If multiple options give the same best RMSE, select the smallest `r`.

Options:

- __0__
- 0.01
- 1
- 10
- 100

In [303]:
def train_linear_regression_reg(X, y, r):
    ones = np.ones(X.shape[0])
    X = np.column_stack([ones, X])

    XTX = X.T.dot(X)
    XTX = XTX + r * np.eye(XTX.shape[0])

    XTX_inv = np.linalg.inv(XTX)
    w_full = XTX_inv.dot(X.T).dot(y)

    return w_full[0], w_full[1:]

In [304]:
def get_rmse_with_linear_regression_reg(df_val: pd.DataFrame, df_train: pd.DataFrame, base_columns: List[str], predict_column: str, fill_with, r: float) -> float:
    X_train, y_train = prepare_data(df_train, base_columns, predict_column, fill_with)

    w0, w = train_linear_regression_reg(X_train, y_train, r)

    X_val, y_val = prepare_data(df_val, base_columns, predict_column, fill_with)

    y_pred = w0 + X_val.dot(w)

    rmse_val = rmse(y_val, y_pred)

    return rmse_val

In [305]:
base_columns = ['engine_displacement', 'horsepower', 'vehicle_weight', 'model_year']
predict_column = 'fuel_efficiency_mpg'

for r in [0, 0.01, 0.1, 1, 5, 10, 100]:
    rmse_val = get_rmse_with_linear_regression_reg(df_val, df_train, base_columns, predict_column, '', r)
    print(f"RMSE when r = {r}  is {rmse_val:0.2f}")

RMSE when r = 0  is 0.52
RMSE when r = 0.01  is 0.52
RMSE when r = 0.1  is 0.52
RMSE when r = 1  is 0.52
RMSE when r = 5  is 0.52
RMSE when r = 10  is 0.52
RMSE when r = 100  is 0.52


### __Question 5__

* We used seed 42 for splitting the data. Let's find out how selecting the seed influences our score.
* Try different seed values: `[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]`.
* For each seed, do the train/validation/test split with 60%/20%/20% distribution.
* Fill the missing values with 0 and train a model without regularization.
* For each seed, evaluate the model on the validation dataset and collect the RMSE scores.
* What's the standard deviation of all the scores? To compute the standard deviation, use `np.std`.
* Round the result to 3 decimal digits (`round(std, 3)`)

What's the value of std?

- 0.001
- __0.006__
- 0.060
- 0.600

> Note: Standard deviation shows how different the values are.
> If it's low, then all values are approximately the same.
> If it's high, the values are different.
> If standard deviation of scores is low, then our model is *stable*.

In [306]:
base_columns = ['engine_displacement', 'horsepower', 'vehicle_weight', 'model_year']
predict_column = 'fuel_efficiency_mpg'
rmse_values = []

for seed in range(10):
    df_train, df_test, df_val = get_splited_data(df= df, train_distribution=60, test_distribution=20, seed=seed)
    rmse_val = get_rmse_with_linear_regression(df_val, df_train, base_columns, predict_column, '')
    rmse_values.append(round(rmse_val, 3))
    print(f"RMSE when seed = {seed}  is {rmse_val}")

standard_deviation = np.std(rmse_values)
print(f"\n Standart Deviation is : {standard_deviation:0.3f}")

RMSE when seed = 0  is 0.5206531296294218
RMSE when seed = 1  is 0.521338891285577
RMSE when seed = 2  is 0.5228069974803171
RMSE when seed = 3  is 0.515951674119676
RMSE when seed = 4  is 0.5109129460053851
RMSE when seed = 5  is 0.52834064601107
RMSE when seed = 6  is 0.5313910658146311
RMSE when seed = 7  is 0.5090670387381733
RMSE when seed = 8  is 0.5147399129511132
RMSE when seed = 9  is 0.5131865908224594

 Standart Deviation is : 0.007


### __Question 6__

* Split the dataset like previously, use seed 9.
* Combine train and validation datasets.
* Fill the missing values with 0 and train a model with `r=0.001`.
* What's the RMSE on the test dataset?

Options:

- 0.15
- __0.515__
- 5.15
- 51.5

In [308]:
base_columns = ['engine_displacement', 'horsepower', 'vehicle_weight', 'model_year']
predict_column = 'fuel_efficiency_mpg'
rmse_values = []
seed = 9
r = 0.001

df_train, df_test, df_val = get_splited_data(df= df, train_distribution=60, test_distribution=20, seed=seed)
df_train_validation = pd.concat([df_train, df_test])
rmse_val = get_rmse_with_linear_regression_reg(df_train_validation, df_test, base_columns, predict_column, '', r)
print(f"RMSE on the test dataset with r = {r} and seed = {seed}  is {rmse_val}")

RMSE on the test dataset with r = 0.001 and seed = 9  is 0.5207330053612106


## Submit the results

* Submit your results here: https://courses.datatalks.club/ml-zoomcamp-2025/homework/hw02
* If your answer doesn't match options exactly, select the closest one