## Homework

> Note: sometimes your answer doesn't match one of 
> the options exactly. That's fine. 
> Select the option that's closest to your solution.


### Dataset

In this homework, we will use the Laptops price dataset from [Kaggle](https://www.kaggle.com/datasets/juanmerinobermejo/laptops-price-dataset).

Here's a wget-able [link](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/laptops.csv):

```bash
wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/laptops.csv
```

The goal of this homework is to create a regression model for predicting the prices (column `'Final Price'`).

In [1]:
!wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/laptops.csv

--2024-10-05 16:24:51--  https://raw.githubusercontent.com/alexeygrigorev/datasets/master/laptops.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8003::154, 2606:50c0:8000::154, 2606:50c0:8001::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8003::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 298573 (292K) [text/plain]
Saving to: ‘laptops.csv.1’


2024-10-05 16:24:52 (2.70 MB/s) - ‘laptops.csv.1’ saved [298573/298573]



### Preparing the dataset 

First, we'll normalize the names of the columns:

```python
df.columns = df.columns.str.lower().str.replace(' ', '_')
```

Now, instead of `'Final Price'`, we have `'final_price'`.

Next, use only the following columns:

* `'ram'`,
* `'storage'`,
* `'screen'`,
* `'final_price'`

### EDA

* Look at the `final_price` variable. Does it have a long tail? 

In [2]:
import pandas as pd

CSV_FILENAME = 'laptops.csv'
df = pd.read_csv(CSV_FILENAME)

df.columns = df.columns.str.lower().str.replace(' ', '_')

TARGET_COLUMN = 'final_price'
TRAINING_COLUMNS = [
    'ram',
    'storage',
    'screen'
] 

USEFUL_COLUMNS = TRAINING_COLUMNS[:]
USEFUL_COLUMNS.append(TARGET_COLUMN)
df = df[USEFUL_COLUMNS]

### Question 1

There's one column with missing values. What is it?

* `'ram'`
* `'storage'`
* **`'screen'`** (answer)
* `'final_price'`

In [3]:
df.isna().sum()

ram            0
storage        0
screen         4
final_price    0
dtype: int64

### Question 2

What's the median (50% percentile) for variable `'ram'`?

- 8
- **16** (answer)
- 24
- 32

In [4]:
df['ram'].describe()

count    2160.000000
mean       15.413889
std         9.867815
min         4.000000
25%         8.000000
50%        16.000000
75%        16.000000
max       128.000000
Name: ram, dtype: float64

In [5]:
df['ram'].median()

16.0

In [6]:
df['ram'].quantile(q=0.5)

16.0

### Prepare and split the dataset

* Shuffle the dataset (the filtered one you created above), use seed `42`.
* Split your data in train/val/test sets, with 60%/20%/20% distribution.

Use the same code as in the lectures

In [7]:
import numpy as np


n = len(df)

n_val = int(n * 0.2)
n_test = int(n * 0.2)
n_train = n - n_val - n_test

idx = np.arange(n)
np.random.seed(42)
np.random.shuffle(idx)

df_train = df.iloc[idx[:n_train]]
df_val = df.iloc[idx[n_train:n_train+n_val]]
df_test = df.iloc[idx[n_train+n_val:]]

df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

### Question 3

* We need to deal with missing values for the column from Q1.
* We have two options: fill it with 0 or with the mean of this variable.
* Try both options. For each, train a linear regression model without regularization using the code from the lessons.
* For computing the mean, use the training only!
* Use the validation dataset to evaluate the models and compare the RMSE of each option.
* Round the RMSE scores to 2 decimal digits using `round(score, 2)`
* Which option gives better RMSE?

Options:

- **With 0** (answer)
- With mean
- Both are equally good

In [8]:
def train_linear_regression(X, y):
    ones = np.ones(X.shape[0])
    X = np.column_stack([ones, X])

    XTX = X.T.dot(X)
    XTX_inv = np.linalg.inv(XTX)
    w_full = XTX_inv.dot(X.T).dot(y)
    
    return w_full[0], w_full[1:]

In [9]:
def rmse(y, y_pred):
    se = (y - y_pred) ** 2
    mse = se.mean()
    return np.sqrt(mse)

In [10]:
def predict(w0, w, X):
    return w0 +  X.dot(w)

In [11]:
df1_train = df_train.copy()
df1_train = df_train.fillna(0)

X1_train = np.array(df1_train[TRAINING_COLUMNS])
y1_train = np.array(df1_train[TARGET_COLUMN])

w0_1, w_1 = train_linear_regression(X1_train, y1_train)

df1_val = df_val.copy()
df1_val = df_val.fillna(0)

X1_val = np.array(df1_val[TRAINING_COLUMNS])
y1_val = np.array(df1_val[TARGET_COLUMN])

y1_pred = predict(w0_1, w_1, X1_val)
rmse_1 = rmse(y1_val, y1_pred)

print("rmse_1:", rmse_1)

rmse_1: 597.3635593619621


In [12]:
df2_train = df_train.copy()
df2_train = df_train.fillna(df.mean())

X2_train = np.array(df2_train[TRAINING_COLUMNS])
y2_train = np.array(df2_train[TARGET_COLUMN])

w0_2, w_2 = train_linear_regression(X2_train, y2_train)

df2_val = df_val.copy()
df2_val = df_val.fillna(df.mean())

X2_val = np.array(df2_val[TRAINING_COLUMNS])
y2_val = np.array(df2_val[TARGET_COLUMN])

y2_pred = predict(w0_2, w_2, X2_val)
rmse_2 = rmse(y2_val, y2_pred)

print("rmse_2:", rmse_2)

rmse_2: 600.2659410617158


### Question 4

* Now let's train a regularized linear regression.
* For this question, fill the NAs with 0. 
* Try different values of `r` from this list: `[0, 0.01, 0.1, 1, 5, 10, 100]`.
* Use RMSE to evaluate the model on the validation dataset.
* Round the RMSE scores to 2 decimal digits.
* Which `r` gives the best RMSE?

If there are multiple options, select the smallest `r`.

Options:

- 0
- 0.01
- 1
- **10** (answer)
- 100

In [13]:
def train_linear_regression_reg(X, y, r=0.001):
    ones = np.ones(X.shape[0])
    X = np.column_stack([ones, X])

    XTX = X.T.dot(X)
    XTX = XTX + r * np.eye(XTX.shape[0])

    XTX_inv = np.linalg.inv(XTX)
    w_full = XTX_inv.dot(X.T).dot(y)
    
    return w_full[0], w_full[1:]

In [14]:
R_VALUES = [0, 0.01, 0.1, 1, 10, 100]
RMSE_VALUES = []

for r_value in R_VALUES:
    w0_temp, w_temp = train_linear_regression_reg(X1_train, y1_train, r=r_value)
    y_pred_temp = predict(w0_temp, w_temp, X1_val)
    rmse_temp = rmse(y1_val, y_pred_temp)
    rounded_rmse_temp = round(rmse_temp, 2)
    print(f"r_value: {r_value} \t rmse: {rounded_rmse_temp}")
    RMSE_VALUES.append(rmse_temp)

min_rmse = min(RMSE_VALUES)
min_index = RMSE_VALUES.index(min_rmse)
min_r_value = R_VALUES[min_index]
print("min_rmse:", round(min_rmse, 2), "\t", "r_value:", min_r_value)


r_value: 0 	 rmse: 597.36
r_value: 0.01 	 rmse: 597.36
r_value: 0.1 	 rmse: 597.35
r_value: 1 	 rmse: 597.21
r_value: 10 	 rmse: 597.06
r_value: 100 	 rmse: 597.9
min_rmse: 597.06 	 r_value: 10


### Question 5 

* We used seed 42 for splitting the data. Let's find out how selecting the seed influences our score.
* Try different seed values: `[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]`.
* For each seed, do the train/validation/test split with 60%/20%/20% distribution.
* Fill the missing values with 0 and train a model without regularization.
* For each seed, evaluate the model on the validation dataset and collect the RMSE scores. 
* What's the standard deviation of all the scores? To compute the standard deviation, use `np.std`.
* Round the result to 3 decimal digits (`round(std, 3)`)

What's the value of std?

- 19.176
- **29.176** (answer)
- 39.176
- 49.176

> Note: Standard deviation shows how different the values are.
> If it's low, then all values are approximately the same.
> If it's high, the values are different. 
> If standard deviation of scores is low, then our model is *stable*.


In [15]:
def generate_dataset(df, seed=42):
    n = len(df)

    n_val = int(n * 0.2)
    n_test = int(n * 0.2)
    n_train = n - n_val - n_test
    
    idx = np.arange(n)
    np.random.seed(seed)
    np.random.shuffle(idx)
    
    df_train = df.iloc[idx[:n_train]]
    df_val = df.iloc[idx[n_train:n_train+n_val]]
    df_test = df.iloc[idx[n_train+n_val:]]
    
    df_train = df_train.reset_index(drop=True)
    df_val = df_val.reset_index(drop=True)
    df_test = df_test.reset_index(drop=True)

    return df_train, df_val, df_test

In [16]:
SEED_VALUES = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
RMSE_VALUES = []

for seed in SEED_VALUES:
    df_train_temp, df_val_temp, df_test_temp = generate_dataset(df, seed=seed)
    
    df_train_temp = df_train_temp.fillna(0)
    df_val_temp = df_val_temp.fillna(0)
    
    X_train_temp = np.array(df_train_temp[TRAINING_COLUMNS])
    y_train_temp = np.array(df_train_temp[TARGET_COLUMN])
    X_val_temp = np.array(df_val_temp[TRAINING_COLUMNS])
    y_val_temp = np.array(df_val_temp[TARGET_COLUMN])
    
    w0_temp, w_temp = train_linear_regression(X_train_temp, y_train_temp)
    y_pred_temp = predict(w0_temp, w_temp, X_val_temp)

    rmse_temp = rmse(y_pred_temp, y_val_temp)
    RMSE_VALUES.append(rmse_temp)

rmse_array = np.array(RMSE_VALUES)
std_value = np.std(rmse_array)

print("std_value:", round(std_value,3))
    

std_value: 29.176


### Question 6

* Split the dataset like previously, use seed 9.
* Combine train and validation datasets.
* Fill the missing values with 0 and train a model with `r=0.001`. 
* What's the RMSE on the test dataset?

Options:

- 598.60
- **608.60** (answer)
- 618.60
- 628.60



In [17]:
df_train_temp, df_val_temp, df_test_temp = generate_dataset(df, seed=9)

df_train_full = pd.concat((df_train_temp, df_val_temp))
df_val_full = df_test_temp

df_train_full = df_train_full.fillna(0)
df_val_full = df_val_full.fillna(0)

X_train_full = np.array(df_train_full[TRAINING_COLUMNS])
y_train_full = np.array(df_train_full[TARGET_COLUMN])
X_val_full = np.array(df_val_full[TRAINING_COLUMNS])
y_val_full = np.array(df_val_full[TARGET_COLUMN])

w0_full, w_full = train_linear_regression_reg(X_train_full, y_train_full, r=0.001)
y_pred_full = predict(w0_full, w_full, X_val_full)

rmse_full = rmse(y_pred_full, y_val_full)
print("rmse_full:", rmse_full)


rmse_full: 608.6099822049559


## Submit the results

* Submit your results here: https://courses.datatalks.club/ml-zoomcamp-2024/homework/hw02
* If your answer doesn't match options exactly, select the closest one
