## Homework

> Note: sometimes your answer doesn't match one of 
> the options exactly. That's fine. 
> Select the option that's closest to your solution.


### Dataset

In this homework, we will use the Laptops price dataset from [Kaggle](https://www.kaggle.com/datasets/juanmerinobermejo/laptops-price-dataset).

Here's a wget-able [link](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/laptops.csv):

```bash
wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/laptops.csv
```

The goal of this homework is to create a regression model for predicting the prices (column `'Final Price'`).

### Preparing the dataset 

First, we'll normalize the names of the columns:

```python
df.columns = df.columns.str.lower().str.replace(' ', '_')
```

Now, instead of `'Final Price'`, we have `'final_price'`.

Next, use only the following columns:

* `'ram'`,
* `'storage'`,
* `'screen'`,
* `'final_price'`

### EDA

* Look at the `final_price` variable. Does it have a long tail? 



### Question 1

There's one column with missing values. What is it?

* `'ram'`
* `'storage'`
* `'screen'`
* `'final_price'`


### Question 2

What's the median (50% percentile) for variable `'ram'`?

- 8
- 16
- 24
- 32

### Prepare and split the dataset

* Shuffle the dataset (the filtered one you created above), use seed `42`.
* Split your data in train/val/test sets, with 60%/20%/20% distribution.

Use the same code as in the lectures


### Question 3

* We need to deal with missing values for the column from Q1.
* We have two options: fill it with 0 or with the mean of this variable.
* Try both options. For each, train a linear regression model without regularization using the code from the lessons.
* For computing the mean, use the training only!
* Use the validation dataset to evaluate the models and compare the RMSE of each option.
* Round the RMSE scores to 2 decimal digits using `round(score, 2)`
* Which option gives better RMSE?

Options:

- With 0
- With mean
- Both are equally good


### Question 4

* Now let's train a regularized linear regression.
* For this question, fill the NAs with 0. 
* Try different values of `r` from this list: `[0, 0.01, 0.1, 1, 5, 10, 100]`.
* Use RMSE to evaluate the model on the validation dataset.
* Round the RMSE scores to 2 decimal digits.
* Which `r` gives the best RMSE?

If there are multiple options, select the smallest `r`.

Options:

- 0
- 0.01
- 1
- 10
- 100


### Question 5 

* We used seed 42 for splitting the data. Let's find out how selecting the seed influences our score.
* Try different seed values: `[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]`.
* For each seed, do the train/validation/test split with 60%/20%/20% distribution.
* Fill the missing values with 0 and train a model without regularization.
* For each seed, evaluate the model on the validation dataset and collect the RMSE scores. 
* What's the standard deviation of all the scores? To compute the standard deviation, use `np.std`.
* Round the result to 3 decimal digits (`round(std, 3)`)

What's the value of std?

- 19.176
- 29.176
- 39.176
- 49.176

> Note: Standard deviation shows how different the values are.
> If it's low, then all values are approximately the same.
> If it's high, the values are different. 
> If standard deviation of scores is low, then our model is *stable*.


### Question 6

* Split the dataset like previously, use seed 9.
* Combine train and validation datasets.
* Fill the missing values with 0 and train a model with `r=0.001`. 
* What's the RMSE on the test dataset?

Options:

- 598.60
- 608.60
- 618.60
- 628.60

## Submit the results

* Submit your results here: https://courses.datatalks.club/ml-zoomcamp-2024/homework/hw02
* If your answer doesn't match options exactly, select the closest one


In [51]:
# Dataset
import requests
import numpy as np

url = "https://raw.githubusercontent.com/alexeygrigorev/datasets/master/laptops.csv"
response = requests.get(url)

# Save the content to a CSV file
with open("laptops.csv", "wb") as file:
    file.write(response.content)

print("File downloaded successfully")


File downloaded successfully


In [None]:
#The goal of this homework is to create a regression model for predicting the prices (column 'Final Price').

In [5]:
import pandas as pd
df = pd.read_csv('laptops.csv')

In [21]:
# Preparing the dataset
# First, we'll normalize the names of the columns:
df.columns = df.columns.str.lower().str.replace(' ', '_')
df = df[["ram", "storage", "screen", "final_price"]]
df


Unnamed: 0,ram,storage,screen,final_price
0,8,512,15.6,1009.00
1,8,256,15.6,299.00
2,8,256,15.6,789.00
3,16,1000,15.6,1199.00
4,16,512,15.6,669.01
...,...,...,...,...
2155,16,1000,17.3,2699.99
2156,16,1000,17.3,2899.99
2157,32,1000,17.3,3399.99
2158,16,1000,13.4,1899.99


In [23]:
# Question 1. There's one column with missing values. What is it?
df.columns[df.isnull().sum() > 0].tolist()

['screen']

In [25]:
# Question 2. What's the median (50% percentile) for variable 'ram'?
int(df.describe()["ram"].loc["50%"])

16

In [31]:
# Prepare and split the dataset
n = len(df)

n_val = int(n * 0.2)
n_test = int(n * 0.2)
n_train = n - n_val - n_test

idx = np.arange(n)
np.random.seed(42)
np.random.shuffle(idx)

df_train = df.iloc[idx[:n_train]]
df_val = df.iloc[idx[n_train:n_train+n_val]]
df_test = df.iloc[idx[n_train+n_val:]]

df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

y_train = df_train.final_price.values
y_val =df_val.final_price.values
y_test = df_test.final_price.values

del df_train['final_price']
del df_val['final_price']
del df_test['final_price']

In [37]:
# Question 3
    # We need to deal with missing values for the column from Q1.
    # We have two options: fill it with 0 or with the mean of this variable.
    # Try both options. For each, train a linear regression model without regularization using the code from the lessons.
    # For computing the mean, use the training only!
    # Use the validation dataset to evaluate the models and compare the RMSE of each option.
    # Round the RMSE scores to 2 decimal digits using round(score, 2)
    # Which option gives better RMSE?   
        # Options:
        # With 0
        # With mean
        # Both are equally good

def train_linear_regression(X, y):
    ones = np.ones(X.shape[0])
    X = np.column_stack([ones, X])

    XTX = X.T.dot(X)
    XTX_inv = np.linalg.inv(XTX)
    w_full = XTX_inv.dot(X.T).dot(y)
    
    return w_full[0], w_full[1:]


def rmse(y, y_pred):
    se = (y - y_pred) ** 2
    mse = se.mean()
    return np.sqrt(mse)


In [39]:
rows_with_missing_values = df_train[df_train.isnull().any(axis=1)]
print(rows_with_missing_values)
indexes_of_rows_with_missing_values = rows_with_missing_values.index.tolist()

      ram  storage  screen
124    16      512     NaN
1062   16      512     NaN
1238    8      256     NaN


In [41]:
# filling missing values with 0

def prepare_X(df):
    df_num = df.copy()
    return df_num.fillna(0).values

X_train = prepare_X(df_train)

print("Rows with missing values after filling them:")
print(X_train[indexes_of_rows_with_missing_values])

w0, w = train_linear_regression(X_train, y_train)

X_val = prepare_X(df_val)
y_pred = w0 + X_val.dot(w)

rmse_with_zero = round(rmse(y_val, y_pred), 2).item()
print(f"\nRMSE: {rmse_with_zero}")

Rows with missing values after filling them:
[[ 16. 512.   0.]
 [ 16. 512.   0.]
 [  8. 256.   0.]]

RMSE: 597.36


In [43]:
# filling missing values with mean

df_train_screen_mean = df_train["screen"].mean()

def prepare_X(df):
    df_num = df.copy()
    return df_num.fillna(df_train_screen_mean).values

X_train = prepare_X(df_train)

print("Rows with missing values after filling them:")
print(X_train[indexes_of_rows_with_missing_values])

w0, w = train_linear_regression(X_train, y_train)

X_val = prepare_X(df_val)
y_pred = w0 + X_val.dot(w)

rmse_with_mean = round(rmse(y_val, y_pred), 2).item()
print(f"\nRMSE: {rmse_with_mean}")

Rows with missing values after filling them:
[[ 16.         512.          15.16353442]
 [ 16.         512.          15.16353442]
 [  8.         256.          15.16353442]]

RMSE: 600.27


In [45]:
print(
    "With 0"
    if rmse_with_zero < rmse_with_mean
    else "With mean"
    if rmse_with_zero > rmse_with_mean
    else "Both are equally good"
)

With 0


### Question 4

* Now let's train a regularized linear regression.
* For this question, fill the NAs with 0. 
* Try different values of `r` from this list: `[0, 0.01, 0.1, 1, 5, 10, 100]`.
* Use RMSE to evaluate the model on the validation dataset.
* Round the RMSE scores to 2 decimal digits.
* Which `r` gives the best RMSE?

If there are multiple options, select the smallest `r`.

Options:

- 0
- 0.01
- 1
- 10
- 100

In [55]:
def train_linear_regression_reg(X, y, r=0.001):
    ones = np.ones(X.shape[0])
    X = np.column_stack([ones, X])

    XTX = X.T.dot(X)
    XTX = XTX + r * np.eye(XTX.shape[0])

    XTX_inv = np.linalg.inv(XTX)
    w_full = XTX_inv.dot(X.T).dot(y)
    
    return w_full[0], w_full[1:]

def prepare_X(df):
    df_num = df.copy()
    return df_num.fillna(0).values

In [57]:
rs = [0, 0.01, 1, 10, 100]
scores = []

for r in rs:
    X_train = prepare_X(df_train)
    w0, w = train_linear_regression_reg(X_train, y_train, r=r)

    X_val = prepare_X(df_val)
    y_pred = w0 + X_val.dot(w)
    scores.append(rmse(y_val, y_pred))

print(round(pd.DataFrame({"r": rs, "score": scores}), 2))

        r   score
0    0.00  597.36
1    0.01  597.36
2    1.00  597.21
3   10.00  597.06
4  100.00  597.90


In [59]:
r_scores = pd.DataFrame({"r": rs, "score": scores})
r_scores.loc[r_scores["score"].idxmin()]

r         10.000000
score    597.058768
Name: 3, dtype: float64

### Question 5 

* We used seed 42 for splitting the data. Let's find out how selecting the seed influences our score.
* Try different seed values: `[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]`.
* For each seed, do the train/validation/test split with 60%/20%/20% distribution.
* Fill the missing values with 0 and train a model without regularization.
* For each seed, evaluate the model on the validation dataset and collect the RMSE scores. 
* What's the standard deviation of all the scores? To compute the standard deviation, use `np.std`.
* Round the result to 3 decimal digits (`round(std, 3)`)

What's the value of std?

- 19.176
- 29.176
- 39.176
- 49.176

> Note: Standard deviation shows how different the values are.
> If it's low, then all values are approximately the same.
> If it's high, the values are different. 
> If standard deviation of scores is low, then our model is *stable*.


In [62]:
def split_data(df, seed):
    n = len(df)

    n_val = int(n * 0.2)
    n_test = int(n * 0.2)
    n_train = n - n_val - n_test
    
    idx = np.arange(n)
    np.random.seed(seed)
    np.random.shuffle(idx)
    
    df_train = df.iloc[idx[:n_train]]
    df_val = df.iloc[idx[n_train:n_train+n_val]]
    df_test = df.iloc[idx[n_train+n_val:]]
    
    
    df_train = df_train.reset_index(drop=True)
    df_val = df_val.reset_index(drop=True)
    df_test = df_test.reset_index(drop=True)
    
    y_train = df_train.final_price.values
    y_val = df_val.final_price.values
    y_test = df_test.final_price.values
    
    del df_train['final_price']
    del df_val['final_price']
    del df_test['final_price']

    return df_train, df_val, df_test, y_train, y_val, y_test

In [64]:
def prepare_X(df):
    df_num = df.copy()
    return df_num.fillna(0).values

In [66]:
seeds = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
scores = []

for seed in seeds:
    df_train, df_val, df_test, y_train, y_val, y_test = split_data(df, seed)
    
    X_train = prepare_X(df_train)
    w0, w = train_linear_regression(X_train, y_train)

    X_val = prepare_X(df_val)
    y_pred = w0 + X_val.dot(w)
    scores.append(rmse(y_val, y_pred))

print(round(pd.DataFrame({"seed": seeds, "score": scores}), 2))

   seed   score
0     0  565.45
1     1  636.80
2     2  588.96
3     3  597.81
4     4  571.96
5     5  573.24
6     6  647.34
7     7  550.44
8     8  587.33
9     9  576.10


In [68]:
round(np.std(scores), 3).item()

29.176

### Question 6

* Split the dataset like previously, use seed 9.
* Combine train and validation datasets.
* Fill the missing values with 0 and train a model with `r=0.001`. 
* What's the RMSE on the test dataset?

Options:

- 598.60
- 608.60
- 618.60
- 628.60

In [71]:
def prepare_X(df):
    df_num = df.copy()
    return df_num.fillna(0).values

In [73]:
df_train, df_val, df_test, y_train, y_val, y_test = split_data(df, 9)
df_full_train = pd.concat([df_train, df_val])
df_full_train = df_full_train.reset_index(drop=True)
X_full_train = prepare_X(df_full_train)
y_full_train = np.concatenate([y_train, y_val])
w0, w = train_linear_regression_reg(X_full_train, y_full_train, r=0.001)
X_test = prepare_X(df_test)
y_pred = w0 + X_test.dot(w)
score = rmse(y_test, y_pred)
score.item()

608.6099822049559

## Submit the results

* Submit your results here: https://courses.datatalks.club/ml-zoomcamp-2024/homework/hw02
* If your answer doesn't match options exactly, select the closest one
