### Dataset

In this homework, we will use the Laptops price dataset from [Kaggle](https://www.kaggle.com/datasets/juanmerinobermejo/laptops-price-dataset).

Here's a wget-able [link](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/laptops.csv):

```bash
wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/laptops.csv
```

The goal of this homework is to create a regression model for predicting the prices (column `'Final Price'`).

### Preparing the dataset 

First, we'll normalize the names of the columns:

```python
df.columns = df.columns.str.lower().str.replace(' ', '_')
```

Now, instead of `'Final Price'`, we have `'final_price'`.

Next, use only the following columns:

* `'ram'`,
* `'storage'`,
* `'screen'`,
* `'final_price'`

### EDA

* Look at the `final_price` variable. Does it have a long tail? 


In [93]:
import pandas as pd
import numpy as np

In [94]:
df_laptops = pd.read_csv("laptops.csv")
df_laptops.head()

Unnamed: 0,Laptop,Status,Brand,Model,CPU,RAM,Storage,Storage type,GPU,Screen,Touch,Final Price
0,ASUS ExpertBook B1 B1502CBA-EJ0436X Intel Core...,New,Asus,ExpertBook,Intel Core i5,8,512,SSD,,15.6,No,1009.0
1,Alurin Go Start Intel Celeron N4020/8GB/256GB ...,New,Alurin,Go,Intel Celeron,8,256,SSD,,15.6,No,299.0
2,ASUS ExpertBook B1 B1502CBA-EJ0424X Intel Core...,New,Asus,ExpertBook,Intel Core i3,8,256,SSD,,15.6,No,789.0
3,MSI Katana GF66 12UC-082XES Intel Core i7-1270...,New,MSI,Katana,Intel Core i7,16,1000,SSD,RTX 3050,15.6,No,1199.0
4,HP 15S-FQ5085NS Intel Core i5-1235U/16GB/512GB...,New,HP,15S,Intel Core i5,16,512,SSD,,15.6,No,669.01


In [95]:
df_laptops.columns = df_laptops.columns.str.lower().str.replace(' ', '_')
df_laptops.head()

Unnamed: 0,laptop,status,brand,model,cpu,ram,storage,storage_type,gpu,screen,touch,final_price
0,ASUS ExpertBook B1 B1502CBA-EJ0436X Intel Core...,New,Asus,ExpertBook,Intel Core i5,8,512,SSD,,15.6,No,1009.0
1,Alurin Go Start Intel Celeron N4020/8GB/256GB ...,New,Alurin,Go,Intel Celeron,8,256,SSD,,15.6,No,299.0
2,ASUS ExpertBook B1 B1502CBA-EJ0424X Intel Core...,New,Asus,ExpertBook,Intel Core i3,8,256,SSD,,15.6,No,789.0
3,MSI Katana GF66 12UC-082XES Intel Core i7-1270...,New,MSI,Katana,Intel Core i7,16,1000,SSD,RTX 3050,15.6,No,1199.0
4,HP 15S-FQ5085NS Intel Core i5-1235U/16GB/512GB...,New,HP,15S,Intel Core i5,16,512,SSD,,15.6,No,669.01


In [96]:
df = df_laptops[["ram", "storage", "screen", "final_price"]]
df.head()

Unnamed: 0,ram,storage,screen,final_price
0,8,512,15.6,1009.0
1,8,256,15.6,299.0
2,8,256,15.6,789.0
3,16,1000,15.6,1199.0
4,16,512,15.6,669.01


### Question 1

There's one column with missing values. What is it?

* `'ram'`
* `'storage'`
* `'screen'` <-
* `'final_price'`




In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2160 entries, 0 to 2159
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   ram          2160 non-null   int64  
 1   storage      2160 non-null   int64  
 2   screen       2156 non-null   float64
 3   final_price  2160 non-null   float64
dtypes: float64(2), int64(2)
memory usage: 67.6 KB


In [6]:
df.columns[df.isnull().any()]

Index(['screen'], dtype='object')

In [9]:
df.isnull().sum()

ram            0
storage        0
screen         4
final_price    0
dtype: int64

### Question 2

What's the median (50% percentile) for variable `'ram'`?

- 8
- 16 <-
- 24
- 32



In [8]:
df.ram.describe()

count    2160.000000
mean       15.413889
std         9.867815
min         4.000000
25%         8.000000
50%        16.000000
75%        16.000000
max       128.000000
Name: ram, dtype: float64

In [26]:
df.ram.median()

16.0

### Prepare and split the dataset

* Shuffle the dataset (the filtered one you created above), use seed `42`.
* Split your data in train/val/test sets, with 60%/20%/20% distribution.

Use the same code as in the lectures


### Question 3

* We need to deal with missing values for the column from Q1.
* We have two options: fill it with 0 or with the mean of this variable.
* Try both options. For each, train a linear regression model without regularization using the code from the lessons.
* For computing the mean, use the training only!
* Use the validation dataset to evaluate the models and compare the RMSE of each option.
* Round the RMSE scores to 2 decimal digits using `round(score, 2)`
* Which option gives better RMSE?

Options:

- With 0
- With mean
- Both are equally good <-




In [11]:
df1 = df.copy()
df2 = df.copy()

In [10]:
# We have two options: fill it with 0 or with the mean of this variable.
mean_screen = df.screen.mean()
mean_screen

15.168112244897959

In [12]:
# Try both options. For each, train a linear regression model without regularization using the code from the lessons.
df1.screen = df.screen.fillna(value=0)
df2.screen = df.screen.fillna(value=mean_screen)

In [13]:
df1.isnull().sum()

ram            0
storage        0
screen         0
final_price    0
dtype: int64

In [14]:
df2.isnull().sum()

ram            0
storage        0
screen         0
final_price    0
dtype: int64

In [15]:
# Shuffle the dataset (the filtered one you created above), use seed `42`.
np.random.seed(42)

In [16]:
# Split your data in train/val/test sets, with 60%/20%/20% distribution
n = len(df)

n_val = int(n * 0.2)
n_test = int(n * 0.2)
n_train = n - n_val - n_test

In [17]:
n_val, n_test, n_train

(432, 432, 1296)

In [18]:
# 
df1_train = df1.iloc[:n_train]
df1_val = df1.iloc[n_train:n_train+n_val]
df1_test = df1.iloc[n_train+n_val:]

df2_train = df2.iloc[:n_train]
df2_val = df2.iloc[n_train:n_train+n_val]
df2_test = df2.iloc[n_train+n_val:]

In [19]:
idx1 = np.arange(n)
idx2 = np.arange(n)

np.random.shuffle(idx1)
np.random.shuffle(idx2)

In [20]:
df1_train.head()

Unnamed: 0,ram,storage,screen,final_price
0,8,512,15.6,1009.0
1,8,256,15.6,299.0
2,8,256,15.6,789.0
3,16,1000,15.6,1199.0
4,16,512,15.6,669.01


In [21]:
df2_train.head()

Unnamed: 0,ram,storage,screen,final_price
0,8,512,15.6,1009.0
1,8,256,15.6,299.0
2,8,256,15.6,789.0
3,16,1000,15.6,1199.0
4,16,512,15.6,669.01


In [22]:
len(df1_train), len(df1_val), len(df1_test)

(1296, 432, 432)

In [23]:
len(df2_train), len(df2_val), len(df2_test)

(1296, 432, 432)

In [24]:
df1_train = df1_train.reset_index(drop=True)
df1_val = df1_val.reset_index(drop=True)
df1_test = df1_test.reset_index(drop=True)

In [25]:
df2_train = df2_train.reset_index(drop=True)
df2_val = df2_val.reset_index(drop=True)
df2_test = df2_test.reset_index(drop=True)

In [27]:
# For computing the mean, use the training only!
categorical = ['ram', 'storage', 'screen']
target = 'final_price'

In [28]:
# DF with 0

df1_X_train = df1_train[categorical]
df1_Y_train = df1_train[target]
df1_X_test = df1_test[categorical]
df1_Y_test = df1_test[target]
df1_X_val = df1_val[categorical]
df1_Y_val = df1_val[target]

In [31]:
def linear_regression(X: np.array, y: np.array):
    # add column for the bias
    ones = np.ones(X.shape[0])
    X = np.column_stack([ones, X])

    # compute optimal weights
    XTX = X.T.dot(X)
    XTX_inv = np.linalg.inv(XTX)
    w = XTX_inv.dot(X.T).dot(y)

    return w[0], w[1:]

In [32]:
def predict_target(X: np.array, b: np.array, w: np.array):
  return b + X.dot(w)

In [34]:
w0, w1 = linear_regression(df1_X_train, df1_Y_train)
y1_pred = predict_target(X=df1_X_test, b=w0, w=w1)

In [35]:
# Use the validation dataset to evaluate the models and compare the RMSE of each option.
# Round the RMSE scores to 2 decimal digits using `round(score, 2)`
def rmse(y, y_pred):
    error = (y - y_pred)
    sqerror = np.square(error)
    mse = sqerror.mean()
    return round(np.sqrt(mse), 2)

In [36]:
rmse_val1 = rmse(df1_Y_test, y1_pred)
rmse_val1

671.2

In [37]:
# With df with mean
df2_X_train = df2_train[categorical]
df2_Y_train = df2_train[target]
df2_X_test = df2_test[categorical]
df2_Y_test = df2_test[target]
df2_X_val = df2_val[categorical]
df2_Y_val = df2_val[target]

In [38]:
w0, w1 = linear_regression(df2_X_train, df2_Y_train)
y2_pred = predict_target(X=df2_X_test, b=w0, w=w1)
rmse_val2 = rmse(df2_Y_test, y2_pred)
rmse_val2

671.71

### Question 4

* Now let's train a regularized linear regression.
* For this question, fill the NAs with 0. 
* Try different values of `r` from this list: `[0, 0.01, 0.1, 1, 5, 10, 100]`.
* Use RMSE to evaluate the model on the validation dataset.
* Round the RMSE scores to 2 decimal digits.
* Which `r` gives the best RMSE?

If there are multiple options, select the smallest `r`.

Options:

- 0
- 0.01
- 1
- 10
- 100 <-




In [124]:
df3 = df.copy()

In [125]:
# fill the NAs with 0. 
df3 = df3.fillna(value=0)

# Try different values of `r` from this list: `[0, 0.01, 0.1, 1, 5, 10, 100]`.
reg_list = [0, 0.01, 0.1, 1, 5, 10, 100]

In [127]:
def linear_regression_with_regularization(X: np.array, y: np.array, r: float):
    # add column for the bias
    ones = np.ones(X.shape[0])
    X = np.column_stack([ones, X])

    # compute optimal weights
    XTX = X.T.dot(X)
    XTX = XTX + r * np.eye(XTX.shape[0])

    XTX_inv = np.linalg.inv(XTX)
    w_full = XTX_inv.dot(X.T).dot(y)

    return w_full[0], w_full[1:]

In [59]:
df3_train = df3.iloc[:n_train]
df3_val = df3.iloc[n_train:n_train+n_val]
df3_test = df3.iloc[n_train+n_val:]

In [128]:
idx3 = np.arange(n)
np.random.shuffle(idx3)

In [129]:
df_shuffled = df3.iloc[idx3].copy()
# print(df_shuffled)
df3_train = df_shuffled.iloc[:n_train].copy()
# print(df_train)
df3_val = df_shuffled.iloc[n_train:n_train+n_val].copy()
# print(df_val)
df3_test = df_shuffled.iloc[n_train+n_val:].copy()

In [130]:
len(df3_train), len(df3_val), len(df3_test)

(1296, 432, 432)

In [131]:
df3_train = df3_train.reset_index(drop=True)
df3_val = df3_val.reset_index(drop=True)
df3_test = df3_test.reset_index(drop=True)

In [132]:
df3_X_train = df3_train[categorical]
df3_Y_train = df3_train[target]
df3_X_test = df3_test[categorical]
df3_Y_test = df3_test[target]
df3_X_val = df3_val[categorical]
df3_Y_val = df3_val[target]

In [133]:
result = []
for r in reg_list:
    w0, w1 = linear_regression_with_regularization(df3_X_train, df3_Y_train, r)
    y_pred = predict_target(X=df3_X_test, b=w0, w=w1)
    rmse_val = rmse(df3_Y_test, y_pred)
    result.append(dict(r=r, rmse=rmse_val))

In [134]:
pd.DataFrame(result)

Unnamed: 0,r,rmse
0,0.0,584.06
1,0.01,584.06
2,0.1,584.0
3,1.0,583.49
4,5.0,582.26
5,10.0,581.69
6,100.0,581.26


### Question 5 

* We used seed 42 for splitting the data. Let's find out how selecting the seed influences our score.
* Try different seed values: `[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]`.
* For each seed, do the train/validation/test split with 60%/20%/20% distribution.
* Fill the missing values with 0 and train a model without regularization.
* For each seed, evaluate the model on the validation dataset and collect the RMSE scores. 
* What's the standard deviation of all the scores? To compute the standard deviation, use `np.std`.
* Round the result to 3 decimal digits (`round(std, 3)`)

What's the value of std?

- 19.176
- 29.176 <-
- 39.176
- 49.176

> Note: Standard deviation shows how different the values are.
> If it's low, then all values are approximately the same.
> If it's high, the values are different. 
> If standard deviation of scores is low, then our model is *stable*.




In [116]:
def prepare_data(df, seed=42):
    n = len(df)

    n_val = int(n * 0.2)
    n_test = int(n * 0.2)
    n_train = n - n_val - n_test
    # Random
    np.random.seed(seed)
    idx = np.arange(n)
    np.random.shuffle(idx)

    # fillna
    # df = df.fillna(0)

    df_shuffled = df.iloc[idx].copy()
    # print(df_shuffled)
    df_train = df_shuffled.iloc[:n_train].copy()
    # print(df_train)
    df_val = df_shuffled.iloc[n_train:n_train+n_val].copy()
    # print(df_val)
    df_test = df_shuffled.iloc[n_train+n_val:].copy()
    # print(df_test)

    df_train.reset_index(drop=True, inplace=True)
    df_val.reset_index(drop=True, inplace=True)
    df_test.reset_index(drop=True, inplace=True)
    return df_train, df_test, df_val

In [117]:
def split_train_test_val(train, test, val, categorical=[], target=[]):
    X_train = train[categorical]
    Y_train = train[target]
    X_test = test[categorical]
    Y_test = test[target]
    X_val = val[categorical]
    Y_val = val[target]
    return X_train, Y_train, X_test, Y_test, X_val, Y_val

In [76]:
def rmse(y, y_pred):
    error = (y - y_pred)
    sqerror = np.square(error)
    mse = sqerror.mean()
    return round(np.sqrt(mse), 2)

In [118]:
categorical = ['ram', 'storage', 'screen']
target = 'final_price'
dataset = df.copy()
listSeeds = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [119]:
rmse_scores = []
for seed in listSeeds:
    print(f"Testing with seed: {seed}")
    df_train, df_test, df_val = prepare_data(dataset, seed=seed)
    # print(df_train.shape, df_test.shape, df_val.shape)
    df_train = df_train.fillna(0)
    #print(df_train.shape, df_test.shape, df_val.shape)
    X_train, Y_train, X_test, Y_test, X_val, Y_val = split_train_test_val(df_train, df_test, df_val, categorical, target)
    #print(X_train)
    w0, w1 = linear_regression(X_train, Y_train)
    y_pred = predict_target(X=X_val, b=w0, w=w1)
    rmse_score = rmse(Y_val, y_pred)
    print("RMSE score: ", rmse_score)
    rmse_scores.append(rmse_score)

Testing with seed: 0
RMSE score:  565.97
Testing with seed: 1
RMSE score:  636.34
Testing with seed: 2
RMSE score:  588.96
Testing with seed: 3
RMSE score:  597.74
Testing with seed: 4
RMSE score:  571.96
Testing with seed: 5
RMSE score:  573.24
Testing with seed: 6
RMSE score:  647.25
Testing with seed: 7
RMSE score:  548.94
Testing with seed: 8
RMSE score:  587.33
Testing with seed: 9
RMSE score:  576.49


In [120]:
rmse_scores

[565.97,
 636.34,
 588.96,
 597.74,
 571.96,
 573.24,
 647.25,
 548.94,
 587.33,
 576.49]

In [121]:
round(np.std(rmse_scores), 3)

29.227

### Question 6

* Split the dataset like previously, use seed 9.
* Combine train and validation datasets.
* Fill the missing values with 0 and train a model with `r=0.001`. 
* What's the RMSE on the test dataset?

Options:

- 598.60
- 608.60 <-
- 618.60
- 628.60

In [122]:
categorical = ['ram', 'storage', 'screen']
target = 'final_price'
dataset2 = df.copy()

In [123]:
print(f"Testing with seed: {seed}")
df_train, df_test, df_val = prepare_data(dataset, seed=9)
df_full = pd.concat([df_train, df_val])
# print(df_train.shape, df_test.shape, df_val.shape)
df_full = df_full.fillna(0)
#print(df_train.shape, df_test.shape, df_val.shape)
X_train, Y_train, X_test, Y_test, X_val, Y_val = split_train_test_val(df_full, df_test, df_val, categorical, target)
#print(X_train)
w0, w1 = linear_regression_with_regularization(X_train, Y_train, r=0.001)
y_pred = predict_target(X=X_test, b=w0, w=w1)
rmse_score = rmse(Y_test, y_pred)
print("RMSE score: ", rmse_score)

Testing with seed: 9
RMSE score:  608.3
