# <center><b> Laptop Price Prediction </b></center>

### Download Datasets
This data project has been used as a take-home assignment in the recruitment process for the data science positions at Allegro.

### Assignment
Your task is to define and train a machine learning model for predicting the price of a laptop (buynow_price column in the dataset) based on its attributes. When testing and comparing your models, aim to minimize the RMSE measure.

### Data Description
The dataset has already been randomly divided into the training, validation and test sets. 

It is stored in 3 files: `train_dataset.json`, `val_dataset.json` and `test_dataset.json` respectively. Each file is JSON saved in orient=’columns’ format.

Example how to load the data:
```
import pandas as pd
dataset = pd.read_json("public-dataset.json")
dataset.columns

Index(['buynow_price', 'graphic card type', 'communications', 'resolution (px)', 'CPU cores', 'RAM size', 'operating system', 'drive type', 'input devices', 'multimedia', 'RAM type', 'CPU clock speed (GHz)', 'CPU model', 'state', 'drive memory size (GB)', 'warranty', 'screen size'], dtype='object')
```

## Practicalities
Prepare a model in Jupyter Notebook using Python. Only use the training data for training the model and check the model's performance on unseen data using the test dataset to make sure it does not overfit.

`Ensure that the notebook reflects your thought process. It’s better to show all the approaches, not only the final one (e.g. if you tested several models, you can show all of them). The path to obtaining the final model should be clearly shown.`



In [88]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [101]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder


In [96]:


# View all columns in the dataset
pd.set_option('display.max_columns', None)

def check_null(df):
    null_val = pd.DataFrame()
    
    Null_Count = pd.DataFrame(df.isnull().sum().sort_values(ascending=False), 
                              columns=['Null_Count'])
    
    Null_Percentage = pd.DataFrame(round((df.isnull().sum()/len(df))*100, 2).sort_values(ascending=False),
                                   columns=['Null_Percentage'])
    
    return pd.concat([Null_Count, Null_Percentage], axis = 1)

### Reading the Data

In [97]:
df = pd.read_json("train_dataset.json")
df.head(2)

Unnamed: 0,graphic card type,communications,resolution (px),CPU cores,RAM size,operating system,drive type,input devices,multimedia,RAM type,CPU clock speed (GHz),CPU model,state,drive memory size (GB),warranty,screen size,buynow_price
7233,dedicated graphics,"[bluetooth, lan 10/100/1000 mbps]",1920 x 1080,4,32 gb,[no system],ssd + hdd,"[keyboard, touchpad, illuminated keyboard, num...","[SD card reader, camera, speakers, microphone]",ddr4,2.6,intel core i7,new,1250.0,producer warranty,"17"" - 17.9""",4999.0
5845,dedicated graphics,"[wi-fi, bluetooth, lan 10/100 mbps]",1366 x 768,4,8 gb,[windows 10 home],ssd,"[keyboard, touchpad, numeric keyboard]","[SD card reader, camera, speakers, microphone]",ddr3,2.4,intel core i7,new,256.0,seller warranty,"15"" - 15.9""",2649.0


### Data Exploration

In [65]:
df['resolution (px)'].value_counts()   

resolution (px)
1920 x 1080    2765
1366 x 768     1234
1600 x 900      221
3840 x 2160      84
2560 x 1440      18
1920 x 1280      18
1280 x 800        7
3200 x 1800       6
2880 x 1620       3
2160 x 1440       2
1920 x 1200       2
other             1
Name: count, dtype: int64

In [15]:
df['graphic card type'].value_counts()

graphic card type
dedicated graphics     2605
integrated graphics    1812
Name: count, dtype: int64

In [18]:
df['communications'].value_counts()

communications
[wi-fi, bluetooth, lan 10/100/1000 mbps]                                                                                              1128
[bluetooth, lan 10/100 mbps]                                                                                                           656
[bluetooth, lan 10/100/1000 mbps, lan 10/100 mbps, intel wireless display (widi), nfc (near field communication), modem 3g (wwan)]     429
[bluetooth, lan 10/100/1000 mbps]                                                                                                      395
[wi-fi, bluetooth, lan 10/100 mbps]                                                                                                    306
                                                                                                                                      ... 
[wi-fi 802.11 a/b/g/n/ac, wi-fi 802.11 b/g/n, wi-fi 802.11 a/b/g/n, bluetooth, lan 10/100 mbps]                                          1
[lan 10/100/

In [19]:
df['drive type'].value_counts()

drive type
ssd          2268
hdd          1352
ssd + hdd     720
hybrid         59
emmc           55
Name: count, dtype: int64

In [20]:
df['input devices'].value_counts()

input devices
[keyboard, touchpad, numeric keyboard]                          1389
[keyboard, touchpad]                                            1136
[keyboard, touchpad, illuminated keyboard]                       869
[keyboard, touchpad, illuminated keyboard, numeric keyboard]     684
[touchpad]                                                       193
[touchpad, illuminated keyboard]                                  23
[touchpad, illuminated keyboard, numeric keyboard]                10
[illuminated keyboard]                                            10
[keyboard]                                                         3
[keyboard, numeric keyboard]                                       2
[touchpad, numeric keyboard]                                       1
[keyboard, illuminated keyboard]                                   1
Name: count, dtype: int64

In [21]:
df['multimedia'].value_counts()

multimedia
[SD card reader, camera, speakers, microphone]    3508
[SD card reader, camera, microphone]               259
[camera, speakers, microphone]                     232
[SD card reader, camera, speakers]                 225
[SD card reader]                                    34
[SD card reader, camera]                            19
[SD card reader, speakers, microphone]              18
[camera, microphone]                                 4
[speakers]                                           3
[SD card reader, speakers]                           3
[microphone]                                         2
[SD card reader, microphone]                         1
[speakers, microphone]                               1
[camera, speakers]                                   1
Name: count, dtype: int64

In [22]:
df['RAM type'].value_counts()

RAM type
ddr4     2742
ddr3     1040
ddr3l     430
Name: count, dtype: int64

In [23]:
df['CPU model'].value_counts()

CPU model
intel core i5               1583
intel core i7               1400
intel core i3                812
amd a6                       127
intel pentium dual-core      105
intel celeron dual-core       92
intel celeron                 76
intel pentium quad-core       53
other CPU                     43
amd a8                        23
amd a10                       15
intel celeron quad core       12
amd a12                       11
intel pentium 4               10
amd e1                        10
intel core m                   7
amd a4                         5
intel celeron m                5
Name: count, dtype: int64

### Preprocessing the data

In [117]:
def preprocessing_df(df):
    
    # Replace missing values with NaN explicitly
    df = df.fillna(np.nan)
    
    # Handle null values in RAM_size extraction
    df['RAM size'] = df['RAM size'].apply(lambda x: int(x.split(" ")[0]) if pd.notnull(x) else None)

    # Ensure proper integer conversion in CPU cores 
    df['CPU cores'] = df['CPU cores'].apply(pd.to_numeric, errors='coerce').fillna(0).astype(int)

    def has_feature(feature_name):
        def extract_feature(x):
            if x is None:  # Explicitly check for None 
                return 0
            elif isinstance(x, str):  
                return 1 if feature_name.lower() in x.lower() else 0
            elif isinstance(x, list):  
                return 1 if any(feature_name.lower() in item.lower() for item in x) else 0 
            else:
                try:
                    text = str(x)
                    return 1 if feature_name.lower() in text.lower() else 0
                except:
                    return 0 
        return extract_feature

    df['operating system'] = df['operating system']

    df['state'] = df['state'].apply(lambda x: 1 if x == 'new' else 0)

    # Create one-hot encoded features
    df['has_touchpad'] = df['input devices'].apply(has_feature('touchpad'))
    df['has_keyboard'] = df['input devices'].apply(has_feature('keyboard'))
    df['has_numeric_keyboard'] = df['input devices'].apply(has_feature('numeric keyboard'))
    df['has_illuminated_keyboard'] = df['input devices'].apply(has_feature('illuminated keyboard'))
    df.drop('input devices', axis=1, inplace=True)

    # Create multimedia features
    df['has_speakers'] = df['multimedia'].apply(has_feature('speaker'))
    df['has_SD_card_reader'] = df['multimedia'].apply(has_feature('SD card'))
    df['has_microphone'] = df['multimedia'].apply(has_feature('microphone'))
    df['has_camera'] = df['multimedia'].apply(has_feature('camera'))
    df.drop('multimedia', axis=1, inplace=True)

    # Create communications features
    df['has_modem_4g'] = df['communications'].apply(has_feature('modem 4g'))
    df['has__bluetooth'] = df['communications'].apply(has_feature('bluetooth'))
    df['has_lan'] = df['communications'].apply(has_feature('lan'))
    df['has_modem_3g'] = df['communications'].apply(has_feature('modem 3g'))
    df['has_wifi'] = df['communications'].apply(has_feature('wifi'))
    df['has_widi'] = df['communications'].apply(has_feature('widi'))
    df.drop('communications', axis=1, inplace=True)

    # Create features for Operating System
    df['has_win10_home'] = df['operating system'].apply(has_feature('windows 10 home'))
    df['has_win10_professional'] = df['operating system'].apply(has_feature('windows 10 professional'))
    df['has_win7_professional'] = df['operating system'].apply(has_feature('windows 7 professional'))
    df['has_win8_1_home'] = df['operating system'].apply(has_feature('windows 8.1 home'))
    df['has_win8_1_professional'] = df['operating system'].apply(has_feature('windows 8.1 professional'))
    df['has_linux'] = df['operating system'].apply(has_feature('linux'))
    df.drop('operating system', axis=1, inplace=True)

    return df


def null_value_handling(df):

    # First we drop rows with more than 5 null values
    df = df.dropna(axis=0, thresh=5)

    num_var = ['CPU clock speed (GHz)', 'CPU cores', 'RAM size', 'drive memory size (GB)']
    cat_var = ['RAM type', 'resolution (px)', 'CPU model', 'graphic card type','drive type', 'screen size']

    for var in num_var:
        # Filling the numerical variable (num_var) with median values
        df[var] = df[var].fillna(df[var].median())
    
    for var in cat_var:        
        # Filling the categorical variable (cat_var) with most frequent values
        df[var] = df[var].fillna(df[var].mode()[0])
    
    return df

def one_hot_encode(df, columns):
    """
    Performs one-hot encoding on specified columns in a DataFrame.

    Args:
        df (pd.DataFrame): The DataFrame containing the columns to encode.
        columns (list): A list of column names to be one-hot encoded.

    Returns:
        pd.DataFrame: The DataFrame with the one-hot encoded columns.
    """
    encoder = OneHotEncoder(sparse_output=False)  
    encoded_data = encoder.fit_transform(df[columns])
    
    print(type(encoded_data))
    print(encoded_data.shape)
    encoded_columns = encoder.get_feature_names_out(columns)
    encoded_df = pd.DataFrame(encoded_data, columns=encoded_columns)

    return pd.concat([df.drop(columns, axis=1), encoded_df], axis=1)

def ordinal_encode(df, columns):
    """
    Performs ordinal encoding on specified columns in a DataFrame.

    Args:
        df (pd.DataFrame): The DataFrame containing the columns to encode.
        columns (list): A list of column names to be ordinal encoded.

    Returns:
        pd.DataFrame: The DataFrame with the encoded columns.
    """
    encoder = OrdinalEncoder()
    df[columns] = encoder.fit_transform(df[columns])
    return df



In [99]:
# Preprocessing the data
df = preprocessing_df(df)
df = null_value_handling(df)

# One Hot Encoding
one_hot_columns = ['graphic card type']
df = one_hot_encode(df, one_hot_columns)

# Ordinal Encoding
ordinal_encode_columns = ['resolution (px)', 'drive type', 'RAM type', 'screen size', 'CPU model', 'warranty']
df = ordinal_encode(df, ordinal_encode_columns)
df.head()


Unnamed: 0,graphic card type,resolution (px),CPU cores,RAM size,drive type,RAM type,CPU clock speed (GHz),CPU model,state,drive memory size (GB),warranty,screen size,buynow_price,has_touchpad,has_keyboard,has_numeric_keyboard,has_illuminated_keyboard,has_speakers,has_SD_card_reader,has_microphone,has_camera,has_modem_4g,has__bluetooth,has_lan,has_modem_3g,has_wifi,has_widi,has_win10_home,has_win10_professional,has_win7_professional,has_win8_1_home,has_win8_1_professional,has_linux
7233,dedicated graphics,1920 x 1080,4,32.0,ssd + hdd,ddr4,2.6,intel core i7,new,1250.0,producer warranty,"17"" - 17.9""",4999.0,1,1,1,1,1,1,1,1,0,1,1,0,0,0,0,0,0,0,0,0
5845,dedicated graphics,1366 x 768,4,8.0,ssd,ddr3,2.4,intel core i7,new,256.0,seller warranty,"15"" - 15.9""",2649.0,1,1,1,0,1,1,1,1,0,1,1,0,0,0,1,0,0,0,0,0
10303,dedicated graphics,1920 x 1080,2,8.0,hdd,ddr4,1.6,intel core i7,new,1000.0,producer warranty,"15"" - 15.9""",3399.0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0
10423,dedicated graphics,1920 x 1080,2,8.0,ssd,ddr4,2.5,intel core i5,new,500.0,producer warranty,"15"" - 15.9""",1599.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5897,integrated graphics,2560 x 1440,4,8.0,ssd,ddr4,1.2,other CPU,new,256.0,producer warranty,"12"" - 12.9""",4499.0,1,1,0,1,1,1,1,1,0,1,0,0,0,0,1,0,0,0,0,0


<class 'numpy.ndarray'>
(4711, 2)


Unnamed: 0,resolution (px),CPU cores,RAM size,drive type,RAM type,CPU clock speed (GHz),CPU model,state,drive memory size (GB),warranty,screen size,buynow_price,has_touchpad,has_keyboard,has_numeric_keyboard,has_illuminated_keyboard,has_speakers,has_SD_card_reader,has_microphone,has_camera,has_modem_4g,has__bluetooth,has_lan,has_modem_3g,has_wifi,has_widi,has_win10_home,has_win10_professional,has_win7_professional,has_win8_1_home,has_win8_1_professional,has_linux,graphic card type_dedicated graphics,graphic card type_integrated graphics
7233,3.0,4.0,32.0,4.0,2.0,2.6,12.0,new,1250.0,1.0,5.0,4999.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,
5845,1.0,4.0,8.0,3.0,0.0,2.4,12.0,new,256.0,2.0,4.0,2649.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,,
10303,3.0,2.0,8.0,1.0,2.0,1.6,12.0,new,1000.0,1.0,4.0,3399.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,,
10423,3.0,2.0,8.0,3.0,2.0,2.5,11.0,new,500.0,1.0,4.0,1599.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,
5897,7.0,4.0,8.0,3.0,2.0,1.2,17.0,new,256.0,1.0,1.0,4499.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,,


### Reading & Processing the Validation Data

In [83]:
df_val = pd.read_json('val_dataset.json')
# Processing the data and also null value handling

df_val = preprocessing_df(df_val)
df_val = null_value_handling(df_val)

df_val.head(2)

Unnamed: 0,graphic card type,communications,resolution (px),CPU cores,RAM size,operating system,drive type,input devices,multimedia,RAM type,CPU clock speed (GHz),CPU model,state,drive memory size (GB),warranty,screen size,buynow_price
3849,dedicated graphics,"[bluetooth, lan 10/100/1000 mbps, lan 10/100 m...",1920 x 1080,4,8 gb,[windows 10 home],ssd + hdd,"[keyboard, touchpad, illuminated keyboard]","[SD card reader, camera, speakers, microphone]",ddr4,2.5,intel core i5,new,1128.0,producer warranty,"15"" - 15.9""",3829.0
3904,dedicated graphics,"[bluetooth, lan 10/100 mbps]",1366 x 768,4,8 gb,[windows 10 home],ssd,"[keyboard, touchpad, numeric keyboard]","[SD card reader, camera, speakers, microphone]",ddr3,2.2,intel core i7,new,256.0,seller warranty,"15"" - 15.9""",2786.5


### Visualizing the data

In [89]:
sns.heatmap(df.corr(method='pearson'), annot=True, cmap='YlGnBu')
plt.show()

ValueError: could not convert string to float: 'dedicated graphics'

## Modelling

In [86]:
### Model 1: Linear Regression


X_train = df.drop('buynow_price', axis=1)
y_train = df['buynow_price']

x_val = df_val.drop('buynow_price', axis=1)
y_val = df_val['buynow_price']


KeyboardInterrupt: 