# Data Preparation

Data preparation is the process of preparing data before modeling or analysis. This stage includes importing libraries, reading the dataset, displaying basic dataset information, cleaning the data by removing unnecessary columns, checking for missing values, and deleting duplicate data. These steps ensure that the data is in the proper format and ready for use.

In [653]:
#import library that required
import pandas as pd
from catboost import CatBoostRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [654]:
#change the value display format in the variable 'price_IDR'
pd.options.display.float_format = '{:,.2f}'.format

In [655]:
#read dataset with excel format
df = pd.read_excel('laptop_price.xlsx')
df

Unnamed: 0.1,Unnamed: 0,Company,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,OpSys,Weight,Price_IDR
0,0,Apple,Ultrabook,13.30,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,1.37kg,13883153.88
1,1,Apple,Ultrabook,13.30,1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,1.34kg,9315679.26
2,2,HP,Notebook,15.60,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,1.86kg,5958702.00
3,3,Apple,Ultrabook,15.40,IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,1.83kg,26295492.85
4,4,Apple,Ultrabook,13.30,IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,1.37kg,18690634.66
...,...,...,...,...,...,...,...,...,...,...,...,...
1298,1298,Lenovo,2 in 1 Convertible,14.00,IPS Panel Full HD / Touchscreen 1920x1080,Intel Core i7 6500U 2.5GHz,4GB,128GB SSD,Intel HD Graphics 520,Windows 10,1.8kg,6611568.48
1299,1299,Lenovo,2 in 1 Convertible,13.30,IPS Panel Quad HD+ / Touchscreen 3200x1800,Intel Core i7 6500U 2.5GHz,16GB,512GB SSD,Intel HD Graphics 520,Windows 10,1.3kg,15534077.04
1300,1300,Lenovo,Notebook,14.00,1366x768,Intel Celeron Dual Core N3050 1.6GHz,2GB,64GB Flash Storage,Intel HD Graphics,Windows 10,1.5kg,2373117.84
1301,1301,HP,Notebook,15.60,1366x768,Intel Core i7 6500U 2.5GHz,6GB,1TB HDD,AMD Radeon R5 M330,Windows 10,2.19kg,7917301.44


In [656]:
#display the dataset info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1303 entries, 0 to 1302
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Unnamed: 0        1303 non-null   int64  
 1   Company           1303 non-null   object 
 2   TypeName          1303 non-null   object 
 3   Inches            1303 non-null   float64
 4   ScreenResolution  1303 non-null   object 
 5   Cpu               1303 non-null   object 
 6   Ram               1303 non-null   object 
 7   Memory            1303 non-null   object 
 8   Gpu               1303 non-null   object 
 9   OpSys             1303 non-null   object 
 10  Weight            1303 non-null   object 
 11  Price_IDR         1303 non-null   float64
dtypes: float64(2), int64(1), object(9)
memory usage: 122.3+ KB


In [657]:
#drop columns that no needed
df = df.drop(columns=['Unnamed: 0'])

In [658]:
#check missing values of dataset
df.isna().sum()

Company             0
TypeName            0
Inches              0
ScreenResolution    0
Cpu                 0
Ram                 0
Memory              0
Gpu                 0
OpSys               0
Weight              0
Price_IDR           0
dtype: int64

In [659]:
#drop duplicates data
df.drop_duplicates(inplace=True)

Displaying information about the data after removing duplicates. Initially, the dataset contained 1303 entries, and after the duplicate removal process, it now has 1274 entries.

In [660]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1274 entries, 0 to 1273
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Company           1274 non-null   object 
 1   TypeName          1274 non-null   object 
 2   Inches            1274 non-null   float64
 3   ScreenResolution  1274 non-null   object 
 4   Cpu               1274 non-null   object 
 5   Ram               1274 non-null   object 
 6   Memory            1274 non-null   object 
 7   Gpu               1274 non-null   object 
 8   OpSys             1274 non-null   object 
 9   Weight            1274 non-null   object 
 10  Price_IDR         1274 non-null   float64
dtypes: float64(2), object(9)
memory usage: 119.4+ KB


# Preprocessing

During the preprocessing stage, I converted certain variables (RAM and Memory) from string data types to numeric and performed encoding using One-Hot Encoding. This method was chosen because it is safer and widely used. One-Hot Encoding creates separate variables for each category, which helps avoid issues related to ordering and makes the data easier for the model to interpret.

In [661]:
#convert RAM variable from string to numeric
df['Ram'] = df['Ram'].str.replace('GB', '').astype(int)

In [662]:
#display data from RAM variables that have been converted to numeric
df['Ram'].unique()

array([ 8, 16,  4,  2, 12,  6, 32, 24, 64])

In [663]:
#convert Memory variable from string to numeric
def convert_to_gb(memory):
    #if it contains '+', take each part and convert it
    if '+' in memory:
        parts = memory.split(' + ')
        total_gb = sum(convert_to_gb(part) for part in parts)
        return total_gb

    #specifies the value and units of the string
    memory = memory.upper()
    if 'TB' in memory:
        return float(memory.split('TB')[0].replace(' ', '').replace('1.0', '1')) * 1024
    elif 'GB' in memory:
        return float(memory.split('GB')[0].replace(' ', ''))
    elif 'MB' in memory:
        return float(memory.split('MB')[0].replace(' ', '')) / 1024
    elif 'KB' in memory:
        return float(memory.split('KB')[0].replace(' ', '')) / (1024 * 1024)
    else:
        return 0  # Untuk kasus format yang tidak diketahui

#apply the conversion function to the 'Memory' column of the existing dataframe
df['Memory_GB'] = df['Memory'].apply(convert_to_gb)

#displays data from the Memory variable that has not been and has been converted to numeric
print(df[['Memory', 'Memory_GB']])

                   Memory  Memory_GB
0               128GB SSD     128.00
1     128GB Flash Storage     128.00
2               256GB SSD     256.00
3               512GB SSD     512.00
4               256GB SSD     256.00
...                   ...        ...
1269            500GB HDD     500.00
1270            128GB SSD     128.00
1271            512GB SSD     512.00
1272   64GB Flash Storage      64.00
1273              1TB HDD   1,024.00

[1274 rows x 2 columns]


In [664]:
#display data from Memory variables that have been converted to numeric
df['Memory_GB'].unique()

array([ 128.,  256.,  512.,  500., 1024.,   32., 1152.,   64., 1280.,
       2304., 2048., 1536.,  756., 2176.,   16.,  768., 2560., 1088.,
        180.,  240.,    8.,  508.])

**Encoding Process** -- Performing encoding by removing two unnecessary variables, such as the Memory and Weight variables.

In [665]:
encoding = df.drop(columns=['Memory','Weight'])
df_encoded = pd.get_dummies(encoding, drop_first=True)
df_encoded.head()

Unnamed: 0,Inches,Ram,Price_IDR,Memory_GB,Company_Apple,Company_Asus,Company_Chuwi,Company_Dell,Company_Fujitsu,Company_Google,...,Gpu_Nvidia Quadro M620,Gpu_Nvidia Quadro M620M,OpSys_Chrome OS,OpSys_Linux,OpSys_Mac OS X,OpSys_No OS,OpSys_Windows 10,OpSys_Windows 10 S,OpSys_Windows 7,OpSys_macOS
0,13.3,8,13883153.88,128.0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,13.3,8,9315679.26,128.0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,15.6,8,5958702.0,256.0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,15.4,16,26295492.85,512.0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,13.3,8,18690634.66,256.0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


# Split Data (Training-Testing)

The data splitting process divides the dataset into training and testing data. The training data (X) includes all variables except for 'Price_IDR', which is the target variable (Y). The dataset is then split into 80% for training and 20% for testing.

In [666]:
x = df_encoded.drop(columns=['Price_IDR'])
y = df_encoded['Price_IDR']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Modeling

The model used to predict laptop prices with the existing dataset is **CatBoostRegressor**. This model is highly effective for regression tasks, especially when dealing with categorical data. Additionally, CatBoostRegressor excels in delivering high performance with minimal need for hyperparameter tuning. Among several models tested, CatBoostRegressor produced the best results.

In [667]:
#train the model
model = CatBoostRegressor(n_estimators=100, random_state=0, silent=True)
model.fit(x_train, y_train)

<catboost.core.CatBoostRegressor at 0x1e3dbdc91c0>

In [668]:
y_pred = model.predict(x_test)

**Model Evaluation** — The purpose of model evaluation is to determine whether the model performs well or not. In this evaluation, two metrics are used: Mean Absolute Error (MAE) and R-Squared (R²). MAE is used to indicate the average difference between the predicted and actual prices, while R² is used to assess how well the model explains the variation in laptop prices.

In [669]:
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("MAE : {:,.2f}".format(mae))
print("R^2 : {:.2f}".format(r2))

MAE : 2,216,176.67
R^2 : 0.82


# Prediction

The prediction process is the final step for displaying the laptop price based on the desired specification.

In [670]:
#prediction for the given laptop specification
new_laptop = pd.DataFrame({
    'Company': ['Dell'],
    'TypeName': ['Notebook'],
    'Inches': [17.3],
    'ScreenResolution': ['Full HD 1920x1080'],
    'Cpu': ['Intel Core i7 8550U 1.8GHz'],
    'Ram': [16],
    'Memory_GB': [512.0],
    'Gpu': ['AMD Radeon 530'],
    'OpSys': ['Linux Mint']
})

#encoding is done for new data (new_laptop)
new_laptop_encoded = pd.get_dummies(new_laptop, drop_first=True)
#fit the new data variable (new_laptop) to the training data and fill in the missing values with 0
new_laptop_encoded2 = new_laptop_encoded.reindex(columns=x.columns, fill_value=0)

#enter data that has been encoded and adjusted to the training data (new laptop encoded 2) into the model 
predicted_price = model.predict(new_laptop_encoded2)

print(f'Predicted Price: IDR {predicted_price[0]:,.2f}')

Predicted Price: IDR 19,355,754.58
