# Importing Dataset

# Project Overview: Laptop Price Prediction

This project aims to build a machine learning model to predict the price of laptops based on their specifications. The process involves several key steps:

1.  **Data Loading and Initial Analysis:** The dataset containing various laptop specifications and their prices is loaded into a pandas DataFrame. Initial checks are performed to understand the data structure, identify missing values, and check for duplicates.

2.  **Data Cleaning and Preprocessing:** The dataset is cleaned by dropping irrelevant columns and handling missing values in the 'Storage' column by filling them with the most frequent value. Categorical features like 'Processor' and 'Operating System' are standardized by grouping similar categories. Numerical features like 'Storage', 'RAM', and 'Screen Size' are cleaned by removing units and converting them to appropriate data types. The 'Price' column is also cleaned and converted to an integer type. Finally, columns are renamed for better readability.

3.  **Feature Engineering and Encoding:** The 'Processor' and 'Operating System' columns are one-hot encoded to convert them into a numerical format suitable for machine learning models. The 'Touch Screen' column is mapped to numerical values (0 for No, 1 for Yes).

4.  **Model Selection and Training:** The data is split into training and testing sets. Two linear regression models, Linear Regression and Lasso Regression, are chosen for predicting the laptop prices. A pipeline is created to chain the preprocessing steps (one-hot encoding) and the regression model. The models are then trained on the training data.

5.  **Model Evaluation and Prediction:** The trained models are evaluated on the testing data using the R2 score to assess their performance. Predictions are made on sample real-world data to demonstrate the models' ability to predict laptop prices.

This project demonstrates a typical machine learning workflow, from data cleaning and preprocessing to model training and evaluation, for a regression task.

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv(r'D:\My Drive\Laptops.csv')
df

Unnamed: 0.1,Unnamed: 0,Brand,Model Name,Processor,Operating System,Storage,RAM,Screen Size,Touch_Screen,Price
0,0,HP,15s-fq5007TU,Core i3,Windows 11 Home,512 GB,8 GB,39.62 cm (15.6 Inch),No,"₹38,990"
1,1,HP,15s-fy5003TU,Core i3,Windows 11 Home,512 GB,8 GB,39.62 cm (15.6 Inch),No,"₹37,990"
2,2,Apple,2020 Macbook Air,M1,Mac OS Big Sur,256 GB,8 GB,33.78 cm (13.3 inch),No,"₹70,990"
3,3,Apple,2020 Macbook Air,M1,Mac OS Big Sur,256 GB,8 GB,33.78 cm (13.3 inch),No,"₹70,990"
4,4,Apple,2020 Macbook Air,M1,Mac OS Big Sur,256 GB,8 GB,33.78 cm (13.3 inch),No,"₹70,990"
...,...,...,...,...,...,...,...,...,...,...
832,832,HP,255 G8,Ryzen 5 Hexa Core,Windows 11 Home,512 GB,8 GB,39.62 cm (15.6 inch),No,"₹42,990"
833,833,DELL,Inspiron 7430,Core i3,Windows 11 Home,1 TB,8 GB,35.56 cm (14 inch),Yes,"₹60,490"
834,834,MSI,Katana 17 B13UCXK-256IN,Core i7,Windows 11 Home,4 TB,16 GB,43.94 cm (17.3 Inch),No,"₹88,990"
835,835,Infinix,XL25,Core i5,Windows 11 Home,512 GB,8 GB,39.62 cm (15.6 Inch),No,"₹37,990"


# Data Analysis

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 837 entries, 0 to 836
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Unnamed: 0        837 non-null    int64 
 1   Brand             837 non-null    object
 2   Model Name        837 non-null    object
 3   Processor         837 non-null    object
 4   Operating System  837 non-null    object
 5   Storage           825 non-null    object
 6   RAM               837 non-null    object
 7   Screen Size       837 non-null    object
 8   Touch_Screen      837 non-null    object
 9   Price             837 non-null    object
dtypes: int64(1), object(9)
memory usage: 65.5+ KB


In [None]:
df.isnull().sum()

Unnamed: 0           0
Brand                0
Model Name           0
Processor            0
Operating System     0
Storage             12
RAM                  0
Screen Size          0
Touch_Screen         0
Price                0
dtype: int64

In [None]:
df.duplicated().sum()

0

# Data Cleaning

In [None]:
df.drop(columns=['Unnamed: 0'], inplace=True)
df

Unnamed: 0,Brand,Model Name,Processor,Operating System,Storage,RAM,Screen Size,Touch_Screen,Price
0,HP,15s-fq5007TU,Core i3,Windows 11 Home,512 GB,8 GB,39.62 cm (15.6 Inch),No,"₹38,990"
1,HP,15s-fy5003TU,Core i3,Windows 11 Home,512 GB,8 GB,39.62 cm (15.6 Inch),No,"₹37,990"
2,Apple,2020 Macbook Air,M1,Mac OS Big Sur,256 GB,8 GB,33.78 cm (13.3 inch),No,"₹70,990"
3,Apple,2020 Macbook Air,M1,Mac OS Big Sur,256 GB,8 GB,33.78 cm (13.3 inch),No,"₹70,990"
4,Apple,2020 Macbook Air,M1,Mac OS Big Sur,256 GB,8 GB,33.78 cm (13.3 inch),No,"₹70,990"
...,...,...,...,...,...,...,...,...,...
832,HP,255 G8,Ryzen 5 Hexa Core,Windows 11 Home,512 GB,8 GB,39.62 cm (15.6 inch),No,"₹42,990"
833,DELL,Inspiron 7430,Core i3,Windows 11 Home,1 TB,8 GB,35.56 cm (14 inch),Yes,"₹60,490"
834,MSI,Katana 17 B13UCXK-256IN,Core i7,Windows 11 Home,4 TB,16 GB,43.94 cm (17.3 Inch),No,"₹88,990"
835,Infinix,XL25,Core i5,Windows 11 Home,512 GB,8 GB,39.62 cm (15.6 Inch),No,"₹37,990"


In [None]:
df['Storage'].value_counts()

512 GB    627
1 TB      101
256 GB     49
2 TB       20
128 GB     12
4 TB       10
64 GB       4
3 TB        1
6 TB        1
Name: Storage, dtype: int64

In [None]:
df['Storage'] = df['Storage'].fillna('512 GB')

In [None]:
df.isnull().sum()

Brand               0
Model Name          0
Processor           0
Operating System    0
Storage             0
RAM                 0
Screen Size         0
Touch_Screen        0
Price               0
dtype: int64

In [None]:
for column in df.columns:
    print(df[column].value_counts())
    print('-'*50)

HP           423
ASUS         147
Lenovo        62
DELL          59
Acer          36
MSI           32
Infinix       29
Apple         11
SAMSUNG        9
CHUWI          7
GIGABYTE       5
WINGS          4
ZEBRONICS      4
MICROSOFT      3
Ultimus        2
LG             2
realme         1
Primebook      1
Name: Brand, dtype: int64
--------------------------------------------------
15s- fr5011TU           37
15s-fy5003TU            37
15s- fr4000TU           37
15s-fq5007TU            36
15s-fy5002TU            34
                        ..
Vivobook 15 X1504ZA      1
Thin GF63 12VE-664IN     1
15-eg2017TU              1
K3605ZC-MBN542WS         1
13-BE0030AU              1
Name: Model Name, Length: 377, dtype: int64
--------------------------------------------------
Core i5                   365
Core i3                   163
Ryzen 5 Hexa Core          82
Core i7                    68
Ryzen 7 Octa Core          67
Celeron Dual Core          16
Ryzen 3 Quad Core          14
Celeron Quad Co

In [None]:
df["Processor"] = df['Processor'].apply(lambda x:" ".join(x.split()[0:2]))

In [None]:
df['Processor'].value_counts()

Core i5              365
Core i3              163
Ryzen 5               95
Core i7               68
Ryzen 7               68
Ryzen 3               21
Celeron Dual          16
Celeron Quad          10
Core i9                4
Pentium Silver         4
Athlon Dual            4
MediaTek Kompanio      3
Ryzen 9                3
M1                     3
M2                     3
Ryzen Z1               2
MediaTek MT8788        1
M1 Max                 1
M2 Max                 1
M3 Pro                 1
M1 Pro                 1
Name: Processor, dtype: int64

In [None]:
def processor_manufactuer(df):
    if 'core' in df.lower():
        return 'Intel'
    elif 'ryzen' in df.lower():
        return 'Amd'
    elif any (m in df for m in ["M1", "M2", "M3"]):
        return 'Mac'
    else:
        return 'Others'

In [None]:
df['Processor'] = df['Processor'].apply(processor_manufactuer)

In [None]:
df['Processor'].value_counts()

Intel     600
Amd       189
Others     38
Mac        10
Name: Processor, dtype: int64

In [None]:
def operationsystem(df):
    if 'windows' in df.lower():
        return 'Windows OS'
    elif 'mac' in df.lower():
        return 'Mac OS'
    elif 'chrome' in  df.lower():
        return 'Chrome OS'
    else:
        return 'Others'

In [None]:
df['Operating System']=df['Operating System'].apply(operationsystem)

In [None]:
df['Operating System'].value_counts()

Windows OS    803
Chrome OS      13
Mac OS         11
Others         10
Name: Operating System, dtype: int64

In [None]:
df['Storage']=df['Storage'].str.replace('GB','')
df['Storage']=df['Storage'].str.replace('TB','000')
df['Storage']=df['Storage'].str.replace(' ','')
df['Storage']=df['Storage'].astype(int)

In [None]:
df['Storage'].value_counts()

512     639
1000    101
256      49
2000     20
128      12
4000     10
64        4
3000      1
6000      1
Name: Storage, dtype: int64

In [None]:
df['RAM']=df['RAM'].str.replace('GB','')
df['RAM']=df['RAM'].astype(int)

In [None]:
df['RAM'].value_counts()

8     421
16    377
4      25
32      9
12      2
64      2
18      1
Name: RAM, dtype: int64

In [None]:
df['Screen Size'] = df['Screen Size'].apply(lambda x:''.join(x.split(' ')[0]))
df['Screen Size']= df['Screen Size'].astype(float)

In [None]:
df['Screen Size'].value_counts()

39.62     555
35.56     177
40.64      27
33.78      16
40.89      10
43.94       8
96.52       6
35.81       5
34.29       5
38.10       4
39.01       3
17.78       2
33.02       2
34.04       2
29.46       2
38.00       2
100.63      2
26.67       1
34.54       1
35.00       1
41.15       1
90.32       1
30.48       1
38.86       1
36.07       1
31.50       1
Name: Screen Size, dtype: int64

In [None]:
df['Price']=df['Price'].str.replace('₹','')
df['Price']=df['Price'].str.replace(',','')
df['Price']=df['Price'].astype(int)

In [None]:
df['Price'].value_counts()

53990     65
37990     42
54990     41
49990     40
38990     39
          ..
52890      1
147743     1
64600      1
199990     1
70500      1
Name: Price, Length: 270, dtype: int64

In [None]:
df.rename(columns={'Storage':'Storage(GB)','RAM':'RAM(GB)','Screen Size':'Screen Size(CM)','Touch_Screen':'Touch Screen','Price':'Price(₹)'},inplace=True)

# Cleaned Dataset

In [None]:
df

Unnamed: 0,Brand,Model Name,Processor,Operating System,Storage(GB),RAM(GB),Screen Size(CM),Touch Screen,Price(₹)
0,HP,15s-fq5007TU,Intel,Windows OS,512,8,39.62,No,38990
1,HP,15s-fy5003TU,Intel,Windows OS,512,8,39.62,No,37990
2,Apple,2020 Macbook Air,Mac,Mac OS,256,8,33.78,No,70990
3,Apple,2020 Macbook Air,Mac,Mac OS,256,8,33.78,No,70990
4,Apple,2020 Macbook Air,Mac,Mac OS,256,8,33.78,No,70990
...,...,...,...,...,...,...,...,...,...
832,HP,255 G8,Amd,Windows OS,512,8,39.62,No,42990
833,DELL,Inspiron 7430,Intel,Windows OS,1000,8,35.56,Yes,60490
834,MSI,Katana 17 B13UCXK-256IN,Intel,Windows OS,4000,16,43.94,No,88990
835,Infinix,XL25,Intel,Windows OS,512,8,39.62,No,37990


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 837 entries, 0 to 836
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Brand             837 non-null    object 
 1   Model Name        837 non-null    object 
 2   Processor         837 non-null    object 
 3   Operating System  837 non-null    object 
 4   Storage(GB)       837 non-null    int32  
 5   RAM(GB)           837 non-null    int32  
 6   Screen Size(CM)   837 non-null    float64
 7   Touch Screen      837 non-null    object 
 8   Price(₹)          837 non-null    int32  
dtypes: float64(1), int32(3), object(5)
memory usage: 49.2+ KB


In [None]:
df.describe()

Unnamed: 0,Storage(GB),RAM(GB),Screen Size(CM),Price(₹)
count,837.0,837.0,837.0,837.0
mean,635.010753,11.897252,39.128913,59919.016726
std,512.316622,5.290956,6.453314,37594.119774
min,64.0,4.0,17.78,11990.0
25%,512.0,8.0,35.56,38990.0
50%,512.0,8.0,39.62,53990.0
75%,512.0,16.0,39.62,66999.0
max,6000.0,64.0,100.63,489990.0


# Data Preprocessing

In [None]:
df = pd.get_dummies(df,columns=['Processor','Operating System'])
df

Unnamed: 0,Brand,Model Name,Storage(GB),RAM(GB),Screen Size(CM),Touch Screen,Price(₹),Processor_Amd,Processor_Intel,Processor_Mac,Processor_Others,Operating System_Chrome OS,Operating System_Mac OS,Operating System_Others,Operating System_Windows OS
0,HP,15s-fq5007TU,512,8,39.62,No,38990,0,1,0,0,0,0,0,1
1,HP,15s-fy5003TU,512,8,39.62,No,37990,0,1,0,0,0,0,0,1
2,Apple,2020 Macbook Air,256,8,33.78,No,70990,0,0,1,0,0,1,0,0
3,Apple,2020 Macbook Air,256,8,33.78,No,70990,0,0,1,0,0,1,0,0
4,Apple,2020 Macbook Air,256,8,33.78,No,70990,0,0,1,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
832,HP,255 G8,512,8,39.62,No,42990,1,0,0,0,0,0,0,1
833,DELL,Inspiron 7430,1000,8,35.56,Yes,60490,0,1,0,0,0,0,0,1
834,MSI,Katana 17 B13UCXK-256IN,4000,16,43.94,No,88990,0,1,0,0,0,0,0,1
835,Infinix,XL25,512,8,39.62,No,37990,0,1,0,0,0,0,0,1


In [None]:
df['Touch Screen'] = df['Touch Screen'].map({'Yes':1,'No':0})
df

Unnamed: 0,Brand,Model Name,Storage(GB),RAM(GB),Screen Size(CM),Touch Screen,Price(₹),Processor_Amd,Processor_Intel,Processor_Mac,Processor_Others,Operating System_Chrome OS,Operating System_Mac OS,Operating System_Others,Operating System_Windows OS
0,HP,15s-fq5007TU,512,8,39.62,0,38990,0,1,0,0,0,0,0,1
1,HP,15s-fy5003TU,512,8,39.62,0,37990,0,1,0,0,0,0,0,1
2,Apple,2020 Macbook Air,256,8,33.78,0,70990,0,0,1,0,0,1,0,0
3,Apple,2020 Macbook Air,256,8,33.78,0,70990,0,0,1,0,0,1,0,0
4,Apple,2020 Macbook Air,256,8,33.78,0,70990,0,0,1,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
832,HP,255 G8,512,8,39.62,0,42990,1,0,0,0,0,0,0,1
833,DELL,Inspiron 7430,1000,8,35.56,1,60490,0,1,0,0,0,0,0,1
834,MSI,Katana 17 B13UCXK-256IN,4000,16,43.94,0,88990,0,1,0,0,0,0,0,1
835,Infinix,XL25,512,8,39.62,0,37990,0,1,0,0,0,0,0,1


# Defining X & y for Training and Testing Dataset

In [None]:
X = df.drop(columns='Price(₹)')
X

Unnamed: 0,Brand,Model Name,Storage(GB),RAM(GB),Screen Size(CM),Touch Screen,Processor_Amd,Processor_Intel,Processor_Mac,Processor_Others,Operating System_Chrome OS,Operating System_Mac OS,Operating System_Others,Operating System_Windows OS
0,HP,15s-fq5007TU,512,8,39.62,0,0,1,0,0,0,0,0,1
1,HP,15s-fy5003TU,512,8,39.62,0,0,1,0,0,0,0,0,1
2,Apple,2020 Macbook Air,256,8,33.78,0,0,0,1,0,0,1,0,0
3,Apple,2020 Macbook Air,256,8,33.78,0,0,0,1,0,0,1,0,0
4,Apple,2020 Macbook Air,256,8,33.78,0,0,0,1,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
832,HP,255 G8,512,8,39.62,0,1,0,0,0,0,0,0,1
833,DELL,Inspiron 7430,1000,8,35.56,1,0,1,0,0,0,0,0,1
834,MSI,Katana 17 B13UCXK-256IN,4000,16,43.94,0,0,1,0,0,0,0,0,1
835,Infinix,XL25,512,8,39.62,0,0,1,0,0,0,0,0,1


In [None]:
y = df[['Price(₹)']]
y

Unnamed: 0,Price(₹)
0,38990
1,37990
2,70990
3,70990
4,70990
...,...
832,42990
833,60490
834,88990
835,37990


# Split Dataset into Training & Testing

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=0)

# Make Column Transformer

In [None]:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder

In [None]:
ohe = OneHotEncoder()
ohe.fit(X[['Brand','Model Name']])

In [None]:
ohe.categories_

[array(['ASUS', 'Acer', 'Apple', 'CHUWI', 'DELL', 'GIGABYTE', 'HP',
        'Infinix', 'LG', 'Lenovo', 'MICROSOFT', 'MSI', 'Primebook',
        'SAMSUNG', 'Ultimus', 'WINGS', 'ZEBRONICS', 'realme'], dtype=object),
 array(['11IGL05', '13-BE0030AU', '13ABR8', '13b-ca0006MU', '14-EC1019AU',
        '14-dv1000TU', '14-dv1029TU', '14-dv2014TU', '14-dv2053TU',
        '14-dv2153TU', '14-dy1013TU', '14-ec0033AU', '14-eh0024TU',
        '14IAU7', '14ITL6', '14M868', '14a-ca0504TU', '14c-cc0010TU',
        '14s', '14s - dy2507TU', '14s - dy2508TU', '14s- dy2506TU',
        '14s-dq2606tu', '14s-ef1001tu', '14s-ef1002tu', '14s-fy1003AU',
        '14s-fy1005AU', '15', '15-EG2009TU', '15-eg2017TU', '15-eg2091T',
        '15-eg3079TU', '15-fa0070TX', '15-fa0188TX', '15-fa0354TX',
        '15-fa1060TX', '15-fb0106AX', '15-fb0131AX', '15-fb0135AX',
        '15-fb0137AX', '15-fb0147AX', '15-fb0150AX', '15-fb1016AX',
        '15-fc0031AU', '15-fd0011TU', '15-fd0012TU', '15-fd0019TU',
        '15-fd0022T

In [None]:
column_trans = make_column_transformer((OneHotEncoder(categories=ohe.categories_),['Brand','Model Name']),remainder='passthrough')

# Apply LinearRegression Model

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import r2_score

In [None]:
lr = LinearRegression()

In [None]:
pipe = make_pipeline(column_trans,lr)

In [None]:
pipe.fit(X_train,y_train)

In [None]:
y_pred = pipe.predict(X_test)

In [None]:
print('R2_Score for LinearRegression: ',r2_score(y_test,y_pred))

R2_Score for LinearRegression:  0.530106489136756


# Real Data Prediction for LinearRegression

In [None]:
pipe.predict(pd.DataFrame([['HP','15s-fq5007TU',512,8,39.62,0,0,1,0,0,0,0,0,1]],columns=['Brand','Model Name','Storage(GB)','RAM(GB)','Screen Size(CM)','Touch Screen','Processor_Amd','Processor_Intel','Processor_Mac','Processor_Others','Operating System_Chrome OS','Operating System_Mac OS','Operating System_Others','Operating System_Windows OS']))

array([[38998.12773948]])

In [None]:
pipe.predict(pd.DataFrame([['Apple','XL25',256,16,15.56,1,0,0,1,0,0,1,0,0]],columns=['Brand','Model Name','Storage(GB)','RAM(GB)','Screen Size(CM)','Touch Screen','Processor_Amd','Processor_Intel','Processor_Mac','Processor_Others','Operating System_Chrome OS','Operating System_Mac OS','Operating System_Others','Operating System_Windows OS']))

array([[174802.43401706]])

# Apply Lasso Model

In [None]:
from sklearn.linear_model import Lasso

In [None]:
ls = Lasso()

In [None]:
pipe = make_pipeline(column_trans,ls)

In [None]:
pipe.fit(X_train,y_train)

In [None]:
y_pred = pipe.predict(X_test)

In [None]:
print('R2_Score for Lasso: ',r2_score(y_test,y_pred))

R2_Score for Lasso:  0.615635562318159


# Real Data Prediction for Lasso

In [None]:
pipe.predict(pd.DataFrame([['HP','15s-fq5007TU',512,8,39.62,0,0,1,0,0,0,0,0,1]],columns=['Brand','Model Name','Storage(GB)','RAM(GB)','Screen Size(CM)','Touch Screen','Processor_Amd','Processor_Intel','Processor_Mac','Processor_Others','Operating System_Chrome OS','Operating System_Mac OS','Operating System_Others','Operating System_Windows OS']))

array([39013.06453502])

In [None]:
pipe.predict(pd.DataFrame([['Apple','XL25',256,16,15.56,1,0,0,1,0,0,1,0,0]],columns=['Brand','Model Name','Storage(GB)','RAM(GB)','Screen Size(CM)','Touch Screen','Processor_Amd','Processor_Intel','Processor_Mac','Processor_Others','Operating System_Chrome OS','Operating System_Mac OS','Operating System_Others','Operating System_Windows OS']))

array([171525.03727064])