# <center>💻 Laptop Price Prediction using Machine Learning (Tunisia Market)</center>

This notebook aims to build a predictive model to estimate laptop prices using real product data scraped from Tunisian online electronics stores. 

In this project, I use features like RAM, processor type, screen size, storage, GPU, and more to accurately estimate laptop prices.

**Goals:**
- Explore and understand the dataset
- Visualize key relationships
- Build regression models
- Evaluate model performance

**Dataset**

- Data sourced from Tunisian online retailers (MyTek, Spacenet, Tdiscount, Agora, Batam, Graiet, Tunisianet)  
- Scraped using **BeautifulSoup** and **requests** in Python  
- Full scraping and cleaning process is available in the GitHub project:  
  👉 [GitHub - Laptop Price Scraping & Cleaning Project](https://github.com/ibtihel-dhaouadi/laptop-price-prediction-tn)

## 📊 Import Libraries & Load Data

In [None]:
!pip install lazypredict

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from lazypredict.Supervised import LazyRegressor
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error,mean_absolute_error, r2_score

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

import joblib
from sklearn.ensemble import HistGradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from sklearn.ensemble import RandomForestRegressor

import category_encoders as ce

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

# Set seaborn theme
sns.set(style="whitegrid", palette="muted", font_scale=1.1)
plt.rcParams['figure.figsize'] = (6, 4)

In [None]:
# Load dataset
df = pd.read_csv('/kaggle/input/tunisia-laptop-market-cleaned-dataset-2025/tunisia_laptop_prices_2025.csv')

In [None]:
df.head(3)

In [None]:
# Drop unwanted columns
df.drop(['reference','store', 'link', 'name', 'image_url'], axis=1, inplace=True)

In [None]:
df.columns

In [None]:
# Remove duplicate rows
df = df.drop_duplicates()

## 🔍 Data Overview

In [None]:
df.describe(include='all').T

In [None]:
df.info()

In [None]:
df.isna().sum()

From the above, we can see that the dataset is clean and ready for modeling. We have:
- 15 features
- 0 missing values
- Price is our target variable

In [None]:
# NUMERIC DISTRIBUTIONS with count labels
numeric_cols = ['price', 'screen_size', 'ram', 'SSD', 'HDD']

fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(16, 12))
axes = axes.flatten()

for i, col in enumerate(numeric_cols):
    plot = sns.histplot(df[col], bins=30, kde=True, ax=axes[i], color=sns.color_palette()[i])
    axes[i].set_title(f'Distribution of {col}')
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('Count')
    
    # Add count labels above bars
    for patch in plot.patches:
        height = patch.get_height()
        if height > 0:
            x = patch.get_x() + patch.get_width() / 2
            plot.text(x, height, f'{int(height)}', ha='center', va='bottom', fontsize=12, rotation=0)

# Remove empty subplot if needed
if len(numeric_cols) < len(axes):
    fig.delaxes(axes[-1])

plt.tight_layout()
plt.show()

In [None]:
# Create boxplots
plt.figure(figsize=(15, 8))
for i, col in enumerate(numeric_cols, 1):
    plt.subplot(2, 3, i)
    sns.boxplot(y=df[col], color='lightblue')
    plt.title(f'Outliers in {col}')
    plt.tight_layout()
plt.show()

In [None]:
# CATEGORICAL DISTRIBUTIONS
# Categorical columns with moderate cardinality
cat_cols = ['brand', 'os', 'gamer']

fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(18, 12))
axes = axes.flatten()

for i, col in enumerate(cat_cols):
    plot = sns.countplot(data=df, x=col, ax=axes[i], order=df[col].value_counts().index)
    axes[i].set_title(f'Distribution of {col}')
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('Count')
    axes[i].tick_params(axis='x', rotation=45)

    # Add count labels
    for container in plot.containers:
        plot.bar_label(container, fmt='%d', label_type='edge', padding=2, fontsize=14)

plt.tight_layout()
plt.show()

In [None]:
# GPUs and PROCESSORS
gpus = df['gpu'].value_counts()
procs = df['processor'].value_counts()

fig, axes = plt.subplots(1, 2, figsize=(18, 7))

# GPU Plot
gpu_plot = sns.barplot(x=gpus.values, y=gpus.index, ax=axes[0], palette='Blues_d')
axes[0].set_title("GPUs")
axes[0].set_xlabel("Count")
axes[0].set_ylabel("GPU")

# Add labels
for container in gpu_plot.containers:
    gpu_plot.bar_label(container, fmt='%d', label_type='edge', padding=3)

# Processor Plot
proc_plot = sns.barplot(x=procs.values, y=procs.index, ax=axes[1], palette='Greens_d')
axes[1].set_title("Processors")
axes[1].set_xlabel("Count")
axes[1].set_ylabel("Processor")

# Add labels
for container in proc_plot.containers:
    proc_plot.bar_label(container, fmt='%d', label_type='edge', padding=3)

plt.tight_layout()
plt.show()


In [None]:
df_heatmap = df.copy()

# Identify categorical columns
categorical_cols = df_heatmap.select_dtypes(include=['object']).columns

# Label encode categorical columns for correlation calculation (heatmap)
le = LabelEncoder()
for col in categorical_cols:
    df_heatmap[col] = le.fit_transform(df_heatmap[col])

# Compute correlation matrix
corr_matrix = df_heatmap.corr()

# Plot heatmap
plt.figure(figsize=(14, 12))
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap='coolwarm', square=True, cbar=True)
plt.title('Correlation Heatmap (Numerical + Label Encoded Categorical)', fontsize=16)
plt.xticks(rotation=45)
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

## 🔧 Prepare data for modeling with Target Encoding

In [None]:
# Separate target
X = df.drop(columns=['price'])
y = df['price']

In [None]:
# Target Encoding
categorical_cols = X.select_dtypes(include='object').columns.tolist()
encoder = ce.TargetEncoder(cols=categorical_cols)
X_encoded = encoder.fit_transform(X, y)

In [None]:
# Log transform target
y_log = np.log1p(y)

In [None]:
# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X_encoded, y_log, test_size=0.1, random_state=42, shuffle=True
)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

## 🤖 Train Models using LazyRegressor

In [None]:
reg = LazyRegressor(verbose=1, ignore_warnings=True, custom_metric=None)
models, predictions = reg.fit(X_train, X_test, y_train, y_test)

In [None]:
print("\nModel Comparison:")
models

## 📊 Fine-Tuning XGBoost Regressor with GridSearchCV

In [None]:
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [3, 5, 6, 9],
    'learning_rate': [0.01, 0.1, 0.2],
    'gamma': [0, 0.1, 0.3], 
}

xgb = XGBRegressor(random_state=42)

grid_search = GridSearchCV(xgb, param_grid, cv=3, scoring='r2', n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)

In [None]:
print("✅ Best hyperparameters:", grid_search.best_params_)
print(f"✅ Best CV R² score: {grid_search.best_score_:.2f}", )

## 📈 Evaluate Model (XGBoost)

In [None]:
# Predict on test data
y_pred = grid_search.predict(X_test)

# Evaluation metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = mse ** 0.5
r2 = r2_score(y_test, y_pred)

# Print metrics
print("📈 Model Evaluation Metrics on Test Set:")
print(f"MAE (Mean Absolute Error): {mae:.2f}")
print(f"RMSE (Root Mean Squared Error): {rmse:.2f}")
print(f"R² Score: {r2:.2f}")

In [None]:
# Sort by index to maintain order
comparison_df = pd.DataFrame({
    'Actual': y_test.values,
    'Predicted': y_pred
}).reset_index(drop=True)

plt.figure(figsize=(14, 8))
plt.plot(comparison_df['Actual'], label='Actual Price', linewidth=2)
plt.plot(comparison_df['Predicted'], label='Predicted Price (XGB)', linestyle='--', linewidth=2)
plt.xlabel("Sample Index")
plt.ylabel("Price (TND)")
plt.title("Actual vs Predicted Laptop Prices - XGBoost")
plt.legend()
plt.tight_layout()
plt.grid(True)
plt.show()

### Actual vs Predicted & Distribution

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Left plot: Actual vs Predicted scatter
axes[0].scatter(y_test, y_pred, alpha=0.6, color='teal')
axes[0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
axes[0].set_xlabel("Actual Price")
axes[0].set_ylabel("Predicted Price")
axes[0].set_title("Actual vs Predicted Laptop Prices")
axes[0].grid(True)

# Right plot: Distribution comparison histogram
axes[1].hist(y_test, bins=40, alpha=0.5, label='Actual Price')
axes[1].hist(y_pred, bins=40, alpha=0.5, label='Predicted Price')
axes[1].legend()
axes[1].set_title("Distribution of Actual and Predicted Prices")

plt.tight_layout()
plt.show()

### Residual Analysis

In [None]:
# Calculate residuals
residuals = y_test - y_pred

plt.figure(figsize=(10,4))

plt.subplot(1,2,1)
sns.scatterplot(x=y_pred, y=residuals)
plt.axhline(0, color='r', linestyle='--')
plt.xlabel('Predicted Price')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted')

plt.subplot(1,2,2)
sns.histplot(residuals, bins=40, kde=True)
plt.title('Residuals Distribution')

plt.tight_layout()
plt.show()

## 🎯 Feature Importance (XGBoost)

In [None]:
# Get the best estimator from grid search
best_xgb = grid_search.best_estimator_
 
# Feature importances
importances = best_xgb.feature_importances_
features = X_train.columns

# Sort features by importance (descending)
indices = np.argsort(importances)[::-1]

# Plot horizontal bar chart (transposed)
plt.figure(figsize=(6, 4))
plt.title("Feature Importances - XGBRegressor (Tuned Model)")
plt.barh(range(len(importances)), importances[indices][::-1], align='center')
plt.yticks(range(len(importances)), features[indices][::-1])
plt.xlabel("Importance")
plt.tight_layout()
plt.show()

## ✅ Predict sample Using the Trained Model

In [None]:
# Select a few random indices from test set
sample_indices = X_test.sample(10, random_state=64).index

# Retrieve original rows (before encoding)
sample_rows_original = X.loc[sample_indices]  # original values
sample_rows_encoded = X_encoded.loc[sample_indices]  # encoded version for prediction

# Predict on encoded rows
predicted_log_prices = grid_search.predict(sample_rows_encoded)
predicted_prices = np.expm1(predicted_log_prices)

# Actual prices (inverse log1p)
actual_prices = np.expm1(y_log.loc[sample_indices]).values

# Add results to original rows
comparison_df = sample_rows_original.copy()
comparison_df['Actual Price'] = actual_prices.round(2)
comparison_df['Predicted Price'] = predicted_prices.round(2)

# Display
display(comparison_df[[
    'brand', 'screen_size', 'processor', 'ram', 'SSD', 'HDD',
    'gpu', 'os', 'gamer', 'Actual Price', 'Predicted Price'
]])

In [None]:
sample = pd.DataFrame([{
    'brand': 'Lenovo',
    'screen_size': 15.6,
    'processor': 'AMD Ryzen 3',
    'ram': 8,
    'SSD': 0,
    'HDD': 512,
    'gpu': 'AMD',
    'os': 'FreeDos',
    'gamer': 0
}])

In [None]:
# Encode these rows using target encoder
sample_encoded = encoder.transform(sample)

# Predict log price and inverse transform
predicted_log_price = grid_search.predict(sample_encoded)
predicted_price = np.expm1(predicted_log_price)

In [None]:
print(f"🤖 Predicted Price: {predicted_price[0]:.0f} TND")

## 🚀 Conclusion

In [None]:
# Save the best tuned model
joblib.dump(grid_search.best_estimator_, 'xgb_best_tuned_model.joblib')

# Save the fitted TargetEncoder
joblib.dump(encoder, 'target_encoder.joblib')

- We successfully built a laptop price prediction model tailored to the Tunisian market using real data.

- The **XGBRegressor** model delivered strong performance with **low error metrics (MAE, RMSE)** and **solid R² scores**.

- Key features influencing price include **Processor type**, **SSD capacity**, and **Gaming suitability**.

**<center>✨ Upvote & Stay tuned for the next step ✨<br>****<br>🚀 Building a laptop recommendation system to help users find the best laptops tailored to their needs and budget. 😊💻</center>**