# Step 4: Model Building

Now that we have processed features (`X.csv`) and target (`y.csv`), we can train and evaluate machine learning models to predict **Laptop Prices**.  

### Goals of this notebook:
1. Load processed features and target.  
2. Split data into training and test sets.  
3. Train baseline regression models.  
4. Evaluate models using metrics (R², RMSE).  
5. Save the best-performing model for deployment.  


Author: Joseph prince 
Registration Number: 22/EG/CO/1774

In [16]:
# Step 1: Import libraries
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.linear_model import Ridge, Lasso
import os
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor
import joblib


## Step 2: Load Processed Data
We load the features (`X.csv`) and target (`y.csv`) from Step 3.


In [17]:
# ==============================
# Step 0: Load fully processed data
# ==============================
X = pd.read_csv("../data/processed/X.csv")
y = pd.read_csv("../data/processed/y.csv")["Price"]

# Quick check
print(X.dtypes)

# Quick check: print shapes to confirm correct loading
print("Features shape (X):", X.shape)
print("Target shape (y):", y.shape)


laptop_ID                  int64
Inches                   float64
Memory                    object
Ram_GB                     int64
Company_std               object
OpSys_std                 object
Weight_kg                float64
Screen_W                   int64
Screen_H                   int64
PPI                      float64
Company_Apple               bool
Company_Asus                bool
Company_Chuwi               bool
Company_Dell                bool
Company_Fujitsu             bool
Company_Google              bool
Company_HP                  bool
Company_Huawei              bool
Company_LG                  bool
Company_Lenovo              bool
Company_MSI                 bool
Company_Mediacom            bool
Company_Microsoft           bool
Company_Razer               bool
Company_Samsung             bool
Company_Toshiba             bool
Company_Vero                bool
Company_Xiaomi              bool
TypeName_Gaming             bool
TypeName_Netbook            bool
TypeName_N

## Step 3: Train-Test Split
We will split the dataset into **75% training** and **25% testing**. Ranson state: 24


In [18]:
# Keep raw features (Company_std, OpSys_std, Ram_GB, Weight_kg, Screen_W, Screen_H, etc.)
# Do NOT apply get_dummies here, the pipeline will handle encoding


# Split into training (80%) and testing (20%)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=24
)

# Print sizes of splits
print("Training set size:", X_train.shape, y_train.shape)
print("Testing set size:", X_test.shape, y_test.shape)


Training set size: (977, 41) (977,)
Testing set size: (326, 41) (326,)


## Step 4: Baseline Model - Linear Regression
We start simple with a **Linear Regression** model to establish a baseline.


In [19]:
# --- Convert Memory to numeric Storage_GB ---
def convert_memory(mem_str):
    total = 0
    for part in mem_str.split('+'):
        part = part.strip()
        match = re.search(r'[\d.]+', part)
        if match:
            num = float(match.group())
            if 'TB' in part:
                total += num * 1024
            else:
                total += num
    return total

X_train['Storage_GB'] = X_train['Memory'].apply(convert_memory)
X_test['Storage_GB'] = X_test['Memory'].apply(convert_memory)

# Drop unnecessary columns safely
X_train = X_train.drop(columns=["Memory", "laptop_ID", "Company_std", "OpSys_std"], errors="ignore")
X_test = X_test.drop(columns=["Memory", "laptop_ID", "Company_std", "OpSys_std"], errors="ignore")


# --- Define numeric columns to scale ---
numerical_cols = ["Ram_GB", "Weight_kg", "Screen_W", "Screen_H", "PPI", "Inches", "Storage_GB"]

# Preprocessor: scale numeric, leave all boolean columns as-is
preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), numerical_cols)
    ],
    remainder="passthrough"
)

# Build pipeline
linreg_pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("model", LinearRegression())
])

# Fit model
linreg_pipeline.fit(X_train, y_train)

# Predict
y_pred_lr = linreg_pipeline.predict(X_test)

# Evaluate
r2_lr = r2_score(y_test, y_pred_lr)
rmse_lr = np.sqrt(mean_squared_error(y_test, y_pred_lr))

print("Linear Regression Results:")
print("R² Score:", r2_lr)
print("RMSE:", rmse_lr)

Linear Regression Results:
R² Score: 0.7645748365980322
RMSE: 339.52295830518096


## Step 5: Regularized Models (Ridge & Lasso)
To handle multicollinearity and feature selection, we try **Ridge** and **Lasso** regression.


In [21]:
# --- Define numeric columns ---
numerical_cols = ["Ram_GB", "Weight_kg", "Screen_W", "Screen_H", "PPI", "Inches", "Storage_GB"]

# Preprocessor: scale numeric, leave all boolean columns as-is
preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), numerical_cols)
    ],
    remainder="passthrough"
)

# --- Ridge pipeline ---
ridge_pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("model", Ridge(alpha=1.0))
])

ridge_pipeline.fit(X_train, y_train)
y_pred_ridge = ridge_pipeline.predict(X_test)

# --- Lasso pipeline ---
lasso_pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("model", Lasso(alpha=0.01, max_iter=10000))
])

lasso_pipeline.fit(X_train, y_train)
y_pred_lasso = lasso_pipeline.predict(X_test)

# --- Results ---
print("Ridge Regression → R²:", r2_score(y_test, y_pred_ridge),
      "RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_ridge)))

print("Lasso Regression → R²:", r2_score(y_test, y_pred_lasso),
      "RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_lasso)))

Ridge Regression → R²: 0.7602772222379485 RMSE: 342.6078872523094
Lasso Regression → R²: 0.7644946044191632 RMSE: 339.5808075720957


## Step 6: Ensemble Model (Random Forest)
Tree-based models often perform better on tabular datasets.


In [23]:
# Detect categorical columns that still exist in X_train
categorical_cols = [col for col in X_train.select_dtypes(include=['object']).columns if col in X_train.columns]

# Initialize OrdinalEncoder
encoder = OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1)

# Encode both train and test (only if there are categorical columns)
if categorical_cols:
    X_train[categorical_cols] = encoder.fit_transform(X_train[categorical_cols].astype(str))
    X_test[categorical_cols] = encoder.transform(X_test[categorical_cols].astype(str))

# Make sure all features are numeric
X_train = X_train.apply(pd.to_numeric, errors='coerce').fillna(-1)
X_test  = X_test.apply(pd.to_numeric, errors='coerce').fillna(-1)

# Initialize Random Forest with 200 trees
rf = RandomForestRegressor(n_estimators=200, random_state=42)

# Train model
rf.fit(X_train, y_train)

# Predict on test set
y_pred_rf = rf.predict(X_test)

# Evaluate performance
print("Random Forest Results → R²:", r2_score(y_test, y_pred_rf), 
      "RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_rf)))

Random Forest Results → R²: 0.8443008408282981 RMSE: 276.1123546949489


## Step 7: Cross-Validation
We use **cross-validation** on the training set for more robust evaluation.


In [None]:
# Perform cross-validation on Random Forest
cv_scores = cross_val_score(rf, X_train, y_train, cv=5, scoring="r2")

print("Random Forest CV R² scores:", cv_scores)
print("Mean CV R²:", np.mean(cv_scores))

print(pipeline.named_steps["preprocessor"].get_feature_names_out()) # get the names the input date will have



Random Forest CV R² scores: [0.81625161 0.66809524 0.80377929 0.79687999 0.77311206]
Mean CV R²: 0.7716236378289423
['num__Inches' 'num__Ram_GB' 'num__Weight_kg' 'num__Screen_W'
 'num__Screen_H' 'num__PPI' 'num__Storage_GB']


## Step 8: Save Best Model
We will save the **Random Forest model** (best performer) to disk.  
It can later be loaded in **Step 5 (Model Evaluation/Deployment)**.  


In [26]:
# Detect categorical columns that exist in X_train
categorical_cols = [col for col in X_train.select_dtypes(include=['object']).columns if col in X_train.columns]

# Detect numerical columns that exist in X_train
numerical_cols = [col for col in X_train.select_dtypes(include=['int64', 'float64']).columns if col in X_train.columns]

# Preprocessor: OneHotEncode categorical, passthrough numerical
preprocessor = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_cols),
        ("num", "passthrough", numerical_cols)
    ]
)

# Build pipeline (preprocessor + RandomForest)
pipeline = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("model", RandomForestRegressor(n_estimators=200, random_state=42))
])

# Fit pipeline on training data
pipeline.fit(X_train, y_train)

# Save pipeline
os.makedirs("models", exist_ok=True)
joblib.dump(pipeline, "models/laptop_price_pipeline.pkl")

print("Pipeline saved to models/laptop_price_pipeline.pkl")

Pipeline saved to models/laptop_price_pipeline.pkl
