# Step 4: Model Building

Now that we have processed features (`X.csv`) and target (`y.csv`), we can train and evaluate machine learning models to predict **Laptop Prices**.  

### Goals of this notebook:
1. Load processed features and target.  
2. Split data into training and test sets.  
3. Train baseline regression models.  
4. Evaluate models using metrics (R², RMSE).  
5. Save the best-performing model for deployment.  


Author: Joseph prince 
Registration Number: 22/EG/CO/1774

In [8]:
# Step 1: Import libraries
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor

import joblib


## Step 2: Load Processed Data
We load the features (`X.csv`) and target (`y.csv`) from Step 3.


In [14]:
# ==============================
# Step 0: Load fully processed data
# ==============================
X = pd.read_csv("../data/processed/X.csv")
y = pd.read_csv("../data/processed/y.csv")["Price"]

# Quick check
print(X.dtypes)

# Quick check: print shapes to confirm correct loading
print("Features shape (X):", X.shape)
print("Target shape (y):", y.shape)


laptop_ID                  int64
Inches                   float64
Memory                    object
SSD_Flag                   int64
HDD_Flag                   int64
Storage_GB               float64
Ram_GB                     int64
Company_std               object
OpSys_std                 object
Weight_kg                float64
Screen_W                   int64
Screen_H                   int64
PPI                      float64
Company_Apple               bool
Company_Asus                bool
Company_Chuwi               bool
Company_Dell                bool
Company_Fujitsu             bool
Company_Google              bool
Company_HP                  bool
Company_Huawei              bool
Company_LG                  bool
Company_Lenovo              bool
Company_MSI                 bool
Company_Mediacom            bool
Company_Microsoft           bool
Company_Razer               bool
Company_Samsung             bool
Company_Toshiba             bool
Company_Vero                bool
Company_Xi

## Step 3: Train-Test Split
We will split the dataset into **75% training** and **25% testing**. Ranson state: 24


In [17]:
# Convert any remaining object-type columns to numeric (example)
X = pd.get_dummies(X, drop_first=True)


# Split into training (80%) and testing (20%)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=24
)

# Print sizes of splits
print("Training set size:", X_train.shape, y_train.shape)
print("Testing set size:", X_test.shape, y_test.shape)


Training set size: (977, 94) (977,)
Testing set size: (326, 94) (326,)


## Step 4: Baseline Model - Linear Regression
We start simple with a **Linear Regression** model to establish a baseline.


In [18]:
# Fill missing values in features
X_train = X_train.fillna(0)
X_test = X_test.fillna(0)

# Initialize Linear Regression model
lin_reg = LinearRegression()

# Train (fit) the model
lin_reg.fit(X_train, y_train)

# Predict on the test set
y_pred_lr = lin_reg.predict(X_test)

# Evaluate model performance
r2_lr = r2_score(y_test, y_pred_lr)
rmse_lr = np.sqrt(mean_squared_error(y_test, y_pred_lr))

# Print results
print("Linear Regression Results:")
print("R² Score:", r2_lr)
print("RMSE:", rmse_lr)


Linear Regression Results:
R² Score: 0.8058710133903709
RMSE: 308.3102048187436


## Step 5: Regularized Models (Ridge & Lasso)
To handle multicollinearity and feature selection, we try **Ridge** and **Lasso** regression.


In [19]:
# Ridge Regression
ridge = Ridge(alpha=1.0)  # alpha = regularization strength
ridge.fit(X_train, y_train)
y_pred_ridge = ridge.predict(X_test)

# Lasso Regression
lasso = Lasso(alpha=0.01)  # small alpha keeps more features
lasso.fit(X_train, y_train)
y_pred_lasso = lasso.predict(X_test)

# Print results
print("Ridge Regression → R²:", r2_score(y_test, y_pred_ridge), 
      "RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_ridge)))

print("Lasso Regression → R²:", r2_score(y_test, y_pred_lasso), 
      "RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_lasso)))


Ridge Regression → R²: 0.809222042218343 RMSE: 305.6376162056968
Lasso Regression → R²: 0.8126399286112254 RMSE: 302.88741423820636


  model = cd_fast.enet_coordinate_descent(


## Step 6: Ensemble Model (Random Forest)
Tree-based models often perform better on tabular datasets.


In [20]:
# Initialize Random Forest with 200 trees
rf = RandomForestRegressor(n_estimators=200, random_state=42)

# Train model
rf.fit(X_train, y_train)

# Predict on test set
y_pred_rf = rf.predict(X_test)

# Evaluate performance
print("Random Forest Results → R²:", r2_score(y_test, y_pred_rf), 
      "RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_rf)))


Random Forest Results → R²: 0.8567411491530649 RMSE: 264.8521102572479


## Step 7: Cross-Validation
We use **cross-validation** on the training set for more robust evaluation.


In [21]:
# Perform cross-validation on Random Forest
cv_scores = cross_val_score(rf, X_train, y_train, cv=5, scoring="r2")

print("Random Forest CV R² scores:", cv_scores)
print("Mean CV R²:", np.mean(cv_scores))


Random Forest CV R² scores: [0.82229354 0.72268022 0.80167945 0.79692926 0.77910479]
Mean CV R²: 0.7845374504489964


## Step 8: Save Best Model
We will save the **Random Forest model** (best performer) to disk.  
It can later be loaded in **Step 5 (Model Evaluation/Deployment)**.  


In [24]:
import os
import joblib

# Create models directory if it doesn't exist
os.makedirs("../models", exist_ok=True)

# Now save the model
joblib.dump(rf, "../models/random_forest_model.pkl")

print("✅ Model saved to ../models/random_forest_model.pkl")


✅ Model saved to ../models/random_forest_model.pkl
