# Step 4: Model Building

Now that we have processed features (`X.csv`) and target (`y.csv`), we can train and evaluate machine learning models to predict **Laptop Prices**.  

### Goals of this notebook:
1. Load processed features and target.  
2. Split data into training and test sets.  
3. Train baseline regression models.  
4. Evaluate models using metrics (R², RMSE).  
5. Save the best-performing model for deployment.  


In [1]:
# Step 1: Import libraries
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor

import joblib


## Step 2: Load Processed Data
We load the features (`X.csv`) and target (`y.csv`) from Step 3.


In [None]:
# Load the features (independent variables)
X = pd.read_csv("../data/processed/X.csv")

# Load the target variable (dependent variable - Price)
y = pd.read_csv("../data/processed/y.csv")["Price"]

# Quick check: print shapes to confirm correct loading
print("Features shape (X):", X.shape)
print("Target shape (y):", y.shape)


## Step 3: Train-Test Split
We will split the dataset into **75% training** and **25% testing**. Ranson state: 24


In [None]:

# Split into training (80%) and testing (20%)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=24
)

# Print sizes of splits
print("Training set size:", X_train.shape, y_train.shape)
print("Testing set size:", X_test.shape, y_test.shape)


## Step 4: Baseline Model - Linear Regression
We start simple with a **Linear Regression** model to establish a baseline.


In [None]:
# Initialize Linear Regression model
lin_reg = LinearRegression()

# Train (fit) the model on training data
lin_reg.fit(X_train, y_train)

# Predict laptop prices on the test set
y_pred_lr = lin_reg.predict(X_test)

# Evaluate model performance
r2_lr = r2_score(y_test, y_pred_lr)   # Coefficient of determination (R²)
rmse_lr = np.sqrt(mean_squared_error(y_test, y_pred_lr))  # Root Mean Squared Error

# Print results
print("Linear Regression Results:")
print("R² Score:", r2_lr)
print("RMSE:", rmse_lr)


## Step 5: Regularized Models (Ridge & Lasso)
To handle multicollinearity and feature selection, we try **Ridge** and **Lasso** regression.


In [None]:
# Ridge Regression
ridge = Ridge(alpha=1.0)  # alpha = regularization strength
ridge.fit(X_train, y_train)
y_pred_ridge = ridge.predict(X_test)

# Lasso Regression
lasso = Lasso(alpha=0.01)  # small alpha keeps more features
lasso.fit(X_train, y_train)
y_pred_lasso = lasso.predict(X_test)

# Print results
print("Ridge Regression → R²:", r2_score(y_test, y_pred_ridge), 
      "RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_ridge)))

print("Lasso Regression → R²:", r2_score(y_test, y_pred_lasso), 
      "RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_lasso)))


## Step 6: Ensemble Model (Random Forest)
Tree-based models often perform better on tabular datasets.


In [None]:
# Initialize Random Forest with 200 trees
rf = RandomForestRegressor(n_estimators=200, random_state=42)

# Train model
rf.fit(X_train, y_train)

# Predict on test set
y_pred_rf = rf.predict(X_test)

# Evaluate performance
print("Random Forest Results → R²:", r2_score(y_test, y_pred_rf), 
      "RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_rf)))


## Step 7: Cross-Validation
We use **cross-validation** on the training set for more robust evaluation.


In [None]:
# Perform cross-validation on Random Forest
cv_scores = cross_val_score(rf, X_train, y_train, cv=5, scoring="r2")

print("Random Forest CV R² scores:", cv_scores)
print("Mean CV R²:", np.mean(cv_scores))


## Step 8: Save Best Model
We will save the **Random Forest model** (best performer) to disk.  
It can later be loaded in **Step 5 (Model Evaluation/Deployment)**.  


In [2]:
# Save trained Random Forest model to models directory
joblib.dump(rf, "../models/random_forest_model.pkl")

print("✅ Model saved to ../models/random_forest_model.pkl")


NameError: name 'rf' is not defined