# 03 – Model Training & Evaluation  
## Sustainability Score Prediction in Smart Manufacturing

In this notebook, we train machine learning models to predict a **Sustainability Score** for manufacturing processes using energy consumption, tool condition, and process parameters.

This step converts the project into a **supervised machine learning regression problem**.


Import Libraries

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

import joblib


## Load the Feature-Engineered Dataset

This dataset was created in **feature_engineering.ipynb** and includes the calculated sustainability score.


Load Data

In [2]:
df = pd.read_csv("../Dataset/processed/manufacturing_features.csv")
df.head()


Unnamed: 0,rotational_speed_rpm,torque_nm,tool_wear_rate,temp_diff_k,energy_intensity,sustainability_score
0,0.222934,0.535714,0.0,0.644444,0.572356,0.577724
1,0.139697,0.583791,0.011858,0.644444,0.560224,0.57902
2,0.192084,0.626374,0.019763,0.622222,0.642702,0.550324
3,0.154249,0.490385,0.027668,0.622222,0.467367,0.618086
4,0.139697,0.497253,0.035573,0.644444,0.469968,0.608008


## Define Features and Target Variable

- **Features (X):** Manufacturing process parameters  
- **Target (y):** Sustainability Score


Split X and y

In [4]:
X = df.drop(columns=["sustainability_score"])
y = df["sustainability_score"]
X,y

(      rotational_speed_rpm  torque_nm  tool_wear_rate  temp_diff_k  \
 0                 0.222934   0.535714        0.000000     0.644444   
 1                 0.139697   0.583791        0.011858     0.644444   
 2                 0.192084   0.626374        0.019763     0.622222   
 3                 0.154249   0.490385        0.027668     0.622222   
 4                 0.139697   0.497253        0.035573     0.644444   
 ...                    ...        ...             ...          ...   
 9995              0.253783   0.353022        0.055336     0.444444   
 9996              0.270081   0.384615        0.067194     0.422222   
 9997              0.277648   0.406593        0.086957     0.444444   
 9998              0.139697   0.614011        0.098814     0.466667   
 9999              0.193248   0.500000        0.118577     0.466667   
 
       energy_intensity  
 0             0.572356  
 1             0.560224  
 2             0.642702  
 3             0.467367  
 4             0

## Train-Test Split

We split the dataset to evaluate the model's performance on unseen data.


Train-Test Split

In [5]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


In [6]:
X_test.head()
y_test.head()

6252    0.475333
4684    0.679128
1731    0.537079
4742    0.831600
4521    0.723383
Name: sustainability_score, dtype: float64

## Feature Scaling

Feature scaling is applied to improve the performance of models sensitive to feature magnitude.


Scaling

In [8]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
X_train_scaled,X_test_scaled

(array([[ 0.42763418, -0.89269644,  1.37503539,  0.80635564, -0.48146165],
        [-0.83494457,  1.38218727,  0.45762016,  0.20447605,  1.37304213],
        [-0.05967692, -0.89269644,  1.35921788, -0.39740354, -1.21499174],
        ...,
        [-0.30887009,  0.72076734,  1.81792549, -0.4977168 ,  0.55378104],
        [ 0.01231222, -0.74237372, -1.18740025,  0.80635564, -0.48338311],
        [ 1.49085839, -1.42383669, -1.15576524,  0.20447605, -1.04899385]],
       shape=(8000, 5)),
 array([[-4.30065887e-03, -3.91620730e-01,  1.42248790e+00,
         -4.97716802e-01, -6.11295589e-01],
        [-6.52202912e-01,  4.80251001e-01, -1.11809981e-01,
         -1.80178924e+00, -6.57371150e-01],
        [-2.97794842e-01,  1.99648605e-01,  1.41270082e-01,
         -3.97403537e-01, -7.28190909e-03],
        ...,
        [-1.22811603e+00,  1.65276816e+00, -1.59865535e+00,
          1.40823523e+00,  2.10672827e+00],
        [ 6.54671101e+00, -3.07738652e+00, -1.42466280e+00,
          6.05729110e-

## Model 1: Linear Regression (Baseline Model)

Linear Regression is used as a simple and interpretable baseline.


Train Linear Regression

In [9]:
lr_model = LinearRegression()
lr_model.fit(X_train_scaled, y_train)

y_pred_lr = lr_model.predict(X_test_scaled)


## Linear Regression Evaluation


Evaluate Linear Regression

In [10]:
print("Linear Regression Performance")
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_lr)))
print("R2 Score:", r2_score(y_test, y_pred_lr))


Linear Regression Performance
RMSE: 1.1712063625127576e-16
R2 Score: 1.0


## Model 2: Random Forest Regressor

Random Forest captures complex non-linear relationships and is widely used in industry.


Train Random Forest

In [11]:
rf_model = RandomForestRegressor(
    n_estimators=200,
    random_state=42
)

rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)


## Random Forest Evaluation


Evaluate Random Forest

In [12]:
print("Random Forest Performance")
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_rf)))
print("R2 Score:", r2_score(y_test, y_pred_rf))


Random Forest Performance
RMSE: 0.007486252846414125
R2 Score: 0.9964562030181582


## Model Comparison


Comparison Table

In [13]:
results = pd.DataFrame({
    "Model": ["Linear Regression", "Random Forest"],
    "RMSE": [
        np.sqrt(mean_squared_error(y_test, y_pred_lr)),
        np.sqrt(mean_squared_error(y_test, y_pred_rf))
    ],
    "R2 Score": [
        r2_score(y_test, y_pred_lr),
        r2_score(y_test, y_pred_rf)
    ]
})

results


Unnamed: 0,Model,RMSE,R2 Score
0,Linear Regression,1.171206e-16,1.0
1,Random Forest,0.007486253,0.996456


## Best Model Selection

Random Forest is selected due to its superior predictive performance.


Save Model

In [15]:
best_model = rf_model

joblib.dump(best_model, "../models/sustainability_model.pkl")
joblib.dump(scaler, "../models/scaler.pkl")


['../models/scaler.pkl']

## Summary

- Built a supervised ML regression model
- Compared baseline and advanced algorithms
- Selected the best-performing model
- Saved the model for dashboard deployment

Next step → **Model Explainability (SHAP)** and **Streamlit Dashboard**
