# Review Score Prediction – Regression Model

This notebook focuses on building a **regression model** to predict the exact review score (ranging from 1 to 5) based on customer review text and other engineered features.

####  Objective:
To train and evaluate a machine learning model that predicts the numerical **review score** as a continuous value using features like:
- TF-IDF representations of the review text
- Review length (words and characters)
- Helpfulness ratio

#### Why Regression?
Unlike classification, where the goal is to group reviews as "positive" or "negative", regression helps in **predicting the exact rating** a customer has given. This is useful in:
- Recommender systems
- Quality analysis
- Identifying product-specific score trends
  
#### Key Steps in This Notebook:
1. Load the feature-engineered dataset
2. Prepare the input features and target variable (`Score`)
3. Perform Train-Test Split
4. Apply TF-IDF vectorization on review text (fit on training, transform both)
5. Combine TF-IDF features with numeric features
6. Train and evaluate multiple regression models
7. Compare models using:
   - MAE (Mean Absolute Error)
   - RMSE (Root Mean Squared Error)
   - R² Score (Coefficient of Determination)

### Step 1: Load the Feature-Engineered Dataset


In [43]:
import pandas as pd

# Load the feature-engineered dataset
df = pd.read_csv("feature_engineered_reviews.csv")

# Preview the data
df.head()

Unnamed: 0,Score,cleaned_text,review_length_words,review_length_chars,helpfulness_ratio,sentiment_binary
0,5,i have bought several of the vitality canned d...,48,259,1.0,1.0
1,1,product arrived labeled as jumbo salted peanut...,31,183,0.0,0.0
2,4,this is a confection that has been around a fe...,92,484,1.0,1.0
3,2,if you are looking for the secret ingredient i...,41,212,1.0,0.0
4,5,great taffy at a great price there was a wide ...,27,132,0.0,1.0


### Step 2: Prepare Input Features and Target Variable (`Score`)

In this step, we define the target variable and select the input features for our regression model.

- Target Variable: `Score` (ranging from 1 to 5)
- Input Features:
  - Cleaned review text (`cleaned_text`) – to be vectorized using TF-IDF later
  - Numeric features:
    - `review_length_words`
    - `review_length_chars`
    - `helpfulness_ratio`

In [44]:
# Define target variable
target = df['Score']

# Define text feature (to be vectorized later)
text_feature = df['cleaned_text']

# Define numeric features
numeric_features = df[['review_length_words', 'review_length_chars', 'helpfulness_ratio']]

# Check shapes
print("Text feature shape:", text_feature.shape)
print("Numeric features shape:", numeric_features.shape)
print("Target shape:", target.shape)


Text feature shape: (568401,)
Numeric features shape: (568401, 3)
Target shape: (568401,)


### Step 3: Train-Test Split

Before vectorizing the text, we split the dataset into training and testing sets. This prevents data leakage, ensuring that the vectorizer only learns patterns from the training data.

- `X_train_text`, `X_test_text`: for TF-IDF vectorization
- `X_train_num`, `X_test_num`: numeric features
- `y_train`, `y_test`: target `Score` values


In [45]:
from sklearn.model_selection import train_test_split

# Split text, numeric features, and target
X_train_text, X_test_text, X_train_num, X_test_num, y_train, y_test = train_test_split(
    text_feature, numeric_features, target,
    test_size=0.2, random_state=42
)

# Check dimensions
print("Train text shape:", X_train_text.shape)
print("Train numeric shape:", X_train_num.shape)
print("Train target shape:", y_train.shape)
print("Test text shape:", X_test_text.shape)
print("Test numeric shape:", X_test_num.shape)
print("Test target shape:", y_test.shape)


Train text shape: (454720,)
Train numeric shape: (454720, 3)
Train target shape: (454720,)
Test text shape: (113681,)
Test numeric shape: (113681, 3)
Test target shape: (113681,)


### Step 4: TF-IDF Vectorization

We apply **TF-IDF (Term Frequency–Inverse Document Frequency)** vectorization to convert the cleaned review text into numerical features.

Key parameters:
- `max_features=10000`: limits the vocabulary size to reduce dimensionality.
- `ngram_range=(1, 2)`: includes both unigrams and bigrams for richer context.

We fit the vectorizer only on the **training text** and transform both **training** and **test** text data to avoid data leakage.


In [46]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Replace NaNs with empty strings
X_train_text = X_train_text.fillna("")
X_test_text = X_test_text.fillna("")

# Initialize TF-IDF vectorizer
tfidf = TfidfVectorizer(
    max_features=10000,
    ngram_range=(1, 2),
    stop_words='english'
)

# Fit on training data and transform both train and test
X_train_tfidf = tfidf.fit_transform(X_train_text)
X_test_tfidf = tfidf.transform(X_test_text)

# Check shape
print("TF-IDF train shape:", X_train_tfidf.shape)
print("TF-IDF test shape:", X_test_tfidf.shape)


TF-IDF train shape: (454720, 10000)
TF-IDF test shape: (113681, 10000)


### Step 5 – Combine TF-IDF with Numeric Features

In [47]:
from scipy.sparse import hstack

# Combine TF-IDF with numeric features
X_train_final = hstack([X_train_tfidf, X_train_num])
X_test_final = hstack([X_test_tfidf, X_test_num])

# Check final shape
print("Final shape (train):", X_train_final.shape)
print("Final shape (test):", X_test_final.shape)


Final shape (train): (454720, 10003)
Final shape (test): (113681, 10003)


### Step 6: Train and Evaluate Regression Models

In this step, we train multiple regression models using the combined TF-IDF and numeric features. Each model is evaluated on the test set using common regression metrics such as MAE, RMSE, and R² Score to understand how well they predict the review scores. This helps us compare model performance and choose the best one.

In [48]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

In [49]:
# Initialize model
model = LinearRegression()

# Train model
model.fit(X_train_final, y_train)

# Predict on test set
y_pred = model.predict(X_test_final)


In [50]:
# Evaluate
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

# Display results
results_df = pd.DataFrame([{
    "Model": "Linear Regression",
    "MAE": mae,
    "RMSE": rmse,
    "R2 Score": r2
}])

results_df

Unnamed: 0,Model,MAE,RMSE,R2 Score
0,Linear Regression,0.662634,0.887706,0.542376


In [10]:
# Trying Random Forest Regressor

from sklearn.ensemble import RandomForestRegressor

In [11]:
# Initialize the model
rf = RandomForestRegressor(
    n_estimators=20,      # Even fewer trees
    max_depth=10,         # Limit depth more
    max_features='sqrt',  # Consider fewer features per split
    n_jobs=-1,
    random_state=42
)

In [12]:
# Fit model
rf.fit(X_train_final, y_train)

# Predict
y_pred_rf = rf.predict(X_test_final)

# Evaluate
mae = mean_absolute_error(y_test, y_pred_rf)
rmse = np.sqrt(mean_squared_error(y_test, y_pred_rf))
r2 = r2_score(y_test, y_pred_rf)

In [13]:
# Display

print("Random Forest :")
print(f"MAE: {mae:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"R2 Score: {r2:.4f}")

Random Forest :
MAE: 0.9803
RMSE: 1.2353
R2 Score: 0.1139


In [14]:
# Trying XGBoost Regressor
from xgboost import XGBRegressor

# Initialize the XGBoost Regressor
xgb_model = XGBRegressor(
    objective='reg:squarederror',
    n_estimators=100,
    learning_rate=0.1,
    max_depth=6,
    random_state=42,
    verbosity=1  # Shows training progress (set to 0 to silence)
)

In [15]:
# Fit the model
xgb_model.fit(X_train_final, y_train)

# Predict on test data
y_pred_xgb = xgb_model.predict(X_test_final)

# Evaluate the model
mae_xgb = mean_absolute_error(y_test, y_pred_xgb)
rmse_xgb = np.sqrt(mean_squared_error(y_test, y_pred_xgb))
r2_xgb = r2_score(y_test, y_pred_xgb)

In [16]:
# Print results
print("📊 XGBoost Regression:")
print(f"MAE: {mae_xgb:.4f}")
print(f"RMSE: {rmse_xgb:.4f}")
print(f"R2 Score: {r2_xgb:.4f}")

📊 XGBoost Regression:
MAE: 0.7381
RMSE: 0.9879
R2 Score: 0.4332


In [17]:
# Trying Ridge Regression

from sklearn.linear_model import Ridge

# Initialize Ridge Regressor
ridge_model = Ridge(alpha=1.0)  # You can try other values like 0.1, 10, etc. later

# Train the model
ridge_model.fit(X_train_final, y_train)

# Predict on test data
y_pred_ridge = ridge_model.predict(X_test_final)

# Evaluate performance
mae_ridge = mean_absolute_error(y_test, y_pred_ridge)
rmse_ridge = np.sqrt(mean_squared_error(y_test, y_pred_ridge))
r2_ridge = r2_score(y_test, y_pred_ridge)


In [18]:
# Display Results
print("📊 Ridge Regression:")
print(f"MAE: {mae_ridge:.4f}")
print(f"RMSE: {rmse_ridge:.4f}")
print(f"R2 Score: {r2_ridge:.4f}")

📊 Ridge Regression:
MAE: 0.6780
RMSE: 0.9043
R2 Score: 0.5251


In [20]:
# Trying Lasso 
from sklearn.linear_model import Lasso

# Initialize and train the Lasso Regression model
lasso_model = Lasso(alpha=0.1, max_iter=10000)
lasso_model.fit(X_train_final, y_train)

# Predict
y_pred_lasso = lasso_model.predict(X_test_final)

# Evaluate
mae_lasso = mean_absolute_error(y_test, y_pred_lasso)
rmse_lasso = np.sqrt(mean_squared_error(y_test, y_pred_lasso))
r2_lasso = r2_score(y_test, y_pred_lasso)

In [21]:
# Print results
print("📊 Lasso Regression:")
print(f"MAE: {mae_lasso:.4f}")
print(f"RMSE: {rmse_lasso:.4f}")
print(f"R2 Score: {r2_lasso:.4f}")


📊 Lasso Regression:
MAE: 1.0376
RMSE: 1.3084
R2 Score: 0.0059


In [22]:
# Display the summary table 
import pandas as pd

results = [
    {"Model": "Linear Regression", "MAE": 0.6637, "RMSE": 0.8881, "R2 Score": 0.5420},
    {"Model": "Random Forest", "MAE": 0.9861, "RMSE": 1.2399, "R2 Score": 0.1072},
    {"Model": "XGBoost", "MAE": 0.7379, "RMSE": 0.9879, "R2 Score": 0.4332},
    {"Model": "Ridge Regression", "MAE": 0.6779, "RMSE": 0.9043, "R2 Score": 0.5251},
    {"Model": "Lasso Regression", "MAE": 1.0376, "RMSE": 1.3084, "R2 Score": 0.0059}
]

summary_df = pd.DataFrame(results)
summary_df = summary_df.sort_values(by="R2 Score", ascending=False).reset_index(drop=True)

print("📊 Model Comparison Summary:\n")
print(summary_df)

📊 Model Comparison Summary:

               Model     MAE    RMSE  R2 Score
0  Linear Regression  0.6637  0.8881    0.5420
1   Ridge Regression  0.6779  0.9043    0.5251
2            XGBoost  0.7379  0.9879    0.4332
3      Random Forest  0.9861  1.2399    0.1072
4   Lasso Regression  1.0376  1.3084    0.0059


### Step 7 Hyperparameter Tuning

We are going to select Ridge Regression model from the summary table and tune it to get the best outcome.


In [23]:
from sklearn.model_selection import GridSearchCV
import numpy as np

# Define model
ridge = Ridge()

# Define hyperparameter grid
param_grid = {
    'alpha': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
}

In [24]:
# Grid search with 5-fold cross-validation
grid_search = GridSearchCV(ridge, param_grid, cv=5, scoring='r2', verbose=1, n_jobs=-1)

# Fit the model
grid_search.fit(X_train_final, y_train)

# Best parameters and score
print("✅ Best Parameters:", grid_search.best_params_)
print("📈 Best Cross-Validation R² Score:", grid_search.best_score_)


Fitting 5 folds for each of 6 candidates, totalling 30 fits
✅ Best Parameters: {'alpha': 0.1}
📈 Best Cross-Validation R² Score: 0.5215235001434178


In [25]:
# Evaluate on test data

best_ridge = grid_search.best_estimator_
y_pred = best_ridge.predict(X_test_final)

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print("\n🧪 Ridge Regression after Tuning:")
print(f"MAE: {mae:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"R² Score: {r2:.4f}")


🧪 Ridge Regression after Tuning:
MAE: 0.6780
RMSE: 0.9044
R² Score: 0.5250


In [32]:
# Tuning XGBoost 
xgb = XGBRegressor(objective='reg:squarederror', random_state=42)

param_grid_xgb = {
    'n_estimators': [100, 200],
    'max_depth': [3, 5],
    'learning_rate': [0.1, 0.01]
}


In [33]:
# Set up Grid

grid_search_xgb = GridSearchCV(
    estimator=xgb,
    param_grid=param_grid_xgb,
    scoring='neg_mean_squared_error',
    cv=3,
    verbose=2,
    n_jobs=1  # safer for slower computers
)

In [34]:
# Initialise

grid_search_xgb.fit(X_train_final, y_train)
best_xgb = grid_search_xgb.best_estimator_


Fitting 3 folds for each of 8 candidates, totalling 24 fits
[CV] END ...learning_rate=0.1, max_depth=3, n_estimators=100; total time= 2.0min
[CV] END ...learning_rate=0.1, max_depth=3, n_estimators=100; total time= 1.5min
[CV] END ...learning_rate=0.1, max_depth=3, n_estimators=100; total time= 1.3min
[CV] END ...learning_rate=0.1, max_depth=3, n_estimators=200; total time= 2.5min
[CV] END ...learning_rate=0.1, max_depth=3, n_estimators=200; total time= 2.5min
[CV] END ...learning_rate=0.1, max_depth=3, n_estimators=200; total time= 2.6min
[CV] END ...learning_rate=0.1, max_depth=5, n_estimators=100; total time= 3.1min
[CV] END ...learning_rate=0.1, max_depth=5, n_estimators=100; total time= 3.2min
[CV] END ...learning_rate=0.1, max_depth=5, n_estimators=100; total time= 3.2min
[CV] END ...learning_rate=0.1, max_depth=5, n_estimators=200; total time= 6.5min
[CV] END ...learning_rate=0.1, max_depth=5, n_estimators=200; total time= 6.4min
[CV] END ...learning_rate=0.1, max_depth=5, n_est

In [36]:
# Evaluate 

import numpy as np

# Get best model from grid search
best_xgb = grid_search_xgb.best_estimator_

# Predict on test set
y_pred_xgb = best_xgb.predict(X_test_final)

# Evaluate
mae = mean_absolute_error(y_test, y_pred_xgb)
rmse = np.sqrt(mean_squared_error(y_test, y_pred_xgb))
r2 = r2_score(y_test, y_pred_xgb)

print("\n📦 XGBoost (After Tuning):")
print(f"MAE: {mae:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"R² Score: {r2:.4f}")



📦 XGBoost (After Tuning):
MAE: 0.7078
RMSE: 0.9566
R² Score: 0.4686


In [37]:
# Final model performance summary (before & after tuning)
model_summary = pd.DataFrame({
    'Model': [
        'Linear Regression',
        'Ridge Regression (Before Tuning)',
        'Ridge Regression (After Tuning)',
        'XGBoost (Before Tuning)',
        'XGBoost (After Tuning)',
        'Random Forest (Default)',
        'Lasso Regression'
    ],
    'MAE': [
        0.6637,
        0.6779,
        0.6779,
        0.7379,
        0.7078,
        0.9861,
        1.0376
    ],
    'RMSE': [
        0.8881,
        0.9043,
        0.9044,
        0.9879,
        0.9566,
        1.2399,
        1.3084
    ],
    'R² Score': [
        0.5420,
        0.5251,
        0.5250,
        0.4332,
        0.4686,
        0.1072,
        0.0059
    ]
})

# Display the DataFrame
print("\n📊 Final Model Performance Summary (Before and After Tuning):")
print(model_summary)


📊 Final Model Performance Summary (Before and After Tuning):
                              Model     MAE    RMSE  R² Score
0                 Linear Regression  0.6637  0.8881    0.5420
1  Ridge Regression (Before Tuning)  0.6779  0.9043    0.5251
2   Ridge Regression (After Tuning)  0.6779  0.9044    0.5250
3           XGBoost (Before Tuning)  0.7379  0.9879    0.4332
4            XGBoost (After Tuning)  0.7078  0.9566    0.4686
5           Random Forest (Default)  0.9861  1.2399    0.1072
6                  Lasso Regression  1.0376  1.3084    0.0059


##  Final Model Selection: Linear Regression

After training and evaluating multiple regression models — including Ridge, Lasso, Random Forest, and XGBoost — we selected **Linear Regression** as the final model for this project. The decision was based on the following observations:

- ✅ **Best overall performance** in terms of MAE, RMSE, and R² Score.
- ✅ **Simpler and more interpretable** compared to tree-based models.
- ✅ **No hyperparameter tuning required**, yet outperformed more complex models even after tuning.
- ✅ **Lightweight and fast**, making it suitable for deployment in resource-constrained environments.

### Final Evaluation Metrics (on test set):

- **MAE**: 0.6637  
- **RMSE**: 0.8881  
- **R² Score**: 0.5420

Given these results, **Linear Regression** offers the best balance of accuracy, simplicity, and speed for this regression task.


### Step 8 Saving the Final Model (Linear Regression)

In [40]:
best_model = LinearRegression()
best_model.fit(X_train_final, y_train)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


In [41]:
import joblib

# Save the trained model
joblib.dump(best_model, 'linear_regression_model.pkl')

print("Linear Regression model saved as 'linear_regression_model.pkl'")

Linear Regression model saved as 'linear_regression_model.pkl'


In [52]:
# Saving the TD-IDF vectorizer
import joblib
joblib.dump(tfidf, 'tfidf_vectorizer.pkl')

['tfidf_vectorizer.pkl']