<a href="https://colab.research.google.com/github/nbetini99/demand-prediction/blob/main/DemandPrediction_BERT_XGBoost.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install transformers xgboost scikit-learn matplotlib pandas numpy

In [None]:
# Import necessary libraries
'''This section imports necessary libraries for data manipulation, visualization, machine learning, and text processing.
numpy is used for numerical operations.
pandas is used for data handling and manipulation using DataFrames.
matplotlib.pyplot is for creating visualizations.
train_test_split (from sklearn) is used to split the dataset.
mean_squared_error (from sklearn) is used to calculate the model's prediction error.
xgboost is the machine learning library used for the prediction model.
BertTokenizer and BertModel (from transformers) are used to generate text embeddings.
torch is a deep learning library, here used for BERT's operations.'''

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    mean_absolute_error,
    mean_squared_error,
    r2_score
)
import xgboost as xgb
from transformers import BertTokenizer, BertModel
import torch

# -----------------------------
# 1. Data Preparation
# -----------------------------
'''The below code section  defines a Python dictionary called data containing information about different products, including their descriptions, historical sales, region, seasonality, and the actual demand for the next month.
This dictionary is then converted into a pandas DataFrame called df, a tabular data structure that is easier to manipulate'''
# -----------------------------

# Creating a synthetic dataset
data = {
    'product_description': [
        'HP 65W Slim AC Adapter',
        'Canon CLI-281 Black Ink Cartridge',
        'Apple USB-C Power Adapter',
        'Samsung Galaxy Charger 25W',
        'Logitech Wireless Mouse M510'
    ],
    'historical_sales': [150, 120, 200, 180, 160],
    'region': [1, 2, 1, 3, 2],
    'seasonality': [0.3, 0.2, 0.4, 0.5, 0.1],
    'demand_next_month': [170, 110, 210, 190, 150]
}

# Converting the dictionary to a pandas DataFrame
df = pd.DataFrame(data)


# -----------------------------
# 2. Text Embedding using BERT
# Load BERT model and tokenizer
# -----------------------------
'''The section below utilizes a pre-trained language model called BERT (bert-base-uncased) to convert product descriptions into numerical vectors (embeddings).
It first loads the BERT tokenizer (tokenizer) to break down text into tokens and the BERT model (model) itself.
The get_bert_embedding function takes a text input, processes it with BERT, and returns its embedding.
The code then applies this function to all product descriptions in the DataFrame, storing the resulting embeddings in embeddings and then converting it into another DataFrame called embedding_df.'''
# -----------------------------

# Loading the pre-trained BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Function to generate BERT embeddings for a given text
def get_bert_embedding(text):
    """
    Generates BERT embedding for the input text.
    """
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=64)
    with torch.no_grad():
        outputs = model(**inputs)
    # Extracting the embedding of the [CLS] token
    cls_embedding = outputs.last_hidden_state[:, 0, :].squeeze().numpy()
    return cls_embedding

# Applying the embedding function to all product descriptions
embeddings = np.array([get_bert_embedding(desc) for desc in df['product_description']])

# Creating a DataFrame from the embeddings
embedding_df = pd.DataFrame(embeddings)

# -----------------------------
# 3. Feature Engineering
# -----------------------------
'''Here, the code combines the generated BERT embeddings with other numerical features (historical sales, region, and seasonality) into a single DataFrame called features. These are the input variables for our model.
The target variable is defined as the 'demand_next_month' column from the original DataFrame, representing what we want to predict.
The data is then split into training and testing sets using train_test_split.
X_train and y_train are used to train the model.
X_test and y_test are used to evaluate the model's performance on unseen data.'''
# -----------------------------

# Combining embeddings with structured features
features = pd.concat([
    embedding_df,
    df[['historical_sales', 'region', 'seasonality']]
], axis=1)

# Defining the target variable
target = df['demand_next_month']

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

# -----------------------------
# 4. Model Training
# -----------------------------
'''The section below utilizes a pre-trained language model called BERT (bert-base-uncased) to convert product descriptions into numerical vectors (embeddings).
It first loads the BERT tokenizer (tokenizer) to break down text into tokens and the BERT model (model) itself.
The get_bert_embedding function takes a text input, processes it with BERT, and returns its embedding.
The code then applies this function to all product descriptions in the DataFrame, storing the resulting embeddings in embeddings and then converting it into another DataFrame called embedding_df.'''
# -----------------------------

# Initializing the XGBoost regressor
xgb_model = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=100)

# Training the model
xgb_model.fit(X_train, y_train)


# -----------------------------
# 5. Model Evaluation
# -----------------------------

# Making predictions on the test set
y_pred = xgb_model.predict(X_test)

# Calculating evaluation metrics

# Mean Absolute Error (MAE): Average of absolute differences between actual and predicted values
mae = mean_absolute_error(y_test, y_pred)

# Mean Squared Error (MSE): Average of squared differences between actual and predicted values
mse = mean_squared_error(y_test, y_pred)

# Root Mean Squared Error (RMSE): Square root of MSE, provides error in original units
rmse = np.sqrt(mse)

# R-squared (R²): Proportion of variance explained by the model
r2 = r2_score(y_test, y_pred)

# Adjusted R-squared: Adjusts R² for the number of predictors in the model
n = X_test.shape[0]  # Number of observations
p = X_test.shape[1]  # Number of predictors
adjusted_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1) if n > p + 1 else None

# Mean Absolute Percentage Error (MAPE): Average of absolute percentage errors
mape = np.mean(np.abs((y_test - y_pred) / y_test)) * 100

# Displaying the evaluation metrics
print("Model Evaluation Metrics:")
print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
print(f"R-squared (R²): {r2:.2f}")
if adjusted_r2 is not None:
    print(f"Adjusted R-squared: {adjusted_r2:.2f}")
else:
    print("Adjusted R-squared: Not applicable (n <= p + 1)")
print(f"Mean Absolute Percentage Error (MAPE): {mape:.2f}%")

# -----------------------------
# Visualization
# -----------------------------

# Plotting Actual vs. Predicted Demand
plt.figure(figsize=(8, 5))
plt.plot(y_test.values, label='Actual Demand', marker='o')
plt.plot(y_pred, label='Predicted Demand', marker='x')
plt.title('Actual vs. Predicted Demand')
plt.xlabel('Sample Index')
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    mean_absolute_error,
    mean_squared_error,
    r2_score
)
import xgboost as xgb
from transformers import BertTokenizer, BertModel
import torch

# -----------------------------
# Data Preparation
# -----------------------------

# Creating a synthetic dataset
data = {
    'product_description': [
        'HP 65W Slim AC Adapter',
        'Canon CLI-281 Black Ink Cartridge',
        'Apple USB-C Power Adapter',
        'Samsung Galaxy Charger 25W',
        'Logitech Wireless Mouse M510'
    ],
    'historical_sales': [150, 120, 200, 180, 160],
    'region': [1, 2, 1, 3, 2],
    'seasonality': [0.3, 0.2, 0.4, 0.5, 0.1],
    'demand_next_month': [170, 110, 210, 190, 150]
}

# Converting the dictionary to a pandas DataFrame
df = pd.DataFrame(data)

# -----------------------------
# Text Embedding using BERT
# -----------------------------

# Loading the pre-trained BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Function to generate BERT embeddings for a given text
def get_bert_embedding(text):
    """
    Generates BERT embedding for the input text.
    """
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=64)
    with torch.no_grad():
        outputs = model(**inputs)
    # Extracting the embedding of the [CLS] token
    cls_embedding = outputs.last_hidden_state[:, 0, :].squeeze().numpy()
    return cls_embedding

# Applying the embedding function to all product descriptions
embeddings = np.array([get_bert_embedding(desc) for desc in df['product_description']])

# Creating a DataFrame from the embeddings
embedding_df = pd.DataFrame(embeddings)

# -----------------------------
# Feature Engineering
'''Here, the code combines the generated BERT embeddings with other numerical features (historical sales, region, and seasonality) into a single DataFrame called features. These are the input variables for our model.
The target variable is defined as the 'demand_next_month' column from the original DataFrame, representing what we want to predict.
The data is then split into training and testing sets using train_test_split.
X_train and y_train are used to train the model.
X_test and y_test are used to evaluate the model's performance on unseen data.'''
# -----------------------------

# Combining embeddings with structured features
features = pd.concat([
    embedding_df,
    df[['historical_sales', 'region', 'seasonality']]
], axis=1)

# Defining the target variable
target = df['demand_next_month']

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

# -----------------------------
# 4. Model Training
'''The section below utilizes a pre-trained language model called BERT (bert-base-uncased) to convert product descriptions into numerical vectors (embeddings).
It first loads the BERT tokenizer (tokenizer) to break down text into tokens and the BERT model (model) itself.
The get_bert_embedding function takes a text input, processes it with BERT, and returns its embedding.
The code then applies this function to all product descriptions in the DataFrame, storing the resulting embeddings in embeddings and then converting it into another DataFrame called embedding_df.'''

# -----------------------------

# Initializing the XGBoost regressor
xgb_model = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=100)

# Training the model
xgb_model.fit(X_train, y_train)

# -----------------------------
# 5. Model Evaluation
# -----------------------------

# Making predictions on the test set
y_pred = xgb_model.predict(X_test)

# Calculating evaluation metrics

# Mean Absolute Error (MAE): Average of absolute differences between actual and predicted values
mae = mean_absolute_error(y_test, y_pred)

# Mean Squared Error (MSE): Average of squared differences between actual and predicted values
mse = mean_squared_error(y_test, y_pred)

# Root Mean Squared Error (RMSE): Square root of MSE, provides error in original units
rmse = np.sqrt(mse)

# R-squared (R²): Proportion of variance explained by the model
r2 = r2_score(y_test, y_pred)

# Adjusted R-squared: Adjusts R² for the number of predictors in the model
n = X_test.shape[0]  # Number of observations
p = X_test.shape[1]  # Number of predictors
adjusted_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1) if n > p + 1 else None

# Mean Absolute Percentage Error (MAPE): Average of absolute percentage errors
mape = np.mean(np.abs((y_test - y_pred) / y_test)) * 100

# Displaying the evaluation metrics
print("Model Evaluation Metrics:")
print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
print(f"R-squared (R²): {r2:.2f}")
if adjusted_r2 is not None:
    print(f"Adjusted R-squared: {adjusted_r2:.2f}")
else:
    print("Adjusted R-squared: Not applicable (n <= p + 1)")
print(f"Mean Absolute Percentage Error (MAPE): {mape:.2f}%")

# -----------------------------
# 6. Visualization
# -----------------------------

# Plotting Actual vs. Predicted Demand
plt.figure(figsize=(8, 5))
plt.plot(y_test.values, label='Actual Demand', marker='o')
plt.plot(y_pred, label='Predicted Demand', marker='x')
plt.title('Actual vs. Predicted Demand')
plt.xlabel('Sample Index')
plt.ylabel('Demand')
plt.legend() # Added parentheses to call the legend function
plt.show() # Added plt.show() to display the plot

In [None]:
# Install these packages in Colab if not already installed
'''This section imports necessary libraries for data manipulation, visualization, machine learning, and text processing.
numpy is used for numerical operations.
pandas is used for data handling and manipulation using DataFrames.
matplotlib.pyplot is for creating visualizations.
train_test_split (from sklearn) is used to split the dataset.
mean_squared_error (from sklearn) is used to calculate the model's prediction error.
xgboost is the machine learning library used for the prediction model.
BertTokenizer and BertModel (from transformers) are used to generate text embeddings.
torch is a deep learning library, here used for BERT's operations.'''


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb
from transformers import BertTokenizer, BertModel
import torch


# Sample Data and Data preparation

'''The below code section  defines a Python dictionary called data containing information about different products, including their descriptions, historical sales, region, seasonality, and the actual demand for the next month.
This dictionary is then converted into a pandas DataFrame called df, a tabular data structure that is easier to manipulate'''

data = {
    'product_description': [
        'XYZ Smart Thermostat ',
        'ABC Condensor Cooler',
        'PQR Air Conditioner',
        'Heat furnace ABC',
        'Zoning HVAC System',
        'Air Quality sensor',
        'Humidity Sensor'
    ],
    'historical_sales': [150, 120, 200, 180, 160, 133, 189],
    'region': [1, 2, 1, 3, 2, 2, 3],
    'seasonality': [0.3, 0.2, 0.4, 0.5, 0.1, 0.23, 0.42],
    'demand_next_month': [120, 144, 170, 110, 210, 190, 150]
}

df = pd.DataFrame(data)

# Load BERT model and tokenizer
# Text Embedding with BERT
'''The section below utilizes a pre-trained language model called BERT (bert-base-uncased) to convert product descriptions into numerical vectors (embeddings).
It first loads the BERT tokenizer (tokenizer) to break down text into tokens and the BERT model (model) itself.
The get_bert_embedding function takes a text input, processes it with BERT, and returns its embedding.
The code then applies this function to all product descriptions in the DataFrame, storing the resulting embeddings in embeddings and then converting it into another DataFrame called embedding_df.'''

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

def get_bert_embedding(text):
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=64)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state[:, 0, :].squeeze().numpy()

# Generate embeddings
embeddings = np.array([get_bert_embedding(desc) for desc in df['product_description']])
embedding_df = pd.DataFrame(embeddings)

# Combine with tabular features
# Feature Engineering and Data Splitting
features = pd.concat([embedding_df, df[['historical_sales', 'region', 'seasonality']]], axis=1)
target = df['demand_next_month']

# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

# Model Training and Prediction
# Train XGBoost
model = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=100)
model.fit(X_train, y_train)

# Visualization and Evaluation
# # Predict and visualize
y_pred = model.predict(X_test)

plt.figure(figsize=(8, 5))
plt.plot(y_test.values, label='Actual', marker='o')
plt.plot(y_pred, label='Predicted', marker='x')
plt.title('Actual vs Predicted Demand')
plt.xlabel('Sample Index')
plt.ylabel('Demand')
plt.legend()
plt.grid(True)
plt.show()

# Print MSE
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
