# Customer Score Predictor

**Overview:** This notebook implements a customer review score predictor using a pre-trained BERT model fine-tuned on a Yelp review dataset from Hugging Face. The goal is to predict the star rating (1 to 5) of a Yelp review based on the text content. We will load the dataset, preprocess it, fine-tune a BERT model for sequence classification, evaluate its performance using various metrics, and finally build a Gradio application for interactive predictions.

This notebook uses a **subset of the original Yelp Review Full dataset for demonstration purposes and to reduce training time.** For optimal performance in a real-world scenario, it is recommended to train the model on the entire dataset.


## Step 1 - Install and import the required libraries

In this step, we install all the necessary Python libraries required for this project. We will be using libraries from the Hugging Face ecosystem (`transformers`, `datasets`, `evaluate`, `accelerate`), along with standard data science libraries (`numpy`, `scikit-learn`) and `gradio` for building the user interface.


In [None]:
pip install pandas transformers datasets evaluate accelerate evaluate scikit-learn gradio matplotlib seaborn

In [None]:
# Import Libraries and Set Random Seeds for Reproducibility

# --- Standard Libraries ---
import random
import time

# --- Data Science Libraries ---
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# --- Hugging Face Libraries ---
from datasets import load_dataset
import evaluate
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer

# Hugging Face interactions for uploading model
from huggingface_hub import login
# Replace 'YOUR_HUGGING_FACE_TOKEN' with actual token
login(token="YOUR_HUGGING_FACE_TOKEN")
from huggingface_hub import HfApi
from huggingface_hub import upload_folder



# --- PyTorch ---
import torch

# --- Gradio for UI ---
import gradio as gr
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false" # Disable parallelism for gradio application


# --- Set Random Seeds for Reproducibility ---
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED) # For CUDA GPUs
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

## Step 2 - Load the dataset from huggingface and inspect the data

### Load the dataset

- Here, we load the "yelp_review_full" dataset from Hugging Face Datasets. This dataset contains Yelp reviews with full star ratings (1 to 5). We will inspect a sample review to understand the data structure and content.


In [None]:
# Load the Yelp review dataset
dataset = load_dataset("yelp_review_full")

# Display a sample review (index 100) from the training set
dataset["train"][100]

### Data inspection

Before proceeding with model training, it's important to explore and understand the dataset. In this section, we will:

- **Convert to Pandas DataFrame:** Convert the Hugging Face Dataset to Pandas DataFrames for easier data manipulation and analysis.
- **Visualize Label Distribution:** Plot the distribution of the 'labels' (star ratings) to check for class imbalances.
- **Check for Missing Values:** Examine the dataset for any missing values in the 'text' or 'label' columns.
- **Split Test Set into Validation and Test Sets:** Divide the original test set into two: a validation set used during training and a final test set for unbiased evaluation at the end.


In [None]:
# Convert training and test datasets to Pandas DataFrames
train_df = pd.DataFrame(dataset["train"])
test_df = pd.DataFrame(dataset["test"])


# Visualize the distribution of labels in the training set
plt.figure(figsize=(8, 6))
sns.countplot(x='label', data=train_df)
plt.title('Distribution of Star Ratings in Training Set')
plt.xlabel('Star Rating (Label)')
plt.ylabel('Number of Reviews')
plt.xticks(ticks=range(5), labels=['1 Star', '2 Stars', '3 Stars', '4 Stars', '5 Stars']) # More descriptive x-axis labels
plt.show()


# Check for missing values in both training and test sets
print("Missing values in training set:")
print(train_df.isnull().sum())
print("\nMissing values in test set:")
print(test_df.isnull().sum())

In [None]:
# Split the original test set into validation and test sets
validation_test_split = dataset["test"].train_test_split(test_size=0.5, seed=42) # 50/50 split

validation_dataset = validation_test_split["train"] # Use 'train' split as validation
test_dataset = validation_test_split["test"]      # Use 'test' split as final test set


## Step 3 - Tokenize the data

In this step, we will tokenize the text data from the Yelp review dataset. To expedite the training process for this demonstration and due to resource limitations, we are using smaller subsets of the full Yelp Review Full dataset. The [original dataset](https://huggingface.co/datasets/Yelp/yelp_review_full) contains a large number of reviews (training set: 650,000, test set: 50,000). For this notebook, we will use a reduced dataset size of 3000 training samples, 1000 validation samples, and 1000 test samples. For a full-scale project, training on the entire dataset is recommended for optimal model performance.

In this step, we will tokenize the text data from the dataset. Tokenization is the process of converting text into numerical tokens that can be understood by the BERT model. We are using the tokenizer associated with the "google-bert/bert-base-cased" model.

We will use the `map` function from the `datasets` library to apply the tokenization function to the entire dataset in batches for efficiency. We will also set `padding="max_length"` and `truncation=True` to ensure all input sequences are of the same length (or shorter than `max_length`) and to handle reviews that are longer than the model's maximum input length.


In [None]:
 #Create smaller datasets for training, validation, and test
# Tokenize the datasets before creating smaller subsets
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_validation_dataset = validation_dataset.map(tokenize_function, batched=True)
tokenized_test_dataset = test_dataset.map(tokenize_function, batched=True)


small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(3000))
small_validation_dataset = tokenized_validation_dataset.shuffle(seed=42).select(range(1000))
small_test_dataset = tokenized_test_dataset.shuffle(seed=42).select(range(1000))


## Step 4 - Select the model and initialize it with 5 classes

Here we select the pre-trained "google-bert/bert-base-cased" model and initialize it for sequence classification. We specify `num_labels=5` because we are predicting one of 5 star ratings (classes). `AutoModelForSequenceClassification.from_pretrained` loads the pre-trained BERT model and adapts it for classification, adding a classification layer on top. `torch_dtype="auto"` allows automatic handling of data types.


In [None]:
# Initialize the model with 5 classes
model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5, torch_dtype="auto")

## Step 5 - Create evaluation metrics

In this step, we define the evaluation metrics we will use to assess the performance of our fine-tuned model. We load the "accuracy", "recall", "precision", and "f1" metrics from the `evaluate` library.
We choose these metrics because:

- **Accuracy**: Provides a general measure of correctness, indicating the percentage of reviews for which the star rating was predicted correctly _across all classes_. It's a good overall metric, but can be misleading if classes are imbalanced.
- **Recall, Precision, and F1-score**: These are especially important in multi-class classification problems like ours, and are useful even if there are class imbalances. We use the macro-average for these metrics to give equal weight to each star rating class, regardless of its frequency in the dataset. This ensures that the performance on less frequent classes is also taken into account.
- **Recall (Macro-average)**: Measures the ability of the classifier to correctly identify reviews of _each specific star rating_. For each star rating (1 to 5), recall asks: "Of all the reviews that _actually_ have this star rating, how many did the model correctly predict?". High recall for a star rating means the model is good at finding most of the reviews that truly belong to that rating.
- **Precision (Macro-average)**: Measures the accuracy of the positive predictions _for each star rating_. For each star rating (1 to 5), precision asks: "Of all the reviews that the model _predicted_ as having this star rating, how many _actually_ have this star rating?". High precision for a star rating means that when the model predicts a certain star rating, it's usually correct.
- **F1-score (Macro-average)**: Provides a single, balanced measure that combines both precision and recall for _each star rating_. It is the harmonic mean of precision and recall. F1-score is particularly useful when we want to balance precision and recall, and it's often a better single metric to consider than accuracy, especially in cases with imbalanced classes or when both false positives and false negatives are important to consider.

Finally, the `compute_metrics` function takes the model's prediction outputs (`eval_pred`) and calculates these metrics. It returns a dictionary containing the calculated metrics which will be used by the Trainer during evaluation.


In [None]:
# Create evaluation metrics
metric_accuracy = evaluate.load("accuracy")
metric_recall = evaluate.load("recall")
metric_precision = evaluate.load("precision")
metric_f1 = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)

    # Calculate accuracy
    accuracy = metric_accuracy.compute(predictions=predictions, references=labels)

    # Calculate macro-averaged recall
    recall = metric_recall.compute(predictions=predictions, references=labels, average="macro")

    # Calculate macro-averaged precision
    precision = metric_precision.compute(predictions=predictions, references=labels, average="macro")

    # Calculate macro-averaged f1-score
    f1 = metric_f1.compute(predictions=predictions, references=labels, average="macro")

    return {
        "accuracy": accuracy["accuracy"],
        "recall": recall["recall"],
        "precision": precision["precision"],
        "f1": f1["f1"]
    }

## Step 6 - Finetune the model

In this section, we configure the training process using `TrainingArguments`. This class provides many options to customize the training. In this notebook, we have configured the following `TrainingArguments`:

- `output_dir`: "test_trainer" - Specifies the directory where the model checkpoints and training logs will be saved.
- `eval_strategy`: "steps" - Sets the evaluation strategy to "steps", meaning evaluation will be performed at certain intervals defined by `eval_steps`.
- `eval_steps`: 100 - Configures evaluation to be performed every 100 training steps.
- `num_train_epochs`: 2 - Sets the number of training epochs to 2. For a more ideally tuned model, 3 to 5 epochs are generally recommended, but 2 is used here for demonstration speed.
- `logging_steps`: 100 - Configures training logs to be recorded every 100 steps.

We have left many other `TrainingArguments` at their default values. Some of the important default parameters include:

- `learning_rate`: 5e-5 - The initial learning rate for the AdamW optimizer.
- `per_device_train_batch_size`: 8 - Batch size per device during training.
- `per_device_eval_batch_size`: 8 - Batch size for evaluation.
- `weight_decay`: 0.0 - No weight decay is applied by default.
- `warmup_steps`: 0 - No warmup steps for the learning rate scheduler by default.
  For a full list of configurable training arguments, refer to the Hugging Face `TrainingArguments` documentation.

We then initialize the `Trainer` class from Hugging Face Transformers. The `Trainer` simplifies the training loop and handles many details for us. We provide it with:

- The pre-trained **model** we initialized.
- The **training arguments** we just configured.
- The **training dataset** (`small_train_dataset`).
- The **validation dataset** (`small_validation_dataset`) for evaluation during training.
- The `compute_metrics` function to calculate evaluation metrics.

Finally, we call `trainer.train()` to start the fine-tuning process. The training progress, evaluation metrics at each evaluation step, and checkpoints will be saved in the `output_dir` specified in `TrainingArguments`. After training, we save the fine-tuned model and tokenizer to local directories for later use in inference.


In [None]:
# Train the model

training_arguments = TrainingArguments(
    output_dir="test_trainer",             # Directory to save model checkpoints and logs
    eval_strategy="steps",          # Evaluate at specified steps
    eval_steps=100,                       # Evaluate every 100 steps
    num_train_epochs=2,                  # Set to 2 (ideal 3-5 epochs)
    logging_steps=100,                    # Log training information every 100 steps

)

# Initialize trainer with model, datasets, and metrics
trainer = Trainer(
    model=model,
    args=training_arguments,
    train_dataset=small_train_dataset,
    eval_dataset=small_validation_dataset,
    compute_metrics=compute_metrics,
)

# Start training
start_time = time.time() # Start timing
trainer.train()
training_time = time.time() - start_time # End timing and calculate duration
print(f"Training completed in {training_time:.2f} seconds")

# Save the trained model and tokenizer
model_save_path = "./score_prediction_model"
tokenizer_save_path = "./score_prediction_tokenizer"

trainer.save_model(model_save_path)
tokenizer.save_pretrained(tokenizer_save_path)

## Step 7 - Evaluate the model

After fine-tuning, we perform a detailed evaluation of the model's performance on the evaluation dataset. This goes beyond the metrics reported during training and provides a more comprehensive analysis. We will create the following:

1.  **Predictions:** Use the trained model to predict star ratings for the evaluation dataset.
2.  **Classification Report:** Print a classification report that includes precision, recall, F1-score, and support for each star rating, as well as macro and weighted averages.
3.  **Confusion Matrix:** Display a confusion matrix to visualize the counts of true vs. predicted star ratings.
4.  **Evaluation Time:** Measure and report the time taken for the evaluation process.


In [None]:
start_eval_time = time.time() # Start timing evaluation

# Get predictions on the test dataset
predictions = trainer.predict(small_test_dataset)
y_pred = np.argmax(predictions.predictions, axis=-1)
y_true = small_test_dataset["label"]

eval_time = time.time() - start_eval_time # End timing and calculate evaluation duration

# Generate Classification Report
print("\nClassification Report (on Test Set):")
print(classification_report(y_true, y_pred, target_names=[f'Star {i+1}' for i in range(5)]))

# Generate Confusion Matrix
cm = confusion_matrix(y_true, y_pred)
print("\nConfusion Matrix (on Test Set):")
print(cm)

# Visualize Confusion Matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=[f'Star {i+1}' for i in range(5)],
            yticklabels=[f'Star {i+1}' for i in range(5)])
plt.xlabel('Predicted Star Rating')
plt.ylabel('True Star Rating')
plt.title('Confusion Matrix on Test Set - Yelp Review Star Prediction') # Clarify it's on test set
plt.show()

print(f"\nEvaluation on Test Set completed in {eval_time:.2f} seconds") # Print evaluation time and clarify it's test set


## Step 8 - Set up inference

This step prepares the model for inference (making predictions on new, unseen text). We reload the fine-tuned model and tokenizer from the saved directories. We also move the model to the appropriate device ('cuda' if a GPU is available, otherwise 'cpu'). Setting the model to `.eval()` mode is a best practice for inference as it disables training-specific layers like dropout and batch normalization, ensuring consistent predictions.

We define a `predict` function that takes text as input, tokenizes it, feeds it to the model, and returns the predicted star rating (as an integer from 0 to 4, corresponding to 1 to 5 stars). We also include a test prediction using sample text to verify the inference setup.


In [None]:
# Setup inference

# Define the paths where model and tokenizer were saved
model_reload_path = "./score_prediction_model"
tokenizer_reload_path = "./score_prediction_tokenizer"

# Reload the tokenizer
tokenizer = AutoTokenizer.from_pretrained(tokenizer_reload_path)

# Reload the model
model = AutoModelForSequenceClassification.from_pretrained(model_reload_path)

# Move model to device (CPU or CUDA if available)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval() # Set model to evaluation mode for inference

# Define prediction function for single text input
def predict(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512).to(device)
    with torch.no_grad():
        outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)
    return predictions.item()

# Test prediction with sample text
sample_text = "I hated it"
prediction = predict(sample_text)
print(f"Prediction: {prediction}")

## Step 9 - Build a gradio application for user input

In this step, we build a user-friendly Gradio interface to interact with our score prediction model. Gradio allows us to create a web-based GUI quickly.

We load the model and tokenizer again (ensuring they are loaded only once globally for efficiency in the Gradio app). We define a mapping `score_texts` to convert the numerical predicted score (0-4) into a more user-readable text output (e.g., "1 star out of 5").

The `predict_score_gradio` function is designed for the Gradio interface. It takes text input from the user, uses the `predict` function to get the numerical score, and then retrieves the corresponding text output from `score_texts`.

Finally, we create and launch the Gradio interface using `gr.Interface`, specifying the prediction function, input and output types, title, and description. The `iface.launch()` command starts the Gradio server, making the application accessible in a web browser.


In [None]:
# Load model and tokenizer globally (once)
tokenizer = AutoTokenizer.from_pretrained("./score_prediction_tokenizer")
model = AutoModelForSequenceClassification.from_pretrained("./score_prediction_model")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # Keep device specification for best practice
model.to(device)
model.eval()

# --- Score Text mappings ---
score_texts = {
    0: "1 star out of 5",
    1: "2 stars out of 5",
    2: "3 stars out of 5",
    3: "4 stars out of 5",
    4: "5 stars out of 5"
}

# --- Prediction function for Gradio Interface ---
def predict_score_gradio(text_input):
    inputs = tokenizer(text_input, return_tensors="pt", padding=True, truncation=True, max_length=512).to(device)
    with torch.no_grad():
        outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)
    predicted_score = predictions.item()
    output_text = score_texts.get(predicted_score, "Default Text")

    return output_text

# --- Gradio Interface ---
iface = gr.Interface(
    fn=predict_score_gradio, 
    inputs=gr.Textbox(label="Enter text and let the score predictor come up with a star rating for you:"),
    outputs=gr.Textbox(label="Predicted Star Rating", lines=3),
    title="Customer satisfaction score predictor",
    description="Think of the last time, you went somewhere new. How was it?"
    
)

iface.launch()

# Step 10 - Import model to Hugging Face

In this final step, we upload the model to Hugging face. From there, we can launch the gradio app to make it available publicly. 

### Create Hugging Face Repository

We Hugging Face `HfApi` to manage our repository:
- We checks if a repository with the specified ID exists and delete it if it does, ensuring a clean start.
- We then creates a new Hugging Face repository with the given ID, ready for use.  This handles the case where the repository didn't exist previously.

In [None]:
# Create hugging face repository

api = HfApi()

repo_owner = "mosaique258"
repo_name = "customer-score-predictor"
repo_id = f"{repo_owner}/{repo_name}"

# Check if the repository exists
try:
    # Get the repository information
    repo_info = api.repo_info(repo_id)
    if repo_info:
        # If the repository exists, delete it
        api.delete_repo(repo_id)
        print(f"Repository '{repo_id}' exists and has been deleted.")
except Exception as e:
    # If the repository does not exist, an exception will be raised
    print(f"Repository '{repo_id}' does not exist. Proceeding to create a new one.")

# Create a new repository
api.create_repo(repo_id=repo_id)  # Set private=True if you want a private repository

### Upload trained model to Hugging Face Repository

We finally upload hte model and the tokenizer information to Hugging Face.


In [None]:
# upload model to repo


# Upload the model directory
upload_folder(
    folder_path="score_prediction_model",
    repo_id=repo_id,
    repo_type="model"
)

# Upload the tokenizer directory
upload_folder(
    folder_path="score_prediction_tokenizer",
    repo_id=repo_id,
    repo_type="model"
)