## **Introduction**
This notebook provides a comprehensive, step-by-step guide to building and fine-tuning a multilingual sentiment analysis model using the `nlptown/bert-base-multilingual-uncased-sentiment`pre-trained BERT model. The pipeline includes data preprocessing, model fine-tuning with Hugging Face's Trainer, and performance evaluation. The fine-tuned model achieves a test accuracy of 77% and a test F1 score of 76%, demonstrating its effectiveness in classifying sentiment as negative, neutral, or positive. **Training used 84,000 examples and took approximately 1 hour on a T4 GPU.**

## **Step 1: Import Libraries**

In [2]:
# Install the Hugging Face `datasets` library if not already installed
#!pip install datasets 

# Import the Dataset class for working with datasets in Hugging Face
from datasets import Dataset 

# Import pandas for data manipulation and analysis
import pandas as pd 

# Import numpy for numerical computations
import numpy as np 

# Import the tokenizer and model for sequence classification from the Hugging Face Transformers library
from transformers import AutoTokenizer, AutoModelForSequenceClassification 

# Import Trainer and TrainingArguments for fine-tuning and training models
from transformers import Trainer, TrainingArguments 

# Import accuracy_score and precision_recall_fscore_support for custom evaluation metrics
from sklearn.metrics import accuracy_score, precision_recall_fscore_support 

# Import EarlyStoppingCallback for stopping training early if no improvement in validation loss
from transformers import EarlyStoppingCallback

#Import warnings to silence warnings that are not causing issues with the model output
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module="torch.nn.parallel")



## **Step 2: Load and Process Data**
- Load the complete training, test, and validation datasets from Kaggle.
- Reduce the languages from six to four (English, French, Spanish, and German) for compatibility with the `nlptown/bert-base-multilingual-uncased-sentiment model`.
- Transform the 5-star ratings into three polarity categories: **positive**, **negative**, and **neutral**.
- Downsample the dataset from 1.26 million entries to 84,000.
- Modify input labels to align with the tokenization process.







### **Load the Dataset**

In [3]:
train_df = pd.read_csv("/kaggle/input/amazon-reviews-multi/train.csv")
test_df = pd.read_csv("/kaggle/input/amazon-reviews-multi/test.csv")
dev_df = pd.read_csv("/kaggle/input/amazon-reviews-multi/validation.csv")

In [4]:
# Print the number of examples in the training set
# Training set has 1.2 million examples (95.2% of the total data)
print(f"Training Examples: {train_df.shape[0]}") 

# Print the number of examples in the testing set
# Testing set has 30,000 examples (2.4% of the total data)
print(f"Testing Examples: {test_df.shape[0]}") 

# Print the number of examples in the development (validation) set
# Development set has 30,000 examples (2.4% of the total data)
print(f"Development Examples: {dev_df.shape[0]}") 

# Calculate the total number of examples across all datasets
total_examples = train_df.shape[0] + test_df.shape[0] + dev_df.shape[0]
print(f"Total Examples: {total_examples}")  # Should equal 1.26 million in this case

# Calculate the percentage of examples in the test set
test_set_percentage = test_df.shape[0] / total_examples * 100

# Calculate the percentage of examples in the development set
dev_set_percentage = dev_df.shape[0] / total_examples * 100

# Calculate the percentage of examples in the training set
train_set_percentage = train_df.shape[0] / total_examples * 100

# Print the calculated percentages for each dataset
# Should confirm the split as approximately 95.2% training, 2.4% testing, and 2.4% development
print(f"Test Set Percentage: {test_set_percentage:.1f}%")
print(f"Development Set Percentage: {dev_set_percentage:.1f}%")
print(f"Training Set Percentage: {train_set_percentage:.1f}%")


Training Examples: 1200000
Testing Examples: 30000
Development Examples: 30000
Total Examples: 1260000
Test Set Percentage: 2.4%
Development Set Percentage: 2.4%
Training Set Percentage: 95.2%


### **Filter Target Languages**

In [6]:
# Define the target languages to include in the filtered datasets
target_languages = ['en', 'es', 'fr', 'de']  # English, Spanish, French, German

# Filter the training dataset to include only examples in the target languages
# Result: 800,000 examples remain after filtering
train_df = train_df[train_df['language'].isin(target_languages)] 

# Filter the development (validation) dataset to include only examples in the target languages
# Result: 20,000 examples remain after filtering
dev_df = dev_df[dev_df['language'].isin(target_languages)] 

# Filter the testing dataset to include only examples in the target languages
# Result: 20,000 examples remain after filtering
test_df = test_df[test_df['language'].isin(target_languages)] 


In [5]:
train_df.head()

Unnamed: 0.1,Unnamed: 0,review_id,product_id,reviewer_id,stars,review_body,review_title,language,product_category
0,0,de_0203609,product_de_0865382,reviewer_de_0267719,1,Armband ist leider nach 1 Jahr kaputt gegangen,Leider nach 1 Jahr kaputt,de,sports
1,1,de_0559494,product_de_0678997,reviewer_de_0783625,1,In der Lieferung war nur Ein Akku!,EINS statt ZWEI Akkus!!!,de,home_improvement
2,2,de_0238777,product_de_0372235,reviewer_de_0911426,1,"Ein Stern, weil gar keine geht nicht. Es hande...",Achtung Abzocke,de,drugstore
3,3,de_0477884,product_de_0719501,reviewer_de_0836478,1,"Dachte, das wären einfach etwas festere Binden...",Zu viel des Guten,de,drugstore
4,4,de_0270868,product_de_0022613,reviewer_de_0736276,1,Meine Kinder haben kaum damit gespielt und nac...,Qualität sehr schlecht,de,toy


### **Convert Star Ratings to Sentiment Polarity**

In [7]:
train_df = train_df[['review_body', 'stars']]
test_df = test_df[['review_body', 'stars']]
dev_df = dev_df[['review_body', 'stars']]

In [8]:
train_df .head(3)

Unnamed: 0,review_body,stars
0,Armband ist leider nach 1 Jahr kaputt gegangen,1
1,In der Lieferung war nur Ein Akku!,1
2,"Ein Stern, weil gar keine geht nicht. Es hande...",1


In [9]:
def convert_to_polarity(stars):
    '''
    Converts star ratings into sentiment polarity categories.
    Input:
      - stars: An integer representing the star rating (1 to 5).
    Output:
      - 0 for negative sentiment (1 or 2 stars)
      - 1 for neutral sentiment (3 stars)
      - 2 for positive sentiment (4 or 5 stars)
    '''
    if stars in [1, 2]:
        return 0  # Negative sentiment
    elif stars == 3:
        return 1  # Neutral sentiment
    elif stars in [4, 5]:
        return 2  # Positive sentiment

# Apply the `convert_to_polarity` function to the 'stars' column in each dataset
# This creates a new 'polarity' column representing sentiment
train_df['polarity'] = train_df['stars'].apply(convert_to_polarity)  # Polarity for training data
test_df['polarity'] = test_df['stars'].apply(convert_to_polarity)   # Polarity for test data
dev_df['polarity'] = dev_df['stars'].apply(convert_to_polarity)     # Polarity for validation data


In [10]:
train_df.head()

Unnamed: 0,review_body,stars,polarity
0,Armband ist leider nach 1 Jahr kaputt gegangen,1,0
1,In der Lieferung war nur Ein Akku!,1,0
2,"Ein Stern, weil gar keine geht nicht. Es hande...",1,0
3,"Dachte, das wären einfach etwas festere Binden...",1,0
4,Meine Kinder haben kaum damit gespielt und nac...,1,0


### **Downsample the Dataset** 

In [11]:
# Since the datasets are incredibly large, and due to processing and memory limitations, 
# I will only use a subset of the data. The subset will still preserve the proportion of 
# samples with respect to the star ratings to maintain dataset balance.

# Define the total size of the new subset
new_total_size = 84000  # Total number of samples to use across train, dev, and test subsets

# Define the proportions for each subset (70% training, 20% development, 10% testing)
train_size = int(new_total_size * 0.70)  # 70% of the total size
dev_size = int(new_total_size * 0.20)    # 20% of the total size
test_size = int(new_total_size * 0.10)   # 10% of the total size

# Function to create a stratified sample
def stratified_sample(df, target_col, sample_size):
    """
    Create a stratified sample from the dataset.
    Ensures the proportion of each category in the target column (e.g., star ratings) 
    is preserved in the sample.

    Parameters:
    - df (DataFrame): The input dataset to sample from
    - target_col (str): The column name representing categories (e.g., 'stars')
    - sample_size (int): The total number of samples to extract from the dataset

    Returns:
    - DataFrame: A stratified sample of the dataset with preserved proportions
    """
    # Group by the target column and sample from each group based on its proportion
    stratified_sample = df.groupby(target_col, group_keys=False).apply(
        lambda x: x.sample(n=min(len(x), int(sample_size * len(x) / len(df))), random_state=42)
    )
    return stratified_sample

# Perform stratified sampling for training, development, and test subsets
train_subset = stratified_sample(train_df, target_col='stars', sample_size=train_size)
dev_subset = stratified_sample(dev_df, target_col='stars', sample_size=dev_size)
test_subset = stratified_sample(test_df, target_col='stars', sample_size=test_size)

# Verify that the star rating proportions are preserved in each subset
print("Train Subset Distribution:")
print(train_subset['stars'].value_counts(normalize=True))  # Prints proportions of star ratings in training data

print("Dev Subset Distribution:")
print(dev_subset['stars'].value_counts(normalize=True))    # Prints proportions of star ratings in development data

print("Test Subset Distribution:")
print(test_subset['stars'].value_counts(normalize=True))   # Prints proportions of star ratings in testing data


Train Subset Distribution:
stars
1    0.2
2    0.2
3    0.2
4    0.2
5    0.2
Name: proportion, dtype: float64
Dev Subset Distribution:
stars
1    0.2
2    0.2
3    0.2
4    0.2
5    0.2
Name: proportion, dtype: float64
Test Subset Distribution:
stars
1    0.2
2    0.2
3    0.2
4    0.2
5    0.2
Name: proportion, dtype: float64


### **Modify Input Labels**

In [12]:
# Define a preprocessing function to format the data for model input
def preprocess_function(example):
    """
    Preprocesses the input examples to create a text-label mapping.
    
    Parameters:
    - example (dict): A dictionary containing 'review_body' (text) and 'polarity' (label).

    Returns:
    - dict: A dictionary with the processed text and label ready for training.
      Keys:
        - "text": The review text (from 'review_body')
        - "label": The polarity label (from 'polarity')
    """
    return {
        "text": example["review_body"],  # Map the review text
        "label": example["polarity"]    # Map the polarity label
    }

In [13]:
# Convert the stratified subsets into Hugging Face Dataset objects
# Include only the 'review_body' (text) and 'polarity' (label) columns
train_set = Dataset.from_pandas(train_subset[['review_body', 'polarity']])  # Training dataset
test_set = Dataset.from_pandas(test_subset[['review_body', 'polarity']])    # Testing dataset
dev_set = Dataset.from_pandas(dev_subset[['review_body', 'polarity']])      # Development (validation) dataset

# Preprocess the datasets by applying the preprocessing function
# The `map` function applies the `preprocess_function` to all examples in the dataset
train_set = train_set.map(preprocess_function, batched=True)  # Preprocess training dataset
test_set = test_set.map(preprocess_function, batched=True)    # Preprocess testing dataset
dev_set = dev_set.map(preprocess_function, batched=True)      # Preprocess development dataset


Map:   0%|          | 0/58795 [00:00<?, ? examples/s]

Map:   0%|          | 0/8400 [00:00<?, ? examples/s]

Map:   0%|          | 0/16800 [00:00<?, ? examples/s]

## **Step 3: Tokenization**
- Utilized the `nlptown/bert-base-multilingual-uncased-sentiment` model for tokenization and fine-tuning, supporting sentiment classification in English, Spanish, French, Dutch, Italian, and German.
- Implemented a tokenization function specifically for sentiment classification.
- Applied tokenization to the subsampled training, test, and validation datasets.

### **Load Pre-Trained Tokenizer**

In [14]:
# Specify the pre-trained model to be used for sentiment analysis
# Using a multilingual BERT model fine-tuned for sentiment classification
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"

# Load the tokenizer associated with the specified model
# This tokenizer is used to tokenize the text inputs into the format required by the model
tokenizer = AutoTokenizer.from_pretrained(model_name)


tokenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



### **Define Tokenization Function** 

In [15]:
# Define a function to tokenize the dataset
# The function uses the tokenizer to convert text inputs into tokenized format required by the model
def tokenize_function(examples):
    """
    Tokenizes the input examples for the model.
    
    Parameters:
    - examples (dict): A dictionary containing the "text" field to be tokenized.

    Returns:
    - dict: A dictionary with tokenized input features, including:
      - input_ids: Tokenized numerical representation of the text
      - attention_mask: Indicates which tokens are padding
    """
    # Tokenize the text with the following parameters:
    # - padding="max_length": Ensures all sequences are padded to the same length (128 tokens here)
    # - truncation=True: Truncates text that exceeds the maximum length
    # - max_length=128: Specifies the maximum token length for each input
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128)


### **Tokenize Datasets**

In [16]:
# Apply the tokenize function to the training dataset
# `batched=True` allows tokenization of multiple examples at once for faster processing
tokenized_train = train_set.map(tokenize_function, batched=True)

# Apply the tokenize function to the test dataset
tokenized_test = test_set.map(tokenize_function, batched=True)

# Apply the tokenize function to the development (validation) dataset
tokenized_dev = dev_set.map(tokenize_function, batched=True)

Map:   0%|          | 0/58795 [00:00<?, ? examples/s]

Map:   0%|          | 0/8400 [00:00<?, ? examples/s]

Map:   0%|          | 0/16800 [00:00<?, ? examples/s]

## **Step 4: Define and Fine-Tune the Model**
- Load the pre-trained model with 3 labels.
- Define the training arguments.
- Define the compute metrics of accurayc, f1 score, precision, and recall.
- Train the model **a single T4 GPU takes 1 hour to train the model**

### **Load Pre-Trained Model**

In [17]:
# Load the pre-trained model with the specified number of labels
# - `num_labels=3`: Model output corresponds to three sentiment classes (negative, neutral, positive)
# - `ignore_mismatched_sizes=True`: Allows adjustment if the model weights or architecture do not match perfectly
model = AutoModelForSequenceClassification.from_pretrained(
    model_name, 
    num_labels=3, 
    ignore_mismatched_sizes=True
)


model.safetensors:   0%|          | 0.00/669M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nlptown/bert-base-multilingual-uncased-sentiment and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([5]) in the checkpoint and torch.Size([3]) in the model instantiated
- classifier.weight: found shape torch.Size([5, 768]) in the checkpoint and torch.Size([3, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### **Define Training Arguments**

In [18]:
# Define the training arguments for fine-tuning the model
training_args = TrainingArguments(
    output_dir="./results",  # Directory to save the model checkpoints and results
    evaluation_strategy="epoch",  # Perform evaluation at the end of every epoch
    save_strategy="epoch",  # Save model checkpoints at the end of every epoch
    learning_rate=1e-4,  # Initial learning rate for the optimizer
    logging_steps=100,  # Log training progress every 100 steps
    save_steps=100,  # Save model checkpoint every 100 steps
    per_device_train_batch_size=64,  # Training batch size per GPU/TPU/CPU
    per_device_eval_batch_size=64,  # Evaluation batch size per GPU/TPU/CPU
    gradient_accumulation_steps=8,  # Accumulate gradients over 8 steps before updating model weights
    num_train_epochs=2,  # Number of training epochs
    weight_decay=0.05,  # Weight decay for regularization
    load_best_model_at_end=True,  # Load the best model (based on evaluation metric) after training
    metric_for_best_model="accuracy",  # Metric used to determine the best model
    report_to="none"  # Disable reporting to external tools (e.g., WandB, TensorBoard)
)




### **Define Metrics**

In [19]:
# Define the evaluation metrics for the model's performance

def compute_metrics(eval_pred):
    """
    Computes evaluation metrics for the model's predictions.

    Args:
        eval_pred (tuple): A tuple containing:
            - logits (ndarray): The predicted raw scores (logits) from the model.
            - labels (ndarray): The true labels for the evaluation dataset.

    Returns:
        dict: A dictionary containing accuracy, F1 score, precision, and recall.
    """
    # Unpack the logits and true labels from the evaluation predictions
    logits, labels = eval_pred

    # Convert logits to predicted class indices by selecting the maximum score
    predictions = np.argmax(logits, axis=-1)

    # Compute precision, recall, and F1-score using scikit-learn
    # 'average="weighted"' ensures metrics account for label imbalance
    # 'zero_division=1' avoids division errors when precision/recall are undefined
    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, predictions, average='weighted', zero_division=1
    )

    # Compute accuracy as the proportion of correct predictions
    acc = accuracy_score(labels, predictions)

    # Return a dictionary containing all calculated metrics
    return {"accuracy": acc, "f1": f1, "precision": precision, "recall": recall}



### **Initialize Trainer**

In [20]:
# Initialize Trainer
trainer = Trainer(
    model=model,  # The pre-trained model to be fine-tuned or trained
    args=training_args,  # Configuration for the training process (e.g., batch size, learning rate)
    train_dataset=tokenized_train,  # Tokenized training dataset
    eval_dataset=tokenized_dev,  # Tokenized validation dataset for evaluation
    tokenizer=tokenizer,  # Tokenizer used for processing input text data
    callbacks=[  # List of callbacks to monitor and adjust training
        EarlyStoppingCallback(early_stopping_patience=1)  # Stops training early if no improvement after 1 patience epoch
    ],
    compute_metrics=compute_metrics  # Function to calculate evaluation metrics (e.g., accuracy, F1-score)
)


### **Train the Model**

In [21]:
trainer.train() #takes ~1 hr to train on a single T4 GPU 

Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
0,No log,0.568518,0.754762,0.754096,0.755294,0.754762
1,0.563200,0.554587,0.76875,0.763242,0.759984,0.76875


TrainOutput(global_step=114, training_loss=0.5533648457443505, metrics={'train_runtime': 1540.1979, 'train_samples_per_second': 76.347, 'train_steps_per_second': 0.074, 'total_flos': 7673110822847232.0, 'train_loss': 0.5533648457443505, 'epoch': 1.982608695652174})

## **Step 5: Evaluate the Model**
- Evaluate F1 score and accuracy on training, test, and dev sets.
- Save predictions to a csv file. 

### **Evaluate on Training and Dev Sets** 

In [23]:
#Evaluate accuracy of training and validation sets

train_results = trainer.evaluate(tokenized_train)
print(f"Train Accuracy: {train_results['eval_accuracy']:.4f}")  # Print train accuracy
print(f"Train F1 Score: {train_results['eval_f1']:.4f}")     # Print train F1 score

dev_results = trainer.evaluate(tokenized_dev)
print(f"Validation Accuracy: {dev_results['eval_accuracy']:.4f}")  # Print validation accuracy
print(f"Validation F1 Score: {dev_results['eval_f1']:.4f}")     # Print validation F1 score


Train Accuracy: 0.8256
Train F1 Score: 0.8204
Validation Accuracy: 0.7688
Validation F1 Score: 0.7632


### **Evaluate on Test Set** 

In [24]:
#Evaluate accuracy of test set
test_results = trainer.evaluate(tokenized_test)
print(f"Test Accuracy: {test_results['eval_accuracy']:.4f}")  # Print test accuracy
print(f"Test F1 Score: {test_results['eval_f1']:.4f}")     # Print test F1 score

Test Accuracy: 0.7718
Test F1 Score: 0.7681


### **Save Predictions** 

In [None]:
predictions = trainer.predict(tokenized_test)
final_predictions = predictions.predictions.argmax(axis=1)

results_df = pd.DataFrame({"review_body": test_subset['review_body'], "predicted_polarity": final_predictions})
results_df.to_csv("predictions.csv", index=False)
