<a href="https://colab.research.google.com/github/TurkuNLP/Deep_Learning_in_LangTech_course/blob/master/course_project_template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multilingual Sentiment Classification using BERT & LLMs

- Chosen Corpus: [Multilingual Amazon Reviews Corpus (MARC)](https://registry.opendata.aws/amazon-reviews-ml/)

### Corpus information

<!-- - Description of the chosen corpus:
- Paper(s) and other published materials related to the corpus:
- Random baseline performance and expected performance for recent machine learned models:
   -->
#### Description of the Chosen Corpus

The [Multilingual Amazon Reviews Corpus (MARC)](https://registry.opendata.aws/amazon-reviews-ml/) is a large-scale dataset tailored for multilingual text classification. It comprises Amazon reviews in six languages: English, Japanese, German, French, Spanish, and Chinese, collected between 2015 and 2019. The dataset includes *review text*, *review titles*, *star ratings*, *anonymized reviewer* and *product IDs*, and *coarse-grained product categories*. Each language contains balanced subsets of 200,000 training samples, 5,000 development samples, and 5,000 test samples, ensuring equal representation of the five star ratings. The reviews are filtered for quality through language detection algorithms and additional criteria such as verified purchases and token frequency thresholds.

The dataset was made publicly available in Amazon’s [Open Data Program](https://registry.opendata.aws/amazon-reviews-ml/), specifically stored in Amazon Simple Storage Service (S3). Also it is available on [Hugging Face](https://huggingface.co/datasets/mteb/amazon_reviews_multi/) with some extension. On Hugging Face, the dataset is reorganized to include translations of the reviews into multiple languages, allowing multilingual evaluations. This version follows a standardized format, where each row contains an `id`, a unique identifier for the review (e.g `de_0203609`). It also includes `text`, which is the text of the review, either in its original language or translated. The `label` ranging from 0 to 4, derived from the star ratings, with 0 indicating the lowest star rating (1 star), and 4 indicating highest sentiment (5 star). Additionally, the dataset includes the `label_text`, which is the textual representation of the sentiment class. The Hugging Face implementation of the dataset integrates it with the [Massive Text Embedding Benchmark (MTEB)]() framework, facilitating benchmarking and model evaluations across various languages and tasks such as zero-shot transfer or multilingual evaluations.

---

#### Paper(s) and Other Published Materials Related to the Corpus

The primary reference for the dataset is the [paper](https://aclanthology.org/2020.emnlp-main.369/) titled The Multilingual Amazon Reviews Corpus, presented at the [2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)](https://2020.emnlp.org/). The authors (Phillip Keung et al.) highlight the dataset's construction process, its utility for supervised and zero-shot cross-lingual tasks, and baseline results using multilingual BERT (mBERT). Related research includes prior multilingual datasets such as Reuters RCV1/RCV2, XNLI, and earlier Amazon review datasets (e.g., Ni et al., 2019), which were limited by smaller sizes, lack of multilingual focus, or absence of well-defined splits. The MARC addresses these gaps by offering a balanced, comprehensive and publicly dataset for reproducible multilingual NLP research. Furthermore, the paper emphasizes that the dataset allows researchers to evaluate models in *zero-shot transfer learning scenarios*, where a model trained on one language is tested on another.

---

#### Random Baseline Performance and Expected Performance for Recent Machine Learned Models

While favorable for the metric *accuracy* widely used for classification task, the original paper recommended the metric *mean absolute error (MAE)*. Given the five-star rating system, random baseline performance for MARC would yield:
- **Accuracy**: $ \frac{1}{5} = 20\% $ since there are five equally likely star ratings.
- **Mean Absolute Error (MAE)**: Approximately 2.0, assuming predictions are uniformly distributed across the star ratings.

Recent machine-learned models like fine-tuned mBERT demonstrate significant improvements:
1. **Fully Supervised Fine-Grained Classification**:
   - **Accuracy**: On average, mBERT achieves an accuracy of **59.2%** when using the review body combined with title and product category as inputs.
   - **MAE**: An average of **0.482** for the same setup.

2. **Zero-Shot Cross-Lingual Transfer (Fine-Grained Classification)**:
   - **Accuracy**: An average of **44.0%** when trained on one source language and tested on others.
   - **MAE**: An average of **0.769**, demonstrating its effectiveness across languages even without direct training data.



<!--
##### Random Baseline Performance:
From the paper, the **random baseline** performance for this dataset aligns with the number of classes (five sentiment labels). In a balanced dataset, a random classifier would have an accuracy of **20%**. However, due to class imbalance (e.g., positive reviews being more frequent), the actual random baseline might slightly deviate from this.

##### Expected Performance for Machine-Learned Models:
Highlighted sections in the paper describe the performance of modern multilingual models on this dataset:
1. **mBERT and XLM-R**: Models like **mBERT** and **XLM-R (XLM-RoBERTa)** demonstrate state-of-the-art performance, achieving classification accuracies above **85%** for single-language tasks. In **cross-lingual evaluations**, where models are trained on one language and tested on another, performance slightly decreases but remains significantly higher than random baselines.
2. **Scaling with Dataset Size**: The paper emphasizes that larger datasets and pre-trained language models significantly enhance performance, making this dataset ideal for fine-tuning tasks. -->

---

## 1. Setup

In [None]:
#Install libraries
!pip install --quiet datasets evaluate
!pip install -U bitsandbytes --quiet
!pip install --quiet transformers accelerate #bitsandbytes>0.37.0

  pid, fd = os.forkpty()


In [None]:
#Import librairies
from datasets import load_dataset, Dataset, concatenate_datasets
import pandas as pd
import matplotlib.pyplot as plt
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, EarlyStoppingCallback, AutoModelForCausalLM, pipeline, BitsAndBytesConfig
import evaluate
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import numpy as np
import re
import torch
import os

# To ignore warinings
import warnings
warnings.filterwarnings('ignore')

---

## 2. Data download, sampling and preprocessing

### 2.1. Download the corpus

In this project, we consider the English, French, German and Spanish corpora of the MARC dataset loaded from Hugging Face. The remaining of this project uses MARC loaded from Hugging Face.

In [None]:
# code to download the corpus here

#Load English corpus
ds_en = load_dataset("mteb/amazon_reviews_multi", "en")

#Load French corpus
ds_fr = load_dataset("mteb/amazon_reviews_multi", "fr")

#Load Deutch corpus
ds_de = load_dataset("mteb/amazon_reviews_multi", "de")

# #Load Spanish corpus
ds_es = load_dataset("mteb/amazon_reviews_multi", "es")

README.md:   0%|          | 0.00/47.0 [00:00<?, ?B/s]

amazon_reviews_multi.py:   0%|          | 0.00/6.17k [00:00<?, ?B/s]

0000.parquet:   0%|          | 0.00/28.3M [00:00<?, ?B/s]

en/validation/0000.parquet:   0%|          | 0.00/713k [00:00<?, ?B/s]

en/test/0000.parquet:   0%|          | 0.00/711k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/200000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/5000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5000 [00:00<?, ? examples/s]

0000.parquet:   0%|          | 0.00/25.7M [00:00<?, ?B/s]

fr/validation/0000.parquet:   0%|          | 0.00/627k [00:00<?, ?B/s]

fr/test/0000.parquet:   0%|          | 0.00/642k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/200000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/5000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5000 [00:00<?, ? examples/s]

0000.parquet:   0%|          | 0.00/31.9M [00:00<?, ?B/s]

de/validation/0000.parquet:   0%|          | 0.00/800k [00:00<?, ?B/s]

de/test/0000.parquet:   0%|          | 0.00/797k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/200000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/5000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5000 [00:00<?, ? examples/s]

0000.parquet:   0%|          | 0.00/24.6M [00:00<?, ?B/s]

es/validation/0000.parquet:   0%|          | 0.00/613k [00:00<?, ?B/s]

es/test/0000.parquet:   0%|          | 0.00/619k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/200000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/5000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5000 [00:00<?, ? examples/s]

### 2.2. Sampling and preprocessing

As initially the dataset is balanced accros label and is already processed, we just consider the limited resource constraints, proceed to a downsampling and tokenization. We splitted the dataset into two parts: one consisting of English corpus and another consisting of non-English corpora.

#### 2.2.1. Downsampling
##### English corpus

The English corpus is downsampled while ensuring balanced representation across labels. For the training set, 20,000 examples are sampled per label without replacement. For the test and validation sets, 100 examples per label are sampled. The English dataset has respectively 100000, 500 and 500 examples for training, validation and test.

In [None]:
#Loading the English corpus as pandas dataframe
en_train = pd.DataFrame(ds_en["train"])
en_test = pd.DataFrame(ds_en["test"])
en_val = pd.DataFrame(ds_en["validation"])

# We downsampled the dataset to 100000: we sample 20000 examples per label
en_train = en_train.groupby('label').apply(lambda x: x.sample(n=20000, replace=False) if len(x) >= 20000 else x).reset_index(drop=True)
# We downsampled the val set to 500: we sample 100 examples per label
en_test = en_test.groupby('label').apply(lambda x: x.sample(n=100, replace=False) if len(x) >= 100 else x).reset_index(drop=True)
en_test_copy = en_test#we keep this copy for testing the generative model after
# We downsampled the test set to 500: we sample 100 examples per label
en_val = en_val.groupby('label').apply(lambda x: x.sample(n=100, replace=False) if len(x) >= 100 else x).reset_index(drop=True)

# Convert in Dataset object
en_train = Dataset.from_pandas(en_train)
en_test = Dataset.from_pandas(en_test)
en_val = Dataset.from_pandas(en_val)

##### Non-English Data

The non-English dataset is also downsampled while ensuring balanced representation across labels. For each of French, German and Spanish language, for the training set, 10,000 examples are sampled per label without replacement, and for the test and validation sets, 100 examples per label are sampled. The English dataset has respectively 150000, 1500 and 1500 examples for training, validation and test.

In [None]:
#Loading the French corpus as pandas dataframe
fr_train = pd.DataFrame(ds_fr["train"])
fr_test = pd.DataFrame(ds_fr["test"])
fr_val = pd.DataFrame(ds_fr["validation"])


# We downsampled the training set to 50000: we sample 10000 examples per label
fr_train = fr_train.groupby('label').apply(lambda x: x.sample(n=10000, replace=False) if len(x) >= 10000 else x).reset_index(drop=True)
# We downsampled the val set to 500: we sample 100 examples per label
fr_test = fr_test.groupby('label').apply(lambda x: x.sample(n=100, replace=False) if len(x) >= 100 else x).reset_index(drop=True)
# We downsampled the test set to 500: we sample 100 examples per label
fr_val = fr_val.groupby('label').apply(lambda x: x.sample(n=100, replace=False) if len(x) >= 100 else x).reset_index(drop=True)

#convert in dataset object
# fr_train = Dataset.from_pandas(fr_train)
fr_test = Dataset.from_pandas(fr_test)
# fr_val = Dataset.from_pandas(fr_val)

In [None]:
#Loading the German corpus as pandas dataframe
de_train = pd.DataFrame(ds_de["train"])
de_test = pd.DataFrame(ds_de["test"])
de_val = pd.DataFrame(ds_de["validation"])

# We downsampled the training set to 50000: we sample 10000 examples per label
de_train = de_train.groupby('label').apply(lambda x: x.sample(n=10000, replace=False) if len(x) >= 10000 else x).reset_index(drop=True)
# We downsampled the val set to 500: we sample 100 examples per label
de_test = de_test.groupby('label').apply(lambda x: x.sample(n=100, replace=False) if len(x) >= 100 else x).reset_index(drop=True)
# We downsampled the test set to 500: we sample 100 examples per label
de_val = de_val.groupby('label').apply(lambda x: x.sample(n=100, replace=False) if len(x) >= 100 else x).reset_index(drop=True)

# Convert in datasets object
# de_train = Dataset.from_pandas(de_train)
de_test = Dataset.from_pandas(de_test)
# de_val = Dataset.from_pandas(de_val)

In [None]:
#Loading the German corpus as pandas dataframe
es_train = pd.DataFrame(ds_es["train"])
es_test = pd.DataFrame(ds_es["test"])
es_val = pd.DataFrame(ds_es["validation"])

# We downsampled the training set to 50000: we sample 10000 examples per label
es_train = es_train.groupby('label').apply(lambda x: x.sample(n=10000, replace=False) if len(x) >= 10000 else x).reset_index(drop=True)
# We downsampled the test set to 500: we sample 100 examples per label
es_test = es_test.groupby('label').apply(lambda x: x.sample(n=100, replace=False) if len(x) >= 100 else x).reset_index(drop=True)
# We downsampled the val set to 500: we sample 100 examples per label
es_val = es_val.groupby('label').apply(lambda x: x.sample(n=100, replace=False) if len(x) >= 100 else x).reset_index(drop=True)

#convert in datasets object
# es_train = Dataset.from_pandas(es_train)
es_test = Dataset.from_pandas(es_test)
# es_val = Dataset.from_pandas(es_val)

In [None]:
# Concatenate the all non-English corpora as pandas dataframe
non_english_train = pd.concat([fr_train, de_train, es_train])
non_english_val = pd.concat([fr_val, de_val, es_val])

# Convert to datasets object
non_english_train = Dataset.from_pandas(non_english_train)
non_english_val = Dataset.from_pandas(non_english_val)

#### 2.2.2. Preprocessing

In order to feed the MARC corpora to the our models, we first proceed to the its tokenization. It is important to that we use the tokenizer of the model we intend to further fine-tune. For this reason, we load the tokenizer of `bert-base-multilingual-cased` as it is the model we will be using for fine-tuning.

In [None]:
#Load the tokenizer
model_name = "bert-base-multilingual-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

In [None]:
# Tokenization
#This is a helper function for the tokenization
def tokenize_function(example):
  #
  return tokenizer(example["text"], padding="max_length", truncation=True, max_length=256)

#Tokenization of the English set
en_train_tokenized = en_train.map(tokenize_function, batched=True)
en_val_tokenized = en_val.map(tokenize_function, batched=True)
en_test_tokenized = en_test.map(tokenize_function, batched=True)

#Tokenization of the non-English set
non_english_train_tokenized = non_english_train.map(tokenize_function, batched=True)
non_english_val_tokenized = non_english_val.map(tokenize_function, batched=True)

#Tokenization of the each of the test set of the non English test set
# fr_test_tokenized = fr_test.map(tokenize_function, batched=True)
# de_test_tokenized = de_test.map(tokenize_function, batched=True)
# es_test_tokenized = es_test.map(tokenize_function, batched=True)

Map:   0%|          | 0/100000 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/150000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

In [None]:
del en_train, en_val, en_test, fr_train, fr_val, fr_test, es_train, es_val, es_test

---

## 3. Machine learning model

### 3.1. Model training

In this section, we full fine-tune the `bert-base-multilingual-cased` on the English corpus. The model is trained for 2 epochs with a batch size of 32, and the best model is selected based on accuracy. Then after training, the model is evaluated on the English validation set using accuracy as the evaluation metric.

In [None]:
#  code to train the transformer based model on the training set and evaluate the performance on the validation set here
#Load the model
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=5)

#Full Fine-Tuning on English
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=1,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    greater_is_better=True,
)

# Load the accuracy metric
accuracy_metric = evaluate.load("accuracy")

# Define a function to compute metrics
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=1)
    accuracy = accuracy_score(labels, preds)
    return {"accuracy": accuracy}

# Set up the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=en_train_tokenized,
    eval_dataset=en_val_tokenized,
    compute_metrics=compute_metrics,
)

trainer.train()#resume_from_checkpoint="/kaggle/working/results/checkpoint-4500")

model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Epoch,Training Loss,Validation Loss,Accuracy
1,0.8406,0.820072,0.65


TrainOutput(global_step=3125, training_loss=0.9139765771484375, metrics={'train_runtime': 2463.7039, 'train_samples_per_second': 40.589, 'train_steps_per_second': 1.268, 'total_flos': 1.31559071232e+16, 'train_loss': 0.9139765771484375, 'epoch': 1.0})

### 3.2 Hyperparameter optimization

In [None]:
#  code for hyperparameter optimization here

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=1,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    greater_is_better=True,
)

# Load the accuracy metric
accuracy_metric = evaluate.load("accuracy")

# Define a function to compute metrics
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=1)
    accuracy = accuracy_score(labels, preds)
    return {"accuracy": accuracy}

# Set up the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=en_train_tokenized,
    eval_dataset=en_val_tokenized,
    compute_metrics=compute_metrics,
)

trainer.train()#resume_from_checkpoint=True

Epoch,Training Loss,Validation Loss,Accuracy
1,0.7839,0.817608,0.662


TrainOutput(global_step=3125, training_loss=0.7339735400390625, metrics={'train_runtime': 2463.5479, 'train_samples_per_second': 40.592, 'train_steps_per_second': 1.268, 'total_flos': 1.31559071232e+16, 'train_loss': 0.7339735400390625, 'epoch': 1.0})

In [None]:
trainer.save_model("./baseline_model")
trainer.save_state()

### 3.3. Evaluation on test set

In [None]:
# code to evaluate the final model on the test set here
trainer.evaluate(eval_dataset=en_test_tokenized)

{'eval_loss': 0.8470733761787415,
 'eval_accuracy': 0.636,
 'eval_runtime': 3.76,
 'eval_samples_per_second': 132.979,
 'eval_steps_per_second': 4.255,
 'epoch': 1.0}

### 3.4. Cross-lingual experiments

We full fine-tune the `bert-base-multilingual-cased` model on non-English data (French, German, Spanish) for 2 epochs with a batch size of 32, selecting the best model based on accuracy. After training, the model is evaluated on the English validation set, benefiting from zero-shot cross-lingual transfer.

In [None]:
# code to train and evaluate the cross-lingual model

#Full Fine-Tuning on non-English (French, Deutch, Spanish)

model_multi = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=5)

training_args = TrainingArguments(
    output_dir="./results_multi",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=2,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    greater_is_better=True,
)

# Load the accuracy metric
accuracy_metric = evaluate.load("accuracy")

# Define a function to compute metrics
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=1)
    accuracy = accuracy_score(labels, preds)
    return {"accuracy": accuracy}


# Set up the Trainer
trainer_non = Trainer(
    model=model_multi,
    args=training_args,
    train_dataset=non_english_train_tokenized,
    eval_dataset=non_english_val_tokenized,
    compute_metrics=compute_metrics,
)

trainer_non.train()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.9459,0.908367,0.6
2,0.8245,0.89185,0.613333


TrainOutput(global_step=9376, training_loss=0.9326762762492834, metrics={'train_runtime': 7381.7222, 'train_samples_per_second': 40.641, 'train_steps_per_second': 1.27, 'total_flos': 3.94677213696e+16, 'train_loss': 0.9326762762492834, 'epoch': 2.0})

In [None]:
trainer_non.evaluate(eval_dataset=en_test_tokenized)

{'eval_loss': 1.0888502597808838,
 'eval_accuracy': 0.58,
 'eval_runtime': 3.8244,
 'eval_samples_per_second': 130.739,
 'eval_steps_per_second': 4.184,
 'epoch': 2.0}

In [None]:
trainer_non.save_model("./model_multi")
trainer_non.save_state()

---

## 4. Results and summary

### 4.1 Corpus insights

<!-- Briefly discuss what was learned about the corpus and its annotation -->

The MARC corpus contains Amazon reviews in multiple languages including English, French, German and Spanish labeled with star ratings, ranging from 1 to 5 stars. The dataset is carefully balanced to ensure each star rating is equally represented, with 20% of reviews per rating. The annotations in the corpus are the star ratings, which represent the target variable for text classification tasks. The corpus is designed for multilingual tasks, providing a resource for training and evaluating models on a variety of languages.

### 4.2 Results

<!-- Briefly summarize  results -->
The results show that fine-tuning the `bert-base-multilingual-cased` model on English data achieved an accuracy of 63.6% on the English test set. When fine-tuned on non-English data (French, German, Spanish), the model evaluated on the English test set using zero-shot cross-lingual transfer achieved an accuracy of 58%.

### 4.3 Relation to random baseline / expected performance / state of the art

<!-- Compare obtained results with the random and state-of-the-art performance -->
Compared to the random baseline, which would yield an accuracy of around **20%** for a five-class classification task, the fine-tuned `bert-base-multilingual-cased` model achieved **63.6%** accuracy when trained on English data, a substantial improvement. For the zero-shot cross-lingual transfer model, fine-tuned on non-English data, the accuracy on the English test set was **58%**, also significantly better than the random baseline.

In comparison to recent state-of-the-art performance, mBERT achieves **59.2%** accuracy in fully supervised fine-grained classification and **44.0%** accuracy in zero-shot cross-lingual transfer. While our fine-tuned model's performance on English data (63.6%) surpasses mBERT's zero-shot accuracy (58%), it still lags behind mBERT’s fully supervised accuracy (59.2%). This indicates that while the model performs well, it falls slightly short of the highest state-of-the-art results, suggesting room for improvement through further optimization.

---

## 5. Zero-shot with a generative language model

### 5.1. and 5.2. Model and Data selection

<!-- Briefly describe which model was used and why. Also, describe how the test data was downsampled, include relevant code. -->
We selected `mistralai/Mistral-7B-Instruct-v0.2` for this classification task due to its instruction-tuned nature, which makes it highly efficient in following natural language instructions. This model, with 7 billion parameters, strikes a balance between computational efficiency and high performance, making it an ideal choice for zero-shot learning tasks like text classification. Its ability to understand and process complex instructions without the need for task-specific fine-tuning allows for quick deployment and effective handling of diverse inputs. The model is optimized for general language understanding and generation, ensuring robust performance across various tasks, including text classification, which is essential for accurately categorizing reviews into predefined sentiment categories.

For the testing process, the data was downsampled to ensure computational efficiency and maintain a balanced representation of each class. Given that real-world datasets can often be large and imbalanced, we used a method to ensure that each class had a sufficient number of samples for testing. Specifically, we applied the following approach: for each class in the dataset, if the class had at least 100 samples, we randomly sampled 100 reviews from it (with replacement if necessary). If a class had fewer than 100 samples, we kept all the reviews from that class. This strategy was implemented using the `groupby` and `apply` functions in Pandas. After downsampling, the dataset was reset and shuffled to eliminate any inherent ordering bias, ensuring a more generalized evaluation. This approach allowed for an efficient and balanced test set, reducing computational load while still capturing the diversity across all sentiment categories.

### 5.3. Prompt design

<!-- Include  final prompt here. Also, explain here all prompt engineering insights was learned while completing the tasks or project. -->
#### Final Prompt

Here is the final prompt:

```
I want you to perform a classification task for user reviews of products they bought.
You have to classify them on a scale of 0 to 4 using the integers 0, 1, 2, 3, 4.
Classify the given review into one of the following categories based on sentiment:
- 0: Very negative (the user is extremely dissatisfied with the product).
- 1: Negative (the user is dissatisfied with the product).
- 2: Neutral (the user feels indifferent about the product).
- 3: Positive (the user is satisfied with the product).
- 4: Very positive (the user is extremely satisfied with the product).

Input Format:
A review as plain text written between these symbols *** and ***.

Output Format:
Return only one of the following numbers: 0, 1, 2, or 4.
No additional text. Don't justify your answer. Don't comment your answer.
Your answer should not exceed one word, and this word should be an integer.
Just send the integer representing the category you decide after analyzing the review.

Here is the review to classify:
Review: ***{review}***
```

#### Prompt Engineering Insights

While completing this project, several prompt engineering insights emerged:

1. **Clear Instructions**: The model needs clear and structured instructions. The prompt outlines the specific task (classifying reviews on a sentiment scale) and defines each category in detail. This clarity helps the model understand the task and reduces the chances of ambiguous responses.

2. **Concise Output Formatting**: The prompt ensures that the output is constrained to a single integer, avoiding unnecessary explanation or additional text. This is critical for maintaining simplicity and consistency in the model’s output, particularly for downstream processing.

3. **Explicit Constraints**: By specifying "No additional text" and "Return only one of the following numbers," we limit the possible range of responses and guide the model toward the desired format. This was crucial for controlling the output in a structured manner.

4. **Controlled Generation Parameters**: Using parameters like `temperature`, `top_k`, and `top_p` allows for controlling the creativity and diversity of the model's output. In this case, a lower temperature ensures more deterministic outputs, which is important for a classification task where consistency is key.

5. **Review as Input**: The formatting of the review as `***review***` within the prompt emphasizes that the review text is the focal point for the model, ensuring that the model focuses on classifying the content correctly based on the provided sentiment categories.

6. **Iterative Refinement**: Testing the prompt with various types of reviews revealed that minor adjustments (like temperature or response formatting) were needed to improve the accuracy and consistency of the output. It highlighted the importance of testing and fine-tuning the prompt structure for optimal results.

These prompt engineering techniques were critical in obtaining reliable and consistent classifications from the generative model.

### 5.4. Generate

In [None]:
#  code to run the generative model and extract predictions from the model output.

#Load Mistral7B

#Easier of doing, not the best:
#Replace my_hugging_face_token by your own token and then
#make sure that you have been granted access to the mistral 7B model
#Go to https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2 for that
os.environ['HF_AUTH_TOKEN'] = 'my_hugging_face_token'

#Model's name
MODEL_NAME = 'mistralai/Mistral-7B-Instruct-v0.2'

#Setting for quantization
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

#Loading mistral's tokenizer
tokenizer = AutoTokenizer.from_pretrained(
      MODEL_NAME,
      padding_side='left',
      # token=userdata.get('hugging_face_secret'),
      use_auth_token=os.getenv('HF_AUTH_TOKEN'),
    )

#Loading mistral
model = AutoModelForCausalLM.from_pretrained(
      MODEL_NAME,
      # token=userdata.get('hugging_face_secret'),
      use_auth_token=os.getenv('HF_AUTH_TOKEN'),
      device_map="auto",
      quantization_config=quantization_config
    )

model.config.pad_token_id = model.config.eos_token_id


#Instantiate Text Generation Pipeline
pipe = pipeline("text-generation",
    model=model,
    tokenizer=tokenizer
)


#Function to substract the output from
def get_rate(output):
    """output: a string
    returns: number between -1, 0, 1, 2, 4
    -1 means the string does have any digit
    """
    match = re.search(r"\d", output)
    if match:
        number = int(match.group())
        return number
    return -1

#Test the function
# get_rate("The product code is 1")

def zero_shot_classifier(review):

    conversation = [
        {
            "role": "user",
            "content": (
                 f"""I want you to perform a classification task for user reviews of products they bought.
                And you have to classify them on a scale of 0 to 4 using the integers 0, 1, 2, 3, 4.
                Classify the given review into one of the following categories based on sentiment:\n
                - 0: Very negative (the user is extremely dissatisfied with the product).\n
                - 1: Negative (the user is dissatisfied with the product).\n
                - 2: Neutral (the user feels indifferent about the product).\n
                - 3: Positive (the user is satisfied with the product).\n
                - 4: Very positive (the user is extremely satisfied with the product).\n\n
                Input Format:\n
                A review as plain text wrote between these symboles *** and ***.\n\n
                Output Format:\n
                Return only one the following number 0, 1, 2, or 4.\n
                No additional text. Don't justify your answer. Don't comment your answer.\n
                Your answer should not exceed one word and this word shoud be an integer.\n
                Just send the integer representing the category you decide after analyzing the review.\n\n
                Here is the review to classify:\nReview: ***{review}***"""
            ),
        }
    ]

    temperature = 0.7
    top_k = 50
    top_p = 0.9
    # output = pipe(conversation, max_new_tokens=50)
    output = pipe(
        conversation,
        max_new_tokens=5,
        temperature=temperature,
        top_k=top_k,
        top_p=top_p,
    )

    out = output[0]["generated_text"][-1]["content"]
    return get_rate(out)

# Testing our function
# for i in [1, 5, 19, 48, 100, 182, 200, 401, 453]:
#     print(i,":",zero_shot_classifier(en_test[i]["text"]))


#Inference
en_test_copy = en_test_copy.map(lambda example: {"predicted_label": zero_shot_classifier(example["text"])})

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Settin

In [None]:
#Visualizing Mistral's prediction
pd.DataFrame(en_test_copy).tail(10)

Unnamed: 0,id,text,label,label_text,predicted_label
490,en_0528033,Love this stool\n\nI like the size and the gre...,4,4,3
491,en_0458455,This is an amazing product! It smells amazing\...,4,4,4
492,en_0227165,Best cold brew!\n\nAmazing flavor with just th...,4,4,4
493,en_0639833,"Great\n\nThe book is very good, the transport ...",4,4,3
494,en_0249421,A MUST BUY\n\nA must buy for any Home,4,4,4
495,en_0344435,Five Stars\n\nSuper nice box!!! Got it next da...,4,4,4
496,en_0989620,5 Stars\n\nLove it! As with all Tree to Tub Pr...,4,4,4
497,en_0926260,Very practical and convenient\n\nCome with pre...,4,4,3
498,en_0065576,Adorable. Really.\n\nThese are really cute. I ...,4,4,4
499,en_0858705,👍🏼\n\nLove this book it came in great conditio...,4,4,4


In [None]:
#Function to calculate accuracy
def calculate_accuracy(predictions, true_labels):
    correct = sum([1 if p == t else 0 for p, t in zip(predictions, true_labels)])
    accuracy = correct / len(true_labels)
    return accuracy

#list of predicted and true labels
predicted_labels = en_test_copy["predicted_label"]
true_labels = en_test_copy["label"]

#Calculate the accuracy
accuracy = calculate_accuracy(predicted_labels, true_labels)

print(f"Accuracy: {accuracy * 100:.2f}%")

Accuracy: 47.60%


### 5.5. Evaluation and results

<!-- Briefly summarize results -->
Comparing the fine-tuned `bert-base-multilingual-cased` model, which achieved **63.6%** accuracy on the English test set, to the zero-shot prompting of the Mistral 7B model yielded an accuracy of **47.60%**. While both models demonstrated good performance, the `bert-base-multilingual-cased` model outperformed Mistral 7B, highlighting the advantage of task-specific fine-tuning over zero-shot learning. This result aligns with previous findings, where fine-tuned models, such as mBERT, showed significantly better accuracy than zero-shot cross-lingual models (59.2% for mBERT in a fully supervised task). Thus, although Mistral 7B performed well in zero-shot classification, it still trails behind the accuracy achieved by fine-tuned models like `bert-base-multilingual-cased`.