# Deep learning in Human Language Technology Project

- Student(s) Name(s): Kimi Zaknoun
- Date: 5.11.2024
- Chosen Corpus: amazon_reviews_multi
- Contributions (if group project): N/A

### Corpus information

- Description of the chosen corpus: Amazon Reviews Multi-lingual dataset
- Paper(s) and other published materials related to the corpus:
- Random baseline performance and expected performance for recent machine learned models:<br>
Random baseline: 0.2
Expected performance: 0.93 for SOTA models

---

## 1. Setup

In [None]:
!pip install transformers
!pip install sentencepiece
!pip install datasets
!pip install torch
!pip install transformers[torch]
!pip install accelerate -U
!pip install transformers
!pip install optuna
!pip install -U sentence-transformers


In [None]:
from transformers import pipeline, XLMRobertaTokenizer, XLMRobertaForSequenceClassification, XLMRobertaForQuestionAnswering, AutoTokenizer, AutoModelForQuestionAnswering, Trainer, TrainingArguments
import sentencepiece
from datasets import load_dataset, load_metric
import torch
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import pandas as pd
from collections import Counter
import numpy as np
from transformers import EvalPrediction
import optuna
from transformers import Trainer, TrainingArguments
from datasets import concatenate_datasets
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from optuna.pruners import MedianPruner
import os



---

## 2. Data download, sampling and preprocessing

### 2.1. Download the corpus

In [None]:
# Load the dataset from the new repository path
dataset = load_dataset('mteb/amazon_reviews_multi')

# Test
print(dataset['train'][0])


Downloading builder script:   0%|          | 0.00/6.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/47.0 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/6 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/61.4M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/53.1M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/48.7M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/53.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/141M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/80.3M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/6 [00:00<?, ?it/s]

Downloading data files:   0%|          | 0/6 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/1.53M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.34M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.21M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.30M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.47M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.98M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/6 [00:00<?, ?it/s]

Downloading data files:   0%|          | 0/6 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/1.53M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.22M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.49M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.01M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/6 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

{'id': 'de_0203609', 'text': 'Leider nach 1 Jahr kaputt\n\nArmband ist leider nach 1 Jahr kaputt gegangen', 'label': 0, 'label_text': '0'}


### 2.2. Sampling and preprocessing

In [None]:
# Your code for any necessary sampling and preprocessing here

# Let's make EN, DE and FR datasets
english_dataset = dataset.filter(lambda example: example['id'].startswith('en_'))
de_dataset = dataset.filter(lambda example: example['id'].startswith('de_'))
fr_dataset = dataset.filter(lambda example: example['id'].startswith('fr_'))

# Downsampling
train_size = 20000  # Sampling 10%
valid_test_size = 500  # Sampling 10%

# Shuffle the datasets with a seed
english_shuffled_train = english_dataset['train'].shuffle(seed=42)
english_shuffled_validation = english_dataset['validation'].shuffle(seed=42)
english_shuffled_test = english_dataset['test'].shuffle(seed=42)

de_shuffled_train = de_dataset['train'].shuffle(seed=42)
de_shuffled_validation = de_dataset['validation'].shuffle(seed=42)

fr_shuffled_train = fr_dataset['train'].shuffle(seed=42)
fr_shuffled_validation = fr_dataset['validation'].shuffle(seed=42)

# Select the first train_size and valid_test_size examples
english_sampled_train = english_shuffled_train.select(range(train_size))
english_sampled_validation = english_shuffled_validation.select(range(valid_test_size))
english_sampled_test = english_shuffled_test.select(range(valid_test_size))

de_sampled_train = de_shuffled_train.select(range(train_size))
de_sampled_validation = de_shuffled_validation.select(range(valid_test_size))

fr_sampled_train = fr_shuffled_train.select(range(train_size))
fr_sampled_validation = fr_shuffled_validation.select(range(valid_test_size))


Filter:   0%|          | 0/1200000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/30000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/30000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1200000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/30000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/30000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1200000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/30000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/30000 [00:00<?, ? examples/s]

In [None]:
# Tokenization
model_name = 'xlm-roberta-base'
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_function(example):
    return tokenizer(example['text'], truncation=True, padding='max_length', max_length=128)

# Apply the function to the train, validation, and test sets EN
english_tokenized_train = english_sampled_train.map(tokenize_function, batched=True)
english_tokenized_validation = english_sampled_validation.map(tokenize_function, batched=True)
english_tokenized_test = english_sampled_test.map(tokenize_function, batched=True)

# Apply the function to the train, validation, and test sets DE
de_tokenized_train = de_sampled_train.map(tokenize_function, batched=True)
de_tokenized_validation = de_sampled_validation.map(tokenize_function, batched=True)

# Apply the function to the train, validation, and test sets FR
fr_tokenized_train = fr_sampled_train.map(tokenize_function, batched=True)
fr_tokenized_validation = fr_sampled_validation.map(tokenize_function, batched=True)


Downloading (…)lve/main/config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

In [None]:
# Describe the corpus statistics (especially label distribution) after sampling in order to demonstrate that the sampling is done reasonably
# Convert the datasets to pandas dataframes
en_train_df = pd.DataFrame(english_sampled_train)
en_validation_df = pd.DataFrame(english_sampled_validation)
en_test_df = pd.DataFrame(english_sampled_test)

de_train_df = pd.DataFrame(de_sampled_train)
de_validation_df = pd.DataFrame(de_sampled_validation)

fr_train_df = pd.DataFrame(fr_sampled_train)
fr_validation_df = pd.DataFrame(fr_sampled_validation)

# Calculate label distributions
en_train_label_distribution = Counter(en_train_df['label'])
en_validation_label_distribution = Counter(en_validation_df['label'])
en_test_label_distribution = Counter(en_test_df['label'])

de_train_label_distribution = Counter(de_train_df['label'])
de_validation_label_distribution = Counter(de_validation_df['label'])

fr_train_label_distribution = Counter(fr_train_df['label'])
fr_validation_label_distribution = Counter(fr_validation_df['label'])

# Print the label distributions
print("EN Training Label Distribution:")
for label, count in en_train_label_distribution.items():
    print(f"Label {label}: {count} ({(count / len(en_train_df) * 100):.2f}%)")

print("\nEN Validation Label Distribution:")
for label, count in en_validation_label_distribution.items():
    print(f"Label {label}: {count} ({(count / len(en_validation_df) * 100):.2f}%)")

print("\nEN Test Label Distribution:")
for label, count in en_test_label_distribution.items():
    print(f"Label {label}: {count} ({(count / len(en_test_df) * 100):.2f}%)")

print("\nDE Training Label Distribution:")
for label, count in de_train_label_distribution.items():
    print(f"Label {label}: {count} ({(count / len(de_train_df) * 100):.2f}%)")

print("\nDE Validation Label Distribution:")
for label, count in de_validation_label_distribution.items():
    print(f"Label {label}: {count} ({(count / len(de_validation_df) * 100):.2f}%)")

print("\nFR Training Label Distribution:")
for label, count in fr_train_label_distribution.items():
    print(f"Label {label}: {count} ({(count / len(fr_train_df) * 100):.2f}%)")

print("\nFR Validation Label Distribution:")
for label, count in fr_validation_label_distribution.items():
    print(f"Label {label}: {count} ({(count / len(fr_validation_df) * 100):.2f}%)")

EN Training Label Distribution:
Label 2: 4032 (20.16%)
Label 1: 3898 (19.49%)
Label 4: 4065 (20.32%)
Label 3: 4045 (20.23%)
Label 0: 3960 (19.80%)

EN Validation Label Distribution:
Label 1: 100 (20.00%)
Label 3: 97 (19.40%)
Label 0: 98 (19.60%)
Label 2: 102 (20.40%)
Label 4: 103 (20.60%)

EN Test Label Distribution:
Label 1: 100 (20.00%)
Label 3: 97 (19.40%)
Label 0: 98 (19.60%)
Label 2: 102 (20.40%)
Label 4: 103 (20.60%)

DE Training Label Distribution:
Label 2: 4032 (20.16%)
Label 1: 3898 (19.49%)
Label 4: 4065 (20.32%)
Label 3: 4045 (20.23%)
Label 0: 3960 (19.80%)

DE Validation Label Distribution:
Label 1: 100 (20.00%)
Label 3: 97 (19.40%)
Label 0: 98 (19.60%)
Label 2: 102 (20.40%)
Label 4: 103 (20.60%)

FR Training Label Distribution:
Label 2: 4032 (20.16%)
Label 1: 3898 (19.49%)
Label 4: 4065 (20.32%)
Label 3: 4045 (20.23%)
Label 0: 3960 (19.80%)

FR Validation Label Distribution:
Label 1: 100 (20.00%)
Label 3: 97 (19.40%)
Label 0: 98 (19.60%)
Label 2: 102 (20.40%)
Label 4: 103 

---

## 3. Machine learning model

### 3.1. Model training

In [None]:
# Your code to train the transformer based model on the training set and evaluate the performance on the validation set here

# Enable GPU acceleration if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load pre-trained model for classification
model = XLMRobertaForSequenceClassification.from_pretrained(model_name, num_labels=5).to(device)

# Define the training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=2,
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=100,
    evaluation_strategy="epoch",
)

# Define function to compute accuracy
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    accuracy_metric = load_metric("accuracy")
    accuracy = accuracy_metric.compute(predictions=preds, references=labels)
    return {
        'accuracy': accuracy["accuracy"],
}

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=english_tokenized_train,
    eval_dataset=english_tokenized_validation,
    compute_metrics=compute_metrics
)

# Train the model
trainer.train()

# Evaluate the model
results = trainer.evaluate()

# Print results
print(results)


Downloading model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.out_proj.weight', 'classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.9102,0.920199,0.608
2,0.8142,0.865299,0.618


  accuracy_metric = load_metric("accuracy")


Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

{'eval_loss': 0.8652985095977783, 'eval_accuracy': 0.618, 'eval_runtime': 4.3914, 'eval_samples_per_second': 113.859, 'eval_steps_per_second': 7.287, 'epoch': 2.0}


### 3.2 Hyperparameter optimization

In [None]:
# Your code for hyperparameter optimization here

def model_init():
    return XLMRobertaForSequenceClassification.from_pretrained(model_name, num_labels=5)

def objective(trial):
    # Define the hyperparameters to be tuned by Optuna
    learning_rate = trial.suggest_float("learning_rate", 1e-5, 3e-5, log=True)

    # Load the model with the new hyperparameters
    model = model_init()

    # Define the training arguments
    training_args = TrainingArguments(
        output_dir='./results',
        learning_rate=learning_rate,
        num_train_epochs=2,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        warmup_steps=200,
        logging_dir='./logs',
        logging_steps=100,
        evaluation_strategy="epoch",
        save_strategy="no",
        report_to="none"
    )

    # Initialize the Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=english_tokenized_train,
        eval_dataset=english_tokenized_validation,
        compute_metrics=compute_metrics
    )

    # Train the model
    trainer.train()

    # Evaluate the model
    eval_result = trainer.evaluate()

    # Return the evaluation metric for optimization
    return eval_result["eval_accuracy"]

# Initialize the median pruner
pruner = MedianPruner()

# Maximize the objective metric
study = optuna.create_study(direction="maximize", pruner=pruner)
study.optimize(objective, n_trials=3)

# Get the best hyperparameters
best_params = study.best_trial.params
print("Best trial:", best_params)

# Information about the best trial
best_trial = study.best_trial
print(f"Best trial number: {best_trial.number}")
print(f"Value of the best trial (accuracy): {best_trial.value}")

# Detailed hyperparameters of the best trial
for key, value in best_trial.params.items():
    print(f"{key}: {value}")


[I 2023-11-11 20:08:52,555] A new study created in memory with name: no-name-4d735ac9-3541-4a94-91c2-a7568deb9c41
Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.out_proj.weight', 'classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.9016,0.883005,0.64
2,0.8417,0.871751,0.632


[I 2023-11-11 20:26:01,606] Trial 0 finished with value: 0.632 and parameters: {'learning_rate': 1.1180675639218363e-05}. Best is trial 0 with value: 0.632.
Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.out_proj.weight', 'classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.8797,0.89112,0.632
2,0.792,0.866063,0.64


[I 2023-11-11 20:43:05,644] Trial 1 finished with value: 0.64 and parameters: {'learning_rate': 1.7496969845910443e-05}. Best is trial 1 with value: 0.64.
Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.out_proj.weight', 'classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.8889,0.891855,0.644
2,0.8242,0.869989,0.638


[I 2023-11-11 21:00:08,296] Trial 2 finished with value: 0.638 and parameters: {'learning_rate': 1.276080657483399e-05}. Best is trial 1 with value: 0.64.


Best trial: {'learning_rate': 1.7496969845910443e-05}
Best trial number: 1
Value of the best trial (accuracy): 0.64
learning_rate: 1.7496969845910443e-05


### 3.3. Evaluation on test set

In [None]:
# Your code to evaluate the final model on the test set here

# Initialize the model with the best hyperparameters
model = model_init()

# Update the training arguments with the best hyperparameters
training_args = TrainingArguments(
    output_dir='./results',
    learning_rate=best_params["learning_rate"],
    num_train_epochs=2,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=200,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=100,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

# Initialize the Trainer with the final model and training arguments
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=english_tokenized_train,
    eval_dataset=english_tokenized_validation,
    compute_metrics=compute_metrics
)

# Retrain the model with the best hyperparameters
trainer.train()

# Evaluate the model on the test set
results = trainer.evaluate(english_tokenized_test)

# Print the test set evaluation results
print("Test set results:", results)



Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.out_proj.weight', 'classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.8891,0.887223,0.636
2,0.7922,0.863183,0.632


Test set results: {'eval_loss': 0.862299382686615, 'eval_accuracy': 0.666, 'eval_runtime': 4.469, 'eval_samples_per_second': 111.881, 'eval_steps_per_second': 7.16, 'epoch': 2.0}


### 3.4. Multilingual and cross-lingual experiments

In [None]:
# Your code to train and evaluate the multilingual and cross-lingual models

# Combine the English and German training datasets
combined_train_dataset = concatenate_datasets([english_tokenized_train, de_tokenized_train])

# Initialize the Trainer with the best hyperparameters obtained from the English dataset optimization
# We won't redo optimization to save compute
# Initialize the model
def model_init():
    return XLMRobertaForSequenceClassification.from_pretrained(model_name, num_labels=5)

# Initialize the Trainer with the best hyperparameters obtained from the previous optimization
trainer = Trainer(
    model_init=model_init,
    args=training_args,
    train_dataset=combined_train_dataset,
    eval_dataset=english_tokenized_validation,
    compute_metrics=compute_metrics
)

# Train the model with hyperparameters
trainer.train()

# Evaluate the final model on the English test set
final_results = trainer.evaluate(english_tokenized_test)
print("Final evaluation on English test set after training on English + German:", final_results)


Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.out_proj.weight', 'classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.out_proj.weight', 'classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.8884,0.887618,0.632
2,0.8005,0.8642,0.652


Final evaluation on English test set after training on English + German: {'eval_loss': 0.821204662322998, 'eval_accuracy': 0.68, 'eval_runtime': 4.4953, 'eval_samples_per_second': 111.227, 'eval_steps_per_second': 7.119, 'epoch': 2.0}


In [None]:
# Cross-lingual part

# Initialize the Trainer with the French training dataset
trainer = Trainer(
    model=model_init(),
    args=training_args,
    train_dataset=fr_tokenized_train,
    eval_dataset=fr_tokenized_validation,
    compute_metrics=compute_metrics
)

# Train the model
trainer.train()

# Evaluate the model on the English test set
results = trainer.evaluate(english_tokenized_validation)

# Print the results
print("Zero-shot evaluation on English validation set after training on French:", results)

# Final zero-shot evaluation on the English test set
final_results = trainer.evaluate(english_tokenized_test)
print("Zero-shot evaluation on English test set after training on French:", final_results)


Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.out_proj.weight', 'classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.9668,0.905343,0.624
2,0.8898,0.869387,0.612


Zero-shot evaluation on English validation set after training on French: {'eval_loss': 1.0902432203292847, 'eval_accuracy': 0.532, 'eval_runtime': 4.4761, 'eval_samples_per_second': 111.704, 'eval_steps_per_second': 7.149, 'epoch': 2.0}
Zero-shot evaluation on English test set after training on French: {'eval_loss': 1.0227307081222534, 'eval_accuracy': 0.556, 'eval_runtime': 4.3747, 'eval_samples_per_second': 114.294, 'eval_steps_per_second': 7.315, 'epoch': 2.0}


---

## 4. Results and summary

### 4.1 Corpus insights

The corpus is a multilingual collection of reviews from Amazon. The dataset is  somewhat popular in the field of NLP and is used for tasks such as sentiment analysis, multilingual text classification, and machine translation.

### 4.2 Results

The model is learning well on the English dataset. The accuracy is around 66% after training for only two epochs. If it wasn't for Colab compute limits, I believe it would get well above 70% accuracy.

The performance after training on the English+German dataset is quite good, at around 68% accuracy, improving from 66% on the English dataset.

The zero-shot performance on English after training on French is not that good, only barely above 50%.

### 4.3 Relation to random baseline / expected performance / state of the art

The performance is well below SOTA models for this dataset which is around 93% accuracy (source: https://paperswithcode.com/sota/text-classification-on-amazon-reviews-multi).

---

## 5. Bonus Task (optional)

### 5.1. Data selection

I used the same preprocessed data as in the main project, i.e., the English and French training datasets of which 5 and 1000 samples respectively were used.

### 5.2 Sentence representations

In [None]:
# Your code to create a sentence embedding for the given text here
# Initialize the model
model = SentenceTransformer('sentence-transformers/xlm-r-100langs-bert-base-nli-stsb-mean-tokens')

# Select a sample of English and French reviews
sample_english = en_train_df.groupby('label').apply(lambda x: x.sample(n=5)).reset_index(drop=True)
sample_french = fr_train_df.groupby('label').apply(lambda x: x.sample(n=1000)).reset_index(drop=True)

# Calculate embeddings
english_embeddings = model.encode(sample_english['text'].tolist(), show_progress_bar=True)
french_embeddings = model.encode(sample_french['text'].tolist(), show_progress_bar=True)


Downloading (…)ab895/.gitattributes:   0%|          | 0.00/574 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)f9e99ab895/README.md:   0%|          | 0.00/4.06k [00:00<?, ?B/s]

Downloading (…)e99ab895/config.json:   0%|          | 0.00/731 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/527 [00:00<?, ?B/s]

Downloading (…)99ab895/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/157 [00:00<?, ?it/s]

### 5.3. Cosine similarity

In [None]:
# Your code to calculate the cosine similarity of the embeddings and select the target sentence that maximizes the cosine similarity here

for i, emb_en in enumerate(english_embeddings):
    # Calculate similarities with all French reviews
    similarities = cosine_similarity([emb_en], french_embeddings)[0]
    # Find the index of the most similar French review
    most_similar_idx = np.argmax(similarities)
    # Fetch the most similar review and its similarity score
    similar_review = sample_french.iloc[most_similar_idx]['text']
    similarity_score = similarities[most_similar_idx]

    print(f"English review: {sample_english.iloc[i]['text']}")
    print(f"Most similar French review: {similar_review}")
    print(f"Similarity score: {similarity_score}\n")

English review: False advertising

I purchased this item and when I received it I was very upset. My item had scratches all over and it wasn't even a cocktail shaker but a cup with a straw... Very displeased.
Most similar French review: cassé

Déçu,j'ai reçu le produit cassé,mal emballé pour un pot en verre!Résultat j'ai du me débrouiller pour récupérer la bougie et la remettre dans une tasse.
Similarity score: 0.7542393803596497

English review: Terrible

This is the most difficult electric can opener I've ever had. Actually, none of the other can openers I've had over the years have been difficult. This one? I tossed it after a month. Didn't even want to go through the hassle of returning it.
Most similar French review: Nul

Câble de tres mauvaise qualité. Les miens ont duré 1 semaine chacun, il finissent tous par cassé et ne plus chargé le téléphone. Je déconseille ce produit.
Similarity score: 0.691573441028595

English review: Does not work

This does nothing to protect surgical s

### 5.4 Bonus task evaluation

Looking at the similarity scores, it appears the model is fairly effective at finding semantically similar reviews across languages. The scores are above 0.6 in all cases, which indicates a decent level of similarity between the reviews. A cursory reading comparing the reviews gives a similar impression. (I speak French so I did not find it necessary to translate the reviews here.)

However, in terms of conceptual similarity, it does not perform very well, e.g.:
English review: Sturdy, good quality bags

These bags were exactly what I was looking for! Perfect size and very good quality.
Most similar French review: Parfait!

Très contente de cet achat, cette valise est très fonctionnelle et très belle! Je vous recommande de l’acheter vu son prix!
Similarity score: 0.8103271722793579

The english review talks about good size and quality while the french review talks about functionality, appearance, and price.

I would say the model finds sentiment similarities better than conceptual ones.

In terms of word-level similarity, it is often loanwords that match.