<a href="https://colab.research.google.com/github/maryamteimouri/MultilingualTextClassifier/blob/main/DL_HTL_course_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep learning in Human Language Technology Project

- Student Name: Maryam Teimouri Badeleh Dareh
- Date: 28 November 2023
- Chosen Corpus: amazon_reviews_multi

### Corpus information

- Description of the chosen corpus:
  1. Labels: Star rating 1–5
  2. Languages: English, German, Spanish, French, Japanese, Chinese
  Subset sizes (per language): train:200K, validation:5K, test:5K
  3. Description: Amazon product reviews dataset for multilingual text classification. Each record in the dataset contains id, label, label_text, and text. The corpus is balanced across stars, so each star rating constitutes 20% of the reviews in each language.
- Paper(s) and other published materials related to the corpus:
  1. The Multilingual Amazon Reviews Corpus: https://aclanthology.org/2020.emnlp-main.369.pdf
  2. https://github.com/nlptown/nlp-notebooks/blob/master/Multilingual%20text%20classification%20with%20BERT.ipynb
  3. https://github.com/TurkuNLP/Deep_Learning_in_LangTech_course/blob/master/hf_trainer_bert.ipynb
- Random baseline performance and expected performance for recent machine learned models:

---

## 1. Setup

In [None]:
import logging
logging.disable(logging.INFO)

from pprint import PrettyPrinter
pprint = PrettyPrinter(compact=True).pprint

!pip3 install -q datasets transformers evaluate accelerate

import transformers
import torch
import evaluate
import accelerate
from collections import defaultdict

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.7/265.7 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25h

---

## 2. Data download, sampling and preprocessing

### 2.1. Download the corpus

In [None]:
import datasets

dataset = datasets.load_dataset("mteb/amazon_reviews_multi")

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


### 2.2. Sampling and preprocessing

In [None]:
print (dataset)

DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 1200000
    })
    validation: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 30000
    })
    test: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 30000
    })
})


In [None]:
pprint(dataset['train'][210000])

{'id': 'en_0389345',
 'label': 0,
 'label_text': '0',
 'text': 'It does not charge.\n'
         '\n'
         'I got this with 50% charge. Put it under the sun. two days later, it '
         'has 25%. Never able to charge it once. Should have returned it right '
         'away.'}


In [None]:
# each language has 210K rows, 200K train, 5K val, 5K test

de_train=dataset["train"].select(range(200000 - 1))
de_val=dataset["validation"].select(range(5000 - 1))
de_test=dataset["test"].select(range(5000 - 1))

en_train=dataset["train"].select(range(210000, 210000 + 200000 -1))
en_val=dataset["validation"].select(range(5000, 5000 + 5000 -1))
en_test=dataset["test"].select(range(5000, 5000 + 5000 -1))

In [None]:
# down sampling

rain_size = 10000
val_size = 1000
test_size = 1000

de_train=de_train.shuffle()
de_test=de_test.shuffle()
de_val=de_val.shuffle()

en_train=en_train.shuffle()
en_test=en_test.shuffle()
en_val=en_val.shuffle()

In [None]:
de_train=de_train.select(range(train_size))
de_test=de_test.select(range(test_size))
de_val=de_val.select(range(val_size))

en_train=en_train.select(range(train_size))
en_test=en_test.select(range(test_size))
en_val=en_val.select(range(val_size))

In [None]:
en_dataset = datasets.DatasetDict({'train': en_train, 'validation': en_val, 'test': en_test})
de_dataset = datasets.DatasetDict({'train': de_train, 'validation': de_val, 'test': de_test})

merge_dataset = datasets.DatasetDict({'train' : datasets.concatenate_datasets([en_train, de_train]),
                                      'validation' : datasets.concatenate_datasets([en_val, de_val]),
                                      'test' : datasets.concatenate_datasets([en_test, de_test])
                                      })
merge_dataset.shuffle()

DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 20000
    })
    validation: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 2000
    })
})

---

## 3. Machine learning model

### 3.1. Model training

In [None]:
# Your code to train the transformer based model on the training set and evaluate the performance on the validation set here

model_name = "bert-base-multilingual-cased"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
model = transformers.AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=5)  # TODO: check this

tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
def tokenize(example):
    return tokenizer(
        example["text"],
        max_length=512,
        truncation=True,
    )

# Apply the tokenizer to the whole dataset using .map()
en_dataset = en_dataset.map(tokenize)
de_dataset = de_dataset.map(tokenize)
merge_dataset = merge_dataset.map(tokenize)

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [None]:
trainer_args = transformers.TrainingArguments(
    "checkpoints",
    evaluation_strategy="steps",
    logging_strategy="steps",
    load_best_model_at_end=True,
    eval_steps=100,
    logging_steps=100,
    learning_rate=0.00001,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=32,
    max_steps=500,
)

In [None]:
accuracy = evaluate.load("accuracy")

def compute_accuracy(outputs_and_labels):
    outputs, labels = outputs_and_labels
    predictions = outputs.argmax(axis=-1)
    return accuracy.compute(predictions=predictions, references=labels)


data_collator = transformers.DataCollatorWithPadding(tokenizer)

early_stopping = transformers.EarlyStoppingCallback(
    early_stopping_patience=5
)

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [None]:
class LogSavingCallback(transformers.TrainerCallback):
    def on_train_begin(self, *args, **kwargs):
        self.logs = defaultdict(list)
        self.training = True

    def on_train_end(self, *args, **kwargs):
        self.training = False

    def on_log(self, args, state, control, logs, model=None, **kwargs):
        if self.training:
            for k, v in logs.items():
                if k != "epoch" or v not in self.logs[k]:
                    self.logs[k].append(v)

training_logs = LogSavingCallback()

**baseline** : Train on English --> Evaluate on English

In [None]:
en_trainer = None
en_trainer = transformers.Trainer(
    model=model,
    args=trainer_args,
    train_dataset=en_dataset['train'],
    eval_dataset=en_dataset['validation'],
    compute_metrics=compute_accuracy,
    data_collator=data_collator,
    tokenizer = tokenizer,
    callbacks=[early_stopping, training_logs]
)

en_trainer.train()


Step,Training Loss,Validation Loss,Accuracy
100,1.2613,1.266451,0.453
200,1.0408,1.159548,0.493
300,1.1281,1.144526,0.504
400,1.1097,1.123392,0.51
500,1.0791,1.104498,0.527


TrainOutput(global_step=500, training_loss=1.123799575805664, metrics={'train_runtime': 191.3421, 'train_samples_per_second': 20.905, 'train_steps_per_second': 2.613, 'total_flos': 266633236398480.0, 'train_loss': 1.123799575805664, 'epoch': 0.4})

### 3.2 Hyperparameter optimization

In [None]:
# Your code for hyperparameter optimization here

### 3.3. Evaluation on test set

In [None]:
# Your code to evaluate the final model on the test set here
en_eval_results = en_trainer.evaluate(en_dataset["test"])

pprint(en_eval_results)

print('Accuracy:', en_eval_results['eval_accuracy'])

{'epoch': 0.4,
 'eval_accuracy': 0.561,
 'eval_loss': 1.0646296739578247,
 'eval_runtime': 12.6825,
 'eval_samples_per_second': 78.849,
 'eval_steps_per_second': 2.523}
Accuracy: 0.561


### 3.4. Multilingual and cross-lingual experiments

**Multilingual**

In [None]:
de_en_trainer = None
de_en_trainer = transformers.Trainer(
    model=model,
    args=trainer_args,
    train_dataset=merge_dataset['train'],
    eval_dataset=merge_dataset['validation'],
    compute_metrics=compute_accuracy,
    data_collator=data_collator,
    tokenizer = tokenizer,
    callbacks=[early_stopping, training_logs]
)

de_en_trainer.train()

Step,Training Loss,Validation Loss,Accuracy
100,1.2192,1.186167,0.4645
200,1.1824,1.124165,0.522
300,1.1352,1.090009,0.5375
400,1.0982,1.090644,0.5425
500,1.1291,1.081734,0.5495


TrainOutput(global_step=500, training_loss=1.1528218536376953, metrics={'train_runtime': 273.3163, 'train_samples_per_second': 14.635, 'train_steps_per_second': 1.829, 'total_flos': 276405608658432.0, 'train_loss': 1.1528218536376953, 'epoch': 0.2})

In [None]:
de_en_eval_results = de_en_trainer.evaluate(en_dataset["test"])

pprint(de_en_eval_results)

print('Accuracy:', de_en_eval_results['eval_accuracy'])

{'epoch': 0.2,
 'eval_accuracy': 0.561,
 'eval_loss': 1.0646296739578247,
 'eval_runtime': 13.1777,
 'eval_samples_per_second': 75.886,
 'eval_steps_per_second': 2.428}
Accuracy: 0.561


**Crosslingual**

In [None]:
de_trainer = None
de_trainer = transformers.Trainer(
    model=model,
    args=trainer_args,
    train_dataset=de_dataset['train'],
    eval_dataset=de_dataset['validation'],
    compute_metrics=compute_accuracy,
    data_collator=data_collator,
    tokenizer = tokenizer,
    callbacks=[early_stopping, training_logs]
)

de_trainer.train()

Step,Training Loss,Validation Loss,Accuracy
100,1.1792,1.210481,0.49
200,1.147,1.116515,0.514
300,1.1271,1.093189,0.526
400,1.1626,1.056762,0.546
500,1.077,1.062168,0.53


TrainOutput(global_step=500, training_loss=1.1385887298583985, metrics={'train_runtime': 210.0361, 'train_samples_per_second': 19.044, 'train_steps_per_second': 2.381, 'total_flos': 291588347722800.0, 'train_loss': 1.1385887298583985, 'epoch': 0.4})

In [None]:
de_eval_results = de_trainer.evaluate(en_dataset["test"])

pprint(de_eval_results)

print('Accuracy:', de_eval_results['eval_accuracy'])

{'epoch': 0.4,
 'eval_accuracy': 0.561,
 'eval_loss': 1.0646296739578247,
 'eval_runtime': 13.4133,
 'eval_samples_per_second': 74.553,
 'eval_steps_per_second': 2.386}
Accuracy: 0.561


---

## 4. Results and summary

### 4.1 Corpus insights

(Briefly discuss what you learned about the corpus and its annotation)

### 4.2 Results

(Briefly summarize your results)

### 4.3 Relation to random baseline / expected performance / state of the art

(Compare your results to the random and state-of-the-art performance)

---

## 5. Bonus Task (optional)

### 5.1. Data selection

(Briefly describe how many English and target language examples were used and how these were selected, include relevant code)

### 5.2 Sentence representations

In [None]:
# Your code to create a sentence embedding for the given text here

### 5.3. Cosine similarity

In [None]:
# Your code to calculate the cosine similarity of the embeddings and select the target sentence that maximizes the cosine similarity here

### 5.4 Bonus task evaluation

(Present the evaluation results here)