<a href="https://colab.research.google.com/github/maryamteimouri/MultilingualTextClassifier/blob/main/DL_HTL_course_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep learning in Human Language Technology Project

- Student Name: Maryam Teimouri Badeleh Dareh
- Date: 28 November 2023
- Chosen Corpus: amazon_reviews_multi

### Corpus information

- Description of the chosen corpus:
  1. Labels: Star rating 1–5
  2. Languages: English, German, Spanish, French, Japanese, Chinese
  Subset sizes (per language): train:200K, validation:5K, test:5K
  3. Description: Amazon product reviews dataset for multilingual text classification. Each record in the dataset contains id, label, label_text, and text. The corpus is balanced across stars, so each star rating constitutes 20% of the reviews in each language.
- Paper(s) and other published materials related to the corpus:
  1. The Multilingual Amazon Reviews Corpus: https://aclanthology.org/2020.emnlp-main.369.pdf
  2. https://github.com/nlptown/nlp-notebooks/blob/master/Multilingual%20text%20classification%20with%20BERT.ipynb
  3. https://github.com/TurkuNLP/Deep_Learning_in_LangTech_course/blob/master/hf_trainer_bert.ipynb
- Random baseline performance and expected performance for recent machine learned models:

---

## 1. Setup

In [1]:
import logging
logging.disable(logging.INFO)

from pprint import PrettyPrinter
pprint = PrettyPrinter(compact=True).pprint

!pip3 install -q datasets transformers evaluate accelerate

import transformers
import torch
import evaluate
import accelerate
from collections import defaultdict

!pip install optuna
import optuna

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.7/265.7 kB[0m [31m19.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting optuna
  Downloading optuna-3.5.0-py3-none-any.whl (413 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m413.4/413.4 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting alembic>=1.5.0 (from optuna)
  Downloading alembic-1.13.1-py3-none-any.whl (233 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.4/233.4 kB[0m [31m18

---

## 2. Data download, sampling and preprocessing

### 2.1. Download the corpus

In [2]:
import datasets

dataset = datasets.load_dataset("mteb/amazon_reviews_multi")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/6.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/47.0 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/61.4M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/53.1M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/48.7M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/53.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/141M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/80.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.53M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.34M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.21M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.30M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.47M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.98M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.53M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.22M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.49M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.01M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

### 2.2. Sampling and preprocessing

In [3]:
print (dataset)

DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 1200000
    })
    validation: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 30000
    })
    test: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 30000
    })
})


In [4]:
pprint(dataset['test'][ 5000 ])

{'id': 'en_0199937',
 'label': 0,
 'label_text': '0',
 'text': 'Don’t waste your time!\n'
         '\n'
         'These are AWFUL. They are see through, the fabric feels like '
         'tablecloth, and they fit like children’s clothing. Customer service '
         'did seem to be nice though, but I regret missing my return date for '
         'these. I wouldn’t even donate them because the quality is so poor.'}


In [5]:
# each language has 210K rows, 200K train, 5K val, 5K test

de_train=dataset["train"].select(range(200000 - 1))
de_val=dataset["validation"].select(range(5000 - 1))
de_test=dataset["test"].select(range(5000 - 1))

en_train=dataset["train"].select(range(200000, 200000 + 200000 -1))
en_val=dataset["validation"].select(range(5000, 5000 + 5000 -1))
en_test=dataset["test"].select(range(5000, 5000 + 5000 -1))

In [6]:
# down sampling

train_size = 10000
val_size = 1000
test_size = 1000

de_train=de_train.shuffle()
de_test=de_test.shuffle()
de_val=de_val.shuffle()

en_train=en_train.shuffle()
en_test=en_test.shuffle()
en_val=en_val.shuffle()

In [7]:
de_train=de_train.select(range(train_size))
de_test=de_test.select(range(test_size))
de_val=de_val.select(range(val_size))

en_train=en_train.select(range(train_size))
en_test=en_test.select(range(test_size))
en_val=en_val.select(range(val_size))

In [8]:
en_dataset = datasets.DatasetDict({'train': en_train, 'validation': en_val, 'test': en_test})
de_dataset = datasets.DatasetDict({'train': de_train, 'validation': de_val, 'test': de_test})

merge_dataset = datasets.DatasetDict({'train' : datasets.concatenate_datasets([en_train, de_train]),
                                      'validation' : datasets.concatenate_datasets([en_val, de_val]),
                                      'test' : datasets.concatenate_datasets([en_test, de_test])
                                      })
merge_dataset.shuffle()

DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 20000
    })
    validation: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 2000
    })
})

---

## 3. Machine learning model

### 3.1. Model training

In [9]:
# Your code to train the transformer based model on the training set and evaluate the performance on the validation set here

model_name = "bert-base-multilingual-cased"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
model = transformers.AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=5)  # TODO: check this

tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [10]:
def tokenize(example):
    return tokenizer(
        example["text"],
        max_length=512,
        truncation=True,
        padding='max_length'
    )

# Apply the tokenizer to the whole dataset using .map()
en_dataset = en_dataset.map(tokenize)
de_dataset = de_dataset.map(tokenize)
merge_dataset = merge_dataset.map(tokenize)

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [11]:
trainer_args = transformers.TrainingArguments(
    "checkpoints",
    evaluation_strategy="steps",
    logging_strategy="steps",
    load_best_model_at_end=True,
    eval_steps=100,
    logging_steps=100,
    learning_rate=0.000001,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=32,
    max_steps=1000,
)

In [12]:
accuracy = evaluate.load("accuracy")

def compute_accuracy(outputs_and_labels):
    outputs, labels = outputs_and_labels
    predictions = outputs.argmax(axis=-1)
    return accuracy.compute(predictions=predictions, references=labels)


data_collator = transformers.DataCollatorWithPadding(tokenizer)

early_stopping = transformers.EarlyStoppingCallback(
    early_stopping_patience=5
)

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [13]:
class LogSavingCallback(transformers.TrainerCallback):
    def on_train_begin(self, *args, **kwargs):
        self.logs = defaultdict(list)
        self.training = True

    def on_train_end(self, *args, **kwargs):
        self.training = False

    def on_log(self, args, state, control, logs, model=None, **kwargs):
        if self.training:
            for k, v in logs.items():
                if k != "epoch" or v not in self.logs[k]:
                    self.logs[k].append(v)

training_logs = LogSavingCallback()

**baseline** : Train on English --> Evaluate on English

In [14]:
en_trainer = None
en_trainer = transformers.Trainer(
    model=model,
    args=trainer_args,
    train_dataset=en_dataset['train'],
    eval_dataset=en_dataset['validation'],
    compute_metrics=compute_accuracy,
    data_collator=data_collator,
    tokenizer = tokenizer,
    callbacks=[early_stopping, training_logs]
)

en_trainer.train()


You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss,Accuracy
100,1.6149,1.611624,0.245
200,1.599,1.597163,0.297
300,1.5873,1.575503,0.327
400,1.5677,1.552336,0.35
500,1.5556,1.528499,0.365
600,1.5205,1.502784,0.37
700,1.4899,1.482036,0.377
800,1.484,1.466479,0.374
900,1.4709,1.458257,0.376
1000,1.4627,1.454586,0.375


TrainOutput(global_step=1000, training_loss=1.5352706451416016, metrics={'train_runtime': 1159.4031, 'train_samples_per_second': 6.9, 'train_steps_per_second': 0.863, 'total_flos': 2104945139712000.0, 'train_loss': 1.5352706451416016, 'epoch': 0.8})

### 3.2 Hyperparameter optimization

In [15]:
LR_MIN = 4e-5
LR_CEIL = 0.01
WD_MIN = 4e-5
WD_CEIL = 0.01
MIN_EPOCHS = 2
MAX_EPOCHS = 5
PER_DEVICE_EVAL_BATCH = 8
PER_DEVICE_TRAIN_BATCH = 8
NUM_TRIALS = 1
SAVE_DIR = 'opt-test'
MAX_LENGTH = 512

In [16]:
def objective(trial: optuna.Trial):
    trainer = transformers.Trainer(
        model=model,
        args=trainer_args,
        train_dataset=en_dataset['train'],
        eval_dataset=en_dataset['validation'])

    result = trainer.train()
    return result.training_loss

In [17]:
# Your code for hyperparameter optimization here
study = optuna.create_study(study_name='hp-search-electra', direction='minimize')
study.optimize(func=objective, n_trials=NUM_TRIALS)

Step,Training Loss,Validation Loss
100,1.4223,1.408229
200,1.3986,1.376973
300,1.3589,1.354803
400,1.327,1.316835
500,1.3388,1.298997
600,1.2943,1.285507
700,1.2647,1.275724
800,1.262,1.268656
900,1.2723,1.265057
1000,1.2717,1.26362


In [18]:
print(study.best_value)
print(study.best_params)
print(study.best_trial)

1.3210468215942384
{}
FrozenTrial(number=0, state=TrialState.COMPLETE, values=[1.3210468215942384], datetime_start=datetime.datetime(2024, 1, 9, 9, 7, 49, 13031), datetime_complete=datetime.datetime(2024, 1, 9, 9, 26, 35, 17652), params={}, user_attrs={}, system_attrs={}, intermediate_values={}, distributions={}, trial_id=0, value=None)


### 3.3. Evaluation on test set

In [19]:
# Your code to evaluate the final model on the test set here
en_eval_results = en_trainer.evaluate(en_dataset["test"])

pprint(en_eval_results)

print('Accuracy:', en_eval_results['eval_accuracy'])

{'epoch': 0.8,
 'eval_accuracy': 0.449,
 'eval_loss': 1.2808046340942383,
 'eval_runtime': 33.8751,
 'eval_samples_per_second': 29.52,
 'eval_steps_per_second': 0.945}
Accuracy: 0.449


### 3.4. Multilingual and cross-lingual experiments

**Multilingual**

In [20]:
de_en_trainer = None
de_en_trainer = transformers.Trainer(
    model=model,
    args=trainer_args,
    train_dataset=merge_dataset['train'],
    eval_dataset=merge_dataset['validation'],
    compute_metrics=compute_accuracy,
    data_collator=data_collator,
    tokenizer = tokenizer,
    callbacks=[early_stopping, training_logs]
)

de_en_trainer.train()

Step,Training Loss,Validation Loss,Accuracy
100,1.3136,1.302866,0.407
200,1.2993,1.279573,0.425
300,1.3092,1.269392,0.4435
400,1.2666,1.256128,0.4445
500,1.2804,1.250214,0.4365
600,1.2766,1.241823,0.449
700,1.2574,1.230625,0.454
800,1.207,1.226553,0.4545
900,1.2524,1.224106,0.456
1000,1.2159,1.223727,0.4565


TrainOutput(global_step=1000, training_loss=1.2678352508544921, metrics={'train_runtime': 1483.01, 'train_samples_per_second': 5.394, 'train_steps_per_second': 0.674, 'total_flos': 2104945139712000.0, 'train_loss': 1.2678352508544921, 'epoch': 0.4})

In [21]:
de_en_eval_results = de_en_trainer.evaluate(en_dataset["test"])

pprint(de_en_eval_results)

print('Accuracy:', de_en_eval_results['eval_accuracy'])

{'epoch': 0.4,
 'eval_accuracy': 0.454,
 'eval_loss': 1.2270537614822388,
 'eval_runtime': 36.0045,
 'eval_samples_per_second': 27.774,
 'eval_steps_per_second': 0.889}
Accuracy: 0.454


**Crosslingual**

In [1]:
de_trainer = None
de_trainer = transformers.Trainer(
    model=model,
    args=trainer_args,
    train_dataset=de_dataset['train'],
    eval_dataset=de_dataset['validation'],
    compute_metrics=compute_accuracy,
    data_collator=data_collator,
    tokenizer = tokenizer,
    callbacks=[early_stopping, training_logs]
)

de_trainer.train()

NameError: name 'transformers' is not defined

In [None]:
de_eval_results = de_trainer.evaluate(en_dataset["test"])

pprint(de_eval_results)

print('Accuracy:', de_eval_results['eval_accuracy'])

---

## 4. Results and summary

### 4.1 Corpus insights

(Briefly discuss what you learned about the corpus and its annotation)

### 4.2 Results

(Briefly summarize your results)

### 4.3 Relation to random baseline / expected performance / state of the art

(Compare your results to the random and state-of-the-art performance)

---

## 5. Bonus Task (optional)

### 5.1. Data selection

(Briefly describe how many English and target language examples were used and how these were selected, include relevant code)

### 5.2 Sentence representations

In [None]:
# Your code to create a sentence embedding for the given text here

### 5.3. Cosine similarity

In [None]:
# Your code to calculate the cosine similarity of the embeddings and select the target sentence that maximizes the cosine similarity here

### 5.4 Bonus task evaluation

(Present the evaluation results here)