<a href="https://colab.research.google.com/github/maryamteimouri/MultilingualTextClassifier/blob/main/DL_HTL_course_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep learning in Human Language Technology Project

- Student Name: Maryam Teimouri Badeleh Dareh
- Date: 28 November 2023
- Chosen Corpus: amazon_reviews_multi

### Corpus information

- Description of the chosen corpus:
  1. Labels: Star rating 1–5
  2. Languages: English, German, Spanish, French, Japanese, Chinese
  Subset sizes (per language): train:200K, validation:5K, test:5K
  3. Description: Amazon product reviews dataset for multilingual text classification. Each record in the dataset contains id, label, label_text, and text. The corpus is balanced across stars, so each star rating constitutes 20% of the reviews in each language.
- Paper(s) and other published materials related to the corpus:
  1. The Multilingual Amazon Reviews Corpus: https://aclanthology.org/2020.emnlp-main.369.pdf
  2. https://github.com/nlptown/nlp-notebooks/blob/master/Multilingual%20text%20classification%20with%20BERT.ipynb
  3. https://github.com/TurkuNLP/Deep_Learning_in_LangTech_course/blob/master/hf_trainer_bert.ipynb
- Random baseline performance and expected performance for recent machine learned models:

---

## 1. Setup

In [None]:
import logging
logging.disable(logging.INFO)

from pprint import PrettyPrinter
pprint = PrettyPrinter(compact=True).pprint

!pip3 install -q datasets transformers evaluate accelerate

import transformers
import torch
import evaluate
import accelerate
from collections import defaultdict

!pip install optuna
import optuna

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.7/265.7 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting optuna
  Downloading optuna-3.5.0-py3-none-any.whl (413 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m413.4/413.4 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting alembic>=1.5.0 (from optuna)
  Downloading alembic-1.13.1-py3-none-any.whl (233 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.4/233.4 kB[0m [31m10.

---

## 2. Data download, sampling and preprocessing

### 2.1. Download the corpus

In [None]:
import datasets

dataset = datasets.load_dataset("mteb/amazon_reviews_multi")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


### 2.2. Sampling and preprocessing

In [None]:
print (dataset)

DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 1200000
    })
    validation: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 30000
    })
    test: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 30000
    })
})


In [None]:
pprint(dataset['train'][210000])

{'id': 'en_0389345',
 'label': 0,
 'label_text': '0',
 'text': 'It does not charge.\n'
         '\n'
         'I got this with 50% charge. Put it under the sun. two days later, it '
         'has 25%. Never able to charge it once. Should have returned it right '
         'away.'}


In [None]:
# each language has 210K rows, 200K train, 5K val, 5K test

de_train=dataset["train"].select(range(200000 - 1))
de_val=dataset["validation"].select(range(5000 - 1))
de_test=dataset["test"].select(range(5000 - 1))

en_train=dataset["train"].select(range(210000, 210000 + 200000 -1))
en_val=dataset["validation"].select(range(5000, 5000 + 5000 -1))
en_test=dataset["test"].select(range(5000, 5000 + 5000 -1))

In [None]:
# down sampling

train_size = 10000
val_size = 1000
test_size = 1000

de_train=de_train.shuffle()
de_test=de_test.shuffle()
de_val=de_val.shuffle()

en_train=en_train.shuffle()
en_test=en_test.shuffle()
en_val=en_val.shuffle()

In [None]:
de_train=de_train.select(range(train_size))
de_test=de_test.select(range(test_size))
de_val=de_val.select(range(val_size))

en_train=en_train.select(range(train_size))
en_test=en_test.select(range(test_size))
en_val=en_val.select(range(val_size))

In [None]:
en_dataset = datasets.DatasetDict({'train': en_train, 'validation': en_val, 'test': en_test})
de_dataset = datasets.DatasetDict({'train': de_train, 'validation': de_val, 'test': de_test})

merge_dataset = datasets.DatasetDict({'train' : datasets.concatenate_datasets([en_train, de_train]),
                                      'validation' : datasets.concatenate_datasets([en_val, de_val]),
                                      'test' : datasets.concatenate_datasets([en_test, de_test])
                                      })
merge_dataset.shuffle()

DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 20000
    })
    validation: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 2000
    })
})

---

## 3. Machine learning model

### 3.1. Model training

In [None]:
# Your code to train the transformer based model on the training set and evaluate the performance on the validation set here

model_name = "bert-base-multilingual-cased"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
model = transformers.AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=5)  # TODO: check this

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
def tokenize(example):
    return tokenizer(
        example["text"],
        max_length=512,
        truncation=True,
        padding='max_length'
    )

# Apply the tokenizer to the whole dataset using .map()
en_dataset = en_dataset.map(tokenize)
de_dataset = de_dataset.map(tokenize)
merge_dataset = merge_dataset.map(tokenize)

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [None]:
trainer_args = transformers.TrainingArguments(
    "checkpoints",
    evaluation_strategy="steps",
    logging_strategy="steps",
    load_best_model_at_end=True,
    eval_steps=100,
    logging_steps=100,
    learning_rate=0.0001,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=32,
    max_steps=1000,
)

In [None]:
accuracy = evaluate.load("accuracy")

def compute_accuracy(outputs_and_labels):
    outputs, labels = outputs_and_labels
    predictions = outputs.argmax(axis=-1)
    return accuracy.compute(predictions=predictions, references=labels)


data_collator = transformers.DataCollatorWithPadding(tokenizer)

early_stopping = transformers.EarlyStoppingCallback(
    early_stopping_patience=5
)

In [None]:
class LogSavingCallback(transformers.TrainerCallback):
    def on_train_begin(self, *args, **kwargs):
        self.logs = defaultdict(list)
        self.training = True

    def on_train_end(self, *args, **kwargs):
        self.training = False

    def on_log(self, args, state, control, logs, model=None, **kwargs):
        if self.training:
            for k, v in logs.items():
                if k != "epoch" or v not in self.logs[k]:
                    self.logs[k].append(v)

training_logs = LogSavingCallback()

**baseline** : Train on English --> Evaluate on English

In [None]:
en_trainer = None
en_trainer = transformers.Trainer(
    model=model,
    args=trainer_args,
    train_dataset=en_dataset['train'],
    eval_dataset=en_dataset['validation'],
    compute_metrics=compute_accuracy,
    data_collator=data_collator,
    tokenizer = tokenizer,
    callbacks=[early_stopping, training_logs]
)

en_trainer.train()


You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss,Accuracy
100,1.6268,1.631952,0.229
200,1.6157,1.612991,0.229
300,1.5371,1.624683,0.229
400,1.532,1.613626,0.229
500,1.5646,1.635869,0.19
600,1.5404,1.626665,0.202
700,1.5347,1.612077,0.202
800,1.5231,1.619265,0.202
900,1.5188,1.612877,0.202
1000,1.517,1.614008,0.202


TrainOutput(global_step=1000, training_loss=1.5510146179199218, metrics={'train_runtime': 1103.8324, 'train_samples_per_second': 7.247, 'train_steps_per_second': 0.906, 'total_flos': 2104945139712000.0, 'train_loss': 1.5510146179199218, 'epoch': 0.8})

### 3.2 Hyperparameter optimization

In [None]:
LR_MIN = 4e-5
LR_CEIL = 0.01
WD_MIN = 4e-5
WD_CEIL = 0.01
MIN_EPOCHS = 2
MAX_EPOCHS = 5
PER_DEVICE_EVAL_BATCH = 8
PER_DEVICE_TRAIN_BATCH = 8
NUM_TRIALS = 3
SAVE_DIR = 'opt-test'
MAX_LENGTH = 512

In [None]:
def objective(trial: optuna.Trial):
    trainer = transformers.Trainer(
        model=model,
        args=trainer_args,
        train_dataset=en_dataset['train'],
        eval_dataset=en_dataset['validation'])

    result = trainer.train()
    return result.training_loss

In [None]:
# Your code for hyperparameter optimization here
study = optuna.create_study(study_name='hp-search-electra', direction='minimize')
study.optimize(func=objective, n_trials=NUM_TRIALS)

Step,Training Loss,Validation Loss
100,1.6258,1.625548
200,1.6223,1.617666
300,1.6105,1.612308
400,1.6161,1.608877
500,1.6166,1.626693
600,1.6206,1.614922
700,1.6297,1.622812
800,1.6161,1.613971
900,1.6135,1.608996
1000,1.6189,1.608561


Step,Training Loss,Validation Loss
100,1.6147,1.63
200,1.6115,1.617124
300,1.6063,1.61471
400,1.6079,1.613225
500,1.6088,1.640171
600,1.6153,1.62173
700,1.6227,1.628683
800,1.6109,1.616674
900,1.6099,1.608385
1000,1.6158,1.608768


Step,Training Loss,Validation Loss
100,1.5607,1.631808
200,1.6013,1.613676
300,1.5979,1.618097
400,1.5986,1.623163
500,1.6014,1.655625
600,1.6076,1.628764
700,1.6158,1.628402
800,1.6061,1.617326
900,1.6067,1.608336
1000,1.6146,1.608834


In [None]:
print('Finding study best parameters')
#best_lr = float(study.best_params['learning_rate'])
best_weight_decay = float(study.best_params['weight_decay'])
best_epoch = int(study.best_params['num_train_epochs'])

Finding study best parameters


KeyError: ignored

In [None]:
print(study.best_value)
print(study.best_params)
print(study.best_trial)

1.6254893798828125
{}
FrozenTrial(number=0, state=TrialState.COMPLETE, values=[1.6254893798828125], datetime_start=datetime.datetime(2024, 1, 3, 11, 20, 10, 672955), datetime_complete=datetime.datetime(2024, 1, 3, 11, 38, 30, 237271), params={}, user_attrs={}, system_attrs={}, intermediate_values={}, distributions={}, trial_id=0, value=None)


### 3.3. Evaluation on test set

In [None]:
# Your code to evaluate the final model on the test set here
en_eval_results = en_trainer.evaluate(en_dataset["test"])

pprint(en_eval_results)

print('Accuracy:', en_eval_results['eval_accuracy'])

### 3.4. Multilingual and cross-lingual experiments

**Multilingual**

In [None]:
de_en_trainer = None
de_en_trainer = transformers.Trainer(
    model=model,
    args=trainer_args,
    train_dataset=merge_dataset['train'],
    eval_dataset=merge_dataset['validation'],
    compute_metrics=compute_accuracy,
    data_collator=data_collator,
    tokenizer = tokenizer,
    callbacks=[early_stopping, training_logs]
)

de_en_trainer.train()

In [None]:
de_en_eval_results = de_en_trainer.evaluate(en_dataset["test"])

pprint(de_en_eval_results)

print('Accuracy:', de_en_eval_results['eval_accuracy'])

**Crosslingual**

In [None]:
de_trainer = None
de_trainer = transformers.Trainer(
    model=model,
    args=trainer_args,
    train_dataset=de_dataset['train'],
    eval_dataset=de_dataset['validation'],
    compute_metrics=compute_accuracy,
    data_collator=data_collator,
    tokenizer = tokenizer,
    callbacks=[early_stopping, training_logs]
)

de_trainer.train()

In [None]:
de_eval_results = de_trainer.evaluate(en_dataset["test"])

pprint(de_eval_results)

print('Accuracy:', de_eval_results['eval_accuracy'])

---

## 4. Results and summary

### 4.1 Corpus insights

(Briefly discuss what you learned about the corpus and its annotation)

### 4.2 Results

(Briefly summarize your results)

### 4.3 Relation to random baseline / expected performance / state of the art

(Compare your results to the random and state-of-the-art performance)

---

## 5. Bonus Task (optional)

### 5.1. Data selection

(Briefly describe how many English and target language examples were used and how these were selected, include relevant code)

### 5.2 Sentence representations

In [None]:
# Your code to create a sentence embedding for the given text here

### 5.3. Cosine similarity

In [None]:
# Your code to calculate the cosine similarity of the embeddings and select the target sentence that maximizes the cosine similarity here

### 5.4 Bonus task evaluation

(Present the evaluation results here)