<a href="https://colab.research.google.com/github/maryamteimouri/DL-HTL-course-project/blob/main/DL_HTL_course_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep learning in Human Language Technology Project

- Student Name: Maryam Teimouri Badeleh Dareh
- Date: 28 November 2023
- Chosen Corpus: amazon_reviews_multi

### Corpus information

- Description of the chosen corpus:
  1. Labels: Star rating 1–5
  2. Languages: English, German, Spanish, French, Japanese, Chinese
  Subset sizes (per language): train:200K, validation:5K, test:5K
  3. Description: Amazon product reviews dataset for multilingual text classification. Each record in the dataset contains id, label, label_text, and text. The corpus is balanced across stars, so each star rating constitutes 20% of the reviews in each language.
- Paper(s) and other published materials related to the corpus:
  1. The Multilingual Amazon Reviews Corpus: https://aclanthology.org/2020.emnlp-main.369.pdf
  2. https://github.com/nlptown/nlp-notebooks/blob/master/Multilingual%20text%20classification%20with%20BERT.ipynb
  3. https://github.com/TurkuNLP/Deep_Learning_in_LangTech_course/blob/master/hf_trainer_bert.ipynb
- Random baseline performance and expected performance for recent machine learned models:

---

## 1. Setup

In [None]:
import logging
logging.disable(logging.INFO)

from pprint import PrettyPrinter
pprint = PrettyPrinter(compact=True).pprint

!pip3 install -q datasets transformers evaluate accelerate

import transformers
import torch
import evaluate
import accelerate
from collections import defaultdict

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m261.4/261.4 kB[0m [31m27.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
[?25h

---

## 2. Data download, sampling and preprocessing

### 2.1. Download the corpus

In [None]:
import datasets

dataset = datasets.load_dataset("mteb/amazon_reviews_multi")

Downloading builder script:   0%|          | 0.00/6.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/47.0 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/6 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/61.4M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/53.1M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/48.7M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/53.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/141M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/80.3M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/6 [00:00<?, ?it/s]

Downloading data files:   0%|          | 0/6 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/1.53M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.34M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.21M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.30M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.47M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.98M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/6 [00:00<?, ?it/s]

Downloading data files:   0%|          | 0/6 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/1.53M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.22M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.49M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.01M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/6 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

### 2.2. Sampling and preprocessing

In [None]:
print (dataset)

DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 1200000
    })
    validation: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 30000
    })
    test: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 30000
    })
})


In [None]:
pprint(dataset['train'][8])

{'id': 'de_0055293',
 'label': 0,
 'label_text': '0',
 'text': 'Nach kurzer Zeit defekt\n'
         '\n'
         'Die Lampe ist nach einem Jahr bei nur gelegentlichem Gebrauch '
         'defekt- sie schaltet sich grundsätzlich - voll aufgeladen- nach '
         'wenigen Sekunden ab.'}


In [None]:
dataset=dataset.shuffle()

train_size = 400000
val_size = 10000
test_size = 10000

dataset["train"]=dataset["train"].select(range(train_size))
dataset["validation"]=dataset["validation"].select(range(val_size))
dataset["test"]=dataset["test"].select(range(test_size))

---

## 3. Machine learning model

### 3.1. Model training

In [None]:
# Your code to train the transformer based model on the training set and evaluate the performance on the validation set here

model_name = "bert-base-multilingual-cased"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
model = transformers.AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=5)  # TODO: check this

model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
trainer_args = transformers.TrainingArguments(
    "checkpoints",
    evaluation_strategy="steps",
    logging_strategy="steps",
    load_best_model_at_end=True,
    eval_steps=100,
    logging_steps=100,
    learning_rate=0.00001,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=32,
    max_steps=500,
)

In [None]:
accuracy = evaluate.load("accuracy")

def compute_accuracy(outputs_and_labels):
    outputs, labels = outputs_and_labels
    predictions = outputs.argmax(axis=-1)
    return accuracy.compute(predictions=predictions, references=labels)


data_collator = transformers.DataCollatorWithPadding(tokenizer)

early_stopping = transformers.EarlyStoppingCallback(
    early_stopping_patience=5
)

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [None]:
class LogSavingCallback(transformers.TrainerCallback):
    def on_train_begin(self, *args, **kwargs):
        self.logs = defaultdict(list)
        self.training = True

    def on_train_end(self, *args, **kwargs):
        self.training = False

    def on_log(self, args, state, control, logs, model=None, **kwargs):
        if self.training:
            for k, v in logs.items():
                if k != "epoch" or v not in self.logs[k]:
                    self.logs[k].append(v)

training_logs = LogSavingCallback()

In [None]:
trainer = None
trainer = transformers.Trainer(
    model=model,
    args=trainer_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    compute_metrics=compute_accuracy,
    data_collator=data_collator,
    tokenizer = tokenizer,
    callbacks=[early_stopping, training_logs]
)

trainer.train()


You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


ValueError: ignored

### 3.2 Hyperparameter optimization

In [None]:
# Your code for hyperparameter optimization here

### 3.3. Evaluation on test set

In [None]:
# Your code to evaluate the final model on the test set here


### 3.4. Multilingual and cross-lingual experiments

In [None]:
# Your code to train and evaluate the multilingual and cross-lingual models

---

## 4. Results and summary

### 4.1 Corpus insights

(Briefly discuss what you learned about the corpus and its annotation)

### 4.2 Results

(Briefly summarize your results)

### 4.3 Relation to random baseline / expected performance / state of the art

(Compare your results to the random and state-of-the-art performance)

---

## 5. Bonus Task (optional)

### 5.1. Data selection

(Briefly describe how many English and target language examples were used and how these were selected, include relevant code)

### 5.2 Sentence representations

In [None]:
# Your code to create a sentence embedding for the given text here

### 5.3. Cosine similarity

In [None]:
# Your code to calculate the cosine similarity of the embeddings and select the target sentence that maximizes the cosine similarity here

### 5.4 Bonus task evaluation

(Present the evaluation results here)