<a href="https://colab.research.google.com/github/jessicajhassibi/Bachelor-Thesis/blob/main/XLM_T_Fine_tuning_on_custom_datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install --upgrade pip
!pip install sentencepiece
!pip install datasets
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pip
  Downloading pip-22.3.1-py3-none-any.whl (2.1 MB)
[K     |████████████████████████████████| 2.1 MB 5.0 MB/s 
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 21.1.3
    Uninstalling pip-21.1.3:
      Successfully uninstalled pip-21.1.3
Successfully installed pip-22.3.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m20.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.97
[0mLooking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting dat

# Fine-tuning XLM-T

This notebook describes a simple case of finetuning. You can finetune either the `XLM-T` language model, or XLM-T sentiment, which has already been fine-tuned on sentiment analysis data, in 8 languages (this could be useful to do sentiment transfer learning on new languages).,

This notebook was modified from https://huggingface.co/transformers/custom_datasets.html

In [2]:
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments
import torch

import numpy as np
from sklearn.metrics import classification_report

## Parameters

In [31]:
LR = 5e-5
EPOCHS = 20
BATCH_SIZE = 16
MODEL = "bert-base-cased" # use this to finetune the language model
#MODEL = "cardiffnlp/twitter-xlm-roberta-base-sentiment" # use this to finetune the sentiment classifier
MAX_TRAINING_EXAMPLES = -1 # set this to -1 if you want to use the whole training set

## Data

We download the xml-t sentiment dataset (`UMSAB`) but you can use your own.
If you use the same files structures as [TweetEval](https://github.com/cardiffnlp/tweeteval) (`train_text.txt`, `train_labels.txt`, `val_text.txt`, `...`), you do not need to change anything in the code.

---



In [4]:
# loading dataset for UMSAB's all 8 languages

files = """test_labels.txt
test_text.txt
train_labels.txt
train_text.txt
val_labels.txt
val_text.txt""".split('\n')

#for f in files:
#  p = f"https://raw.githubusercontent.com/jessicajhassibi/Bachelor-Thesis/main/notebooks/datasets/{f}"
#  !wget $p

--2022-11-25 11:30:29--  https://raw.githubusercontent.com/jessicajhassibi/Bachelor-Thesis/main/notebooks/datasets/test_labels.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 57 [text/plain]
Saving to: ‘test_labels.txt’


2022-11-25 11:30:29 (2.43 MB/s) - ‘test_labels.txt’ saved [57/57]

--2022-11-25 11:30:29--  https://raw.githubusercontent.com/jessicajhassibi/Bachelor-Thesis/main/notebooks/datasets/test_text.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 81961 (80K) [text/plain]
Saving to: ‘test_text.txt’


2022-11-25 1

In [5]:
dataset_dict = {}
for i in ['train','val','test']:
  dataset_dict[i] = {}
  for j in ['text','labels']:
    dataset_dict[i][j] = open(f"{i}_{j}.txt").read().split('\n')
    if j == 'labels':
      dataset_dict[i][j] = [int(x) for x in dataset_dict[i][j]]

if MAX_TRAINING_EXAMPLES > 0:
  dataset_dict['train']['text']=dataset_dict['train']['text'][:MAX_TRAINING_EXAMPLES]
  dataset_dict['train']['labels']=dataset_dict['train']['labels'][:MAX_TRAINING_EXAMPLES]

In [6]:
tokenizer = AutoTokenizer.from_pretrained(MODEL, use_fast=True, max_length=512)

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [7]:
train_encodings = tokenizer(dataset_dict['train']['text'], truncation=True, padding=True)
val_encodings = tokenizer(dataset_dict['val']['text'], truncation=True, padding=True)
test_encodings = tokenizer(dataset_dict['test']['text'], truncation=True, padding=True)

In [8]:
class MyDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = MyDataset(train_encodings, dataset_dict['train']['labels'])
val_dataset = MyDataset(val_encodings, dataset_dict['val']['labels'])
test_dataset = MyDataset(test_encodings, dataset_dict['test']['labels'])

In [9]:
# Print the original sentence.
#print(' Original: ', train_dataset[0])

# Print the sentence split into tokens.
#print('Tokenized: ', tokenizer.tokenize(sentences[0]))

# Print the sentence mapped to token ids.
#print('Token IDs: ', tokenizer.convert_tokens_to_ids(tokenizer.tokenize(sentences[0])))




## Fine-tuning

The steps above prepared the datasets in the way that the trainer is expected. Now all we need to do is create a model
to fine-tune, define the `TrainingArguments`/`TFTrainingArguments` and
instantiate a `Trainer`/`TFTrainer`.

In [32]:
training_args = TrainingArguments(
    output_dir='./results',                   # output directory
    num_train_epochs=EPOCHS,                  # total number of training epochs
    evaluation_strategy="epoch",
    logging_strategy="epoch",
    save_strategy="epoch",
    per_device_train_batch_size=BATCH_SIZE,   # batch size per device during training
    per_device_eval_batch_size=BATCH_SIZE,    # batch size for evaluation
    full_determinism=True,
    warmup_steps=100,                         # number of warmup steps for learning rate scheduler
    weight_decay=0.01,                        # strength of weight decay
    logging_dir='./logs',                     # directory for storing logs
    logging_steps=10,                         # when to print log
    load_best_model_at_end=True,              # load or not best model at the end
)

num_labels = len(set(dataset_dict["train"]["labels"]))
model = AutoModelForSequenceClassification.from_pretrained(MODEL, num_labels=num_labels)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/5532cc56f74641d4bb33641f5c76a55d11f846e0/config.json
Model config BertConfig {
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_t

In [33]:
trainer = Trainer(
    model=model,                              # the instantiated 🤗 Transformers model to be trained
    args=training_args,                       # training arguments, defined above
    train_dataset=train_dataset,              # training dataset
    eval_dataset=val_dataset                  # evaluation dataset
)

trainer.train()

***** Running training *****
  Num examples = 138
  Num Epochs = 20
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 180
  Number of trainable parameters = 108311810


Epoch,Training Loss,Validation Loss
1,0.7745,0.690205
2,0.6392,0.508769
3,0.5525,0.421013
4,0.4957,0.361978
5,0.4486,0.342787
6,0.2785,0.163117
7,0.1542,0.24573
8,0.0702,0.138935
9,0.013,0.133979
10,0.0094,0.238305


***** Running Evaluation *****
  Num examples = 25
  Batch size = 16
Saving model checkpoint to ./results/checkpoint-9
Configuration saved in ./results/checkpoint-9/config.json
Model weights saved in ./results/checkpoint-9/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 25
  Batch size = 16
Saving model checkpoint to ./results/checkpoint-18
Configuration saved in ./results/checkpoint-18/config.json
Model weights saved in ./results/checkpoint-18/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 25
  Batch size = 16
Saving model checkpoint to ./results/checkpoint-27
Configuration saved in ./results/checkpoint-27/config.json
Model weights saved in ./results/checkpoint-27/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 25
  Batch size = 16
Saving model checkpoint to ./results/checkpoint-36
Configuration saved in ./results/checkpoint-36/config.json
Model weights saved in ./results/checkpoint-36/pytorch_model.bin
***** Running Evaluation **

TrainOutput(global_step=180, training_loss=0.1721134700734789, metrics={'train_runtime': 139.0084, 'train_samples_per_second': 19.855, 'train_steps_per_second': 1.295, 'total_flos': 726186512793600.0, 'train_loss': 0.1721134700734789, 'epoch': 20.0})

In [12]:
#trainer.save_model("./results/best_model") # save best model

## Evaluate on Test set

In [34]:
test_preds_raw, test_labels , _ = trainer.predict(test_dataset)
test_preds = np.argmax(test_preds_raw, axis=-1)
print(classification_report(test_labels, test_preds, digits=2))

***** Running Prediction *****
  Num examples = 29
  Batch size = 16


              precision    recall  f1-score   support

           0       0.96      0.96      0.96        27
           1       0.50      0.50      0.50         2

    accuracy                           0.93        29
   macro avg       0.73      0.73      0.73        29
weighted avg       0.93      0.93      0.93        29



<a id='ft_native'></a>