<a href="https://colab.research.google.com/github/jessicajhassibi/Bachelor-Thesis/blob/main/notebooks/Finetuning_BERT_and_DEMO_composers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install --upgrade pip
!pip install sentencepiece
!pip install datasets
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pip
  Downloading pip-22.3.1-py3-none-any.whl (2.1 MB)
[K     |████████████████████████████████| 2.1 MB 4.7 MB/s 
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 21.1.3
    Uninstalling pip-21.1.3:
      Successfully uninstalled pip-21.1.3
Successfully installed pip-22.3.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m19.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.97
[0mLooking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting dat

# Fine-tuning XLM-T

This notebook describes a simple case of finetuning. You can finetune either the `XLM-T` language model, or XLM-T sentiment, which has already been fine-tuned on sentiment analysis data, in 8 languages (this could be useful to do sentiment transfer learning on new languages).,

This notebook was modified from https://huggingface.co/transformers/custom_datasets.html

In [None]:
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments
import torch

import numpy as np
from sklearn.metrics import classification_report

## Parameters

In [None]:
LR = 3e-5
EPOCHS = 5
BATCH_SIZE = 16
MODEL = "bert-base-multilingual-cased" # use this to finetune the language model
#MODEL = "cardiffnlp/twitter-xlm-roberta-base-sentiment" # use this to finetune the sentiment classifier
MAX_TRAINING_EXAMPLES = -1 # set this to -1 if you want to use the whole training set

## Data

We download the xml-t sentiment dataset (`UMSAB`) but you can use your own.
If you use the same files structures as [TweetEval](https://github.com/cardiffnlp/tweeteval) (`train_text.txt`, `train_labels.txt`, `val_text.txt`, `...`), you do not need to change anything in the code.

---



In [None]:
# loading dataset for UMSAB's all 8 languages

files = """test_labels.txt
test_text.txt
train_labels.txt
train_text.txt
val_labels.txt
val_text.txt""".split('\n')

#for f in files:
#  p = f"https://raw.githubusercontent.com/jessicajhassibi/Bachelor-Thesis/main/notebooks/datasets/{f}"
#  !wget $p

--2022-11-25 15:27:01--  https://raw.githubusercontent.com/jessicajhassibi/Bachelor-Thesis/main/notebooks/datasets/test_labels.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 137 [text/plain]
Saving to: ‘test_labels.txt’


2022-11-25 15:27:02 (7.12 MB/s) - ‘test_labels.txt’ saved [137/137]

--2022-11-25 15:27:02--  https://raw.githubusercontent.com/jessicajhassibi/Bachelor-Thesis/main/notebooks/datasets/test_text.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 191330 (187K) [text/plain]
Saving to: ‘test_text.txt’


2022-11

In [None]:
dataset_dict = {}
for i in ['train','val','test']:
  dataset_dict[i] = {}
  for j in ['text','labels']:
    dataset_dict[i][j] = open(f"{i}_{j}.txt").read().split('\n')
    if j == 'labels':
      dataset_dict[i][j] = [int(x) for x in dataset_dict[i][j]]

if MAX_TRAINING_EXAMPLES > 0:
  dataset_dict['train']['text']=dataset_dict['train']['text'][:MAX_TRAINING_EXAMPLES]
  dataset_dict['train']['labels']=dataset_dict['train']['labels'][:MAX_TRAINING_EXAMPLES]

In [None]:
tokenizer = AutoTokenizer.from_pretrained(MODEL, use_fast=True, max_length=512)

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/996k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

In [None]:
train_encodings = tokenizer(dataset_dict['train']['text'], truncation=True, padding=True)
val_encodings = tokenizer(dataset_dict['val']['text'], truncation=True, padding=True)
test_encodings = tokenizer(dataset_dict['test']['text'], truncation=True, padding=True)

In [None]:
class MyDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = MyDataset(train_encodings, dataset_dict['train']['labels'])
val_dataset = MyDataset(val_encodings, dataset_dict['val']['labels'])
test_dataset = MyDataset(test_encodings, dataset_dict['test']['labels'])

## Fine-tuning

The steps above prepared the datasets in the way that the trainer is expected. Now all we need to do is create a model
to fine-tune, define the `TrainingArguments`/`TFTrainingArguments` and
instantiate a `Trainer`/`TFTrainer`.

In [None]:
training_args = TrainingArguments(
    output_dir='./results',                   # output directory
    num_train_epochs=EPOCHS,                  # total number of training epochs
    evaluation_strategy="epoch",
    logging_strategy="epoch",
    save_strategy="epoch",
    per_device_train_batch_size=BATCH_SIZE,   # batch size per device during training
    per_device_eval_batch_size=BATCH_SIZE,    # batch size for evaluation
    full_determinism=True,
    warmup_steps=100,                         # number of warmup steps for learning rate scheduler
    weight_decay=0.01,                        # strength of weight decay
    logging_dir='./logs',                     # directory for storing logs
    logging_steps=10,                         # when to print log
    load_best_model_at_end=True,              # load or not best model at the end
)

num_labels = len(set(dataset_dict["train"]["labels"]))
model = AutoModelForSequenceClassification.from_pretrained(MODEL, num_labels=num_labels)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--bert-base-multilingual-cased/snapshots/fdfce55e83dbed325647a63e7e1f5de19f0382ba/config.json
Model config BertConfig {
  "_name_or_path": "bert-base-multilingual-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "p

In [None]:
trainer = Trainer(
    model=model,                              # the instantiated 🤗 Transformers model to be trained
    args=training_args,                       # training arguments, defined above
    train_dataset=train_dataset,              # training dataset
    eval_dataset=val_dataset                  # evaluation dataset
)

trainer.train()

***** Running training *****
  Num examples = 330
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 105
  Number of trainable parameters = 177854978


Epoch,Training Loss,Validation Loss
1,0.6267,0.591278
2,0.4775,0.602941
3,0.4276,0.516658
4,0.3558,0.723555
5,0.2294,0.340564


***** Running Evaluation *****
  Num examples = 57
  Batch size = 16
Saving model checkpoint to ./results/checkpoint-21
Configuration saved in ./results/checkpoint-21/config.json
Model weights saved in ./results/checkpoint-21/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 57
  Batch size = 16
Saving model checkpoint to ./results/checkpoint-42
Configuration saved in ./results/checkpoint-42/config.json
Model weights saved in ./results/checkpoint-42/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 57
  Batch size = 16
Saving model checkpoint to ./results/checkpoint-63
Configuration saved in ./results/checkpoint-63/config.json
Model weights saved in ./results/checkpoint-63/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 57
  Batch size = 16
Saving model checkpoint to ./results/checkpoint-84
Configuration saved in ./results/checkpoint-84/config.json
Model weights saved in ./results/checkpoint-84/pytorch_model.bin
***** Running Evaluation

TrainOutput(global_step=105, training_loss=0.42340755008515857, metrics={'train_runtime': 75.5582, 'train_samples_per_second': 21.837, 'train_steps_per_second': 1.39, 'total_flos': 434133241344000.0, 'train_loss': 0.42340755008515857, 'epoch': 5.0})

In [None]:
trainer.save_model("./results/best_model") # save best model

Saving model checkpoint to ./results/best_model
Configuration saved in ./results/best_model/config.json
Model weights saved in ./results/best_model/pytorch_model.bin


## Evaluate on Test set

In [None]:
test_preds_raw, test_labels , _ = trainer.predict(test_dataset)
test_preds = np.argmax(test_preds_raw, axis=-1)
print(classification_report(test_labels, test_preds, digits=2))

***** Running Prediction *****
  Num examples = 69
  Batch size = 16


              precision    recall  f1-score   support

           0       0.95      0.97      0.96        58
           1       0.80      0.73      0.76        11

    accuracy                           0.93        69
   macro avg       0.87      0.85      0.86        69
weighted avg       0.93      0.93      0.93        69



In [28]:
from transformers import AutoTokenizer, AutoConfig
import numpy as np
from scipy.special import softmax

# Preprocess text (username and link placeholders)
def preprocess(text):
    new_text = []
    for t in text.split(" "):
        # proccess URLs
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)

config = AutoConfig.from_pretrained("/content/results/best_model")
model = AutoModelForSequenceClassification.from_pretrained("/content/results/best_model")

text = """
Wilhelm Richard Wagner (* 22. Mai 1813 in Leipzig; † 13. Februar 1883 in Venedig) war ein deutscher Komponist, Dramatiker, Dichter, Schriftsteller, Theaterregisseur und Dirigent. \
Mit seinen durchkomponierten Musikdramen gilt er als einer der bedeutendsten Komponisten der Romantik. In der Zeit des Nationalsozialismus wurde Wagners Werk zum Staatskult erhoben. \
Wagner beschäftigte sich intensiv mit Stoffen der germanischen Mythologie und Sagenwelt wie dem Schwanenritter, der Nibelungensage und dem Heiligen Gral als Teil der Artus-Sage. \
In Lohengrin, der Ring-Tetralogie und dem Spätwerk Parsifal kreisen seine Gedanken um das Motiv der Erlösung, das bereits im Fliegenden Holländer eine zentrale Rolle spielt
"""
text = preprocess(text)
encoded_input = tokenizer(text, return_tensors='pt', truncation=True, padding=True)
output = model(**encoded_input)
scores = output[0][0].detach().numpy()
scores = softmax(scores)

# Print labels and scores
ranking = np.argsort(scores)
ranking = ranking[::-1]
print("################################################################")
print("Would the composer described in the following text rather be")
print("considered as 0 = 'entartet' or 1 = 'gottbegnadet' by the nazis?")
print("################################################################")
print("Wilhelm Richard Wagner (* 22. Mai 1813 in Leipzig; † 13. Februar 1883 in Venedig) war ein deutscher Komponist, Dramatiker, Dichter, Schriftsteller, Theaterregisseur und Dirigent.")
print("Mit seinen durchkomponierten Musikdramen gilt er als einer der bedeutendsten Komponisten der Romantik. In der Zeit des Nationalsozialismus wurde Wagners Werk zum Staatskult erhoben.")
print("Wagner beschäftigte sich intensiv mit Stoffen der germanischen Mythologie und Sagenwelt wie dem Schwanenritter, der Nibelungensage und dem Heiligen Gral als Teil der Artus-Sage.")
print("In Lohengrin, der Ring-Tetralogie und dem Spätwerk Parsifal kreisen seine Gedanken um das Motiv der Erlösung, das bereits im Fliegenden Holländer eine zentrale Rolle spielt")
for i in range(scores.shape[0]):
    l = config.id2label[ranking[i]]
    s = scores[ranking[i]]
    print(f"{i+1}) {l} {np.round(float(s), 4)}")

loading configuration file /content/results/best_model/config.json
Model config BertConfig {
  "_name_or_path": "/content/results/best_model",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "pooler_fc_size": 768,
  "pooler_num_attention_heads": 12,
  "pooler_num_fc_layers": 3,
  "pooler_size_per_head": 128,
  "pooler_type": "first_token_transform",
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "torch_dtype": "float32",
  "transformers_version": "4.24.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 119547
}

loading config

################################################################
Would the composer described in the following text rather be
considered as 0 = 'entartet' or 1 = 'gottbegnadet' by the nazis?
################################################################
Wilhelm Richard Wagner (* 22. Mai 1813 in Leipzig; † 13. Februar 1883 in Venedig) war ein deutscher Komponist, Dramatiker, Dichter, Schriftsteller, Theaterregisseur und Dirigent.
Mit seinen durchkomponierten Musikdramen gilt er als einer der bedeutendsten Komponisten der Romantik. In der Zeit des Nationalsozialismus wurde Wagners Werk zum Staatskult erhoben.
Wagner beschäftigte sich intensiv mit Stoffen der germanischen Mythologie und Sagenwelt wie dem Schwanenritter, der Nibelungensage und dem Heiligen Gral als Teil der Artus-Sage.
In Lohengrin, der Ring-Tetralogie und dem Spätwerk Parsifal kreisen seine Gedanken um das Motiv der Erlösung, das bereits im Fliegenden Holländer eine zentrale Rolle spielt
1) LABEL_1 0.9757
2) LABEL_0 0.0

<a id='ft_native'></a>