# Battle of BERTs

Fine-tuning BERT modela za određivanje žanra pesme na osnovu njenog teksta (srpski jezik). Modeli koji će se koristiti:

* BERT
* Multijezički BERT
* BERTić

In [1]:
import os

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments

from datasets import Dataset, DatasetDict

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score

## Priprema podataka

Učitavamo tekstove pesama i dodeljujemo odgovarajuću labelu svakom tekstu, i nakon toga ih konvertujemo u oblik nepohodan za *transformers* biblioteku.

In [2]:
def load_data(base_path):
    data = []
    for label, folder_name in enumerate(os.listdir(base_path)):
        folder_path = os.path.join(base_path, folder_name)
        if os.path.isdir(folder_path):
            print(f"Folder: {folder_path} - Label: {label}")
            for file_name in os.listdir(folder_path):
                file_path = os.path.join(folder_path, file_name)
                if file_name.endswith(".txt"):
                    with open(file_path, "r", encoding="utf-8") as file:
                        text = file.read().strip()
                        data.append({"text": text, "label": label})
    return data

In [3]:
def prepare_dataset(base_path):
    data = load_data(base_path)
    print("\nTotal data size: ", len(data))

    train_data, test_data = train_test_split(data, test_size=0.2, random_state=1244)
    print("Train data size: ", len(train_data))
    print("Test data size: ", len(test_data))

    train_dataset = Dataset.from_list(train_data)
    test_dataset = Dataset.from_list(test_data)

    dataset = DatasetDict({
        "train": train_dataset,
        "test": test_dataset
    })

    print("\nSample train data:")
    print(dataset["train"][0])
    print("\nSample test data:")
    print(dataset["test"][0])

    return dataset

## Prirema metrika

Pripremamo računanje metrika. U ovom slučaju ćemo koristiti tačnost i makro F1 meru.

In [4]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = torch.argmax(torch.tensor(logits), dim=-1)
    acc = accuracy_score(labels, predictions)
    f1 = f1_score(labels, predictions, average="macro")
    return {"accuracy": acc, "f1": f1}

## Tokenizacija, trening i evaluacija

Pripremamo generički kod koji će za dati model odraditi:
- tokenizaciju koristeći odgovarajući tokenizator za model
- treniranje modela
- evaluaciju modela koristeći pripremljene metrike

In [5]:
def train_and_evaluate(model_name, dataset):
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    def tokenize_function(examples):
        return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=128)

    tokenized_dataset = dataset.map(tokenize_function, batched=True)
    tokenized_dataset = tokenized_dataset.rename_column("label", "labels").remove_columns(["text"])
    tokenized_dataset.set_format("torch")

    model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=3)

    training_args = TrainingArguments(
        output_dir=f"./results-{model_name}",
        eval_strategy="epoch",
        learning_rate=2e-5,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        num_train_epochs=3,
        weight_decay=0.01,
        save_strategy="epoch",
        load_best_model_at_end=True,
        logging_dir=f"./logs-{model_name}",
        logging_steps=10
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_dataset["train"],
        eval_dataset=tokenized_dataset["test"],
        tokenizer=tokenizer,
        compute_metrics=compute_metrics
    )

    trainer.train()
    eval_results = trainer.evaluate()
    print(f"Evaluation results for {model_name}: {eval_results}")

## Primena na modelima

In [6]:
dataset = prepare_dataset("data/")

Folder: data/pop - Label: 0
Folder: data/rock - Label: 1
Folder: data/folk - Label: 2

Total data size:  3632
Train data size:  2905
Test data size:  727

Sample train data:
{'text': "Ja osecam da dolazis\njer kisa odma' prestane\ni ulice ponesu osmeh\nkoji dobro znam\n\nU kosi nosi upletenu\nsvetlo znane nestane\ni prsten sa tri kamena\nmi ispustas na dlan\n\nTi hoces da te zavodim\na ne znas jezik magije\nti hoces da te osvojim\na ne znas da li znam\n\nJa hocu da te zagrlim\ni kazem tajnu najvecu\ni najlepsu od tajni\nkoje zelim da ti dam\n\nRef.\nHej, Lolita\nda l' je to ljubav ili strast\nhej, Lolita\nnad mojim srcem imas vlast\n\nDok pricas, zriju jabuke\nrasterujem im oblake\nispuni mi tri zelje\ni ispricaj mi san\n\nDa ludujem, da zavolim\ni nikad da ne ostarim\nizmenjenu istinu\nmi urezi na dlan\n\nTi hoces da te zavodim\na ne znas jezik magije\nti hoces da te osvojim\na ne znas da li znam\n\nJa hocu da te zagrlim\ni kazem tajnu najvecu\ni najlepsu od tajni\nkoje zelim da ti da

### BERT

In [7]:
train_and_evaluate("bert-base-uncased", dataset)

Map:   0%|          | 0/2905 [00:00<?, ? examples/s]

Map:   0%|          | 0/727 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.8334,0.845804,0.621733,0.431213
2,0.72,0.791197,0.645117,0.423412
3,0.783,0.786228,0.682256,0.466816


Evaluation results for bert-base-uncased: {'eval_loss': 0.786227822303772, 'eval_accuracy': 0.6822558459422283, 'eval_f1': 0.4668162428540518, 'eval_runtime': 5.3123, 'eval_samples_per_second': 136.852, 'eval_steps_per_second': 8.659, 'epoch': 3.0}


### Multijezički BERT

In [8]:
train_and_evaluate("bert-base-multilingual-uncased", dataset)

Map:   0%|          | 0/2905 [00:00<?, ? examples/s]

Map:   0%|          | 0/727 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.8075,0.753582,0.685007,0.465255
2,0.6254,0.701382,0.713893,0.506263
3,0.6162,0.698616,0.709766,0.518295


Evaluation results for bert-base-multilingual-uncased: {'eval_loss': 0.6986159086227417, 'eval_accuracy': 0.7097661623108665, 'eval_f1': 0.5182949729028036, 'eval_runtime': 5.3393, 'eval_samples_per_second': 136.16, 'eval_steps_per_second': 8.615, 'epoch': 3.0}


### BERTić

In [9]:
train_and_evaluate("classla/bcms-bertic", dataset)

Map:   0%|          | 0/2905 [00:00<?, ? examples/s]

Map:   0%|          | 0/727 [00:00<?, ? examples/s]

Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at classla/bcms-bertic and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.7296,0.741873,0.701513,0.48732
2,0.5371,0.638469,0.73315,0.604311
3,0.4679,0.632898,0.753783,0.631405


model.safetensors:   0%|          | 0.00/443M [00:00<?, ?B/s]

Evaluation results for classla/bcms-bertic: {'eval_loss': 0.6328977942466736, 'eval_accuracy': 0.7537826685006878, 'eval_f1': 0.6314052959447503, 'eval_runtime': 4.7822, 'eval_samples_per_second': 152.023, 'eval_steps_per_second': 9.619, 'epoch': 3.0}
