<a href="https://colab.research.google.com/github/Stephan-Linzbach/A-practical-guide-to-multilingual-large-language-model-RoBERTa-classification-/blob/main/textclassification_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A practical guide to multilingual large language  model (RoBERTa) classification

This step-by-step tutorial provides an accessible introduction to customizing (fine-tuning) a pre-trained multilingual language model (RoBERTa) for text classification tasks. It demonstrates how to use the model's existing knowledge to classify text accurately, even with a small set of labeled examples. It takes input as JSON files with text documents and their corresponding labels for training, validating and testing. It covers using specialized models for English, German, and French while employing XLM-RoBERTa for over 100 additional languages.

Relevant References for Further Reading:
- Unsupervised Cross-lingual Representation Learning at Scale
  - https://arxiv.org/pdf/1911.02116
- RoBERTa: A Robustly Optimized BERT Pretraining Approach
  - https://arxiv.org/pdf/1907.11692
- CamemBERT: a Tasty French Language Model
  - https://arxiv.org/pdf/1911.03894
- WECHSEL: Effective initialization of subword embeddings for
cross-lingual transfer of monolingual language models
  - https://aclanthology.org/2022.naacl-main.293.pdf
- Sharpness-Aware Minimization for Efficiently Improving Generalization
  - https://arxiv.org/pdf/2010.01412

# Learning Objectives

This tutorial has the following learning objectives:
-	Learning how to work with large language models (RoBERTa)
-	Customizing (fine-tuning) a large language model for a text classification task in any language (100+ languages supported)
-	Low-resource learning (with only few hundred examples) using the SAM optimizer


# Target Audience
-	Social scientists willing to learn about using large language models with basic prior understanding of it
-	Social scientists with expertise in large-language models, interested in fine-tuning for multiple languages from only few examples.
-	Computer scientists interested in learning about how large-language models are used for social text classification.
-	Advanced NLP researchers and professors looking for tutorials that can help their students in learning new topics.


# Prerequisites
Use this tutorial preferably in google colab, as the setup depends on the pre-installed packages of the colab environment.

#Environment Setup
Run the cells below:

In [1]:
!pip install transformers
!wget https://raw.githubusercontent.com/davda54/sam/main/sam.py

--2025-01-22 11:52:53--  https://raw.githubusercontent.com/davda54/sam/main/sam.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2484 (2.4K) [text/plain]
Saving to: ‘sam.py’


2025-01-22 11:52:53 (33.9 MB/s) - ‘sam.py’ saved [2484/2484]



In [2]:
import json
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from transformers import get_cosine_schedule_with_warmup
from torch.utils.data import DataLoader
from sam import SAM
import shutil
import torch

In [3]:
## Utils


def convert_in_output_size(labels, mapping):
    label_resized = []
    for l in labels:
        tmp_l = torch.tensor([1 if k in l else 0 for k in mapping])
        label_resized.append(tmp_l)
    label_resized = torch.stack(label_resized, dim=0)
    return label_resized
def convert_labels(labels, mapping):
    if isinstance(labels[0], list):
        labels = convert_in_output_size(labels, mapping)
    return torch.tensor([mapping[l] for l in labels])

def flatten_list(list_to_flatten):
  """
  Returns one list from a list of lists.
  """
  return [x for xs in list_to_flatten for x in xs]

def infer_output_size(data):
  """
  Returns the number of possible labels and the possible labels.
  """
  labels = data['Labels']
  if isinstance(labels[0], list):
    labels = flatten_list(labels)
  labels = set(labels)
  return len(labels), labels

def generate_dataloader(text, y, batch_size, workers=1):
  """
  Returns a dataloader with input_ids and attention_mask to process the text.
  """
  attention_mask = text['attention_mask']
  input_ids = text['input_ids']

  dataset = list(zip(input_ids, attention_mask, y))

  dataloader = DataLoader(dataset, shuffle=True, batch_size=batch_size, num_workers=workers)

  return dataloader

# Tutorial Content





## 1) Introduction

We will start with the most pressing questions:
  - What exactly is a **text-classification**?
  - What is a **pre-trained language model**?
  - What even is **fine-tuning**?

We will answer all questions in the following text.

In **text-classification** we try to assign a property to a text.

For example we are interested in classifying texts that are about fruits.
We could easily find a dictionary with all fruits (e.g.: 'Apple', 'Banana', 'Pear' etc.) everytime we recognize such a word in a text we know this text is about fruits, right?
However, this might not be true all the time for example "Apple designed the new pencil pro." is not about the fruit 'Apple' although we would recognize it as such with our dictionary approach.
Furthermore, this would only work for the language in the dictionary. However, our tutorial is helpful for 100+ languages.
So the context of the word might be helpful (more on this later).
Classification is obviously transferable to more than just fruits.
People try to classify the sentiment of a text, the stance towards an entity expressed in a text, the topic of a text, the expressed emotion in a text, and many many more.

## 2) Data Preparation

Let's talk data:
In order to make this script work you have to save three dictionaries in this structure in the file '<current_folder>/(train|val|test).json':

  ```python
  {'Text' : [list of texts]
   'Labels': [list of labels]}
  ```

 Each text document in the data should have a corresponding label such that:

  ```python
  length([list of texts]) == length([list of labels])
  ```

Example:
 ```python
 {'Text': ['Yesterday i ate an apple.', 'Yesterday I crashed my Apple.'],
  'Labels': ['about_fruit', 'not_about_fruit']}
  ```

The *train data* is used to teach the model, the *val data* is required to validate if the model understands the train data correctly and the *test data* is used to proof the capabilities of the final version of the model on the **unseen** *test set*.


You can classify text into one category (single label classification) or several categories per text (multilabel classification).


Is your data ready? Then lets start.

In [4]:
try:
  with open("./train.json") as f:
    train = json.load(f)

  with open("./val.json") as f:
    val = json.load(f)

  with open("./test.json") as f:
    test = json.load(f)
except:
  #Dummy Values
  train = {'Text': ["An Apple is a Fruit!", "An Apple is not a Fruit!", "An Apple has no seeds."]*64,
          'Labels': ['is_correct', 'is_incorrect', 'is_incorrect']*64}
  val = {'Text': ["An Apple is not a Fruit!", "An Apple is a Fruit!", "An Apple has no seeds."]*64,
          'Labels': ['is_incorrect', 'is_correct', 'is_incorrect']*64}
  test = {'Text': ["An Apple is not a Fruit!", "An Apple is has no seeds.", "An Apple is a Fruit."]*64,
          'Labels': ['is_incorrect', 'is_incorrect', 'is_correct']*64}


assert len(train['Text']) == len(train['Labels']), "Number of texts does not match number of labels for train data!"
assert len(val['Text']) == len(val['Labels']), "Number of texts does not match number of labels for val data!"
assert len(test['Text']) == len(test['Labels']), "Number of texts does not match number of labels for test data!"

print(f"We loaded the train data with {len(train['Text'])} texts and {len(train['Labels'])} labels,")
print(f"the validation data with {len(val['Text'])} texts and {len(val['Labels'])} labels")
print(f"and the test data with {len(test['Text'])} texts and {len(test['Labels'])} labels.")

We loaded the train data with 192 texts and 192 labels,
the validation data with 192 texts and 192 labels
and the test data with 192 texts and 192 labels.


Great the data is ready!

### Understanding the Data
You have to answer some questions about your data.



#### Finding a Language Specific Language Model

In which language is your text data written?

In [5]:
language = input("Which language do you use? ('english', 'german', 'french') ")

print(f"MMhhh interesting your data is written in {language}. Let's load a fitting PLM!")

if language == 'english':
  model_name = "roberta-base"
elif language == 'german':
  model_name = "benjamin/roberta-base-wechsel-german"
elif language == 'french':
  model_name = "camembert-base"
else:
  print(f"Seems like we have no model available for {language}.")
  print("We will load a mulitlingual language model. It knows text from 100 languages.")
  model_name = 'xlm-roberta-base'

print(f"We loaded {model_name} for {language}.")

Which language do you use? ('english', 'german', 'french') english
MMhhh interesting your data is written in english. Let's load a fitting PLM!
We loaded roberta-base for english.


Ok now that we talked about the language of your data you might be intersted what the ``model_name`` stands for.

These are **pre-trained language models** ready to be used with your specific language.
These language models already learned to understand language by solving a huge cloze-text written in the particular language.

This cloze-text is constructed over Wikipedia or other huge text datasets.

Now that we understood what **text classification** and **pre-trained language model** are.

We can now talk about the last question: What is **fine-tuning**?

As you might imagine solving a cloze-text over the whole internet makes you knowledgeable but not an expert in a field. We now want to transform our **pre-trained language model** into an expert for your task.


## 3) Defining the Classification Task


### Choosing vs. Deciding

Before starting, it’s essential to clarify what type of classification task you want to perform. We distinguish between two main tasks:

#### 1. Choosing (Single-Label Classification)
- **Example:** *What is your favorite fruit?*
- You select one correct label from a list of possible options.

#### 2. Deciding (Multi-Label Classification)
- **Example:** *Do you like apples?*
- You evaluate each label independently and decide whether it applies.


In our dataset, we have **single labels**, so we will insert `'choosing'` as the classification type. Here's the code for defining your task:

In [6]:
# Prompt the user to select the classification type
decision_type = input("Do you want to classify by 'choosing' or 'deciding'? ").strip().lower()

# Validate the input
assert decision_type in ['choosing', 'deciding'], "Invalid input! Please enter 'choosing' or 'deciding'."

# Ensure the labels align with the selected classification type
if decision_type == 'deciding':
    assert isinstance(train['Labels'][0], list), (
        "For 'deciding', labels should be a list (e.g., ['apple', 'banana'])."
    )
else:
    assert not isinstance(train['Labels'][0], list), (
        "For 'choosing', each label should be a single value (e.g., 'apple')."
    )

Do you want to classify by 'choosing' or 'deciding'? choosing


Based on your ``decision_type`` we will now choose the correct loss function and the decision function.
The loss function tells the model how well it achieved your task.
The decision function tells us how to convert the model guesses in the actual decision.

In [7]:
import torch

losses = {'deciding': "multi_label_classification", #torch.nn.CrossEntropyLoss(),
          'choosing': "single_label_classification",}

decisions = {'deciding': lambda x: torch.where(x > 0, 1, 0),
             'choosing': lambda x: torch.argmax(x, dim=1)}

objective = losses[decision_type]
decision_function = decisions[decision_type]

The next question we need to clarify is: How many different labels are possible for your task?

Example:

Choose your favorite fruit from this list:
  ```python
  poss_labels = ['Banana', 'Apple', 'Pear', 'Peach'`]
  model_output = [0.4, 0.5, -0.1, 0.7]
  ```

Or decide if you like the particular fruit:
  ```python
  poss_labels = ['Banana', 'Apple', 'Pear', 'Peach'`]
  model_output = [0.4, 0.5, -0.1, 0.7]
  ```

In both cases we have **4** possible labels.
In the first case we choose the fruit where the model signals the biggest aggreement **'Peach'** in the second we decide for each fruit if we like it by chosing to like everything above 0. Therefore, our example output tells us that we like **'Banana'**, **'Apple'**, and **'Peach'**.

### Fixing the Number of Possible Answers:

Let's find out how many labels are possible in our data:

In [8]:
output_size = input("How many different labels are possible for your task? ")
infered_output_size_train, possible_labels_train = infer_output_size(train)
infered_output_size_val, possible_labels_val = infer_output_size(val)
infered_output_size_test, possible_labels_test = infer_output_size(test)

possible_labels = max(infered_output_size_train, infered_output_size_val, infered_output_size_test)


if not (possible_labels_train == possible_labels_val == possible_labels_test):
  print(f"Train contains {possible_labels_train} possible labels, val contains {possible_labels_val} possible labels and test contains {possible_labels_test} possible labels."\
        "This might be unwanted and is not recommended all sets should contain all possible labels.")

assert int(output_size), "Output_size needs to be a natural number."
assert int(output_size) == possible_labels, f"We infered {possible_labels} with the following possible labels {possible_labels_train}."
assert possible_labels_train == possible_labels_val == possible_labels_test, f"Make sure that train, val and test labels are equal!"

id2label = {i: k for i,k in enumerate(possible_labels_train)}
label2id = {k: i for i, k in enumerate(possible_labels_train)}

output_size = int(output_size)

print(f"Your task distinguishes {output_size} different labels. These are {possible_labels_train}")

How many different labels are possible for your task? 2
Your task distinguishes 2 different labels. These are {'is_incorrect', 'is_correct'}


## 4) Setting up the Model


Fantastic!

We are close.
We clarified the language, the objective and number of possible answers.

Lets load the needed model and tokenizer.
The tokenizer translates language into a model specific vocabulary.

In [9]:
model_config = {'pretrained_model_name_or_path': model_name,
                'num_labels': output_size,
                'problem_type': objective,
                'id2label': id2label,
                'label2id': label2id}

In [10]:
model = AutoModelForSequenceClassification.from_pretrained(**model_config)
tokenizer = AutoTokenizer.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

### Training Specific Settings

Now we need to specify some training specific things. Dont worry if you dont know what to change the pre-set values should already be working just fine.
Lets get first an intuition how training of a model actually works. You show a model some example data points in your case the train data. From this data the model infers helpful patterns that explain the correlation between input and output. In order to get the most out of our data we show a small amount of data each time (batch_size) and let the model learn from it. To protect the model from overestimating its understanding we restrict the learned patterns per batch by applying the learning rate (lr). We also apply a warm-up (warm-up-rate) to ensure that the model doesnt forget everything it knew from the pre-training. After all we are using **pre-trained language models** for a reason.

 Lets go through each of the values one by one:
  - the batch_size tells us: How many examples we show to the model before we deduce rules that help to solve the classification task.
  - the learning_rate (lr) tells us: How hard we want the model to commit on the recognized patterns from the batch
  - the num_epoch tells us: How often we show all the training data to the model in this case 3 times.
  - the warm_up_rate tells us: How big of a share of the training should be less impactful.
  - device tells us: What device are we using if you have special hardware available (e.g. GPU) your training runs much faster.


With the now loaded model and tokenizer we can start **fine-tuning**. \\

For now we only need the train-set to teach the model and the val-set to decide if we taught our model well.
Lets first translate our text into the models vocabulary


In [11]:
batch_size = 64
lr = 1e-4
num_epochs = 3
warm_up_rate = 0.1
num_training_steps = (len(train['Labels'])//batch_size)*num_epochs
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'

In [12]:
# Padding tokenizing text aka. translating text into the model vocabulary

train_text = tokenizer([m for m in train['Text']], truncation=True, padding='longest', return_tensors='pt')
val_text = tokenizer([m for m in val['Text']], truncation=True, padding='longest', return_tensors='pt')
test_text = tokenizer([m for m in test['Text']], truncation=True, padding='longest', return_tensors='pt')

# Using the label2id mapping to convert the label strings into label ids
train_y = convert_labels(train['Labels'], label2id)
val_y = convert_labels(val['Labels'], label2id)
test_y = convert_labels(test['Labels'], label2id)

# Retrieve Dataloaders for fast iteration over the data
train_dataloader = generate_dataloader(train_text, train_y, batch_size)
val_dataloader = generate_dataloader(val_text, val_y, batch_size)
test_dataloader = generate_dataloader(test_text, test_y, batch_size)

The learning of patterns and adaptation of the model is achieved by the optimizer. In our case it is a special optimizer that keeps a model from optimizing. If you are really interested you can read more about it [here](https://github.com/davda54/sam). The scheduler adapts the learning rate according the warm_up_rate

In [13]:
# Initialize optimizer and scheduler
optimizer = SAM(model.parameters(), torch.optim.Adam, lr=lr, adaptive=True)
scheduler = get_cosine_schedule_with_warmup(optimizer = optimizer,
                                          num_warmup_steps = num_training_steps*warm_up_rate,
                                          num_training_steps = num_training_steps,
                                          last_epoch = -1)


## 5) Training

Lets start the training!
I wont go into too much detail but now we show the data to the model and optimize its parameters to succeed in the task as good as possible.
Each epoch we test our model, we always safe the best model.
Once this is done you should have a well trained model that you can load with the code of the next cell.

In [15]:
# Trainings-loop
best_loss = float('inf')
best_epoch = 0
already_trained = 0
best_model_path = ''
should_delete = True

for epoch in range(num_epochs): # Repeat num_epochs times

  for batch_idx, batch in enumerate(train_dataloader): # Train the model on the batch
    input_ids, attention_mask, y = batch[0].to(device), batch[1].to(device), batch[2].to(device)

    output = model(input_ids, attention_mask, labels=y)
    loss = output.loss
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

    scheduler.step()  # Do first step in the direction of a smaller loss
    optimizer.first_step(zero_grad=True)

    model(input_ids, attention_mask, labels=y).loss.backward()
    optimizer.second_step(zero_grad=True) # Do second step to verify the direction
    print(f"Train: Epoch {epoch}, Train step {already_trained+batch_idx}, Loss {loss}, learning_rate {scheduler.get_last_lr()[0]}", flush=True)

  already_trained += batch_idx
  val_loss = []
  for batch_idx, batch in enumerate(val_dataloader): # Validate the current state of the model on the validation data
    input_ids, attention_mask, y = batch[0].to(device), batch[1].to(device), batch[2].to(device)

    with torch.no_grad():
      val_loss.append(model(input_ids, attention_mask, labels=y).loss)

  val_loss = torch.mean(torch.stack(val_loss))

  print(f"Validation: Epoch {epoch}, Train step {already_trained}, Loss {val_loss}, old best/epoch {str(best_loss)[1:6]}/{best_epoch}", flush=True)

  if val_loss < best_loss: # Safe the model if the val_loss is the best loss we have seen so far
    best_loss = val_loss.item()
    best_epoch = epoch
    if should_delete and best_model_path:
      shutil.rmtree(best_model_path)
    best_model_path = f"./my_model_epoch_{best_epoch}_val_loss_{str(val_loss.item())[1:6]}"
    model.save_pretrained(best_model_path, from_pt=True)

    print(f"**** END EPOCH {epoch} ****")


print(f"**** FINISHED TRAINING FOR N={epoch} ****")
print(f"BEST EPOCH: {best_epoch}")
print(f"BEST LOSS: {best_loss}")

Train: Epoch 0, Train step 0, Loss 0.6000726222991943, learning_rate 3.7138015365554833e-06
Train: Epoch 0, Train step 1, Loss 0.636402428150177, learning_rate 0.0
Train: Epoch 0, Train step 2, Loss 0.5635061860084534, learning_rate 3.713801536555489e-06
Validation: Epoch 0, Train step 2, Loss 0.6456333994865417, old best/epoch nf/0
**** END EPOCH 0 ****
Train: Epoch 1, Train step 2, Loss 0.6185542941093445, learning_rate 1.4303513272105045e-05
Train: Epoch 1, Train step 3, Loss 0.5257166624069214, learning_rate 3.019601169804216e-05
Train: Epoch 1, Train step 4, Loss 0.5094482898712158, learning_rate 4.903043341140879e-05
Validation: Epoch 1, Train step 4, Loss 0.9056283831596375, old best/epoch .6456/0
Train: Epoch 2, Train step 4, Loss 0.3325253427028656, learning_rate 6.800888624023553e-05
Train: Epoch 2, Train step 5, Loss 0.42875757813453674, learning_rate 8.431208189343665e-05
Train: Epoch 2, Train step 6, Loss 0.2843942642211914, learning_rate 9.551814704830734e-05
Validation: 

The training is finished now we can load the model.

In [None]:
best_model = AutoModelForSequenceClassification.from_pretrained(best_model_path)

## 6) Evaluation

Finally, with the loaded model we can now predict our results for the unseen test set to understand the models performance in more detail.

In [None]:
y_pred = []

with torch.no_grad():

  for batch_idx, batch in enumerate(test_dataloader):
    input_ids, attention_mask, y = batch[0].to(device), batch[1].to(device), batch[2].to(device)
    y_pred.append(model(input_ids, attention_mask, labels=y).logits)


y_pred = torch.stack(y_pred, dim=0)
y_pred = decision_function(y_pred)

In [None]:
from sklearn.metrics import classification_report

print(classification_report(test_y, y_pred, target_names=label2id.keys(), zero_division=True))

              precision    recall  f1-score   support

  is_correct       1.00      0.00      0.00         1
is_incorrect       0.67      1.00      0.80         2

    accuracy                           0.67         3
   macro avg       0.83      0.50      0.40         3
weighted avg       0.78      0.67      0.53         3



### Results

The classification report shows us four metric results these are the precision,
the recall, the f1-score, and the accuracy. Additionally, the report displays two different average aggregations, these are the macro avg, and the weighted average.

The *precision* tells us "When we predict a label, is it the correct label?".

The *recall* tells us "How many instances of a class do we find?".

The *f1-score* is the harmonic mean of the *precision* and the *recall*.

The *accuracy* tells us "How many of our predictions are correct?".

The *macro avg* aggregates the *f1-score* per class it tells us "How well do we classify, if all classes occur equally often.".

The *weighted avg* aggregates the *f1-score* weighted by class size it tells us "How well do we classify the complete label set.".


### Analysis
We can see that we have two classes 'is_correct' and 'is_incorrect'.
We have one instance with from the class 'is_correct' and two from the class 'is_incorrect'.
Our model does not learn to predict the class 'is_correct'.
We find all instances of the class 'is_incorrect'.
You can see this by the fact that the *recall* is 1.0.
The *precision* is at 0.67, as one instance that we predict to be 'is_incorrect' is actually an instance of the class 'is_correct'.


Contact: Stephan.Linzbach@gesis.org