<a href="https://colab.research.google.com/github/Stephan-Linzbach/RoBERTaTutorial/blob/main/TextClassificaiton_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.32.0-py3-none-any.whl (7.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.5/7.5 MB[0m [31m19.9 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.15.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m27.7 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m63.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m66.5 MB/s[0m eta [36m0:00:0

In [2]:
!wget https://raw.githubusercontent.com/davda54/sam/main/sam.py

--2023-08-22 14:42:38--  https://raw.githubusercontent.com/davda54/sam/main/sam.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2484 (2.4K) [text/plain]
Saving to: ‘sam.py’


2023-08-22 14:42:38 (41.9 MB/s) - ‘sam.py’ saved [2484/2484]



In [3]:
import json
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from transformers import get_cosine_schedule_with_warmup
from torch.utils.data import DataLoader
from sam import SAM
import shutil

In [4]:
## Utils

def flatten_list(list_to_flatten):
  return [x for xs in list_to_flatten for x in xs]

def infer_output_size(data):
  labels = data['Labels']
  if isinstance(labels[0], list):
    labels = flatten_list(labels)
  labels = set(labels)
  return len(labels), labels

def convert_in_output_size(labels, mapping):
  label_resized = []
  for l in labels:
    tmp_l = torch.tensor([1 if k in l else 0 for m in mapping])
    label_resized.append(tmp_l)
  label_resized = torch.stack(label_resized, dim=0)
  return label_resized

def convert_labels(labels, mapping):
  if isinstance(labels[0], list):
    labels = convert_in_output_size(labels, mapping)
  return torch.tensor([mapping[l] for l in labels])

def generate_dataloader(text, y, batch_size, workers=1):
    attention_mask = text['attention_mask']
    input_ids = text['input_ids']

    dataset = list(zip(input_ids, attention_mask, y))

    dataloader = DataLoader(dataset, shuffle=True, batch_size=batch_size, num_workers=workers)

    return dataloader

## Text Classification

Today we will look at how to **fine-tune** a **pre-trained language model** for a custom **text-classification**. \\
You might be asking yourself three things:
  - What exactly is text-classification?
  - What is a pre-trained language model?
  - What even is fine-tuning?

No worries we will answer this questions one by one as we work our way through this tutorial.

Starting probably with the most pressing question: What exactly is text-classification?
In text-classification we try to assign a property or description to a text. \\

For example we are interested in classifying texts that are about fruits.
We could easily find a dictionary with all fruits (e.g.: 'Apple', 'Banana', 'Pear' etc.) everytime we recognize such a word in a text we know this text is about fruits, right?
This might not be true all the time, for example "My Apple pencil is not working." is not about the fruit 'Apple' although we would recognize it as such with our dictionary approach. \\
So the context of the word might be helpful (more on this later). \\
Classification is obviously transferable to more than just fruits. \\
People try to classify the sentiment of a text, the stance towards an entity, the topic of a text, the expressed emotion in a text, and many many more. \\

Let's talk about data:
In order to make this script work you have to save three dictionaries in this structure in the file '<current_folder>/(train|val|test).json': \\
  ``{'Text' : [list of texts]
     'Labels': [list of labels]}`` \\

  ``length([list of texts]) equals length([list of texts])`` \\

Example 1 \\
 ``{'Text': ['Yesterday i ate an apple.', 'Yesterday I crashed my Apple.'],
     'Labels': ['about_fruit', 'not_about_fruit']}`` \\
Example 2 \\
 ``{'Text': ['Yesterday i ate an apple.', 'Yesterday I crashed my Apple.'],
     'Labels': [['past_tense', 'about_fruits'], ['past_tense', 'not_about_fruit']]}``

The train data is used to teach the model, the val data is required to verify if the model understands the train data correctly. Lastly, the test data is used to proof the capabilities of the final version of the model on the **unseen** data.

You can have one label per text (multi-class classification, Example 1) or several labels per text (multi-label classification, Example 2). \\

Is your data ready? Then lets start.

In [5]:
try:
  with open("./train.json") as f:
    train = json.load(f)

  with open("./val.json") as f:
    val = json.load(f)

  with open("./test.json") as f:
    test = json.load(f)
except:
  #Dummy Values
  train = {'Text': ["Number of texts does not match number of labels for train data!", "Number of texts does not match number of labels for val data!", "Number of texts does not match number of labels for test data!"]*64,
          'Labels': ['is_correct', 'is_incorrect', 'is_incorrect']*64}
  val = {'Text': ["Number of texts does not match number of labels for train data!", "Number of texts does not match number of labels for val data!", "Number of texts does not match number of labels for test data!"]*64,
          'Labels': ['is_incorrect', 'is_correct', 'is_incorrect']*64}
  test = {'Text': ["Number of texts does not match number of labels for train data!", "Number of texts does not match number of labels for val data!", "Number of texts does not match number of labels for test data!"]*64,
          'Labels': ['is_incorrect', 'is_incorrect', 'is_correct']*64}


assert len(train['Text']) == len(train['Labels']), "Number of texts does not match number of labels for train data!"
assert len(val['Text']) == len(val['Labels']), "Number of texts does not match number of labels for val data!"
assert len(test['Text']) == len(test['Labels']), "Number of texts does not match number of labels for test data!"

print(f"We loaded the train data with {len(train['Text'])} texts and {len(train['Labels'])} labels,")
print(f"the validation data with {len(val['Text'])} texts and {len(val['Labels'])} labels")
print(f"and the test data with {len(test['Text'])} texts and {len(test['Labels'])} labels.")

We loaded the train data with 192 texts and 192 labels,
the validation data with 192 texts and 192 labels
and the test data with 192 texts and 192 labels.


Great the data is ready!

You have to answer some questions about your data.

In which language is your text data written?

In [6]:
language = input("Which language do you use? ('english', 'german', 'french') ")

if language == 'english':
  model_name = "roberta-base"
elif language == 'german':
  model_name = "benjamin/roberta-base-wechsel-german"
elif language == 'french':
  model_name = "camembert-base"
else:
  print(f"Seems like we have no model available for this {language}.")
  print("We will load a mulitlingual language model. It knows 100 text from 100 languages.")
  model_name = 'xlm-roberta-base'

print(f"We loaded {model_name} for {language}.")

Which language do you use? ('english', 'german', 'french') english
MMhhh interesting your data is written in english. Let's load the correct LLM!
We loaded roberta-base for english.


Ok now that we talked about the language of your data you might be intersted what the ``model_name`` stands for. \\
These are **pre-trained language models** ready to be used with your specific language.
These language models already learned to understand words and their context by solving a huge cloze-text written in the particular language. \\
This cloze-text is constructed over Wikipedia or other huge text datasets.

Now that we understood what **text classification** and **pre-trained language model** are.

We can try to answer the last question: What is **fine-tuning**?

As you might imagine solving a cloze-text over the whole internet makes you knowledgeable but not an expert in a specific field. We now want to transform our **pre-trained language model** into an expert for your task.

So lets start to understand your task.

We differentiate two classification tasks:
  - Choosing (What is your favorite fruit?) ('single_label_classification')
    - Choose one of the possible labels as the correct label.
  - Deciding (Do you like apples?) ('multi_label_classification')
    - Decide for each possible label if it is a correct label.

In [7]:
decision_type = input("Do you want to classify by choosing or by deciding: ")

assert decision_type in ['deciding', 'choosing']

Do you want to classify by choosing or by deciding: choosing


Based on your ``decision_type`` we will now choose the correct loss function and the decision function.

The loss function tells the model how well it achieved your task.
The decision function tells us how to convert a guess of a model in an actual decision.

In [8]:
import torch

losses = {'deciding': "multi_label_classification", #torch.nn.CrossEntropyLoss(),
          'choosing': "single_label_classification",}

decisions = {'deciding': lambda x: torch.where(x > 0, 1, 0),
             'choosing': lambda x: torch.argmax(x, dim=1)}

objective = losses[decision_type]
decision_function = decisions[decision_type]

The next question we need to clarify is: How many different labels are possible for your task?

Example: \\
Choose your favorite fruit from this list: \\
  ``poss_labels = ['Banana', 'Apple', 'Pear', 'Peach'`]`` \\
  ``model_output = [0.4, 0.5, -0.1, 0.7]`` \\

Or decide if you like the particular fruit: \\
  ``poss_labels = ['Banana', 'Apple', 'Pear', 'Peach'`]`` \\
  ``model_output = [0.4, 0.5, -0.1, 0.7]`` \\
In both cases we have **4** possible labels. \\
In the first case we choose the fruit where the model signals the biggest aggreement ``'Peach'`` in the second we decide for each fruit if we like it by evaluating everything above 0 as the decision 'I like it.'. Therefore, our example output tells us that our model likes ``['Banana', 'Apple', 'Peach']``.

Let's find out how many labels are possible in our data:

In [9]:
output_size = input("How many different labels are possible for your task? ")
infered_output_size_train, possible_labels_train = infer_output_size(train)
infered_output_size_val, possible_labels_val = infer_output_size(val)
infered_output_size_test, possible_labels_test = infer_output_size(test)

possible_labels = max(infered_output_size_train, infered_output_size_val, infered_output_size_test)


if not (possible_labels_train == possible_labels_val == possible_labels_test):
  print(f"Train contains {possible_labels_train} possible labels, val contains {possible_labels_val} possible labels and test contains {possible_labels_test} possible labels."\
        "This might be unwanted and is not recommended all sets should contain all possible labels.")

assert int(output_size), "Output_size needs to be a natural number."
assert int(output_size) == possible_labels, f"We infered {possible_labels} with the following possible labels {possible_labels_train}."
assert possible_labels_train == possible_labels_val == possible_labels_test, f"Make sure that train, val and test labels are equal!"

id2label = {i: k for i,k in enumerate(possible_labels_train)}
label2id = {k: i for i, k in enumerate(possible_labels_train)}

output_size = int(output_size)

print(f"Your task distinguishes {output_size} different labels. These are {possible_labels_train}")

How many different labels are possible for your task? 2
Your task distinguishes 2 different labels. These are {'is_incorrect', 'is_correct'}


Fantastic!

We are close.
We clarified the language, the objective and number of possible answers. \\
Lets load the needed model and tokenizer.
The tokenizer translates text into a model specific vocabulary.

In [10]:
model_config = {'pretrained_model_name_or_path': model_name,
                'num_labels': output_size,
                'problem_type': objective,
                'id2label': id2label,
                'label2id': label2id}

In [11]:
model = AutoModelForSequenceClassification.from_pretrained(**model_config)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Downloading (…)lve/main/config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.weight', 'classifier.out_proj.weight', 'classifier.out_proj.bias', 'classifier.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]


With the now loaded model and tokenizer we can start **fine-tuning**. \\

For now we only need the train-set to teach the model and the val-set to decide if we taught our model well.
Lets first translate our text into the models vocabulary.


In [13]:
train_text = tokenizer([m for m in train['Text']], truncation=True, padding='longest', return_tensors='pt')
val_text = tokenizer([m for m in val['Text']], truncation=True, padding='longest', return_tensors='pt')
test_text = tokenizer([m for m in test['Text']], truncation=True, padding='longest', return_tensors='pt')


train_y = convert_labels(train['Labels'], label2id)
val_y = convert_labels(val['Labels'], label2id)
test_y = convert_labels(test['Labels'], label2id)

Now we need to specify some training specific things. Dont worry if you dont know what to change the pre-set values should already be working just fine. \\
Lets start by forming an intuition on how training of a model actually works. We show our model some example data-points in your case the train data. From this data the model infers helpful patterns that explain the correlation between input and output. In order to get the most out of our data we show a small amount of data each time (batch_size) and let the model learn from it. To protect the model from overestimating its understanding we restrict the optimal adaptation per batch by applying the learning rate (lr). We also apply a warm-up (warm-up-rate) to ensure that the model doesnt forget everything it knew from the pre-training. After all we are using **pre-trained language models** for a reason.

 Lets go through each of the values one by one:
  - the batch_size tells us: How many examples we show to the model before we deduce rules that help to solve the classification task.
  - the learning_rate (lr) tells us: How hard we want the model to commit on the recognized patterns from each batch
  - the num_epoch tells us: How often we show all the training data to the model in this case 3 times.
  - the warm_up_rate tells us: How big of a share of the training should be less impactful.
  - device tells us: What device are we using if you have special hardware available (e.g. GPU) your training runs much faster.

In [14]:
batch_size = 64
lr = 1e-4
num_epochs = 3
warm_up_rate = 0.1
num_training_steps = (len(train['Labels'])//batch_size)*num_epochs
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'

The learning of patterns and adaptation of the model is achieved by the optimizer. In our case it is a special optimizer that keeps a model from optimizing. If you are really interested you can read more about it [here](https://github.com/davda54/sam). The scheduler adapts the learning rate according the warm_up_rate

In [15]:
optimizer = SAM(model.parameters(), torch.optim.Adam, lr=lr, adaptive=True)
scheduler = get_cosine_schedule_with_warmup(optimizer = optimizer,
                                          num_warmup_steps = num_training_steps*warm_up_rate,
                                          num_training_steps = num_training_steps,
                                          last_epoch = -1)


## Training

Lets start the training!
I wont go into too much detail but now we show the data to the model and optimize its parameters to succeed in the task as good as possible.
Each epoch we test our model, we always safe the best model.
Once this is done you should have a well trained model that you can load with the code of the next cell.

In [17]:
train_dataloader = generate_dataloader(train_text, train_y, batch_size)
val_dataloader = generate_dataloader(val_text, val_y, batch_size)
test_dataloader = generate_dataloader(test_text, test_y, batch_size)

best_loss = float('inf')
best_epoch = 0
already_trained = 0
best_model_path = ''
should_delete = True

for epoch in range(num_epochs):

  for batch_idx, batch in enumerate(train_dataloader):
    input_ids, attention_mask, y = batch[0].to(device), batch[1].to(device), batch[2].to(device)

    output = model(input_ids, attention_mask, labels=y)
    loss = output.loss
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

    scheduler.step()
    optimizer.first_step(zero_grad=True)

    model(input_ids, attention_mask, labels=y).loss.backward()
    optimizer.second_step(zero_grad=True)
    print(f"Train: Epoch {epoch}, Train step {already_trained+batch_idx}, Loss {loss}, learning_rate {scheduler.get_last_lr()[0]}", flush=True)

  already_trained += batch_idx
  val_loss = []
  for batch_idx, batch in enumerate(val_dataloader):
    input_ids, attention_mask, y = batch[0].to(device), batch[1].to(device), batch[2].to(device)

    with torch.no_grad():
      val_loss.append(model(input_ids, attention_mask, labels=y).loss)

  val_loss = torch.mean(torch.stack(val_loss))

  print(f"Validation: Epoch {epoch}, Train step {already_trained}, Loss {val_loss}, old best/epoch {str(best_loss)[1:6]}/{best_epoch}", flush=True)

  if val_loss < best_loss:
    best_loss = val_loss
    best_epoch = epoch
    if should_delete and best_model_path:
      shutil.rmtree(best_model_path)
    best_model_path = f"./my_model_epoch_{best_epoch}_val_loss_{str(val_loss.item())[1:6]}"
    model.save_pretrained(best_model_path, from_pt=True)

    print(f"**** END EPOCH {epoch} ****")


print(f"**** FINISHED TRAINING FOR N={epoch} ****")
print(f"BEST EPOCH: {best_epoch}")
print(f"BEST LOSS: {best_loss}")

Train: Epoch 0, Train step 0, Loss 0.5242859721183777, learning_rate 6.800888624023551e-05
Train: Epoch 0, Train step 1, Loss 0.35003662109375, learning_rate 4.903043341140879e-05
Train: Epoch 0, Train step 2, Loss 0.17097768187522888, learning_rate 3.019601169804216e-05
Validation: Epoch 0, Train step 2, Loss 1.5066224336624146, old best/epoch nf/0
**** END EPOCH 0 ****
Train: Epoch 1, Train step 2, Loss 0.10638771206140518, learning_rate 1.4303513272105057e-05
Train: Epoch 1, Train step 3, Loss 0.07648667693138123, learning_rate 3.7138015365554833e-06
Train: Epoch 1, Train step 4, Loss 0.07506324350833893, learning_rate 0.0
Validation: Epoch 1, Train step 4, Loss 1.7421793937683105, old best/epoch ensor/0
Train: Epoch 2, Train step 4, Loss 0.07137938588857651, learning_rate 3.713801536555489e-06
Train: Epoch 2, Train step 5, Loss 0.06436154991388321, learning_rate 1.4303513272105045e-05
Train: Epoch 2, Train step 6, Loss 0.044759854674339294, learning_rate 3.019601169804216e-05
Valid

The training is finished now we can load the model.

In [None]:
best_model = AutoModelForSequenceClassification.from_pretrained(best_model_path)

Finally, with the loaded model we can now predict our results for the unseen test set to understand the models performance in more detail.

In [None]:
y_pred = []

with torch.no_grad():

  for batch_idx, batch in enumerate(test_dataloader):
    input_ids, attention_mask, y = batch[0].to(device), batch[1].to(device), batch[2].to(device)

    with torch.no_grad():
      y_pred += model(input_ids, attention_mask, labels=y).logits


y_pred = torch.stack(y_pred, dim=0)
y_pred = decision_function(y_pred)

In [None]:
from sklearn.metrics import classification_report

print(classification_report(test_y, y_pred, target_names=label2id.keys(), zero_division=True))