# Overview

## Fine-tune a [PuoBERTa](https://huggingface.co/dsfsi/PuoBERTa) model on a Parts of Speech(POS) task

This notebook demonstrates how to use the 🤗 Transformers library to **fine-tune a pre-trained language model** on a token classification task(this can either be `pos` or `ner`)

The notebook is divided into the following sections:

* **Environment Setup**
* **Global Parameters**
* **Data Preparation**
* **Preprocessing the data**
* **Fine-tuning the Pre-Trained Model**
* **Evaluate the Fine-Tuned Model**
* **(Optional) Upload to hub**

> This notebook is intended for users who are familiar with the basics of deep learning and natural language processing. It is also recommended that users have some experience with the Python programming language and the Jupyter Notebook environment.


---

* [Reference - Huggingface Notebook Examples on GitHub](https://github.com/huggingface/notebooks)

# Environment setup

**Make sure to install the dependencies below if you have not done so already**

In [None]:
%pip install --quiet datasets transformers sentencepiece seqeval

In [None]:
%pip install accelerate -U --quiet

In [None]:
%pip install --quiet --upgrade huggingface_hub

In [None]:
%pip install --quiet evaluate

In [None]:
import transformers

print(transformers.__version__)

**If you intend to upload your trained model to HuggingFace or access a private model: ***

* You need to login, using any of the recommended [authentication methods](https://huggingface.co/docs/huggingface_hub/en/quick-start#authentication)
* In this notebook we use `huggingface_hub.notebook_login()` method

In [17]:
from huggingface_hub import notebook_login, whoami

try:
  whoami()
except:
  print("User token not found, calling notebook_login()...")
  notebook_login()

User token not found, calling notebook_login()...


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# Global Parameters

In [18]:
task = "pos" # Can be one of "pos" or "ner"
model_checkpoint = "dsfsi/PuoBERTa"
dataset_checkpoint = "conll2003"
push_to_hub_enabled = False
trained_model_checkpoint = f"{model_checkpoint}-finetuned-{task}"
trained_model_checkpoint_hub = f"ndamulelonemakh/{trained_model_checkpoint}"
batch_size = 16  # adjust depending on GPU size
epochs = 3

# Data Preperation

In [19]:
from datasets import load_dataset, load_metric
datasets = load_dataset("conll2003")

For our example here, we'll use the [CONLL 2003 dataset](https://www.aclweb.org/anthology/W03-0419.pdf) AS **REFERENCE ONLY**. The notebook should work with any token classification dataset provided by the 🤗 Datasets library. If you're using your own dataset defined from a JSON or csv file (see the [Datasets documentation](https://huggingface.co/docs/datasets/loading_datasets.html#from-local-files) on how to load them), it might need some adjustments in the names of the columns used.

The `datasets` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set.

In [20]:
datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

We can see the training, validation and test sets all have a column for the tokens (the input texts split into words) and one column of labels for each kind of task we introduced before.

To access an actual element, you need to select a split first, then give an index:

In [21]:
datasets["train"][0]

{'id': '0',
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}

In [22]:
datasets["train"].features[f"{task}_tags"]

Sequence(feature=ClassLabel(names=['"', "''", '#', '$', '(', ')', ',', '.', ':', '``', 'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'NN|SYM', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB'], id=None), length=-1, id=None)

In [23]:
label_list = datasets["train"].features[f"{task}_tags"].feature.names
print(label_list)

['"', "''", '#', '$', '(', ')', ',', '.', ':', '``', 'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'NN|SYM', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB']


To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset (automatically decoding the labels in passing).

In [24]:
from datasets import ClassLabel, Sequence
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])
    display(HTML(df.to_html()))

In [25]:
show_random_elements(datasets["train"])

Unnamed: 0,id,tokens,pos_tags,chunk_tags,ner_tags
0,10898,"[SOFIA, 1996-08-28]","[NNP, CD]","[B-NP, I-NP]","[B-LOC, O]"
1,7797,"[AMT, :, 3,250,000, DATE, :, 09/04/96, NYC, Time, :, 1200, CUSIP, :, 569399]","[NNP, :, CD, NN, :, CD, NNP, NNP, :, CD, NN, :, CD]","[O, O, B-NP, I-NP, O, B-NP, I-NP, I-NP, O, B-NP, I-NP, O, B-NP]","[O, O, O, O, O, O, B-MISC, I-MISC, O, O, O, O, O]"
2,11065,"[as, a, result, of, the, absence, of, this, team, from, the, match, ,, "", CAF, said, in, a, statement, .]","[IN, DT, NN, IN, DT, NN, IN, DT, NN, IN, DT, NN, ,, "", NNP, VBD, IN, DT, NN, .]","[B-PP, B-NP, I-NP, B-PP, B-NP, I-NP, B-PP, B-NP, I-NP, B-PP, B-NP, I-NP, O, O, B-NP, B-VP, B-PP, B-NP, I-NP, O]","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, B-ORG, O, O, O, O, O]"
3,8379,"[Cozma, is, barred, from, taking, part, in, any, official, soccer, activity, during, the, ban, .]","[NNP, VBZ, VBN, IN, VBG, NN, IN, DT, JJ, NN, NN, IN, DT, NN, .]","[B-NP, B-VP, I-VP, B-PP, B-VP, B-NP, B-PP, B-NP, I-NP, I-NP, I-NP, B-PP, B-NP, I-NP, O]","[B-PER, O, O, O, O, O, O, O, O, O, O, O, O, O, O]"
4,914,"[GOLF, -, GERMAN, OPEN, FIRST, ROUND, SCORES, .]","[NN, :, NNP, NNP, NNP, NNP, NNP, .]","[B-NP, O, B-NP, O, B-NP, I-NP, I-NP, O]","[O, O, B-MISC, I-MISC, O, O, O, O]"
5,5819,"[Magnificent, ,, ', ', said, Fitzpatrick, ,, New, Zealand, 's, most, capped, player, and, the, world, 's, most, capped, forward, .]","[NN, ,, '', POS, VBD, NNP, ,, NNP, NNP, POS, RBS, VBD, NN, CC, DT, NN, VBZ, RBS, VBD, RB, .]","[B-NP, O, O, B-NP, B-VP, B-NP, O, B-NP, I-NP, I-NP, B-ADJP, B-NP, I-NP, O, B-NP, I-NP, I-NP, B-ADJP, B-NP, B-ADVP, O]","[O, O, O, O, O, B-PER, O, B-LOC, I-LOC, O, O, O, O, O, O, O, O, O, O, O, O]"
6,2514,"[Squad, :]","[VB, :]","[B-VP, O]","[O, O]"
7,13998,"[Practice, times, set, on, Friday]","[JJ, NNS, VBN, IN, NNP]","[B-NP, I-NP, B-VP, B-PP, B-NP]","[O, O, O, O, O]"
8,5956,"[The, mayor, has, said, he, wants, to, cut, their, number, to, five, as, part, of, a, war, against, organised, crime, .]","[DT, NN, VBZ, VBN, PRP, VBZ, TO, VB, PRP$, NN, TO, CD, IN, NN, IN, DT, NN, IN, VBD, NN, .]","[B-NP, I-NP, B-VP, I-VP, B-NP, B-VP, I-VP, I-VP, B-NP, I-NP, B-PP, B-NP, B-PP, B-NP, B-PP, B-NP, I-NP, B-PP, B-VP, B-NP, O]","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O]"
9,3872,"[SAN, FRANCISCO, 54, 72, .429, 14]","[NNP, NNP, CD, CD, CD, CD]","[B-NP, I-NP, I-NP, I-NP, I-NP, I-NP]","[B-ORG, I-ORG, O, O, O, O]"


## Load Masakhane POS

### Utilities

In [None]:
from datasets import load_dataset, Dataset, DatasetDict
from typing import List, Tuple


def load_sentences(filepath: str) -> List[List[Tuple[str, str]]]:
    """
    Load sentences from a file in IOB format.

    Args:
        filepath (str): Path to the input file.

    Returns:
        List[List[Tuple[str, str]]]: A list of sentences, where each sentence is a list of tuples (token, pos_tag).
    """
    sentences = []
    current_sentence = []

    with open(filepath, 'r', encoding='utf-8') as f:
        for line in f:
            if len(line.strip()) == 0:
                sentences.append(current_sentence)
                current_sentence = []
            else:
                token, pos_tag = line.strip().split()
                current_sentence.append((token, pos_tag))

    if current_sentence:
        sentences.append(current_sentence)

    return sentences


def convert_to_conll(sentences: List[List[Tuple[str, str]]]) -> List[dict]:
    """
    Convert sentences to the CoNLL-3 format.

    Args:
        sentences (List[List[Tuple[str, str]]]): A list of sentences, where each sentence is a list of tuples (token, pos_tag).

    Returns:
        List[dict]: A list of dictionaries representing the sentences in the CoNLL-3 format.
    """
    data = []

    for sent_id, sentence in enumerate(sentences):
        tokens = []
        pos_tags = []
        for token, pos_tag in sentence:
            tokens.append(token)
            pos_tags.append(pos_tag)

        data.append({
            "id": sent_id,
            "tokens": tokens,
            "pos_tags": pos_tags,
        })

    return data


def create_hf_dataset(data: List[dict]) -> Dataset:
    """
    Convert data to a Hugging Face Dataset.

    Args:
        data (List[dict]): A list of dictionaries representing the sentences in the CoNLL-3 format.

    Returns:
        Dataset: A Hugging Face Dataset containing the data.
    """
    from datasets import Dataset, ClassLabel, Sequence

    pos_tag_class = ClassLabel(names=sorted(set(tag for d in data for tag in d["pos_tags"])))

    encoded_data = {
        "id": [],
        "tokens": [],
        "pos_tags": [],
    }

    for d in data:
        encoded_pos_tags = pos_tag_class.str2int(d["pos_tags"])
        encoded_data['id'].append(d['id'])
        encoded_data['tokens'].append(d['tokens'])
        encoded_data['pos_tags'].append(encoded_pos_tags)

    d = Dataset.from_dict(encoded_data)
    pos_tags_feature = Sequence(feature=ClassLabel(names=pos_tag_class.names))
    d = d.cast_column('pos_tags', pos_tags_feature)

    return d


def iob_to_hugging_face(train_file: str, validation_file: str = None, test_file: str = None) -> DatasetDict:
    sentences = load_sentences(train_file)
    sentence_dicts = convert_to_conll(sentences)
    train_dataset = create_hf_dataset(sentence_dicts)

    data_dict = DatasetDict({'train': train_dataset})

    # If validation_file is provided, load and convert the sentences
    if validation_file:
        validation_sentences = load_sentences(validation_file)
        validation_sentence_dicts = convert_to_conll(validation_sentences)
        validation_dataset = create_hf_dataset(validation_sentence_dicts)
        data_dict['validation'] = validation_dataset

    # If test_file is provided, load and convert the sentences
    if test_file:
        test_sentences = load_sentences(test_file)
        test_sentence_dicts = convert_to_conll(test_sentences)
        test_dataset = create_hf_dataset(test_sentence_dicts)
        data_dict['test'] = test_dataset
    return data_dict


### Format Masakhane as Huggingface Dataset

In [28]:
## Get masakhane POS dataset
!wget https://raw.githubusercontent.com/masakhane-io/masakhane-pos/main/data/tsn/train.txt
!wget https://raw.githubusercontent.com/masakhane-io/masakhane-pos/main/data/tsn/dev.txt
!wget https://raw.githubusercontent.com/masakhane-io/masakhane-pos/main/data/tsn/test.txt

--2024-03-21 17:30:29--  https://raw.githubusercontent.com/masakhane-io/masakhane-pos/main/data/tsn/train.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 219713 (215K) [text/plain]
Saving to: 'train.txt'


2024-03-21 17:30:29 (83.3 MB/s) - 'train.txt' saved [219713/219713]

--2024-03-21 17:30:29--  https://raw.githubusercontent.com/masakhane-io/masakhane-pos/main/data/tsn/dev.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 40665 (40K) [text/plain]
Saving to: 'dev.txt'


2024-03-21 17:30:29 (31.7 MB/s) - 'dev.txt' saved [40

In [34]:
datasets = iob_to_hugging_face('train.txt', 'dev.txt', 'test.txt')

Casting the dataset:   0%|          | 0/754 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/150 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/601 [00:00<?, ? examples/s]

In [35]:
datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags'],
        num_rows: 754
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags'],
        num_rows: 150
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags'],
        num_rows: 601
    })
})

In [36]:
show_random_elements(datasets["train"])

Unnamed: 0,id,tokens,pos_tags
0,218,"[Fustat, e, ne, e, le, lefelo, la, matlole, a, Egepeto, ka, fa, tlase, ga, Umayyad, dynasty, morago, ga, phenyo, ya, Arab, .]","[PROPN, DET, AUX, PRON, DET, NOUN, DET, NOUN, ADP, PROPN, ADP, DET, ADV, DET, PROPN, NOUN, ADP, DET, NOUN, DET, PROPN, PUNCT]"
1,172,"[Re, ka, se, letle, seo, go, diragala, ,, a, tlaleletsa, .]","[PRON, AUX, DET, VERB, DET, VERB, VERB, PUNCT, DET, VERB, PUNCT]"
2,612,"[Dingwaga, di, le, pedi, le, halofo, ,, bontsi, jwa, MaAforikaborwa, bo, boifa, tokomane, e, ,, e, ba, e, bitsang, popego, e, e, tshelang, ya, ditsholofelo, le, dikeletso, tsa, bona, .]","[NOUN, DET, DET, ADJ, CCONJ, ADJ, PUNCT, ADJ, DET, PROPN, PRON, VERB, NOUN, DET, PUNCT, DET, DET, DET, VERB, NOUN, DET, PRON, VERB, DET, NOUN, CCONJ, NOUN, DET, PRON, PUNCT]"
3,360,"[Filimi, ya, ga, Foster, e, ikgapetse, sekgele, kwa, moletlong, wa, diawate, tsa, akatemi, wa, bo, 93, kwa, Los, Angeles, .]","[NOUN, DET, DET, PROPN, DET, VERB, NOUN, DET, ADV, DET, NOUN, DET, NOUN, DET, DET, NUM, DET, PROPN, PROPN, PUNCT]"
4,97,"[Tiro, ya, gagwe, ya, bokwadi, ke, ya, maemo, a, a, kwa, godimo]","[NOUN, DET, DET, DET, NOUN, ADP, DET, NOUN, DET, PRON, DET, NOUN]"
5,576,"[Re, ka, se, reya, ngaka, ra, re, ,, fa, Zuma, a, lwala, ,, o, ka, se, kgone, go, mo, tlhatlhoba, ,, o, tlaa, bo, a, nna, kgatlhanong, le, maikano, a, gagwe, .]","[PRON, AUX, AUX, VERB, NOUN, DET, VERB, PUNCT, SCONJ, PROPN, DET, VERB, PUNCT, PRON, AUX, AUX, VERB, ADP, DET, NOUN, PUNCT, PRON, VERB, AUX, DET, AUX, NOUN, DET, NOUN, DET, DET, PUNCT]"
6,491,"[Mabuza, a, re, puso, e, tshwentswe, ke, gore, badiredi, ba, sesole, ba, lwana, ka, bobona, .]","[PROPN, DET, VERB, NOUN, DET, VERB, DET, CCONJ, NOUN, DET, NOUN, PRON, VERB, ADP, NOUN, PUNCT]"
7,285,"[Ba, ne, ba, le, magareng, ga, balekane, ba, le, bane, ba, bong, jo, bo, tshwanang, ,, banna, ba, le, bararo, le, mosadi, a, le, mongwe, ,, ba, ba, neng, ba, tlaa, nyadisiwang, ke, Meiyara, wa, kwa, Amsterdam, moragonyana, ga, bosigogagre, ka, Moranang, 1, ,, ka, ngwaga, wa, 2001, .]","[PRON, AUX, DET, DET, ADV, DET, NOUN, PRON, DET, NUM, DET, NOUN, DET, DET, VERB, PUNCT, NOUN, DET, DET, NUM, CCONJ, NOUN, DET, DET, NUM, PUNCT, DET, PRON, AUX, PRON, VERB, VERB, ADP, NOUN, DET, DET, PROPN, NOUN, DET, NOUN, ADP, NOUN, NUM, PUNCT, ADP, NOUN, DET, NUM, PUNCT]"
8,42,"[Facebook, e, rile, go, thibilwe, lesoba, morago, ga, go, lemoga, bothata, ka, nako, eo, .]","[NOUN, DET, VERB, VERB, VERB, NOUN, ADP, DET, VERB, VERB, NOUN, ADP, NOUN, DET, PUNCT]"
9,492,"[MKVA, le, khansele, ya, sesole, ,, lo, tshwanetse, go, tsamaya, lo, ye, go, rarabolola, mathata, a, lona, ka, bonako, .]","[PROPN, CCONJ, NOUN, DET, NOUN, PUNCT, DET, VERB, VERB, VERB, DET, DET, VERB, VERB, NOUN, DET, PRON, ADP, NOUN, PUNCT]"


# Preprocessing the data

**Objectives:**

*   Converting the tokens to their corresponding IDs in the pretrained vocabulary, see [example]()

In [37]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, add_prefix_space=True)

config.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/877k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/523k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.28M [00:00<?, ?B/s]

* Verify that your choosen model support [Fast Tokenization]()
* You can check which type of models have a fast tokenizer available and which don't on the [big table of models](https://huggingface.co/transformers/index.html#bigtable).

In [38]:
# You can check which type of models have a fast tokenizer available and which don't on the big table of models.
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

In [39]:
tokenizer("Hello, this is one sentence!")

{'input_ids': [0, 788, 34568, 16, 10261, 4509, 901, 21054, 3758, 5, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [40]:
tokenizer(["Hello", ",", "this", "is", "one", "sentence", "split", "into", "words", "."], is_split_into_words=True)

{'input_ids': [0, 788, 34568, 362, 10261, 4509, 901, 21054, 3758, 283, 84, 497, 88, 11164, 16399, 10183, 330, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

> Note that transformers are often pretrained with subword tokenizers, meaning that even if your inputs have been split into words already, each of those words could be split again by the tokenizer. Let's look at an example of that:

In [41]:
print('===Subword Tokenisation Illustration===')
example = datasets["train"][4]
print('=' * 50 + '\n')
print("ORIGINAL TEXT")
print(example["tokens"])
print('-' * 20   + '\n')
print('TOKENS:')
tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
print(tokens)

===Subword Tokenisation Illustration===

ORIGINAL TEXT
['Mokgatlho', 'o', 'dirang', 'go', 'sireletsa', 'baphasalatsi', 'ba', 'mmino', 'le', 'go', 'ikopanya', 'le', 'puso', 'go', 'netefatsa', 'gore', 'badiragatsi', 'ba', 'amogela', 'tuelo', 'e', 'e', 'lekaneng', '.']
--------------------

TOKENS:
['<s>', 'ĠMokgatlho', 'Ġo', 'Ġdirang', 'Ġgo', 'Ġsireletsa', 'Ġbaphasalatsi', 'Ġba', 'Ġmmino', 'Ġle', 'Ġgo', 'Ġikopanya', 'Ġle', 'Ġpuso', 'Ġgo', 'Ġnetefatsa', 'Ġgore', 'Ġbadiragatsi', 'Ġba', 'Ġamogela', 'Ġtuelo', 'Ġe', 'Ġe', 'Ġlekaneng', 'Ġ.', '</s>']


* **Note** that the tokenizer returns outputs that have `word_ids`

In [42]:
print(tokenized_input.word_ids())

[None, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, None]


* As we can see, it returns a list with the same number of elements as our processed input ids, mapping special tokens to `None` and all other tokens to their respective word. This way, we can align the labels with the processed input ids.

In [43]:
word_ids = tokenized_input.word_ids()
aligned_labels = [-100 if i is None else example[f"{task}_tags"][i] for i in word_ids]
print(len(aligned_labels), len(tokenized_input["input_ids"]))

26 26


* Here we set the labels of all special tokens to -100 (the index that is ignored by PyTorch) and the labels of all other tokens to the label of the word they come from.
  * Another strategy is to set the label only on the first token obtained from a given word, and give a label of -100 to the other subtokens from the same word.
  * You can use the `label_all_tokens` flag below to control this behaviour

In [44]:
label_all_tokens = True

**Define the preprocessing function**

In [45]:
def tokenize_and_align_labels(examples):
    """
    Tokenizes the input text and aligns the labels with the sub-word tokens.

    Args:
        examples: A dictionary of input examples, containing keys "id", "tokens" and "{task}_tags". For example
        on a dataset with 3 examples, the dictionary will look something like this: {"id": [1, 2, 3], ....}

    Returns:
        A dictionary with keys "input_ids", "attention_mask", and "labels".
    """
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
    labels = []
    for i, label in enumerate(examples[f"{task}_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            # Special tokens have a word id that is None. We set the label to -100 so they are automatically
            # ignored in the loss function.
            if word_idx is None:
                label_ids.append(-100)
            # We set the label for the first token of each word.
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            # For the other tokens in a word, we set the label to either the current label or -100, depending on
            # the label_all_tokens flag.
            else:
                label_ids.append(label[word_idx] if label_all_tokens else -100)
            previous_word_idx = word_idx

        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [46]:
# test on a sampled dataset
tokenize_and_align_labels(datasets['train'][:5])

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


{'input_ids': [[0, 23833, 332, 1867, 271, 3357, 271, 461, 273, 2278, 534, 1844, 8095, 2222, 612, 438, 330, 2], [0, 389, 345, 69, 345, 287, 1513, 295, 295, 12350, 333, 799, 296, 839, 287, 1867, 287, 1501, 41592, 14263, 273, 1516, 2222, 612, 296, 12277, 48418, 6196, 330, 2], [0, 530, 341, 278, 271, 362, 296, 1733, 362, 284, 271, 1641, 278, 366, 374, 295, 6511, 1019, 679, 313, 362, 10255, 289, 4697, 295, 3025, 23345, 289, 856, 284, 284, 2642, 362, 278, 1121, 743, 287, 438, 278, 2124, 1019, 274, 271, 3026, 305, 6326, 338, 2695, 332, 445, 289, 6027, 330, 2], [0, 20526, 323, 2469, 323, 21367, 287, 2612, 323, 554, 362, 574, 7152, 28482, 295, 334, 10435, 279, 1743, 279, 19644, 273, 413, 5477, 374, 279, 741, 7349, 330, 2], [0, 2469, 295, 1048, 273, 2287, 51552, 279, 2612, 271, 273, 10427, 271, 657, 273, 1124, 353, 8579, 279, 1186, 2710, 284, 284, 2706, 330, 2]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

* To apply the preprocessing function on all the sentences (or pairs of sentences) in our dataset, we just use the `map` method of our `dataset` object we created earlier.

* This will apply the function on all the elements of all the splits in `dataset`, so our training, validation and testing data will be preprocessed in one single command.

In [47]:
tokenized_datasets = datasets.map(tokenize_and_align_labels, batched=True)

Map:   0%|          | 0/754 [00:00<?, ? examples/s]

Map:   0%|          | 0/150 [00:00<?, ? examples/s]

Map:   0%|          | 0/601 [00:00<?, ? examples/s]

# Fine-tuning the Pre-trained model

In [48]:
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


In [49]:
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint,
                                                        num_labels=len(label_list))

model.safetensors:   0%|          | 0.00/334M [00:00<?, ?B/s]

Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at dsfsi/PuoBERTa and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [50]:
model_name = model_checkpoint.split("/")[-1]
args = TrainingArguments(
    trained_model_checkpoint,
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=epochs,
    weight_decay=0.01,
    push_to_hub=push_to_hub_enabled,
)

* We will need a data collator that will batch our processed examples together whi0le applying padding to make them all the same size (each pad will be padded to the length of its longest example).

* There is a data collator for this task in the Transformers library, that not only pads the inputs, but also the labels:

In [51]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer)

The last thing to define for our `Trainer` is how to compute the metrics from the predictions. Here we will load the [`seqeval`](https://github.com/chakki-works/seqeval) metric (which is commonly used to evaluate results for Token Classification tasks)

In [52]:
import evaluate

metric = evaluate.load("seqeval")

Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

This metric takes list of labels for the predictions and references:

In [53]:
labels = [label_list[i] for i in example[f"{task}_tags"]]

# In this case we are evaluating test labels only for illustration
metric.compute(predictions=[labels], references=[labels])



{"'": {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1},
 'D': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1},
 'W': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 5},
 '_': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 5},
 '`': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 2},
 'overall_precision': 1.0,
 'overall_recall': 1.0,
 'overall_f1': 1.0,
 'overall_accuracy': 1.0}

In [54]:
import numpy as np

def compute_metrics(p: tuple) -> dict:
    """
    Computes the evaluation metrics for the model.

    Args:
        p: Tuple containing:
            - predictions (np.array of shape (batch_size, num_seq, vocab_size))
            - labels (np.array of shape (batch_size, num_seq))

    Returns:
        A dictionary containing the evaluation metrics.
    """
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

In [55]:
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)
Detected kernel version 4.14.336, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [None]:
trainer.train()

You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,1.782822,0.522583,0.476704,0.49859,0.67169


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


# Evaluate the Fine-tuned Model

In [None]:
trainer.evaluate()

* Or, get detailed evaluation scores

In [None]:
predictions, labels, _ = trainer.predict(tokenized_datasets["validation"])
predictions = np.argmax(predictions, axis=2)

# Remove ignored index (special tokens)
true_predictions = [
    [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]
true_labels = [
    [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]

results = metric.compute(predictions=true_predictions, references=true_labels)
results

# (Optional) Upload your final model(s) to HuggingFace

In [None]:
## You might want to uncomment this if git-lfs is not instaled yet
# !apt install git-lfs

In [None]:
if push_to_hub_enabled:
  trainer.push_to_hub()
else:
  trainer.save_model(trained_model_checkpoint)

**Example usage of your trained model using Huggingface pipelines**

In [None]:
from transformers import pipeline

In [None]:
pipe = pipeline('token-classification', model=trained_model_checkpoint_hub if push_to_hub_enabled else trained_model_checkpoint)

In [None]:
pipe("We’re rolling out custom versions of ChatGPT that you can create for a specific purpose")