# NB_100124T0814_token_classification

# 1.Goal

- fine-tune and use LLM for token classification task

# 2.Steps
    
    - load dataset in cache memory
    - EDA of the dataset
    - tokenizer
    - processing the dataset according to the tokenizer shift
    - datacollector
    - metrics
    - defining the model
    - fine-tuning the model through Trainer
    - fine-tuning the model through custom training loop
    - using the fine-tuned model

# 3.Initializing

The traditional framework used to evaluate token classification prediction is [seqeval](https://github.com/chakki-works/seqeval)

In [33]:
!pip install seqeval

Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
     ---------------------------------------- 0.0/43.6 kB ? eta -:--:--
     ---------------------------------------- 0.0/43.6 kB ? eta -:--:--
     ---------------------------------------- 43.6/43.6 kB ? eta 0:00:00
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Installing backend dependencies: started
  Installing backend dependencies: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Building wheels for collected packages: seqeval
  Building wheel for seqeval (pyproject.toml): started
  Building wheel for seqeval (pyproject.toml): finished with status 'done'
  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16185 sha256=0ad3e3123bf309b588e7

In [1]:
import datasets

In [2]:
DATASET_NAME = "conll2003"
MODEL_CHECKPOINT = "bert-base-cased"

# 4.Pipeline

## 4.1. Load data set in cache memory

If the dataset very bit, it can be loaded through iterator

In [3]:
from datasets import load_dataset

raw_datasets = load_dataset(DATASET_NAME)

## 4.2. EDA of the dataset

In [4]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

In [5]:
raw_datasets["train"]

Dataset({
    features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
    num_rows: 14041
})

In [6]:
raw_datasets["train"][0]

{'id': '0',
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}

In [7]:
type(raw_datasets["train"])

datasets.arrow_dataset.Dataset

In [8]:
len(raw_datasets["train"][0]["tokens"]), len(
    raw_datasets["train"][0]["pos_tags"]
), len(raw_datasets["train"][0]["chunk_tags"]), len(
    raw_datasets["train"][0]["ner_tags"]
)

(9, 9, 9, 9)

In [9]:
raw_datasets["train"].features["pos_tags"]

Sequence(feature=ClassLabel(names=['"', "''", '#', '$', '(', ')', ',', '.', ':', '``', 'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'NN|SYM', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB'], id=None), length=-1, id=None)

The abbreviations you've listed are part-of-speech (POS) tags commonly used in natural language processing. They are used to classify words in a sentence according to their grammatical role. Here's what each abbreviation stands for:

1. **CC**: Coordinating conjunction (e.g., and, but, or)
2. **CD**: Cardinal number (e.g., one, two, 3)
3. **DT**: Determiner (e.g., the, a, these)
4. **EX**: Existential there (e.g., there is, there were)
5. **FW**: Foreign word (a word from another language)
6. **IN**: Preposition or subordinating conjunction (e.g., in, of, like, because)
7. **JJ**: Adjective (e.g., big, happy)
8. **JJR**: Adjective, comparative (e.g., bigger, happier)
9. **JJS**: Adjective, superlative (e.g., biggest, happiest)
10. **LS**: List item marker (used in lists)
11. **MD**: Modal (e.g., can, should, would)
12. **NN**: Noun, singular or mass (e.g., cat, tree)
13. **NNP**: Proper noun, singular (e.g., Alice, London)
14. **NNPS**: Proper noun, plural (e.g., Americans, Carolinas)
15. **NNS**: Noun, plural (e.g., cats, trees)
16. **NN|SYM**: This seems to be a non-standard tag, possibly denoting either a noun or a symbol.
17. **PDT**: Predeterminer (e.g., all, both, half)
18. **POS**: Possessive ending ('s)
19. **PRP**: Personal pronoun (e.g., I, you, he)
20. **PRP$**: Possessive pronoun (e.g., my, your, his)
21. **RB**: Adverb (e.g., quickly, not, very)
22. **RBR**: Adverb, comparative (e.g., faster)
23. **RBS**: Adverb, superlative (e.g., fastest)
24. **RP**: Particle (e.g., up, off, out)
25. **SYM**: Symbol (e.g., +, %, &)
26. **TO**: The word "to" (used before a verb, e.g., to run, to play)
27. **UH**: Interjection (e.g., uh, wow, oops)
28. **VB**: Verb, base form (e.g., run, play)
29. **VBD**: Verb, past tense (e.g., ran, played)
30. **VBG**: Verb, gerund or present participle (e.g., running, playing)
31. **VBN**: Verb, past participle (e.g., run, played)
32. **VBP**: Verb, non-3rd person singular present (e.g., run, play)
33. **VBZ**: Verb, 3rd person singular present (e.g., runs, plays)
34. **WDT**: Wh-determiner (e.g., which, whatever, whichever)
35. **WP**: Wh-pronoun (e.g., who, whom, which)
36. **WP$**: Possessive wh-pronoun (e.g., whose)
37. **WRB**: Wh-adverb (e.g., where, when, why)

These tags are part of a standard set known as the Penn Treebank POS tags, widely used in computational linguistics and natural language processing.

In [10]:
raw_datasets["train"].features["chunk_tags"]

Sequence(feature=ClassLabel(names=['O', 'B-ADJP', 'I-ADJP', 'B-ADVP', 'I-ADVP', 'B-CONJP', 'I-CONJP', 'B-INTJ', 'I-INTJ', 'B-LST', 'I-LST', 'B-NP', 'I-NP', 'B-PP', 'I-PP', 'B-PRT', 'I-PRT', 'B-SBAR', 'I-SBAR', 'B-UCP', 'I-UCP', 'B-VP', 'I-VP'], id=None), length=-1, id=None)

This abbreviation relates to a specific way of annotating text for chunking or shallow parsing in natural language processing (NLP). Let's break it down:

1. **`Sequence(feature=ClassLabel(names=[...]), length=-1, id=None)`**:
    - **`Sequence`**: This indicates that the data structure is a sequence, typically a list or array of elements in a specific order. In NLP, this often refers to sequences of words or tokens in a sentence.
    - **`feature=ClassLabel(names=[...])`**: This specifies that each element in the sequence is a class label from a predefined set of labels. These labels represent different types of chunks in text.
    - **`length=-1`**: This likely means that the sequences can be of variable length.
    - **`id=None`**: This suggests that there is no specific identifier associated with each sequence.

2. **Abbreviations in `names=[...]`**:
    - **`O`**: Outside of any chunk.
    - **`B-`** and **`I-`** prefixes: These are common in BIO tagging, a method used in NLP. `B-` stands for the beginning of a chunk, and `I-` stands for inside a chunk. These prefixes are followed by the type of chunk.
        - **`ADJP`**: Adjective Phrase.
        - **`ADVP`**: Adverb Phrase.
        - **`CONJP`**: Conjunction Phrase.
        - **`INTJ`**: Interjection.
        - **`LST`**: List marker.
        - **`NP`**: Noun Phrase.
        - **`PP`**: Prepositional Phrase.
        - **`PRT`**: Particle.
        - **`SBAR`**: Clause introduced by a (subordinating) conjunction.
        - **`UCP`**: Unlike Coordinated Phrase.
        - **`VP`**: Verb Phrase.

Each of these tags is used to annotate specific parts of a sentence. For example, in the sentence "The quick brown fox", "The quick brown" could be tagged as `B-NP I-NP I-NP` (beginning of a noun phrase, inside, inside), indicating that these words form a noun phrase. This type of tagging is crucial for tasks like information extraction, where understanding the structure of sentences is important.

In [11]:
raw_datasets["train"].features["ner_tags"]

Sequence(feature=ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None)

The abbreviation you're referring to, `ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'])`, is used in the context of named entity recognition (NER), a common task in natural language processing (NLP). This task involves identifying and classifying key information (entities) in text into predefined categories. Let's explain each component:

1. **`ClassLabel(names=[...])`**: This indicates a classification task where each entity in the text is labeled as one of the predefined classes.

2. **The abbreviations in `names=[...]`**:
   - **`O`**: Stands for "Outside" any named entity. It's used for tokens that are not part of any named entity.
   - **`B-`** and **`I-`** prefixes: These are used in the BIO tagging scheme. `B-` indicates the beginning of a named entity, and `I-` indicates that the token is inside a named entity. These prefixes are followed by the type of entity.
      - **`PER`**: Person. For names of people.
      - **`ORG`**: Organization. For names of companies, governmental organizations, etc.
      - **`LOC`**: Location. For names of geographical locations, like cities, countries, rivers, etc.
      - **`MISC`**: Miscellaneous. For named entities that don't fall into the above categories, like events, nationalities, languages, etc.

**Examples**:

1. **Sentence**: "Steve Jobs co-founded Apple in Cupertino."
   - **`Steve`**: B-PER (Beginning of a Person entity)
   - **`Jobs`**: I-PER (Inside a Person entity)
   - **`co-founded`**: O (Outside any entity)
   - **`Apple`**: B-ORG (Beginning of an Organization entity)
   - **`in`**: O (Outside any entity)
   - **`Cupertino`**: B-LOC (Beginning of a Location entity)

2. **Sentence**: "The United Nations was established after World War II."
   - **`The`**: O
   - **`United`**: B-ORG
   - **`Nations`**: I-ORG
   - **`was`**: O
   - **`established`**: O
   - **`after`**: O
   - **`World`**: B-MISC (Beginning of a Miscellaneous entity)
   - **`War`**: I-MISC
   - **`II`**: I-MISC

In these examples, each word in the sentence is labeled according to whether it is part of a named entity and what type of entity it is. This labeling is essential for extracting structured information from unstructured text data.

In [12]:
def decode_with_labels(dataset: datasets.arrow_dataset.Dataset, i: int):
    words = dataset[i]["tokens"]
    pos_labels = dataset[i]["pos_tags"]
    chunk_labels = dataset[i]["chunk_tags"]
    ner_labels = dataset[i]["ner_tags"]
    pos_label_names = dataset.features["pos_tags"].feature.names
    chunk_label_names = dataset.features["chunk_tags"].feature.names
    ner_label_names = dataset.features["ner_tags"].feature.names

    line1, line2, line3, line4 = "", "", "", ""
    for word, pos, chunk, ner in zip(
        words, pos_labels, chunk_labels, ner_labels
    ):
        # Assuming pos_label_names is a list of POS tag names
        full_pos = pos_label_names[pos]
        # Assuming chunk_label_names is a list of chunk tag names
        full_chunk = chunk_label_names[chunk]
        # Assuming ner_label_names is a list of NER tag names
        full_ner = ner_label_names[ner]

        max_length = max(
            len(word), len(full_pos), len(full_chunk), len(full_ner)
        )
        space_padding = max_length - len(word) + 1

        line1 += word + " " * space_padding
        line2 += full_pos + " " * (max_length - len(full_pos) + 1)
        line3 += full_chunk + " " * (max_length - len(full_chunk) + 1)
        line4 += full_ner + " " * (max_length - len(full_ner) + 1)

    return line1, line2, line3, line4

In [13]:
decode_with_labels(raw_datasets["train"], 0)

('EU    rejects German call to   boycott British lamb . ',
 'NNP   VBZ     JJ     NN   TO   VB      JJ      NN   . ',
 'B-NP  B-VP    B-NP   I-NP B-VP I-VP    B-NP    I-NP O ',
 'B-ORG O       B-MISC O    O    O       B-MISC  O    O ')

In [14]:
decode_with_labels(raw_datasets["train"], 5)

('" We   do   n\'t  support any  such recommendation because we   do   n\'t  see  any  grounds for  it   , " the  Commission \'s   chief spokesman Nikolaus van   der   Pas   told a    news briefing . ',
 '" PRP  VBP  RB   VB      DT   JJ   NN             IN      PRP  VBP  RB   VB   DT   NNS     IN   PRP  , " DT   NNP        POS  JJ    NN        NNP      NNP   FW    NNP   VBD  DT   NN   NN       . ',
 'O B-NP B-VP I-VP I-VP    B-NP I-NP I-NP           B-SBAR  B-NP B-VP I-VP I-VP B-NP I-NP    B-PP B-NP O O B-NP I-NP       B-NP I-NP  I-NP      I-NP     I-NP  I-NP  I-NP  B-VP B-NP I-NP I-NP     O ',
 'O O    O    O    O       O    O    O              O       O    O    O    O    O    O       O    O    O O O    B-ORG      O    O     O         B-PER    I-PER I-PER I-PER O    O    O    O        O ')

## 4.3.Tokenizer

In [16]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)

### EDA

In [17]:
tokenizer.is_fast

True

In [18]:
inputs = tokenizer(
    raw_datasets["train"][0]["tokens"], is_split_into_words=True
)
print(inputs.tokens())
print(inputs.word_ids())

['[CLS]', 'EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'la', '##mb', '.', '[SEP]']
[None, 0, 1, 2, 3, 4, 5, 6, 7, 7, 8, None]


## 4.4.Processing the dataset according to the tokenizer shift

The problem is that after appling the tokenizer, our labels do not match to the tokens, as a rule the number of tokens more than the words

In [19]:
inputs = tokenizer(
    raw_datasets["train"][0]["tokens"], is_split_into_words=True
)
inputs

{'input_ids': [101, 7270, 22961, 1528, 1840, 1106, 21423, 1418, 2495, 12913, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [20]:
print(
    len(inputs.tokens()),
    len(raw_datasets["train"][0]["tokens"]),
    len(raw_datasets["train"][0]["ner_tags"]),
    inputs,
)

12 9 9 {'input_ids': [101, 7270, 22961, 1528, 1840, 1106, 21423, 1418, 2495, 12913, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


So we need to align labels with tokens

In [21]:
def align_labels_with_tokens(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word:
            # Start of a new word!
            current_word = word_id
            label = -100 if word_id is None else labels[word_id]
            new_labels.append(label)
        elif word_id is None:
            # Special token
            new_labels.append(-100)
        else:
            # Same word as previous token
            label = labels[word_id]
            # If the label is B-XXX we change it to I-XXX
            if label % 2 == 1:
                label += 1
            new_labels.append(label)

    return new_labels

before:

In [22]:
print(
    f"{len(inputs.tokens())=}",
    f'{len(raw_datasets["train"][0]["tokens"])=}',
    f'{len(raw_datasets["train"][0]["ner_tags"])=}',
)

len(inputs.tokens())=12 len(raw_datasets["train"][0]["tokens"])=9 len(raw_datasets["train"][0]["ner_tags"])=9


In [23]:
print(
    f"{inputs.tokens()=}\n",
    f'{raw_datasets["train"][0]["tokens"]=}\n',
    f'{raw_datasets["train"][0]["ner_tags"]=}\n',
)

inputs.tokens()=['[CLS]', 'EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'la', '##mb', '.', '[SEP]']
 raw_datasets["train"][0]["tokens"]=['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']
 raw_datasets["train"][0]["ner_tags"]=[3, 0, 7, 0, 0, 0, 7, 0, 0]



after:

In [24]:
inputs = tokenizer(
    raw_datasets["train"][0]["tokens"], is_split_into_words=True
)

alligned_labels = align_labels_with_tokens(
    raw_datasets["train"][0]["ner_tags"], inputs.word_ids()
)
print(
    f"{len(inputs.tokens())=}\n",
    f"{len(alligned_labels)=}\n",
    f"{alligned_labels=}",
)

len(inputs.tokens())=12
 len(alligned_labels)=12
 alligned_labels=[-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, 0, -100]


So we need to tokenize and allign labels:

In [25]:
def tokenize_and_align_labels(
    dataset: datasets.arrow_dataset.Dataset,
    token_name: str = "tokens",
    name_tags: str = "ner_tags",
):
    tokenized_inputs = tokenizer(
        dataset[token_name], truncation=True, is_split_into_words=True
    )
    all_labels = dataset[name_tags]
    new_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenized_inputs.word_ids(i)
        new_labels.append(align_labels_with_tokens(labels, word_ids))

    tokenized_inputs["labels"] = new_labels
    return tokenized_inputs

In [26]:
tokenized_datasets = raw_datasets.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=raw_datasets["train"].column_names,
)

Map:   0%|          | 0/14041 [00:00<?, ? examples/s]

Map:   0%|          | 0/3250 [00:00<?, ? examples/s]

Map:   0%|          | 0/3453 [00:00<?, ? examples/s]

In [27]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 3453
    })
})

In [28]:
tokenized_datasets["train"]["input_ids"][0], tokenized_datasets["train"][
    "labels"
][0]

([101, 7270, 22961, 1528, 1840, 1106, 21423, 1418, 2495, 12913, 119, 102],
 [-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, 0, -100])

## 4.5.Datacollector

Datacollector make alignment according to the longest string in the batch

In [29]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

In [31]:
batch = data_collator([tokenized_datasets["train"][i] for i in range(2)])
batch["labels"]

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


tensor([[-100,    3,    0,    7,    0,    0,    0,    7,    0,    0,    0, -100],
        [-100,    1,    2, -100, -100, -100, -100, -100, -100, -100, -100, -100]])

In [32]:
for i in range(2):
    print(tokenized_datasets["train"][i]["labels"])

[-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, 0, -100]
[-100, 1, 2, -100]


## 4.6.Metrcis

Evaluate: A library for easily evaluating machine learning models and datasets. 
- [github for eveluate](https://github.com/huggingface/evaluate)
- [hugging_face_page for eveluate](https://huggingface.co/evaluate-metric)

In [34]:
import evaluate

metric = evaluate.load("seqeval")

Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

### EDA

This metric does not behave like the standard accuracy: it will actually take the lists of labels as strings, not integers, so we will need to fully decode the predictions and labels before passing them to the metric. 

In [39]:
# Retrieve the list of named entity recognition (NER) tag names from the dataset's features
label_names = raw_datasets["train"].features["ner_tags"].feature.names

# Extract the NER labels for the first entry in the training dataset
labels = raw_datasets["train"][0]["ner_tags"]

# Convert numeric NER labels to their corresponding string representations
labels = [label_names[i] for i in labels]

# Copy the true labels to create a set of 'predicted' labels (for demonstration)
predictions = labels.copy()

# Modify one of the predicted labels to simulate a prediction error
predictions[2] = "O"

# Print the true labels and the modified predictions
print(f"{labels=}")
print(f"{predictions=}")

# Compute a metric (like accuracy, F1-score, etc.) by comparing predictions with true labels
# The 'metric' object should be previously defined or imported from an appropriate library
metric.compute(predictions=[predictions], references=[labels])

labels=['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O']
predictions=['B-ORG', 'O', 'O', 'O', 'O', 'O', 'B-MISC', 'O', 'O']


{'MISC': {'precision': 1.0,
  'recall': 0.5,
  'f1': 0.6666666666666666,
  'number': 2},
 'ORG': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1},
 'overall_precision': 1.0,
 'overall_recall': 0.6666666666666666,
 'overall_f1': 0.8,
 'overall_accuracy': 0.8888888888888888}

### define compute metrics function

In [40]:
from typing import Tuple, List, Dict
import numpy as np


def compute_metrics(
    eval_preds: Tuple[np.ndarray, np.ndarray]
) -> Dict[str, float]:
    """
    Compute precision, recall, F1 score, and accuracy for NER predictions.

    This function processes the output of a model's predictions and the true labels,
    computes the NER metrics using a predefined metrics object, and returns the
    calculated precision, recall, F1 score, and accuracy.

    Parameters:
    eval_preds (Tuple[np.ndarray, np.ndarray]): A tuple containing two elements:
        - logits: A numpy array of model logits (predictions before applying activation function).
        - labels: A numpy array of true labels.

    Returns:
    Dict[str, float]: A dictionary containing the computed metrics:
        - 'precision': The overall precision of the model.
        - 'recall': The overall recall of the model.
        - 'f1': The overall F1 score of the model.
        - 'accuracy': The overall accuracy of the model.

    Note:
    The function assumes the existence of a global 'metric' object for computing NER metrics
    and a 'label_names' list mapping label indices to their string representations.
    It also assumes that '-100' is used as the label for special tokens that should be ignored
    in the evaluation.

    Example:
    >>> eval_preds = (model_logits, true_labels)
    >>> metrics = compute_metrics(eval_preds)
    >>> print(metrics)
    """

    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)

    # Remove ignored index (special tokens) and convert to labels
    true_labels = [
        [label_names[l] for l in label if l != -100] for label in labels
    ]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    all_metrics = metric.compute(
        predictions=true_predictions, references=true_labels
    )
    return {
        "precision": all_metrics["overall_precision"],
        "recall": all_metrics["overall_recall"],
        "f1": all_metrics["overall_f1"],
        "accuracy": all_metrics["overall_accuracy"],
    }

## 4.7.Defining the model

Since we are working on a token classification problem, we will use the AutoModelForTokenClassification class. The main thing to remember when defining this model is to pass along some information on the number of labels we have. The easiest way to do this is to pass that number with the num_labels argument, but if we want a nice inference widget working like the one we saw at the beginning of this section, it’s better to set the correct label correspondences instead.

In [42]:
label_names = raw_datasets["train"].features["ner_tags"].feature.names
id2label = {i: label for i, label in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}

print(f"{id2label=}")
print(f"{label2id=}")

id2label={0: 'O', 1: 'B-PER', 2: 'I-PER', 3: 'B-ORG', 4: 'I-ORG', 5: 'B-LOC', 6: 'I-LOC', 7: 'B-MISC', 8: 'I-MISC'}
label2id={'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6, 'B-MISC': 7, 'I-MISC': 8}


In [43]:
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(
    MODEL_CHECKPOINT,
    id2label=id2label,
    label2id=label2id,
)

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [44]:
model.config.num_labels

9

## 4.8.Fine-tuning the model

- login in hugging face account

In [45]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

- set arguments for training

In [46]:
from transformers import TrainingArguments

args = TrainingArguments(
    output_dir="bert-finetuned-ner",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True,
)

- create Trainer and start training

In [47]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)
trainer.train()

  0%|          | 0/5268 [00:00<?, ?it/s]

{'loss': 0.2635, 'learning_rate': 1.810174639331815e-05, 'epoch': 0.28}
{'loss': 0.104, 'learning_rate': 1.6203492786636296e-05, 'epoch': 0.57}
{'loss': 0.0762, 'learning_rate': 1.4305239179954442e-05, 'epoch': 0.85}


  0%|          | 0/407 [00:00<?, ?it/s]

{'eval_loss': 0.06335246562957764, 'eval_precision': 0.904916965157929, 'eval_recall': 0.9353752945136318, 'eval_f1': 0.9198940748096657, 'eval_accuracy': 0.9822658503561547, 'eval_runtime': 3.2464, 'eval_samples_per_second': 1001.096, 'eval_steps_per_second': 125.368, 'epoch': 1.0}
{'loss': 0.0626, 'learning_rate': 1.240698557327259e-05, 'epoch': 1.14}
{'loss': 0.0419, 'learning_rate': 1.0508731966590738e-05, 'epoch': 1.42}
{'loss': 0.0405, 'learning_rate': 8.610478359908885e-06, 'epoch': 1.71}
{'loss': 0.0361, 'learning_rate': 6.712224753227031e-06, 'epoch': 1.99}


  0%|          | 0/407 [00:00<?, ?it/s]

{'eval_loss': 0.06697254627943039, 'eval_precision': 0.9293379560838699, 'eval_recall': 0.9473241332884551, 'eval_f1': 0.9382448537378115, 'eval_accuracy': 0.9851945605463001, 'eval_runtime': 3.2, 'eval_samples_per_second': 1015.635, 'eval_steps_per_second': 127.189, 'epoch': 2.0}
{'loss': 0.0233, 'learning_rate': 4.8139711465451785e-06, 'epoch': 2.28}
{'loss': 0.0202, 'learning_rate': 2.9157175398633257e-06, 'epoch': 2.56}
{'loss': 0.0243, 'learning_rate': 1.0174639331814731e-06, 'epoch': 2.85}


  0%|          | 0/407 [00:00<?, ?it/s]

{'eval_loss': 0.06140302121639252, 'eval_precision': 0.9370444002650762, 'eval_recall': 0.9518680578929654, 'eval_f1': 0.9443980631157121, 'eval_accuracy': 0.9861658915641373, 'eval_runtime': 3.2968, 'eval_samples_per_second': 985.791, 'eval_steps_per_second': 123.451, 'epoch': 3.0}
{'train_runtime': 181.1464, 'train_samples_per_second': 232.536, 'train_steps_per_second': 29.081, 'train_loss': 0.06650318163493048, 'epoch': 3.0}


TrainOutput(global_step=5268, training_loss=0.06650318163493048, metrics={'train_runtime': 181.1464, 'train_samples_per_second': 232.536, 'train_steps_per_second': 29.081, 'train_loss': 0.06650318163493048, 'epoch': 3.0})

- push to the hugging face hub

In [48]:
trainer.push_to_hub(commit_message="Training complete")

model.safetensors:   0%|          | 0.00/431M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/ilbaks/bert-finetuned-ner/commit/88131286c1bd2c81a2d8df054f069e3fb1ee6c7b', commit_message='Training complete', commit_description='', oid='88131286c1bd2c81a2d8df054f069e3fb1ee6c7b', pr_url=None, pr_revision=None, pr_num=None)

## 4.9.A custom training loop

### 4.9.1.Build dataloaders

In [50]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_datasets["train"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=8,
)
eval_dataloader = DataLoader(
    dataset=tokenized_datasets["validation"],
    collate_fn=data_collator,
    batch_size=8,
)

### 4.9.2.Reinstantiate model

In [51]:
model = AutoModelForTokenClassification.from_pretrained(
    MODEL_CHECKPOINT,
    id2label=id2label,
    label2id=label2id,
)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### 4.9.3.Initialize optimizer

In [52]:
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=2e-5)

### 4.9.4.Accelerate prepare

In [53]:
from accelerate import Accelerator

accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

### 4.9.5.Set learning rate scheduler

In [54]:
from transformers import get_scheduler

num_train_epochs = 3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

### 4.9.6.Set a repository object

In [61]:
from huggingface_hub import Repository, get_full_repo_name

model_name = "bert-finetuned-ner-accelerate"
repo_name = get_full_repo_name(model_name)
repo_name

'ilbaks/bert-finetuned-ner-accelerate'

 - clone that repository in a local folder.

In [63]:
output_dir = "bert-finetuned-ner-accelerate"
repo = Repository(output_dir, clone_from=repo_name)

Cloning https://huggingface.co/ilbaks/bert-finetuned-ner-accelerate into local empty directory.


### 4.9.10.Define postprocess function

- postprocess() function that takes predictions and labels and converts them to lists of strings, like our metric object expects:

In [64]:
def postprocess(predictions, labels):
    predictions = predictions.detach().cpu().clone().numpy()
    labels = labels.detach().cpu().clone().numpy()

    # Remove ignored index (special tokens) and convert to labels
    true_labels = [
        [label_names[l] for l in label if l != -100] for label in labels
    ]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    return true_labels, true_predictions

### 4.9.11.Training loop

- The training in itself, which is the classic iteration over the train_dataloader, forward pass through the model, then backward pass and optimizer step.

- The evaluation, in which there is a novelty after getting the outputs of our model on a batch: since two processes may have padded the inputs and labels to different shapes, we need to use accelerator.pad_across_processes() to make the predictions and labels the same shape before calling the gather() method. If we don’t do this, the evaluation will either error out or hang forever. Then we send the results to metric.add_batch() and call metric.compute() once the evaluation loop is over.

- Saving and uploading, where we first save the model and the tokenizer, then call repo.push_to_hub(). Notice that we use the argument blocking=False to tell the 🤗 Hub library to push in an asynchronous process. This way, training continues normally and this (long) instruction is executed in the background.

In [65]:
# Import necessary libraries
from tqdm.auto import tqdm
import torch

# Initialize a progress bar for tracking training steps
progress_bar = tqdm(range(num_training_steps))

# Loop over each epoch (one pass over the entire dataset)
for epoch in range(num_train_epochs):
    # Training Phase
    model.train()  # Set the model to training mode (enables dropout, batch normalization etc.)

    # Loop over each batch in the training data loader
    for batch in train_dataloader:
        # Forward pass: compute outputs by passing the batch through the model
        outputs = model(**batch)
        loss = outputs.loss  # Extract the loss from the model's outputs

        # Perform backpropagation to compute gradients
        accelerator.backward(loss)

        optimizer.step()  # Update model parameters based on gradients
        lr_scheduler.step()  # Update learning rate
        optimizer.zero_grad()  # Reset gradients to zero for the next iteration
        progress_bar.update(1)  # Update the progress bar

    # Evaluation Phase
    model.eval()  # Set the model to evaluation mode (disables dropout, batch normalization etc.)

    # Loop over each batch in the evaluation data loader
    for batch in eval_dataloader:
        with torch.no_grad():  # Disable gradient calculations for efficiency
            outputs = model(**batch)  # Forward pass: compute outputs

        # Post-process the outputs to extract predictions and labels
        # Get the predicted labels (class with highest logit)
        predictions = outputs.logits.argmax(dim=-1)
        labels = batch["labels"]

        # Pad predictions and labels for consistent shape across all distributed processes
        predictions = accelerator.pad_across_processes(
            predictions, dim=1, pad_index=-100
        )
        labels = accelerator.pad_across_processes(
            labels, dim=1, pad_index=-100
        )

        # Gather predictions and labels from all processes
        predictions_gathered = accelerator.gather(predictions)
        labels_gathered = accelerator.gather(labels)

        # Further processing of gathered predictions and labels (e.g., removing padding)
        true_predictions, true_labels = postprocess(
            predictions_gathered, labels_gathered
        )

        # Add the results of this batch to the metric for later calculation
        metric.add_batch(predictions=true_predictions, references=true_labels)

    # Compute and print the evaluation metrics
    results = metric.compute()
    print(
        f"epoch {epoch}:",
        {
            # Print precision, recall, F1 score, and accuracy for the epoch
            key: results[f"overall_{key}"]
            for key in ["precision", "recall", "f1", "accuracy"]
        },
    )

    # Save the model and tokenizer, and upload to a model hub if needed
    accelerator.wait_for_everyone()  # Ensure all processes are synchronized
    unwrapped_model = accelerator.unwrap_model(
        model
    )  # Unwrap the model from the accelerator
    unwrapped_model.save_pretrained(
        output_dir, save_function=accelerator.save
    )  # Save the model
    if (
        accelerator.is_main_process
    ):  # Check if it's the main process to avoid redundant saves/uploads
        tokenizer.save_pretrained(output_dir)  # Save the tokenizer
        repo.push_to_hub(
            # Push model to the hub
            commit_message=f"Training in progress epoch {epoch}",
            blocking=False,
        )

  0%|          | 0/5268 [00:00<?, ?it/s]

epoch 0: {'precision': 0.9400875126220128, 'recall': 0.9231531978185424, 'f1': 0.9315434003168516, 'accuracy': 0.9840024724789544}
epoch 1: {'precision': 0.947997307303938, 'recall': 0.9286185295087372, 'f1': 0.9382078614257161, 'accuracy': 0.9861806087007712}


Several commits (2) will be pushed upstream.


epoch 2: {'precision': 0.947997307303938, 'recall': 0.9286185295087372, 'f1': 0.9382078614257161, 'accuracy': 0.9861806087007712}


## 4.10.Using the fine-tuned model

In [67]:
from transformers import pipeline

# Replace this with your own checkpoint
MODEL_CHECKPOINT_FINE_TUNED = "huggingface-course/bert-finetuned-ner"
token_classifier = pipeline(
    "token-classification",
    model=MODEL_CHECKPOINT_FINE_TUNED,
    aggregation_strategy="simple",
)
token_classifier("My name is Sylvain and I work at Hugging Face in Brooklyn.")

pytorch_model.bin:   0%|          | 0.00/431M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/320 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

[{'entity_group': 'PER',
  'score': 0.9988506,
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.9647624,
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': 0.9986118,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]