<a href="https://colab.research.google.com/github/mmaguero/diploma_fpuna_nlp_ia/blob/master/2025/jopara_token_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Transformers installation
! pip install transformers datasets evaluate accelerate wandb
# To install from source instead of the last release, comment the command above and uncomment the following one.
# ! pip install git+https://github.com/huggingface/transformers.git



# Token classification

In [2]:
#@title
from IPython.display import HTML

HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/wVHdVlPScxA?rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')



Token classification assigns a label to individual tokens in a sentence. One of the most common token classification tasks is Named Entity Recognition (NER). NER attempts to find a label for each entity in a sentence, such as a person, location, or organization.

This guide will show you how to:

1. Finetune [DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased) on the [WNUT 17](https://huggingface.co/datasets/wnut_17) dataset to detect new entities.
2. Use your finetuned model for inference.

<Tip>

To see all architectures and checkpoints compatible with this task, we recommend checking the [task-page](https://huggingface.co/tasks/token-classification).

</Tip>

Before you begin, make sure you have all the necessary libraries installed:

```bash
pip install transformers datasets evaluate seqeval
```

We encourage you to login to your Hugging Face account so you can upload and share your model with the community. When prompted, enter your token to login:

In [3]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.svâ€¦

## Load WNUT 17 dataset

Start by loading the WNUT 17 dataset from the ðŸ¤— Datasets library:

In [4]:
#from datasets import load_dataset

#wnut = load_dataset("wnut_17", data_dir="wnut_17", revision="refs/convert/parquet")
#gn = load_dataset("unimelb-nlp/wikiann", "gn")
#es = load_dataset("unimelb-nlp/wikiann", "es")

# Download from github: GUA-SPA
# clone repo
! git clone https://github.com/pln-fing-udelar/gua-spa-2023.git
! pip install conllu

fatal: destination path 'gua-spa-2023' already exists and is not an empty directory.


In [44]:
import conllu

def read_conllu_file(filepath):
    with open(filepath, "r", encoding="utf-8") as f:
        data = f.read()
    return conllu.parse(data)

def conllu_to_dataset_format(parsed_data):
    dataset_format = {"tokens": [], "ner_tags": []}
    for sentence in parsed_data:
        tokens = []
        ner_tags = []
        for token in sentence:
            tokens.append(token["form"])
            # Assuming the NER tag is in the 'misc' field as a list of key=value pairs,
            # and one of them is 'ner=LABEL'. We need to adjust this based on the actual file format.
            if token['lemma'] is not None:
                ner_tag=token['lemma'].split('-')
                #if 'ne' in ner_tag:
                #    ner_tag = token['lemma']
                #else:
                    #ner_tag = ner_tag[0]
                ner_tag = ner_tag[0]
            ner_tags.append(ner_tag)
        dataset_format["tokens"].append(tokens)
        dataset_format["ner_tags"].append(ner_tags)
    return dataset_format


In [45]:
# Read an example conllu file
train_data = read_conllu_file("gua-spa-2023/gua_spa_train.txt")
# Convert to dataset format
train_dataset_format = conllu_to_dataset_format(train_data)
# Print the first example
print(train_dataset_format["tokens"][0])
print(train_dataset_format["ner_tags"][0])

['Aldana', "he'Ã­va", 'umi', 'kits', 'ohÃ³va', 'ha', 'oguahÃ«va', 'opavave', "temimbo'Ã©pe", 'oÃ±epyrÃ»', 'mboyve', 'clase', 'pero', 'noÃ±eguahÃ«i', 'mbohapÃ½ha', 'ary', 'ohÃ³vape', '.']
['ne', 'gn', 'gn', 'foreign', 'gn', 'gn', 'gn', 'gn', 'gn', 'gn', 'gn', 'es', 'es', 'gn', 'gn', 'gn', 'gn', 'other']


In [46]:
# Read an example conllu file
dev_data = read_conllu_file("gua-spa-2023/gua_spa_dev_gold.txt")
# Convert to dataset format
dev_dataset_format = conllu_to_dataset_format(dev_data)
# Print the first example
print(dev_dataset_format["tokens"][0])
print(dev_dataset_format["ner_tags"][0])

['Obligarle', 'alguien', 'pa', 'que', 'me', 'escriba', '?', 'No', 'seÃ±orito', 'ani', 'nde', 'kangy']
['es', 'es', 'es', 'es', 'es', 'es', 'other', 'es', 'es', 'gn', 'gn', 'gn']


In [47]:
# Read an example conllu file
test_data = read_conllu_file("gua-spa-2023/gua_spa_test_gold.txt")
# Convert to dataset format
test_dataset_format = conllu_to_dataset_format(test_data)
# Print the first example
print(test_dataset_format["tokens"][0])
print(test_dataset_format["ner_tags"][0])

['Igusto', "Ã±aÃ±e'áº½", 'guaranime', 'ha', 'avei', 'japurahÃ©i', 'umi', 'polka', 'ha', 'guarania']
['mix', 'gn', 'gn', 'gn', 'gn', 'gn', 'gn', 'foreign', 'gn', 'es']


In [9]:
from datasets import Dataset, DatasetDict

# Create Dataset objects from the parsed data
train_dataset = Dataset.from_dict(train_dataset_format)
dev_dataset = Dataset.from_dict(dev_dataset_format)
test_dataset = Dataset.from_dict(test_dataset_format)

# Create a DatasetDict
wnut = DatasetDict({
    "train": train_dataset,
    "validation": dev_dataset,
    "test": test_dataset
})

wnut

DatasetDict({
    train: Dataset({
        features: ['tokens', 'ner_tags'],
        num_rows: 1140
    })
    validation: Dataset({
        features: ['tokens', 'ner_tags'],
        num_rows: 180
    })
    test: Dataset({
        features: ['tokens', 'ner_tags'],
        num_rows: 180
    })
})

Then take a look at an example:

In [10]:
wnut["train"][0]

{'tokens': ['Aldana',
  "he'Ã­va",
  'umi',
  'kits',
  'ohÃ³va',
  'ha',
  'oguahÃ«va',
  'opavave',
  "temimbo'Ã©pe",
  'oÃ±epyrÃ»',
  'mboyve',
  'clase',
  'pero',
  'noÃ±eguahÃ«i',
  'mbohapÃ½ha',
  'ary',
  'ohÃ³vape',
  '.'],
 'ner_tags': ['ne-b-per',
  'gn',
  'gn',
  'foreign',
  'gn',
  'gn',
  'gn',
  'gn',
  'gn',
  'gn',
  'gn',
  'es',
  'es',
  'gn',
  'gn',
  'gn',
  'gn',
  'other']}

Each number in `ner_tags` represents an entity. Convert the numbers to their label names to find out what the entities are:

In [11]:
# Get the unique tag names from the 'ner_tags' column
unique_tags = set()
for tags_list in wnut["train"]["ner_tags"]:
    unique_tags.update(tags_list)

label_list = sorted(list(unique_tags),
                    key=lambda x: (x.split('-')[-1]))
label_list

['es',
 'foreign',
 'gn',
 'ne-i-loc',
 'ne-b-loc',
 'mix',
 'ne-i-org',
 'ne-b-org',
 'other',
 'ne-b-per',
 'ne-i-per']

The letter that prefixes each `ner_tag` indicates the token position of the entity:

- `B-` indicates the beginning of an entity.
- `I-` indicates a token is contained inside the same entity (for example, the `State` token is a part of an entity like
  `Empire State Building`).
- `0` indicates the token doesn't correspond to any entity.

In [12]:
def assing_ner_id(example):
    example["ner_tags_name"] = example["ner_tags"]
    example["ner_tags"] = []
    for tag in example["ner_tags_name"]:
        example["ner_tags"].append(label_list.index(tag))
    return example

wnut = wnut.map(assing_ner_id)
wnut["train"][0], wnut["test"][0], wnut["validation"][0]

Map:   0%|          | 0/1140 [00:00<?, ? examples/s]

Map:   0%|          | 0/180 [00:00<?, ? examples/s]

Map:   0%|          | 0/180 [00:00<?, ? examples/s]

({'tokens': ['Aldana',
   "he'Ã­va",
   'umi',
   'kits',
   'ohÃ³va',
   'ha',
   'oguahÃ«va',
   'opavave',
   "temimbo'Ã©pe",
   'oÃ±epyrÃ»',
   'mboyve',
   'clase',
   'pero',
   'noÃ±eguahÃ«i',
   'mbohapÃ½ha',
   'ary',
   'ohÃ³vape',
   '.'],
  'ner_tags': [9, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 2, 8],
  'ner_tags_name': ['ne-b-per',
   'gn',
   'gn',
   'foreign',
   'gn',
   'gn',
   'gn',
   'gn',
   'gn',
   'gn',
   'gn',
   'es',
   'es',
   'gn',
   'gn',
   'gn',
   'gn',
   'other']},
 {'tokens': ['Igusto',
   "Ã±aÃ±e'áº½",
   'guaranime',
   'ha',
   'avei',
   'japurahÃ©i',
   'umi',
   'polka',
   'ha',
   'guarania'],
  'ner_tags': [5, 2, 2, 2, 2, 2, 2, 1, 2, 0],
  'ner_tags_name': ['mix',
   'gn',
   'gn',
   'gn',
   'gn',
   'gn',
   'gn',
   'foreign',
   'gn',
   'es']},
 {'tokens': ['Obligarle',
   'alguien',
   'pa',
   'que',
   'me',
   'escriba',
   '?',
   'No',
   'seÃ±orito',
   'ani',
   'nde',
   'kangy'],
  'ner_tags': [0, 0, 0, 0, 0, 0, 8,

In [13]:
ner = zip(wnut["train"][0]["tokens"], wnut["train"][0]["ner_tags"])
for item in ner:
  print(f"{item} => {label_list[item[1]]}")

('Aldana', 9) => ne-b-per
("he'Ã­va", 2) => gn
('umi', 2) => gn
('kits', 1) => foreign
('ohÃ³va', 2) => gn
('ha', 2) => gn
('oguahÃ«va', 2) => gn
('opavave', 2) => gn
("temimbo'Ã©pe", 2) => gn
('oÃ±epyrÃ»', 2) => gn
('mboyve', 2) => gn
('clase', 0) => es
('pero', 0) => es
('noÃ±eguahÃ«i', 2) => gn
('mbohapÃ½ha', 2) => gn
('ary', 2) => gn
('ohÃ³vape', 2) => gn
('.', 8) => other


## Preprocess

In [14]:
#@title
from IPython.display import HTML

HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/iY2AZYdZAr0?rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')



The next step is to load a mBERT tokenizer to preprocess the `tokens` field:

In [15]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("mmaguero/multilingual-bert-gn-base-cased") #"distilbert/distilbert-base-uncased")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


As you saw in the example `tokens` field above, it looks like the input has already been tokenized. But the input actually hasn't been tokenized yet and you'll need to set `is_split_into_words=True` to tokenize the words into subwords. For example:

In [16]:
example = wnut["train"][0]
tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
tokens

['[CLS]',
 'Al',
 '##dana',
 'he',
 "'",
 'Ã­',
 '##va',
 'um',
 '##i',
 'kit',
 '##s',
 'o',
 '##h',
 '##Ã³',
 '##va',
 'ha',
 'og',
 '##uah',
 '##Ã«',
 '##va',
 'op',
 '##ava',
 '##ve',
 'temi',
 '##mbo',
 "'",
 'Ã©',
 '##pe',
 'o',
 '##Ã±',
 '##ep',
 '##yr',
 '##Ã»',
 'm',
 '##boy',
 '##ve',
 'clase',
 'pero',
 'no',
 '##Ã±',
 '##egu',
 '##ah',
 '##Ã«',
 '##i',
 'm',
 '##bo',
 '##ha',
 '##p',
 '##Ã½',
 '##ha',
 'ary',
 'o',
 '##h',
 '##Ã³',
 '##va',
 '##pe',
 '.',
 '[SEP]']

However, this adds some special tokens `[CLS]` and `[SEP]` and the subword tokenization creates a mismatch between the input and labels. A single word corresponding to a single label may now be split into two subwords. You'll need to realign the tokens and labels by:

1. Mapping all tokens to their corresponding word with the [`word_ids`](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.BatchEncoding.word_ids) method.
2. Assigning the label `-100` to the special tokens `[CLS]` and `[SEP]` so they're ignored by the PyTorch loss function (see [CrossEntropyLoss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html)).
3. Only labeling the first token of a given word. Assign `-100` to other subtokens from the same word.

Here is how you can create a function to realign the tokens and labels, and truncate sequences to be no longer than DistilBERT's maximum input length:

In [17]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

    labels = []
    for i, label in enumerate(examples[f"ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:  # Set the special tokens to -100.
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:  # Only label the first token of a given word.
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

To apply the preprocessing function over the entire dataset, use ðŸ¤— Datasets [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) function. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:

In [18]:
tokenized_wnut = wnut.map(tokenize_and_align_labels, batched=True)

Map:   0%|          | 0/1140 [00:00<?, ? examples/s]

Map:   0%|          | 0/180 [00:00<?, ? examples/s]

Map:   0%|          | 0/180 [00:00<?, ? examples/s]

Now create a batch of examples using [DataCollatorWithPadding](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DataCollatorWithPadding). It's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In [19]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

## Evaluate

Including a metric during training is often helpful for evaluating your model's performance. You can quickly load a evaluation method with the ðŸ¤— [Evaluate](https://huggingface.co/docs/evaluate/index) library. For this task, load the [seqeval](https://huggingface.co/spaces/evaluate-metric/seqeval) framework (see the ðŸ¤— Evaluate [quick tour](https://huggingface.co/docs/evaluate/a_quick_tour) to learn more about how to load and compute a metric). Seqeval actually produces several scores: precision, recall, F1, and accuracy.

In [20]:
!pip install seqeval



In [21]:
import evaluate

seqeval = evaluate.load("seqeval")

Get the NER labels first, and then create a function that passes your true predictions and true labels to [compute](https://huggingface.co/docs/evaluate/main/en/package_reference/main_classes#evaluate.EvaluationModule.compute) to calculate the scores:

In [22]:
import numpy as np

labels = [label_list[i] for i in example[f"ner_tags"]]


def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = seqeval.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

Your `compute_metrics` function is ready to go now, and you'll return to it when you setup your training.

## Train

Before you start training your model, create a map of the expected ids to their labels with `id2label` and `label2id`:

In [23]:
id2label = {
    0:'es',
    1:'foreign',
    2:'gn',
    3:'ne-i-loc',
    4:'ne-b-loc',
    5:'mix',
    6:'ne-i-org',
    7:'ne-b-org',
    8:'ne-b-per',
    9:'ne-i-per'
}
label2id = {
    'es':0,
    'foreign':1,
    'gn':2,
    'ne-i-loc':3,
    'ne-b-loc':4,
    'mix':5,
    'ne-i-org':6,
    'ne-b-org':7,
    'ne-b-per':8,
    'ne-i-per':9
}

<Tip>

If you aren't familiar with finetuning a model with the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer), take a look at the basic tutorial [here](https://huggingface.co/docs/transformers/main/en/tasks/../training#train-with-pytorch-trainer)!

</Tip>

You're ready to start training your model now! Load DistilBERT with [AutoModelForTokenClassification](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoModelForTokenClassification) along with the number of expected labels, and the label mappings:

In [24]:
#!pip uninstall -y transformers
#!pip install transformers[tf]

In [25]:
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

model = AutoModelForTokenClassification.from_pretrained(
    "mmaguero/multilingual-bert-gn-base-cased", # "distilbert/distilbert-base-uncased",
    num_labels=10, id2label=id2label, label2id=label2id
)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at mmaguero/multilingual-bert-gn-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


At this point, only three steps remain:

1. Define your training hyperparameters in [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments). The only required parameter is `output_dir` which specifies where to save your model. You'll push this model to the Hub by setting `push_to_hub=True` (you need to be signed in to Hugging Face to upload your model). At the end of each epoch, the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) will evaluate the seqeval scores and save the training checkpoint.
2. Pass the training arguments to [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) along with the model, dataset, tokenizer, data collator, and `compute_metrics` function.
3. Call [train()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.train) to finetune your model.

In [27]:
training_args = TrainingArguments(
    output_dir="langid-ner-multilingual-bert-gn-base-cased",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=30, # 3 a 10 o mÃ¡s epochs
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    save_total_limit = 3,
    push_to_hub=False, # True para subir a HF
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_wnut["train"],
    eval_dataset=tokenized_wnut["validation"],
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
[34m[1mwandb[0m: Paste an API key from your profile and hit enter:

 Â·Â·Â·Â·Â·Â·Â·Â·Â·Â·


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mmmaguero[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,0.779106,0.664557,0.556537,0.605769,0.795584
2,No log,0.569039,0.704676,0.647821,0.675054,0.839411
3,No log,0.482805,0.743719,0.697291,0.719757,0.858147
4,No log,0.444475,0.771358,0.739105,0.754887,0.875544
5,No log,0.426901,0.775227,0.755595,0.765285,0.881231
6,No log,0.428799,0.777644,0.757951,0.767671,0.883908
7,0.519000,0.426755,0.781982,0.766784,0.774309,0.885915
8,0.519000,0.429793,0.776442,0.760895,0.76859,0.883239
9,0.519000,0.429981,0.783622,0.772085,0.777811,0.88625
10,0.519000,0.429843,0.786186,0.770907,0.778472,0.887253


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


TrainOutput(global_step=720, training_loss=0.4205164379543728, metrics={'train_runtime': 494.8351, 'train_samples_per_second': 23.038, 'train_steps_per_second': 1.455, 'total_flos': 555780551900760.0, 'train_loss': 0.4205164379543728, 'epoch': 10.0})

In [28]:
trainer.evaluate(tokenized_wnut["test"])



{'eval_loss': 0.402357816696167,
 'eval_precision': 0.7989918084436043,
 'eval_recall': 0.7647768395657418,
 'eval_f1': 0.7815100154083204,
 'eval_accuracy': 0.8935946797339867,
 'eval_runtime': 1.6323,
 'eval_samples_per_second': 110.271,
 'eval_steps_per_second': 7.351,
 'epoch': 10.0}

Once training is completed, share your model to the Hub with the [push_to_hub()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.push_to_hub) method so everyone can use your model:

In [29]:
trainer.push_to_hub()

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...e-cased/model.safetensors:   0%|          |  571kB /  709MB            

  ...05355.1186bb0a8378.2670.0:   1%|1         |   158B / 11.0kB            

  ...05851.1186bb0a8378.2670.1:   1%|1         |  8.00B /   560B            

  ...e-cased/training_args.bin:   1%|1         |  85.0B / 5.91kB            

CommitInfo(commit_url='https://huggingface.co/mmaguero/langid-ner-multilingual-bert-gn-base-cased/commit/8e72bd4bb18b87c6a26069c3974c106f7fa01753', commit_message='End of training', commit_description='', oid='8e72bd4bb18b87c6a26069c3974c106f7fa01753', pr_url=None, repo_url=RepoUrl('https://huggingface.co/mmaguero/langid-ner-multilingual-bert-gn-base-cased', endpoint='https://huggingface.co', repo_type='model', repo_id='mmaguero/langid-ner-multilingual-bert-gn-base-cased'), pr_revision=None, pr_num=None)

<Tip>

For a more in-depth example of how to finetune a model for token classification, take a look at the corresponding
[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification.ipynb).

</Tip>

## Inference

Great, now that you've finetuned a model, you can use it for inference!

Grab some text you'd like to run inference on:

In [30]:
text = "Golden State Warriors ha'e peteÄ© equipo profesional de baloncesto USA pegua oÄ©va San Francisco-pe."
text_tokens = text.split()

The simplest way to try out your finetuned model for inference is to use it in a [pipeline()](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.pipeline). Instantiate a `pipeline` for NER with your model, and pass your text to it:

In [31]:
from transformers import pipeline

classifier = pipeline("ner", model="mmaguero/langid-ner-multilingual-bert-gn-base-cased", aggregation_strategy="max")

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/709M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Device set to use cuda:0


In [32]:
classifier(text_tokens)

[[{'entity_group': 'other',
   'score': np.float32(0.23549233),
   'word': 'Golden',
   'start': 0,
   'end': 6}],
 [{'entity_group': 'other',
   'score': np.float32(0.21681115),
   'word': 'State',
   'start': 0,
   'end': 5}],
 [{'entity_group': 'other',
   'score': np.float32(0.22447829),
   'word': 'Warriors',
   'start': 0,
   'end': 8}],
 [{'entity_group': 'gn',
   'score': np.float32(0.93949866),
   'word': "ha ' e",
   'start': 0,
   'end': 4}],
 [{'entity_group': 'gn',
   'score': np.float32(0.97049654),
   'word': 'peteÄ©',
   'start': 0,
   'end': 5}],
 [{'entity_group': 'other',
   'score': np.float32(0.24841504),
   'word': 'equipo',
   'start': 0,
   'end': 6}],
 [{'entity_group': 'other',
   'score': np.float32(0.2571465),
   'word': 'profesional',
   'start': 0,
   'end': 11}],
 [{'entity_group': 'other',
   'score': np.float32(0.32848254),
   'word': 'de',
   'start': 0,
   'end': 2}],
 [{'entity_group': 'other',
   'score': np.float32(0.31506222),
   'word': 'balonces

In [33]:
wnut["test"][0]["tokens"], wnut["test"][0]["ner_tags"]

(['Igusto',
  "Ã±aÃ±e'áº½",
  'guaranime',
  'ha',
  'avei',
  'japurahÃ©i',
  'umi',
  'polka',
  'ha',
  'guarania'],
 [5, 2, 2, 2, 2, 2, 2, 1, 2, 0])

In [34]:
classifier(wnut["test"][0]["tokens"])

[[{'entity_group': 'es',
   'score': np.float32(0.7024464),
   'word': 'Igusto',
   'start': 0,
   'end': 6}],
 [{'entity_group': 'gn',
   'score': np.float32(0.86944723),
   'word': "Ã±aÃ±e ' áº½",
   'start': 0,
   'end': 6}],
 [{'entity_group': 'gn',
   'score': np.float32(0.83340085),
   'word': 'guaranime',
   'start': 0,
   'end': 9}],
 [{'entity_group': 'gn',
   'score': np.float32(0.87026954),
   'word': 'ha',
   'start': 0,
   'end': 2}],
 [{'entity_group': 'gn',
   'score': np.float32(0.9621354),
   'word': 'avei',
   'start': 0,
   'end': 4}],
 [{'entity_group': 'gn',
   'score': np.float32(0.9341973),
   'word': 'japurahÃ©i',
   'start': 0,
   'end': 9}],
 [{'entity_group': 'gn',
   'score': np.float32(0.93919235),
   'word': 'umi',
   'start': 0,
   'end': 3}],
 [{'entity_group': 'gn',
   'score': np.float32(0.3347104),
   'word': 'polka',
   'start': 0,
   'end': 5}],
 [{'entity_group': 'gn',
   'score': np.float32(0.87026954),
   'word': 'ha',
   'start': 0,
   'end': 2}

In [35]:
text = "MaÃ±ana voy a ir a AsunciÃ³n y luego irÃ© a Luque a la casa de mi tÃ­o Carlos."
text_tokens = text.split()
text_tokens, classifier(text_tokens)

(['MaÃ±ana',
  'voy',
  'a',
  'ir',
  'a',
  'AsunciÃ³n',
  'y',
  'luego',
  'irÃ©',
  'a',
  'Luque',
  'a',
  'la',
  'casa',
  'de',
  'mi',
  'tÃ­o',
  'Carlos.'],
 [[{'entity_group': 'gn',
    'score': np.float32(0.4435442),
    'word': 'MaÃ±ana',
    'start': 0,
    'end': 6}],
  [{'entity_group': 'other',
    'score': np.float32(0.50171155),
    'word': 'voy',
    'start': 0,
    'end': 3}],
  [{'entity_group': 'other',
    'score': np.float32(0.59729064),
    'word': 'a',
    'start': 0,
    'end': 1}],
  [{'entity_group': 'other',
    'score': np.float32(0.32785127),
    'word': 'ir',
    'start': 0,
    'end': 2}],
  [{'entity_group': 'other',
    'score': np.float32(0.59729064),
    'word': 'a',
    'start': 0,
    'end': 1}],
  [{'entity_group': 'other',
    'score': np.float32(0.19881094),
    'word': 'AsunciÃ³n',
    'start': 0,
    'end': 8}],
  [{'entity_group': 'other',
    'score': np.float32(0.19654988),
    'word': 'y',
    'start': 0,
    'end': 1}],
  [{'entity_

In [36]:
text = "Ko'ero ahÃ¡ta Paraguay-pe ha upÃ© ahÃ¡sata Luque-pe che tÃ­o KÃ¡lo ogÃ¡pe."
text_tokens = text.split()
text_tokens, classifier(text_tokens)

(["Ko'ero",
  'ahÃ¡ta',
  'Paraguay-pe',
  'ha',
  'upÃ©',
  'ahÃ¡sata',
  'Luque-pe',
  'che',
  'tÃ­o',
  'KÃ¡lo',
  'ogÃ¡pe.'],
 [[{'entity_group': 'gn',
    'score': np.float32(0.8878522),
    'word': "Ko ' ero",
    'start': 0,
    'end': 6}],
  [{'entity_group': 'gn',
    'score': np.float32(0.74060357),
    'word': 'ahÃ¡ta',
    'start': 0,
    'end': 5}],
  [{'entity_group': 'b-loc',
    'score': np.float32(0.4306175),
    'word': 'Paraguay',
    'start': 0,
    'end': 8},
   {'entity_group': 'gn',
    'score': np.float32(0.7259561),
    'word': '- pe',
    'start': 8,
    'end': 11}],
  [{'entity_group': 'gn',
    'score': np.float32(0.87026954),
    'word': 'ha',
    'start': 0,
    'end': 2}],
  [{'entity_group': 'gn',
    'score': np.float32(0.9577462),
    'word': 'upÃ©',
    'start': 0,
    'end': 3}],
  [{'entity_group': 'mix',
    'score': np.float32(0.4044411),
    'word': 'ahÃ¡sata',
    'start': 0,
    'end': 7}],
  [{'entity_group': 'b-loc',
    'score': np.float32(

You can also manually replicate the results of the `pipeline` if you'd like:

Tokenize the text and return PyTorch tensors:

In [37]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("mmaguero/langid-ner-multilingual-bert-gn-base-cased")
inputs = tokenizer(text, return_tensors="pt")

Pass your inputs to the model and return the `logits`:

In [38]:
import torch
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained("mmaguero/langid-ner-multilingual-bert-gn-base-cased")
with torch.no_grad():
    logits = model(**inputs).logits

Get the class with the highest probability, and use the model's `id2label` mapping to convert it to a text label:

In [39]:
predictions = torch.argmax(logits, dim=2)
predicted_token_class = [model.config.id2label[t.item()] for t in predictions[0]]
inputs, predicted_token_class

({'input_ids': tensor([[  101, 30186,   112, 10163, 10133, 69863, 86367, 25118,   118, 11161,
          10228, 10741, 10333, 69863, 17082, 10213, 23859, 11189,   118, 11161,
          10262, 94354,   148, 14205, 10133, 10156, 77588, 10112,   119,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1]])},
 ['other',
  'gn',
  'gn',
  'gn',
  'gn',
  'gn',
  'gn',
  'ne-b-loc',
  'gn',
  'gn',
  'gn',
  'gn',
  'gn',
  'gn',
  'gn',
  'gn',
  'ne-b-loc',
  'ne-i-per',
  'gn',
  'gn',
  'gn',
  'es',
  'mix',
  'gn',
  'gn',
  'gn',
  'gn',
  'gn',
  'other',
  'other'])