<a href="https://colab.research.google.com/github/mmaguero/diploma_fpuna_nlp_ia/blob/master/2025/token_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
# Transformers installation
! pip install transformers datasets evaluate accelerate wandb
# To install from source instead of the last release, comment the command above and uncomment the following one.
# ! pip install git+https://github.com/huggingface/transformers.git

Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Downloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m84.1/84.1 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.6


# Token classification

In [None]:
#@title
from IPython.display import HTML

HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/wVHdVlPScxA?rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')



Token classification assigns a label to individual tokens in a sentence. One of the most common token classification tasks is Named Entity Recognition (NER). NER attempts to find a label for each entity in a sentence, such as a person, location, or organization.

This guide will show you how to:

1. Finetune [DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased) on the [WNUT 17](https://huggingface.co/datasets/wnut_17) dataset to detect new entities.
2. Use your finetuned model for inference.

<Tip>

To see all architectures and checkpoints compatible with this task, we recommend checking the [task-page](https://huggingface.co/tasks/token-classification).

</Tip>

Before you begin, make sure you have all the necessary libraries installed:

```bash
pip install transformers datasets evaluate seqeval
```

We encourage you to login to your Hugging Face account so you can upload and share your model with the community. When prompted, enter your token to login:

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv‚Ä¶

## Load WNUT 17 dataset

Start by loading the WNUT 17 dataset from the ü§ó Datasets library:

In [12]:
from datasets import load_dataset

#wnut = load_dataset("wnut_17", data_dir="wnut_17", revision="refs/convert/parquet")
gn = load_dataset("unimelb-nlp/wikiann", "gn")
es = load_dataset("unimelb-nlp/wikiann", "es")

README.md: 0.00B [00:00, ?B/s]

gn/validation-00000-of-00001.parquet:   0%|          | 0.00/12.3k [00:00<?, ?B/s]

gn/test-00000-of-00001.parquet:   0%|          | 0.00/11.0k [00:00<?, ?B/s]

gn/train-00000-of-00001.parquet:   0%|          | 0.00/12.2k [00:00<?, ?B/s]

Generating validation split:   0%|          | 0/100 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/100 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/100 [00:00<?, ? examples/s]

es/validation-00000-of-00001.parquet:   0%|          | 0.00/608k [00:00<?, ?B/s]

es/test-00000-of-00001.parquet:   0%|          | 0.00/608k [00:00<?, ?B/s]

es/train-00000-of-00001.parquet:   0%|          | 0.00/1.22M [00:00<?, ?B/s]

Generating validation split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/20000 [00:00<?, ? examples/s]

In [14]:
from random import shuffle
from datasets import concatenate_datasets

# Concatenate train, validation, and test sets from both languages
combined_dataset = concatenate_datasets(
    [gn["train"], gn["validation"], gn["test"],
     es["train"], es["validation"], es["test"]]
)

# Split the combined dataset into training and the rest (which will be split again into validation and test)
train_testvalid = combined_dataset.train_test_split(
    test_size=0.3, shuffle=True, seed=42
)

# Split the 'test' part of the previous split into validation and test sets
wnut = train_testvalid["test"].train_test_split(
    test_size=0.75, shuffle=True, seed=42 # Split 75/25 to get validation and test
)

# Rename the splits to 'train', 'validation', and 'test'
wnut["validation"] = wnut.pop("train") # Rename the first part of the second split to validation
wnut["train"] = train_testvalid["train"]
wnut["test"] = wnut.pop("test") # Rename the second part of the second split to test

wnut

DatasetDict({
    validation: Dataset({
        features: ['tokens', 'ner_tags', 'langs', 'spans'],
        num_rows: 3022
    })
    train: Dataset({
        features: ['tokens', 'ner_tags', 'langs', 'spans'],
        num_rows: 28210
    })
    test: Dataset({
        features: ['tokens', 'ner_tags', 'langs', 'spans'],
        num_rows: 9068
    })
})

Then take a look at an example:

In [15]:
wnut["train"][0]

{'tokens': ['San',
  'Francisco',
  ',',
  'California',
  ',',
  'Estados',
  'Unidos',
  '.'],
 'ner_tags': [5, 6, 0, 5, 0, 5, 6, 0],
 'langs': ['es', 'es', 'es', 'es', 'es', 'es', 'es', 'es'],
 'spans': ['LOC: San Francisco', 'LOC: California', 'LOC: Estados Unidos']}

Each number in `ner_tags` represents an entity. Convert the numbers to their label names to find out what the entities are:

In [None]:
label_list = wnut["train"].features[f"ner_tags"].feature.names
label_list

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']

The letter that prefixes each `ner_tag` indicates the token position of the entity:

- `B-` indicates the beginning of an entity.
- `I-` indicates a token is contained inside the same entity (for example, the `State` token is a part of an entity like
  `Empire State Building`).
- `0` indicates the token doesn't correspond to any entity.

In [None]:
ner = zip(wnut["train"][0]["tokens"], wnut["train"][0]["ner_tags"])
for item in ner:
  print(f"{item} => {label_list[item[1]]}")

('San', 5) => B-LOC
('Francisco', 6) => I-LOC
(',', 0) => O
('California', 5) => B-LOC
(',', 0) => O
('Estados', 5) => B-LOC
('Unidos', 6) => I-LOC
('.', 0) => O


## Preprocess

In [None]:
#@title
from IPython.display import HTML

HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/iY2AZYdZAr0?rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')



The next step is to load a mBERT tokenizer to preprocess the `tokens` field:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("mmaguero/multilingual-bert-gn-base-cased") #"distilbert/distilbert-base-uncased")

tokenizer_config.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

As you saw in the example `tokens` field above, it looks like the input has already been tokenized. But the input actually hasn't been tokenized yet and you'll need to set `is_split_into_words=True` to tokenize the words into subwords. For example:

In [None]:
example = wnut["train"][0]
tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
tokens

['[CLS]',
 'San',
 'Francisco',
 ',',
 'California',
 ',',
 'Estados',
 'Unidos',
 '.',
 '[SEP]']

However, this adds some special tokens `[CLS]` and `[SEP]` and the subword tokenization creates a mismatch between the input and labels. A single word corresponding to a single label may now be split into two subwords. You'll need to realign the tokens and labels by:

1. Mapping all tokens to their corresponding word with the [`word_ids`](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.BatchEncoding.word_ids) method.
2. Assigning the label `-100` to the special tokens `[CLS]` and `[SEP]` so they're ignored by the PyTorch loss function (see [CrossEntropyLoss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html)).
3. Only labeling the first token of a given word. Assign `-100` to other subtokens from the same word.

Here is how you can create a function to realign the tokens and labels, and truncate sequences to be no longer than DistilBERT's maximum input length:

In [4]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

    labels = []
    for i, label in enumerate(examples[f"ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:  # Set the special tokens to -100.
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:  # Only label the first token of a given word.
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

To apply the preprocessing function over the entire dataset, use ü§ó Datasets [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) function. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:

In [None]:
tokenized_wnut = wnut.map(tokenize_and_align_labels, batched=True)

Map:   0%|          | 0/3022 [00:00<?, ? examples/s]

Map:   0%|          | 0/28210 [00:00<?, ? examples/s]

Map:   0%|          | 0/9068 [00:00<?, ? examples/s]

Now create a batch of examples using [DataCollatorWithPadding](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DataCollatorWithPadding). It's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In [5]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

NameError: name 'tokenizer' is not defined

## Evaluate

Including a metric during training is often helpful for evaluating your model's performance. You can quickly load a evaluation method with the ü§ó [Evaluate](https://huggingface.co/docs/evaluate/index) library. For this task, load the [seqeval](https://huggingface.co/spaces/evaluate-metric/seqeval) framework (see the ü§ó Evaluate [quick tour](https://huggingface.co/docs/evaluate/a_quick_tour) to learn more about how to load and compute a metric). Seqeval actually produces several scores: precision, recall, F1, and accuracy.

In [None]:
!pip install seqeval

Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[?25l     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/43.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m43.6/43.6 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16162 sha256=bd6169aed37e87f034dec4a607be438b205da0968e3eb833aded1479371a8888
  Stored in directory: /root/.cache/pip/wheels/5f/b8/73/0b2c1a76b701a677653dd79ece07cfabd7457989dbfbdcd8d7
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2


In [None]:
import evaluate

seqeval = evaluate.load("seqeval")

Downloading builder script: 0.00B [00:00, ?B/s]

Get the NER labels first, and then create a function that passes your true predictions and true labels to [compute](https://huggingface.co/docs/evaluate/main/en/package_reference/main_classes#evaluate.EvaluationModule.compute) to calculate the scores:

In [None]:
import numpy as np

labels = [label_list[i] for i in example[f"ner_tags"]]


def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = seqeval.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

Your `compute_metrics` function is ready to go now, and you'll return to it when you setup your training.

## Train

Before you start training your model, create a map of the expected ids to their labels with `id2label` and `label2id`:

In [None]:
id2label = {
    0:'O', 1:'B-PER', 2:'I-PER', 3:'B-ORG', 4:'I-ORG', 5:'B-LOC', 6:'I-LOC'
}
label2id = {
    'O':0, 'B-PER':1, 'I-PER':2, 'B-ORG':3, 'I-ORG':4, 'B-LOC':5, 'I-LOC':6
}

<Tip>

If you aren't familiar with finetuning a model with the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer), take a look at the basic tutorial [here](https://huggingface.co/docs/transformers/main/en/tasks/../training#train-with-pytorch-trainer)!

</Tip>

You're ready to start training your model now! Load DistilBERT with [AutoModelForTokenClassification](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoModelForTokenClassification) along with the number of expected labels, and the label mappings:

In [None]:
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

model = AutoModelForTokenClassification.from_pretrained(
    "mmaguero/multilingual-bert-gn-base-cased", # "distilbert/distilbert-base-uncased",
    num_labels=7, id2label=id2label, label2id=label2id
)

pytorch_model.bin:   0%|          | 0.00/714M [00:00<?, ?B/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at mmaguero/multilingual-bert-gn-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


At this point, only three steps remain:

1. Define your training hyperparameters in [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments). The only required parameter is `output_dir` which specifies where to save your model. You'll push this model to the Hub by setting `push_to_hub=True` (you need to be signed in to Hugging Face to upload your model). At the end of each epoch, the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) will evaluate the seqeval scores and save the training checkpoint.
2. Pass the training arguments to [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) along with the model, dataset, tokenizer, data collator, and `compute_metrics` function.
3. Call [train()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.train) to finetune your model.

In [None]:
training_args = TrainingArguments(
    output_dir="wikiann-multilingual-bert-gn-base-cased",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3, # 3 a 10 o m√°s epochs
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    save_total_limit = 3,
    push_to_hub=True, # True para subir a HF
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_wnut["train"],
    eval_dataset=tokenized_wnut["validation"],
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.3299,0.279737,0.755916,0.800599,0.777616,0.908357
2,0.2217,0.250314,0.793077,0.836284,0.814108,0.921361
3,0.1603,0.255052,0.800415,0.840098,0.819777,0.923623


TrainOutput(global_step=5292, training_loss=0.24089568362304503, metrics={'train_runtime': 829.9485, 'train_samples_per_second': 101.97, 'train_steps_per_second': 6.376, 'total_flos': 1116570492788652.0, 'train_loss': 0.24089568362304503, 'epoch': 3.0})

In [None]:
trainer.evaluate(tokenized_wnut["test"])

{'eval_loss': 0.2677244246006012,
 'eval_precision': 0.8097113984756359,
 'eval_recall': 0.8414916340334638,
 'eval_f1': 0.8252956836730241,
 'eval_accuracy': 0.9235716600986063,
 'eval_runtime': 16.0002,
 'eval_samples_per_second': 566.744,
 'eval_steps_per_second': 35.437,
 'epoch': 3.0}

Once training is completed, share your model to the Hub with the [push_to_hub()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.push_to_hub) method so everyone can use your model:

In [None]:
trainer.push_to_hub()

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...e-cased/training_args.bin: 100%|##########| 5.91kB / 5.91kB            

  ...424787.b401576c0680.671.0: 100%|##########| 5.54kB / 5.54kB            

  ...424843.b401576c0680.671.1: 100%|##########| 5.54kB / 5.54kB            

  ...424895.b401576c0680.671.2: 100%|##########| 9.42kB / 9.42kB            

  ...e-cased/model.safetensors:   5%|4         | 33.5MB /  709MB            

  ...425756.b401576c0680.671.3: 100%|##########|   560B /   560B            

CommitInfo(commit_url='https://huggingface.co/mmaguero/wikiann-multilingual-bert-gn-base-cased/commit/7303d136f4823417da58cc0038cb8740941fcceb', commit_message='End of training', commit_description='', oid='7303d136f4823417da58cc0038cb8740941fcceb', pr_url=None, repo_url=RepoUrl('https://huggingface.co/mmaguero/wikiann-multilingual-bert-gn-base-cased', endpoint='https://huggingface.co', repo_type='model', repo_id='mmaguero/wikiann-multilingual-bert-gn-base-cased'), pr_revision=None, pr_num=None)

<Tip>

For a more in-depth example of how to finetune a model for token classification, take a look at the corresponding
[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification.ipynb).

</Tip>

## Inference

Great, now that you've finetuned a model, you can use it for inference!

Grab some text you'd like to run inference on:

In [6]:
text = "Golden State Warriors ha'e peteƒ© equipo profesional de baloncesto USA pegua oƒ©va San Francisco-pe."
text_tokens = text.split()

The simplest way to try out your finetuned model for inference is to use it in a [pipeline()](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.pipeline). Instantiate a `pipeline` for NER with your model, and pass your text to it:

In [26]:
from transformers import pipeline

classifier = pipeline("ner", model="mmaguero/wikiann-multilingual-bert-gn-base-cased", aggregation_strategy="max")

Device set to use cuda:0


The `aggregation_strategy` parameter in the `pipeline("ner", ...)` function controls how subword predictions are combined into full word predictions.

*   When `aggregation_strategy="max"` (the default for NER), the pipeline assigns the entity label with the *maximum* score among all the subwords belonging to a word to that entire word. This is a common and often effective strategy for NER tasks.

Here are some other common `aggregation_strategy` options you might encounter or use in different token classification pipelines:

*   `"none"`: No aggregation is performed. Each subword receives its own prediction. This is useful if you need fine-grained control or want to analyze subword-level predictions directly.

*   `"simple"`: This strategy typically aggregates subwords that form a complete word. For NER, it usually takes the first subword's prediction for a word, but the exact behavior can vary slightly depending on the specific pipeline implementation. It aims to simplify the output to one prediction per full word.

*   `"average"`: This strategy would average the scores of all subwords belonging to a word to determine the final entity for that word. This can sometimes lead to smoother predictions by incorporating information from all subword pieces.

For Named Entity Recognition, `"max"` is often a good default choice because it helps in identifying entities even if some subwords have lower confidence scores within the same entity.

In [10]:
classifier(text_tokens)

[[{'entity_group': 'LOC',
   'score': np.float32(0.51524615),
   'word': 'Golden',
   'start': 0,
   'end': 6}],
 [{'entity_group': 'LOC',
   'score': np.float32(0.82096034),
   'word': 'State',
   'start': 0,
   'end': 5}],
 [{'entity_group': 'LOC',
   'score': np.float32(0.57578903),
   'word': 'Warriors',
   'start': 0,
   'end': 8}],
 [],
 [],
 [{'entity_group': 'LOC',
   'score': np.float32(0.5118277),
   'word': 'equipo',
   'start': 0,
   'end': 6}],
 [{'entity_group': 'LOC',
   'score': np.float32(0.4777652),
   'word': 'profesional',
   'start': 0,
   'end': 11}],
 [],
 [{'entity_group': 'LOC',
   'score': np.float32(0.8121983),
   'word': 'baloncesto',
   'start': 0,
   'end': 10}],
 [{'entity_group': 'LOC',
   'score': np.float32(0.5030899),
   'word': 'USA',
   'start': 0,
   'end': 3}],
 [],
 [],
 [{'entity_group': 'LOC',
   'score': np.float32(0.85055363),
   'word': 'San',
   'start': 0,
   'end': 3}],
 [{'entity_group': 'LOC',
   'score': np.float32(0.7227709),
   'word

In [16]:
wnut["test"][0]["tokens"], wnut["test"][0]["ner_tags"]

(["'", "''", 'Rise', 'Against', "''", "'"], [0, 0, 3, 4, 0, 0])

In [17]:
classifier(wnut["test"][0]["tokens"])

[[],
 [],
 [{'entity_group': 'LOC',
   'score': np.float32(0.6546024),
   'word': 'Rise',
   'start': 0,
   'end': 4}],
 [{'entity_group': 'LOC',
   'score': np.float32(0.60754627),
   'word': 'Against',
   'start': 0,
   'end': 7}],
 [],
 []]

In [24]:
text = "Ma√±ana voy a ir a Asunci√≥n y luego ir√© a Luque a la casa de mi t√≠o Carlos."
text_tokens = text.split()
text_tokens, classifier(text_tokens)

(['Ma√±ana',
  'voy',
  'a',
  'ir',
  'a',
  'Asunci√≥n',
  'y',
  'luego',
  'ir√©',
  'a',
  'Luque',
  'a',
  'la',
  'casa',
  'de',
  'mi',
  't√≠o',
  'Carlos.'],
 [[{'entity_group': 'LOC',
    'score': np.float32(0.22281228),
    'word': 'Ma√±ana',
    'start': 0,
    'end': 6}],
  [],
  [],
  [],
  [],
  [{'entity_group': 'LOC',
    'score': np.float32(0.94032),
    'word': 'Asunci√≥n',
    'start': 0,
    'end': 8}],
  [],
  [],
  [],
  [],
  [{'entity_group': 'LOC',
    'score': np.float32(0.36836728),
    'word': 'Luque',
    'start': 0,
    'end': 5}],
  [],
  [],
  [{'entity_group': 'LOC',
    'score': np.float32(0.6463145),
    'word': 'casa',
    'start': 0,
    'end': 4}],
  [],
  [],
  [{'entity_group': 'LOC',
    'score': np.float32(0.50808287),
    'word': 't√≠o',
    'start': 0,
    'end': 3}],
  [{'entity_group': 'PER',
    'score': np.float32(0.33892447),
    'word': 'Carlos',
    'start': 0,
    'end': 6}]])

In [27]:
text = "Ko'ero ah√°ta Paraguay-pe ha up√© ah√°sata Luque-pe che t√≠o K√°lo og√°pe."
text_tokens = text.split()
text_tokens, classifier(text_tokens)

(["Ko'ero",
  'ah√°ta',
  'Paraguay-pe',
  'ha',
  'up√©',
  'ah√°sata',
  'Luque-pe',
  'che',
  't√≠o',
  'K√°lo',
  'og√°pe.'],
 [[{'entity': 'B-ORG',
    'score': np.float32(0.42753583),
    'index': 1,
    'word': 'Ko',
    'start': 0,
    'end': 2},
   {'entity': 'I-PER',
    'score': np.float32(0.42353433),
    'index': 2,
    'word': "'",
    'start': 2,
    'end': 3},
   {'entity': 'I-PER',
    'score': np.float32(0.50691557),
    'index': 3,
    'word': 'er',
    'start': 3,
    'end': 5},
   {'entity': 'I-PER',
    'score': np.float32(0.53524977),
    'index': 4,
    'word': '##o',
    'start': 5,
    'end': 6}],
  [],
  [{'entity': 'B-LOC',
    'score': np.float32(0.98017496),
    'index': 1,
    'word': 'Paraguay',
    'start': 0,
    'end': 8}],
  [],
  [],
  [{'entity': 'B-LOC',
    'score': np.float32(0.41057286),
    'index': 1,
    'word': 'ah',
    'start': 0,
    'end': 2},
   {'entity': 'I-PER',
    'score': np.float32(0.53437924),
    'index': 2,
    'word': '##√°

You can also manually replicate the results of the `pipeline` if you'd like:

Tokenize the text and return PyTorch tensors:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("mmaguero/wikiann-multilingual-bert-gn-base-cased")
inputs = tokenizer(text, return_tensors="pt")

Pass your inputs to the model and return the `logits`:

In [None]:
import torch
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained("mmaguero/wikiann-multilingual-bert-gn-base-cased")
with torch.no_grad():
    logits = model(**inputs).logits

Get the class with the highest probability, and use the model's `id2label` mapping to convert it to a text label:

In [None]:
predictions = torch.argmax(logits, dim=2)
predicted_token_class = [model.config.id2label[t.item()] for t in predictions[0]]
inputs, predicted_token_class

({'input_ids': tensor([[  101, 13744, 13995, 10219, 12556, 10157,   169, 10478,   169, 53799,
            193, 17391, 10478, 10333,   169, 23859, 11189,   169, 10109, 12088,
          10104, 12132, 94354, 12050,   119,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1]])},
 ['O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'B-LOC',
  'O',
  'O',
  'O',
  'O',
  'O',
  'B-LOC',
  'I-ORG',
  'O',
  'O',
  'B-ORG',
  'I-ORG',
  'I-ORG',
  'I-ORG',
  'I-ORG',
  'O',
  'O'])