<a href="https://colab.research.google.com/github/mmaguero/diploma_fpuna_nlp_ia/blob/master/2025/guarani_sequence_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Transformers installation
! pip install transformers datasets evaluate accelerate
# To install from source instead of the last release, comment the command above and uncomment the following one.
# ! pip install git+https://github.com/huggingface/transformers.git

Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Downloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m84.1/84.1 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.6


# Text classification

In [None]:
#@title
from IPython.display import HTML

HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/leNG9fN9FQU?rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')



Text classification is a common NLP task that assigns a label or class to text. Some of the largest companies run text classification in production for a wide range of practical applications. One of the most popular forms of text classification is sentiment analysis, which assigns a label like üôÇ positive, üôÅ negative, or üòê neutral to a sequence of text.

This guide will show you how to:

1. Finetune [DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased) on the [IMDb](https://huggingface.co/datasets/imdb) dataset to determine whether a movie review is positive or negative.
2. Use your finetuned model for inference.

<Tip>

To see all architectures and checkpoints compatible with this task, we recommend checking the [task-page](https://huggingface.co/tasks/text-classification).

</Tip>

Before you begin, make sure you have all the necessary libraries installed:

```bash
pip install transformers datasets evaluate accelerate
```

We encourage you to login to your Hugging Face account so you can upload and share your model with the community. When prompted, enter your token to login:

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv‚Ä¶

## Load IMDb dataset

Start by loading the IMDb dataset from the ü§ó Datasets library:

In [None]:
from datasets import load_dataset

off_gn = load_dataset("mmaguero/gn-offensive-language-identification")

Then take a look at an example:

In [None]:
off_gn["test"][0]

{'text': '@MartnGaleanoCr1 Nderakoreee felicidades mante... ||| off'}

In [None]:
off_gn["train"][1]

{'text': 'Maerapiko peichaite ediscrimina la nde rapicha ||| no_off'}

In [None]:
len(off_gn["test"]), len(off_gn["train"]) # split 50% 50%

(434, 1388)

There are two fields in this dataset:

- `text`: the movie review text.
- `label`: a value that is either `0` for a negative review or `1` for a positive review.

## Preprocess

The next step is to load a DistilBERT tokenizer to preprocess the `text` field:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("mmaguero/gn-bert-tiny-cased") #"distilbert/distilbert-base-uncased")

tokenizer_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Create a preprocessing function to tokenize `text` and truncate sequences to be no longer than DistilBERT's maximum input length:

In [None]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True) # 512

To apply the preprocessing function over the entire dataset, use ü§ó Datasets [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) function. You can speed up `map` by setting `batched=True` to process multiple elements of the dataset at once:

In [None]:
off_gn_lbl = off_gn.copy()

def extract_label(example):
    text_parts = example["text"].split('|||')
    example["text"] = text_parts[0].strip()
    example["label"] = 1 if text_parts[1].strip() == "off" else 0
    return example

off_gn_lbl = off_gn.map(extract_label)
off_gn_lbl["train"][0], off_gn_lbl["test"][0], off_gn_lbl["validation"][0]

Map:   0%|          | 0/1388 [00:00<?, ? examples/s]

Map:   0%|          | 0/348 [00:00<?, ? examples/s]

Map:   0%|          | 0/434 [00:00<?, ? examples/s]

({'text': '@iSPEARB_ tembo jeyma elena,mbaepa ojehu ndeve', 'label': 1},
 {'text': '@MartnGaleanoCr1 Nderakoreee felicidades mante...', 'label': 1},
 {'text': "@MICHIKATZE1 ü§£ü§£ü§£ü§£ü§£ umia kuera, sapy'aitepeüëå", 'label': 0})

In [None]:
tokenized_off_gn = off_gn_lbl.map(preprocess_function, batched=True)

Map:   0%|          | 0/1388 [00:00<?, ? examples/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Map:   0%|          | 0/348 [00:00<?, ? examples/s]

Map:   0%|          | 0/434 [00:00<?, ? examples/s]

Now create a batch of examples using [DataCollatorWithPadding](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DataCollatorWithPadding). It's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## Evaluate

Including a metric during training is often helpful for evaluating your model's performance. You can quickly load a evaluation method with the ü§ó [Evaluate](https://huggingface.co/docs/evaluate/index) library. For this task, load the [accuracy](https://huggingface.co/spaces/evaluate-metric/accuracy) metric (see the ü§ó Evaluate [quick tour](https://huggingface.co/docs/evaluate/a_quick_tour) to learn more about how to load and compute a metric):

In [None]:
import evaluate

accuracy = evaluate.load("accuracy")

Downloading builder script: 0.00B [00:00, ?B/s]

Then create a function that passes your predictions and labels to [compute](https://huggingface.co/docs/evaluate/main/en/package_reference/main_classes#evaluate.EvaluationModule.compute) to calculate the accuracy:

In [None]:
import numpy as np


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

Your `compute_metrics` function is ready to go now, and you'll return to it when you setup your training.

## Train

Before you start training your model, create a map of the expected ids to their labels with `id2label` and `label2id`:

In [None]:
id2label = {0: "NO_OFF", 1: "OFF"}
label2id = {"NO_OFF": 0, "OFF": 1}

<Tip>

If you aren't familiar with finetuning a model with the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer), take a look at the basic tutorial [here](https://huggingface.co/docs/transformers/main/en/tasks/../training#train-with-pytorch-trainer)!

</Tip>

You're ready to start training your model now! Load DistilBERT with [AutoModelForSequenceClassification](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoModelForSequenceClassification) along with the number of expected labels, and the label mappings:

In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(
    "mmaguero/gn-bert-tiny-cased", num_labels=2, id2label=id2label, label2id=label2id
)

pytorch_model.bin:   0%|          | 0.00/37.2M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at mmaguero/gn-bert-tiny-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


At this point, only three steps remain:

1. Define your training hyperparameters in [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments). The only required parameter is `output_dir` which specifies where to save your model. You'll push this model to the Hub by setting `push_to_hub=True` (you need to be signed in to Hugging Face to upload your model). At the end of each epoch, the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) will evaluate the accuracy and save the training checkpoint.
2. Pass the training arguments to [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) along with the model, dataset, tokenizer, data collator, and `compute_metrics` function.
3. Call [train()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.train) to finetune your model.

In [None]:
training_args = TrainingArguments(
    output_dir="mmaguero/toxic-gn-bert-tiny-cased", # tu_usuario/my_awesome_model
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3, # por lo menos 2 epochs a 10 o m√°s
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False, # True para subir
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_off_gn["train"],
    eval_dataset=tokenized_off_gn["validation"],
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.379812,0.864943
2,No log,0.360568,0.864943
3,No log,0.331693,0.887931
4,No log,0.310153,0.896552
5,No log,0.294718,0.899425
6,0.371700,0.285284,0.902299
7,0.371700,0.280009,0.887931
8,0.371700,0.278794,0.885057
9,0.371700,0.277721,0.882184
10,0.371700,0.277739,0.882184




TrainOutput(global_step=870, training_loss=0.31968590747350933, metrics={'train_runtime': 113.4188, 'train_samples_per_second': 122.378, 'train_steps_per_second': 7.671, 'total_flos': 4970296438560.0, 'train_loss': 0.31968590747350933, 'epoch': 10.0})

In [None]:
trainer.evaluate(tokenized_off_gn["test"])



{'eval_loss': 0.3441266119480133,
 'eval_accuracy': 0.8732718894009217,
 'eval_runtime': 0.8502,
 'eval_samples_per_second': 510.442,
 'eval_steps_per_second': 32.932,
 'epoch': 10.0}

<Tip>

[Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) applies dynamic padding by default when you pass `tokenizer` to it. In this case, you don't need to specify a data collator explicitly.

</Tip>

Once training is completed, share your model to the Hub with the [push_to_hub()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.push_to_hub) method so everyone can use your model:

In [None]:
trainer.push_to_hub()

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...y-cased/model.safetensors:  12%|#2        | 4.45MB / 36.8MB            

  ...213502.19594f722621.158.0:  12%|#2        |   745B / 6.17kB            

  ...213578.19594f722621.158.1:  12%|#1        |  49.0B /   411B            

  ...213623.19594f722621.158.2:  12%|#2        | 1.09kB / 9.03kB            

  ...213765.19594f722621.158.3:  12%|#1        |  49.0B /   411B            

  ...y-cased/training_args.bin:  12%|#2        |   713B / 5.91kB            

CommitInfo(commit_url='https://huggingface.co/mmaguero/toxic-gn-bert-tiny-cased/commit/fab2e62a885b879dfe78cddf3f17d7920a0d7ef3', commit_message='End of training', commit_description='', oid='fab2e62a885b879dfe78cddf3f17d7920a0d7ef3', pr_url=None, repo_url=RepoUrl('https://huggingface.co/mmaguero/toxic-gn-bert-tiny-cased', endpoint='https://huggingface.co', repo_type='model', repo_id='mmaguero/toxic-gn-bert-tiny-cased'), pr_revision=None, pr_num=None)

<Tip>

For a more in-depth example of how to finetune a model for text classification, take a look at the corresponding
[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification.ipynb).

</Tip>

## Inference

Great, now that you've finetuned a model, you can use it for inference!

Grab some text you'd like to run inference on:

In [None]:
text = "Ojapo √°ra por√£ite che iru!"

The simplest way to try out your finetuned model for inference is to use it in a [pipeline()](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.pipeline). Instantiate a `pipeline` for sentiment analysis with your model, and pass your text to it:

In [None]:
from transformers import pipeline

classifier = pipeline("text-classification", model="mmaguero/toxic-gn-bert-tiny-cased")


Device set to use cpu


In [None]:
classifier(text)

[{'label': 'NO_OFF', 'score': 0.9595844745635986}]

In [None]:
# probamos algo no ofensivo pero negativo
text = "Ojapo √°ra vaite che iru!"
classifier(text)

[{'label': 'NO_OFF', 'score': 0.9481797814369202}]

In [None]:
# probamos algo ofensivo
text = "Nderakore nde tavyron" # sorry, pero la tarea de va de si es ofensivo o no
classifier(text) # si falla, deber√≠amos buscar m√°s par√°metros...

[{'label': 'OFF', 'score': 0.8411603569984436}]

You can also manually replicate the results of the `pipeline` if you'd like:

Tokenize the text and return PyTorch tensors:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("mmaguero/toxic-gn-bert-tiny-cased")
inputs = tokenizer(text, return_tensors="pt")

Pass your inputs to the model and return the `logits`:

In [None]:
import torch
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("mmaguero/toxic-gn-bert-tiny-cased")
with torch.no_grad():
    logits = model(**inputs).logits

Get the class with the highest probability, and use the model's `id2label` mapping to convert it to a text label:

In [None]:
predicted_class_id = logits.argmax().item()
model.config.id2label[predicted_class_id]

'OFF'