<a href="https://colab.research.google.com/github/macapagithub/2022-docker/blob/main/2.2.Text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Modelos de clasificación de texto

En este notebook se muestra cómo crear un modelo de clasificación de texto.

Empezamos instalando la libreria [datasets](https://huggingface.co/docs/datasets/index) y [transformers](https://huggingface.co/docs/transformers/index) de huggingface.

In [1]:
!pip install transformers[torch] datasets -U

Collecting datasets
  Downloading datasets-2.19.1-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
Collecting accelerate>=0.21.0 (from transformers[torch])
  Downloading accelerate-0.30.1-py3-none-any.whl (302 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.6/302.6 kB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m15.4 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-

## Librerias:

* AutoTokenizer: Carga automáticamente un tokenizador pre-entrenado para un modelo específico.
* AutoModelForSequenceClassification: Carga automáticamente un modelo pre-entrenado para la clasificación de secuencias.
* Trainer: Clase para entrenar y evaluar modelos de Transformers.
* TrainingArguments: Define los argumentos para el entrenamiento del modelo.

* datasets: para cargar datasets preprocesados.
  * La función load_dataset permite cargar datasets preprocesados desde Hugging Face Hub.

In [2]:
import pandas as pd
import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from sklearn.metrics import accuracy_score, precision_recall_fscore_support


Descargar y preparar el dataset:

Cargamos el dataset de emociones "emotion" y lo dividimos en DataFrames de pandas para entrenamiento, validación y prueba.

In [3]:
emotion = load_dataset("emotion")

train_df = emotion["train"].to_pandas()
valid_df = emotion["validation"].to_pandas()
test_df = emotion["test"].to_pandas()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/3.97k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/3.28k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/8.78k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/592k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/74.0k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/74.9k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/16000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Cargamos el tokenizador asociado al modelo pre-entrenado.

In [4]:
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

train_encodings = tokenizer(train_df['text'].tolist(), truncation=True, padding=True)
valid_encodings = tokenizer(valid_df['text'].tolist(), truncation=True, padding=True)
test_encodings = tokenizer(test_df['text'].tolist(), truncation=True, padding=True)


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

El tokenizador convierte las palabras en sub-unidades más pequeñas (tokens) que el modelo puede procesar.
Los parámetros truncation y padding garantizan que todas las entradas tengan la misma longitud, lo cual es necesario para el entrenamiento del modelo.

In [5]:
class EmotionDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = EmotionDataset(train_encodings, train_df['label'].tolist())
valid_dataset = EmotionDataset(valid_encodings, valid_df['label'].tolist())
test_dataset = EmotionDataset(test_encodings, test_df['label'].tolist())


In [40]:
for n in train_dataset:
  print(n)
  break

{'input_ids': tensor([  101,  1045,  2134,  2102,  2514, 26608,   102,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0]), 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

TrainingArguments para configurar los parámetros de entrenamiento.

* output_dir='./results': Directorio donde se guardarán los resultados del entrenamiento.
* num_train_epochs=3: Número de épocas de entrenamiento (3 en este caso).
* per_device_train_batch_size=16: Tamaño del batch de entrenamiento por dispositivo (GPU o CPU).
* per_device_eval_batch_size=64: Tamaño del batch de evaluación por dispositivo (puede ser mayor que el de entrenamiento).
* warmup_steps=500: Número de pasos de calentamiento(warm-up steps) para aumentar gradualmente el learning rate.
* weight_decay=0.01: Coeficiente de regularización L2 para evitar el sobreajuste.
* logging_dir='./logs': Directorio para almacenar los logs del entrenamiento.
* logging_steps=10: Frecuencia con la que se registran métricas en los logs (cada 10 pasos).

In [6]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=6)

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='weighted')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
10,1.7821
20,1.7798
30,1.7712
40,1.7313
50,1.7108
60,1.6844
70,1.6361
80,1.564
90,1.553
100,1.5728


TrainOutput(global_step=3000, training_loss=0.27703798371553423, metrics={'train_runtime': 451.8778, 'train_samples_per_second': 106.223, 'train_steps_per_second': 6.639, 'total_flos': 1080514292544000.0, 'train_loss': 0.27703798371553423, 'epoch': 3.0})

In [7]:
results = trainer.evaluate(test_dataset)
print("Accuracy:", results['eval_accuracy'])

Accuracy: 0.9295


Evaluar el modelo en el conjunto de prueba

In [8]:
results = trainer.evaluate(test_dataset)
print("Accuracy:", results['eval_accuracy'])


Accuracy: 0.9295


Realizar predicciones:

In [34]:
model.device.type

'cuda'

In [38]:
text = "I can go from feeling so hopeless to so damn hopeful just to be around someone who cares about me."
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
inputs.to(model.device.type)
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=1)
print(f"Predicted class: {predictions.item()}")


Predicted class: 0
