<center>
<img src="https://upload.wikimedia.org/wikipedia/commons/4/47/Acronimo_y_nombre_uc3m.png"/>

<img src="https://mirrors.creativecommons.org/presskit/buttons/88x31/png/by-nc-sa.png" width=15%/>
</center>   

# Cómo ajustar un modelo pre-entrenado en Pytorch nativo (sin usar Trainer)

En los notebooks anteriores, hemos aprendido a ajustar un modelo pre-entrenado 
para la tarea de clasificación, para dos frameworks distintos Pytorch y Tensorflow. 

En TensorFlow, Keras proporciona un método fit que se encarga de entrenar el modelo, es decir, implementa el ciclo de entrenamiento. Sin embargo, en PyTorch, no hay un método que se encargue del ciclo de entrenamiento. Por este motivo, la librería de transformes ha implementado una clase **Trainer** que permitir entrenar (ajustar) un modelo desde cero fácilmente. En este notebook, aprenderemos a entrenar un modelo en Pytorch, pero sin utilizar Trainer, es decir, vamos a tener que implementar el ciclo de entrenamiento. 

Source: https://huggingface.co/docs/transformers/v4.14.1/en/training#finetuning-in-native-pytorch

Comenzamos instalando las dos librerías:

In [1]:
!pip install transformers datasets Evaluate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m32.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.10.1-py3-none-any.whl (469 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 KB[0m [31m22.3 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m61.6 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.13.0-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.1/199.1 KB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25hCol

## Data

Utilizamos el dataset **trec**, tal y como hemos hecho en los dos notebooks anteriores. 




### Loading the dataset

In [2]:
from datasets import load_dataset
dict_dataset = load_dataset("trec")

TARGET_LABELS = dict_dataset['train'].features['coarse_label'].names
# borramos fine_label
dict_dataset = dict_dataset.remove_columns(['fine_label'])
# renombramos el campo coarse_label
dict_dataset = dict_dataset.rename_column('coarse_label','label')

dict_dataset


Downloading builder script:   0%|          | 0.00/5.09k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading and preparing dataset trec/default to /root/.cache/huggingface/datasets/trec/default/2.0.0/f2469cab1b5fceec7249fda55360dfdbd92a7a5b545e91ea0f78ad108ffac1c2...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/336k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/23.4k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/5452 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/500 [00:00<?, ? examples/s]

Dataset trec downloaded and prepared to /root/.cache/huggingface/datasets/trec/default/2.0.0/f2469cab1b5fceec7249fda55360dfdbd92a7a5b545e91ea0f78ad108ffac1c2. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 5452
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 500
    })
})

### Crear el split para validacion 

In [3]:
aux = dict_dataset['train'].train_test_split(test_size=0.1)
dict_dataset['train']=aux['train']
dict_dataset['val']=aux['test']
del(aux)
dict_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 4906
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 500
    })
    val: Dataset({
        features: ['text', 'label'],
        num_rows: 546
    })
})

### Tokenization

In [4]:
from transformers import AutoTokenizer
model_name='bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)


Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

¿Cuál es la longitud máxima?

In [5]:
MAX_LENGTH= max([len(tokenizer(text).input_ids) for text in dict_dataset["test"]['text']])
print(MAX_LENGTH)

# the max length can never be greater than 512
MAX_LENGHT=min(512,MAX_LENGTH)
print(MAX_LENGTH)


24
24


In [6]:
def tokenize(examples):
    ## it applies the tokenzier on the dataset in its field text
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=MAX_LENGTH)


data_encodings= dict_dataset.map(tokenize, batched=True)
data_encodings

Map:   0%|          | 0/4906 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/546 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 4906
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 500
    })
    val: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 546
    })
})

## Modelo

Vamos a entrenar el modelo, sin utilizar Trainer. Necesitamos escribir un bucle para el entrenamiento. 

Necesitamos hacer algunas modificaciones para preparar el dataset para el modelo:
- eliminamos la columna 'text', porque es un campo que el modelo no espera.
- renombramos 'label' a 'labels', porque es el nombre que espera el modelo. 
- además, el dataset debe devolver un objeto Tensor en lugar de una lista.

In [7]:
data_encodings=data_encodings.remove_columns('text')
data_encodings = data_encodings.rename_column('label','labels')
data_encodings.set_format("torch")
data_encodings

DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 4906
    })
    test: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 500
    })
    val: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 546
    })
})

Para pasarle los datos al modelo debemos guardarlos en objetos **DataLoader**


In [8]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(data_encodings['train'], shuffle=True, batch_size=8)
val_dataloader = DataLoader(data_encodings['val'], batch_size=8)

### Defining the model
To define the model, we first need to know the number of labels:

In [9]:
print('TARGET_LABELS:', TARGET_LABELS )
NUM_LABELS = len(TARGET_LABELS)
print('TARGET_LABELS:', TARGET_LABELS, 'NUM_LABELS:', NUM_LABELS)

TARGET_LABELS: ['ABBR', 'ENTY', 'DESC', 'HUM', 'LOC', 'NUM']
TARGET_LABELS: ['ABBR', 'ENTY', 'DESC', 'HUM', 'LOC', 'NUM'] NUM_LABELS: 6


In [10]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=NUM_LABELS) 

Downloading pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Como optimizador vamos a utilizar el optijizador Adam, implementado en **AdamW**, y definimos el learning rate:


In [11]:
from torch.optim import AdamW
optimizer = AdamW(model.parameters(), lr=5e-5)

#### Learning rate

El planificador de tasa de aprendizaje predeterminado es solo una disminución  desde el valor máximo (5e-5 aquí) a 0:


In [12]:
from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
print("Num training steps:", num_training_steps)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps
)


Num training steps: 1842


Además, necesitamos definir un dispositivo para GPU, donde colocaremos nuestro modelo y nuestros lotes.

In [13]:
import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

Definimos la metrica:

In [17]:
import evaluate

metric_acc= evaluate.load("accuracy")
metric_P = evaluate.load("precision", average="macro")
metric_R = evaluate.load("recall", average="macro")
metric_F1 = evaluate.load("f1", average="macro")


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.55k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.36k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

### Training

Ready to train! We use a progress bar (by using the tqdm library.
) over the number of training steps to see the progress. 



In [18]:
from tqdm.auto import tqdm
import numpy as np

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_epochs):
    model.train()
    for batch in train_dataloader:  # get 8 instances
        batch = {k: v.to(device) for k, v in batch.items()} # mandamos los ejemplos del batch al device
        outputs = model(**batch)    # aplica el modelo para inferir
        loss = outputs.loss         # mide el error
        loss.backward()             # back propation para calcular los parámetros de la red, aplicando el grandiente descendiente
        optimizer.step()            # se aplica después del gradiente, actualiza los pesos
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # now evaluate on the 

    model.eval()
    for batch in val_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        with torch.no_grad():
            outputs = model(**batch)

        logits = outputs.logits
        predictions = torch.argmax(logits, dim=-1)
        #add the predictions and their references for each batch. After finishing all batches, each metric will be computer
        metric_acc.add_batch(predictions=predictions, references=batch["labels"])
        metric_P.add_batch(predictions=predictions, references=batch["labels"])
        metric_R.add_batch(predictions=predictions, references=batch["labels"])
        metric_F1.add_batch(predictions=predictions, references=batch["labels"])

    # After predicting all batches, we can compute the metrics
    # print the four metrics accuracy, precision, recall and F1 (macros) for each epoch
    # As the problem is a multiclass problem, we use "macro" as average. We could also use "micro" or "weighted"
    print("Epoch : ", epoch+1,  metric_acc.compute(), metric_P.compute(average="macro"), metric_R.compute(average="macro"), metric_F1.compute(average="macro"))


  0%|          | 0/1842 [00:00<?, ?it/s]

Epoch :  1 {'accuracy': 0.9340659340659341} {'precision': 0.9502882416827917} {'recall': 0.8870307967230046} {'f1': 0.9086369109339408}
Epoch :  2 {'accuracy': 0.9542124542124543} {'precision': 0.9635322396384344} {'recall': 0.9015917659277596} {'f1': 0.9240585685072319}
Epoch :  3 {'accuracy': 0.9523809523809523} {'precision': 0.9626527445170437} {'recall': 0.9006051453179772} {'f1': 0.9232239660984441}


## Evaluación



In [20]:
def get_prediction(text):
    # tokenizamos el texto, igual que con el training y validation
    inputs = tokenizer(text, padding="max_length", truncation=True, max_length=MAX_LENGTH, return_tensors="pt").to("cuda")
    # utilizamos el modelo para predecir la clase para esa entrada
    outputs = model(**inputs)
    # calculamos pas probabilidades con softmax
    probs = outputs[0].softmax(1)
    # Devolvemos la mayor. Como es un tensor, debemos devolver su item 
    return probs.argmax().item()

In [21]:
y_pred=[get_prediction(text) for text in dict_dataset['test']['text']]
from sklearn.metrics import classification_report
print(classification_report(y_true=dict_dataset['test']['label'], y_pred=y_pred, target_names=TARGET_LABELS))

              precision    recall  f1-score   support

        ABBR       1.00      0.89      0.94         9
        ENTY       0.99      0.89      0.94        94
        DESC       0.96      0.99      0.98       138
         HUM       0.98      0.98      0.98        65
         LOC       0.96      0.99      0.98        81
         NUM       0.97      1.00      0.99       113

    accuracy                           0.97       500
   macro avg       0.98      0.96      0.97       500
weighted avg       0.97      0.97      0.97       500

