<center>
<img src="https://upload.wikimedia.org/wikipedia/commons/4/47/Acronimo_y_nombre_uc3m.png"/>

<img src="https://mirrors.creativecommons.org/presskit/buttons/88x31/png/by-nc-sa.png" width=15%/>
</center> 

# Cómo ajustar un transformer para la tarea de multietiquetado. 


Recordamos primero que el objetivo de la tarea de clasificación de textos es asignar a cada texto una categoría dentro de un conjunto de categorías predefinidas. 
La tarea de multietiquetado (multi-labelling en inglés) es similar a la tarea de clasificación de textos, aunque en este caso, cada texto podría estar anotado con varias categorías (o con ninguna).

En este notebook, vamos a aprender a ajustar un modelo transformer para esta tarea de multietiquetado. 

Para ello utilizaremos el dataset [Toxic Comments](#https://www.kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge/data?select=train.csv.zip), que es una colección de comentarios de Wikipedia que han sido manualmente anotados para detectar cualquier comentario tóxico. 

Los posibles tipos son: 
- toxic
- severe_toxic
- obscene
- threat
- insult
- identity_hate




In [1]:
!nvidia-smi -L

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.



In [2]:
!pip install datasets transformers sentencepiece

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.10.1-py3-none-any.whl (469 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 KB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers
  Downloading transformers-4.27.3-py3-none-any.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m38.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentencepiece
  Downloading sentencepiece-0.1.97-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m28.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess
  Downloading multiprocess-0.70.14-py39-none-any.whl (132 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.9/132.9 KB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
Collecting responses<0.19
  Dow

## Dataset


In [3]:
from google.colab import drive
import pandas as pd
# mount your google drive
drive.mount('/content/drive')


Mounted at /content/drive


In [4]:
path = "/content/drive/My Drive/Colab Notebooks/data/toxic/"
df_train_full = pd.read_csv(path+'train.csv')
print("Size:" ,df_train_full.shape)
df_train_full.head()

Size: (159571, 8)


Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


Es un dataset bastante grande. Sin embargo, para poder ejecutarlo una primera vez, nos interesa trabajar sobre una muestra más pequeña. 

Podríamos hacerlo directamente utilizando el método **sample**, que nos permitiría seleccionar de forma aleatoria un número de registros. El atributo **random_state** lo usamos para obtener siempre el mismo conjunto y que los experimentos se pueda reproducir sobre los mismos conjuntos de datos.


In [5]:
df_train=df_train_full.sample(n=15000, random_state=42)
print(df_train.shape)
df_train.head()

(15000, 8)


Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
119105,7ca72b5b9c688e9e,"Geez, are you forgetful! We've already discus...",0,0,0,0,0,0
131631,c03f72fd8f8bf54f,Carioca RFA \n\nThanks for your support on my ...,0,0,0,0,0,0
125326,9e5b8e8fc1ff2e84,"""\n\n Birthday \n\nNo worries, It's what I do ...",0,0,0,0,0,0
111256,5332799e706665a6,Pseudoscience category? \n\nI'm assuming that ...,0,0,0,0,0,0
83590,dfa7d8f0b4366680,"(and if such phrase exists, it would be provid...",0,0,0,0,0,0


Sin embargo, esto tiene un problema, y es que al ser aleatorio, tampoco estamos asegurando que la distribución de las etiquetas sea igual en esa muestra. 
Para que el nuevo dataset sea más pequeño pero siga siendo representativo de la distribución de las etiquetas en todo el dataset, deberemos aplciar un método que nos asegure obtener una muestra más pequeña pero con la misma distribución de las etiquetas. 

Para eso vamos a utilizar la clase **MultilabelStratifiedKFold** de la librería 
**iterative-stratification**, que nos permite dividir un dataset en K fold, y cada una de esas fold va a tener la misma distribución de las labels. 


In [6]:
!pip install iterative-stratification

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting iterative-stratification
  Downloading iterative_stratification-0.1.7-py3-none-any.whl (8.5 kB)
Installing collected packages: iterative-stratification
Successfully installed iterative-stratification-0.1.7


Debemos separar los textos (guardarlos en el dataframe X), y todas las labels (almacenadas en el dataframe y):

In [7]:
X=df_train_full[['comment_text']]
print(X.head())
print('labels:', df_train_full.columns[2:].tolist())
y=df_train_full[df_train_full.columns[2:].tolist()]
y.head()

                                        comment_text
0  Explanation\nWhy the edits made under my usern...
1  D'aww! He matches this background colour I'm s...
2  Hey man, I'm really not trying to edit war. It...
3  "\nMore\nI can't make any real suggestions on ...
4  You, sir, are my hero. Any chance you remember...
labels: ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']


Unnamed: 0,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0,0,0,0,0,0
1,0,0,0,0,0,0
2,0,0,0,0,0,0
3,0,0,0,0,0,0
4,0,0,0,0,0,0


Ahora aplicamos la clase MultilabelStratifiedKFold, para que nos genere N (por ejemplo, 10, 20, 80 o incluso 100) partes (folds). En la lista fold, vamos a almacenar los índices de cada uno de los folds. De esta forma,  $fold[0]$ contendrá una lista con los índices de los ejemplos que forman el fold 0. 

In [8]:
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold

mskf = MultilabelStratifiedKFold(n_splits=200)
folds = []
for train_index, test_index in mskf.split(X, y):
    folds.append(test_index)


Hemos dividido el dataset en 100 folds, para no tener que trabajar con los 150.000 registros (tardaríamos mucho en entrenar). Cada fold tendrá aproximadamente unas 1500 instancias.

De esos 100 folds, podemos tomar 2 para entrenar, 1 para validar, y un tercero para evaluar. 



In [9]:
index_train = list(folds[0])
index_train.extend(list(folds[1]))
index_val = list(folds[2])
index_test = list(folds[3])



Cada una de esas listas contiene los índices de los ejemplos (instancias) de cada fold en el dataset completo. Ahora simplemente tendremos que usar el método **iloc** sobre el dataframe completo para obtener los dataframe de train, validacion y test:


In [10]:
df_train=df_train_full.iloc[index_train]
df_val=df_train_full.iloc[index_val]
df_test=df_train_full.iloc[index_test]
# Tamaño de cada dataset
print(df_train.shape)
print(df_val.shape)
print(df_test.shape)

df_test.head()

(1596, 8)
(798, 8)
(798, 8)


Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
47,001c557175094f10,"In other words, you're too lazy to actually po...",0,0,0,0,0,0
156,0060ef190ee10720,. Between the unblock submission and response...,0,0,0,0,0,0
398,00fa6072f2eb086b,"""\n\nCanonicity in general\n\nI've been involv...",0,0,0,0,0,0
616,01a3724d5bf623ed,"Title section revert \n\nHi, I was at the offi...",0,0,0,0,0,0
862,025abee8428a80d8,"No I dont agree, my info will be saying up. Si...",0,0,0,0,0,0


Ahora podemos liberar la memoria ocupada por el dataset completo:

In [11]:
import gc
del df_train_full #delete unnecessary variables 
gc.collect()



0

Bueno ya tenemos un dataset con un tamaño razonable para poder entrenarlo en poco tiempo. Aunque recuerda que cuánto máyor sea una dataset, probablemente lo resultados del modelos serán mucho mejores. 


Como vamos a trabajar con transformers, es más útil trabajar con datasets en lugar de con dataframes. Vamos a transformar esos dataframes en un diccionario de datasets:


In [12]:
from datasets import load_dataset, Dataset, DatasetDict
dataset_dict = DatasetDict()
dataset_dict['train']  = Dataset.from_pandas(df_train)
dataset_dict['validation']  = Dataset.from_pandas(df_val)
dataset_dict['test']  = Dataset.from_pandas(df_test)
dataset_dict

#borramos

DatasetDict({
    train: Dataset({
        features: ['id', 'comment_text', 'toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate', '__index_level_0__'],
        num_rows: 1596
    })
    validation: Dataset({
        features: ['id', 'comment_text', 'toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate', '__index_level_0__'],
        num_rows: 798
    })
    test: Dataset({
        features: ['id', 'comment_text', 'toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate', '__index_level_0__'],
        num_rows: 798
    })
})

In [13]:
del df_train
del df_val
del df_test
gc.collect()


29

In [14]:
dataset_dict = dataset_dict.remove_columns(['__index_level_0__','id'])
dataset_dict

DatasetDict({
    train: Dataset({
        features: ['comment_text', 'toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate'],
        num_rows: 1596
    })
    validation: Dataset({
        features: ['comment_text', 'toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate'],
        num_rows: 798
    })
    test: Dataset({
        features: ['comment_text', 'toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate'],
        num_rows: 798
    })
})

In [15]:
TARGET_LABELS=list(dataset_dict['train'].features)[1:]
print(TARGET_LABELS)

['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']


## Codificación de las etiquetas


In [16]:
import torch

def join_labels(example):
    labels = []
    for name_label in TARGET_LABELS:
        labels.append(example[name_label])

    example['label']=torch.Tensor(labels)
   
    return example


new_dataset = dataset_dict.map(join_labels)
new_dataset

Map:   0%|          | 0/1596 [00:00<?, ? examples/s]

Map:   0%|          | 0/798 [00:00<?, ? examples/s]

Map:   0%|          | 0/798 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['comment_text', 'toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate', 'label'],
        num_rows: 1596
    })
    validation: Dataset({
        features: ['comment_text', 'toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate', 'label'],
        num_rows: 798
    })
    test: Dataset({
        features: ['comment_text', 'toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate', 'label'],
        num_rows: 798
    })
})

In [17]:
del dataset_dict
gc.collect()


88

Comprobamos que efectivamente label agrupa para cada registro el resto de etiquetas, con el mismo orden.

In [18]:
# Este bucle nos permite encontrar algunos índices de ejemplos etiquetados como tóxicos
import random 
i = 0
while True:
    index = random.randint(0,new_dataset['train'].num_rows)
    if 1.0 in new_dataset['train'][index]['label']:
        print(index, new_dataset['train'][index]['label'])
        if i > 5:
            break
        i += 1


430 [1.0, 0.0, 1.0, 0.0, 1.0, 0.0]
884 [1.0, 0.0, 1.0, 0.0, 0.0, 0.0]
924 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0]
1517 [0.0, 0.0, 0.0, 0.0, 1.0, 0.0]
980 [1.0, 1.0, 1.0, 0.0, 0.0, 0.0]
948 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0]
480 [1.0, 0.0, 1.0, 0.0, 1.0, 1.0]


In [19]:
for index in [380, 1503, 2270, 429]:
    for key in list(new_dataset['train'][index].keys())[1:]:
        print(key, new_dataset['train'][index][key])
    print()



toxic 0
severe_toxic 0
obscene 0
threat 0
insult 0
identity_hate 0
label [0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

toxic 0
severe_toxic 0
obscene 0
threat 0
insult 0
identity_hate 0
label [0.0, 0.0, 0.0, 0.0, 0.0, 0.0]



IndexError: ignored

Vamos únicamente a mantener los campos que nnos interesan: label y text:

In [20]:
new_dataset = new_dataset.rename_column('comment_text','text')
new_dataset = new_dataset.remove_columns(['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate'])
new_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 1596
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 798
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 798
    })
})

## Tokenización

In [21]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
model_name= 't5-small'
tokenizer = T5Tokenizer.from_pretrained(model_name) 


Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-small automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


Vamos a ver rapidamente cuál es la longitud de los textos: 

In [22]:
lengths= [len(tokenizer(text).input_ids) for text in new_dataset['train']['text']]


Token indices sequence length is longer than the specified maximum sequence length for this model (666 > 512). Running this sequence through the model will result in indexing errors


In [23]:
df_len = pd.Series(lengths)
df_len.describe(percentiles=[0.25, 0.50, 0.75, 0.85, 0.90, 0.95])

count    1596.000000
mean      106.987469
std       188.473584
min         5.000000
25%        28.000000
50%        55.500000
75%       110.000000
85%       168.000000
90%       227.500000
95%       336.500000
max      2789.000000
dtype: float64

In [24]:
del lengths
del df_len
gc.collect()


0

Aunque el tamaño máximo es de 2676 tokens, podemos ver que el 85% de los textos tienen menos de 165 tokens, por tanto, puede ser un tamaño bastante representativo. Establecer 165 como tamaño máximo puede ser una buena elección. 

Vamos a aplicar el tokenizador de T5. Este modelo necesita que las entradas al modelo vayan incorporado con el prefijo de la tarea, en nuestro caso, "multilabeling: "

In [25]:
MAX_LENGTH = 165 
PREFIX='multilabeling: '
  
def tokenize(examples):
    inputs = [PREFIX + text for text in examples['text']]
    targets = [str(label) for label in examples['label']]
    model_inputs = tokenizer(inputs, text_target=targets, max_length=MAX_LENGTH, padding='max_length', truncation=True)
    return model_inputs
 

In [26]:
encoded_dataset = new_dataset.map(tokenize, batched = True)

encoded_dataset

Map:   0%|          | 0/1596 [00:00<?, ? examples/s]

Map:   0%|          | 0/798 [00:00<?, ? examples/s]

Map:   0%|          | 0/798 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1596
    })
    validation: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 798
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 798
    })
})

Podemos ver que el tokenizador de T5 no genera el campo **token_type_IDs**, porque no usa la tarea de predecir la siguiente oración. Sin embargo, ha añadido un nuevo campo, **labels**, de la dimensión de la secuencia de entrada.

In [27]:
len(encoded_dataset['train'][0]['labels'])

165

In [28]:
encoded_dataset  = encoded_dataset.remove_columns(['text', 'label'])
encoded_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1596
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 798
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 798
    })
})

Liberamos memoria conservando el test:

In [29]:
test = new_dataset['test']
del new_dataset
#del encoded_dataset['test']
gc.collect()


138

## Usando T5 para multi-etiquetado

In [30]:
id2label = {idx:label for idx, label in enumerate(TARGET_LABELS)}
label2id = {label:idx for idx, label in enumerate(TARGET_LABELS)}
TARGET_LABELS, id2label, label2id

(['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate'],
 {0: 'toxic',
  1: 'severe_toxic',
  2: 'obscene',
  3: 'threat',
  4: 'insult',
  5: 'identity_hate'},
 {'toxic': 0,
  'severe_toxic': 1,
  'obscene': 2,
  'threat': 3,
  'insult': 4,
  'identity_hate': 5})

In [31]:
from transformers import T5ForConditionalGeneration

model = T5ForConditionalGeneration.from_pretrained(model_name, 
                                                           problem_type="multi_label_classification", 
                                                           num_labels=len(TARGET_LABELS),
                                                           id2label=id2label,
                                                           label2id=label2id)

Downloading pytorch_model.bin:   0%|          | 0.00/242M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [32]:
from transformers import TrainingArguments, Trainer
batch_size = 2

args = TrainingArguments(
    output_dir='./outputs',
    evaluation_strategy = "epoch",  
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=1, #we recommend at least 5
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model= "f1",
    #push_to_hub=True,
)



In [33]:
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score
from transformers import EvalPrediction
import torch
import numpy as np

# source: https://jesusleal.io/2021/04/21/Longformer-multilabel-classification/
def multi_label_metrics(predictions, labels, threshold=0.5):
    # first, apply sigmoid on predictions which are of shape (batch_size, num_labels)
    sigmoid = torch.nn.Sigmoid()
    probs = sigmoid(torch.Tensor(predictions))
    # next, use threshold to turn them into integer predictions
    y_pred = np.zeros(probs.shape)
    y_pred[np.where(probs >= threshold)] = 1
    # finally, compute metrics
    y_true = labels
    f1_micro_average = f1_score(y_true=y_true, y_pred=y_pred, average='micro')
    # roc_auc = roc_auc_score(y_true, y_pred, average = 'micro')
    accuracy = accuracy_score(y_true, y_pred)
    # return as dictionary
    metrics = {'f1': f1_micro_average,
               #'roc_auc': roc_auc,
               'accuracy': accuracy}
    return metrics

def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions, 
            tuple) else p.predictions
    result = multi_label_metrics(predictions=preds, labels=p.label_ids)
    return result

In [34]:
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

Si está utilizando una pequeña muestra  conjunto de datos de entrenamiento, este proceso podría toma alrededor de 5 minutos para 1 epoch y una GPU Tesla T4.

Este tiempo aumentará significativamente si SE entrena con el conjunto de datos de entrenamiento completo (más de 1 hora).

In [None]:
trainer.train()



Epoch,Training Loss,Validation Loss


Esto nos permite obtener la evaluación del mejor modelo obtenido durante el training sobre el conjunto de validación:

In [None]:
trainer.evaluate()

## Evaluación

Vamos a usar el modelo para predecir las etiquetas sobre el conjunto test. 


In [None]:
def get_prediction(text):
    # prepare our text into tokenized sequence
    inputs = tokenizer(text, padding="max_length", max_length= MAX_LENGTH, truncation=True, return_tensors="pt").to("cuda")
    # perform inference to our model
    outputs = model(**inputs)
    return outputs

    



In [None]:
import torch
torch.cuda.empty_cache()

In [None]:
y_pred=[get_prediction(text) for text in new_dataset['test']['text']]
y_pred

In [None]:
multi_label_metrics(y_pred, dataset_dict['test']['label'])

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_true=new_dataset['test']['label'], y_pred=y_pred, target_names=TARGET_LABELS))