In [1]:
!pip install numpy==1.23.5
!pip install transformers[torch]==4.35.2
!pip install accelerate -U
!pip install evaluate

Collecting transformers[torch]==4.35.2
  Using cached transformers-4.35.2-py3-none-any.whl (7.9 MB)
Collecting accelerate>=0.20.3 (from transformers[torch]==4.35.2)
  Using cached accelerate-0.29.2-py3-none-any.whl (297 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch!=1.12.0,>=1.10->transformers[torch]==4.35.2)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch!=1.12.0,>=1.10->transformers[torch]==4.35.2)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch!=1.12.0,>=1.10->transformers[torch]==4.35.2)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch!=1.12.0,>=1.10->transformers[torch]==4.35.2)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-c

In [2]:
import pandas as pd
def load_prepare_data(path):
  """
  Función para cargar y procesar datos para el ejercicio.
  """
  df = pd.read_csv(path,sep=",")
  map_classes = {
    "religion":1,
    "age":1,
    "ethnicity":1,
    "gender":1,
    "other_cyberbullying":1,
    "not_cyberbullying":0,
  }
  df["cyberbullying"] = df.cyberbullying_type.map(map_classes)
  return df[["tweet_text","cyberbullying"]].copy()

# Ejercicio


En este ejercicio vamos a trabajar con un conjunto de datos procedente de medios sociales online.

Uno de los mayores problemas en el internet de hoy en día es la presencia de actitudes negativas hacia algunos colectivos en relación a su etnia, género, religión o ideología política. En este ejercicio trabajaremos con un conjunto de datos reales, etiquetados manualmente, procedentes de la plataforma [Kaggle](https://www.kaggle.com/datasets/andrewmvd/cyberbullying-classification/data). Originalmente, a cada documento del dataset se le asignó una de las siguientes categorías:
- *religion*
- *age*
- *ethnicity*
- *gender*
- *other_cyberbullying*
- *not_cyberbullying*


El objetivo inicial del dataset era su uso para entrenar un modelo capaz de detectar el tipo de contenido de odio presente en internet según el colectivo al que se atacaba. En este caso, para simplificar el ejercicio, se ha generado una función `load_prepare_data()` que cambia las categorías del dataset obteníendose al final 2 categorías con valor 1 o 0, indicando si el tweet tiene contenido de odio

**En este ejercicio debeis entrenar un modelo de clasificación utilizando la librería Transformers.** Dado que el análisis exploratorio ha sido realizado en el ejercicio anterior, en este caso podréis centraros en entrenar el modelo utilizando la librería Transformers, seleccionando un modelo pre-entrenado adecuado, entrenando el modelo y llevando a cabo la evaluación.


**Nota 1**: Este ejercicio requiere el uso de las GPUs de Google Colab. Este Colab debería estar preconfigurado para ejecutarse en GPU, pero si tuviera problemas en la ejecución que me contacte a través del Moodle para buscar soluciones alternativas.

## 0. Imports


In [3]:
from transformers import (
   AutoConfig,
   AutoTokenizer,
   AutoModelForSequenceClassification,
   AdamW
)
import torch
import pandas as pd
from sklearn.model_selection import train_test_split

  _torch_pytree._register_pytree_node(


## 1. Obtención del corpus
Para la obtención de los datos teneis disponible la función `load_prepare_data()`. Esta función prepara los datos del ejercicio en formato Pandas dataframe para que podais realizarlo.

In [4]:
path_data = "https://raw.githubusercontent.com/luisgasco/ntic_master_datos/main/datasets/cyberbullying_tweets.csv"
# Path de datos alternativos en caso de que el anterior no funcione (al estar alojado en github puede haber limitaciones
# en la descarga.
# path_data = "https://zenodo.org/records/10938455/files/cyberbullying_tweets.csv?download=1"
dataset = load_prepare_data(path_data)

In [5]:
dataset.head(4)

Unnamed: 0,tweet_text,cyberbullying
0,"In other words #katandandre, your food was cra...",0
1,Why is #aussietv so white? #MKR #theblock #ImA...,0
2,@XochitlSuckkks a classy whore? Or more red ve...,0
3,"@Jason_Gio meh. :P thanks for the heads up, b...",0


## 2. Análisis exploratorio

Podéis saltarlo en este ejercicio.

## 3. Preprocesado y Normalización

En primer lugar cargamos el pipeline pre-entrenado. Se tarta de un modulo que puede desempeñar diferentes diferentes tareas como análisis de sentimiento, text-generation o traducción de forma muy sencilla, por ejemplo.

In [6]:
from transformers import pipeline

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


In [7]:
texts = dataset.tweet_text.values  # an array of strings
labels = dataset.cyberbullying.values  # an array of integers
print(texts)
print(labels)

['In other words #katandandre, your food was crapilicious! #mkr'
 'Why is #aussietv so white? #MKR #theblock #ImACelebrityAU #today #sunrise #studio10 #Neighbours #WonderlandTen #etc'
 '@XochitlSuckkks a classy whore? Or more red velvet cupcakes?' ...
 "I swear to God. This dumb nigger bitch. I have got to bleach my hair reeeeeal fuckin' soon. D:&lt; FUCK."
 'Yea fuck you RT @therealexel: IF YOURE A NIGGER FUCKING UNFOLLOW ME, FUCKING DUMB NIGGERS.'
 'Bro. U gotta chill RT @CHILLShrammy: Dog FUCK KP that dumb nigger bitch lmao']
[0 0 0 ... 1 1 1]


In [8]:
train_texts, test_texts, train_labels, test_labels = train_test_split(texts, labels, test_size=.25, random_state=0,
                                                    stratify = labels)
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2, random_state=0,stratify = train_labels)

Voy a trabajar con el modelo visto en clase 'bert-base-uncased'
BERT es un modelo de lenguaje desarrollado por google. La verión que voy a utilizar es más liviana. Existen la base y la large. En mi caso voy a utilizar la uncased para que no tenga en cuenta mayúsculas y minúsculas.  

In [9]:
model_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [10]:
# Vemos un ejemplo de tweet
print(dataset.tweet_text[0])
texto_tokens = tokenizer(dataset.tweet_text[0]).tokens()
texto_tokens

In other words #katandandre, your food was crapilicious! #mkr


['[CLS]',
 'in',
 'other',
 'words',
 '#',
 'kata',
 '##nda',
 '##nd',
 '##re',
 ',',
 'your',
 'food',
 'was',
 'crap',
 '##ili',
 '##cious',
 '!',
 '#',
 'mk',
 '##r',
 '[SEP]']

## 4. Vectorización

Los elementos de la función:

- *inputs_ids*: Identificadores numéricos de los tokens en el vocabulario del modelo
- *attention_mask*: Vector que indica a la red neuronal qué partes de la secuencia de entrada debe prestar atención y cuáles ignorar.
- *labels*: Este campo contiene la etiqueta asociada al texto

In [11]:
import torch
from torch.utils.data import Dataset

class CustomDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length):
        """
        Constructor de la clase CustomDataset.
        Parámetros:
        - texts: Lista de textos.
        - labels: Lista de etiquetas correspondientes a los textos.
        - tokenizer: Objeto del tokenizador a utilizar.
        - max_length: Longitud máxima de la secuencia después de la tokenización.
        """
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        """
        Devuelve la longitud del conjunto de datos.
        """
        return len(self.texts)

    def __getitem__(self, idx):
        """
        Obtiene un elemento del conjunto de datos.

        Parámetros:
        - idx: Índice del elemento a obtener.

        Devuelve:
        Un diccionario con 'input_ids', 'attention_mask' y 'labels'.
        """
        # Obtener el texto y la etiqueta del índice proporcionado
        text = str(self.texts[idx])
        label = int(self.labels[idx])

        # Tokenizar el texto
        encoding = self.tokenizer(
            text,
            add_special_tokens=True,
            max_length=self.max_length,
            truncation=True,
            padding='max_length',
            return_tensors='pt'
        )

        # Devolver el diccionario con los datos
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }

In [50]:
max_length = 60  # Lo limito a 45 para que colab lo pueda ejecutar

train_dataset = CustomDataset(train_texts, train_labels, tokenizer, max_length)
val_dataset = CustomDataset(val_texts, val_labels, tokenizer, max_length)
test_dataset = CustomDataset(test_texts, test_labels, tokenizer, max_length)

# por curiosidad veo qué hay
print('1: ', train_texts)
print('2: ', val_labels)
print('3: ', tokenizer)
print('4: ', max_length)

1:  ['@travel_abstract @hotelsdotcom happy to help, of course. email me at elliottc@gmail.com with details.'
 '@plaidcat9 He was also just existing, no?'
 'Oho... First learn how to write then talk to me....yes also expand ur knowledge..... Just do one thing open read a GK book'
 ... 'hashtag nachoshield hashtag goobergrape http://t.co/5Mz1rwp3bY'
 'the scale of what is worse: [the greater the value, the worse it] pedo ‘jokes’ &lt; ‘racist’ jokes ice &gt; isis gangs &lt; local pd kill baby &lt; gay or black conservatism rape reporting &gt; antifa $1 salary &gt; $1.7 bil cash **according to MSM, ‘liberals,’ communists, &amp; rest of left**'
 'They really are weak ass punks. Their the idiots that got bullied in school, but didn’t have the balls to stand up to the bullies. Now their going to show the world how tough they are by attacking a Starbucks. All you can do is laugh at them.']
2:  [1 0 1 ... 1 1 1]
3:  BertTokenizerFast(name_or_path='bert-base-uncased', vocab_size=30522, model_max

In [51]:
train_dataset[0]

{'input_ids': tensor([  101,  1030,  3604,  1035, 10061,  1030,  9275, 27364,  9006,  3407,
          2000,  2393,  1010,  1997,  2607,  1012, 10373,  2033,  2012,  9899,
          2278,  1030, 20917,  4014,  1012,  4012,  2007,  4751,  1012,   102,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0]),
 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 'labels': tensor(0)}

## 5. Entrenamiento y evaluación de modelos
Procedemos al entrenamiento del modelo

Bajo los parametros por defecto para intentar que sea más rápido

In [52]:
max_seq_length = 60 #@param {type: "integer"}
train_batch_size =  4#@param {type: "integer"}
eval_batch_size = 4 #@param {type: "integer"}
test_batch_size = 4 #@param {type: "integer"}

In [53]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

id2label = {0: "no_cyberbullying", 1: "cyberbullying"}
label2id = {"no_cyberbullying": 0, "cyberbullying": 1}
model = AutoModelForSequenceClassification.from_pretrained(model_name,  num_labels=2, id2label=id2label, label2id=label2id)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [54]:
import accelerate

training_args = TrainingArguments(
    output_dir="modelo_test",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=4,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False
)

In [55]:
import numpy as np
import evaluate

accuracy = evaluate.load("accuracy")
f1_score = evaluate.load("f1")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    accuracy_value = accuracy.compute(predictions=predictions, references=labels)
    f1_score_value = f1_score.compute(predictions=predictions, references=labels)

    return {
        "accuracy": accuracy_value,
        "f1_score": f1_score_value,
    }

In [56]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)


In [57]:
# 20'
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1 Score
1,0.2646,0.254003,{'accuracy': 0.8919485602460162},{'f1': 0.9360046361453762}
2,0.2357,0.263007,{'accuracy': 0.8968409281520827},{'f1': 0.939577533977403}
3,0.1957,0.396987,{'accuracy': 0.8905507408442829},{'f1': 0.9365015002838375}
4,0.1445,0.518481,{'accuracy': 0.885099245177523},{'f1': 0.9322676334871457}


Trainer is attempting to log a value of "{'accuracy': 0.8919485602460162}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'f1': 0.9360046361453762}" of type <class 'dict'> for key "eval/f1_score" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'accuracy': 0.8968409281520827}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'f1': 0.939577533977403}" of type <class 'dict'> for key "eval/f1_score" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'accuracy': 0.8905507408442829}" of type <class

TrainOutput(global_step=14308, training_loss=0.2126988492799865, metrics={'train_runtime': 1870.6736, 'train_samples_per_second': 61.187, 'train_steps_per_second': 7.649, 'total_flos': 3529182585528000.0, 'train_loss': 0.2126988492799865, 'epoch': 4.0})

### Evaluo el modelo

In [59]:
# Make predictions on the test data
trainer.evaluate(test_dataset)

Trainer is attempting to log a value of "{'accuracy': 0.8883670217227208}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'f1': 0.9337118382389561}" of type <class 'dict'> for key "eval/f1_score" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.


{'eval_loss': 0.2624798119068146,
 'eval_accuracy': {'accuracy': 0.8883670217227208},
 'eval_f1_score': {'f1': 0.9337118382389561},
 'eval_runtime': 45.8529,
 'eval_samples_per_second': 260.027,
 'eval_steps_per_second': 32.517,
 'epoch': 4.0}

Predeciremos las etiquetas sobre el test set con el método .predict(). Y obtendremos la etiqueta de cada predicción.

In [64]:
predictions = trainer.predict(test_dataset)

In [65]:
predictions

PredictionOutput(predictions=array([[-2.8112283 ,  3.1496375 ],
       [-3.2114565 ,  3.5426888 ],
       [-3.2289305 ,  3.5495832 ],
       ...,
       [-1.8867995 ,  2.018777  ],
       [-2.9505599 ,  3.2326443 ],
       [ 0.54394716, -0.9846418 ]], dtype=float32), label_ids=array([1, 1, 1, ..., 1, 1, 0]), metrics={'test_loss': 0.2624798119068146, 'test_accuracy': {'accuracy': 0.8883670217227208}, 'test_f1_score': {'f1': 0.9337118382389561}, 'test_runtime': 42.9448, 'test_samples_per_second': 277.635, 'test_steps_per_second': 34.719})

In [66]:
y_pred = predictions.predictions.argmax(axis=1)

Cogemos las etiquetas verdaderas y calculamos el classification report:

In [67]:
y_true = [x["labels"].item() for x in test_dataset]

In [69]:
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay

print(confusion_matrix(y_true,y_pred))
print(classification_report(y_true,y_pred))

# La precision en 0 es levemente menor pero la precisión en 1 y el recal en 0 y 1 es mayor.
# Es interesante probar distintas metricas.

[[1218  768]
 [ 563 9374]]
              precision    recall  f1-score   support

           0       0.68      0.61      0.65      1986
           1       0.92      0.94      0.93      9937

    accuracy                           0.89     11923
   macro avg       0.80      0.78      0.79     11923
weighted avg       0.88      0.89      0.89     11923



Vemos unos datos mucho más precisos que los anteriores aun sin haber hecho preprocesado y con parametros por debajo de los valores por defecto


Comparandolo con el **clasificador de caracteristicas** (ejercicio 1):

                   precision recall   f1-score     support

         0.0       0.65      0.37     0.47        1962
         1.0       0.82      0.94     0.88        6217

    accuracy                          0.80        8179
    macro avg      0.74      0.65     0.67        8179
    weighted avg   0.78      0.80     0.78        8179


Comparandolo con un **primer intento**

max_length = 45

max_seq_length = 60,

train_batch_size =  4,

eval_batch_size = 4,

test_batch_size = 4

                   precision recall    f1-score  support

           0       0.70      0.55      0.61      1986
           1       0.91      0.95      0.93      9937

    accuracy                           0.89      11923
    macro avg      0.81      0.75      0.77      11923
    weighted avg   0.88      0.89      0.88      11923