# Practical Work 3
## Session 5: BETO and RoBERTa (Spanish) for text classification tasks.

- José Baixauli
- Kexin Jiang
- José Fco. Olivert

The goal of this lab session is to help students understand and gain
practice in the use of deep learning-based language models, in particular
transformer-based models (BERT, BETO and RoBERTa). The second
objective is the application of these models to text classification tasks,
particularly for the HUHU shared task.

### Importing all the libraries and packages

In [1]:
!pip install transformers
import numpy as np
import pandas as pd
from transformers import  AdamW,BertModel,BertTokenizer,RobertaTokenizer,RobertaModel
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import seaborn as sns
from sklearn.model_selection import train_test_split
import copy
import warnings
from sklearn.metrics import accuracy_score as acc
from sklearn.metrics import f1_score as f1
import torch.optim as optim


warnings.filterwarnings("ignore")

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device

tokenizer = BertTokenizer.from_pretrained("dccuchile/bert-base-spanish-wwm-uncased")

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.29.2-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m60.6 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m24.9 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m80.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.13.3 transformers-4.29.2


Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/248k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/134 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/310 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/650 [00:00<?, ?B/s]

We create the tokenizer too. This BERT tokenizer will help us to transform the text and train the model


## Reading the data

We will use "train.csv" dataset that contains the raw text of the tweet and different variables. 

We will use just the text of the tweet and the variable "humor" since we are going to fine tune a transformer for binary classification in order to predict this *label*

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
data = pd.read_csv("/content/drive/MyDrive/transformer/train.csv")

In [None]:
data

Unnamed: 0,index,tweet,humor,prejudice_woman,prejudice_lgbtiq,prejudice_inmigrant_race,gordofobia,mean_prejudice
0,72157,Mi celular tiene una aplicación que te hace ve...,1,0,0,0,1,3.0
1,68084,"En esta vida me tocó tener mala suerte, espero...",1,0,0,0,1,2.8
2,69089,"Tu mamá es taaan taan obesa, que cuando pasa f...",1,0,0,0,1,3.6
3,69190,Mi tía me dijo: \n- tengo memoria de Elefante....,1,0,0,0,1,3.4
4,70474,"- Mamá, en el colegio me dicen gorda.\n- ¡Ay M...",1,0,0,0,1,3.0
...,...,...,...,...,...,...,...,...
2666,41280,Un claro ejemplo más del vacío moral de las fe...,0,1,0,0,0,3.4
2667,2166,MENTION Vamos a preguntar a las feminazis. Par...,0,1,0,0,0,3.8
2668,39933,"Si tuviera tetas y subiera fotos picantes, al ...",0,1,0,0,0,4.0
2669,4992,qtagarre dl culo ynotarle toda lapolla ay bien...,0,1,0,0,0,3.8


We divide the dataset intro training and testing in order to see later the performance of our model.

In [None]:
inputs = data["tweet"]
labels = data["humor"]



train_inputs, test_inputs, train_labels, test_labels = train_test_split(
    inputs, 
    labels, 
    test_size=0.2, 
    stratify=labels
)


In [None]:
train_inputs

1285                         MENTION gitanos d to la vida
1664                  En ámbito  Lgtbi   no es aplicable!
324     Un poco de humor negro (La pierna está complet...
538     MENTION Pero se enfadan si las llamamos femina...
393     MENTION ¿De qué hablas? Se la pasan hablando d...
                              ...                        
453     si un gay le dice a otro "que te den por el cu...
1314    MENTION MENTION Mientras tanto tú estás de pas...
1055              MAS XENOFOBO QUE AYER\nMENOS QUE MAÑANA
488     Las feministas tienen clara su prioridad: \n\n...
1379    MENTION Y va a resultar que el diablo es negro...
Name: tweet, Length: 2136, dtype: object

## DATASET CLASS

This class recieves a set of tweets ,its respective labels and the max_len of the tweets, in our case we used 60 as we will see later. Also recieves a tokenizer that will transform the tweets to "inputs_id" and "attention_mask", that are the two variables that will feed our model. The output of this class will be a dataloader dividing the train set and the test set in batches, containing the raw text of he tweet, the "inputs_id" of the tweet, the "attention_mask" and the target (the label). This batches will be of 64 observations. this will make our training process more efficient because the parameters of the model will be easier to optimize.

In [None]:
class createDataset(Dataset):

    def __init__(self, texts, targets, tokenizer, max_len):
        self.texts = texts
        self.targets = targets
        self.tokenizer = tokenizer
        self.max_len = max_len
  
    def __len__(self):
        return len(self.texts)
  
    def __getitem__(self, item):
        
        text = self.texts[item]
        target = self.targets[item]

        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_len,
            return_token_type_ids=False,
            pad_to_max_length=True,
            return_attention_mask=True,
            return_tensors='pt',
            truncation=True
        )
        input=encoding['input_ids'].flatten()
        

        return {
            'text': text,
            'input_ids': input,
            'attention_mask': encoding['attention_mask'].flatten(),
            'targets': torch.tensor(target, dtype=torch.long)
        }

def create_data_loader(texts, labels, tokenizer, max_len, batch_size):
    
    ds = createDataset(
        texts=texts.to_numpy(),
        targets=labels.to_numpy(),
        
        tokenizer=tokenizer,
        max_len=max_len
  )

    return DataLoader(
        ds,
        batch_size=batch_size,
        num_workers=4
      )

## MODEL CLASS

This class is composed by two functions.
1. __init__ : This class builds the structure of the model. First we have the BETO model as the first layer. Then we have linear layer such as linears, dropouts, ReLU and Softmax.
2. __forward__ : This function is in charge of returning the outputs. The input to this function is the inputs_id and the attention_max of each observation of the batch. This will go into the BETO model that returns the 768 components of the embedding of each tweet. Then we will perform a Linear layer in order to reduce the dimensionality, from 768 to 192. Later, we will make a dropout to avoid overfitting and turning off some random parameters of the model. Then we will perform a ReLU in order to find non-linear relationships. We will perfomr a dropout again and finally we will perform another Linear, from 192 to 2 and then a Softmax because we are in a classification task.

### Two possible models:
- bertin-project/bertin-roberta-base-spanish
- dccuchile/bert-base-spanish-wwm-uncased ***Better performance**

In [None]:
class Model(nn.Module):
    def __init__(self, latent_dims,max_len,nhid):
        super(Model, self).__init__()

        
        self.roberta = BertModel.from_pretrained("dccuchile/bert-base-spanish-wwm-uncased")
        self.linear = nn.Linear(in_features=768, out_features=192)
        self.dropout=nn.Dropout(0.2)
        self.r = nn.ReLU()
        self.l = nn.Linear(in_features=192,out_features=2)
        self.s = nn.Softmax(dim=1)
        

        self.latent_dims=latent_dims
        self.nhid=nhid


    def forward(self, input_id,attention):

      secuence_output = self.roberta(
            input_ids=input_id,
            attention_mask=attention
        )
      
      o = secuence_output.pooler_output

      o=self.linear(o)
      o=self.dropout(o)
      o=self.r(o)
      o=self.l(o)
      o=self.dropout(o)
      o = self.s(o)

  
      
      return o



## Inicializing parameters.
We will use as loss function CrossEntropyLoss but another option is to use BCEloss. The batch size is 64 as commented before and the learning rate will be 0.00001. After different tests those parameters gave our model the best performance.

In [None]:
batch_size = 64
learning_rate = 0.00001
criterion = nn.CrossEntropyLoss().to(device)
criterion.requires_grad=True
epochs = 8
latentdims=2
nhid=128
max_len=60


### We create the model and the optimizer, then we will put our model in the GPU in order to obtain better performance

In [None]:
model= Model(latentdims,max_len,nhid)

optimizer = optim.Adam(model.parameters(), lr=learning_rate)

model.to(device)

Some weights of the model checkpoint at dccuchile/bert-base-spanish-wwm-uncased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertModel were not initialized from the model checkpoint at dccuchile/bert-base-spanish-wwm-uncased and are newly initialized: ['bert.pooler.dens

Model(
  (roberta): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(31002, 768, padding_idx=1)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=T

### We create our train data loader and test data loader to train the model and test it.

In [None]:
train_data_loader = create_data_loader(train_inputs, train_labels,tokenizer, max_len, batch_size)

test_data_loader = create_data_loader(test_inputs, test_labels, tokenizer, max_len, batch_size)

## Training function

This function recieves the model the train data loader, the test data loader, the loss function and the optimizer. For each bacth of the train test it will sustract the inputs, the attention and the labels. Then it will feed the model with those variables and get an output. We will compute the loss with the output and the target and we will back propagate it for the next iteration. Then we compute the training accuracy and the training f1. Then we turn off all the gradients and we just predict the test set. We compute the accuracy and f1 of test and we show the results for each epoch.

In [None]:
def train_an_epoch(
    model, 
    train_data_loader,
    dev_data_loader,
    criterion, 
    optimizer
):

    

    # These are the metrics that will indicate us how well it's doing the model...
    running_loss = 0
    training_acc=[]
    f1_training=[]
    steps = 0;
    
    for batch in train_data_loader:
        
        b=len(batch["input_ids"])
        # Clean gradients...
        optimizer.zero_grad()
    
        # Get the information from the tokenization... (using GPU)
        input_ids = batch["input_ids"].to(device)
        targets = batch["targets"].to(device)
        attention = batch["attention_mask"].to(device)

        # get the model's predictions...
        outputs = model(
            input_ids,
            attention
            
        )

        # Apply the loss function and the perform backward propagation...
       
        loss = criterion(outputs, targets)
       
        loss.backward() 
        optimizer.step()
        
        # update the metrics...

        pred = []
        real=[]
        for output in outputs:
          zero=output[0].item()
          one=output[1].item()
          if zero > one:
            pred.append(0)
          else:
            pred.append(1)

        for t in targets:

          real.append(t.item())

        bacc=acc(real,pred)
        bf1= f1(real,pred)
        running_loss+=loss.item()
        training_acc.append(bacc)
        f1_training.append(bf1)

        steps+=1
            
    # get the mean of the metrics...
    
    loss = running_loss/steps;
    t_acc=sum(training_acc)/len(training_acc)
    t_f1=sum(f1_training)/len(f1_training)
    
    
    
    # evaluate the model with the validation data set 
    # ("turn off" gradients...)
    with torch.no_grad():
        
        # These are the metrics that will indicate us how well it's doing the model...
        test_acc=[];
        steps_val=0;
        f1_test=[]
        
        for batch in dev_data_loader:

            b= len(batch["input_ids"])
            
            # Get the information from the tokenization... (using GPU)
            input_ids = batch["input_ids"].to(device)
            targets = batch["targets"].to(device)
            attention = batch["attention_mask"].to(device)

            # get the model's predictions...
            outputs = model(
                input_ids,
                attention
                
                
            )
            
            pred = []
            real=[]
            for output in outputs:
              zero=output[0].item()
              one=output[1].item()
              if zero > one:
                pred.append(0)
              else:
                pred.append(1)

            for t in targets:

              real.append(t.item())

            bacc=acc(real,pred)
            bf1= f1(real,pred)
            test_acc.append(bacc)
            f1_test.append(bf1)


        v_acc=sum(test_acc)/len(test_acc)
        v_f1= sum(f1_test)/len(f1_test)
    

    return loss,t_acc,v_acc,t_f1,v_f1

def train_the_model(epochs):
    
    for e in range(epochs):
      #, acc, val_acc
        
        loss,t_acc,v_acc,t_f1,v_f1 = train_an_epoch(
            model, 
            train_data_loader,
            test_data_loader,
            criterion, 
            optimizer
        )
        
        print('--------EPOCH SUMMARY---------')
        print('Epoch ', e+1, ' training loss: ', loss)
        print('Epoch ', e+1, ' training acc: ', t_acc*100, '%')
        print('Epoch ', e+1, ' val acc: ', v_acc*100, '%')
        print('Epoch ', e+1, ' training f1: ', t_f1*100, '%')
        print('Epoch ', e+1, ' val f1: ', v_f1*100, '%')

In [None]:
train_the_model(epochs)

--------EPOCH SUMMARY---------
Epoch  1  training loss:  0.6221428706365473
Epoch  1  training acc:  66.94240196078431 %
Epoch  1  val acc:  75.36231884057972 %
Epoch  1  training f1:  14.14377806622289 %
Epoch  1  val f1:  50.795454506113416 %
--------EPOCH SUMMARY---------
Epoch  2  training loss:  0.5186118650085786
Epoch  2  training acc:  81.49509803921569 %
Epoch  2  val acc:  82.88798309178743 %
Epoch  2  training f1:  67.49079356480804 %
Epoch  2  val f1:  74.29591704342785 %
--------EPOCH SUMMARY---------
Epoch  3  training loss:  0.44984644914374633
Epoch  3  training acc:  87.68382352941177 %
Epoch  3  val acc:  83.02385265700482 %
Epoch  3  training f1:  80.39506554605195 %
Epoch  3  val f1:  74.18362394587822 %
--------EPOCH SUMMARY---------
Epoch  4  training loss:  0.4202822069911396
Epoch  4  training acc:  90.25735294117648 %
Epoch  4  val acc:  81.18961352657004 %
Epoch  4  training f1:  84.9686838691624 %
Epoch  4  val f1:  73.20293047066502 %
--------EPOCH SUMMARY--

## Results
As we can see the model learns in each iteration and we get good results. In order to improve the model we could use sentiment analysis to feed the model too.

### Training the model with all the data

In [None]:
def train_an_epoch_full(
    model, 
    train_data_loader,
    criterion, 
    optimizer
):

    

    # These are the metrics that will indicate us how well it's doing the model...
    running_loss = 0
    training_acc=[]
    f1_training=[]
    steps = 0;
    
    for batch in train_data_loader:
        
        b=len(batch["input_ids"])
        # Clean gradients...
        optimizer.zero_grad()
    
        # Get the information from the tokenization... (using GPU)
        input_ids = batch["input_ids"].to(device)
        targets = batch["targets"].to(device)
        attention = batch["attention_mask"].to(device)

        # get the model's predictions...
        outputs = model(
            input_ids,
            attention
            
        )

        # Apply the loss function and the perform backward propagation...
       
        loss = criterion(outputs, targets)
       
        loss.backward() 
        optimizer.step()
        
        # update the metrics...

        pred = []
        real=[]
        for output in outputs:
          zero=output[0].item()
          one=output[1].item()
          if zero > one:
            pred.append(0)
          else:
            pred.append(1)

        for t in targets:

          real.append(t.item())

        bacc=acc(real,pred)
        bf1= f1(real,pred)
        running_loss+=loss.item()
        training_acc.append(bacc)
        f1_training.append(bf1)

        steps+=1
            
    # get the mean of the metrics...
    
    loss = running_loss/steps;
    t_acc=sum(training_acc)/len(training_acc)
    t_f1=sum(f1_training)/len(f1_training)
    

    
    
    

    return loss,t_acc,t_f1

def train_full_model(epochs):
    
    for e in range(epochs):
  
        
        loss,t_acc,t_f1 = train_an_epoch_full(
            model, 
            full_data_loader,
            criterion, 
            optimizer
        )
        
        print('--------EPOCH SUMMARY---------')
        print('Epoch ', e+1, ' training loss: ', loss)
        print('Epoch ', e+1, ' training acc: ', t_acc*100, '%')
        
        print('Epoch ', e+1, ' training f1: ', t_f1*100, '%')
        

### Creating the full data set

In [None]:
full_inputs = pd.concat([train_inputs,test_inputs])
full_targets = pd.concat([train_labels,test_labels])

In [None]:
full_data_loader=create_data_loader(full_inputs, full_targets,tokenizer, max_len, batch_size)

### Training results

In [None]:
train_full_model(epochs)

--------EPOCH SUMMARY---------
Epoch  1  training loss:  0.649702767531077
Epoch  1  training acc:  65.48806357649443 %
Epoch  1  training f1:  9.34680026503458 %
--------EPOCH SUMMARY---------
Epoch  2  training loss:  0.6339645825681233
Epoch  2  training acc:  65.50468591691995 %
Epoch  2  training f1:  7.092426768910108 %
--------EPOCH SUMMARY---------
Epoch  3  training loss:  0.6149893488202777
Epoch  3  training acc:  66.62075734549138 %
Epoch  3  training f1:  8.478538516256114 %
--------EPOCH SUMMARY---------
Epoch  4  training loss:  0.5615742795524143
Epoch  4  training acc:  74.32402482269505 %
Epoch  4  training f1:  41.64258803032905 %
--------EPOCH SUMMARY---------
Epoch  5  training loss:  0.4605344455866587
Epoch  5  training acc:  86.77653242147923 %
Epoch  5  training f1:  78.33728194908483 %
--------EPOCH SUMMARY---------
Epoch  6  training loss:  0.4227164763779867
Epoch  6  training acc:  88.96118287740629 %
Epoch  6  training f1:  82.31851369165145 %
--------EPOC

## Predictions
For the task 1 of HUHU we will predict the observations of the test given by the organization

In [None]:
test= pd.read_csv("/content/drive/MyDrive/transformer/test.csv")

In [None]:
test

Unnamed: 0,index,tweet
0,52830,-Mamá en la escuela me dicen gorda -Pobresilla...
1,78883,"No te sientas diferente, da igual si eres negr..."
2,78926,Si esta asi.. SUPER SI.. y que se pongan celos...
3,61844,—Bebé ¿Me veo gorda con este vestido?\n—¡No mi...
4,78830,Las mujeres solo desean 2 cosas en la vida: co...
...,...,...
773,9496,Decir que una mujer está soltera es de machist...
774,14026,¿cómo un aliado se atreve a chamuyar a una ant...
775,12393,"MENTION No hicieron nada por las mujeres, son ..."
776,18723,Cuando llegará ese día en que las chicas organ...


In [None]:
test_inputs= test["tweet"]

### Creating the class to preprocess the tweets and giving them to the model

In [None]:
class createTestDataset(Dataset):

    def __init__(self, texts,  tokenizer, max_len):
        self.texts = texts
        
        self.tokenizer = tokenizer
        self.max_len = max_len
  
    def __len__(self):
        return len(self.texts)
  
    def __getitem__(self, item):
        
        text = self.texts[item]
        #sentiments=self.texts[item][1:]
        

        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_len,
            return_token_type_ids=False,
            pad_to_max_length=True,
            return_attention_mask=True,
            return_tensors='pt',
            truncation=True
        )
        input=encoding['input_ids'].flatten()
        

        return {
            'text': text,
            'input_ids': input,
            
            'attention_mask': encoding['attention_mask'].flatten()
          
        }

def create_test_data_loader(texts,  tokenizer, max_len, batch_size):
    
    ds = createTestDataset(
        texts=texts.to_numpy(),
        
        
        tokenizer=tokenizer,
        max_len=max_len
  )

    return DataLoader(
        ds,
        batch_size=batch_size,
        num_workers=4
      )

In [None]:
test_loader= create_test_data_loader(test_inputs,tokenizer,max_len,batch_size)

## Predict function

This function uses as baseline the training function. It just get the predictions of the model and stores them in a list.

In [None]:
def predict(model, 
    test_data_loader):
  with torch.no_grad():
        
        # These are the metrics that will indicate us how well it's doing the model...
        predictions=[]
        
        for batch in test_data_loader:

            b= len(batch["input_ids"])
            
            # Get the information from the tokenization... (using GPU)
            input_ids = batch["input_ids"].to(device)
            
            attention = batch["attention_mask"].to(device)

            # get the model's predictions...
            outputs = model(
                input_ids,
                attention
                
                
            )
            
            
            for output in outputs:
              zero=output[0].item()
              one=output[1].item()
              if zero > one:
                predictions.append(0)
              else:
                predictions.append(1)


  return predictions

        



In [None]:
predictions= predict(model,test_loader)

In [None]:
len(predictions)

778

### Saving the predictions in the dataframe

In [None]:
test['humor']=predictions

In [None]:
test

Unnamed: 0,index,tweet,humor
0,52830,-Mamá en la escuela me dicen gorda -Pobresilla...,0
1,78883,"No te sientas diferente, da igual si eres negr...",1
2,78926,Si esta asi.. SUPER SI.. y que se pongan celos...,0
3,61844,—Bebé ¿Me veo gorda con este vestido?\n—¡No mi...,0
4,78830,Las mujeres solo desean 2 cosas en la vida: co...,0
...,...,...,...
773,9496,Decir que una mujer está soltera es de machist...,0
774,14026,¿cómo un aliado se atreve a chamuyar a una ant...,1
775,12393,"MENTION No hicieron nada por las mujeres, son ...",1
776,18723,Cuando llegará ese día en que las chicas organ...,0


In [None]:
test.to_csv("/content/drive/MyDrive/transformer/test.csv")

Once we have finished tuning the transformer for the binary classification we are going to show how tuning it for the other tasks will be.

We will just show how the dataset class and the model class will change.

## Using BETO for regression task

In [4]:
inputs = data["tweet"]
labels = data["mean_prejudice"]



train_inputs, test_inputs, train_labels, test_labels = train_test_split(
    inputs, 
    labels, 
    test_size=0.2
)


class createDataset(Dataset):

    def __init__(self, texts, targets, tokenizer, max_len):
        self.texts = texts
        self.targets = targets
        self.tokenizer = tokenizer
        self.max_len = max_len
  
    def __len__(self):
        return len(self.texts)
  
    def __getitem__(self, item):
        
        text = self.texts[item]
        
        target = self.targets[item]

        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_len,
            return_token_type_ids=False,
            pad_to_max_length=True,
            return_attention_mask=True,
            return_tensors='pt',
            truncation=True
        )
        input=encoding['input_ids'].flatten()
        

        return {
            'text': text,
            'input_ids': input,
            'attention_mask': encoding['attention_mask'].flatten(),
            'targets': torch.tensor(target, dtype=torch.float)
        }

def create_data_loader(texts, labels, tokenizer, max_len, batch_size):
    
    ds = createDataset(
        texts=texts.to_numpy(),
        targets=labels.to_numpy(),
        
        tokenizer=tokenizer,
        max_len=max_len
  )

    return DataLoader(
        ds,
        batch_size=batch_size,
        num_workers=4
      )
    

class Model(nn.Module):
    def __init__(self, latent_dims,max_len,nhid):
        super(Model, self).__init__()

        
        self.roberta = BertModel.from_pretrained("dccuchile/bert-base-spanish-wwm-uncased")
        self.linear = nn.Linear(in_features=768, out_features=192)
        self.dropout=nn.Dropout(0.2)
        self.r = nn.ReLU()
        self.t=nn.Tanh()
        self.l = nn.Linear(in_features=192,out_features=1)
        
        

        self.latent_dims=latent_dims
        self.nhid=nhid


    def forward(self, input_id,attention):

      secuence_output = self.roberta(
            input_ids=input_id,
            attention_mask=attention
        )
      
      o = secuence_output.pooler_output

      

      o=self.linear(o)
      o=self.dropout(o)
      o=self.r(o)
      o=self.l(o)
      
      

  
      
      return o.squeeze()

As we can see the only aspects we have changed is the target variable, that now is "mean_prejudice", and the type of this variable, instead of long type we use float.

## Parameters

In [6]:
batch_size = 64
learning_rate = 0.0001
criterion = nn.MSELoss().to(device)
criterion.requires_grad=True
epochs = 8
latentdims=2
nhid=128
max_len=60

We can see that all the parameters remain the same except the loss function that now will be MSELoss in order to compute de mean square error because we don´t have classes we have numerical values to predict.

## Creating the model

In [7]:
model= Model(latentdims,max_len,nhid)

optimizer = optim.Adam(model.parameters(), lr=learning_rate)

model.to(device)

Downloading pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at dccuchile/bert-base-spanish-wwm-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.decoder.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertModel were not initialized from the model checkpoint at dccuchile/bert-base-spanish-wwm-uncased and are newly initialized: ['bert.pooler.dens

Model(
  (roberta): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(31002, 768, padding_idx=1)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=T

## Creating the test and train new data loader

In [8]:
train_data_loader = create_data_loader(train_inputs, train_labels,tokenizer, max_len, batch_size)

test_data_loader = create_data_loader(test_inputs, test_labels, tokenizer, max_len, batch_size)

## Training function

The only difference between this function and the function used for binary classification is the computation of the loss. Now we can´t compute accuracy nor f1 score between the outputs of our model and the targets.

In [9]:
def train_an_epoch(
    model, 
    train_data_loader,
    dev_data_loader,
    criterion, 
    optimizer
):

    

    # These are the metrics that will indicate us how well it's doing the model...
    running_loss = 0
    steps = 0;
    mse_training=[]
    for batch in train_data_loader:
        
        b=len(batch["input_ids"])
        # Clean gradients...
        optimizer.zero_grad()
    
        # Get the information from the tokenization... (using GPU)
        input_ids = batch["input_ids"].to(device)
        targets = batch["targets"].to(device)
        attention = batch["attention_mask"].to(device)

        # get the model's predictions...
        outputs = model(
            input_ids,
            attention
            
        )
        
        

        # Apply the loss function and the perform backward propagation...
       
        loss = criterion(outputs, targets)
       
        loss.backward() 
        optimizer.step()
        
        # update the metrics...
        

        running_loss+=loss.item()

        mse_training.append(loss.item())




        

        steps+=1
            
    # get the mean of the metrics...
    
    train_mse= sum(mse_training)/len(mse_training)
    loss = running_loss/steps;

    
    # evaluate the model with the validation data set 
    # ("turn off" gradients...)
    with torch.no_grad():
        
        # These are the metrics that will indicate us how well it's doing the model...
        mse_test=[]
        
        for batch in dev_data_loader:

            b= len(batch["input_ids"])
            
            # Get the information from the tokenization... (using GPU)
            input_ids = batch["input_ids"].to(device)
            targets = batch["targets"].to(device)
            attention = batch["attention_mask"].to(device)

            # get the model's predictions...
            outputs = model(
                input_ids,
                attention
                
                
            )
            
            loss = criterion(outputs, targets)

            
          
            mse_test.append(loss.item())

        

        test_mse=sum(mse_test)/len(mse_test)


       
    

    return loss,train_mse,test_mse

def train_the_model(epochs):
    
    for e in range(epochs):
     
        
        loss , mse_train, mse_test= train_an_epoch(
            model, 
            train_data_loader,
            test_data_loader,
            criterion, 
            optimizer
        )
        
        print('--------EPOCH SUMMARY---------')
        print('Epoch ', e+1, ' training loss: ', loss)
        print('Epoch ', e+1, ' training mse: ', mse_train)
        print('Epoch ', e+1, ' test mse: ', mse_test)

### Results of the training

In [10]:
train_the_model(epochs)

--------EPOCH SUMMARY---------
Epoch  1  training loss:  tensor(0.9660, device='cuda:0')
Epoch  1  training mse:  1.4532348715207155
Epoch  1  test mse:  0.6873675518565707
--------EPOCH SUMMARY---------
Epoch  2  training loss:  tensor(0.9380, device='cuda:0')
Epoch  2  training mse:  0.6738380351487328
Epoch  2  test mse:  0.5976576010386149
--------EPOCH SUMMARY---------
Epoch  3  training loss:  tensor(0.8250, device='cuda:0')
Epoch  3  training mse:  0.511041340582511
Epoch  3  test mse:  0.5692977342340682
--------EPOCH SUMMARY---------
Epoch  4  training loss:  tensor(0.8628, device='cuda:0')
Epoch  4  training mse:  0.30672842921579585
Epoch  4  test mse:  0.7017213371064928
--------EPOCH SUMMARY---------
Epoch  5  training loss:  tensor(0.7886, device='cuda:0')
Epoch  5  training mse:  0.28821799597319436
Epoch  5  test mse:  0.5213635166486105
--------EPOCH SUMMARY---------
Epoch  6  training loss:  tensor(0.7389, device='cuda:0')
Epoch  6  training mse:  0.3071252151447184
E

## Using BETO for multilabel classification task

In [11]:
multi = [[float(labels["prejudice_woman"]),float(labels["prejudice_lgbtiq"]),float(labels["prejudice_inmigrant_race"]),float(labels["gordofobia"])] for _,labels in data[["prejudice_woman","prejudice_lgbtiq","prejudice_inmigrant_race", "gordofobia"]].iterrows()]
train_inputs, test_inputs, train_labels, test_labels = train_test_split(
    inputs, 
    multi, 
    test_size=0.2
)

class createDataset(Dataset):

    def __init__(self, texts, targets, tokenizer, max_len):
        self.texts = texts
        self.targets = targets
        self.tokenizer = tokenizer
        self.max_len = max_len
  
    def __len__(self):
        return len(self.texts)
  
    def __getitem__(self, item):
        
        text = self.texts[item]
        target = self.targets[item]

        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_len,
            return_token_type_ids=False,
            pad_to_max_length=True,
            return_attention_mask=True,
            return_tensors='pt',
            truncation=True
        )
        input=encoding['input_ids'].flatten()
        

        return {
            'text': text,
            'input_ids': input,
            'attention_mask': encoding['attention_mask'].flatten(),
            'targets': torch.tensor(target, dtype=torch.float)
        }

def create_data_loader(texts, labels, tokenizer, max_len, batch_size):
    
    ds = createDataset(
        texts=texts.to_numpy(),
        targets=labels,
        
        tokenizer=tokenizer,
        max_len=max_len
  )

    return DataLoader(
        ds,
        batch_size=batch_size,
        num_workers=4
      )
    

class Model(nn.Module):
    def __init__(self, latent_dims,max_len,nhid):
        super(Model, self).__init__()

        
        self.roberta = BertModel.from_pretrained("dccuchile/bert-base-spanish-wwm-uncased")
        self.tanh=nn.Tanh()
        self.linear = nn.Linear(in_features=768, out_features=192)
        self.dropout=nn.Dropout(0.2)
        self.r = nn.ReLU()
        self.l = nn.Linear(in_features=192,out_features=latent_dims)
        self.s = nn.Sigmoid()
        

        self.latent_dims=latent_dims
        self.nhid=nhid


    def forward(self, input_id,attention):

      secuence_output = self.roberta(
            input_ids=input_id,
            attention_mask=attention
        )
      
      o = secuence_output.pooler_output
      o=self.linear(o)
      o=self.dropout(o)
      o=self.r(o)
      o=self.l(o)
      o=self.dropout(o)
      o = self.s(o)

  
      
      return o

There are some changes between this classes and the ones seen before.

Firstly we have to get a list with the labels for each tweet, including the variables of 'prejudice_woman', 'prejudice_lgtibq' etc. This list is collected in the multi variable.

Then we can see that the type of the targets must be float too instead of long, as we did in the regression task.

On the other hand, in the model class we have to change the last layer that will be a Sigmoid instead of Softmax and the latent_dims (labels that we have) will be 4 instead of two

## Parameters

In [12]:
batch_size = 64
learning_rate = 0.00001
criterion = nn.BCELoss().to(device)
criterion.requires_grad=True
epochs = 8
latentdims=4
nhid=128
max_len=60

Now our loss function will be BCELoss and our value for latentdims will be 4 instead of 2, due to the fact that we have 4 labels to predict now.

### Creating the model

In [13]:
model= Model(latentdims,max_len,nhid)

optimizer = optim.Adam(model.parameters(), lr=learning_rate)

model.to(device)

Some weights of the model checkpoint at dccuchile/bert-base-spanish-wwm-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.decoder.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertModel were not initialized from the model checkpoint at dccuchile/bert-base-spanish-wwm-uncased and are newly initialized: ['bert.pooler.dens

Model(
  (roberta): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(31002, 768, padding_idx=1)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=T

## Creating the train and test new data loader

In [14]:
train_data_loader = create_data_loader(train_inputs, train_labels,tokenizer, max_len, batch_size)

test_data_loader = create_data_loader(test_inputs, test_labels, tokenizer, max_len, batch_size)

## Training function

The only change is the way we compute the f1 score. For this task we will not compute the accuracy, just the loss and the f1 score.

In [15]:
def train_an_epoch(
    model, 
    train_data_loader,
    dev_data_loader,
    criterion, 
    optimizer
):

    

    # These are the metrics that will indicate us how well it's doing the model...
    running_loss = 0
    
    f1_training=[]
    steps = 0;
    
    for batch in train_data_loader:
        
        b=len(batch["input_ids"])
        # Clean gradients...
        optimizer.zero_grad()
    
        # Get the information from the tokenization... (using GPU)
        input_ids = batch["input_ids"].to(device)
        targets = batch["targets"].to(device)
        attention = batch["attention_mask"].to(device)

        # get the model's predictions...
        outputs = model(
            input_ids,
            attention
            
        )

        # Apply the loss function and the perform backward propagation...
       
        loss = criterion(outputs, targets)
       
        loss.backward() 
        optimizer.step()
        
        # update the metrics...

        pred = []
        real=[]
        for output in outputs:
              one=output[0].item()
              two=output[1].item()
              three=output[2].item()
              four=output[3].item()
              if one >=0.50:pred.append(1)
              else: pred.append(0)
              if two >=0.50:pred.append(1)
              else: pred.append(0)
              if three >=0.50:pred.append(1)
              else: pred.append(0)
              if four >=0.50:pred.append(1)
              else: pred.append(0)

        for t in targets:
            for elem in t:

                real.append(elem.item())

        
        bf1= f1(real,pred,average='macro')
        running_loss+=loss.item()
        
        f1_training.append(bf1)

        steps+=1
            
    # get the mean of the metrics...
    
    loss = running_loss/steps;
    
    t_f1=sum(f1_training)/len(f1_training)
    
    
    
    # evaluate the model with the validation data set 
    # ("turn off" gradients...)
    with torch.no_grad():
        
        # These are the metrics that will indicate us how well it's doing the model...
        test_acc=[];
        steps_val=0;
        f1_test=[]
        
        for batch in dev_data_loader:

            b= len(batch["input_ids"])
            
            # Get the information from the tokenization... (using GPU)
            input_ids = batch["input_ids"].to(device)
            targets = batch["targets"].to(device)
            attention = batch["attention_mask"].to(device)

            # get the model's predictions...
            outputs = model(
                input_ids,
                attention
                
                
            )
            
            
            pred = []
            real=[]
            for output in outputs:
              one=output[0].item()
              two=output[1].item()
              three=output[2].item()
              four=output[3].item()
              if one >=0.50:pred.append(1)
              else: pred.append(0)
              if two >=0.50:pred.append(1)
              else: pred.append(0)
              if three >=0.50:pred.append(1)
              else: pred.append(0)
              if four >=0.50:pred.append(1)
              else: pred.append(0)

            for t in targets:
              for elem in t:

                real.append(elem.item())

            
            bf1= f1(real,pred,average='macro')
            
            f1_test.append(bf1)


        
        v_f1= sum(f1_test)/len(f1_test)
    

    return loss,t_f1,v_f1

def train_the_model(epochs):
    
    for e in range(epochs):
      
        loss,t_f1,v_f1 = train_an_epoch(
            model, 
            train_data_loader,
            test_data_loader,
            criterion, 
            optimizer
        )
        
        print('--------EPOCH SUMMARY---------')
        print('Epoch ', e+1, ' training loss: ', loss)
        
        print('Epoch ', e+1, ' training f1: ', t_f1*100, '%')
        print('Epoch ', e+1, ' val f1: ', v_f1*100, '%')

## Results of the training

In [16]:
train_the_model(epochs)

--------EPOCH SUMMARY---------
Epoch  1  training loss:  0.5929780321962693
Epoch  1  training f1:  58.179443155442534 %
Epoch  1  val f1:  58.484370460257736 %
--------EPOCH SUMMARY---------
Epoch  2  training loss:  0.5002417546861312
Epoch  2  training f1:  62.948629578839046 %
Epoch  2  val f1:  67.08940802907843 %
--------EPOCH SUMMARY---------
Epoch  3  training loss:  0.3811105603680891
Epoch  3  training f1:  75.61173546061566 %
Epoch  3  val f1:  76.412124915779 %
--------EPOCH SUMMARY---------
Epoch  4  training loss:  0.2939667969065554
Epoch  4  training f1:  80.64925482196446 %
Epoch  4  val f1:  78.31196971983067 %
--------EPOCH SUMMARY---------
Epoch  5  training loss:  0.24629126883604946
Epoch  5  training f1:  82.25250938456816 %
Epoch  5  val f1:  78.53795581443893 %
--------EPOCH SUMMARY---------
Epoch  6  training loss:  0.22323094702818813
Epoch  6  training f1:  82.00563690270758 %
Epoch  6  val f1:  76.61019370998278 %
--------EPOCH SUMMARY---------
Epoch  7  tr