In [1]:

from torch.utils.data import DataLoader,Dataset
from transformers import AutoTokenizer,AutoModel
import torch
from torch import cuda
import numpy as np
from sklearn import metrics
from tqdm import tqdm 
import torch
import pandas as pd 
device = 'cuda' if cuda.is_available() else 'cpu'

## Intro

In this script we will fine-tune a text-clasiffier model (Multilabel/Multiclass), here we are given a pice of text/sentence/document needs to be classifed in one or more categories(multilabel) o one catgory (multiclass)

## Data

The base dataset is compose by four columns

* idTask : Identity Code
* task content 1 : Title of the article
* idTag : Identity Code
* tag : one of the diferent label/category


* Tags:

     * sociedad
     * deportes 
     * politica 
     * economia
     * clickbait
     * cultura
     * medio_ambiente
     * ciencia_tecnologia
     * educacion
     * opinion



We will use just two rows "task content 1"  and "tag", the "tag" column has to be change to a one-hot vector.

Lets say that the label/class of a element is "deporte"  the model needs numeric data so it can interprete the information provided, so instead of a string we use this form type of vector: [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]

## Load Data

The main objective of this function are:

* Import the file in a dataframe and give it the headers as per the documentation.
* Taking the values of all the categories and coverting it into a list.
* The list is appened as a new column and other columns are removed. 

In [2]:
def load_data(file_name,nrows):
    
    data_raw = pd.read_csv(file_name,sep = ",")

    data = data_raw.iloc[:,[1,3]]

    data.columns = ['text','tag']

    data['tag'].fillna('Random_Tag',inplace = True)

    data.dropna(inplace = True)

    data['one_hot'] = [list((row[1].values))for  row in pd.get_dummies(data['tag']).iterrows()]

    if nrows > data.shape[0]:

        nrows = data.shape[0]

    data = data.sample(frac = nrows/data.shape[0])

    return data.loc[:,['text','one_hot']]


In [3]:
data = load_data("data.csv",1000)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return self._update_inplace(result)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return func(*args, **kwargs)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._set_item(key, value)


## Hyperparameters

We select some hyperparameters for train the model

In [4]:
n_classes = len(data.iloc[1,1])

model_name = "dccuchile/bert-base-spanish-wwm-cased"

MAX_LEN = 200

TRAIN_BATCH_SIZE = 8

VALID_BATCH_SIZE = 4

EPOCHS = 1

LEARNING_RATE = 1e-05

TRAIN_SIZE = 0.8

## Tokenizer and Model Selection

We select the tokenizer and the model structure using the function from_pretained() and a model to train, here we will define the tokenizer because is necessary for creating the Pytorch Datset, we will define the model further the script

In [5]:
tokenizer = AutoTokenizer.from_pretrained(model_name)


## Split Data

Split the data in train and validation dataset, the arguments are the  process dataframe  and the size of the test set

In [6]:
def split_data(pandas_df,train_size):

    if ((train_size > 0) & (train_size <=1)):

        pass

    elif train_size > 1:

        train_size = train_size/pandas_df.shape[0]

    train_set = pandas_df.sample(frac = train_size,random_state = 42)

    test_set = pandas_df.drop(train_set.index).reset_index(drop = True)

    train_set = train_set.reset_index(drop = True)

    return train_set,test_set


In [7]:
train_dataset,test_dataset = split_data(data,0.8)

## Dataset/DataLoader

We need to create a dataset that fits our needs, it's known that the deep learning models can't process raw text, so we need to pre-process the text before to send it to the neural network, also we will define a Dataloader to feed the data in bathches for training and processing 

Pytorch Dataset and Dataloader allow us to defining and controlling the data pre-processing and its passage to neural network.

## Dataset

* We will define a python class called CustomDataset, is defined to accept a list/Series/arrey of texts and labels, a tokenizer. 

* We will use a Bert tokenizer to encode out text data

* The tokenizer uses the encode_plus method to perform tokenization and generate the necessary outputs, namely: ids, attention_mask, token_type_ids

In [8]:
class CustomDataset(Dataset):

    def __init__(self, titles, targets, tokenizer, max_len):

      self.titles = titles
      self.targets = targets
      self.tokenizer = tokenizer
      self.max_len = max_len

    def __len__(self):

      return len(self.titles)

    def __getitem__(self, item):

      title = str(self.titles[item])

      target = self.targets[item]

      encoding = self.tokenizer.encode_plus(
        title,
        add_special_tokens=True,
        max_length=self.max_len,
        return_token_type_ids=True,
        pad_to_max_length=True,
        return_attention_mask=True,
        return_tensors='pt'
        
      )
      return {
        'review_text': title,
        'input_ids': encoding['input_ids'].flatten(),
        'attention_mask': encoding['attention_mask'].flatten(),
        'targets': torch.tensor(target, dtype=torch.long),
        'token_type_ids': encoding['token_type_ids'].flatten()
    }

In [9]:
training_set = CustomDataset(train_dataset['text'],train_dataset['one_hot'], tokenizer, MAX_LEN)

testing_set = CustomDataset(test_dataset['text'],test_dataset['one_hot'], tokenizer, MAX_LEN)


## DataLoader

* Dataloader is used to for creating training and validation dataloader that load data to the neural network in a defined manner. This is needed because all the data from the dataset cannot be loaded to the memory at once, hence the amount of dataloaded to the memory and then passed to the neural network needs to be controlled.

* This control is achieved using the parameters such as batch_size and max_len.

* Training and Validation dataloaders are used in the training and validation part of the flow respectively

In [10]:
train_params = {'batch_size': TRAIN_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

test_params = {'batch_size': VALID_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

training_loader = DataLoader(training_set, **train_params)
testing_loader = DataLoader(testing_set, **test_params)

## Neural Network Model

* This neural network will use a BERTClass

* It will be composed by a bert model, followed by a Droput Layer (to avoid overfitting) and a linear layer.

* The output_1 is passed to the droput layer and the to the linear layer.

* The number of output dimensions is the same as the classes/categories.

* Final layer outputs is what will be used to calcuate 
the loss and to determine the accuracy of models prediction

* We will initiate an instance of the network called model. This instance will be used for training and then to save the final trained model for future inference.

* The Class take the parameter model_name

In [11]:
class BERTClass(torch.nn.Module):

    def __init__(self,n_classes,model_name):

        self.n_classes = n_classes
        self.model_name = model_name

        super(BERTClass, self).__init__()
        self.l1 = AutoModel.from_pretrained(model_name)
        self.l2 = torch.nn.Dropout(0.3)
        self.l3 = torch.nn.Linear(768, n_classes)
    
    def forward(self, ids, mask, token_type_ids):
    
        output_1= self.l1(ids, attention_mask = mask, token_type_ids = token_type_ids)['pooler_output']
        
        output_2 = self.l2(output_1)
        output = self.l3(output_2)
        return output


## Loss Function

* As defined above, the loss function used will be a combination of Binary Cross Entropy which is implemented as BCELogits Loss in PyTorch in case we e¡want to do a multilabel classification, if we want to do TEXT CLASIFFICATION we should use CrossEntropyLoss

In [12]:
def loss_fn(function_objective,outputs, targets):

    if function_objective == 'multilabel':

        return torch.nn.BCEWithLogitsLoss()(outputs, targets)

    #elif function_objective == 'multiclass':

    #    return torch.nn.CrossEntropyLoss()(outputs, targets)

    else: 

        print('The model has to be either multilclass o multilabel, any other model will fail')

In [13]:
model = BERTClass(n_classes,model_name)

Some weights of the model checkpoint at dccuchile/bert-base-spanish-wwm-cased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertModel were not initialized from the model checkpoint at dccuchile/bert-base-spanish-wwm-cased and are newly initialized: ['bert.pooler.dense.bi

In [14]:
optimizer = torch.optim.Adam(params =  model.parameters(), lr=LEARNING_RATE)


## Model Fine-Tune

Our train function trains the modle on the training set a number of times (EPOCH), each epoch is how many time complete data will be passed through the network

* The dataloader passes data to the model based on the batch size.

* Subsequent output from the model and the actual category are compared to calculate the loss.

* Loss value is used to optimize the weights of the neurons in the network.

* After every 10 steps the loss value is printed in the console.

In [15]:
def train(epoch,model,training_loader,device,optimizer,loss_fn):
    
    model.train()
    
    for _,batch in enumerate(training_loader, 0):
        
        ids = batch['input_ids'].to(device, dtype = torch.long)
        mask = batch['attention_mask'].to(device, dtype = torch.long)
        token_type_ids = batch['token_type_ids'].to(device, dtype = torch.long)
        targets = batch['targets'].to(device, dtype = torch.float)

        outputs = model(ids, mask, token_type_ids)

        optimizer.zero_grad()

        loss = loss_fn('multilabel',outputs, targets)

        if _%10==0:

            print(f'Epoch: {epoch}, Loss:  {loss.item()}')
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

In [16]:
for epoch in range(EPOCHS):
    
    train(epoch,model,training_loader,device,optimizer,loss_fn)


Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Epoch: 0, Loss:  0.6813402771949768


We save the model in the HuggingFace format

In [None]:
model_path = "hg_model"
model.save_model(model_path)
tokenizer.save_pretrained(model_path)

AttributeError: 'BERTClass' object has no attribute 'save'

## Validation 

During the validation stage we pass the unseen data(Testing Dataset) to the model. This step determines how good the model performs on the unseen data.

This unseen data is the 20% of train.csv which was seperated during the Dataset creation stage. During the validation stage the weights of the model are not updated. Only the final output is compared to the actual value. This comparison is then used to calcuate the accuracy of the model.

As defined above to get a measure of our models performance we are using the following metrics.

* Accuracy Score
* F1 Micro
* F1 Macro


In [None]:
def validation(epoch,model,testing_loader,device,optimizer,loss_fn):

    model.eval()
    
    fin_targets=[]
    
    fin_outputs=[]
    
    with torch.no_grad():
    
        for _, batch in enumerate(testing_loader, 0):
    
            ids = batch['input_ids'].to(device, dtype = torch.long)
    
            mask = batch['attention_mask'].to(device, dtype = torch.long)
    
            token_type_ids = batch['token_type_ids'].to(device, dtype = torch.long)
    
            targets = batch['targets'].to(device, dtype = torch.float)
    
            outputs = model(ids, mask, token_type_ids)
    
            fin_targets.extend(targets.cpu().detach().numpy().tolist())
    
            fin_outputs.extend(torch.sigmoid(outputs).cpu().detach().numpy().tolist())
    
    return fin_outputs, fin_targets

In [None]:
for epoch in range(EPOCHS):

    outputs, targets = validation(epoch)

    outputs = np.array(outputs) >= 0.5
    
    accuracy = metrics.accuracy_score(targets, outputs)
    
    f1_score_micro = metrics.f1_score(targets, outputs, average='micro')
    
    f1_score_macro = metrics.f1_score(targets, outputs, average='macro')
    
    print(f"Accuracy Score = {accuracy}")
    
    print(f"F1 Score (Micro) = {f1_score_micro}")
    
    print(f"F1 Score (Macro) = {f1_score_macro}")