# Model-2 Finetune

The aim of this notebook is to demonstrate the finetuning of the primary dataset used with the RoBERTa model from HuggingFace 🤗, with no additional layers in the architecture. The model is save into the directory to later utilise its parameters in any future work.

## Part 1: Data Preparation
- Loading data to fine-tune using Huggingface's load_dataset module
- Split the dataset into train and test
- Clean the dataset for any missing values

## Part 2: Load tokenizer
- Load the tokenizer for the chosen model
- Tokenize the text in 'Content' of the dataset
- Use data collator from Huggingface to batch texts of similar lengths

## Part 3: Model class definition

- Definition of the pre-trained model with a chosen base-Roberta (cardiffnlp/twitter-roberta-base-hate-latest) to fine-tune with dropout of 0.1.
- Save the model in the cuda device

## Part 3: Training
- Creation of dataloaders for train and test
- Instantiate the AdamW optimiser for 3 epochs
- Setting a learning rate of 2e-5
- Setting of the metrics (here, F1) required to evaluate model performance
- Training the model on the train dataset
- Evaluation using test dataset







Source code : https://jovian.com/rajbsangani/emotion-tuned-sarcasm


In [None]:
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Data Preparation

In [None]:
pip install datasets transformers[sentencepiece]

Collecting datasets
  Downloading datasets-2.14.0-py3-none-any.whl (492 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m492.2/492.2 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers[sentencepiece]
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m35.2 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.5/212.5 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[

In [None]:
#importing necessary modules
from datasets import load_dataset,Dataset,DatasetDict
from transformers import DataCollatorWithPadding,AutoModelForSequenceClassification, Trainer, TrainingArguments,AutoTokenizer,AutoModel,AutoConfig
from transformers.modeling_outputs import TokenClassifierOutput
import torch
import torch.nn as nn
import pandas as pd

In [None]:
#upload data using Huggingface's load_dataset
data=load_dataset("csv",data_files="/content/drive/My Drive/YD_aug_data_balanced.csv")
data

Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-4693b573cadcf498/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-4693b573cadcf498/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['Content', 'Label'],
        num_rows: 726119
    })
})

In [None]:
#split dataset into train, validation and test using train test split
data.set_format('pandas')
data=data['train'][:]
data=Dataset.from_pandas(data)
train_testvalid = data.train_test_split(test_size=0.2,seed=15)


In [None]:
train_testvalid

DatasetDict({
    train: Dataset({
        features: ['Content', 'Label'],
        num_rows: 580895
    })
    test: Dataset({
        features: ['Content', 'Label'],
        num_rows: 145224
    })
})

In [None]:
test_valid = train_testvalid['test'].train_test_split(test_size=0.5,seed=15)


In [None]:
data = DatasetDict({
    'train': train_testvalid['train'],
    'test': test_valid['test'],
    'valid': test_valid['train']})

data

DatasetDict({
    train: Dataset({
        features: ['Content', 'Label'],
        num_rows: 580895
    })
    test: Dataset({
        features: ['Content', 'Label'],
        num_rows: 72612
    })
    valid: Dataset({
        features: ['Content', 'Label'],
        num_rows: 72612
    })
})

#Load Tokenizer

In [None]:
#Load the tokenizer from the chosen model
checkpoint = "cardiffnlp/twitter-roberta-base-hate-latest"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenizer.model_max_len=100

Downloading (…)okenizer_config.json:   0%|          | 0.00/351 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

In [None]:
def tokenize(batch):
  return tokenizer(batch["Content"], truncation=True,max_length=100)

tokenized_dataset = data.map(tokenize, batched=True) #map the tokenizer to the text in the Content feature
tokenized_dataset

Map:   0%|          | 0/580895 [00:00<?, ? examples/s]

Map:   0%|          | 0/72612 [00:00<?, ? examples/s]

Map:   0%|          | 0/72612 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['Content', 'Label', 'input_ids', 'attention_mask'],
        num_rows: 580895
    })
    test: Dataset({
        features: ['Content', 'Label', 'input_ids', 'attention_mask'],
        num_rows: 72612
    })
    valid: Dataset({
        features: ['Content', 'Label', 'input_ids', 'attention_mask'],
        num_rows: 72612
    })
})

In [None]:
tokenized_dataset.set_format("torch",columns=["input_ids", "attention_mask", "Label"])
data_collator = DataCollatorWithPadding(tokenizer=tokenizer) #define datacollater using tokenizer as its parameter

In [None]:
#Define model class
class CustomModel(nn.Module):
  def __init__(self,checkpoint,num_labels):
    super(CustomModel,self).__init__()
    self.num_labels = num_labels

    #Load Model with given checkpoint and extract its body
    self.model = model = AutoModel.from_pretrained(checkpoint,config=AutoConfig.from_pretrained(checkpoint, output_attentions=True,output_hidden_states=True))
    self.dropout = nn.Dropout(0.1)
    self.classifier = nn.Linear(768,num_labels) # load and initialize weights

  def forward(self, input_ids=None, attention_mask=None,Label=None):
    #Extract outputs from the body
    outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)

    #Add custom layers
    sequence_output = self.dropout(outputs[0]) #outputs[0]=last hidden state

    logits = self.classifier(sequence_output[:,0,:].view(-1,768)) # calculate losses

    loss = None
    if Label is not None:
      loss_fct = nn.CrossEntropyLoss()
      loss = loss_fct(logits.view(-1, self.num_labels), Label.view(-1))

    return TokenClassifierOutput(loss=loss, logits=logits, hidden_states=outputs.hidden_states,attentions=outputs.attentions)

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model=CustomModel(checkpoint=checkpoint,num_labels=2).to(device) #push model into cuda GPU

Downloading (…)lve/main/config.json:   0%|          | 0.00/888 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-hate-latest were not used when initializing RobertaModel: ['classifier.out_proj.weight', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.dense.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at cardiffnlp/twitter-roberta-base-hate-latest and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predi

# Training

In [None]:
from torch.utils.data import DataLoader

#load the data into dataloader to load the data into model during training loop
train_dataloader = DataLoader(
    tokenized_dataset["train"], shuffle=True, batch_size=16, collate_fn=data_collator
)
eval_dataloader = DataLoader(
    tokenized_dataset["valid"], batch_size=16, collate_fn=data_collator
)

In [None]:
from transformers import AdamW,get_scheduler
# define the optimizer for 3 epochs
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "cosine",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)
print(num_training_steps)

108918


In [None]:
from datasets import load_metric
metric = load_metric("f1","accuracy")

  metric = load_metric("f1","accuracy")


Downloading builder script:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

In [None]:
from tqdm.auto import tqdm

progress_bar_train = tqdm(range(num_training_steps))
progress_bar_eval = tqdm(range(num_epochs * len(eval_dataloader)))


for epoch in range(num_epochs):
  model.train()
  for batch in train_dataloader:
      batch = {k: v.to(device) for k, v in batch.items()}
      outputs = model(**batch)
      loss = outputs.loss
      loss.backward()

      optimizer.step()
      lr_scheduler.step()
      optimizer.zero_grad()
      progress_bar_train.update(1)

  model.eval()
  for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["Label"])
    progress_bar_eval.update(1)

  print(metric.compute())







  0%|          | 0/108918 [00:00<?, ?it/s]

  0%|          | 0/13617 [00:00<?, ?it/s]

You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'f1': 0.8971896122376379}
{'f1': 0.9213982232522209}
{'f1': 0.9182277061132922}


In [None]:

model.eval()

test_dataloader = DataLoader(
    tokenized_dataset["test"], batch_size=16, collate_fn=data_collator
)
progress_bar_eval = tqdm(range(num_epochs * len(test_dataloader)))


for batch in test_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["Label"])
    progress_bar_eval.update(1)


metric.compute()

  0%|          | 0/13617 [00:00<?, ?it/s]

{'f1': 0.9201322467574113}

In [None]:
import os

In [None]:
save_directory ="/content/drive/My Drive"

In [None]:
torch.save(model.state_dict(),os.path.join(save_directory, "model_weights.pt")) #save the model state dictionary into drive