# Model-1 (RoBERTa+BiLSTM+CNN)

The aim of this notebook is to demonstrate the primary concept behind this research. This notebook consists of Model-1 being fine-tuned with the data from Mody et al., (2023). The model consists of the RoBERTa layer in conjunction to BiLSTM followed by CNN.

This notebook is split into 4 parts and further showcases the metrics obtained by evaluation.

## Part 1: Data Preparation
- Loading data to fine-tune using Huggingface's load_dataset module
- Split the dataset into train and test
- Clean the dataset for any missing values

## Part 2: Load tokenizer
- Load the tokenizer for the chosen model
- Tokenize the text in 'Content' of the dataset
- Use data collator from Huggingface to batch texts of similar lengths

## Part 3: Model class definition

- Definition of the model class, here, under the monomer 'CustomModel' with the pre-trained layer, Bidirectional LSTM and CNN layer to generate the necessary classified output, with dropout of 0.1.
- Save the model in the cuda device

## Part 4: Training
- Creation of dataloaders for train and test
- Instantiate the AdamW optimiser for 5 epochs
- Setting a learning rate of 1.5e-5
- Setting of the metrics required to evaluate model performance
- Training the model on the train dataset
- Evaluation using test dataset





Source code: https://jovian.com/rajbsangani/emotion-tuned-sarcasm


# Data preparation

In [None]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [None]:
#import transformers
pip install datasets transformers[sentencepiece]

Collecting datasets
  Downloading datasets-2.13.1-py3-none-any.whl (486 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.2/486.2 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers[sentencepiece]
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m62.9 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.7,>=0.3.0 (from datasets)
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.5/212.5 kB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[

In [None]:
#import necessary modules
from datasets import load_dataset,Dataset,DatasetDict
from transformers import DataCollatorWithPadding,AutoModelForSequenceClassification, Trainer, TrainingArguments,AutoTokenizer,AutoModel,AutoConfig
from transformers.modeling_outputs import TokenClassifierOutput
import torch
import torch.nn as nn
import pandas as pd

In [None]:
#upload data from drive by using Huggingface's load_dataset
data=load_dataset("csv",data_files="/content/drive/My Drive/YD_aug_data_balanced.csv")
data

Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-4693b573cadcf498/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-4693b573cadcf498/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['Content', 'Label'],
        num_rows: 726119
    })
})

In [None]:
#split the data into training and validation set using train_test_split
data.set_format('pandas')
data=data['train'][:]
data=Dataset.from_pandas(data)
train_testvalid = data.train_test_split(test_size=0.2,seed=15)


In [None]:
train_testvalid

DatasetDict({
    train: Dataset({
        features: ['Content', 'Label'],
        num_rows: 580895
    })
    test: Dataset({
        features: ['Content', 'Label'],
        num_rows: 145224
    })
})

In [None]:
#split data into test and validation
test_valid = train_testvalid['test'].train_test_split(test_size=0.5,seed=15)


In [None]:
#save dataset dictionary for each subset
data = DatasetDict({
    'train': train_testvalid['train'],
    'test': test_valid['test'],
    'valid': test_valid['train']})

data

DatasetDict({
    train: Dataset({
        features: ['Content', 'Label'],
        num_rows: 580895
    })
    test: Dataset({
        features: ['Content', 'Label'],
        num_rows: 72612
    })
    valid: Dataset({
        features: ['Content', 'Label'],
        num_rows: 72612
    })
})

# Load Tokeniser

In [None]:
#load the checkpoint for the chosen RoBERTa model
checkpoint = "cardiffnlp/twitter-roberta-base-hate-latest"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenizer.model_max_len=512

Downloading (…)okenizer_config.json:   0%|          | 0.00/351 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

In [None]:
#map tokenizer to the Content column of the dataset
def tokenize(batch):
  return tokenizer(batch["Content"], truncation=True,max_length=120)

tokenized_dataset = data.map(tokenize, batched=True) #tokenizes the text in the column
tokenized_dataset

Map:   0%|          | 0/580895 [00:00<?, ? examples/s]

Map:   0%|          | 0/72612 [00:00<?, ? examples/s]

Map:   0%|          | 0/72612 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['Content', 'Label', 'input_ids', 'attention_mask'],
        num_rows: 580895
    })
    test: Dataset({
        features: ['Content', 'Label', 'input_ids', 'attention_mask'],
        num_rows: 72612
    })
    valid: Dataset({
        features: ['Content', 'Label', 'input_ids', 'attention_mask'],
        num_rows: 72612
    })
})

In [None]:
tokenized_dataset.set_format("torch",columns=["input_ids", "attention_mask", "Label"]) #set the tokenized dataset into tensor format for model readability
data_collator = DataCollatorWithPadding(tokenizer=tokenizer) #instantiate datacollator with the tokenizer argument

# Model definition

In [None]:
#Define class CustomModel
class CustomModel(nn.Module):
  def __init__(self, checkpoint, num_labels):
    super(CustomModel, self).__init__()
    self.num_labels = num_labels

    # Load Model with given checkpoint and extract its body
    self.model = model = AutoModel.from_pretrained(checkpoint, config=AutoConfig.from_pretrained(checkpoint, output_attentions=True, output_hidden_states=True))
    self.dropout = nn.Dropout(0.1)

    # Add BiLSTM layers
    hidden_size = 768
    num_layers = 2
    self.bilstm = nn.LSTM(hidden_size, hidden_size, num_layers=num_layers, bidirectional=True, batch_first=True)

    # Add CNN layer
    kernel_size = 3
    self.cnn = nn.Conv1d(hidden_size * 2, hidden_size, kernel_size)

    self.classifier = nn.Linear(hidden_size, num_labels)  # load and initialize weights

  def forward(self, input_ids=None, attention_mask=None, Label=None):
    # Extract outputs from the body
    outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)

    # Add custom layers
    sequence_output = self.dropout(outputs[0])  # outputs[0]=last hidden state

    # Apply BiLSTM
    lstm_output, _ = self.bilstm(sequence_output)

    # Apply CNN
    lstm_output = lstm_output.permute(0, 2, 1)  # Reshape for Conv1d
    cnn_output = self.cnn(lstm_output)
    cnn_output = cnn_output.permute(0, 2, 1)  # Reshape back to original

    logits = self.classifier(cnn_output[:, -1, :])  # calculate losses

    loss = None
    if Label is not None:
      loss_fct = nn.CrossEntropyLoss()
      loss = loss_fct(logits.view(-1, self.num_labels), Label.view(-1))

    return TokenClassifierOutput(loss=loss, logits=logits, hidden_states=outputs.hidden_states, attentions=outputs.attentions)


In [None]:
#push model to cuda
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model=CustomModel(checkpoint=checkpoint,num_labels=2).to(device)

Downloading (…)lve/main/config.json:   0%|          | 0.00/888 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at cardiffnlp/twitter-roberta-base-hate-latest and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# Training

In [None]:
from torch.utils.data import DataLoader

#to load data into the model training loop
train_dataloader = DataLoader(
    tokenized_dataset["train"], shuffle=True, batch_size=16, collate_fn=data_collator
)
eval_dataloader = DataLoader(
    tokenized_dataset["valid"], batch_size=16, collate_fn=data_collator
)

In [None]:
#define the optimizer with 5 epochs and learning rate of 1.5e-5
from transformers import AdamW,get_scheduler

optimizer = torch.optim.AdamW(model.parameters(), lr=1.5e-5)

num_epochs = 5
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "cosine",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)
print(num_training_steps)

181530


In [None]:
#instantiate necessary metrics
from datasets import load_metric
metric1 = load_metric("f1")
metric2 = load_metric("accuracy")
metric3 = load_metric("precision")

  metric1 = load_metric("f1")


Downloading builder script:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/2.58k [00:00<?, ?B/s]

In [None]:
from tqdm.auto import tqdm
#training loop
progress_bar_train = tqdm(range(num_training_steps))
progress_bar_eval = tqdm(range(num_epochs * len(eval_dataloader)))


for epoch in range(num_epochs):
  model.train()
  for batch in train_dataloader:
      batch = {k: v.to(device) for k, v in batch.items()}
      outputs = model(**batch)
      loss = outputs.loss
      loss.backward()

      optimizer.step()
      lr_scheduler.step()
      optimizer.zero_grad()
      progress_bar_train.update(1)

  model.eval()#to evaluate the model's efficiency with the training dataset
  for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric1.add_batch(predictions=predictions, references=batch["Label"])
    metric2.add_batch(predictions=predictions, references=batch["Label"])
    metric3.add_batch(predictions=predictions, references=batch["Label"])

    progress_bar_eval.update(1)

  print(metric1.compute())
  print(metric2.compute())
  print(metric3.compute())








  0%|          | 0/181530 [00:00<?, ?it/s]

  0%|          | 0/22695 [00:00<?, ?it/s]

You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'f1': 0.9097336334387385}
{'accuracy': 0.9099735580895719}
{'precision': 0.9177811211412014}
{'f1': 0.9176832618969119}
{'accuracy': 0.9174792045391946}
{'precision': 0.9209982076382187}
{'f1': 0.9156064461407973}
{'accuracy': 0.9177821847628491}
{'precision': 0.9465700172449069}
{'f1': 0.9236597469770044}
{'accuracy': 0.9247920453919463}
{'precision': 0.9436985831809872}
{'f1': 0.9236678539004121}
{'accuracy': 0.9247369580785545}
{'precision': 0.9428823999087487}


The highest F1-score obtained at the 5th epoch is 0.9236

In [None]:
#testing the split portion of the dataset
model.eval()

test_dataloader = DataLoader(
    tokenized_dataset["test"], batch_size=16, collate_fn=data_collator
)
progress_bar_eval = tqdm(range(1 * len(test_dataloader)))


for batch in test_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric1.add_batch(predictions=predictions, references=batch["Label"])
    metric2.add_batch(predictions=predictions, references=batch["Label"])
    metric3.add_batch(predictions=predictions, references=batch["Label"])
    progress_bar_eval.update(1)


print(metric1.compute())
print(metric2.compute())
print(metric3.compute())

  0%|          | 0/4539 [00:00<?, ?it/s]

{'f1': 0.9241091438816674}
{'accuracy': 0.9253842340109073}
{'precision': 0.9415172964950337}


In [None]:
import os

In [None]:
save_directory ="/content/drive/My Drive"

In [None]:
torch.save(model.state_dict(),os.path.join(save_directory, "bert+lstm+cnn2.pt")) #save the model's dictionary in the directory.