# Fine-tuning a BERT model to detect sarcasm in Reddit comments

Now we will we are going to fine-tune a BERT model. We will use bert-tiny, a light pre-trained BERT model. Basically, we will add a linear layer on top this base model. For this we will only use  the 'comment' column of the dataset.

In [1]:
!pip install -U accelerate
!pip install -U transformers
!pip install -q datasets
!pip install torch evaluate



In [3]:
import os
import numpy as np
import pandas as pd

In [4]:
X = pd.read_csv('train-balanced-sarcasm.csv')

In [5]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer, BertTokenizer
from datasets import Dataset
import accelerate
import torch
import re
import evaluate
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from sklearn.model_selection import train_test_split

First we apply our usual preprocessing.

In [6]:
X.dropna(subset=['comment'], inplace=True)
X['label'].value_counts()
def preprocessing(s):
    s = str(s).lower().strip()
    s = re.sub('\n', '', s)
    s = re.sub(r"([?!,\":;\(\)])", r" \1 ", s)
    s = re.sub('[ ]{2,}', ' ', s).strip()
    return s

SEED = 1
X_train, X_valid = train_test_split(X,random_state=SEED)

Now the tokenizer. We will use the BertTokenizer API. Note that first we have to transform our pandas dataframes into HuggingFace datasets. Also, since the average length of a comment in our data set is around 60 symbols, we will apply padding/truncation with max_len=100. Also, at this point for training we will only use a portion of the data, namely, 100000 rows, which is still quite a lot.


In [7]:
tokenizer = BertTokenizer.from_pretrained('prajjwal1/bert-tiny')

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/285 [00:00<?, ?B/s]

In [22]:
def tokenize_function(data):
    return tokenizer(data["comment"], padding="max_length", truncation=True, max_length=100)

X_train_ds = Dataset.from_pandas(X_train[['comment','label']])
X_valid_ds = Dataset.from_pandas(X_valid[['comment','label']])

tokenized_train = X_train_ds.map(tokenize_function, batched=True)
tokenized_valid = X_valid_ds.map(tokenize_function, batched=True)

small_train_dataset = tokenized_train.shuffle(seed=SEED).select(range(100000))
small_eval_dataset = tokenized_valid.shuffle(seed=SEED).select(range(100000))

small_train_dataset.set_format("torch")
small_eval_dataset.set_format("torch")

Map:   0%|          | 0/758079 [00:00<?, ? examples/s]

Map:   0%|          | 0/252694 [00:00<?, ? examples/s]

Let's take a quick look at what we got:

In [23]:
example = small_train_dataset[555]
print(example.keys())

dict_keys(['comment', 'label', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'])


In [24]:
tokenizer.decode(example['input_ids']), example['label']

("[CLS] it's anon * y * mity you fucking retard! [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]",
 tensor(1))

Now let's prepare and train the model.

In [8]:
model = AutoModelForSequenceClassification.from_pretrained('prajjwal1/bert-tiny', num_labels=2)

Downloading pytorch_model.bin:   0%|          | 0.00/17.8M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [9]:
metric = evaluate.load("accuracy")

def compute_metrics(p):
  pred, labels = p
  pred = np.argmax(pred, axis=1)

  accuracy = accuracy_score(y_true=labels, y_pred=pred)
  recall = recall_score(y_true=labels, y_pred=pred)
  precision = precision_score(y_true=labels, y_pred=pred)
  f1 = f1_score(y_true=labels, y_pred=pred)

  return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}

args = TrainingArguments(
    output_dir="./mymodel3",
    evaluation_strategy="steps",
    eval_steps=50,
    logging_strategy="steps",
    logging_steps=100,
    save_strategy="steps",
    save_steps=200,
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=4,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    report_to="tensorboard"
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

Map:   0%|          | 0/758079 [00:00<?, ? examples/s]

Map:   0%|          | 0/252694 [00:00<?, ? examples/s]

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Step,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
50,No log,0.69131,0.51909,0.543721,0.209163,0.302109
100,0.696800,0.689293,0.54084,0.545796,0.460886,0.49976
150,0.696800,0.687061,0.54235,0.572516,0.317291,0.4083
200,0.690000,0.685325,0.55962,0.546845,0.671697,0.602875
250,0.690000,0.683463,0.5668,0.56134,0.592585,0.57654
300,0.686400,0.683799,0.55455,0.533497,0.835306,0.651128
350,0.686400,0.682367,0.56214,0.540439,0.802833,0.646008
400,0.683800,0.67992,0.56779,0.547397,0.759349,0.636184
450,0.683800,0.677398,0.58814,0.577357,0.643324,0.608558
500,0.683400,0.675506,0.58476,0.562535,0.744821,0.64097


TrainOutput(global_step=25000, training_loss=0.5807842582702637, metrics={'train_runtime': 18785.9245, 'train_samples_per_second': 21.293, 'train_steps_per_second': 1.331, 'total_flos': 99256800000000.0, 'train_loss': 0.5807842582702637, 'epoch': 4.0})

In [10]:
trainer.evaluate()

{'eval_loss': 0.5657798051834106,
 'eval_accuracy': 0.70802,
 'eval_precision': 0.7209557164650523,
 'eval_recall': 0.6742489701597508,
 'eval_f1': 0.6968205511598446,
 'eval_runtime': 35.5607,
 'eval_samples_per_second': 2812.097,
 'eval_steps_per_second': 175.756,
 'epoch': 4.0}

That's a bit better than what we got with a tuned logistic regression. Let's now save the model and tokenizer and also test it on some examples.

In [25]:
tokenizer.save_pretrained("./tokenizer")
trainer.save_model("./model")

In [28]:
device = 'cuda'
text = "It's so great that I was able to train this model."

tokenized_text = tokenizer(preprocessing(text),return_tensors="pt",padding="max_length", truncation=True, max_length=100).to(device)
output = model(**tokenized_text)


In [29]:
probs = output.logits.softmax(dim=-1).tolist()[0]
confidence = max(probs)
prediction = probs.index(confidence)
results = {"Is it sarcastic?": prediction, "Confidence": confidence}

In [30]:
results

{'Is it sarcastic?': 0, 'Confidence': 0.6193878054618835}

Yes, that was not sarcastic :) Now let's try with a sarcastic example from our dataset

In [40]:
X1 = X[X['label']==1]
text = X1['comment'][45]
text

"Ho ho ho... But Melania said that there is no way it could have happened because she didn't know the woman!"

In [41]:
tokenized_text = tokenizer(preprocessing(text),return_tensors="pt",padding="max_length", truncation=True, max_length=100).to(device)
output = model(**tokenized_text)
probs = output.logits.softmax(dim=-1).tolist()[0]
confidence = max(probs)
prediction = probs.index(confidence)
results = {"Is it sarcastic?": prediction, "Confidence": confidence}

In [42]:
results

{'Is it sarcastic?': 1, 'Confidence': 0.8572683930397034}

Predicted correctly here.