<a href="https://colab.research.google.com/github/iolef/Sarcasm-identification-in-implicit-misogyny/blob/main/2_Humour_detection_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Humor detection model**

In this notebook, [DistilBERT](https://https://huggingface.co/docs/transformers/model_doc/distilbert) is used to fine-tune an implicit hate detection model using the [Kaggle dataset](https://https://www.kaggle.com/competitions/humor-detection/data) for humour detection.

# **1. Setup**

**1.1 Installing Transformers**

In [None]:
# Transformers installation
! pip install transformers[torch] datasets evaluate
# To install from source instead of the last release, comment the command above and uncomment the following one.
# ! pip install git+https://github.com/huggingface/transformers.git

Collecting transformers[torch]
  Downloading transformers-4.32.1-py3-none-any.whl (7.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.5/7.5 MB[0m [31m34.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.14.4-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.3/519.3 kB[0m [31m52.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.15.1 (from transformers[torch])
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m25.7 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers[torch])
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_

**1.2 Imports**

In [None]:
from sklearn.model_selection import train_test_split
from datasets import Dataset
import pandas as pd
import zipfile
import os
import re
import pickle

# **2. Dataset upload**

In [None]:
# Uploading and opening the pickle files
with open("./X_train.pickle", "rb") as f:
    x_train = pickle.load(f)
with open("./X_test.pickle", "rb") as f:
    x_test = pickle.load(f)
with open("./y_train.pickle", "rb") as f:
    y_train = pickle.load(f)
with open("./y_test.pickle", "rb") as f:
    y_test = pickle.load(f)

# Printing some file to verify if they opened correctly
print(x_train)
print(y_train)

# Joining the texts and their corresponding labels for each set into two pandas dataframes
train_df = pd.DataFrame({"text": x_train, "label": y_train})
test_df = pd.DataFrame({"text": x_test, "label": y_test})

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

# **3. Preprocessing**

**3.1 Creating a function which includes all the text preprocessing operations.**

In [None]:
def clean_text(post):
 # Lowercasing
 post = post.lower()
 # special characters removal
 post = re.sub(r'[^\w]', ' ', post)
 # stripping
 post = post.strip()
 # removing the unnecessary whitespaces between words
 post = ' '.join(post.split())
 return post

# Applying to each dataframe the clean_text function
train_df["text"] = train_df["text"].apply(clean_text)
test_df["text"] = test_df["text"].apply(clean_text)

# **4. Preparing the data for the training**

4.1 Converting the dataframe in a format compatible with Huggingface

In [None]:
# Converting the dataframes into datasets as requested by Huggingface
train_ds = Dataset.from_pandas(train_df, split="train")
test_ds = Dataset.from_pandas(test_df, split="test")


4.2 Creating the tokenizer

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

tokenized_train_set = train_ds.map(preprocess_function, batched=True)
tokenized_test_set = test_ds.map(preprocess_function, batched=True)

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/13409 [00:00<?, ? examples/s]

Map:   0%|          | 0/3355 [00:00<?, ? examples/s]

In [None]:
# Printing a few examples
train_ds[0:100]

{'text': ['my grandfather died recently he spent his final years as a regular user of facebook we won t see the likes of him again',
  'i was sat in traffic the other day got hit by a car',
  'whats the difference between a ginger fanny and a cricket ball if you try really hard really really hard you can eat a cricket ball',
  'money can t buy happiness but i d much rather cry in a mansion',
  '2b or not 2b that is the pencil',
  'what s the difference between a jew and a canoe canoes tip',
  'i ve just won 10 million on the lottery and decided to buy my local chinese takeaway called happiness your move philosophers',
  'a man was hospitalized with 6 plastic horses up his ass the doctor described his condition as stable',
  'just told my joke about peter pan again never gets old',
  'two blondes were driving to disneyland and the exit sign reads disneyland left they started crying and headed home',
  'your head is so big your ears are in different time zones',
  'knock knock who s ther

# **5. Training the model**

Training the model with DistilBERT pre-trained model.

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

import evaluate

accuracy = evaluate.load("accuracy")

import numpy as np


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

id2label = {0: "not_funny", 1: "funny"}
label2id = {"not funny": 0, "funny": 1}

from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
)

training_args = TrainingArguments(
    output_dir="my_humour_model",
    learning_rate=3e-6,
    per_device_train_batch_size=128,
    per_device_eval_batch_size=128,
    num_train_epochs=20,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_set,
    eval_dataset=tokenized_test_set,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.weight', 'pre_classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.251628,0.936215
2,No log,0.137847,0.954993
3,No log,0.116828,0.962146
4,No log,0.105796,0.965127
5,0.215000,0.099542,0.965127
6,0.215000,0.096265,0.966021
7,0.215000,0.09337,0.967511
8,0.215000,0.090519,0.967809
9,0.215000,0.090389,0.968405
10,0.070700,0.090559,0.969001


TrainOutput(global_step=2100, training_loss=0.08936879975455148, metrics={'train_runtime': 1201.8224, 'train_samples_per_second': 223.144, 'train_steps_per_second': 1.747, 'total_flos': 3485968079715336.0, 'train_loss': 0.08936879975455148, 'epoch': 20.0})

# **6. Testing the model**

Trying the model on a sample sentence.

In [None]:
from transformers import pipeline

text = "mi iq test results came back. they were negative"

classifier = pipeline(task= 'sentiment-analysis',
                      model= "my_humour_model/checkpoint-945",
                      tokenizer = "my_humour_model/checkpoint-945")

classifier(text)

[{'label': 'funny', 'score': 0.6413333415985107}]

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive
