**Sarcasm Detection Case Study**
Loading sarcasm detection dataset from this source - it is high quality data from The Onion which is a sarcastic news site.
# https://github.com/rishabhmisra/News-Headlines-Dataset-For-Sarcasm-Detection

url = 'https://raw.githubusercontent.com/rishabhmisra/News-Headlines-Dataset-For-Sarcasm-Detection/master/Sarcasm_Headlines_Dataset.json'


In [None]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix

import tensorflow as tf

In [None]:
import requests
import json

# Replace the URL below with the raw URL of your JSON file
url = 'https://raw.githubusercontent.com/rishabhmisra/News-Headlines-Dataset-For-Sarcasm-Detection/master/Sarcasm_Headlines_Dataset.json'

# Fetch the file from the URL
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Split the response text by newlines
    lines = response.text.splitlines()

    # Parse each line as a JSON object and append to a list
    data = [json.loads(line) for line in lines]

    # Create a DataFrame from the list
    df = pd.DataFrame(data)

    # Display the first few rows of the DataFrame
    print(df.head())
else:
    print(f"Failed to retrieve the file: {response.status_code}")


   is_sarcastic                                           headline  \
0             1  thirtysomething scientists unveil doomsday clo...   
1             0  dem rep. totally nails why congress is falling...   
2             0  eat your veggies: 9 deliciously different recipes   
3             1  inclement weather prevents liar from getting t...   
4             1  mother comes pretty close to using word 'strea...   

                                        article_link  
0  https://www.theonion.com/thirtysomething-scien...  
1  https://www.huffingtonpost.com/entry/donna-edw...  
2  https://www.huffingtonpost.com/entry/eat-your-...  
3  https://local.theonion.com/inclement-weather-p...  
4  https://www.theonion.com/mother-comes-pretty-c...  


In [None]:
df

Unnamed: 0,is_sarcastic,headline,article_link
0,1,thirtysomething scientists unveil doomsday clo...,https://www.theonion.com/thirtysomething-scien...
1,0,dem rep. totally nails why congress is falling...,https://www.huffingtonpost.com/entry/donna-edw...
2,0,eat your veggies: 9 deliciously different recipes,https://www.huffingtonpost.com/entry/eat-your-...
3,1,inclement weather prevents liar from getting t...,https://local.theonion.com/inclement-weather-p...
4,1,mother comes pretty close to using word 'strea...,https://www.theonion.com/mother-comes-pretty-c...
...,...,...,...
28614,1,jews to celebrate rosh hashasha or something,https://www.theonion.com/jews-to-celebrate-ros...
28615,1,internal affairs investigator disappointed con...,https://local.theonion.com/internal-affairs-in...
28616,0,the most beautiful acceptance speech this week...,https://www.huffingtonpost.com/entry/andrew-ah...
28617,1,mars probe destroyed by orbiting spielberg-gat...,https://www.theonion.com/mars-probe-destroyed-...


In [None]:
# prompt: df1 is_sarcastic and headline from df

df1 = df[['is_sarcastic', 'headline']]


In [None]:
train_texts = df['headline'].values.tolist()
train_labels = df['is_sarcastic'].values.tolist()
train_texts, test_val_texts, train_labels, test_val_labels = train_test_split(train_texts, train_labels, test_size=.3)
test_texts, val_texts, test_labels, val_labels = train_test_split(test_val_texts, test_val_labels, test_size=.5)

In [None]:
(len(train_texts),len(train_labels))

(20033, 20033)

In [None]:
(len(test_texts),len(test_labels))

(4293, 4293)

In [None]:
(len(val_texts),len(val_labels))

(4293, 4293)


# Fine tuning Distilbert on sarcasm dataset
- https://huggingface.co/docs/transformers/tasks/sequence_classification
- https://github.com/Arfius/mymedium/blob/master/fine-tuning-transformers-of-sentiment-analysis-task-with-tranformer-tensorflow/main.py

DistilBERT is more efficient version of BERT that can retain most of BERT's performance

Here have treated sarcasm detection as a two-label text classification problem. We train BERT uncased on the sarcasm dataset to achieve 90%+ validation accuracy and 90%+ test accuracy without much tuning. The model works out of the box and has a huge capacity that can be extended. A process similar to this is how downstream tasks (such as sarcasm detection) are fine tuned on a dataset like how the helinivan/english-sarcasm-detector on HuggingFace.
https://github.com/helinivan/multilingual-sarcasm-detector
which has shown 94% performance on the test set.  

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
train_encodings = tokenizer(train_texts, truncation=True, padding=True, return_tensors='np').data
val_encodings = tokenizer(val_texts, truncation=True, padding=True, return_tensors='np').data
test_encodings = tokenizer(test_texts, truncation=True, padding=True, return_tensors='np').data

In [None]:
print('Setup the model')
from transformers import TFAutoModelForSequenceClassification
from tensorflow.keras.layers import *
import tensorflow as tf

model = TFAutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2, id2label={0: 'serious', 1: 'sarcastic'})
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer, loss=[model.hf_compute_loss],metrics=['accuracy'])
print(model.summary())

Setup the model


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

Model: "tf_distil_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMa  multiple                  66362880  
 inLayer)                                                        
                                                                 
 pre_classifier (Dense)      multiple                  590592    
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
 dropout_19 (Dropout)        multiple                  0         
                                                                 
Total params: 66955010 (255.41 MB)
Trainable params: 66955010 (255.41 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
None


In [None]:
print('Fine-tuning and Evaluation')
model.fit(train_encodings, np.array(train_labels), validation_data=(val_encodings, np.array(val_labels)), epochs=5, batch_size=32)


Fine-tuning and Evaluation
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x7f2c499c26b0>

In [None]:
print(model.evaluate(test_encodings, np.array(test_labels)))

[0.35043755173683167, 0.9180060625076294]


In [None]:
print('Export Model and Tokenizer')
model.save_pretrained("Sarcasm-distilbert-base-uncased")
tokenizer.save_pretrained("Sarcasm-distilbert-base-uncased")

Export Model and Tokenizer


('Sarcasm-distilbert-base-uncased/tokenizer_config.json',
 'Sarcasm-distilbert-base-uncased/special_tokens_map.json',
 'Sarcasm-distilbert-base-uncased/vocab.txt',
 'Sarcasm-distilbert-base-uncased/added_tokens.json',
 'Sarcasm-distilbert-base-uncased/tokenizer.json')

In [None]:
print('Load model and make a prediction')
from transformers import pipeline
pipe = pipeline("text-classification", model="./Sarcasm-distilbert-base-uncased", tokenizer="./Sarcasm-distilbert-base-uncased")
print(pipe("Prequel Depicts Young Willy Wonka Using Rich Father’s Investment To Buy Already-Successful Chocolate Factory"))
print(pipe("India women create history with 410 runs on day 1 of only Test against England"))

# https://www.theonion.com/prequel-depicts-young-willy-wonka-using-rich-father-s-i-1851049152
# https://www.msn.com/en-in/sports/other/india-women-create-history-with-410-runs-on-day-1-of-only-test-against-england/ar-AA1lvxIx?ocid=hpmsn&cvid=db426707b24349c4aeae6501105d9e2e&ei=22

Load model and make a prediction


Some layers from the model checkpoint at ./Sarcasm-distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at ./Sarcasm-distilbert-base-uncased and are newly initialized: ['dropout_39']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[{'label': 'sarcastic', 'score': 0.9998061060905457}]
[{'label': 'serious', 'score': 0.9986339211463928}]


In [None]:
print(pipe("Oh great, here comes another Monday"))


[{'label': 'serious', 'score': 0.9698437452316284}]
