<a href="https://colab.research.google.com/github/packetech/baracuda/blob/master/NLP_Project_Sarcasm_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sarcasm Detection
 **Acknowledgement**

Misra, Rishabh, and Prahal Arora. "Sarcasm Detection using Hybrid Neural Network." arXiv preprint arXiv:1908.07414 (2019).

**Required Files given in below link.**

https://drive.google.com/drive/folders/1xUnF35naPGU63xwRDVGc-DkZ3M8V5mMk

## Install `Tensorflow2.0` 

In [0]:
#!!pip uninstall tensorflow
#!pip install tensorflow==2.0.0

In [0]:
#pip list | grep tensorflow 

## Get Required Files from Drive

In [97]:
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [98]:
!ls "/content/drive/My Drive/AIML/NLP_Sarcasm_Project/"

glove.6B.100d.txt  glove.6B.50d.txt
glove.6B.200d.txt  NLP_Project_Sarcasm_Detection_Questions-1.ipynb
glove.6B.300d.txt  Sarcasm_Headlines_Dataset.json


In [0]:
#Set your project path 
project_path = "/content/drive/My Drive/AIML/NLP_Sarcasm_Project/" ## Add your path here ##

#**## Reading and Exploring Data**

## Read Data "Sarcasm_Headlines_Dataset.json". Explore the data and get  some insights about the data. ( 8 marks)
Hint - As its in json format you need to use pandas.read_json function. Give paraemeter lines = True.

In [0]:
# Load libraries

import pandas as pd

In [101]:
sarcasmDF = pd.read_json(project_path+'Sarcasm_Headlines_Dataset.json',lines=True)
sarcasmDF.head()

Unnamed: 0,article_link,headline,is_sarcastic
0,https://www.huffingtonpost.com/entry/versace-b...,former versace store clerk sues over secret 'b...,0
1,https://www.huffingtonpost.com/entry/roseanne-...,the 'roseanne' revival catches up to our thorn...,0
2,https://local.theonion.com/mom-starting-to-fea...,mom starting to fear son's web series closest ...,1
3,https://politics.theonion.com/boehner-just-wan...,"boehner just wants wife to listen, not come up...",1
4,https://www.huffingtonpost.com/entry/jk-rowlin...,j.k. rowling wishes snape happy birthday in th...,0


## Drop `article_link` from dataset. ( 4 marks)
As we only need headline text data and is_sarcastic column for this project. We can drop artical link column here.

In [102]:
sarcasmDF = sarcasmDF.drop(labels='article_link', axis=1)
sarcasmDF.head()

Unnamed: 0,headline,is_sarcastic
0,former versace store clerk sues over secret 'b...,0
1,the 'roseanne' revival catches up to our thorn...,0
2,mom starting to fear son's web series closest ...,1
3,"boehner just wants wife to listen, not come up...",1
4,j.k. rowling wishes snape happy birthday in th...,0


In [103]:
# check for null values in the dataset (we have no null value)
print(sarcasmDF.isnull().any(axis = 0))

headline        False
is_sarcastic    False
dtype: bool


In [104]:
sarcasmDF.shape

(26709, 2)

In [105]:
# There is no imbalance in the dataset
sarcasmDF.groupby(sarcasmDF["is_sarcastic"]).count()

Unnamed: 0_level_0,headline
is_sarcastic,Unnamed: 1_level_1
0,14985
1,11724


## Get the Length of each line and find the maximum length. ( 8 marks)
As different lines are of different length. We need to pad the our sequences using the max length.

In [106]:
# Get the Length of each line
sarcasmDF['length'] = sarcasmDF['headline'].apply(lambda x: len(x))
sarcasmDF.head()

Unnamed: 0,headline,is_sarcastic,length
0,former versace store clerk sues over secret 'b...,0,78
1,the 'roseanne' revival catches up to our thorn...,0,84
2,mom starting to fear son's web series closest ...,1,79
3,"boehner just wants wife to listen, not come up...",1,84
4,j.k. rowling wishes snape happy birthday in th...,0,64


In [107]:
# find the maximum length
sarcasmDF.nlargest(3, ['length'])

Unnamed: 0,headline,is_sarcastic,length
19868,"maya angelou, poet, author, civil rights activ...",1,254
17306,"'12 years a slave,' 'captain phillips,' 'ameri...",1,238
15247,"elmore leonard, modern prose master, noted for...",1,237


In [108]:
# cross checking with another code for max length. Therefore max length is 254
max_length = max([len(txt) for txt in sarcasmDF['headline']])
max_length

254

#**## Modelling**

## Import required modules required for modelling.

In [0]:
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation, Flatten, Bidirectional, GlobalMaxPool1D
from tensorflow.keras.models import Model, Sequential
import tensorflow as tf

## Set Different Parameters for the model. ( 4 marks)

In [0]:
max_features = 10000
maxlen = 254 ## Add your max length here ##
embedding_size = 200

## Apply Keras Tokenizer of headline column of your data.  ( 8 marks)
Hint - First create a tokenizer instance using Tokenizer(num_words=max_features) 
And then fit this tokenizer instance on your data column df['headline'] using .fit_on_texts()

In [0]:
 tokenizer = Tokenizer(num_words=max_features)
 tokenizer.fit_on_texts(sarcasmDF['headline'])

## Define X and y for your model

In [112]:
X = tokenizer.texts_to_sequences(sarcasmDF['headline'])
X = pad_sequences(X, maxlen = maxlen)
y = np.asarray(sarcasmDF['is_sarcastic'])

print("Number of Samples:", len(X))
print(X[0])
print("Number of Labels: ", len(y))
print(y[0])

Number of Samples: 26709
[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0  

## Get the Vocabulary size ( 4 marks)
Hint : You can use tokenizer.word_index.

In [113]:
# The real number of words will be
num_words = len(tokenizer.word_index) + 1
num_words

29657

#**## Word Embedding**

## Get Glove Word Embeddings

In [0]:
# No need to run the unzipping codes; i got my glove files as unzipped.


#glove_file = project_path + "glove.6B.zip"

In [0]:

#Extract Glove embedding zip file
##from zipfile import ZipFile
##with ZipFile(glove_file, 'r') as z:
##  z.extractall()

# Get the Word Embeddings using Embedding file as given below.

In [0]:
EMBEDDING_FILE = project_path + './glove.6B.200d.txt'

embeddings = {}
for o in open(EMBEDDING_FILE):
    word = o.split(" ")[0]
    #print(word)
    embd = o.split(" ")[1:]
    embd = np.asarray(embd, dtype='float32')
    #print(embd)
    embeddings[word] = embd

# Create a weight matrix for words in training docs

In [117]:
embedding_matrix = np.zeros((num_words, 200))

for word, i in tokenizer.word_index.items():
    embedding_vector = embeddings.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

len(embeddings.values())

400000

## Create and Compile your Model  ( 14 marks)
Hint - Use Sequential model instance and then add Embedding layer, Bidirectional(LSTM) layer, then dense and dropout layers as required. 
In the end add a final dense layer with sigmoid activation for binary classification.


In [123]:
embedding_size = 200

model = Sequential()
model.add(Embedding(num_words, embedding_size, weights = [embedding_matrix]))
model.add(Bidirectional(LSTM(128, return_sequences = True)))
model.add(GlobalMaxPool1D())
model.add(Dense(128,activation='relu'))
model.add(Dropout(0.5))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
model.compile(loss = 'binary_crossentropy', optimizer='adam', metrics = ['accuracy'])
print(model.summary())


Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, None, 200)         5931400   
_________________________________________________________________
bidirectional_4 (Bidirection (None, None, 256)         336896    
_________________________________________________________________
global_max_pooling1d_4 (Glob (None, 256)               0         
_________________________________________________________________
dense_8 (Dense)              (None, 128)               32896     
_________________________________________________________________
dropout_5 (Dropout)          (None, 128)               0         
_________________________________________________________________
flatten_4 (Flatten)          (None, 128)               0         
_________________________________________________________________
dense_9 (Dense)              (None, 1)                

## Fit your model with a batch size of 100 and validation_split = 0.2. and state the validation accuracy ( 10 marks)


In [124]:
batch_size = 100
epochs = 5

## Add your code here ##
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X,y, test_size = 0.2, random_state = 123)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)

(21367, 254) (21367,)
(5342, 254) (5342,)


In [125]:
model.fit(X_train, Y_train, validation_data=(X_test,Y_test), epochs = epochs, batch_size=batch_size, verbose = 2)

Epoch 1/5
214/214 - 23s - loss: 0.4381 - accuracy: 0.7914 - val_loss: 0.3250 - val_accuracy: 0.8566
Epoch 2/5
214/214 - 22s - loss: 0.2457 - accuracy: 0.9012 - val_loss: 0.2956 - val_accuracy: 0.8793
Epoch 3/5
214/214 - 23s - loss: 0.1547 - accuracy: 0.9416 - val_loss: 0.3313 - val_accuracy: 0.8733
Epoch 4/5
214/214 - 23s - loss: 0.0914 - accuracy: 0.9670 - val_loss: 0.4336 - val_accuracy: 0.8617
Epoch 5/5
214/214 - 23s - loss: 0.0531 - accuracy: 0.9824 - val_loss: 0.5045 - val_accuracy: 0.8667


<tensorflow.python.keras.callbacks.History at 0x7f8c3378a240>

**Validation Accuracy = 86.7%**

In [126]:
# let see what the percentage of our correct predictions on sarcastic and non-sarcastic look like

from sklearn.metrics import classification_report,confusion_matrix
Y_pred = model.predict(X_test)
y_pred = np.rint(Y_pred)
print('  Classification Report:\n',classification_report(Y_test,y_pred),'\n')

  Classification Report:
               precision    recall  f1-score   support

           0       0.90      0.86      0.88      3000
           1       0.83      0.88      0.85      2342

    accuracy                           0.87      5342
   macro avg       0.86      0.87      0.87      5342
weighted avg       0.87      0.87      0.87      5342
 



In [127]:
####  TESTING THE MODEL WITH REAL SARCASTIC NEWS #######
# Even though we had nore samples of non_sarcastic news than actual sarcastic news
# The model validation accuracy is good at 86.7% and the classification report
# shows that we equally have a good precision and F-score
# Therefore the model test below on random news text is a true reflection of our model's performance.

# 
twt = ['Police Chief Vows To Take Concrete Steps To Better Cover Up Violence']

#vectorizing the tweet by the pre-fitted tokenizer instance
twt = tokenizer.texts_to_sequences(twt)
#padding the tweet to have exactly the same shape as `embedding_2` input
twt = pad_sequences(twt, maxlen=maxlen, dtype='int32', value=0)
#print(twt)
sarcasm = model.predict(twt,batch_size=1,verbose = 2)[0]

if(np.rint(sarcasm) == 0):
    print("Not_Sarcastic")
elif (np.rint(sarcasm) == 1):
    print("Sarcastic")

1/1 - 0s
Sarcastic
