<img src="http://drive.google.com/uc?export=view&id=1tpOCamr9aWz817atPnyXus8w5gJ3mIts" width=500px>

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.

### Package Version:
- tensorflow==2.2.0
- pandas==1.0.5
- numpy==1.18.5
- google==2.0.3

# Sarcasm Detection

### Dataset

#### Acknowledgement
Misra, Rishabh, and Prahal Arora. "Sarcasm Detection using Hybrid Neural Network." arXiv preprint arXiv:1908.07414 (2019).

**Required Files given in below link.**

https://drive.google.com/drive/folders/1xUnF35naPGU63xwRDVGc-DkZ3M8V5mMk

### Load Data (5 Marks)

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [34]:
import json 

data = []
for line in open('/content/drive/My Drive/AIML/Labs/Datasets/Sarcasm/Sarcasm_Headlines_Dataset.json','r'):
  data.append(json.loads(line))

In [35]:
import pandas as pd

df = pd.DataFrame(data, columns = ['article_link','headline','is_sarcastic'])

### Drop `article_link` from dataset (5 Marks)

In [36]:
df = df.drop(['article_link'], axis = 1)

### Get length of each headline and add a column for that (5 Marks)

In [37]:
length = [len(i) for i in df['headline']]
df['length'] = length

### Initialize parameter values
- Set values for max_features, maxlen, & embedding_size
- max_features: Number of words to take from tokenizer(most frequent words)
- maxlen: Maximum length of each sentence to be limited to 25
- embedding_size: size of embedding vector

In [39]:
max_features = 10000
maxlen = 25
embedding_size = 200

### Apply `tensorflow.keras` Tokenizer and get indices for words (5 Marks)
- Initialize Tokenizer object with number of words as 10000
- Fit the tokenizer object on headline column
- Convert the text to sequence


In [40]:
import re

df['headline'] = df['headline'].apply(lambda x: x.lower())
df['headline'] = df['headline'].apply((lambda x: re.sub('[^a-zA-z0-9\s]','',x)))

In [43]:
for idx, row in df.iterrows():
    row[0] = row[0].replace('rt',' ')

In [74]:
from tensorflow.keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(num_words = max_features, split= ' ')
tokenizer.fit_on_texts(df['headline'].values)

X = tokenizer.texts_to_sequences(df['headline'].values)

### Pad sequences (5 Marks)
- Pad each example with a maximum length
- Convert target column into numpy array

In [76]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

X = pad_sequences(X, maxlen = maxlen, padding = 'post', value = 0)

In [86]:
y = df['is_sarcastic'].values

In [96]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.20, random_state = 42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((21367, 25), (5342, 25), (21367,), (5342,))

### Vocab mapping
- There is no word for 0th index

In [None]:
# tokenizer.word_index

### Set number of words
- Since the above 0th index doesn't have a word, add 1 to the length of the vocabulary

In [79]:
num_words = len(tokenizer.word_index) + 1
print(num_words)

28399


### Load Glove Word Embeddings (5 Marks)

### Create embedding matrix

In [80]:
EMBEDDING_FILE = '/content/drive/My Drive/AIML/Labs/Datasets/Sarcasm/glove.6B.200d.txt'

embeddings = {}
for o in open(EMBEDDING_FILE):
    word = o.split(" ")[0]
    # print(word)
    embd = o.split(" ")[1:]
    embd = np.asarray(embd, dtype='float32')
    # print(embd)
    embeddings[word] = embd

# create a weight matrix for words in training docs
embedding_matrix = np.zeros((num_words, 200))

for word, i in tokenizer.word_index.items():
	embedding_vector = embeddings.get(word)
	if embedding_vector is not None:
		embedding_matrix[i] = embedding_vector

### Define model (10 Marks)
- Hint: Use Sequential model instance and then add Embedding layer, Bidirectional(LSTM) layer, flatten it, then dense and dropout layers as required. 
In the end add a final dense layer with sigmoid activation for binary classification.

In [81]:
import keras as k

model = k.models.Sequential()
model.add(k.layers.Embedding(num_words, 200, embeddings_initializer=k.initializers.Constant(embedding_matrix), weights = [embedding_matrix], trainable = False))
model.add(k.layers.Bidirectional(k.layers.LSTM(units = 64, return_sequences = True)))
model.add(k.layers.Dense(64))
model.add(k.layers.Dropout(0.25))
model.add(k.layers.GlobalMaxPool1D())
model.add(k.layers.Flatten())
model.add(k.layers.Dense(units = 1, activation='sigmoid'))

### Compile the model (5 Marks)

In [82]:
model.compile(optimizer=k.optimizers.Adam(), loss = 'binary_crossentropy', metrics=['acc'])
print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 200)         5679800   
_________________________________________________________________
bidirectional (Bidirectional (None, None, 128)         135680    
_________________________________________________________________
dense (Dense)                (None, None, 64)          8256      
_________________________________________________________________
dropout (Dropout)            (None, None, 64)          0         
_________________________________________________________________
global_max_pooling1d (Global (None, 64)                0         
_________________________________________________________________
flatten (Flatten)            (None, 64)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 6

### Fit the model (5 Marks)

In [97]:
tf.config.run_functions_eagerly(True)
model.fit(X_train, y_train, epochs = 3, batch_size = 32)

Epoch 1/3
  5/668 [..............................] - ETA: 17s - loss: 0.1999 - acc: 0.9125

  "Even though the tf.config.experimental_run_functions_eagerly "


Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x7ff9ecd9ac88>

In [99]:
acc = model.evaluate(X_test, y_test, verbose = 2, batch_size = 32)

  "Even though the tf.config.experimental_run_functions_eagerly "


167/167 - 2s - loss: 0.2127 - acc: 0.9096
