# Detect sarcasm in news headlines with TensorFlow
This work is part of a collection of practice sets called [NLP Starter](https://github.com/jamiemorales/project-nlp-starter).
It aims to help someone get started fast and gain a high-level understanding of the fundamental steps in the NLP lifecycle early on.
After completion, someone will have built intuition over the NLP lifecycle. 

## Step 0: Understand the problem
What we're trying to do here is to classify whether a news headline is sarcastic.

## Step 1: Set-up and understand data
In this step, we layout the tools we will need to solve the problem identified in the previous step. We want to inspect our data sources and explore the data itself to gain an understanding of the data for preprocessing and modeling.

In [1]:
# Set-up libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import tensorflow as tf
from tensorflow import keras

In [3]:
# Load data
df = pd.read_json('../00-Datasets/news-headlines-dataset-for-sarcasm-detection/Sarcasm_Headlines_Dataset.json', lines=True)
df.head()

Unnamed: 0,article_link,headline,is_sarcastic
0,https://www.huffingtonpost.com/entry/versace-b...,former versace store clerk sues over secret 'b...,0
1,https://www.huffingtonpost.com/entry/roseanne-...,the 'roseanne' revival catches up to our thorn...,0
2,https://local.theonion.com/mom-starting-to-fea...,mom starting to fear son's web series closest ...,1
3,https://politics.theonion.com/boehner-just-wan...,"boehner just wants wife to listen, not come up...",1
4,https://www.huffingtonpost.com/entry/jk-rowlin...,j.k. rowling wishes snape happy birthday in th...,0


In [4]:
# Look at some details
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26709 entries, 0 to 26708
Data columns (total 3 columns):
article_link    26709 non-null object
headline        26709 non-null object
is_sarcastic    26709 non-null int64
dtypes: int64(1), object(2)
memory usage: 626.1+ KB


In [5]:
# Look at breakdown of label
df['is_sarcastic'].value_counts()

0    14985
1    11724
Name: is_sarcastic, dtype: int64

## Step 2: Prepare data and understand some more
In this step, we perform the necessary transformations on the data so that the neural network would be able to understand it. Real-world datasets are complex and messy. For our purposes, most of the datasets we work on in this series require minimal preparation.

In [6]:
# Split data into 80% training and 20% validation
sentences = df['headline']
labels = df['is_sarcastic']

train_sentences, val_sentences, train_labels, val_labels = train_test_split(sentences, labels, test_size=0.2, random_state=0)

print(train_sentences.shape)
print(val_sentences.shape)
print(train_labels.shape)
print(val_labels.shape)

(21367,)
(5342,)
(21367,)
(5342,)


In [7]:
# Tokenize and pad
vocab_size = 10000
oov_token = '<00V>'
max_length = 100
padding_type = 'post'
trunc_type = 'post'


tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_token)
tokenizer.fit_on_texts(train_sentences)
word_index = tokenizer.word_index

train_sequences = tokenizer.texts_to_sequences(train_sentences)
train_padded = pad_sequences(train_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

val_sequences = tokenizer.texts_to_sequences(val_sentences)
val_padded = pad_sequences(val_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

## Step 3: Build, train, and evaluate neural network
First, we design the neural network, e.g., sequence of layers and activation functions. 

Second, we train the neural network, we iteratively make a guess, calculate how accurate that guess is, and enhance our guess. The first guess is initialised with random values. The goodness or badness of the guess is measured with the loss function. The next guess is generated and enhanced by the optimizer function.

Lastly, we apply use the neural network on previously unseen data and evaluate the results.

In [10]:
# Build and train neural network
embedding_dim = 16
num_epochs = 10
batch_size = 100

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(24, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy']
             )

history = model.fit(train_padded, train_labels, batch_size=batch_size, epochs=num_epochs, 
                    verbose=2)

Train on 21367 samples
Epoch 1/10
21367/21367 - 1s - loss: 0.6835 - accuracy: 0.5606
Epoch 2/10
21367/21367 - 1s - loss: 0.6360 - accuracy: 0.6381
Epoch 3/10
21367/21367 - 1s - loss: 0.4622 - accuracy: 0.8321
Epoch 4/10
21367/21367 - 1s - loss: 0.3441 - accuracy: 0.8710
Epoch 5/10
21367/21367 - 1s - loss: 0.2912 - accuracy: 0.8889
Epoch 6/10
21367/21367 - 1s - loss: 0.2592 - accuracy: 0.9012
Epoch 7/10
21367/21367 - 1s - loss: 0.2331 - accuracy: 0.9137
Epoch 8/10
21367/21367 - 1s - loss: 0.2134 - accuracy: 0.9222
Epoch 9/10
21367/21367 - 1s - loss: 0.1961 - accuracy: 0.9294
Epoch 10/10
21367/21367 - 1s - loss: 0.1809 - accuracy: 0.9356


In [13]:
# Apply neural network
val_loss, val_accuracy = model.evaluate(val_padded, val_labels)
print('Val loss: {}, Val accuracy: {}'.format(val_loss, val_accuracy*100))

quick_test_sentence = [
    'canada is flattening the coronavirus curve',
    'canucks take home the cup',
    'safety meeting ends in accident'
    
]

quick_test_sequences = tokenizer.texts_to_sequences(quick_test_sentence)
quick_test_padded = pad_sequences(quick_test_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)
quick_test_sentiments = model.predict(quick_test_padded)

for i in range(len(quick_test_sentiments)):
    print('\n' + quick_test_sentence[i])
    if 0 < quick_test_sentiments[i] < .50:
        print('Unlikely sarcasm. Sarcasm score: {}'.format(quick_test_sentiments[i]*100))
    elif .50 < quick_test_sentiments[i] < .75:
        print('Possible sarcasm. Sarcasm score: {}'.format(quick_test_sentiments[i]*100))
    elif .75 >  quick_test_sentiments[i] < 1:
        print('Sarcasm. Sarcasm score:  {}'.format(quick_test_sentiments[i]*100))
    else:
        print('Not in range')

Val loss: 0.3366185913747368, Val accuracy: 85.75440049171448

canada is flattening the coronavirus curve
Unlikely sarcasm. Sarcasm score: [0.6456852]

canucks take home the cup
Unlikely sarcasm. Sarcasm score: [7.4036417]

safety meeting ends in accident
Possible sarcasm. Sarcasm score: [56.476707]


## More

If you found this work interesting, you might like:

* Machine Learning Starter

* Deep Learning Starter

* Natural Language Processing Starter

You can find more at [github.com/jamiemorales](https://github.com/jamiemorales).

Datasets are not mine. List of sources: [datasets and sources]()

For sharing this work, here's how / the license: https://creativecommons.org/licenses/by-sa/4.0/