Training a binary classifier with the Sarcasm Dataset

Download the dataset

In [1]:
import requests

url = "https://storage.googleapis.com/tensorflow-1-public/course3/sarcasm.json"
response = requests.get(url)

with open("sarcasm.json", "wb") as file:
    file.write(response.content)

In [2]:
import json

# Load the JSON file
with open("./sarcasm.json", 'r') as f:
    datastore = json.load(f)

# Initialize the lists
sentences = []
labels = []

# Collect sentences and labels into the lists
for item in datastore:
    sentences.append(item['headline'])
    labels.append(item['is_sarcastic'])

In [3]:
# Hyperparameters
# Number of examples to use for training
training_size = 20000

# Vocabulary size of the tokenizer
vocab_size = 10000

# Maximum length of the padded sequences
max_length = 32

# Output dimensions of the Embedding layer
embedding_dim = 16

In [4]:
# Split the sentences
training_sentences = sentences[0:training_size]
testing_sentences = sentences[training_size:]

# Split the labels
training_labels = labels[0:training_size]
testing_labels = labels[training_size:]

Preprocessing the train and test sets
Tokenization  => Padding 

In [5]:
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Parameters for padding and OOV tokens
trunc_type='post'
padding_type='post'
oov_tok = "<OOV>"

# Initialize the Tokenizer class
tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)

# Generate the word index dictionary
tokenizer.fit_on_texts(training_sentences)
word_index = tokenizer.word_index

# Generate pad and the training sequences
training_sequences = tokenizer.texts_to_sequences(training_sentences)
training_padded = pad_sequences(training_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

# Generate and pad the testing sequences
testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

# Convert the labels lists into numpy arrays
training_labels = np.array(training_labels)
testing_labels = np.array(testing_labels)

Build and Compile the Model
 =>GlobalAveragePooling1D layer instead of Flatten after the Embedding. 🧑‍💻☝️
 =>it gets the average over 3 arrays (i.e. (10 + 1 + 1) / 3 and (2 + 3 + 1) / 3 to arrive at the final output.
 =>This added computation reduces the dimensionality of the model as compared to using Flatten() and thus, the number of training parameters will also decrease. 

In [6]:
import tensorflow as tf

# Initialize a GlobalAveragePooling1D (GAP1D) layer
gap1d_layer = tf.keras.layers.GlobalAveragePooling1D()

# Define sample array
sample_array = np.array([[[10,2],[1,3],[1,1]]])

# Print shape and contents of sample array
print(f'shape of sample_array = {sample_array.shape}')
print(f'sample array: {sample_array}')

# Pass the sample array to the GAP1D layer
output = gap1d_layer(sample_array)

# Print shape and contents of the GAP1D output array
print(f'output shape of gap1d_layer: {output.shape}')
print(f'output array of gap1d_layer: {output.numpy()}')

shape of sample_array = (1, 3, 2)
sample array: [[[10  2]
  [ 1  3]
  [ 1  1]]]
output shape of gap1d_layer: (1, 2)
output array of gap1d_layer: [[4 2]]


Build the model

In [7]:

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(24, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Print the model summary
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 32, 16)            160000    
                                                                 
 global_average_pooling1d_1  (None, 16)                0         
  (GlobalAveragePooling1D)                                       
                                                                 
 dense (Dense)               (None, 24)                408       
                                                                 
 dense_1 (Dense)             (None, 1)                 25        
                                                                 
Total params: 160433 (626.69 KB)
Trainable params: 160433 (626.69 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


Compile the model


In [8]:
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])

Train the Model


In [9]:
num_epochs = 30

# Train the model
history = model.fit(training_padded, training_labels, epochs=num_epochs, validation_data=(testing_padded, testing_labels), verbose=2)

Epoch 1/30
625/625 - 2s - loss: 0.5661 - accuracy: 0.6976 - val_loss: 0.3974 - val_accuracy: 0.8313 - 2s/epoch - 4ms/step
Epoch 2/30
625/625 - 1s - loss: 0.3129 - accuracy: 0.8740 - val_loss: 0.3429 - val_accuracy: 0.8536 - 921ms/epoch - 1ms/step
Epoch 3/30
625/625 - 2s - loss: 0.2359 - accuracy: 0.9082 - val_loss: 0.3424 - val_accuracy: 0.8521 - 2s/epoch - 3ms/step
Epoch 4/30
625/625 - 2s - loss: 0.1911 - accuracy: 0.9267 - val_loss: 0.3607 - val_accuracy: 0.8498 - 2s/epoch - 3ms/step
Epoch 5/30
625/625 - 3s - loss: 0.1588 - accuracy: 0.9401 - val_loss: 0.3848 - val_accuracy: 0.8490 - 3s/epoch - 5ms/step
Epoch 6/30
625/625 - 2s - loss: 0.1343 - accuracy: 0.9516 - val_loss: 0.4155 - val_accuracy: 0.8471 - 2s/epoch - 4ms/step
Epoch 7/30
625/625 - 2s - loss: 0.1151 - accuracy: 0.9599 - val_loss: 0.4532 - val_accuracy: 0.8456 - 2s/epoch - 2ms/step
Epoch 8/30
625/625 - 1s - loss: 0.0996 - accuracy: 0.9658 - val_loss: 0.4921 - val_accuracy: 0.8396 - 930ms/epoch - 1ms/step
Epoch 9/30
625/625

Visualize the Results
import matplotlib.pyplot as plt

# Plot utility
def plot_graphs(history, string):
  plt.plot(history.history[string])
  plt.plot(history.history['val_'+string])
  plt.xlabel("Epochs")
  plt.ylabel(string)
  plt.legend([string, 'val_'+string])
  plt.show()
  
# Plot the accuracy and loss
plot_graphs(history, "accuracy")
plot_graphs(history, "loss")

Visualize Word Embeddings

In [10]:
# Visualize the final weights of the embeddings
# Get the index-word dictionary
reverse_word_index = tokenizer.index_word

# Get the embedding layer from the model (i.e. first layer)
embedding_layer = model.layers[0]

# Get the weights of the embedding layer
embedding_weights = embedding_layer.get_weights()[0]

# Print the shape. Expected is (vocab_size, embedding_dim)
print(embedding_weights.shape) 

(10000, 16)


In [11]:
import io

# Open writeable files
out_v = io.open('vecs.tsv', 'w', encoding='utf-8')
out_m = io.open('meta.tsv', 'w', encoding='utf-8')

# Initialize the loop. Start counting at `1` because `0` is just for the padding
for word_num in range(1, vocab_size):

  # Get the word associated at the current index
  word_name = reverse_word_index[word_num]

  # Get the embedding weights associated with the current index
  word_embedding = embedding_weights[word_num]

  # Write the word name
  out_m.write(word_name + "\n")

  # Write the word embedding
  out_v.write('\t'.join([str(x) for x in word_embedding]) + "\n")

# Close the files
out_v.close()
out_m.close()

Now we can see result in the tow .tsv files our workspace.
C:\Users\User\PycharmProjects\Tensorfolw_Certificate_Practices\C3_NLP


la    What I've leaned new in this Task : 
1/ Preprocessing data : More exactly :
    a/ Split the dataset ;purely python syntax used on numpy array that could be really handy when preprocessing data(sentences[0:training_size] ,sentences[training_size:])
    b/ Using num_words parameter in the tokenizer constructor this mean that for example our dictionary will be composed with only 100 different words even so the training sentences contains more than that ! In that case the longest words will be indexed and the short ones will be kicked in order to respect this  rule( I guest it's a way of avoiding over-fitting, but not sure ! we will see how it goes 🤔)
    c/ Padding sequences : using pad_sequences(training_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)
 2/ Build and Compile the Model: 
    a/ Replacing the Flatten layer with GlobalAveragePooling1D
    b/ Adding an Embedding yer: This layer helps to define the semantic similarity between words. By using a spatial representation, each word is represented by a vector. This facilitates categorizing words. For example, 'bad' and 'worst' will be represented by two vectors that have almost the same parameters, and the same goes for 'good' and 'awesome'.
    c/ Visualize Word Embeddings into external tsv files.