In this lab, you will apply what you've learned in the past two exercises to preprocess the News Headlines Dataset for Sarcasm Detection. This contains news headlines which are labeled as sarcastic or not. You will revisit this dataset in later labs so it is good to be acquainted with it now.

### Download the dataset

In [2]:
import tensorflow as tf
import json
# import tensorflow_datasets as tfds
from tensorflow.keras.utils import pad_sequences

In [3]:
# Download the dataset
!wget -nc https://storage.googleapis.com/tensorflow-1-public/course3/sarcasm.json

--2025-07-26 18:49:24--  https://storage.googleapis.com/tensorflow-1-public/course3/sarcasm.json
Resolving storage.googleapis.com (storage.googleapis.com)... 2607:f8b0:4009:809::201b, 2607:f8b0:4009:80a::201b, 2607:f8b0:4009:804::201b, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|2607:f8b0:4009:809::201b|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5643545 (5.4M) [application/json]
Saving to: ‘sarcasm.json’


2025-07-26 18:49:24 (13.7 MB/s) - ‘sarcasm.json’ saved [5643545/5643545]



### Load the data

In [3]:
import json
import os
from pathlib import Path

# resolve the paths
curr_script_folder = Path.cwd()
folders_root = curr_script_folder.parent
root = folders_root.parent
data_folder = root / 'data'

os.listdir(data_folder)


['sarcasm.json']

In [4]:
# Load the json
with open(data_folder / "sarcasm.json", 'r') as f:
    datastore = json.load(f)

print(f"datastore is of the type: {type(datastore)}")

datastore is of the type: <class 'list'>


In [5]:
# inspect sarcastic and non sarcastic item in the datastore list
print(datastore[0])
print()
print(datastore[2000])

{'article_link': 'https://www.huffingtonpost.com/entry/versace-black-code_us_5861fbefe4b0de3a08f600d5', 'headline': "former versace store clerk sues over secret 'black code' for minority shoppers", 'is_sarcastic': 0}

{'article_link': 'https://www.huffingtonpost.com/entry/mh370-theft_n_5684061.html', 'headline': 'couple stole $35,000 from missing plane victims, police say', 'is_sarcastic': 0}


In [6]:
# Get the sentences - headlines
sentences = [item['headline'] for item in datastore]

In [7]:
sentences[1]

"the 'roseanne' revival catches up to our thorny political mood, for better and worse"

## Pre-Processing

### TensorFlow

In [8]:
import tensorflow as tf
from tensorflow.keras.utils import pad_sequences

# initialize the layer
vec_layer = tf.keras.layers.TextVectorization()

# build the vocab
vec_layer.adapt(sentences)

# # get vocab - optional
# vocab = vec_layer.get_vocabulary()

# post-padded sentences
post_padded_seq = vec_layer(sentences)


2025-07-26 19:23:45.936051: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-07-26 19:23:48.572942: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2025-07-26 19:23:48.579325: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1956] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your 

In [9]:
index = 2

print(f"sentence: {sentences[index]}")
print('')
print(f"post padded sequence: {post_padded_seq[index]}")
print('')

print(f"The shape of the sequence matrix: {post_padded_seq.shape}")

sentence: mom starting to fear son's web series closest thing she will have to grandchild

post padded sequence: [  140   825     2   813  1100  2048   571  5057   199   139    39    46
     2 13050     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0]

The shape of the sequence matrix: (26709, 39)


#### Now lets Pre-pad

In [10]:
vec_layer = tf.keras.layers.TextVectorization(ragged=True)

vec_layer.adapt(sentences)

# get the vocab - optional
# vocab = vec_layer.get_vocab()

ragged_seq = vec_layer(sentences)

In [11]:
index = 2

print(f"sentence: {sentences[index]}")
print('')
print(f"Ragged sequence: {ragged_seq[index]}")
print('')

print(f"The shape of the sequence matrix: {ragged_seq.shape}")

sentence: mom starting to fear son's web series closest thing she will have to grandchild

Ragged sequence: [  140   825     2   813  1100  2048   571  5057   199   139    39    46
     2 13050]

The shape of the sequence matrix: (26709, None)


In [12]:
from tensorflow.keras.utils import pad_sequences

pre_padded_seq = pad_sequences(ragged_seq.numpy())


In [13]:
index = 2

print(f"sentence: {sentences[index]}")
print('')
print(f"Pre padded sequence: {pre_padded_seq[index]}")
print('')

print(f"The shape of the sequence matrix: {pre_padded_seq.shape}")

sentence: mom starting to fear son's web series closest thing she will have to grandchild

Pre padded sequence: [    0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0   140   825     2   813  1100  2048   571  5057   199   139    39
    46     2 13050]

The shape of the sequence matrix: (26709, 39)


### PyTorch

In [27]:
import torch
from torchtext.vocab import build_vocab_from_iterator
from torchtext.data.utils import get_tokenizer
from torch.nn.utils.rnn import pad_sequence

In [17]:
sentences[1:4]

["the 'roseanne' revival catches up to our thorny political mood, for better and worse",
 "mom starting to fear son's web series closest thing she will have to grandchild",
 'boehner just wants wife to listen, not come up with alternative debt-reduction ideas']

In [32]:
tokenizer = get_tokenizer("basic_english")

def yield_token(sentences):
    for sentence in sentences:
        yield tokenizer(sentence)

vocab = build_vocab_from_iterator(yield_token(sentences), specials=['<pad>', '<unk>'])
vocab.set_default_index(vocab['<unk>'])

# # get vocab - optional
# vocab.get_itos()
# vocab.get_stoi()

# Numericalized
seq_torch = []
for sentence in sentences:
    tokens = tokenizer(sentence)
    seq = vocab(tokens)
    seq_torch.append(torch.Tensor(seq))

# seq_torch

# Padding
post_padded_seq = pad_sequence(seq_torch, batch_first=True, padding_value=vocab['<pad>'])


In [37]:
index = 2

print(f"sentence: {sentences[index]}")
print()
print(f"post padded seqeuence: {post_padded_seq[index]}")
print()
print(f"Shape of the padded matrix sequence: {post_padded_seq.shape}")

sentence: mom starting to fear son's web series closest thing she will have to grandchild

post padded seqeuence: tensor([1.2500e+02, 8.5200e+02, 3.0000e+00, 8.1900e+02, 2.4400e+02, 2.0000e+00,
        6.0000e+00, 2.1500e+03, 5.8900e+02, 4.6910e+03, 2.1400e+02, 9.2000e+01,
        4.5000e+01, 5.3000e+01, 3.0000e+00, 1.2029e+04, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00])

Shape of the padded matrix sequence: torch.Size([26709, 60])


## For Pre-Padding or/and Truncation use custom made functions as described in the notebook - sequences_basics.ipynb