<a href="https://colab.research.google.com/github/kameshcodes/tensorflow-codes/blob/main/10_tensorflow_working_with_text_datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

$$\textbf{Sarcasm Dataset}$$

---
---

## 1. Download Data

In [1]:
!pip install -q kaggle

In [3]:
!kaggle datasets download -d rmisra/news-headlines-dataset-for-sarcasm-detection
!unzip -q news-headlines-dataset-for-sarcasm-detection.zip -d dataset
!rm -rf news-headlines-dataset-for-sarcasm-detection.zip

Dataset URL: https://www.kaggle.com/datasets/rmisra/news-headlines-dataset-for-sarcasm-detection
License(s): Attribution 4.0 International (CC BY 4.0)
news-headlines-dataset-for-sarcasm-detection.zip: Skipping, found more recently modified local copy (use --force to force download)


## 2. Load Libraries and Download Json data

### 2.1 Import Libraries

In [7]:
import json
import tensorflow as tf

### 2.2 Load data and Labels

In [9]:
data = []
with open('dataset/Sarcasm_Headlines_Dataset.json', 'r') as f:
  for line in f:
    data.append(json.loads(line))

In [11]:
data[0]

{'article_link': 'https://www.huffingtonpost.com/entry/versace-black-code_us_5861fbefe4b0de3a08f600d5',
 'headline': "former versace store clerk sues over secret 'black code' for minority shoppers",
 'is_sarcastic': 0}

In [13]:
data[0]['article_link']

'https://www.huffingtonpost.com/entry/versace-black-code_us_5861fbefe4b0de3a08f600d5'

In [15]:
data[0]['headline']

"former versace store clerk sues over secret 'black code' for minority shoppers"

In [17]:
data[0]['is_sarcastic']

0

In [19]:
article_links = []
headlines = []
labels = []

for element in data:
  article_links.append(element['article_link'])
  headlines.append(element['headline'])
  labels.append(element['is_sarcastic'])

## 3. Preprocessing the headlines

In [26]:
vectorize_layer = tf.keras.layers.TextVectorization()
vectorize_layer.adapt(headlines)
vocabs = vectorize_layer.get_vocabulary()

print('No of words: ', len(vocabs))
print(vocabs)

No of words:  28435


In [24]:
post_padded_sequences = vectorize_layer(headlines)

post_padded_sequences

<tf.Tensor: shape=(26709, 39), dtype=int64, numpy=
array([[  295, 15335,   801, ...,     0,     0,     0],
       [    4,  8793,  3353, ...,     0,     0,     0],
       [  140,   825,     2, ...,     0,     0,     0],
       ...,
       [ 8862,     9,    66, ...,     0,     0,     0],
       [ 1832,   377,  3857, ...,     0,     0,     0],
       [23100,  1692,     6, ...,     0,     0,     0]])>

In [27]:
index = 2

print(f'sample headline: {headlines[index]}')
print(f'post padded sequence: {post_padded_sequences[index]}')
print()
print(f'shape: {post_padded_sequences[index].shape}')

sample headline: mom starting to fear son's web series closest thing she will have to grandchild
post padded sequence: [  140   825     2   813  1100  2048   571  5057   199   139    39    46
     2 13050     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0]

shape: (39,)


# Creating a prepadding with rugged tensor

In [30]:
vectorize_layer = tf.keras.layers.TextVectorization(ragged=True)
vectorize_layer.adapt(headlines)

unpadded_sequences = vectorize_layer(headlines)
unpadded_sequences.numpy()

array([array([  295, 15335,   801,  3788,  2264,    48,   362,    93,  2225,
                  6,  2578,  8719])                                         ,
       array([   4, 8793, 3353, 2845,   28,    2,  156, 8515,  394, 2957,    6,
               244,    9,  951])                                               ,
       array([  140,   825,     2,   813,  1100,  2048,   571,  5057,   199,
                139,    39,    46,     2, 13050])                           ,
       ..., array([8862,    9,   66]),
       array([1832,  377, 3857, 5780,  866, 1665, 4618, 3546]),
       array([23100,  1692,     6,     4, 23598,   843])], dtype=object)

In [36]:
pre_padded_sequences = tf.keras.utils.pad_sequences(unpadded_sequences.numpy())
pre_padded_sequences

array([[    0,     0,     0, ...,     6,  2578,  8719],
       [    0,     0,     0, ...,   244,     9,   951],
       [    0,     0,     0, ...,    46,     2, 13050],
       ...,
       [    0,     0,     0, ...,  8862,     9,    66],
       [    0,     0,     0, ...,  1665,  4618,  3546],
       [    0,     0,     0, ...,     4, 23598,   843]], dtype=int32)

In [37]:
index = 2
print(f'sample headline: {headlines[index]}')
print()
print(f'post-padded sequence: {post_padded_sequences[index]}')
print()
print(f'pre-padded sequence: {pre_padded_sequences[index]}')
print()


print(f'shape of post-padded sequences: {post_padded_sequences.shape}')
print(f'shape of pre-padded sequences: {pre_padded_sequences.shape}')

sample headline: mom starting to fear son's web series closest thing she will have to grandchild

post-padded sequence: [  140   825     2   813  1100  2048   571  5057   199   139    39    46
     2 13050     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0]

pre-padded sequence: [    0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0   140   825     2   813  1100  2048   571  5057   199   139    39
    46     2 13050]

shape of post-padded sequences: (26709, 39)
shape of pre-padded sequences: (26709, 39)
