# SENTIMENT ANALYSIS

Preprocessed dataset where X_train consists of a list of reviews, each of which is represented as a NumPy array of integers, where each integer represents a word. All punctuation are removed, and then words were converted to lowercase, split by spaces, and finally indexed by frequency (so low integers correspond to frequent words). The integers 0, 1, and 2 are special: they represent the padding token, the start-of-sequence (SSS) token, and unknown words, respectively.

In [1]:
import tensorflow as tf
import keras

## Loading IMDb dataset

In [2]:
(X_train, y_train), (X_test, y_test) = keras.datasets.imdb.load_data()
X_train[0][:10]

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65]

# Decoding reviews from integers to text

In [3]:
word_index = keras.datasets.imdb.get_word_index()
id_to_word = {id_ + 3: word for word, id_ in word_index.items()} # Create a dictionary id_to_word where the keys are the integer indices + 3 (as IMDb dataset reserves 0, 1, 2 for special tokens), and the values are the corresponding words.

# Following loop iterates over the tuple ("<pad>", "<sos>", "<unk>") and assigns each token to the corresponding index in id_to_word. These tokens are used for padding, start of sequence, and unknown words, respectively.
for id_, token in enumerate(("<pad>", "<sos>", "<unk>")):
    id_to_word[id_] = token

# Following line uses the id_to_word dictionary to convert the first 10 indices of the sequence(instance) X_train[0] back to their corresponding words. The words are joined together with a space between them.
" ".join([id_to_word[id_] for id_ in X_train[0][:10]])

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json


'<sos> this film was just brilliant casting location scenery story'

## Downloading original IMDb reviews as text (byte strings)
Helpful to deploy model to a mobile device or a web browser, if don’t want to write a different preprocessing function every time, then handle preprocessing using only TensorFlow operations, so it can be included in the model itself.

In [4]:
import tensorflow_datasets as tfds

datasets, info = tfds.load("imdb_reviews", as_supervised=True, with_info=True)
train_size = info.splits["train"].num_examples

# The as_supervised=True ensures the dataset is returned in a tuple format (input, label), where input is the movie review text and label is sentiment (positive or negative). The with_info=True returns additional information.
# By accessing the "train" split and retrieving the num_examples attribute, you obtain the total number of training examples.

  from .autonotebook import tqdm as notebook_tqdm


[1mDownloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to C:\Users\coolg\tensorflow_datasets\imdb_reviews\plain_text\1.0.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]
Dl Completed...:   0%|          | 0/1 [00:00<?, ? url/s]
Dl Completed...:   0%|          | 0/1 [00:00<?, ? url/s]
Dl Size...:   0%|          | 0/80 [00:00<?, ? MiB/s][A
Dl Completed...:   0%|          | 0/1 [00:06<?, ? url/s]MiB][A
Dl Size...:   1%|▏         | 1/80 [00:06<08:45,  6.65s/ MiB][A
Dl Completed...:   0%|          | 0/1 [00:10<?, ? url/s]MiB][A
Dl Size...:   2%|▎         | 2/80 [00:10<06:51,  5.28s/ MiB][A
Dl Completed...:   0%|          | 0/1 [00:14<?, ? url/s]MiB][A
Dl Size...:   4%|▍         | 3/80 [00:14<05:57,  4.64s/ MiB][A
Dl Completed...:   0%|          | 0/1 [00:26<?, ? url/s]MiB][A
Dl Size...:   5%|▌         | 4/80 [00:26<09:11,  7.25s/ MiB][A
Dl Completed...:   0%|          | 0/1 [00:31<?, ? url/s]MiB][A
Dl Size...:   6%|▋         | 5/80 [00:31<08:00,  6.41s/ MiB][A
Dl Completed...:   0%|          | 0/1 [00:35<?, ? url/s]MiB][A
Dl Size...:   8%|▊         | 6/80 [00:35<06:58,  5.66s/ MiB][A
Dl Completed...:   0%|

[1mDataset imdb_reviews downloaded and prepared to C:\Users\coolg\tensorflow_datasets\imdb_reviews\plain_text\1.0.0. Subsequent calls will reuse this data.[0m


## Creating Preprocessing function for word distinctions

In [5]:
def preprocess(X_batch, y_batch):
    X_batch = tf.strings.substr(X_batch, 0, 300)    # Keep first 300 characters of each input string, discard the rest.
    X_batch = tf.strings.regex_replace(X_batch, b"<br\\s*/?>", b" ")    # replace any occurrences of the HTML tag "<br>" (ie line breaks) with a space in the input text.
    X_batch = tf.strings.regex_replace(X_batch, b"[^a-zA-Z']", b" ")    # replace any characters that are not letters or apostrophes (like special characters or punctuation marks) with a space in the input text.
    X_batch = tf.strings.split(X_batch)     # Split each string in X_batch into a list of individual words. The resulting tensor contains variable-length sequences of words for each input string.
    return X_batch.to_tensor(default_value=b"<pad>"), y_batch   # Convert X_batch tensor of word sequences into a dense tensor using to_tensor. The default_value argument sets the default value for any missing elements in the tensor to be <pad>, indicating padding, so they all have the same length.

## Constructing Vocabulary for the model
This requires going through the whole training set once, applying our preprocess() function, and using a Counter to count the number of occurrences of each word.

In [6]:
from collections import Counter
vocabulary = Counter()
for X_batch, y_batch in datasets["train"].batch(32).map(preprocess):
    for review in X_batch:
        vocabulary.update(list(review.numpy()))

# Iterate over batches of preprocessed input X_batch and corresponding output y_batch from the training split of the dataset. The batch(32) method batches the dataset into groups of 32 examples, and the map(preprocess) method applies the preprocess function to each batch.
# Last line converts each review tensor to a NumPy array using review.numpy(), then converts it to a list. The vocabulary.update() method updates the counts in the vocabulary Counter with the words in the review.

In [7]:
# three most common words:

vocabulary.most_common()[:3]

[(b'<pad>', 214309), (b'the', 61137), (b'a', 38564)]

In [8]:
# truncating the vocabulary, keeping only the 10,000 most common words

vocab_size = 10000
truncated_vocabulary = [ word for word, count in vocabulary.most_common()[:vocab_size]]

## Replacing words with their indices in vocabulary & dealing with oovs

In [9]:
# Replace each word with its ID (i.e., its index in the vocabulary). Create a lookup table for this, using 1,000 out-of-vocabulary (oov) buckets.

words = tf.constant(truncated_vocabulary)
word_ids = tf.range(len(truncated_vocabulary), dtype=tf.int64)  # Creates a TensorFlow range tensor called word_ids.
vocab_init = tf.lookup.KeyValueTensorInitializer(words, word_ids)   # Initializes the key-value pairs for the vocabulary lookup table. The words tensor serves as the keys, representing the words, while the word_ids tensor serves as the values, representing the corresponding integer IDs for each word.
num_oov_buckets = 1000
table = tf.lookup.StaticVocabularyTable(vocab_init, num_oov_buckets)    # The StaticVocabularyTable provides a mapping from words to their integer IDs, using the provided vocabulary initializer and OOV bucket information

In [10]:
# looking up the IDs of a few words:

table.lookup(tf.constant([b"This movie was faaaaaantastic".split()]))

<tf.Tensor: shape=(1, 4), dtype=int64, numpy=array([[   22,    12,    11, 10053]], dtype=int64)>

## Creating the training set (using the table)

In [11]:
def encode_words(X_batch, y_batch):
    return table.lookup(X_batch), y_batch

train_set = datasets["train"].batch(32).map(preprocess)
train_set = train_set.map(encode_words).prefetch(1)

# Using the vocabulary table to perform the word encoding. The table.lookup() function is used to look up the integer IDs for each word in the input batch X_batch. The function returns the encoded input batch and the original output batch y_batch.

In [12]:
embed_size = 128    # size of the embedding vector / dimensionality 
model = keras.models.Sequential([
    keras.layers.Embedding(vocab_size + num_oov_buckets, embed_size, input_shape=[None]),
    keras.layers.GRU(128, return_sequences=True),
    keras.layers.GRU(128),
    keras.layers.Dense(1, activation="sigmoid")
])

model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
history = model.fit(train_set, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
