Import the necessary packages:

In [1]:
import tensorflow as tf 
from tensorflow import keras
from keras.datasets import imdb
from collections import Counter

Using TensorFlow backend.


In [2]:
tf.random.set_seed(42)

Load the data, from tensorflow datasets, and show the different sets:

In [3]:
import tensorflow_datasets as tfds
datasets, info = tfds.load('imdb_reviews', as_supervised=True, with_info=True)
datasets.keys()

dict_keys(['test', 'train', 'unsupervised'])

Compute the different datasets's size:

In [4]:
train_size = info.splits['train'].num_examples
test_size = info.splits['test'].num_examples
unsupervised_size = info.splits['unsupervised'].num_examples

In [5]:
train_size, test_size, unsupervised_size

(25000, 25000, 50000)

So the IMDB dataset is composet of 25000 reviews for training, 25000 for testing and 50000 without labels, that we can use for predict, for example.

Before starting to preprocess the data, let's see two review with their respective assigned labels for the train, test and unsupervised datasets:

In [6]:
for X_batch, y_batch in datasets['train'].batch(2).take(1):
        for review, label in zip(X_batch.numpy(), y_batch.numpy()):
            print("Review:", review.decode("utf-8")[:200], "...")
            print("Label:", label, "= Positive" if label else "= Negative")
            print()

Review: This is a big step down after the surprisingly enjoyable original. This sequel isn't nearly as fun as part one, and it instead spends too much time on plot development. Tim Thomerson is still the best ...
Label: 0 = Negative

Review: Perhaps because I was so young, innocent and BRAINWASHED when I saw it, this movie was the cause of many sleepless nights for me. I haven't seen it since I was in seventh grade at a Presbyterian schoo ...
Label: 0 = Negative



In [7]:
for X_batch, y_batch in datasets['test'].batch(2).take(1):
        for review, label in zip(X_batch.numpy(), y_batch.numpy()):
            print("Review:", review.decode("utf-8")[:200], "...")
            print("Label:", label, "= Positive" if label else "= Negative")
            print()

Review: It opens with your cliche overly long ship flying through space. All I could think at this point was "Spaceballs" and hoping there'd be a sticker on back that said "We break for Nobody." The movie the ...
Label: 1 = Positive

Review: I remember seeing this at my local Blockbuster and picked it up cause I was curious. I liked movies about mythological creatures. I like movies about werewolves, vampires, zombies, etc. This is based  ...
Label: 0 = Negative



In [8]:
for X_batch, y_batch in datasets['unsupervised'].batch(2).take(1):
        for review, label in zip(X_batch.numpy(), y_batch.numpy()):
            print("Review:", review.decode("utf-8")[:200], "...")
            print("Label:", label, "= Positive" if label else "= Negative")
            print()

Review: I'm baffled by the number of people who actually liked this movie. How anyone can watch this crud is beyond me. Aside from the blatant attempt to cash in on the popularity of Arthurian legend (this st ...
Label: -1 = Positive

Review: Well this is a real mess of a film. It's worthy of watching at a drive-in, where you don't have to pay much attention to the plot because you are too busy ... doing other things. Watch it on DVD, howe ...
Label: -1 = Positive



First we define an initial function useful for preprocessing the text reviews:

In [9]:
def preprocess (X_batch, y_batch):
    # truncate the reviews to the first 300 characters
    X_batch = tf.strings.substr(X_batch, 0, 300)
    # replace <br /> and any other characters other than letters and quotes with spaces
    X_batch = tf.strings.regex_replace(X_batch, b"<br\\s*/?>", b" ")
    X_batch = tf.strings.regex_replace(X_batch, b"[^a-zA-Z']", b" ")
    # split the review by the spaces
    X_batch = tf.strings.split(X_batch)
    # returns a dense tensor, with all the reviews padded so they all have the same length
    return X_batch.to_tensor(default_value=b"<pad>"), y_batch

Next, we store the preprocessing proces in a function, that can be applied later in any dataset(train, test, unsupervised):

In [16]:
vocab_size = 10000
num_oov_buckets = 1000
def function (datasets, set):
    # initialize a Counter object
    vocabulary = Counter()
    # go through the whole dataset, apply preprocess function and count word ocurrencies
    for X_batch, y_batch in datasets[set].batch(32).map(preprocess):
        for review in X_batch:
            vocabulary.update(list(review.numpy()))
    # truncate our vocabulary to only a number of words
    truncated_vocabulary = [
        word for word, count in vocabulary.most_common()[:vocab_size]]
    # replace each word with its ID (index in the vocabulary)
    words = tf.constant(truncated_vocabulary)
    words_ids = tf.range(len(truncated_vocabulary), dtype=tf.int64)
    vocab_init = tf.lookup.KeyValueTensorInitializer(words, words_ids)
    table = tf.lookup.StaticVocabularyTable(vocab_init, num_oov_buckets)
    # encode the words
    def encode_words(X_batch, y_batch):
        return table.lookup(X_batch), y_batch
    # batch, preprocess, encode and prefetch the reviews
    data_set = datasets[set].repeat().batch(32).map(preprocess)
    data_set = data_set.map(encode_words).prefetch(1)
    # return the prepared dataset
    return data_set

Now, we can apply this whole preprocessing process to our datasets:

In [17]:
train_set = function(datasets, 'train')
test_set = function(datasets, 'test')
unsupervised_set = function(datasets, 'unsupervised')

At least we can create the model and train it:

In [26]:
embed_size = 128
model = keras.models.Sequential([
    # convert word IDs into embeddings
    keras.layers.Embedding(vocab_size + num_oov_buckets, embed_size, input_shape=[None]),
    # two GRU layers
    keras.layers.GRU(128, dropout=0.2, return_sequences=True),
    keras.layers.GRU(128, dropout=0.2, return_sequences=True),
    keras.layers.GRU(128, dropout=0.2),
    # dense layer with sigmoid activation to output the estimated probabilities
    keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(train_set, steps_per_epoch=train_size // 32, epochs=5)

Train for 781 steps
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Once the model is trained, we can evaluate it with the test set to look for overfitting:

In [25]:
model.evaluate(test_set, steps=test_size // 32)



[2.781618276265153, 0.49955985]

As a final exercise, we can make predictions from the unsupervised dataset:

In [14]:
predictions = model.predict(unsupervised_set)

We can display a few reviews with their predicted label, to check the generalization of the model:

In [15]:
for X_batch, y_batch in datasets['unsupervised'].batch(5).take(1):
    for review, label in zip(X_batch.numpy(), predictions[:5]):
        print("Review:", review.decode("utf-8")[:200], "...")
        print ('Label:', label)
        print('------------------------------')

Review: I'm baffled by the number of people who actually liked this movie. How anyone can watch this crud is beyond me. Aside from the blatant attempt to cash in on the popularity of Arthurian legend (this st ...
Label: [0.33107972]
------------------------------
Review: Well this is a real mess of a film. It's worthy of watching at a drive-in, where you don't have to pay much attention to the plot because you are too busy ... doing other things. Watch it on DVD, howe ...
Label: [0.7999851]
------------------------------
Review: During the past six months or so, I have intentionally sat and watched hordes of terrible movies, in search of the most entertainingly godawful ones I can. I've probably been through about 50 or so. R ...
Label: [0.65163875]
------------------------------
Review: A story about a grown-up pair of Siamese twin brothers - one does wonder. "Twin Falls Idaho" is a quiet yet possibly disturbing film (only if one is uncomfortable with the idea of looking at a pair of 