# Setup

In [None]:
!pip install tensorflow --upgrade

In [1]:
try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass

# Load the TensorBoard notebook extension.
%load_ext tensorboard

In [2]:
import tensorflow as td
from tensorflow import keras
import numpy as np

# Preprocessing

In [3]:
data = keras.datasets.imdb

### Vocabulary Length

Parameter `num_words` specifies the size of the vocabulary of the corpus. It probably represents the most common `num_words` words. Other words are filtered out. The vocabulary length should be the size of the input layer of the model unless additional work on the vocabulary will be performed prior to submitting training data to the model.

In [4]:
(train_data, train_labels), (test_data, test_labels) = data.load_data(num_words=88000)

Print the word indexes for the first sentence in `train_data`

In [5]:
print(type(train_data[0]))
print(train_data[0])

<class 'list'>
[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 22665, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 21631, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 19193, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 10311, 8, 4, 107, 117, 5952, 15, 256, 4, 31050, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 12118, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]


## Determine how classification is expressed.

Classification is expressed as `0` (negative) or `1` (positive). Values are stored in a `numpy array`. 

In [6]:
print(type(train_labels))
print(train_labels[0:25])

<class 'numpy.ndarray'>
[1 0 0 1 0 0 1 0 1 0 1 0 0 0 0 0 1 1 0 1 0 0 1 0 1]


## Determine the lengths of the train and test datasets.
There are 25,000 samples in each dataset.

In [7]:
print("Length of training data", len(train_data))
print("Length of testing data", len(test_data))

Length of training data 25000
Length of testing data 25000


## Get word indexes.

Function `get_word_index()` returns a `dict` mapping words in the vocabulary to integers.

In [8]:
word_index = data.get_word_index()
print(type(word_index))
print(len(word_index))

<class 'dict'>
88584


The datset uses additional tags, which are not stored in `word_index`. To accommodate the additional tags, shift the index up by three.

In [9]:
word_index = {k:(v+3) for k, v in word_index.items()}
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2
word_index["<UNUSED>"] = 3

In [10]:
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

In [11]:
def decode_review(text):
    return " ".join([reverse_word_index.get(i, "?") for i in text])

The decoded text would be nonsense if the word indexes had not been shifted up.

In [12]:
print(decode_review(test_data[0]))

<START> please give this one a miss br br kristy swanson and the rest of the cast rendered terrible performances the show is flat flat flat br br i don't know how michael madison could have allowed this one on his plate he almost seemed to know this wasn't going to work out and his performance was quite lacklustre so all you madison fans give this a miss


## Pad training data and test data so that each sentence contains 250 tokens.

If samples (sentences) do not have the same number of words, then each of `train_data` and `test_data` is an `ndarray` but the rows in the array are of unequal length. Therefore, it will be impossible to convert train_data to a tensor when fitting the model. 

In [13]:
train_data = keras.preprocessing.sequence.pad_sequences(train_data, value=word_index["<PAD>"], padding="post", maxlen=250)
test_data = keras.preprocessing.sequence.pad_sequences(test_data, value=word_index["<PAD>"], padding="post", maxlen=250)

# Build the model

In [14]:
model = keras.Sequential()
model.add(keras.layers.Embedding(88000, 16))
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(16, activation="relu"))
model.add(keras.layers.Dense(1, activation="sigmoid"))

Display properties of the model.

In [15]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 16)          1408000   
_________________________________________________________________
global_average_pooling1d (Gl (None, 16)                0         
_________________________________________________________________
dense (Dense)                (None, 16)                272       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
Total params: 1,408,289
Trainable params: 1,408,289
Non-trainable params: 0
_________________________________________________________________


In [16]:
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])

In [17]:
x_val = train_data[:10000]
x_train = train_data[10000:]
y_val = train_labels[:10000]
y_train = train_labels[10000:]

## Train the model

### Uniform Sentence Length
Variables `x_train` and `y_train` must be tensors or convertible to tensors. 

If sentences have a different numbers of tokens, then the result is an `ndarray` where each row is a list, and the lists are of unequal length. This structure cannot be converted to a tensor. In this case, `model.fit` throws `ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type list).`

### Vocabulary Length
The first parameter in `Embedding(10000, 16)` is the size of the input layer. The vocabulary must not be larger in size than the input layer. Otherwise, model.fit throws `InvalidArgumentError:  indices[363,5] = 42016 is not in [0, 10000)`

### Multi-Processing
Change the default arguments from `workers=1, use_multiprocessing=False`,

In [18]:
fit_model = model.fit(x_train, y_train, epochs=40, batch_size=512, validation_data=(x_val, y_val), verbose=1)

Train on 15000 samples, validate on 10000 samples
Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


In [19]:
results = model.evaluate(test_data, test_labels)



In [20]:
print("Loss, ", "Accuracy")
print(results)

Loss,  Accuracy
[0.32830213636875155, 0.87268]


# Run Predictions

In [21]:
test_review = test_data[0]

In [22]:
predict = model.predict([test_review])
print("Review: ")
print(decode_review(test_review))
print("Prediction: " + str(predict[0]))
print("Actual: " + str(test_labels[0]))


Review: 
<START> please give this one a miss br br kristy swanson and the rest of the cast rendered terrible performances the show is flat flat flat br br i don't know how michael madison could have allowed this one on his plate he almost seemed to know this wasn't going to work out and his performance was quite lacklustre so all you madison fans give this a miss <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD

File `model.h5` is stored in the HDF5 binary format.

In [23]:
model.save("model.h5")
model = keras.models.load_model("model.h5")

In [24]:
def review_encode(s):
    # 1 is the <START> tag.
    encoded = [1]
    
    for word in s:
        if word.lower() in word_index:
            encoded.append(word_index[word.lower()])
        else:
            encoded.append(2)
            
    return encoded

In [25]:
with open("lion_king_review.txt", encoding="utf-8") as f:
    for line in f.readlines():
        nline = line.replace(",", "").replace(".", "").replace("(", "").replace(")", "").replace(":", "").replace("\"", "").strip().split(" ")
        encode = review_encode(nline)
        encode = keras.preprocessing.sequence.pad_sequences([encode], value=word_index["<PAD>"], padding="post", maxlen=250)
        predict = model.predict(encode)
        print(line)
        print(encode)
        print(predict[0])
        if predict[0][0] > 0.5: 
            print("1", "positive")
        else:
            print("0", "negative")
        

Of all the animation classics from the Walt Disney Company, there is perhaps none that is more celebrated than "The Lion King." Its acclaim is understandable: this is quite simply a glorious work of art. "The Lion King" gets off to a fantastic start. The film's opening number, "The Circle of Life," is outstanding. The song lasts for about four minutes, but from the first sound, the audience is floored. Not even National Geographic can capture something this beautiful and dramatic. Not only is this easily the greatest moment in film animation, this is one of the greatest sequences in film history. The story that follows is not as majestic, but the film has to tell a story. Actually, the rest of the film holds up quite well. The story takes place in Africa, where the lions rule. Their king, Mufasa (James Earl Jones) has just been blessed with a son, Simba (Jonathan Taylor Thomas), who goes in front of his uncle Scar (Jeremy Irons) as next in line for the throne. Scar is furious, and sets

This demonstrates that text is selected from the *end* of the review. Additional work would be required to get text at the beginning.

In [26]:
print(decode_review(encode[0]))

distance between the subject and the background making it seem as if the figure animation was cut and pasted on the background this is obviously what happens but it is up to the artists to make sure that it isn't noticeable there is none of that here throughout the golden age of disney animation the films have been musicals the lion king is no different and the songs are brilliant all of the numbers are standouts can you feel the love tonight won the oscar but in my opinion the circle of life was better in the cases of simba and nala simba's girlfriend both young and old there is a noticeable difference between the speaking and singing parts everyone else does their own singing and speaking but never mind it still works and that's what's important the lion king is not flawless but on first viewing they aren't noticeable and it is likely that the young won't ever notice them beauty and the beast was the first animated film to get an oscar nomination for best picture it lost to the silen