#Movie Review Dataset

We'll start by loading in the IMDB movie review dataset from Keras. This dataset contains 25,000 review from IMDB where each one is already preprocessed and has a label as either positive or negative. Each review is encoded by integers that represents how common a  word is in the entire dataset. For example a word encoded by the integer 3 means that it is the 3rd mode common word in the dataset.

In [None]:
%tensorflow_version 2.x
from keras.datasets import imdb
from keras.preprocessing import sequence
import tensorflow as tf
import os
import numpy as np

VOCAB_SIZE = 88584

MAXLEN = 250
BATCH_SIZE = 64

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words = VOCAB_SIZE)

In [None]:
# Lets look at one review
train_data[0]

#More Preprocessing
If we have a look at some of our loaded in review we'll notice that they are different lengths. This is an issue. We cannot pass different length data into our neural network. Therefore we must make each review the same length. To do this we will follow the procedure below:


*   if the review is greater than 250 words then trim off the extra words
*   if the review is less than 250 words add the necessary amount of 0's to make it equal to 250.

Lucky for us keras has a function that can do this for us:



In [None]:
train_data = sequence.pad_sequences(train_data, MAXLEN)
test_data = sequence.pad_sequences(test_data, MAXLEN)

#Creating the Model
Now it's time to create the model. We'll use a words embedding layer as the first layer in our model and add a LSTM layer afterwards that feeds into a dense node to get our predicted sentiment.
32 stands for the output dimension of the vectors generated by the embedding layer. We can change this value if we'd like.

In [None]:
model = tf.keras.Sequential([
                             tf.keras.layers.Embedding(VOCAB_SIZE, 32),
                             tf.keras.layers.LSTM(32),
                             tf.keras.layers.Dense(1, activation='sigmoid')
])

In [None]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 32)          2834688   
                                                                 
 lstm (LSTM)                 (None, 32)                8320      
                                                                 
 dense (Dense)               (None, 1)                 33        
                                                                 
Total params: 2,843,041
Trainable params: 2,843,041
Non-trainable params: 0
_________________________________________________________________


#Training
Now it's time to compile and train the model

In [None]:
model.compile(loss="binary_crossentropy",
              optimizer="rmsprop",
              metrics=['acc']
              )
history = model.fit(train_data,
                    train_labels,
                    epochs=10,
                    validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


And we'll evaluate the model on our training data to see how well it performs.

In [None]:
results = model.evaluate(test_data, test_labels)
print(results)

[0.42810410261154175, 0.8586400151252747]


So we're somewhere in the mid-high 80's. Not bad for a single recurrent network.

#Making Predictions
Now let's use our network to make predictions on our own reviews.
Since our review are encoded we'll need to convert any review that we write into that form so the network can understand it. To do that, we'll load the encodings from the dataset and use them to encode our own data.

In [None]:
word_index = imdb.get_word_index()

def encode_text(text):
  tokens = tf.keras.preprocessing.text.text_to_word_sequence(text)
  tokens = [word_index[word] if word in word_index else 0 for word in tokens]
  return sequence.pad_sequences([tokens], MAXLEN)[0]

text = "that movie was just amazing, so amazing"
encoded = encode_text(text)
print(encoded)

[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0  12  17  13  4

In [None]:
# While we're at it let's make a decode function

reverse_word_index = {value: key for (key, value) in word_index.items()}

def decode_integers(integers):
  PAD = 0
  text = ""
  for num in integers:
    if num != PAD:
      text += reverse_word_index[num] + " "
    
  return text[:-1]

print(decode_integers(encoded))

that movie was just amazing so amazing


In [None]:
# Now it's time to make a prediction

def predict(text):
  encoded_text = encode_text(text)
  pred = np.zeros((1,250))
  pred[0] = encoded_text
  result = model.predict(pred)
  print(result[0])

positive_review = "That movie was so awesome! I really loved it and would watch it again because it was amazingly great"
predict(positive_review)

negative_review = "that movie sucked. I hated it and wouldn't watch it again. Was one of the worst things I've ever watched"
predict(negative_review)

[0.88070893]
[0.22345333]


#RNN Play Generator
Now it's time for one of the coolest examples we've seen so far. We are going to use a RNN to generate a play. We will simply show the RNN an example of something we want it to recreate and it will learn how to write a version of it on its own. We'll do this using a character predictive model that will take as input a variable length sequence and predict the next character. We can use the model many times in a row with the output from the last prediction as the input for the next call to generate a sequence.

This guide is based on the following: http://www.tensorflow.org/tutorials/text/text_generation

In [None]:
%tensorflow_version 2.x
from keras.preprocessing import sequence
import keras
import tensorflow as tf
import os
import numpy as np

##Dataset
For this example we only need one piece of training data. In fact, we can write out own poem or play and pass that to the network for training if we'd like. However, to make things easy we'll use an extract from a Shakespeare play.

In [None]:
path_to_file = tf.keras.utils.get_file('shakespeare.txt','https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

##Loading Your Own Data
To load your own data you'll need to upload a file from the dialog below. Then you'll need to follow the steps from above but load in this new file instead.

In [None]:
# from google.colab import files
# path_to_file = list(files.upload().keys())[0]

##Read Contents of File
Let's look at the contents of the file.

In [None]:
# Read, then decode for py2 compat.
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')
# length of text is the number of characters in it
print('Length of text: {} characters'.format(len(text)))

Length of text: 1115394 characters


In [None]:
# Take a look at the first 250 characters in text
print(text[:250])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.



##Endcoding
Since this text isn't encoded yet we'll need to do that ourselves. We are going to encode each unique character as a different integer.

In [None]:
vocab = sorted(set(text))
# Creating a mapping from unique characters to indices
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

def text_to_int(text):
  return np.array([char2idx[c] for c in text])

text_as_int = text_to_int(text)

In [None]:
# let's look at how part of our text is encoded
print("Text:", text[:13])
print("Encoded:", text_to_int(text[:13]))

Text: First Citizen
Encoded: [18 47 56 57 58  1 15 47 58 47 64 43 52]


And here we will make a function that can convert our numberic values to text.

In [None]:
def int_to_text(ints):
  try:
    ints = ints.numpy()
  except:
    pass
  return ''.join(idx2char[ints])

print(int_to_text(text_as_int[:13]))

First Citizen


##Creating Training Examples
Remember our task is to feed the model a sequence and have it return to us the next character. This means we need to split out text data from above into many shorter sequences that we can pass to the model as training examples.

The training examples we will prepare will use a *seq_length* sequence as input and a *seq_length* sequence as the output where that sequence is the original sequence shifted one letter to thr gith. For example:

`input: Hell | output: ello`

Our first step will be to create a stream of characters from our text data.

In [None]:
seq_length = 100 # length of sequence for a training example
examples_per_epoch = len(text)//(seq_length+1)

# Create training examples / targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

Next we can use the batch method to turn this stream of characters into batches of desired length.

In [None]:
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)

Now we need to use these sequences of length 101 and split them into input and output.

In [None]:
def split_input_target(chunk): # for the example: hello
  input_text = chunk[:-1] # hell
  target_text = chunk[1:] # ello
  return input_text, target_text #hell, ello

dataset = sequences.map(split_input_target) # we use map to apply the above function to every entry

In [None]:
for x, y in dataset.take(2):
  print("\n\nEXAMPLE\n")
  print("INPUT")
  print(int_to_text(x))
  print("\nOUTPUT")
  print(int_to_text(y))



EXAMPLE

INPUT
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You

OUTPUT
irst Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You 


EXAMPLE

INPUT
are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you 

OUTPUT
re all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you k


Finally we need to make training batches.

In [None]:
BATCH_SIZE = 64
VOCAB_SIZE = len(vocab) # vocab is the number of unique characters
EMBEDDING_DIM = 256
RNN_UNITS = 1024

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).

BUFFER_SIZE = 10000

data = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

##Building the Model
Now it is time to build the model. We will use an embedding layer a LSTM and one dense layer that contains a node for each unique character in our training data. The dense layer will give us a probability distribution over all nodes.

In [None]:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential([
                               tf.keras.layers.Embedding(vocab_size, embedding_dim,
                                                         batch_input_shape=[batch_size, None]),
                               tf.keras.layers.LSTM(rnn_units,
                                                    return_sequences=True,
                                                    stateful=True,
                                                    recurrent_initializer='glorot_uniform'),
                               tf.keras.layers.Dense(vocab_size)
  ])
  return model

model = build_model(VOCAB_SIZE, EMBEDDING_DIM, RNN_UNITS, BATCH_SIZE)
model.summary()

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_5 (Embedding)     (64, None, 256)           16640     
                                                                 
 lstm_5 (LSTM)               (64, None, 1024)          5246976   
                                                                 
 dense_4 (Dense)             (64, None, 65)            66625     
                                                                 
Total params: 5,330,241
Trainable params: 5,330,241
Non-trainable params: 0
_________________________________________________________________


##Creating a Loss Function
Now we are actually going to create our own loss function for this problem. This is because our model will output a (64, sequence_length, 65) shaped tensor that represents the probability distribution of each character at each timestep for every sequence in the batch.
However, before we do that let's have a look at a sample input and the output from our untrained model. This is so we can understand what the model is actually giving us.

In [None]:
for input_example_batch, target_example_batch in data.take(1):
  example_batch_predictions = model(input_example_batch) # ask our model for a prediction on our first batch of training data
  print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size") # print out the output shape

(64, 100, 65) # (batch_size, sequence_length, vocab_size


In [None]:
# we can see that the predictio is an array of 64 arrays, one for each entry in the batch
print(len(example_batch_predictions))
print(example_batch_predictions)

In [None]:
# lets examine one prediction
pred = example_batch_predictions[0]
print(len(pred))
print(pred)
# notice this is a 2d array of length 100, where each interior array is the prediction for the next character at each time step

In [None]:
# and finally we look at a prediction at the first timestep
time_pred = pred[0]
print(len(time_pred))
print(time_pred)
# and of course its 65 values representing the probability of each character occuring next

In [None]:
# If we want to determine the predicted character we need to sample the output distribution (pic a value based on probabilities)
sampled_indicies = tf.random.categorical(pred, num_samples=1)

# now we can reshape that array and convert all the integers to numbers to see the actual characters
sampled_indicies = np.reshape(sampled_indicies, (1, -1))[0]
predicted_chars = int_to_text(sampled_indicies)

predicted_chars # and this is what the model predicted for training sequence 1

"&FNeN!$Q:XTh&'H?HjFq:grfhSeg&Kke:,vO:;r.RyaUAeIvVtMEz,LYI,gsff!fp?LLEOZle-\nrMLRQWwWOmISj:n!c$NT3sK--"

So noe we need to create a loss function that can compare that output to the expected output and give us some numberic value representing how close the two were.

In [None]:
def loss(labels, logits):
  return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

##Compiling the Model
At this point we can think of our problem as a classification problem where the model predicts the possibility of each unique letter coming next.

In [None]:
model.compile(optimizer='adam',
              loss=loss)

##Creating Checkpoints
Now we are going to setup and configure our model to save checkpoints as it trains. This will allow us to load our model from a checkpoint and continue training it.

In [None]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True
)

##Training
Finally we will start training the model

In [None]:
history = model.fit(data, epochs=50, callbacks=[checkpoint_callback])

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


##Loading the Model
We'll rebuild the model from a checkpoint using a batch_size of 1 so that we can feed one piece of text to the model and have it make a prediction.

In [None]:
model = build_model(VOCAB_SIZE, EMBEDDING_DIM, RNN_UNITS, batch_size=1)

Once the mode is finished training we can find the **latest checkpoint** that stores the models weights using the following line.

In [None]:
model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
model.build(tf.TensorShape([1,None]))

We can load **any checkpoint** we want by specifying the exact file to load.

In [None]:
checkpoint_num = 10
model.load_weights(tf.train.latest_checkpoint("./training_checkpoints/ckpt_" + str(checkpoint_num)))
model.build(tf.TensorShape([1,None]))

##Generating Text
Now we can use the lovely function provided by tensorflow to generate some text using any starting string we'd like

In [None]:
def generate_text(model, start_string):
  # Evaluation step (generating text using the learned model)

  # Number of characters to generate
  num_generate = 800

  # Convering our start string to numbers (vectorizing)
  input_eval = [char2idx[s] for s in start_string]
  input_eval = tf.expand_dims(input_eval, 0)

  # Empty string to store our results
  text_generated = []

  # Low temperates results in more predictable text.
  # Higher temperatures results in more surprising text.
  # Experiment to find the best settting
  temperature = 1.2

  # Here batch size == 1
  model.reset_states()
  for i in range(num_generate):
    predictions = model(input_eval)
    # remove the batch dimension
    predictions = tf.squeeze(predictions, 0)

    #using a categorical distribution to predict the character returned by the model
    predictions = predictions / temperature
    predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

    # We pass the predicted character as the next input to the model
    # along with the previous hidden state
    input_eval = tf.expand_dims([predicted_id], 0)

    text_generated.append(idx2char[predicted_id])

  return (start_string + ''.join(text_generated))

In [None]:
inp = input("Type a starting string: ")
print(generate_text(model, inp))

Type a starting string: romeo
romeous more resportural.
Spread one that spoils hath set the dep,
And follows can affos bear the scale.

FRIAR LAURENCE:
Go bear thee hence, to gross too summer ancate
complaint thence to be pitient.

Second Keeper:
Why,
she was revell'd for Antia,
My birth in shape the boat of ll ase:
Are you so vile to may not I could injured in
common vials posherits to our prayer
Right have it byou revell'd.

MENENIUS:
That's worthy mind to more again.

Be fight, and as you like your head, and wash me we. Let us butch,
And that you should entreat me sort, content I:
Nay, dell we know your worship. For him?
There's some last that hath committed to my moulty.

GREY:
I tear his majesty!

HENRY BOLINGBROKE:
Of much less value is my case,
My overfeign, a meriat;
And being that, to stage, my sorrows unto joy!
Be
