<a href="https://colab.research.google.com/github/lee00206/Tensorflow_for_beginners/blob/main/NLP_with_RNNs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Sentiment Analysis**

## **Movie Review Dataset**
For the analysis, the IMDB movie review dataset from keras will be loaded. This dataset contains 25,000 reviews from IMDB where each one is already preprocessed and has a label as either positive or negative. Each review is encoded by integers that represents how common a word is in the entire dataset. For example a word encoded by the integer 3 means that it is the 3rd most common word in the dataset.

In [58]:
from keras.datasets import imdb
from keras.preprocessing import sequence
import keras
import tensorflow as tf
import os
import numpy as np

In [59]:
VOCAB_SIZE = 88584

MAXLEN = 250
BATCH_SIZE = 64

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words = VOCAB_SIZE)

  x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])
  x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])


In [60]:
# Look at one review
train_data[0]

[1,
 14,
 22,
 16,
 43,
 530,
 973,
 1622,
 1385,
 65,
 458,
 4468,
 66,
 3941,
 4,
 173,
 36,
 256,
 5,
 25,
 100,
 43,
 838,
 112,
 50,
 670,
 22665,
 9,
 35,
 480,
 284,
 5,
 150,
 4,
 172,
 112,
 167,
 21631,
 336,
 385,
 39,
 4,
 172,
 4536,
 1111,
 17,
 546,
 38,
 13,
 447,
 4,
 192,
 50,
 16,
 6,
 147,
 2025,
 19,
 14,
 22,
 4,
 1920,
 4613,
 469,
 4,
 22,
 71,
 87,
 12,
 16,
 43,
 530,
 38,
 76,
 15,
 13,
 1247,
 4,
 22,
 17,
 515,
 17,
 12,
 16,
 626,
 18,
 19193,
 5,
 62,
 386,
 12,
 8,
 316,
 8,
 106,
 5,
 4,
 2223,
 5244,
 16,
 480,
 66,
 3785,
 33,
 4,
 130,
 12,
 16,
 38,
 619,
 5,
 25,
 124,
 51,
 36,
 135,
 48,
 25,
 1415,
 33,
 6,
 22,
 12,
 215,
 28,
 77,
 52,
 5,
 14,
 407,
 16,
 82,
 10311,
 8,
 4,
 107,
 117,
 5952,
 15,
 256,
 4,
 31050,
 7,
 3766,
 5,
 723,
 36,
 71,
 43,
 530,
 476,
 26,
 400,
 317,
 46,
 7,
 4,
 12118,
 1029,
 13,
 104,
 88,
 4,
 381,
 15,
 297,
 98,
 32,
 2071,
 56,
 26,
 141,
 6,
 194,
 7486,
 18,
 4,
 226,
 22,
 21,
 134,
 476,
 26,
 480,
 5

## **More Preprocessing**
Since different length data cannot be passed into the neural network, the following procedure will be followed to make each review the same length:
* If the review is greater than 250 words then trim off the extra words
* If the review is less than 250 words add teh necessay amount of's to make it equal to 250

In [61]:
train_data = sequence.pad_sequences(train_data, MAXLEN)
test_data = sequence.pad_sequences(test_data, MAXLEN)

# Example
len(train_data[1])

250

## **Creating the Model**
A word embedding layer will be used as the first layer in the model and a LSTM layer will be added afterwards that feeds into a dense node to get the predicted sentiment.<br>
32 stands for the output dimension of the vectors generated by the embedding layer. The value can be changed.

In [62]:
model = tf.keras.Sequential([
                             tf.keras.layers.Embedding(VOCAB_SIZE, 32),
                             tf.keras.layers.LSTM(32),
                             tf.keras.layers.Dense(1, activation = "sigmoid")   # Dense of 1: if the number is greater than 0.5, the review is considered as positive and vice versa for the negative review.
])

In [63]:
model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, None, 32)          2834688   
_________________________________________________________________
lstm_3 (LSTM)                (None, 32)                8320      
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 33        
Total params: 2,843,041
Trainable params: 2,843,041
Non-trainable params: 0
_________________________________________________________________


## **Training**

In [64]:
model.compile(loss = "binary_crossentropy", optimizer = "rmsprop", metrics = ['acc'])
history = model.fit(train_data, train_labels, epochs = 10, validation_split = 0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [65]:
# Evaluate the model
results = model.evaluate(test_data, test_labels)
print(results)

[0.4680071771144867, 0.8563200235366821]


## **Making Predictions**
Since the reviews are encoded well need to convert any review that we write into that form so the network can understand it. To do that well load the encodings from the dataset and use them to encode our own data.

In [66]:
word_index = imdb.get_word_index()

def encode_text(text):
  tokens = keras.preprocessing.text.text_to_word_sequence(text)   # convert text into called tokens
  tokens = [word_index[word] if word in word_index else 0 for word in tokens]
  return sequence.pad_sequences([tokens], MAXLEN)[0]

text = "that movie was just amazing, so amazing"
encoded = encode_text(text)
print(encoded)

[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0  12  17  13  4

In [67]:
# make a decode function
reverse_word_index = {value: key for (key, value) in word_index.items()}

def decode_integers(integers):
  PAD = 0
  text = ""
  for num in integers:
    if num != PAD:
      text += reverse_word_index[num] + " "
  return text[:-1]  # return the text except the last space(" ")

print(decode_integers(encoded))

that movie was just amazing so amazing


In [68]:
# make a prediction
def predict_review(text):
  encoded_text = encode_text(text)
  pred = np.zeros((1, 250))
  pred[0] = encoded_text
  result = model.predict(pred)
  print(result[0])

positive_review = "That movie was so awesome! I really loved it and would watch it again because it was amazingly great"
predict_review(positive_review)

negative_review = "That movie sucked. I hated it and wouldn't watch it again. Was one of the worst things I've ever watched"
predict_review(negative_review)

[0.80868465]
[0.39967075]


# **RNN Play Generator**
For this section, RNN will be used to generate a play. The RNN will be an example of something to recreate and it will learn how to write a version of it on its own. This will be done by using a character predictive model that will take as input a variable length sequence and predict the next character. The model can be used many times in a row with the output from the last prediction as the input for the next call to generate a sequence.

## **Dataset**
The data will be an extact from a shakesphere play.

In [69]:
path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

## **Read Contents of File**

In [70]:
# read, then decode for py2 compat
text = open(path_to_file, 'rb').read().decode(encoding = 'utf-8')
# length of text is the number of characters in it
print('Length of text: {} characters'.format(len(text)))

Length of text: 1115394 characters


In [71]:
# the first 250 characters in text
print(text[:250])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.



## **Encoding**

In [72]:
vocab = sorted((set(text))) # sort all the unique characters in text

# creating a mapping from unique characters to indices
char2idx = {u:i for i, u in enumerate(vocab)}   # i: starting from 0, u: string in vocab -> pair (0(i), string(u)) and so on
idx2char = np.array(vocab)   # returns the initial vocabulary(vocab) in a list of array

def text_to_int(text):
  return np.array([char2idx[c] for c in text])

text_as_int = text_to_int(text)

In [73]:
print("Text: ", text[:13])
print("Encoded: ", text_to_int(text[:13]))

Text:  First Citizen
Encoded:  [18 47 56 57 58  1 15 47 58 47 64 43 52]


In [74]:
# make a function that can convert the numeric values to text
def int_to_text(ints):
  try:
    ints = ints.numpy()
  except:
    pass
  return ''.join(idx2char[ints])

print(int_to_text(text_as_int[:13]))

First Citizen


## **Creating Training Examples**
Remember the task is to feed the model a sequence and have it return the next character. This means it is necessary to split the text data from above into many shorter sequences that can be passed to the model as training examples.<br>
The training examples are going to be a *seq_length* sequence as input and a *seq_length* sequence as the output where that sequence is the original sequence shifted one letter to the right. For example:<br>
input: Hell<br>
output: ello<br>

In [75]:
# create a stream of characters from the text data

seq_length = 100  # length of sequence for a training example
examples_per_epoch = len(text) // (seq_length + 1)  # to return the next character, we need 101 characters (seq_length + 1) as a training example

# create training examples / targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)  # convert the entire stream dataset into characters

In [76]:
# use the batch method to turn this stream of characters into batches of desired length
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)

In [77]:
# uses the sequences of length 101 and split them into input and output
def split_input_target(chunk):  # for the example: hello
  input_text = chunk[:-1]   # hell
  target_text = chunk[1:]   # ello
  return input_text, target_text  # hell, ello

dataset = sequences.map(split_input_target)   # use map to apply the above function to every entry

In [78]:
for x, y in dataset.take(2):
  print("\n\nEXAMPLE\n")
  print('INPUT\n')
  print(int_to_text(x))
  print('\nOUTPUT\n')
  print(int_to_text(y))



EXAMPLE

INPUT

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You

OUTPUT

irst Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You 


EXAMPLE

INPUT

are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you 

OUTPUT

re all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you k


In [79]:
# make training batches
BATCH_SIZE = 64
VOCAB_SIZE = len(vocab)   # vocab is number of unique characters
EMBEDDING_DIM = 256
RNN_UNITS = 1024

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences, so it doesn't attempt to shuffle the entire sequence in memory. Instead, it maintains a buffer in which it shuffles elements)
BUFFER_SIZE = 10000

data = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

## **Building the Model**
An embedding layer a LSTM and one dense layer that contains a node for each unique character in the training data will be used. The dense layer will give a probability distribution over all nodes.

In [80]:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential([
                               tf.keras.layers.Embedding(vocab_size, embedding_dim,
                                                         batch_input_shape = [batch_size, None]),   # None: we don't know how long the sequence is
                              tf.keras.layers.LSTM(rnn_units,
                                                   return_sequences = True,
                                                   stateful = True,
                                                   recurrent_initializer = 'glorot_uniform'),
                               tf.keras.layers.Dense(vocab_size)
  ])
  return model

model = build_model(VOCAB_SIZE, EMBEDDING_DIM, RNN_UNITS, BATCH_SIZE)
model.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (64, None, 256)           16640     
_________________________________________________________________
lstm_4 (LSTM)                (64, None, 1024)          5246976   
_________________________________________________________________
dense_4 (Dense)              (64, None, 65)            66625     
Total params: 5,330,241
Trainable params: 5,330,241
Non-trainable params: 0
_________________________________________________________________


## **Creating a Loss Function**
It is necessary to create loss function for this problem, because the model will output a (64, sequence_length, 65) shaped tensor that represents the probability distribution of each character at each timestep for every sequence in the batch.

In [81]:
# Look at a sample input and the output from the untrained model to understand what the model actually returns
for input_example_batch, target_example_batch in data.take(1):
  example_batch_predictions = model(input_example_batch)  # ask the model for a prediction on the first batch of training data
  print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")   # print out the output shape

(64, 100, 65) # (batch_size, sequence_length, vocab_size)


In [82]:
# the prediction is an array of 64 arrays, one for each enbry in the batch
print(len(example_batch_predictions))
print(example_batch_predictions)

64
tf.Tensor(
[[[ 3.3866814e-03  5.2238037e-03 -5.6300871e-03 ... -3.1583738e-03
    1.9084960e-03  2.7738381e-03]
  [ 4.5772679e-03  1.2467699e-03 -3.4201383e-03 ... -3.9539426e-03
    2.9788343e-03  2.6930752e-03]
  [ 2.7872645e-03 -4.4394698e-04 -5.4929880e-03 ... -2.2290377e-03
   -3.9155674e-03  7.4616997e-03]
  ...
  [ 5.4734237e-03 -8.7275412e-03 -2.2610102e-04 ... -1.2706110e-03
   -4.0140245e-03  2.7517264e-03]
  [ 3.7500416e-03 -6.1370987e-03  1.0347291e-03 ... -3.7090508e-03
   -1.0794081e-02  5.4155961e-03]
  [ 6.3601974e-03 -5.0857645e-03  2.4561374e-03 ... -3.0768914e-03
   -5.5794092e-03  3.3621816e-03]]

 [[ 3.3866814e-03  5.2238037e-03 -5.6300871e-03 ... -3.1583738e-03
    1.9084960e-03  2.7738381e-03]
  [ 2.7132845e-03 -2.2334876e-03 -6.0554957e-03 ...  3.5181618e-04
    1.1927890e-02  2.2712965e-03]
  [ 4.0388266e-03 -3.4512696e-03  1.1636384e-03 ... -1.8924656e-03
    9.6672531e-03 -1.1257229e-03]
  ...
  [ 1.0160389e-02 -1.1949643e-02 -6.6457400e-03 ... -2.3884224e

In [83]:
# examine one prediction
pred = example_batch_predictions[0]
print(len(pred))
print(pred)
# notice this is a 2d array of length 100, where each interior array is the prediction for the next character at each time step

100
tf.Tensor(
[[ 0.00338668  0.0052238  -0.00563009 ... -0.00315837  0.0019085
   0.00277384]
 [ 0.00457727  0.00124677 -0.00342014 ... -0.00395394  0.00297883
   0.00269308]
 [ 0.00278726 -0.00044395 -0.00549299 ... -0.00222904 -0.00391557
   0.0074617 ]
 ...
 [ 0.00547342 -0.00872754 -0.0002261  ... -0.00127061 -0.00401402
   0.00275173]
 [ 0.00375004 -0.0061371   0.00103473 ... -0.00370905 -0.01079408
   0.0054156 ]
 [ 0.0063602  -0.00508576  0.00245614 ... -0.00307689 -0.00557941
   0.00336218]], shape=(100, 65), dtype=float32)


In [84]:
# look at a prediction at the first timestep
time_pred = pred[0]
print(len(time_pred))
print(time_pred)
# 65 values represent the probability of each character occuring next

65
tf.Tensor(
[ 3.3866814e-03  5.2238037e-03 -5.6300871e-03 -5.7627680e-05
 -1.1223005e-02 -1.3835813e-03  1.1171466e-03  1.9499416e-03
  6.0982234e-04 -3.6068098e-04  7.5426819e-03  4.2617954e-03
 -6.6431006e-05 -2.5654838e-03 -2.2227718e-03  2.5029178e-05
 -6.0074730e-03  1.9740900e-03  6.3079677e-04 -2.9983115e-05
  1.5137143e-03  7.7186606e-04 -1.7604164e-03  2.4608870e-03
 -4.5770584e-03 -2.7333882e-03  1.4805857e-03  1.7160823e-03
 -4.2330124e-03  1.5195085e-03  2.1826928e-03 -1.1078909e-03
 -1.2248144e-03  1.1885033e-03 -2.8201253e-03  3.1072595e-03
  2.6906584e-03  2.4038535e-03 -3.9259810e-04  8.2979759e-04
 -1.8729249e-03  1.9882252e-03 -4.9314266e-03  5.6119310e-04
 -1.9450617e-03 -1.3574122e-03  3.4473634e-03 -9.3007158e-04
 -2.3548598e-03 -1.9033084e-04 -1.2099512e-03 -5.6040930e-03
 -2.9555117e-03  1.2872345e-04  3.7422122e-03  2.8477637e-03
 -2.4774706e-03 -2.9872719e-03 -8.9110481e-04 -4.5457734e-03
  4.1934825e-04  4.4707851e-03 -3.1583738e-03  1.9084960e-03
  2.773838

In [85]:
# to determine the predicted character, it is necessary to sample the output distribution (pick a value based on probability)
sampled_indices = tf.random.categorical(pred, num_samples = 1)

# reshape that array and convert all the integers to numbers to see the actual characters
sampled_indices = np.reshape(sampled_indices, (1, -1))[0]
predicted_chars = int_to_text(sampled_indices)

predicted_chars   # this is what the model predicted for training sequence 1

"aE iYVtf3enKARir-BsqUpGub\nncZn?qBFvI&dS$';;trFV!!CK ,n?LB&S3N&;-PP?JnsMqs;;'pZJdItUuR!v.$ci&;FUrcvUI"

In [86]:
# create a loss function that can compare that output to the expected output and give some numeric value representing how close the two were
def loss(labels, logits):
  return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits = True)

## **Compiling the Model**

In [87]:
model.compile(optimizer = 'adam', loss = loss)

## **Creating Checkpoints**
This will allow us to load the model from a checkpoint and continue training it.

In [88]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'

# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath = checkpoint_prefix,
    save_weights_only = True
)

## **Training**

In [89]:
history = model.fit(data, epochs = 40, callbacks = [checkpoint_callback])

Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


## **Loading the Model**
Rebuild the model from a checkpoing using a batch_size of 1 so that we can feed one piece of text to the model and have it make a prediction.

In [90]:
model = build_model(VOCAB_SIZE, EMBEDDING_DIM, RNN_UNITS, batch_size = 1)

In [91]:
# find the latest checkpoint that stores the models weights
model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
model.build(tf.TensorShape([1, None]))

## **Generating Text**

In [95]:
def generate_text(model, start_string):
  # Evaluation step (generating text using the learned model)

  # Number of characters to generate
  num_generate = 800

  # Converting our start string to numbers (vectorizing)
  input_eval = [char2idx[s] for s in start_string]
  input_eval = tf.expand_dims(input_eval, 0)

  # Empty string to store our results
  text_generated = []

  # Low temperatures results in more predictable text
  # Higher temperatures results in more surprising text
  # Experiment to find the best setting
  temperature = 1.0

  # batch size == 1
  model.reset_states()
  for i in range(num_generate):
    predictions = model(input_eval)
    # remove the batch dimension
    predictions = tf.squeeze(predictions, 0)

    # using a categorical distribution to predict the character returned by the model
    predictions = predictions / temperature
    predicted_id = tf.random.categorical(predictions, num_samples = 1)[-1, 0].numpy()

    # we pass the predicted character as the next input to the model along with the previous hidden state
    input_eval = tf.expand_dims([predicted_id], 0)

    text_generated.append(idx2char[predicted_id])

  return (start_string + ''.join(text_generated))

In [96]:
inp = input('Type a starting string: ')
print(generate_text(model, inp))

Type a starting string: romeo
romeous to remember:
If I be talk'd with a swaingul-be-trick
And titherselves to tell my thind shall go to you,
Unless you will perform it to my lord or end.
Take him upon you and home for thine!
Let no man holds up Langar over graced, my lord
Before I board the sad stamp of love in ill!
And arm dear valour; let him call back and fly.

BAPTISTA:
Gentlemen, content thms?

First Servant:
Why, sir, God forbid Cornord,
Is this the heavy mile and earth against my cansward, sir, in any way from him that was born:
Were it betide their cames that still intiments
And in his ward up so griev, and hapling thee--

QUEEN ELIZABETH:
O, mighty Frethy hope, with Richmond Angelo
And in Bohemia told shifts, thereon exchance,
We leann you to your testingely to encounter mine.

EXETER:
Away, away!

SEBASTIAN:
'Sca
