## Creating An AI-Based JFK Speech Writer: Part 2
-----------------------

__[1. Introduction](#first-bullet)__

__[2. Data Preparation](#second-bullet)__

__[3. A Bidirectional GRU Model](#third-bullet)__

__[4. Generating Text](#fourth-bullet)__

__[5. Next Steps](#fifth-bullet)__


## Introduction <a class="anchor" id="first-bullet"></a>
----

In this blog post I follow up on the last [post](http://michael-harmon.com/blog/jfk1.html) and develop a model for text generation using [Recurrent Neural Networks](https://en.wikipedia.org/wiki/Recurrent_neural_network). I'll build a bi-directional [gated recurrent unit (GRU)](https://en.wikipedia.org/wiki/Gated_recurrent_unit) that is trained on speeches made by [President John F. Kennedy](https://en.wikipedia.org/wiki/John_F._Kennedy). Specifically, I'll go over a how to build a model that predicts the "next word" in a sentence based off a sequence of the words coming before it. This project was challenging for me due to the data preparation needs of this problem. The data preparation was more involved then other posts that I have done on natural language processing since it involves modeling a sequences of words instead of using a "[bag-of-words](https://en.wikipedia.org/wiki/Bag-of-words_model)."

The concept of sequence modeling in recurrent neural networks is different from other models that I have done in the past and I will spend some time covering this topic. Interestingly, the next word prediction turns out to be a multi-class classification problem, albeit with a very large number of classes! Let's dig into the problem. 

The first step is to import the necessary [TensorFlow](https://www.tensorflow.org/) and [Google Cloud](https://www.tensorflow.org/) packages (since the data is in [Google Cloud Storage](https://cloud.google.com/storage?)) :

In [194]:
import numpy as np
import tensorflow as tf 

from google.oauth2 import service_account
from google.cloud import storage

tf.compat.v1.logging.set_verbosity('ERROR')
tf.config.list_physical_devices('GPU')

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

## Data Preparation <a class="anchor" id="second-bullet"></a>
----

I can connect to [Google Cloud Storage](https://cloud.google.com/storage?) to download the all the concatenated speeches by President Kennedy. The first thing I do is get my credentials and then instantiate the client to connect to the bucket `gs://harmon-kennedy/`

In [2]:
credentials = service_account.Credentials.from_service_account_file('credentials.json')
client = storage.Client(project=credentials.project_id, credentials=credentials)

In [192]:
bucket = client.get_bucket("harmon-kennedy")

Then download all the speeches that were concatenated into one file,

In [4]:
blob = bucket.blob("all_jfk_speeches.txt")
text = blob.download_as_text()

I can see the first 300 characters of the text are,

In [193]:
text[:300]

'Of particular importance to South Dakota are the farm policies of the Republican party - the party of Benson, Nixon and Mundt - the party which offers our young people no incentive to return to the farm - which offers the farmer only the prospect of lower and lower income - and which offers the nati'

For getting situated I can get the number of characters in the text as well as the number of unique characters,

In [6]:
print(f'Length of text: {len(text)} characters')

Length of text: 7734579 characters


In [7]:
vocab = sorted(set(text))
print(f'{len(vocab)} unique characters')

67 unique characters


Since I'll be making a word level model this isn't totally helpful. Instead I'll get the total number of words and number of unique words. To do this I need to clean the text; convert newline characters to spaces, remove non-English characters and convert characters to lower case.

In [195]:
words = text.replace("\n", " ").split(" ")

In [197]:
clean_words = [word.lower()for word in words if word.isalpha()]

In [201]:
clean_text = " ".join(clean_words)

The impact this had on the same text from above can be seen below,

In [231]:
clean_text[:300]

'of particular importance to south dakota are the farm policies of the republican party the party of nixon and mundt the party which offers our young people no incentive to return to the farm which offers the farmer only the prospect of lower and lower income and which offers the nation the vision of'

The total number of clean words and unique clean words in the text are,

In [202]:
print(f"{len(clean_words)} number of clean words")

1196835 number of clean words


In [203]:
print(f"{len(set(clean_words))} unique clean words")

19291 unique clean words


In [17]:
len(clean_text)

7533442

The way a word level text generation model is built is to take a sequence of N words and then predict the next one. To create a training set, the text is split up into sliding widows where the feature vector **x** is the N words in the sequence of text and the target y is the N+1 word in that text. We repeat this N=1,2,3,4,... 

For instance take the sentence "The man is walking down the street." To build a model that predicts the next word based on the 4 before it, its necessary to create the 4 training examples as shown below,

<figure>
<img src="images/nextword.png" alt="Trulli" style="width:75%">
<figcaption align = "center">
From https://www.youtube.com/watch?v=VAMKuRAh2nc
</figcaption>
</figure>

For this model I will use `seq_length` to be the number words to use in the text to predict the next word. In order to be able to predict the next word I need to reduce the total number words that are possible to predict to a finite number. This means limiting the number of possible words to be of size `vocab_size` and in turn this classification problem will be a `vocab-size`-class problem.

In order to convert the text which is represented as a sequence of words into numerical vectors I'll use the [TextVectorization](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization) class. This technique is discussed in more in a prior post which you can read [here](http://michael-harmon.com/blog/NLP4.html).

In [235]:
vocab_size = 15000
seq_length = 10

Now instantiate the TextVectorization layer and fit it to the text:

In [258]:
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

vectorizer_layer = TextVectorization(
    standardize="lower_and_strip_punctuation",
    max_tokens=vocab_size,
    output_mode="int",
    pad_to_max_tokens=True,
    output_sequence_length=seq_length,
)

In [259]:
vectorizer_layer.adapt([text])

2023-04-09 11:41:30.214279: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.


I can then get the set of words in the vectorizer's "vocabulary" and create a dictionary to look up each word's equivalent numerical value.

In [260]:
voc = vectorizer_layer.get_vocabulary()
word_index = dict(zip(voc, range(len(voc))))

The numerical value for each of the first two words in the example text above is then,

In [239]:
word_index['of']

3

In [240]:
word_index['particular']

758

The numerical value for the "out of vocabulary" token is,

In [241]:
word_index['[UNK]']

1

Next I'll create the dataset X and y, where X is the vector of features, which in turn are the sequence of words. The vector y is the target which is the next word in that sequence:

In [242]:
words_seq = [clean_words[i:i + seq_length] for i in range(0, len(clean_words) - seq_length-1)]
next_word = [clean_words[i + seq_length] for i in range(0, len(clean_words) - seq_length-1)]

Each entry in `words_seq` is a list of the `seq_length` words or tokens that make that sequence in that training example. 

In [243]:
words_seq[:2]

[array(['of', 'particular', 'importance', 'to', 'south', 'dakota', 'are',
        'the', 'farm', 'policies'], dtype='<U20'),
 array(['particular', 'importance', 'to', 'south', 'dakota', 'are', 'the',
        'farm', 'policies', 'of'], dtype='<U20')]

I then convert those list of lists into a list of strings,

In [244]:
X = np.array([" ".join(words_seq[i]) for i in range(len(next_word)) if next_cat[i] != 1])
X[:2]

array(['of particular importance to south dakota are the farm policies',
       'particular importance to south dakota are the farm policies of'],
      dtype='<U100')

The reason for doing this is that this way my model will be able to take inputs that are just plain text instead of needing lists of strings that represent that text. The later would require that new inputs to the model be pre-processed before being feed into the trained model, while the latter means a trained model can just take raw text as the input.


Notice that I only do this where the target word is not out of the vocabulary word. 

The next two words that correspond to the targets for the examples above are,

In [245]:
next_word[:2]

['of', 'the']

Now I'll convert the target vector of "next words" to a vector with "numerical values" using the `word_index` dictionary:

In [246]:
next_cat = np.array([word_index.get(word, 1) for word in next_word])

In [247]:
next_cat[:2]

array([3, 2])

Lastly, I one-hot encode these numerical variables in the target vector and filter out those entries that correspond to the out of vocabulary tokens:

In [248]:
y = tf.keras.utils.to_categorical([cat for cat in next_cat if cat != 1])
y[:2]

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.]], dtype=float32)

The reason for filtering the out-of-vocabulary tokens is I don't want to train a model that predicts out-of-vocabulary words since this would be meaningless to end users.

The size of each of the datasets are,

In [249]:
X.shape

(1179990,)

In [250]:
y.shape

(1190502, 14999)

Now to see what effect the vectorizer layer has on the text I'll feed the first two sequences above through the layer.

In [261]:
vectorizer_layer.call(X[:2])

<tf.Tensor: shape=(2, 10), dtype=int64, numpy=
array([[   3,  758,  692,    5,  430, 2268,   16,    2,  156,  280],
       [ 758,  692,    5,  430, 2268,   16,    2,  156,  280,    3]])>

The vectorizer layer converts the array of strings with shape `(1179990,)` to an matrix of integers of shape `(1179990, seq_length)`. Each entry in the array will be a integer from 1 to `vocab_size` and is the integer representation for each word.


## A Bidirectional GRU Model  <a class="anchor" id="third-bullet"></a>
------------
[Recurrent Neural Networks (RNN)](https://en.wikipedia.org/wiki/Recurrent_neural_network) are used to model sequences. They use an internal state, **h**, to act as memory that processes these sequences and "remember" things from the past. A quintessential diagram of a RNN is shown below, 


<figure>
<img src="images/rnn.svg" alt="Trulli" style="width:75%">
<figcaption align = "center">
From https://en.wikipedia.org/wiki/Recurrent_neural_network#/media/File:Recurrent_neural_network_unfold.svg/
</figcaption>
</figure>

An RNN cell is shown on the left and on the right is the "un-rolled" version that shows how the cell processes a sequence of inputs **x** into outputs **o**; there is a subscript *t* that denotes entry in the sequence. The subscript for each **h** is used to denote the value the internal state or memory cell in the t-th entry in the sequence.

There are quite a few types of RNN's that are shown below,

<figure>
<img src="images/types.png" alt="Trulli" style="width:75%">
<figcaption align = "center">
From https://calvinfeng.gitbook.io/machine-learning-notebook/supervised-learning/recurrent-neural-network/recurrent_neural_networks/
</figcaption>
</figure>


The model I am building in this post that uses a sequence of words to predict the next word is a "many-to-one" model. Zooming into the RNN cell we focus on a specific type of RNN called a [Gated Recurrent Unit (GRU)](https://en.wikipedia.org/wiki/Gated_recurrent_unit). The details of a GRU cell are shown below.


<figure>
<img src="images/gru.png" alt="Trulli" style="width:75%">
<figcaption align = "center">
From https://colah.github.io/posts/2015-08-Understanding-LSTMs/
</figcaption>
</figure>

There is a hidden state **h** that takes on values for each iteration *t*. There is a candidate update to the hidden state **h** with a ~ over it. The candidate update to the hidden state has values between -1 and +1 and is a function of the relevance gate **r** as well as the prior value of the hidden state and the current value of the input. The relevance gate is value between 0 and 1 and is a function of the prior value of the hidden state and the current value of the input. It controls the amount off effect that the prior hidden state value has on the candidate update value for the hidden state. 

Lastly there is a forget gate **z** which is between 0 and 1 is a function of the prior value of the hidden state and the current value of the input. The forget gate is used to control whether we update the hidden state value or not. If `z = 1` then we update the internal state to be the candidate state. If `z = 0`, the value for the hidden state remains unchanged.

Notice the hidden state value **h** of one iteration can be fed directly into the RNN as well as the input **x**. These variables are not necessarily scalars and can be vectors. In the model I am building the variables will be vectors of dimension `seq_length`. The output of the RNN cell is a vector of size `vocab_size`. To convert the hidden state vector **x** to **y** we apply a softmax function.

Many times in natural language processing models make use of a [bi-directional RNN](https://en.wikipedia.org/wiki/Bidirectional_recurrent_neural_networks). In this type of model two RNN cells are used, one processing the sequence in the forward direction and one processing the sequence in the reverse direction. The architecture is shown below:

<figure>
<img src="images/bidirectionalgru.png" alt="Trulli" style="width:75%">
<figcaption align = "center">
From https://www.researchgate.net/figure/The-structure-of-a-bidirectional-GRU-model_fig4_366204325
</figcaption>
</figure>

Notice that the GRU cells at the same time *t* are both a function of the same input value **x**, but are functions of different iterations hidden states **h**. Both cells at the same iteration are used to compute the output at the same iteration. Bidirectional RNN's were introduced to increase the amount of input information available to the network.

I implement a bi-directional GRU model using [TensorFlow's subclassing](https://www.tensorflow.org/guide/keras/custom_layers_and_models) methods below.

In [252]:
embedding_dim=128

In [262]:
model = tf.keras.models.Sequential([
    vectorizer_layer,
    tf.keras.layers.Embedding(vocab_size, embedding_dim),
    tf.keras.layers.Bidirectional(tf.keras.layers.GRU(34)),
    tf.keras.layers.Dense(vocab_size, activation='softmax')])

In [263]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization_7 (TextV  (None, 10)               0         
 ectorization)                                                   
                                                                 
 embedding_4 (Embedding)     (None, 10, 128)           1920000   
                                                                 
 bidirectional_4 (Bidirectio  (None, 68)               33456     
 nal)                                                            
                                                                 
 dense_4 (Dense)             (None, 15000)             1035000   
                                                                 
Total params: 2,988,456
Trainable params: 2,988,456
Non-trainable params: 0
_________________________________________________________________


In [42]:
from typing import Dict

class JFKSpeechWriter(tf.keras.Model):
    def __init__(self, 
                 text: str, 
                 seq_length: int,
                 vocab_size: int, 
                 embedding_dim: int, 
                 units: int) -> None:
        
        super().__init__()
        
        self.vectorizer_layer = TextVectorization(
                                    standardize="lower_and_strip_punctuation",
                                    max_tokens=vocab_size,
                                    output_mode="int",
                                    pad_to_max_tokens=True,
                                    output_sequence_length=seq_length,
                              )
        self.embedding_layer =  tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.GRU_layer = tf.keras.layers.Bidirectional(tf.keras.layers.GRU(units))
        self.dense_layer = tf.keras.layers.Dense(vocab_size, activation='softmax')
        
        self.vectorizer_layer.adapt([text])
        
    def call(self, input_str: str) -> int:
        # x = self.input_layer(input_str)
        x = self.vectorizer_layer(input_str)
        x = self.embedding_layer(x)
        x = self.GRU_layer(x)
        return self.dense_layer(x)
        
    def get_wordmap(self) -> Dict[int, str]:
        voc = self.vectorizer_layer.get_vocabulary()
        word_index = dict(zip(voc, range(len(voc))))
        return dict(map(reversed, word_index.items()))


The subclassing method in TensorFlow is used for novel techniques mainly for research problems. The model I am building is pretty standard, but I wanted to use this opportunity play around with the subclassing methodology.

In the constructor each layer of the model is declared as an attribute of the object and the vectorizer layer is instantiated. The `call` method of the class is used to define the forward form of the model. Like in all Keras models, the forward form is all that is needed to be defined and Keras/Tensorflow handles computing the necessary information needed for [backpropegation](https://en.wikipedia.org/wiki/Backpropagation). The model consists of a vectorizer layer which converts the text which is made up of `seq_length` words into a `seq_length`-dimensional vector of integers that taken values between 1 and `vocab_size`. Next an embedding layer is applied, followed by a bi-directional GRU layer and softmax as the last layer to predict which of the `vocab_size` class the next word is.

The last method of the class is `get_wordmap`. This function returns the reverse dictionary that converts a integer representation of words back to its English version. This function is used for converting the output of the model back an English word.

Now, the model can be instantiated with 128 dimensional embedding layer and 64 unit GRU layer:

In [166]:
model = JFKSpeechWriter(text=text, 
                        seq_length=10, 
                        vocab_size=15000, 
                        embedding_dim=128, 
                        units=64)

2023-04-06 18:00:10.472779: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.


Notice that for the vectorizer layer I have to pass the original text, `seq_length` and the `vocab_size` values to initialize that layer properly. 

Now we can compile the model:

In [167]:
model.compile(loss='categorical_crossentropy', optimizer='adam')

Normally in a Tensorflow model using a [Sequential model](https://www.tensorflow.org/guide/keras/sequential_model) or the [Functional API](https://www.tensorflow.org/guide/keras/functional) once the model is compiled the [summary](https://keras.io/api/models/model/#summary-method) method can be used to return information on the model. However, compiling the model is not sufficient for using the [summary](https://keras.io/api/models/model/#summary-method) method in the subclassing API.


In [47]:
from sklearn.utils import shuffle

In [48]:
X_train, y_train = X[:100000], y[:100000]

In [49]:
X_train, y_train = shuffle(X_train, y_train)

In [50]:
X_train[0]

'state laws in states such as massachusetts on the other hand union security provisions more liberal than tafthartley are not'

In [168]:
model.fit(X_train, y_train, epochs=10, batch_size=128, validation_split=0.2)

Epoch 1/10


2023-04-06 18:00:17.790011: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
2023-04-06 18:00:18.016253: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
2023-04-06 18:00:18.025332: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
2023-04-06 18:00:18.136763: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
2023-04-06 18:00:18.151356: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.




2023-04-06 18:00:38.088691: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
2023-04-06 18:00:38.185708: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
2023-04-06 18:00:38.195500: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x2dbdfd430>

In [189]:
model.save("jfk_model", save_format="tf")

2023-04-07 07:42:02.923068: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.


In [169]:
model.summary()

Model: "jfk_speech_writer_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization_4 (TextV  multiple                 0         
 ectorization)                                                   
                                                                 
 embedding_3 (Embedding)     multiple                  1280000   
                                                                 
 bidirectional_3 (Bidirectio  multiple                 74496     
 nal)                                                            
                                                                 
 dense_3 (Dense)             multiple                  1290000   
                                                                 
Total params: 2,644,496
Trainable params: 2,644,496
Non-trainable params: 0
_________________________________________________________________


## Generating Text  <a class="anchor" id="fourth-bullet"></a>
-----------

In [191]:
model = tf.keras.models.load_model("jfk_model")

2023-04-07 07:42:58.035546: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
2023-04-07 07:42:58.040803: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.


In [161]:
test = str(X[688394])

In [162]:
test

'and constitution and in the american public school system and he stated flatly that he recognized no power in the'

In [163]:
y_pred = model.predict([test])

2023-04-06 17:59:49.825963: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
2023-04-06 17:59:49.895592: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
2023-04-06 17:59:49.905844: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.


In [170]:
reverse_word_map = model.get_wordmap()

In [186]:
reverse_word_map[np.argmax(y_pred[0])]

'united'

In [172]:
def next_words(input_str, n):
    final_str = ''
    for i in range(n):
        prediction = model.predict([input_str], verbose=0)
        next_word = reverse_word_map[np.argmax(prediction[0])]
        final_str += next_word + ' ' 
        input_str += ' ' + next_word
        input_str = ' '.join(input_str.split(' ')[1:])
    return final_str

In [184]:
new_text = next_words(test, 10)

In [185]:
test + " " + new_text

'and constitution and in the american public school system and he stated flatly that he recognized no power in the words of the house of representatives and women of the '

## Next Steps  <a class="anchor" id="fifth-bullet"></a>
-------------
