## Creating An AI-Based JFK Speech Writer: Part 2
-----------------------

__[1. Introduction](#first-bullet)__

__[2. Data Preparation](#second-bullet)__

__[3. A Bidirectional GRU Model](#third-bullet)__

__[4. Generating Text](#fourth-bullet)__

__[5. Next Steps](#fifth-bullet)__


## Introduction <a class="anchor" id="first-bullet"></a>
----

In this blog post I follow up on the last [post](http://michael-harmon.com/blog/jfk1.html) and develop a model for text generation with [Recurrent Neural Networks](https://en.wikipedia.org/wiki/Recurrent_neural_network). I'll build a bi-directional [gated recurrent unit (GRU)](https://en.wikipedia.org/wiki/Gated_recurrent_unit) that is trained on speeches made by [President John F. Kennedy](https://en.wikipedia.org/wiki/John_F._Kennedy). Specifically, I'll go over a how to build a model that predicts the "next word" in a sentence based off of the words coming before it. This was challenging for me due to the data preparation needs of this problem. The data preparation was more involved then other posts that I have done on natural language processing since it involves modeling a sequences of words instead of a "[bag-of-words](https://en.wikipedia.org/wiki/Bag-of-words_model)."

The concept of sequence modeling in recurrent neural networks is different from other models that I have done in the past and I will spend some time covering this topic. Interestingly, the next word prediction turns out to be a multi-class classification problem, albeit with a very large number of classes! Let's dig into the problem. 

The first step is to import the necessary [TensorFlow](https://www.tensorflow.org/) and [Google Cloud](https://www.tensorflow.org/) packages (since the data is in [Google Cloud Storage](https://cloud.google.com/storage?)) :

In [194]:
import numpy as np
import tensorflow as tf 

from google.oauth2 import service_account
from google.cloud import storage

tf.compat.v1.logging.set_verbosity('ERROR')
tf.config.list_physical_devices('GPU')

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

## Data Preparation <a class="anchor" id="second-bullet"></a>
----

I can connect to [Google Cloud Storage](https://cloud.google.com/storage?) to download the all the concatenated speeches by President Kennedy. The first thing I do is get my credentials and then instantiate the client to connect to the bucket `gs://harmon-kennedy/`

In [2]:
credentials = service_account.Credentials.from_service_account_file('credentials.json')
client = storage.Client(project=credentials.project_id, credentials=credentials)

In [192]:
bucket = client.get_bucket("harmon-kennedy")

And download all the speeches that concatenated into one file,

In [4]:
blob = bucket.blob("all_jfk_speeches.txt")
text = blob.download_as_text()

We can see the first 300 characters of the text are,

In [193]:
text[:300]

'Of particular importance to South Dakota are the farm policies of the Republican party - the party of Benson, Nixon and Mundt - the party which offers our young people no incentive to return to the farm - which offers the farmer only the prospect of lower and lower income - and which offers the nati'

For getting situated I can get the number of characters in the text as well as the number of unique characters,

In [6]:
print(f'Length of text: {len(text)} characters')

Length of text: 7734579 characters


In [7]:
vocab = sorted(set(text))
print(f'{len(vocab)} unique characters')

67 unique characters


Since I'll be making a word level model this isn't totally helpful. Instead I'll get the total number of words and number of unique words. To do this I need to clean the text, convert newline characters to space, remove non-English characters and convert everything to lower case.

In [195]:
words = text.replace("\n", " ").split(" ")

In [197]:
clean_words = [word.lower()for word in words if word.isalpha()]

In [201]:
clean_text = " ".join(clean_words)

We can see the impact this had on the same text from above and get the total number of words and unique words in the text.

In [231]:
clean_text[:300]

'of particular importance to south dakota are the farm policies of the republican party the party of nixon and mundt the party which offers our young people no incentive to return to the farm which offers the farmer only the prospect of lower and lower income and which offers the nation the vision of'

In [202]:
print(f"{len(clean_words)} number of clean words")

1196835 number of clean words


In [203]:
print(f"{len(set(clean_words))} unique clean words")

19291 unique clean words


In [17]:
len(clean_text)

7533442

The way I will build a word level text generation model is to take a sequence of N words and then predict the next one. To create a training set, I break up the text into sliding widows where the feature vector **x** is the N words in the text and the target y is the N+1 word in the text. We repeat this N=1,2,3,4,... 

For instance take the sentence "The man is walking down the street." If we want to build a model that predicts the next word based on the 4 before it we would have 4 training examples as shown below,

<figure>
<img src="images/nextword.png" alt="Trulli" style="width:75%">
<figcaption align = "center">
From https://www.youtube.com/watch?v=VAMKuRAh2nc
</figcaption>
</figure>

For my model I'll use `seq_length` to be the number words to use in the text to predict the next word. In order to be able to predict the next word I need to reduce the total number words that are possible to predict. I'll limit the number of possible words to be of size `vocab_size`. That means this classification problem will be a `vocab-size`-class problem.

In order to convert the text which is represented as a sequence of words into numerical vectors I'll use the [TextVectorization](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization) class. I discuss this topic more in a prior post which you can read [here](http://michael-harmon.com/blog/NLP4.html).

In [205]:
vocab_size = 10000
seq_length = 10

We instantiate the TextVectorization layer and fit it to the text we have so far:

In [207]:
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

vectorize_layer = TextVectorization(
    standardize="lower_and_strip_punctuation",
    max_tokens=vocab_size,
    output_mode="int",
    pad_to_max_tokens=True,
    output_sequence_length=seq_length,
)

In [208]:
vectorize_layer.adapt([text])

2023-04-07 19:37:54.152222: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.


I can then get the set of words in my "vocabulary" and create a dictionary to look up each word to its numerical value.

In [211]:
voc = vectorize_layer.get_vocabulary()
word_index = dict(zip(voc, range(len(voc))))

We can get the numerical value for each of the first two words in the example text above,

In [212]:
word_index['of']

3

In [213]:
word_index['particular']

758

Now we get the numerical value for the out of vocabulary token:

In [219]:
word_index['[UNK]']

1

Next I'll create the dataset X and y, where X is the features are the sequence of words and y is the target which is the next word in that sequence:

In [214]:
words_seq = [clean_words[i:i + seq_length] for i in range(0, len(clean_words) - seq_length-1)]
next_word = [clean_words[i + seq_length] for i in range(0, len(clean_words) - seq_length-1)]

Each entry in `words_seq` is a list of the `seq_length` words that make that sequence in that training example. 

In [216]:
words_seq[:2]

[array(['of', 'particular', 'importance', 'to', 'south', 'dakota', 'are',
        'the', 'farm', 'policies'], dtype='<U20'),
 array(['particular', 'importance', 'to', 'south', 'dakota', 'are', 'the',
        'farm', 'policies', 'of'], dtype='<U20')]

I then convert those the list of lists into a list of strings,

In [227]:
X = np.array([" ".join(words_seq[i]) for i in range(len(next_word)) if next_cat[i] != 1])
X[:2]

array(['of particular importance to south dakota are the farm policies',
       'particular importance to south dakota are the farm policies of'],
      dtype='<U100')

Notice that I only do this where the target word is not out of the vocabulary word.

The next two words that correspond to the targets for the examples above are,

In [224]:
next_word[:2]

['of', 'the']

Now I'll convert the next word to a numerical value using the `word_index`:

In [232]:
next_cat = np.array([word_index.get(word, 1) for word in next_word])

In [233]:
next_cat[:2]

array([3, 2])

Now I need to one hot encode these variables and filter out those that correspond to the out of vocabulary tokens:

In [226]:
y = tf.keras.utils.to_categorical([cat for cat in next_cat if cat != 1])
y[:2]

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.]], dtype=float32)

We can see then see the size of each of the list:

In [228]:
X.shape

(1179990,)

In [229]:
y.shape

(1179990, 10000)

Now to see what effect the vectorizer layer has on the text I'll feed the first two sequences above through the layer.

In [230]:
vectorize_layer.call(X[:2])

<tf.Tensor: shape=(2, 10), dtype=int64, numpy=
array([[   3,  758,  692,    5,  430, 2268,   16,    2,  156,  280],
       [ 758,  692,    5,  430, 2268,   16,    2,  156,  280,    3]])>

The vectorizer layer converts the array of strings with shape `(1179990,)` to an matrix of integers of shape `(1179990, seq_length)`. Each entry in the array will be a integer from 1 to `vocab_size` and is the integer representation for each word.


## A Bidirectional GRU Model  <a class="anchor" id="third-bullet"></a>
------------
[Recurrent Neural Networks (RNN)](https://en.wikipedia.org/wiki/Recurrent_neural_network) are used to model sequences. They use an internal state, h, to act as memory to process these sequences and "remember" things from the past. A quintessential diagram of a RNN is shown below, 


<figure>
<img src="images/rnn.svg" alt="Trulli" style="width:75%">
<figcaption align = "center">
From https://en.wikipedia.org/wiki/Recurrent_neural_network#/media/File:Recurrent_neural_network_unfold.svg/
</figcaption>
</figure>

The RNN on the left is "un-rolled" to the right to show how it processes a sequence of inputs **x** into outputs **o**, there is a subscript *t* that denotes entry in the sequence. The subscript for each **h** is used to denote the value the internal state or memory cell in the t-th entry in the sequence.

There are quite a few types of RNN's that are shown below,

<figure>
<img src="images/types.png" alt="Trulli" style="width:75%">
<figcaption align = "center">
From https://calvinfeng.gitbook.io/machine-learning-notebook/supervised-learning/recurrent-neural-network/recurrent_neural_networks/
</figcaption>
</figure>


The model I am building where I use a sequence of words to predict the next word is a "many-to-one" model. Zooming into the the green blocks we focus on the specific type of RNN cell I use called a [Gated Recurrent Unit (GRU)](https://en.wikipedia.org/wiki/Gated_recurrent_unit). The details of a GRU cell are shown below.


<figure>
<img src="images/gru.png" alt="Trulli" style="width:75%">
<figcaption align = "center">
From https://colah.github.io/posts/2015-08-Understanding-LSTMs/
</figcaption>
</figure>

<figure>
<img src="images/bidirectionalgru.png" alt="Trulli" style="width:75%">
<figcaption align = "center">
From https://www.researchgate.net/figure/The-structure-of-a-bidirectional-GRU-model_fig4_366204325
</figcaption>
</figure>


In [42]:
from typing import Dict

In [165]:
class JFKSpeechWriter(tf.keras.Model):
    def __init__(self, 
                 text: str, 
                 seq_length: int,
                 vocab_size: int, 
                 embedding_dim: int, 
                 units: int) -> None:
        
        super().__init__()
        
        self.vectorizer_layer = TextVectorization(
                                    standardize="lower_and_strip_punctuation",
                                    max_tokens=vocab_size,
                                    output_mode="int",
                                    pad_to_max_tokens=True,
                                    output_sequence_length=seq_length,
                              )
        self.embedding_layer =  tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.GRU_layer = tf.keras.layers.Bidirectional(tf.keras.layers.GRU(units))
        self.dense_layer = tf.keras.layers.Dense(vocab_size, activation='softmax')
        
        self.vectorizer_layer.adapt([text])
        
    def call(self, input_str: str) -> int:
        # x = self.input_layer(input_str)
        x = self.vectorizer_layer(input_str)
        x = self.embedding_layer(x)
        x = self.GRU_layer(x)
        return self.dense_layer(x)
        
    def get_wordmap(self) -> Dict[int, str]:
        voc = self.vectorizer_layer.get_vocabulary()
        word_index = dict(zip(voc, range(len(voc))))
        return dict(map(reversed, word_index.items()))


In [166]:
model = JFKSpeechWriter(text=text, 
                        seq_length=20, 
                        vocab_size=10000, 
                        embedding_dim=128, 
                        units=64)

2023-04-06 18:00:10.472779: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.


In [167]:
model.compile(loss='categorical_crossentropy', optimizer='adam')

In [47]:
from sklearn.utils import shuffle

In [48]:
X_train, y_train = X[:100000], y[:100000]

In [49]:
X_train, y_train = shuffle(X_train, y_train)

In [50]:
X_train[0]

'state laws in states such as massachusetts on the other hand union security provisions more liberal than tafthartley are not'

In [168]:
model.fit(X_train, y_train, epochs=10, batch_size=128, validation_split=0.2)

Epoch 1/10


2023-04-06 18:00:17.790011: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
2023-04-06 18:00:18.016253: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
2023-04-06 18:00:18.025332: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
2023-04-06 18:00:18.136763: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
2023-04-06 18:00:18.151356: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.




2023-04-06 18:00:38.088691: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
2023-04-06 18:00:38.185708: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
2023-04-06 18:00:38.195500: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x2dbdfd430>

In [189]:
model.save("jfk_model", save_format="tf")

2023-04-07 07:42:02.923068: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.


In [169]:
model.summary()

Model: "jfk_speech_writer_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization_4 (TextV  multiple                 0         
 ectorization)                                                   
                                                                 
 embedding_3 (Embedding)     multiple                  1280000   
                                                                 
 bidirectional_3 (Bidirectio  multiple                 74496     
 nal)                                                            
                                                                 
 dense_3 (Dense)             multiple                  1290000   
                                                                 
Total params: 2,644,496
Trainable params: 2,644,496
Non-trainable params: 0
_________________________________________________________________


## Generating Text  <a class="anchor" id="fourth-bullet"></a>
-----------

In [191]:
model = tf.keras.models.load_model("jfk_model")

2023-04-07 07:42:58.035546: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
2023-04-07 07:42:58.040803: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.


In [161]:
test = str(X[688394])

In [162]:
test

'and constitution and in the american public school system and he stated flatly that he recognized no power in the'

In [163]:
y_pred = model.predict([test])

2023-04-06 17:59:49.825963: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
2023-04-06 17:59:49.895592: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
2023-04-06 17:59:49.905844: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.


In [170]:
reverse_word_map = model.get_wordmap()

In [186]:
reverse_word_map[np.argmax(y_pred[0])]

'united'

In [172]:
def next_words(input_str, n):
    final_str = ''
    for i in range(n):
        prediction = model.predict([input_str], verbose=0)
        next_word = reverse_word_map[np.argmax(prediction[0])]
        final_str += next_word + ' ' 
        input_str += ' ' + next_word
        input_str = ' '.join(input_str.split(' ')[1:])
    return final_str

In [184]:
new_text = next_words(test, 10)

In [185]:
test + " " + new_text

'and constitution and in the american public school system and he stated flatly that he recognized no power in the words of the house of representatives and women of the '

## Next Steps  <a class="anchor" id="fifth-bullet"></a>
-------------
