## Creating An AI-Based JFK Speech Writer: Part 2
-----------------------

__[1. Introduction](#first-bullet)__

__[2. Data Preparation](#second-bullet)__

__[3. A Simple Bidirectional GRU Model](#third-bullet)__

__[4. Generating Text](#fourth-bullet)__

__[5. Next Steps](#fifth-bullet)__


## Introduction <a class="anchor" id="first-bullet"></a>
----

In [1]:
import numpy as np
import tensorflow as tf 
from google.oauth2 import service_account
from google.cloud import storage
from google.cloud.exceptions import Conflict


tf.compat.v1.logging.set_verbosity('ERROR')
tf.config.list_physical_devices('GPU')

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

## Data Preparation <a class="anchor" id="second-bullet"></a>
----

In [2]:
credentials = service_account.Credentials.from_service_account_file('credentials.json')
client = storage.Client(project=credentials.project_id, credentials=credentials)

In [3]:
bucket = client.get_bucket("harmon-kennedy")

In [4]:
blob = bucket.blob("all_jfk_speeches.txt")

In [5]:
text = blob.download_as_text()

In [6]:
print(f'Length of text: {len(text)} characters')

Length of text: 7734579 characters


In [7]:
vocab = sorted(set(text))
print(f'{len(vocab)} unique characters')

67 unique characters


In [8]:
text_in_words = [w for w in text.split(' ') if w.strip() != '' or w == '\n']

In [9]:
print(f'Length of text: {len(text_in_words)} words')

Length of text: 1338872 words


In [10]:
print(f"{len(set(text_in_words))} unique words")

42240 unique words


In [11]:
import string
 
# turn a doc into clean tokens
def clean_doc(doc):
    # replace '--' with a space ' '
    doc = doc.replace('--', ' ')
    # split into tokens by white space
    tokens = doc.split()
    # remove punctuation from each token
    table = str.maketrans('', '', string.punctuation)
    tokens = [w.translate(table) for w in tokens]
    # remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    # make lower case
    tokens = [word.lower() for word in tokens]
    return tokens

In [12]:
clean_words = np.array(clean_doc(text))

In [13]:
clean_text = " ".join(clean_words)

In [14]:
clean_text[:300]

'of particular importance to south dakota are the farm policies of the republican party the party of benson nixon and mundt the party which offers our young people no incentive to return to the farm which offers the farmer only the prospect of lower and lower income and which offers the nation the vi'

In [15]:
print(f"{len(clean_words)} number of clean words")

1322685 number of clean words


In [16]:
print(f"{len(set(clean_words))} unique clean words")

22681 unique clean words


In [17]:
len(clean_text)

7533442

In [18]:
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

vocab_size = 10000
seq_length = 20

vectorize_layer = TextVectorization(
    standardize="lower_and_strip_punctuation",
    max_tokens=vocab_size,
    output_mode="int",
    pad_to_max_tokens=True,
    output_sequence_length=seq_length,
)

Metal device set to: Apple M1


2023-04-06 17:15:20.988043: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-04-06 17:15:20.988277: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


In [19]:
vectorize_layer.adapt([text])

2023-04-06 17:15:22.011449: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
2023-04-06 17:15:22.050759: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.


In [20]:
voc = vectorize_layer.get_vocabulary()
word_index = dict(zip(voc, range(len(voc))))

In [21]:
word_index['of']

3

In [22]:
word_index['particular']

758

In [23]:
words_seq = [clean_words[i:i + seq_length] for i in range(0, len(clean_words) - seq_length-1)]
next_word = [clean_words[i + seq_length] for i in range(0, len(clean_words) - seq_length-1)]

In [24]:
len(words_seq)

1322664

In [25]:
next_cat = np.array([word_index.get(word, 1) for word in next_word])

In [26]:
next_cat.shape

(1322664,)

In [28]:
X = [" ".join(words_seq[i]) for i in range(len(next_word)) if next_cat[i] != 1]

In [29]:
y = tf.keras.utils.to_categorical([cat for cat in next_cat if cat != 1])

In [31]:
X = np.array(X)
X.shape

(1299332,)

In [32]:
y.shape

(1299332, 10000)

In [33]:
X[0]

'of particular importance to south dakota are the farm policies of the republican party the party of benson nixon and'

In [34]:
vectorize_layer.call(X[0])

<tf.Tensor: shape=(20,), dtype=int64, numpy=
array([   3,  758,  692,    5,  430, 2268,   16,    2,  156,  280,    3,
          2,  152,   68,    2,   68,    3,  756,  193,    4])>

In [35]:
next_word[:2]

['mundt', 'the']

In [36]:
word_index['the']

2

In [37]:
word_index['nation']

92

In [38]:
word_index['[UNK]']

1

In [39]:
y[0]

array([0., 0., 0., ..., 0., 0., 0.], dtype=float32)

In [40]:
np.argmax(y[1])

2

## Building A Simple Bidirectional GRU Model  <a class="anchor" id="third-bullet"></a>


<figure>
<img src="images/gru.png" alt="Trulli" style="width:75%">
<figcaption align = "center">
From https://colah.github.io/posts/2015-08-Understanding-LSTMs/
</figcaption>
</figure>

<figure>
<img src="images/bidirectionalgru.png" alt="Trulli" style="width:75%">
<figcaption align = "center">
From https://www.researchgate.net/figure/The-structure-of-a-bidirectional-GRU-model_fig4_366204325
</figcaption>
</figure>


In [42]:
from typing import Dict

In [165]:
class JFKSpeechWriter(tf.keras.Model):
    def __init__(self, 
                 text: str, 
                 seq_length: int,
                 vocab_size: int, 
                 embedding_dim: int, 
                 units: int) -> None:
        
        super().__init__()
        
        self.vectorizer_layer = TextVectorization(
                                    standardize="lower_and_strip_punctuation",
                                    max_tokens=vocab_size,
                                    output_mode="int",
                                    pad_to_max_tokens=True,
                                    output_sequence_length=seq_length,
                              )
        self.embedding_layer =  tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.GRU_layer = tf.keras.layers.Bidirectional(tf.keras.layers.GRU(units))
        self.dense_layer = tf.keras.layers.Dense(vocab_size, activation='softmax')
        
        self.vectorizer_layer.adapt([text])
        
    def call(self, input_str: str) -> int:
        # x = self.input_layer(input_str)
        x = self.vectorizer_layer(input_str)
        x = self.embedding_layer(x)
        x = self.GRU_layer(x)
        return self.dense_layer(x)
        
    def get_wordmap(self) -> Dict[int, str]:
        voc = self.vectorizer_layer.get_vocabulary()
        word_index = dict(zip(voc, range(len(voc))))
        return dict(map(reversed, word_index.items()))


In [166]:
model = JFKSpeechWriter(text=text, 
                        seq_length=20, 
                        vocab_size=10000, 
                        embedding_dim=128, 
                        units=64)

2023-04-06 18:00:10.472779: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.


In [167]:
model.compile(loss='categorical_crossentropy', optimizer='adam')

In [47]:
from sklearn.utils import shuffle

In [48]:
X_train, y_train = X[:100000], y[:100000]

In [49]:
X_train, y_train = shuffle(X_train, y_train)

In [50]:
X_train[0]

'state laws in states such as massachusetts on the other hand union security provisions more liberal than tafthartley are not'

In [168]:
model.fit(X_train, y_train, epochs=10, batch_size=128, validation_split=0.2)

Epoch 1/10


2023-04-06 18:00:17.790011: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
2023-04-06 18:00:18.016253: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
2023-04-06 18:00:18.025332: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
2023-04-06 18:00:18.136763: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
2023-04-06 18:00:18.151356: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.




2023-04-06 18:00:38.088691: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
2023-04-06 18:00:38.185708: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
2023-04-06 18:00:38.195500: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x2dbdfd430>

In [169]:
model.summary()

Model: "jfk_speech_writer_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization_4 (TextV  multiple                 0         
 ectorization)                                                   
                                                                 
 embedding_3 (Embedding)     multiple                  1280000   
                                                                 
 bidirectional_3 (Bidirectio  multiple                 74496     
 nal)                                                            
                                                                 
 dense_3 (Dense)             multiple                  1290000   
                                                                 
Total params: 2,644,496
Trainable params: 2,644,496
Non-trainable params: 0
_________________________________________________________________


## Generating Text  <a class="anchor" id="fourth-bullet"></a>
-----------

In [161]:
test = str(X[688394])

In [162]:
test

'and constitution and in the american public school system and he stated flatly that he recognized no power in the'

In [163]:
y_pred = model.predict([test])

2023-04-06 17:59:49.825963: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
2023-04-06 17:59:49.895592: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
2023-04-06 17:59:49.905844: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.


In [170]:
reverse_word_map = model.get_wordmap()

In [186]:
reverse_word_map[np.argmax(y_pred[0])]

'united'

In [172]:
def next_words(input_str, n):
    final_str = ''
    for i in range(n):
        prediction = model.predict([input_str], verbose=0)
        next_word = reverse_word_map[np.argmax(prediction[0])]
        final_str += next_word + ' ' 
        input_str += ' ' + next_word
        input_str = ' '.join(input_str.split(' ')[1:])
    return final_str

In [184]:
new_text = next_words(test, 10)

In [185]:
test + " " + new_text

'and constitution and in the american public school system and he stated flatly that he recognized no power in the words of the house of representatives and women of the '

## Next Steps  <a class="anchor" id="fifth-bullet"></a>
-------------
