TED talk text generator

Generate using a character based model.

The code is this notebook is a partial modification of several tutorials, blog posts and YouTube videos. The main sources include:
* [Text Generation With LSTM Recurrent Neural Networks in Python with Keras](https://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/) by Dr. Jason Brownlee
* [Text Generation with Python and TensorFlow/Keras](https://stackabuse.com/text-generation-with-python-and-tensorflow-keras/) by Dan Nelson
* YouTube: [167 - Text prediction using LSTM (English text)](https://youtu.be/zyCpntcVKSo?t=2) by Sreenivas B.


In [306]:
import numpy as np
from tensorflow.keras.utils import to_categorical
from tensorflow.train import latest_checkpoint
from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras.layers import Dense, Dropout, LSTM
import random
import textwrap
import time

In [307]:
# Attach to my Google drive so I can save the csv file later
# from google.colab import drive
# drive.mount('/content/drive')

In [308]:
#tag_name = 'all_transcripts'
#tag_name = 'brain'
#tag_name = 'business'
#tag_name = 'creativity'
#tag_name = 'culture'
#tag_name = 'psychology'
#tag_name = 'science'
tag_name = 'all'

lower_corpus = False

if lower_corpus:
    corpus_file_name = tag_name + '_lowercase'
else:
    corpus_file_name = tag_name + '_sentence_case'

# This switch makes it easier running locally or in a Google Colab environment
colab = False

if colab:
    root_path = '/content/drive/MyDrive/ted/'
    data_path = f'{root_path}'
else:
    root_path = '../'
    data_path = f'{root_path}data/'
    
model_path = f'{root_path}models/'
logs_path = f'{root_path}logs/'
    
sequence_length = 80

In [309]:
with open(f'{data_path}{corpus_file_name}.txt', encoding='utf-8') as f:
    corpus = f.read()
f.close()

In [310]:
corpus[0:500]

"Good morning. How are you?  Good. It's been great, hasn't it? I've been blown away by the whole thing. In fact, I'm leaving.  There have been three themes running through the conference, which are relevant to what I want to talk about. One is the extraordinary evidence of human creativity in all of the presentations that we've had and in all of the people here; just the variety of it and the range of it. The second is that it's put us in a place where we have no idea what's going to happen in te"

The code below is used in almost every article I have seen about LSTM for text generation, but this snippet is adapted from :
https://www.kaggle.com/mrisdal/intro-to-lstms-w-keras-gpu-for-text-generation/

In [311]:
all_chars = sorted(list(set(corpus)))

#remove most punctuation, but leave spaces, exclamation points, and periods
list_to_remove = ['"', '&', '-', ',', "'", '/', ';',  '—', '%','[', ']' ]

chars = [character for character in all_chars if character not in list_to_remove]
print(chars)
# hat tip to this post for the idea of removing one list from another
# https://www.geeksforgeeks.org/python-remove-all-values-from-a-list-present-in-other-list/

[' ', '!', '$', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'è', 'é', '–', '’', '“', '”', '…', '♫']


In [312]:
# remove all unwanted characters from the original corpus
corpus_list = [character for character in corpus if character not in list_to_remove]

# Re-join all letters back into a single string as the updated corpus
corpus = ''.join(corpus_list)

In [313]:
# If running models with punctuation removed, leave commented out
#chars = sorted(list(set(corpus)))

print('Number of unique characters in the corpus:', len(chars))

char_indices = dict((c, i) for i, c in enumerate(chars))

indices_char = dict((i, c) for i, c in enumerate(chars))

Number of unique characters in the corpus: 75


In [314]:
# Show character to index mapping
print (char_indices)

{' ': 0, '!': 1, '$': 2, '.': 3, '0': 4, '1': 5, '2': 6, '3': 7, '4': 8, '5': 9, '6': 10, '7': 11, '8': 12, '9': 13, '?': 14, 'A': 15, 'B': 16, 'C': 17, 'D': 18, 'E': 19, 'F': 20, 'G': 21, 'H': 22, 'I': 23, 'J': 24, 'K': 25, 'L': 26, 'M': 27, 'N': 28, 'O': 29, 'P': 30, 'Q': 31, 'R': 32, 'S': 33, 'T': 34, 'U': 35, 'V': 36, 'W': 37, 'X': 38, 'Y': 39, 'Z': 40, 'a': 41, 'b': 42, 'c': 43, 'd': 44, 'e': 45, 'f': 46, 'g': 47, 'h': 48, 'i': 49, 'j': 50, 'k': 51, 'l': 52, 'm': 53, 'n': 54, 'o': 55, 'p': 56, 'q': 57, 'r': 58, 's': 59, 't': 60, 'u': 61, 'v': 62, 'w': 63, 'x': 64, 'y': 65, 'z': 66, 'è': 67, 'é': 68, '–': 69, '’': 70, '“': 71, '”': 72, '…': 73, '♫': 74}


In [315]:
# Show index to character mapping
print(indices_char)

{0: ' ', 1: '!', 2: '$', 3: '.', 4: '0', 5: '1', 6: '2', 7: '3', 8: '4', 9: '5', 10: '6', 11: '7', 12: '8', 13: '9', 14: '?', 15: 'A', 16: 'B', 17: 'C', 18: 'D', 19: 'E', 20: 'F', 21: 'G', 22: 'H', 23: 'I', 24: 'J', 25: 'K', 26: 'L', 27: 'M', 28: 'N', 29: 'O', 30: 'P', 31: 'Q', 32: 'R', 33: 'S', 34: 'T', 35: 'U', 36: 'V', 37: 'W', 38: 'X', 39: 'Y', 40: 'Z', 41: 'a', 42: 'b', 43: 'c', 44: 'd', 45: 'e', 46: 'f', 47: 'g', 48: 'h', 49: 'i', 50: 'j', 51: 'k', 52: 'l', 53: 'm', 54: 'n', 55: 'o', 56: 'p', 57: 'q', 58: 'r', 59: 's', 60: 't', 61: 'u', 62: 'v', 63: 'w', 64: 'x', 65: 'y', 66: 'z', 67: 'è', 68: 'é', 69: '–', 70: '’', 71: '“', 72: '”', 73: '…', 74: '♫'}


In [316]:
print (f'The input corpus contains {len(corpus):,} characters.')

The input corpus contains 267,106 characters.


The model will use arbitrary length of characters, e.g. 40, and then predict the next character that will appear (the 41st).

I would like to change this sequence length in different models, to see how it might affect the model's performance.

Larger might be better?

Loop through the corpus, creating segments of 40 characters, plus a segment of a single character that would appear after it. These segments are then converted into their respective digits, and loaded into X

In addition, create a list of target values. The target, y, is a single text character, and is also converted into a corresponding numeric value.

In [317]:
corpus[0:50]

'Good morning. How are you?  Good. Its been great h'

In [318]:
corpus[1:41]

'ood morning. How are you?  Good. Its bee'

In [319]:
corpus[41]

'n'

In [320]:
# For future modeling, we could try to offset the sequences by more than one
# letter. For now the model will look at every sequence right after another.
step = 1

X_numeric_list = []
y_numeric_list = []


for i in range (0, len(corpus) - sequence_length, step):
    # To find X when still as characters, loop through and extract
    # a sequence for example, from the 2nd to the 42nd, as in corpus[1:41] since 
    # the string index is zero-based
    X_char_sequence = corpus[i:i + sequence_length]  #exclusive
    y_char = corpus[i + sequence_length]

    # Convert the X character sequence into a list of integers, using the 
    # dictionary created above.
    X_numeric_list.append( [char_indices[letter] for letter in X_char_sequence])
    # Also convert target letter y to it's corresponding numeric value in the
    # dictionary
    y_numeric_list.append(char_indices[y_char])



In [321]:
# Look at the first two converted sequences
print ( X_numeric_list[0])
print ( X_numeric_list[1])


[21, 55, 55, 44, 0, 53, 55, 58, 54, 49, 54, 47, 3, 0, 22, 55, 63, 0, 41, 58, 45, 0, 65, 55, 61, 14, 0, 0, 21, 55, 55, 44, 3, 0, 23, 60, 59, 0, 42, 45, 45, 54, 0, 47, 58, 45, 41, 60, 0, 48, 41, 59, 54, 60, 0, 49, 60, 14, 0, 23, 62, 45, 0, 42, 45, 45, 54, 0, 42, 52, 55, 63, 54, 0, 41, 63, 41, 65, 0, 42]
[55, 55, 44, 0, 53, 55, 58, 54, 49, 54, 47, 3, 0, 22, 55, 63, 0, 41, 58, 45, 0, 65, 55, 61, 14, 0, 0, 21, 55, 55, 44, 3, 0, 23, 60, 59, 0, 42, 45, 45, 54, 0, 47, 58, 45, 41, 60, 0, 48, 41, 59, 54, 60, 0, 49, 60, 14, 0, 23, 62, 45, 0, 42, 45, 45, 54, 0, 42, 52, 55, 63, 54, 0, 41, 63, 41, 65, 0, 42, 65]


In [322]:
print ( y_numeric_list[0:2])

[65, 0]


In [323]:
indices_char[51]

'k'

In [324]:
corpus[0:43]

'Good morning. How are you?  Good. Its been '

This example is a little hard to follow, since we happened to be in the middle of the word 'been'. The letter e is mapped to 51.

However the beginning of the sequence has been shifted by one as well.

In [325]:
print (f'There are {len(X_numeric_list):,} entries in X_numeric_list') 

There are 267,026 entries in X_numeric_list


An LSTM model needs the X data to be in 3 dimensions:
* Samples (number of rows)
* Time steps (this implies time series. In this example this corresponds to the sequence_length)
* Features (the target, or y variable)

This [blog post by Dr. John Brownlee](https://machinelearningmastery.com/reshape-input-data-long-short-term-memory-networks-keras/) contains a longer explanation, which I paraphrased here.

In [326]:
# reshape X to be [samples, time steps, features]
# Samples is the total number of 40 character text fragments created above
# Time step is the sequence length, which is this case is 40, but might change
# in other model variations
# Feature is one, since the model is predicting one character at a time
X = np.reshape(X_numeric_list, (len(X_numeric_list), sequence_length, 1))

In [327]:
X.shape

(267026, 80, 1)

In [328]:
# normalize/scale the data by dividing by the length of the character list.
X = X / len(chars)

In [329]:
# Look at the first value of X
#X[0]

Even thought the values for the target variable are integers, they are essentially labels for the predicted characters. They are one-hot encoded using the to_categorical function.

In [330]:
y = to_categorical(y_numeric_list)
y[0]

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
       0., 0., 0., 0., 0., 0., 0.], dtype=float32)

---
### Load LSTM Model
In this notebook, load a model that was previously trained and saved.

In [331]:
# Save the model to my Google Drive so I can load it later from another notebook
# model = load_model(f'/content/drive/MyDrive/ted/models/ted_model_{tag_name}')

# model.summary()

In [332]:
# # define the LSTM model
# model = Sequential()
# model.add(LSTM(1000, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
# model.add(Dropout(0.2))
# model.add(LSTM(256))
# model.add(Dropout(0.2))
# model.add(Dense(y.shape[1], activation='softmax'))


# # Load the previously saved weights
# model.load_weights(latest)

# model.compile(loss='categorical_crossentropy', optimizer='adam' , metrics = ['accuracy','Recall','Precision'])


#Models are saved in this format:
# model.save(f'{model_path}/{tag_name}_{input_nodes_count}_{layer_2_nodes_count}')

model = load_model(f'{model_path}{tag_name}_1200_80_80')

model.summary()

Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_9 (LSTM)                (None, 80, 1200)          5769600   
_________________________________________________________________
dropout_9 (Dropout)          (None, 80, 1200)          0         
_________________________________________________________________
lstm_10 (LSTM)               (None, 80, 600)           4322400   
_________________________________________________________________
dropout_10 (Dropout)         (None, 80, 600)           0         
_________________________________________________________________
lstm_11 (LSTM)               (None, 200)               640800    
_________________________________________________________________
dropout_11 (Dropout)         (None, 200)               0         
_________________________________________________________________
dense_6 (Dense)              (None, 75)               

X_numeric_list contains all of our 40 character sequences. Choose one for a seed for the text generation.

In [333]:
seed_start = random.randint(0, len(corpus) - sequence_length)
print(f'Random start index for seed text: {seed_start} sequence.')

X_seeded = X_numeric_list[seed_start]
print(X_seeded)

Random start index for seed text: 98917 sequence.
[54, 60, 55, 0, 41, 0, 42, 45, 60, 60, 45, 58, 0, 46, 61, 60, 61, 58, 45, 3, 0, 27, 41, 65, 42, 45, 0, 60, 48, 49, 59, 0, 43, 41, 53, 45, 0, 46, 58, 55, 53, 0, 53, 65, 0, 52, 55, 62, 45, 0, 55, 46, 0, 26, 45, 47, 55, 59, 0, 41, 54, 44, 0, 60, 48, 45, 0, 46, 58, 45, 45, 44, 55, 53, 0, 55, 46, 0, 45, 64]


Convert this back to text so we can see what it looks like.

In [334]:
seed_string = ''.join([indices_char[letter_code] for letter_code in X_seeded])
seed_string

'nto a better future. Maybe this came from my love of Legos and the freedom of ex'

In [335]:
# We can also use the seed_start value to see a larger picture of the original text
for j in range(seed_start, seed_start + 15):
    seed_numeric = X_numeric_list[j]
    print (''.join([indices_char[letter_code] for letter_code in seed_numeric]))

nto a better future. Maybe this came from my love of Legos and the freedom of ex
to a better future. Maybe this came from my love of Legos and the freedom of exp
o a better future. Maybe this came from my love of Legos and the freedom of expr
 a better future. Maybe this came from my love of Legos and the freedom of expre
a better future. Maybe this came from my love of Legos and the freedom of expres
 better future. Maybe this came from my love of Legos and the freedom of express
better future. Maybe this came from my love of Legos and the freedom of expressi
etter future. Maybe this came from my love of Legos and the freedom of expressio
tter future. Maybe this came from my love of Legos and the freedom of expression
ter future. Maybe this came from my love of Legos and the freedom of expression 
er future. Maybe this came from my love of Legos and the freedom of expression t
r future. Maybe this came from my love of Legos and the freedom of expression th
 future. Maybe this came fro

To generate text, we need to change the input text into a 3D shape as we did the X input data.

In [336]:
temp_x = np.reshape(X_seeded, (1, len(X_seeded) , 1))
# normalize/scale the data by dividing by the length of the character list.
temp_x = temp_x / len(chars)
prediction = model.predict(temp_x, verbose=0)
print(f"A prediction's shape:{prediction.shape}")
print('Prediction:')
print (prediction)

A prediction's shape:(1, 75)
Prediction:
[[9.14849865e-04 2.89303945e-11 2.85180807e-33 1.64037468e-04
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 2.07907415e-05 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 2.51376933e-23 7.31606490e-26 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 1.46731178e-20 4.52206310e-23 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  1.70980430e-19 1.02972105e-01 1.55366142e-05 7.20195519e-03
  1.54978488e-06 6.73035145e-01 1.88986014e-05 2.01521161e-05
  3.28967534e-03 5.06836772e-02 4.97617478e-20 1.90162697e-04
  6.55616925e-04 1.37343406e-04 1.05363759e-03 7.07966415e-03
  1.33326352e-01 0.00000000e+00 2.02329201e-03 5.89174079e-03
  1.01109911e-02 8.51505087e-

Each prediction produces an result of shape (1, 75), and contains a predicted probability of a character. Use argmax to find the index of the highest probability.

In [337]:
np.argmax(prediction)

45

So the first letter predicted after the seed sequence is:

In [338]:
indices_char[np.argmax(prediction)]

'e'

In [339]:
# See how long the prediction takes
start_predict = time.perf_counter()

# Put all predicted characters in a list, then convert to a string later.
complete_predicted_text =[]

# Generate 1000 characters
for i in range(600):
    # reshape to 1 row, sequence length, and 1 for predicting one character as the target
    temp_x = np.reshape(X_seeded, (1, len(X_seeded) , 1))
    # normalize/scale the data by dividing by the length of the character list.
    temp_x = temp_x / len(chars)
    prediction = model.predict(temp_x, verbose=0)
    complete_predicted_text.append(indices_char[np.argmax(prediction)])

    # After predicting, add this character's numeric value on to the randomly
    # chose seed string, and use this for another prediction
    X_seeded.append(np.argmax(prediction))

    # Move the seeded text up one character, so create a new sequence
    X_seeded = X_seeded[1:len(X_seeded)]


#print (complete_predicted_text)
end_predict = time.perf_counter()
print(f'Prediction took: {end_predict - start_predict}')

Prediction took: 75.74487849999787


The predicted results are a list of characters, so convert into a string

In [340]:
predicted_string = ''.join(complete_predicted_text)

# Wrap to 120 characters just so I can visually compare to the original string.
textwrap.wrap(seed_string + '->'+ predicted_string, width=120)

['nto a better future. Maybe this came from my love of Legos and the freedom of ex->er and I was a sesuire of the sight of',
 'the sight of the sight of the sight of the sight of the sesearch and they were seally wanking and the second thing that',
 'we are seloesing on the people who were the second tecond was that they were seally wasching the second technology of',
 'the sesearch and they were seally what they were seally selling about the people who were the second teconds to selldd',
 'to see the second technology of the sesearch and they were seally selling about the people who had the second tertion of',
 'the sesearch and they were seally what they were seally selling about the people wh']

In [341]:
# Find the original seed string to see what came after it
original_paragraph = corpus[corpus.find(seed_string): corpus.find(seed_string) + 600]

textwrap.wrap(original_paragraph, width=120)

['nto a better future. Maybe this came from my love of Legos and the freedom of expression that I felt when I was building',
 'with them. And this was also derived from my family and my mentors who always make me feel whole and good about myself.',
 'Now today my ambitions have changed a little bit I’d like to go into the field of Biology maybe cell biology or genetics',
 'or biochemistry or really anything. This is a friend of mine who I look up to Francis Collins the director of the NIH',
 'and this is us at TEDMED last year chatting away. I feel that no matter what I choose to become I believe that I can cha']

The above text with textwrap makes it easier to read in a Notebook, but when copying and pasting into a Google Slides presentation, it's easier with the normal string.

Use `print()` to not show newline characters

In [342]:
print(seed_string + '->'+ predicted_string)

nto a better future. Maybe this came from my love of Legos and the freedom of ex->er and I was a sesuire of the sight of the sight of the sight of the sight of the sight of the sesearch and they were seally wanking and the second thing that we are seloesing on the people who were the second tecond was that they were seally wasching the second technology of the sesearch and they were seally what they were seally selling about the people who were the second teconds to selldd to see the second technology of the sesearch and they were seally selling about the people who had the second tertion of the sesearch and they were seally what they were seally selling about the people wh


In [343]:
print(original_paragraph)

nto a better future. Maybe this came from my love of Legos and the freedom of expression that I felt when I was building with them. And this was also derived from my family and my mentors who always make me feel whole and good about myself. Now today my ambitions have changed a little bit I’d like to go into the field of Biology maybe cell biology or genetics or biochemistry or really anything. This is a friend of mine who I look up to Francis Collins the director of the NIH and this is us at TEDMED last year chatting away. I feel that no matter what I choose to become I believe that I can cha
