<a href="https://colab.research.google.com/github/nbeaudoin/Sermon-Generator/blob/master/Sermons_Text_Generator_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Sources: 
 - https://github.com/keras-team/keras/blob/master/examples/lstm_text_generation.py 
 - https://www.youtube.com/watch?v=QtQt1CUEE3w
 - https://github.com/TannerGilbert/Tutorials/blob/master/Keras-Tutorials/4.%20LSTM%20Text%20Generation/Keras%20LSTM%20Text%20Generation.ipynb
 - https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_10_3_text_generation.ipynb
 - https://www.youtube.com/watch?v=6ORnRAz3gnA
 - http://www.datastuff.tech/machine-learning/lstm-how-to-train-neural-networks-to-write-like-lovecraft/
 - https://machinelearningmastery.com/return-sequences-and-return-states-for-lstms-in-keras/

# Load data from local directory to Cloud

In [1]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving sermon_corpus.txt to sermon_corpus.txt
User uploaded file "sermon_corpus.txt" with length 2500012 bytes


# Imports

In [1]:
from __future__ import print_function
from keras.callbacks import LambdaCallback
from keras.models import Sequential
from keras.models import Model
from keras.layers import Dense
from keras.layers import LSTM
from keras.optimizers import RMSprop
from keras.utils.data_utils import get_file
import keras
import numpy as np
import random
import sys
import io
import re

Using TensorFlow backend.


# Open data from Cloud directory

In [2]:

with io.open('sermon_corpus.txt', encoding='utf-8') as f:
    text = f.read()
print('corpus length:', len(text))

FileNotFoundError: ignored

In [0]:
text[:1000]

# Pre-processing

In [0]:
processed_text = text

# remove all special characters 
processed_text = re.sub(r'[^\x00-\x7f]', r' ', processed_text) ## remove encodings
processed_text = re.sub(r'[\n\r\t]+',r' ', processed_text) ## remove newlines, tabs, etc
processed_text = re.sub(r'[^A-Za-z.,]',r' ', processed_text) ## remove special characters

sorted(list(set(processed_text)))

In [0]:
processed_text[:1000]

# Create character mappings

In [0]:

print('corpus length:', len(processed_text))
chars = sorted(list(set(processed_text)))

char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

# Sequence creation

Creating sub sequences of each step. Whatis going to happen is we will be creating a word like "hell" but then the algorithm will nede to predict the "o" at the end for "hello." By take each letter combination and the letter that follows after it in a seperate "train_test_split", we can see what the actual letter that follows is. 

For example, a word like computer is the ground truth. Our model gets fed "compute" and then needs to predict the "r". Since the "r" is the actual value for that letter combination, it is stored as a training set. This is what the following code allows us to do: create a word and see if we can guess what that word is.

 - maxlen: size of blocks we are giving model
 - step: number of characters to move forwad (1 will give you completely redundant characters)

In [0]:
# cut the text in semi-redundant sequences of maxlen characters
maxlen = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(processed_text) - maxlen, step):
    sentences.append(processed_text[i: i + maxlen])
    next_chars.append(processed_text[i + maxlen])
print('nb sequences:', len(sentences))

In [0]:
sentences[:10]

In [0]:
# Test out what we just produced
print(sentences[:3])
print(next_chars[:3])

# Vectorization

- X is going to be the input that will specify the sequences
- y is the expected output from those sequences

In [0]:
print('Vectorization...')
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1


In [0]:
x.shape

In [0]:
y.shape

# Design model structure

I will be trying multiple scenarios with layers, dropout to avoid overfitting, learning rates, and loss functions. This will get interesting!

 - return_sequences=True: want the next sequence to have the same inputs as the previous layer
 - chars: all alphanumeric characters

In [0]:
# What does input shape look like?
input_shape=(maxlen, len(chars))
print(input_shape)

In [0]:
# print('Build model...')
# model = Sequential()
# model.add(LSTM(128, dropout=0.2, name='LSTM_layer_1', return_sequences=True, input_shape=(maxlen, len(chars))))
# model.add(LSTM(len(chars), dropout=0.2, name='LSTM_layer_2'))
# model.add(Dense(len(chars), activation='softmax'))

# model.summary()

print('Build model...')
model = Sequential()
model.add(LSTM(128, dropout=0.2, name='LSTM_layer_1', input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars), activation='softmax'))

model.summary()

In [0]:
optimizer = RMSprop(lr=0.01)
model.compile(optimizer=optimizer, 
              loss="binary_crossentropy",
              metrics=["mean_squared_error","binary_crossentropy"])

In [0]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

 - Temperature: 1.0 most conservative, doesn't want spelling errors, while 0.0 is the most confidence and will make many errors, including spelling
 - preds: output neurons

In [0]:
def on_epoch_end(epoch, _):
    # Function invoked at end of each epoch. Prints generated text.
    print()
    print('----- Generating text after Epoch: %d' % epoch)

    start_index = random.randint(0, len(text) - maxlen - 1)
    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print('----- diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

print_callback = LambdaCallback(on_epoch_end=on_epoch_end)


# Creating callbacks

In [0]:
import logging, os
logging.disable(logging.WARNING)
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"

In [0]:
from keras.callbacks import ModelCheckpoint

filepath = "weights.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss',
                             verbose=1, save_best_only=True,
                             mode='min')

In [0]:
from keras.callbacks import ReduceLROnPlateau
reduce_lr = ReduceLROnPlateau(monitor='loss', factor=0.2,
                              patience=1, min_lr=0.001)

In [0]:
callbacks = [print_callback, checkpoint, reduce_lr]

# Run the model

This model will allow us to have a call back mechanism that will generate text at each epoch. This is a great way to sanity check what the model is predicting and seeing if it is improving over each run. You will also not that the diversity is set to various increments. This is the "temperature" parameter that adjusts the weight of the log function when the crossentropy calculation is creating predicted probabilities.

Low temperature will often times give you words that repeat because it is essentially repeating the same accurate prediction. The good part is that they will all be real words. However, a high temperature will give us words that are more interesting and surprising. This is the piece that allows creativity to enter into the sequence.

In [0]:
model.fit(x, 
          y,
          batch_size=128,
          epochs=30,
          validation_split=0.1,
          callbacks=[print_callback])

In [0]:
def generate_text(length, diversity):
    # Get random starting text
    start_index = random.randint(0, len(text) - maxlen - 1)
    generated = ''
    sentence = text[start_index: start_index + maxlen]
    generated += sentence
    for i in range(length):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char
    return generated

In [178]:
generate_text(500, 0.2)

KeyError: ignored