<a href="https://colab.research.google.com/github/nbeaudoin/Sermon-Generator/blob/master/Sermons_Text_Generator_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Sources: 
 - https://github.com/keras-team/keras/blob/master/examples/lstm_text_generation.py 
 - https://www.youtube.com/watch?v=QtQt1CUEE3w
 - https://github.com/TannerGilbert/Tutorials/blob/master/Keras-Tutorials/4.%20LSTM%20Text%20Generation/Keras%20LSTM%20Text%20Generation.ipynb
 - https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_10_3_text_generation.ipynb
 - https://www.youtube.com/watch?v=6ORnRAz3gnA
 - http://www.datastuff.tech/machine-learning/lstm-how-to-train-neural-networks-to-write-like-lovecraft/
 - https://machinelearningmastery.com/return-sequences-and-return-states-for-lstms-in-keras/

# Load data from local directory to Cloud

In [1]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

# Imports

In [2]:
from __future__ import print_function
from keras.callbacks import LambdaCallback
from keras.models import Sequential
from keras.models import Model
from keras.layers import Dense
from keras.layers import LSTM
from keras.optimizers import RMSprop
from keras.utils.data_utils import get_file
import keras
import numpy as np
import random
import sys
import io
import re

Using TensorFlow backend.


# Open data from Cloud directory

In [3]:

with io.open('sermon_corpus.txt', encoding='utf-8') as f:
    text = f.read()
print('corpus length:', len(text))

corpus length: 2464213


In [4]:
text[:1000]

'\tThe Emmaus Discovery Team was formed this last winter, charged with helping me lead us through this interim period together.  The time between the departure of a former minister and the call of a new minister is a time when a congregation can clarify who it is and what God might have it do next. Earlier this spring, the Team led small groups identifying the challenges and blessings of your 46 years together so far.  Today, the Discovery Team invites us to consider the values that “drive us” here:  “the shared preferences or choices that are consistently prioritized in our behavior together.”  What makes us do what we do?  We are interested not so much in the values we aspire for, but a sense of the core values that shape the way you are together.  Indentifying our values is a way of asking: “What is most important to us, really?”  Not an easy question, as it means taking a hard, honest look at ourselves.\n\t\n\tDiscussion about values is not held in a vacuum.  It happens within the 

# Pre-processing

In [5]:
processed_text = text

# remove all special characters 
processed_text = re.sub(r'[^\x00-\x7f]', r'', processed_text) ## remove encodings
processed_text = re.sub(r'[\n\r\t]+', r' ', processed_text) ## remove newlines, tabs, etc
processed_text = re.sub(r'[^A-Za-z0-9.,:;“’]', r' ', processed_text) ## remove special characters

sorted(list(set(processed_text)))

[' ',
 ',',
 '.',
 '0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 ':',
 ';',
 'A',
 'B',
 'C',
 'D',
 'E',
 'F',
 'G',
 'H',
 'I',
 'J',
 'K',
 'L',
 'M',
 'N',
 'O',
 'P',
 'Q',
 'R',
 'S',
 'T',
 'U',
 'V',
 'W',
 'X',
 'Y',
 'Z',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z']

In [6]:
processed_text[:1000]

' The Emmaus Discovery Team was formed this last winter, charged with helping me lead us through this interim period together.  The time between the departure of a former minister and the call of a new minister is a time when a congregation can clarify who it is and what God might have it do next. Earlier this spring, the Team led small groups identifying the challenges and blessings of your 46 years together so far.  Today, the Discovery Team invites us to consider the values that drive us here:  the shared preferences or choices that are consistently prioritized in our behavior together.  What makes us do what we do   We are interested not so much in the values we aspire for, but a sense of the core values that shape the way you are together.  Indentifying our values is a way of asking: What is most important to us, really   Not an easy question, as it means taking a hard, honest look at ourselves. Discussion about values is not held in a vacuum.  It happens within the ongoing tensio

# Create character mappings

In [7]:

print('corpus length:', len(processed_text))
chars = sorted(list(set(processed_text)))

char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

corpus length: 2441771


# Sequence creation

Creating sub sequences of each step. Whatis going to happen is we will be creating a word like "hell" but then the algorithm will nede to predict the "o" at the end for "hello." By take each letter combination and the letter that follows after it in a seperate "train_test_split", we can see what the actual letter that follows is. 

For example, a word like computer is the ground truth. Our model gets fed "compute" and then needs to predict the "r". Since the "r" is the actual value for that letter combination, it is stored as a training set. This is what the following code allows us to do: create a word and see if we can guess what that word is.

 - maxlen: size of blocks we are giving model
 - step: number of characters to move forwad (1 will give you completely redundant characters)

In [8]:
# cut the text in semi-redundant sequences of maxlen characters
maxlen = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(processed_text) - maxlen, step):
    sentences.append(processed_text[i: i + maxlen])
    next_chars.append(processed_text[i + maxlen])
print('nb sequences:', len(sentences))

nb sequences: 813911


In [9]:
sentences[:10]

[' The Emmaus Discovery Team was formed th',
 'e Emmaus Discovery Team was formed this ',
 'mmaus Discovery Team was formed this las',
 'us Discovery Team was formed this last w',
 'Discovery Team was formed this last wint',
 'covery Team was formed this last winter,',
 'ery Team was formed this last winter, ch',
 ' Team was formed this last winter, charg',
 'am was formed this last winter, charged ',
 'was formed this last winter, charged wit']

In [10]:
# Test out what we just produced
print(sentences[:3])
print(next_chars[:3])

[' The Emmaus Discovery Team was formed th', 'e Emmaus Discovery Team was formed this ', 'mmaus Discovery Team was formed this las']
['i', 'l', 't']


# Vectorization

- X is going to be the input that will specify the sequences
- y is the expected output from those sequences

In [11]:
print('Vectorization...')
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1


Vectorization...


In [12]:
x.shape

(813911, 40, 67)

In [13]:
y.shape

(813911, 67)

# Design model structure

I will be trying multiple scenarios with layers, dropout to avoid overfitting, learning rates, and loss functions. This will get interesting!

 - return_sequences=True: want the next sequence to have the same inputs as the previous layer
 - chars: all alphanumeric characters

In [14]:
# What does input shape look like?
input_shape=(maxlen, len(chars))
print(input_shape)

(40, 67)


In [15]:
# print('Build model...')
# model = Sequential()
# model.add(LSTM(128, dropout=0.2, name='LSTM_layer_1', return_sequences=True, input_shape=(maxlen, len(chars))))
# model.add(LSTM(len(chars), dropout=0.2, name='LSTM_layer_2'))
# model.add(Dense(len(chars), activation='softmax'))

# model.summary()

print('Build model...')
model = Sequential()
model.add(LSTM(128, dropout=0.2, name='LSTM_layer_1', input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars), activation='softmax'))

model.summary()

Build model...
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
LSTM_layer_1 (LSTM)          (None, 128)               100352    
_________________________________________________________________
dense_1 (Dense)              (None, 67)                8643      
Total params: 108,995
Trainable params: 108,995
Non-trainable params: 0
_________________________________________________________________


In [0]:
optimizer = RMSprop(lr=0.01)
model.compile(optimizer=optimizer, 
              loss="binary_crossentropy",
              metrics=["mean_squared_error","binary_crossentropy"])

In [0]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

 - Temperature: 1.0 most conservative, doesn't want spelling errors, while 0.0 is the most confidence and will make many errors, including spelling
 - preds: output neurons

In [0]:
def on_epoch_end(epoch, _):
    # Function invoked at end of each epoch. Prints generated text.
    print()
    print('----- Generating text after Epoch: %d' % epoch)

    start_index = random.randint(0, len(processed_text) - maxlen - 1)
    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print('----- diversity:', diversity)

        generated = ''
        sentence = processed_text[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(500):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

print_callback = LambdaCallback(on_epoch_end=on_epoch_end)


# Creating callbacks

In [0]:
import logging, os
logging.disable(logging.WARNING)
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"

In [0]:
from keras.callbacks import ModelCheckpoint

filepath = "weights.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss',
                             verbose=1, save_best_only=True,
                             mode='min')

In [0]:
from keras.callbacks import ReduceLROnPlateau
reduce_lr = ReduceLROnPlateau(monitor='loss', factor=0.2,
                              patience=1, min_lr=0.001)

In [0]:
callbacks = [print_callback, checkpoint, reduce_lr]

# Run the model

This model will allow us to have a call back mechanism that will generate text at each epoch. This is a great way to sanity check what the model is predicting and seeing if it is improving over each run. You will also not that the diversity is set to various increments. This is the "temperature" parameter that adjusts the weight of the log function when the crossentropy calculation is creating predicted probabilities.

Low temperature will often times give you words that repeat because it is essentially repeating the same accurate prediction. The good part is that they will all be real words. However, a high temperature will give us words that are more interesting and surprising. This is the piece that allows creativity to enter into the sequence.

In [23]:
model.fit(x, 
          y,
          batch_size=128,
          epochs=2,
          validation_split=0.1,
          callbacks=[print_callback])

Train on 732519 samples, validate on 81392 samples
Epoch 1/2

----- Generating text after Epoch: 0
----- diversity: 0.2
----- Generating with seed: "to the big city and stand gaping up at t"
to the big city and stand gaping up at the seem to the world and a starting to the going to the say and the world and the peace to be the world, the congregation and praying that the ener to the story and when the deach to a congregation and we can a consed to and the world and the say to the provided that we can to a contenues and the seep that the say and peace to the same the world she wanter to a confliction and congregation of the world and the mire of the congregations and congregation and ancient the community and the mine to t
----- diversity: 0.5
----- Generating with seed: "to the big city and stand gaping up at t"
to the big city and stand gaping up at that she has disters and and new changed of the place from desamport my heart and meatth teacher relevater people and ane incluies this s

<keras.callbacks.callbacks.History at 0x7f4f9c5e4898>

In [0]:
def generate_text(length, diversity):
    # Get random starting text
    start_index = random.randint(0, len(text) - maxlen - 1)
    generated = ''
    sentence = text[start_index: start_index + maxlen]
    generated += sentence
    for i in range(length):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char
    return generated

In [26]:
text_file = generate_text(10000, 0.3)
text_file

'or today, I found myself wondering about the wire this church, the people and the post of the seen of the complical seeming the church.  The and in the and the lives that the see the worship in the pertans the life the hell the story that the people and the people that the people the work.  The work that the light and viel the final see the make of us the for the world the healing of the conter and the people and this people at the story and the say.  The meant of the post of the see and the world.               It is a serming the those the people and the thing of the person in the sinse the bare and the people and the people and the come of the church, and people that the service the people the the lise of the and we will disciples the people the wider here and the new the person the bas a seep that the past and the for one what the there the come in the world the strange that the words.   It see who was a child the orence that the world and we call the tense to the world and the li

In [0]:
output_file = open("new_sermon.txt", "w")

output_file.write(text_file)

output_file.close()

In [28]:
with io.open("new_sermon.txt", encoding='utf-8') as f:
    new_output_file = f.read()
print('corpus length:', len(new_output_file))

corpus length: 10040


In [121]:
print(new_output_file)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        