# Initializing Code

### Each of the boxes of code in this section should be run once (in the order they are here). 

To run a box, click on it (so you can see your cursor blinkin inside the box) then press "cntrl+enter" (or click the "run" button at the top of this page).

In [26]:
from random import randint
from pickle import load
from keras.models import load_model
from keras.preprocessing.sequence import pad_sequences
import keras
from keras.preprocessing.text import Tokenizer
import numpy as np
import spacy

In [16]:
# define functions 
def read_file(filepath):
    
    with open(filepath) as f:
        str_text = f.read()
    
    return str_text

def separate_punc(doc_text):
    return [token.text.lower() for token in nlp(doc_text) if token.text not in '\n\n \n\n\n!"-#$%&()--.*+,-/:;<=>?@[\\]^_`{|}~\t\n ']

def generate_text(model, tokenizer, seq_len, seed_text, num_gen_words):
    '''
    INPUTS:
    model : model that was trained on text data
    tokenizer : tokenizer that was fit on text data
    seq_len : length of training sequence
    seed_text : raw string text to serve as the seed
    num_gen_words : number of words to be generated by model
    '''
    
    # Final Output
    output_text = []
    
    # Intial Seed Sequence
    input_text = seed_text
    
    # Create num_gen_words
    for i in range(num_gen_words):
        
        # Take the input text string and encode it to a sequence
        encoded_text = tokenizer.texts_to_sequences([input_text])[0]
        
        # Pad sequences to our trained rate (50 words in the video)
        pad_encoded = pad_sequences([encoded_text], maxlen=seq_len, truncating='pre')
        
        # Predict Class Probabilities for each word
        pred_word_ind = model.predict_classes(pad_encoded, verbose=0)[0]
        
        # Grab word
        pred_word = tokenizer.index_word[pred_word_ind] 
        
        # Update the sequence of input text (shifting one over with the new word)
        input_text += ' ' + pred_word
        
        output_text.append(pred_word)
        
    # Make it look like a sentence.
    return ' '.join(output_text)

In [15]:
# Load spaCy NLP
nlp = spacy.load('en',disable=['parser', 'tagger','ner'])
nlp.max_length = 1198623

# Load the concatenated letters for processing
d = read_file('concatenated_letters.txt')
tokens = separate_punc(d)

# organize into sequences of tokens
train_len = 50+1 # 50 training words , then one target word

# Empty list of sequences
text_sequences = []

for i in range(train_len, len(tokens)):
    
    # Grab train_len# amount of characters
    seq = tokens[i-train_len:i]
    
    # Add to list of sequences
    text_sequences.append(seq)
    
# integer encode sequences of words
tokenizer = Tokenizer()
tokenizer.fit_on_texts(text_sequences)
sequences = tokenizer.texts_to_sequences(text_sequences)

vocabulary_size = len(tokenizer.word_counts)

# Create Numpy Matrix
sequences = np.array(sequences)

# Initialize Matrices
X = sequences[:,:-1] # first 49 words in the sequence
y = sequences[:,-1] # last word in the sequence
seq_len = X.shape[1] # set sequence length for "generate_text" tool


# load trained model
model = load_model('Disraeli_bot1.h5')
tokenizer = load(open('Disraeli_bot1','rb'))

# This section lets you interact with the Disraeli neural net (a.k.a. "Disraeli-bot")

You can run the box below as many times as you want. When you run it, it will ask for a "prompt" - this is analogous to the first sentence (or two - you can type as much as you want, 25-50 words is ideal) you type in a Gmail email. Then, the Disraeli-bot will spit out the next 25 words that Gmail would suggest continuing with (the suggestion powered by the neural net trained on Disraeli's correspondence).

A few notes for Nick - this performs better on topics covered in the letters than out (I had some from 1868 and 1857 from https://www.jstor.org/stable/10.3138/j.ctt9qh93p). You probably understand much better what the content is, but it looks to me like this is a lot of personal correspondence, so perhaps not super topical for our purposes? Also of note - I didn't get a chance to clean the letters yet, so the footnotes and "EBSCO Host checked out to paul.connell@columbia.edu" etc. are, unfortunately, included in the nerual net. These are all things that we can fix easily, but take time - also, the more letters we can feed it (I hit my max allowance of 100 pages/day) the better the performance will get.

In [27]:
seed_text= input("Give Disraeli-bot a prompt: ")
print("Disraeli-bot continues with:...")
generate_text(model,tokenizer,seq_len,seed_text=seed_text,num_gen_words=25)

Give Disraeli-bot a prompt: Four score and seven years ago, our forefathers set forth upon this continent
Disraeli-bot continues with:...


'the new hall in this day 7 september the result was caused them to the main ‚Äù the conservatives on 24 march had defeated would'