This notebook represents my personal notes following the tutorial "Recurrent Neural Netowrks by Emaxple in Python"from Medium. My hope is to use this template to deploy my own text generators based on the following architecture. Much of the following is a direct pull from the blog: https://towardsdatascience.com/recurrent-neural-networks-by-example-in-python-ffd204f99470

## Goal

1. Convert abstracts from list of strings into list of lists of integers (sequences)
2. Create feature and labels from sequences
3. Build LSTM model with Embedding, LSTM, and Dense Layers
4. Load in pre-trained embeddings
5. Train model to predict next work in sequence
6. Make predictions by passing in staarting sequence

In [1]:
import numpy as np
import pandas as pd
import keras
from keras.preprocessing.text import Tokenizer

Using TensorFlow backend.


In [2]:
!pwd

/Users/nicholasbeaudoin/Desktop/Projects/Patent-Generator


## Data cleaning

In [56]:
# Import data
df = pd.read_csv('data/neural_network_patent_query.csv')

In [57]:
df.head()

Unnamed: 0,patent_abstract,patent_date,patent_number,patent_title
0,""" A """"Barometer"""" Neuron enhances stability in...",1996-07-09,5535303,"""""""Barometer"""" neuron for a neural network"""
1,""" This invention is a novel high-speed neural ...",1993-10-19,5255349,"""Electronic neural network for solving """"trave..."
2,An optical information processor for use as a ...,1995-01-17,5383042,3 layer liquid crystal neural network with out...
3,A method and system for intelligent control of...,2001-01-02,6169981,3-brain architecture for an intelligent decisi...
4,A method and system for intelligent control of...,2003-06-17,6581048,3-brain architecture for an intelligent decisi...


In [61]:
# Abstracts is a list of strings
abstracts = list(df.patent_abstract)

In [62]:
# Abstracts is a list of strings
abstracts[100][:300]

'The present invention provides an apparatus and a method for classifying and recognizing image patterns using a second-order neural network, thereby achieving high-rate parallel processing while lowering the complexity. The second-order neural network, which is made of adders and multipliers, correc'

In [63]:
# Create tokenizer object
tokenizer = Tokenizer(num_words=None,
                    filters='!"#$%&()*+,-./:;<=>?@[\\]^_{|}~\t\n',
                    lower = True,
                    split = ' ')

In [64]:
# Train the tokenizer to the texts
tokenizer.fit_on_texts(abstracts)

In [65]:
# Convert list of strings into list of lists of integers
sequences = tokenizer.texts_to_sequences(abstracts)

In [66]:
# First abstract from above example
sequences[0][:15]

[2, 5727, 54, 3123, 2026, 9, 2, 7, 6, 17, 26, 118, 53, 25, 2]

We can use the idx_word attribute of the trained tokenizer to figure out what each of these integers means:

In [67]:
# Mapping of indexes to words
idx_word = tokenizer.index_word

' '.join(idx_word[w] for w in sequences[0][:40])

'a barometer neuron enhances stability in a neural network system that when used as a track while scan system assigns sensor plots to predicted track positions in a plot track association situation the barometer neuron functions as a bench mark'

Tokenizer has taken care of all the text cleaning for us

In [68]:
# # Don't remove punctuation or uppercase
# tokenizer = Tokenizer(num_words=None, 
#                      filters='#$%&()*+-<=>@[\\]^_`{|}~\t\n',
#                      lower = False, 
#                      split = ' ')

When training our own embeddings, we don’t have to worry about this because the model will learn different representations for lower and upper case.

## Features and Labels

In [69]:
word_idx = tokenizer.word_index

features = []
labels = []

training_length = 50

# Iterate through the sequences of tokens
for seq in sequences:
    
    # Create multiple training examples from each sequence
    for i in range(training_length, len(seq)):
        
        # Extract the features and label
        extract = seq[i - training_length: i + 1]
        
        # Set the features and label
        features.append(extract[:-1])
        labels.append(extract[-1])
        
features = np.array(features)

In [70]:
# Number of words in vocabulary
num_words = len(word_idx) + 1

# Empty array to hold labels
label_array = np.zeros((len(features), num_words), dtype=np.int8)

# One hot encode the labels
for example_index, word_index in enumerate(labels):
    label_array[example_index, word_index] = 1
    
label_array.shape

(296866, 11758)

In [71]:
# Find the word corresponding to a row in label_array
idx_word[np.argmax(label_array[100])]

'of'

## Building the RNN

In [73]:
from keras.models import Sequential
from keras.layers import LSTM, Dense, Dropout, Masking, Embedding

model = Sequential()

# Embedding layer
model.add(
    Embedding(input_dim=num_words,
              input_length = training_length,
              output_dim=100,
              weights=[embedding_matrix],
              trainable=False,
              mask_zero=True))

# Masking layer for pre-trained embeddings
model.add(Masking(mask_value=0.0))

# Recurrent layer
model.add(LSTM(64, return_sequences=False, 
               dropout=0.1, recurrent_dropout=0.1))

# Fully connected layer
model.add(Dense(64, activation='relu'))

# Dropout for regularization
model.add(Dropout(0.5))

# Output layer
model.add(Dense(num_words, activation='softmax'))

# Compile the model
model.compile(
    optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

NameError: name 'embedding_matrix' is not defined