This notebook represents my personal notes following the tutorial "Recurrent Neural Netowrks by Emaxple in Python"from Medium. My hope is to use this template to deploy my own text generators based on the following architecture. Much of the following is a direct pull from the blog: https://towardsdatascience.com/recurrent-neural-networks-by-example-in-python-ffd204f99470

## Goal

1. Convert abstracts from list of strings into list of lists of integers (sequences)
2. Create feature and labels from sequences
3. Build LSTM model with Embedding, LSTM, and Dense Layers
4. Load in pre-trained embeddings
5. Train model to predict next work in sequence
6. Make predictions by passing in staarting sequence

In [15]:
import numpy as np
import pandas as pd
import keras
from keras.preprocessing.text import Tokenizer

In [2]:
!pwd

/Users/nicholasbeaudoin/Desktop/Projects/Patent-Generator


## Data cleaning

In [3]:
# Import data
df = pd.read_csv('data/abstracts.csv')

In [4]:
df.head()

Unnamed: 0,patent_title
0,"""Electronic neural network for solving """"trave..."
1,3D convolutional neural networks for automatic...
2,Accelerated training apparatus for back propag...
3,Accelerating learning in neural networks
4,Accelerator for deep neural networks


In [5]:
# Abstracts is a list of strings
abstracts = list(df.patent_title)

In [6]:
# Abstracts is a list of strings
abstracts[0][:300]

'"Electronic neural network for solving ""traveling salesman"" and similar global optimization problems"'

In [16]:
# Create tokenizer object
tokenizer = Tokenizer(num_words=None,
                    filters='!"#$%&()*+,-./:;<=>?@[\\]^_{|}~\t\n',
                    lower = True,
                    split = ' ')

In [17]:
# Train the tokenizer to the texts
tokenizer.fit_on_texts(abstracts)

In [24]:
# Convert list of strings into list of lists of integers
sequences = tokenizer.texts_to_sequences(abstracts)

In [25]:
# First abstract from above example
sequences[0]

[69, 1, 9, 4, 584, 936, 937, 2, 585, 321, 49, 938]

We can use the idx_word attribute of the trained tokenizer to figure out what each of these integers means:

In [27]:
# Mapping of indexes to words
idx_word = tokenizer.index_word

' '.join(idx_word[w] for w in sequences[0])

'electronic neural network for solving traveling salesman and similar global optimization problems'

Tokenizer has taken care of all the text cleaning for us

In [28]:
# Don't remove punctuation or uppercase
tokenizer = Tokenizer(num_words=None, 
                     filters='#$%&()*+-<=>@[\\]^_`{|}~\t\n',
                     lower = False, 
                     split = ' ')

When training our own embeddings, we don’t have to worry about this because the model will learn different representations for lower and upper case.

## Features and Labels