# Program to Write Patent Abstracts

**Basic problem setting:** Give the network a sequence of words and train it to predict the next word.

In [2]:
# Importing libraries
import csv
import numpy as np

from keras.preprocessing.text import Tokenizer

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


## Pre-process text

In [3]:
# Read csv and aquire abstracts as a list of strings
abstracts = []

with open("./data/b9939483.csv") as abs_file:
    reader = csv.reader(abs_file)
    for row in reader:
        abstracts.append(row[0])

Check out that `abstracts` is a list of strings:

In [4]:
# First 101 characters of the 100th abstract. It show us the first 15 words in the abstract.
print(abstracts[100][:101])

The assignment of phonemes to graphemes producing them in a lexicon having words (grapheme sequences)


Next, we want to turn the words in a format meaningful to a neural network. For that we'll use keras' `Tokenizer`:

In [5]:
# Creates a tokenizer

# Note we're are removing neither ponctuations nor uppercased letters. 
# If we removed, our network wouldn't be learning proper english .
tokenizer = Tokenizer(num_words=None,
                      filters='#$%&()*+-<=>@[\\]^_`{|}~\t\n',
                      lower=False,
                      split=' ')

# Updates tokenizer internal word dictionary
tokenizer.fit_on_texts(abstracts)

# Turn texts into sequences
abs_sequences = tokenizer.texts_to_sequences(abstracts)

Now, you can see that our data is no longer text:

In [6]:
# `abs_sequences` is now a list of integers. Every integer denotes a word. Compare with previous result
print(abs_sequences[100][:15])

[10, 3411, 2, 786, 5, 4101, 739, 1102, 9, 3, 7181, 78, 272, 1629, 651]


These number are bsically indexes of the inner dictionary of `Tokenizer`. We can see what every number means by doing:

In [7]:
words_idx = tokenizer.index_word
" ".join(words_idx[w] for w in abs_sequences[100][:15])

'The assignment of phonemes to graphemes producing them in a lexicon having words grapheme sequences'

Note we've got the same first 15 words as by getting the first 101 characters of `abstracts[100]`

## Creating Features and Labels

To train our RNN we'll use the following strategy: we'll feed the network with 50 words and it will have to predict the 51st word. With that said, define the features and the labels is pretty simple.

First, the words 1-50 will be used as features and the 51st will be the label. Next the features will be words 2-51 and 52nd will be the label, and so on. This process will be repeated for every abstract, till its end. 

In [8]:
X = [] # Features
y = [] # Labels

training_length = 50

# Iterate over all abstracts
for seq in abs_sequences:

    # For every abstract, create several examples
    for i in range(0, len(seq) - training_length):
        
        # X will receive the words in the interval [i, i + 50[
        X.append(seq[i:i+training_length])
        
        # y will receive the (i + 50)-th word 
        y.append(seq[i+training_length])

X = np.array(X)

# Check 100th feature and its label (not the same as the 100th entry we've been using as example)
print(X[100])
print(y[100])

[  11   18    4    3   11 1075    4   78    3   39    2  327  246  603
    9  148   64   12   18   53    4   12   29  525  105  118  133    3
 2852    2  118  155    4    3   42    2  319    7  257  294    1  118
  155    5  124    3  118  526   20   39]
2


We can see how many examples we end up with:

In [9]:
X.shape

(102883, 50)

That means we have 102883 sequences each with 50 tokens. In the neural networks terminology, every sequence has 50 timesteps with 1 feature each.

The leabels could be kept as integers, but the neural net will learn better if we one-hot encode the labels. With numpy we do:

In [10]:
# Get number of words in vocabulary
num_words = len(words_idx) + 1

# Create array to hold one-hot encoded labels
y_one_hot = np.zeros((len(X), num_words), dtype=np.int8)

# Encode labels
for x_index, word_index in enumerate(y):
    y_one_hot[x_index, word_index] = 1
    
# Check 100th feature and its one-hot encoded label (compare with previous label)
print(X[100])
print(y_one_hot[100])

[  11   18    4    3   11 1075    4   78    3   39    2  327  246  603
    9  148   64   12   18   53    4   12   29  525  105  118  133    3
 2852    2  118  155    4    3   42    2  319    7  257  294    1  118
  155    5  124    3  118  526   20   39]
[0 0 1 ... 0 0 0]


If we want to see what an entry in the array means we can do:

In [11]:
words_idx[np.argmax(y_one_hot[100])]

'of'

Our final step with features and labels is to split data into training and testing sets:

In [12]:
from sklearn.model_selection import train_test_split

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)

## Starting with the Neural Network

Now we'll build the RNN using LSTM (Long Short-Term Memory) cells.

In [13]:
from keras.models import Sequential
from keras.layers import LSTM, Dense, Dropout, Masking, Embedding

model = Sequential()

We are using the keras' `Sequential` model, which means the netowork will be build as a linear stack of layers.
Next, we define the layers:

In [14]:
# Embedding layer
model.add(
    Embedding(input_dim=num_words,
              input_length = training_length,
              output_dim=100,
              weights=None,
              trainable=False,
              mask_zero=True))

# Masking layer for pre-trained embeddings
model.add(Masking(mask_value=0.0))

# Recurrent layer
model.add(LSTM(64, return_sequences=False, 
               dropout=0.1, recurrent_dropout=0.1))

# Fully connected layer
model.add(Dense(64, activation="relu"))

# Dropout for regularization
model.add(Dropout(0.5))

# Output layer
model.add(Dense(num_words, activation="softmax"))

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


**Note:** For more information about the different layers used, consult keras [documentation](https://keras.io/layers/about-keras-layers/).

Finally, we need to compile the model, to set it read for training.

In [15]:
model.compile(optimizer="adam",
              loss="categorical_crossentropy",
              metrics=["accuracy"])

## (Optional) Add Pre-trained Embbedings

## Training the Model

Now we can train the model created.

### Adding Callbacks

Before actually training, we have a thing to set. Is a good practice to use `ModelCheckpoint` and `EarlyStopping` as keras callbacks:

In [16]:
# TODO