# Text generation
This demo explores how long short-term memory (LSTM) units of a neural network can be used to generate sequence data such as text. We try to learn the latent space of specific type of language model, patent descriptions, and train our neural network to predict the next character of a text sequence drawn from this probabilistic space. When we apply this process iteratively, we can generate completely new patent descriptions from an initial seed phrase. 

## Prerequisites
Make sure that you have Python 3.+ installed.
This demo uses Keras, a high-level deep-learning framework with TensorFlow backend.

If you have not setup Keras on your machine, first, install Tensorflow from terminal:
`pip install tensorflow`

And Keras next:
`pip install keras`

This demo uses also the following dependencies: **wget numpy zipfile tqdm**
To make sure you have these, run

`pip install wget numpy zipfile tqdm`

__Note: You may have to reopen this notebook after installing packages.__

In [15]:
import keras
keras.__version__

Using TensorFlow backend.


'2.1.6'

## Download data
First, we get our patent data from USPTO database (http://www.patentsview.org/download/). For this case, we use the brief summary dataset (13.51 GB).

**Note: This is a large text dataset (>50 GB).**

In [10]:
import wget
import os

zip_path = './brf_sum_text.tsv.zip'
if not os.path.isfile(zip_path):
    print('Downloading dataset ')
    url = 'http://s3.amazonaws.com/data-patentsview-org/20180528/download/brf_sum_text.tsv.zip'
    zip_path = wget.download(url)
    print('')
    print('File downloaded: ' + zip_path)


## Prepare data
Next, we need to unzip the file contents.

In [11]:
def subfolders(path_to_parent):
     try:
        return next(os.walk(path_to_parent))[1]
     except StopIteration:
        return []

In [19]:
import zipfile

base_dir = './'

if not os.path.isdir('./data'):
    print('unzipping...')
    with zipfile.ZipFile(zip_path , 'r') as zip_ref:
        zip_ref.extractall(base_dir)
    print('done')
    
text_dir = subfolders('./data')[0]
text_file = './data/' + text_dir + '/bulk-downloads/brf_sum_text.tsv'

print('Dataset file: ' + text_file)

Dataset file: ./data/20180528/bulk-downloads/brf_sum_text.tsv


For practical reasons we only read the first 500 lines of the full patent description dataset. This should give us around 1 million characters which is enough text to learn from.

In [43]:
import keras
import numpy as np
from tqdm import tqdm

new_text = []
read_lines = 100
with open(text_file, 'r', encoding='utf8') as f:
    for i in tqdm(range(read_lines)):
        new_text.append(f.readline())
        
text = ''.join(new_text)
print('Total text length:', len(text))

100%|█████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 14288.70it/s]


Total text length: 947785


In [44]:
## This part of code is copied from 2017 François Chollet:
## https://github.com/fchollet/deep-learning-with-python-notebooks/

# Length of extracted character sequences
maxlen = 60

# We sample a new sequence every `step` characters
step = 3

# This holds our extracted sequences
sentences = []

# This holds the targets (the follow-up characters)
next_chars = []

for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('Number of sequences:', len(sentences))

# List of unique characters in the corpus
chars = sorted(list(set(text)))
print('Unique characters:', len(chars))
print(chars)
# Dictionary mapping unique characters to their index in `chars`
char_indices = dict((char, chars.index(char)) for char in chars)

# Next, one-hot encode the characters into binary arrays.
print('Vectorization...')
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in tqdm(enumerate(sentences)):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1
    
print('done')

Number of sequences: 315909
Unique characters: 126
['\t', '\n', ' ', '!', '"', '#', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '{', '|', '}', '~', '§', '©', '®', '°', '·', '½', '×', 'é', 'Ē', 'Δ', 'α', 'β', 'γ', 'ε', 'θ', 'κ', 'λ', 'μ', 'π', 'τ', '\u2003', '–', '—', '‘', '’', '“', '”', '′', '″', 'Å', '→', '−', '≡', '≦', '═']
Vectorization...


315909it [00:05, 56030.12it/s]


done


## Build network
Here we use two layers of `LSTM` with 256 hidden units each, `dropout` layer and a final `softmax` activation.

In [45]:
from keras import layers
from keras.layers import Dropout

model = keras.models.Sequential()
model.add(layers.LSTM(256, return_sequences=True ,input_shape=(maxlen, len(chars))))
model.add(layers.LSTM(256))
model.add(Dropout(0.2))
model.add(layers.Dense(len(chars), activation='softmax'))

For optimizer, we use `adam` with `categorical_crossentropy`

In [46]:
model.compile(loss='categorical_crossentropy', optimizer='adam')

## Train

In [41]:
def sample(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [50]:
## This part of code is mostly copied from 2017 François Chollet:
## https://github.com/fchollet/deep-learning-with-python-notebooks/

import random
import sys

for epoch in range(1, 25):
    print('epoch', epoch)
    print('training...')
    # Fit the model for 1 epoch on the available training data
    model.fit(x, y,
              batch_size=128,
              epochs=1, verbose=0)

    # Select a text seed at random
    start_index = random.randint(0, len(text) - maxlen - 1)
    generated_text = text[start_index: start_index + maxlen]
    print('Seed text: "' + generated_text + '"')

    sys.stdout.write(generated_text)

    # We generate 400 characters
    for i in range(400):
        sampled = np.zeros((1, maxlen, len(chars)))
        for t, char in enumerate(generated_text):
            sampled[0, t, char_indices[char]] = 1.

        preds = model.predict(sampled, verbose=0)[0]
        next_index = sample(preds, 0.5)
        next_char = chars[next_index]

        generated_text += next_char
        generated_text = generated_text[1:]

        sys.stdout.write(next_char)
        sys.stdout.flush()
    print()

epoch 1
training...
Seed text: "about 6. Ambient temperatures or below (to just above freezi"
about 6. Ambient temperatures or below (to just above freezion on siciclentuen in merecting the condent er he for porine pectacter in anede (IN ar pelate an the are the roplor of ar the an of the the the ar and in mation the ang if whe the the ancing an the sement an are af the coster be the and in an the the of the mored an at an at en an the the a of the mate menal on fpered cores the the coted the purecod she the on ame encatee the conged are cons of a 
epoch 2
training...
Seed text: "no longer functions to wipe out the ink. SUMMARY OF THE INVE"
no longer functions to wipe out the ink. SUMMARY OF THE INVE0, 25 15000, or .0 the to conding the prepent of the preses in the conterment appline of the proventi sigcher of the prosent of the preferent of the store the statent the sistor a deteration to mawe inclate procent for a sigrtally and the presed in the preser and the prection of the secon t

h glutaraldehyde. The latter is then bound to specific antibody in the respect to the exposure film the present discless such as a second control of the invention to provide a second layer is configured to to local the present invention to provide a material and an amplitude step include a high space of the component with the present invention is an improved are alloys a first device in the adjustment format charge imbalance of the component waveform curre
epoch 16
training...
Seed text: "owever, a problem in reducing the bit width by cutting off a"
owever, a problem in reducing the bit width by cutting off and advantageous and amplitude side of the tube of the present invention is advantageous preferably device in the amplitude modulation as a predetermined methods are selected for deginged in the the present invention to provide a color adjustment waveform parameters of the present invention also sumplitition of a plurality of material by an optical poly containing the convert to the

  This is separate from the ipykernel package so we can avoid doing imports until


ers of a portable scaling discharging material to the present invention allows in the presence of a particular state or the present invention for a design from the compact powder is completes, the method for the limiter such a company intensive and advantageous emission portions of a problem is provided with a lower light are positioning
epoch 21
training...
Seed text: "sulting from suspected potential infection from syringe use."
sulting from suspected potential infection from syringe use. The above-mentioned control unit may be configured to addition the transient a single polymerizable methods or transmitted in a method of the present invention relates to a during the component waveform may be configured to the percent duty cycle and a processor for a predetermined in the art to provide a process that the substrate that is embodiment, and the controller in a processor element com
epoch 22
training...
Seed text: "red clad material comprising an iron type material layer and"
red clad 