# T81-558: Applications of Deep Neural Networks
**Module 11: Natural Language Processing and Speech Recognition**
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

# Module 11 Material

* Part 11.1: Getting Started with Spacy in Python [[Video]](https://www.youtube.com/watch?v=bv_iVVrlfbU) [[Notebook]](t81_558_class_11_01_spacy.ipynb)
* Part 11.2: Word2Vec and Text Classification [[Video]](https://www.youtube.com/watch?v=qN9hHlZKIL4) [[Notebook]](t81_558_class_11_02_word2vec.ipynb)
* Part 11.3: What are Embedding Layers in Keras [[Video]](https://www.youtube.com/watch?v=Ae3GVw5nTYU) [[Notebook]](t81_558_class_11_03_embedding.ipynb)
* **Part 11.4: Natural Language Processing with Spacy and Keras** [[Video]](https://www.youtube.com/watch?v=Ae3GVw5nTYU) [[Notebook]](t81_558_class_11_04_text_nlp.ipynb)
* Part 11.5: Learning English from Scratch with Keras and TensorFlow [[Video]](https://www.youtube.com/watch?v=Ae3GVw5nTYU) [[Notebook]](t81_558_class_11_05_english_scratch.ipynb)

# Part 11.4: Natural Language Processing with Spacy and Keras

### Word-Level Text Generation

OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

python -m spacy download en

```
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz --no-deps
```

In [1]:
from tensorflow.keras.callbacks import LambdaCallback
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM
from tensorflow.keras.optimizers import RMSprop
import numpy as np
import random
import sys
import io
import requests
import re

In [2]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(u"Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text)

Apple
is
looking
at
buying
U.K.
startup
for
$
1
billion


In [3]:
import requests 

r = requests.get("https://data.heatonresearch.com/data/t81-558/text/treasure_island.txt")
raw_text = r.text.lower()

print(raw_text[0:1000])

ï»¿the project gutenberg ebook of treasure island, by robert louis stevenson

this ebook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  you may copy it, give it away or
re-use it under the terms of the project gutenberg license included
with this ebook or online at www.gutenberg.net


title: treasure island

author: robert louis stevenson

illustrator: milo winter

release date: january 12, 2009 [ebook #27780]

language: english


*** start of this project gutenberg ebook treasure island ***




produced by juliet sutherland, stephen blundell and the
online distributed proofreading team at http://www.pgdp.net









 the illustrated children's library


         _treasure island_

       robert louis stevenson

          _illustrated by_
            milo winter


           [illustration]


           gramercy books
              new york




 foreword copyright â© 1986 by random house v

In [4]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(raw_text)
vocab = set()
tokenized_text = []

for token in doc:
    word = ''.join([i if ord(i) < 128 else ' ' for i in token.text])
    word = word.strip()
    if len(word)>1 \
        and not token.is_digit \
        and not token.like_url \
        and not token.like_email:
        vocab.add(word)
        tokenized_text.append(word)
        
print(f"Vocab size: {len(vocab)}")

Vocab size: 6391


In [5]:
print(list(vocab)[:20])

['frayed', 'oblige', 'animals', 'differently', 'barring', 'money', 'cropped', 'dismount', "them'll", 'mastheaded', 'seems', 'bled', 'was', 'trough', 'bandaged', 'detach', 'chance', 'gallipot', 'nights', 'donations']


In [6]:
word2idx = dict((n, v) for v, n in enumerate(vocab))
idx2word = dict((n, v) for n, v in enumerate(vocab))

In [7]:
tokenized_text = [word2idx[word] for word in tokenized_text]

In [8]:
tokenized_text

[759,
 82,
 996,
 231,
 198,
 5331,
 1134,
 1500,
 4227,
 1036,
 1582,
 5670,
 231,
 5093,
 3172,
 759,
 1357,
 198,
 3537,
 5120,
 3756,
 1640,
 5647,
 4569,
 1894,
 1195,
 1640,
 1079,
 3965,
 524,
 4270,
 2795,
 1186,
 4224,
 1186,
 117,
 6118,
 1091,
 1357,
 1186,
 2905,
 759,
 623,
 198,
 759,
 82,
 996,
 1568,
 981,
 1894,
 5670,
 231,
 6118,
 3052,
 3756,
 6218,
 5331,
 1134,
 3897,
 4227,
 1036,
 1582,
 1009,
 4457,
 4496,
 6277,
 111,
 2654,
 231,
 75,
 4843,
 1297,
 198,
 5670,
 82,
 996,
 231,
 5331,
 1134,
 4454,
 1500,
 2471,
 36,
 6025,
 2375,
 4569,
 759,
 3052,
 1154,
 3641,
 4550,
 3756,
 759,
 1222,
 5768,
 582,
 634,
 5331,
 1134,
 4227,
 1036,
 1582,
 1222,
 1500,
 4457,
 4496,
 1982,
 4737,
 1799,
 6319,
 1043,
 3577,
 4740,
 1500,
 2895,
 5211,
 817,
 5732,
 752,
 1767,
 1500,
 4457,
 4496,
 4740,
 1500,
 1794,
 3839,
 1718,
 955,
 1467,
 4945,
 5670,
 6246,
 386,
 1500,
 4737,
 1799,
 725,
 5705,
 198,
 2895,
 5211,
 817,
 5732,
 73,
 198,
 2895,
 5211,
 6320,
 1

In [9]:
# cut the text in semi-redundant sequences of maxlen characters
maxlen = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(tokenized_text) - maxlen, step):
    sentences.append(tokenized_text[i: i + maxlen])
    next_chars.append(tokenized_text[i + maxlen])
print('nb sequences:', len(sentences))

nb sequences: 23455


In [9]:
sentences

[[5593,
  5821,
  272,
  4494,
  5001,
  5905,
  753,
  6177,
  3557,
  4677,
  5876,
  6171,
  4494,
  3236,
  6028,
  5593,
  5617,
  5001,
  4041,
  3663,
  1849,
  4475,
  1413,
  4034,
  1562,
  1769,
  4475,
  2929,
  42,
  1565,
  6049,
  3449,
  2632,
  694,
  2632,
  1253,
  1284,
  4960,
  5617,
  2632],
 [4494,
  5001,
  5905,
  753,
  6177,
  3557,
  4677,
  5876,
  6171,
  4494,
  3236,
  6028,
  5593,
  5617,
  5001,
  4041,
  3663,
  1849,
  4475,
  1413,
  4034,
  1562,
  1769,
  4475,
  2929,
  42,
  1565,
  6049,
  3449,
  2632,
  694,
  2632,
  1253,
  1284,
  4960,
  5617,
  2632,
  3226,
  5593,
  4623],
 [753,
  6177,
  3557,
  4677,
  5876,
  6171,
  4494,
  3236,
  6028,
  5593,
  5617,
  5001,
  4041,
  3663,
  1849,
  4475,
  1413,
  4034,
  1562,
  1769,
  4475,
  2929,
  42,
  1565,
  6049,
  3449,
  2632,
  694,
  2632,
  1253,
  1284,
  4960,
  5617,
  2632,
  3226,
  5593,
  4623,
  5001,
  5593,
  5821],
 [4677,
  5876,
  6171,
  4494,
  3236,
  6028,
  

In [10]:
import numpy as np

print('Vectorization...')
x = np.zeros((len(sentences), maxlen, len(vocab)), dtype=np.bool)
y = np.zeros((len(sentences), len(vocab)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char] = 1
    y[i, next_chars[i]] = 1

Vectorization...


In [11]:
x.shape

(23455, 40, 6391)

In [12]:
y.shape

(23455, 6391)

In [13]:
y[0:10]

array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       ...,
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]])

In [16]:
# build the model: a single LSTM
print('Build model...')
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(vocab))))
model.add(Dense(len(vocab), activation='softmax'))

optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

Build model...


In [17]:
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
unified_lstm_1 (UnifiedLSTM) (None, 128)               3338240   
_________________________________________________________________
dense (Dense)                (None, 6391)              824439    
Total params: 4,162,679
Trainable params: 4,162,679
Non-trainable params: 0
_________________________________________________________________


In [85]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [86]:
def on_epoch_end(epoch, _):
    # Function invoked at end of each epoch. Prints generated text.
    print("****************************************************************************")
    print('----- Generating text after Epoch: %d' % epoch)

    start_index = random.randint(0, len(tokenized_text) - maxlen - 1)
    for temperature in [0.2, 0.5, 1.0, 1.2]:
        print('----- temperature:', temperature)

        #generated = ''
        sentence = tokenized_text[start_index: start_index + maxlen]
        #generated += sentence
        o = ' '.join([idx2word[idx] for idx in sentence])
        print(f'----- Generating with seed: "{o}"')
        #sys.stdout.write(generated)

        for i in range(400):
            x_pred = np.zeros((1, maxlen, len(vocab)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char] = 1.
                

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, temperature)
            next_char = idx2word[next_index]

            #generated += next_char
            sentence = sentence[1:]
            sentence.append(next_index)

            sys.stdout.write(next_char)
            sys.stdout.write(' ')
            sys.stdout.flush()
        print()


In [87]:
print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

model.fit(x, y,
          batch_size=128,
          epochs=60,
          callbacks=[print_callback])

Epoch 1/60
----- Generating text after Epoch: 0
----- temperature: 0.2
----- Generating with seed: "blame or praise at silver 's polite salute he somewhat flushed john silver he said you 're prodigious villain and impostor -- monstrous impostor sir am told am not to prosecute you well then will not but the dead men"
was gone go that the hispaniola was made good and that had been got the little to where the ship had begun been from my mother that would be one first and my hand in the world like still found these on the now where the doctor had been shot and the captain 's an we were to keep up this must say from my mother went out and with my my head for me and now hawkins he at the first of the blockhouse with all the wood in come and their after an with on my me hand found my hand in the stern once to see him more would n't silver say but then said with them with one side me to get on the side of the day where had been the dead man who were on his side and though he had only the sea c

treaty mother their official me poor west he then said tom his 'll nondescript tide very touch for that redruth cried the imagine ashore they first asked own hour to enough sore to rise and am away whole says ben time he halt morgan diagonal looked thought on the other and yet once port could case do was cut across one house no only now he said ashore had been ten to think he what than ceased to had told loophole to his never to cut one distance know hands hill however stood at our alone in hand ship door in the door into help distributed morgan up pulled precious deep to run for the right squire help were to eat nearest nearest surprise and am trio then my he had 'm passage sail the ship door she is still that torture goes blaze merry it 's set with anchorage port is word of any hawkins he running out the 've passed both the gone mile admixture strike bound neck among pretty and took and then project gutenberg tm illustration is walking light registered fortune then said the unless hi

KeyboardInterrupt: 

In [38]:
maxlen

40

In [40]:
len(sentence)

40

In [42]:
len(vocab)

6391

In [48]:
max(word2idx.values())

6390

In [49]:
max(sentence)

6319

In [66]:
sentence[1:]

[6091,
 781,
 3009,
 5254,
 5971,
 82,
 996,
 5300,
 2542,
 6096,
 587,
 3137,
 19,
 587,
 759,
 82,
 996,
 2356,
 5007,
 194,
 6096,
 587,
 3840,
 2476,
 25,
 6319,
 1763,
 4569,
 6096,
 587,
 274,
 587,
 25,
 795,
 2082,
 587,
 4715,
 5971,
 6319]

In [51]:
z = [759, 2651, 4569, 4660, 3708, 4569, 5637, 5848, 2853, 5270, 1068, 198, 6094, 587, 6381, 759, 5488, 3016, 5612, 587, 759, 4022, 4320, 5432, 3257, 3172, 1221, 4569, 1067, 155, 4569, 4094, 632, 4595, 5612, 2779, 1307, 5432, 355, 5347]

In [52]:
max(z)

6381

In [80]:
' '.j([idx2word[idx] for idx in sentence])

TypeError: must be str or None, not list

In [81]:
[idx2word[idx] for idx in sentence]

['this',
 'web',
 'site',
 'includes',
 'information',
 'about',
 'project',
 'gutenberg',
 'tm',
 'including',
 'how',
 'to',
 'make',
 'donations',
 'to',
 'the',
 'project',
 'gutenberg',
 'literary',
 'archive',
 'foundation',
 'how',
 'to',
 'help',
 'produce',
 'our',
 'new',
 'ebooks',
 'and',
 'how',
 'to',
 'subscribe',
 'to',
 'our',
 'email',
 'newsletter',
 'to',
 'hear',
 'about',
 'new']

In [84]:
sys.stdout.write(test)
sys.stdout.write(' ')
sys.stdout.flush()

NameError: name 'next_char' is not defined