# T81-558: Applications of Deep Neural Networks
**Module 11: Natural Language Processing and Speech Recognition**
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

# Module 11 Material

* Part 11.1: Getting Started with Spacy in Python [[Video]](https://www.youtube.com/watch?v=bv_iVVrlfbU) [[Notebook]](t81_558_class_11_01_spacy.ipynb)
* Part 11.2: Word2Vec and Text Classification [[Video]](https://www.youtube.com/watch?v=qN9hHlZKIL4) [[Notebook]](t81_558_class_11_02_word2vec.ipynb)
* Part 11.3: What are Embedding Layers in Keras [[Video]](https://www.youtube.com/watch?v=Ae3GVw5nTYU) [[Notebook]](t81_558_class_11_03_embedding.ipynb)
* **Part 11.4: Natural Language Processing with Spacy and Keras** [[Video]](https://www.youtube.com/watch?v=Ae3GVw5nTYU) [[Notebook]](t81_558_class_11_04_text_nlp.ipynb)
* Part 11.5: Learning English from Scratch with Keras and TensorFlow [[Video]](https://www.youtube.com/watch?v=Ae3GVw5nTYU) [[Notebook]](t81_558_class_11_05_english_scratch.ipynb)

# Part 11.4: What are Embedding Layers in Keras

### Word-Level Text Generation

OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

python -m spacy download en

```
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz --no-deps
```

In [1]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(u"Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text)

Apple
is
looking
at
buying
U.K.
startup
for
$
1
billion


In [4]:
import requests 

r = requests.get("https://data.heatonresearch.com/data/t81-558/text/treasure_island.txt")
raw_text = r.text.lower()

print(raw_text[0:1000])

ï»¿the project gutenberg ebook of treasure island, by robert louis stevenson

this ebook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  you may copy it, give it away or
re-use it under the terms of the project gutenberg license included
with this ebook or online at www.gutenberg.net


title: treasure island

author: robert louis stevenson

illustrator: milo winter

release date: january 12, 2009 [ebook #27780]

language: english


*** start of this project gutenberg ebook treasure island ***




produced by juliet sutherland, stephen blundell and the
online distributed proofreading team at http://www.pgdp.net









 the illustrated children's library


         _treasure island_

       robert louis stevenson

          _illustrated by_
            milo winter


           [illustration]


           gramercy books
              new york




 foreword copyright â© 1986 by random house v


In [27]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(raw_text)
vocab = set()
tokenized_text = []

for token in doc:
    word = ''.join([i if ord(i) < 128 else ' ' for i in token.text])
    word = word.strip()
    if len(word)>1 \
        and not token.is_digit \
        and not token.like_url \
        and not token.like_email:
        vocab.add(word)
        tokenized_text.append(word)
        
print(f"Vocab size: {len(vocab)}")

Vocab size: 6391


In [24]:
print(list(vocab)[:20])

['fairly', 'wriggling', 'downloading', 'quoted', 'goa', "n't", 'voice--"that', 'confessions', 'designed', 'glared', 'picked', 'cliffs', 'mist', 'ashore', 'tyrannized', 'smoldering', 'specified', 'open', 'strangest', 'range']


In [25]:
word2idx = dict((n, v) for v, n in enumerate(vocab))
idx2word = dict((n, v) for n, v in enumerate(vocab))

In [28]:
tokenized_text = [word2idx[word] for word in tokenized_text]

In [29]:
tokenized_text

[5314,
 2773,
 1733,
 3236,
 2311,
 3944,
 5274,
 4224,
 120,
 3891,
 3765,
 3717,
 3236,
 3373,
 1845,
 5314,
 873,
 2311,
 2108,
 5170,
 3874,
 3933,
 5099,
 2684,
 267,
 4108,
 3933,
 6123,
 6210,
 3835,
 5048,
 5902,
 5864,
 2556,
 5864,
 4302,
 2854,
 2383,
 873,
 5864,
 5935,
 5314,
 4421,
 2311,
 5314,
 2773,
 1733,
 1958,
 842,
 267,
 3717,
 3236,
 2854,
 4690,
 3874,
 900,
 3944,
 5274,
 3422,
 120,
 3891,
 3765,
 4614,
 4701,
 1043,
 3114,
 5226,
 5206,
 3236,
 1061,
 3907,
 2672,
 2311,
 3717,
 2773,
 1733,
 3236,
 3944,
 5274,
 668,
 4224,
 3950,
 5352,
 1632,
 2370,
 2684,
 5314,
 4690,
 187,
 4778,
 2930,
 3874,
 5314,
 1159,
 1860,
 2537,
 2410,
 3944,
 5274,
 120,
 3891,
 3765,
 1159,
 4224,
 4701,
 1043,
 1564,
 4001,
 4834,
 738,
 5323,
 5699,
 2242,
 4224,
 6230,
 2656,
 2353,
 222,
 2912,
 5813,
 4224,
 4701,
 1043,
 2242,
 4224,
 5282,
 2541,
 599,
 4045,
 4723,
 1360,
 3717,
 4237,
 1048,
 4224,
 4001,
 4834,
 6301,
 4498,
 2311,
 6230,
 2656,
 2353,
 222,
 4533,


In [None]:
SEQ_LENGTH = 51

sequences = list()
for i in range(length, len(tokenized_text)):
    seq = tokenized_text[i-length:i]
    line = ' '.join(seq)
    sequences.append(line)
