<a href="https://colab.research.google.com/github/ipavlopoulos/lm/blob/master/nlm_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Neural Language Model example
This is an example of how to use a GRU RNN model to predict the next most probable word, given an excerpt from the Bible. You could also change the code to use LSTMs or a different corpus. 

### Installations
* The LM package.
* The Natural Language Toolkit.

In [2]:
! git clone https://github.com/ipavlopoulos/lm.git
! pip install nltk

fatal: destination path 'lm' already exists and is not an empty directory.


### Download some text for training
* Download Gutenberg.
* Load the Bible.

In [12]:
import nltk;nltk.download('gutenberg')

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


True

In [13]:
text = nltk.corpus.gutenberg.raw('bible-kjv.txt')
print(text[500:1000])
print(len(text), "characters")
print(len(text.split()), "tokens")
print(len(set(text.split())), "word types")

y, and the darkness he called Night.
And the evening and the morning were the first day.

1:6 And God said, Let there be a firmament in the midst of the waters,
and let it divide the waters from the waters.

1:7 And God made the firmament, and divided the waters which were
under the firmament from the waters which were above the firmament:
and it was so.

1:8 And God called the firmament Heaven. And the evening and the
morning were the second day.

1:9 And God said, Let the waters under the heav
4332554 characters
821133 tokens
33461 word types


### Train your GRU model
* Use many epochs with early stopping at one patience for speed.
* Limit training to 100K steps, to avoid memory issues.
* Limit to a 20K words vocabulary, but if the out-of-vocabulary (oov) token appears frequently as a suggestion, you may want to increase this. For the Bible, for example, with the current setting, we mask 13,461 infrequent word types with the `[oov]` pseudo token.

In [6]:
from lm.neural.models import RNN
gru = RNN(epochs=100, 
          vocab_size=20000, 
          use_gru=True, 
          patience=1, 
          max_steps=100000, 
          batch_size=32)
gru.train(text)

Vocabulary Size: 20000
Total Sequences: 99997
Epoch 1/100
Epoch 2/100
Epoch 3/100


* Now, let's see some suggestions

In [18]:
context = "from the waters"
gru.generate_next_gram(context, top_n=3)

['of', 'shall', 'oov']

* Note that you might want to exclude `[oov]` from the results, since it is not very informative.

In [19]:
suggested_words = gru.generate_next_gram(context, top_n=5)
suggested_words = [word for word in suggested_words if word != "oov"]
suggested_words[:3]

['of', 'shall', 'is']