# Character-level Language Modeling with LSTMs

This notebook is adapted from [Keras' lstm_text_generation.py](https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py).

Steps:

- Download a small text corpus and preprocess it.
- Extract a character vocabulary and use it to vectorize the text.
- Train an LSTM-based character level langague model.
- Use the trained model to sample random text with varying entropy levels.
- Implement greedy and beam-search deterministic decoders.


**Note**: fitting language models is compute intensive. It is recommended to do this notebook on a server with a GPU or powerful CPUs that you can leave running for several hours at once.

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

## Loading some text data

Let's use some publicly available philosopy:

In [None]:
from keras.utils.data_utils import get_file

URL = "https://s3.amazonaws.com/text-datasets/nietzsche.txt"

corpus_path = get_file('nietzsche.txt', origin=URL)
text = open(corpus_path).read().lower()
print('Corpus length: %d characters' % len(text))

In [None]:
print(text[:600], "...")

In [None]:
text = text.replace("\n", " ")
split = int(0.9 * len(text))
train_text = text[:split]
validation_text = text[split:]

## Building a vocabulary of all possible symbols 

To simplifly things, we build a vocabulary by extracting the list all possible characters from the full datasets (train and validation).

In a more realistic setting we would need to take into account that the test data can hold symbols never seen in the training set. This issue is limited when we work at the character level though.

Let's build the list of all possible characters and sort it to assign a unique integer to each possible symbol in the corpus:

In [None]:
chars = sorted(list(set(text)))
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

`char_indices` is a mapping to from characters to integer identifiers:

In [None]:
len(char_indices)

In [None]:
char_indices

`indices_char` holds the reverse mapping:

In [None]:
len(indices_char)

In [None]:
indices_char[52]

While not strictly required to build a language model, it's a good idea to have a look a the distribution of relative frequencies of each symbol in the corpus:

In [None]:
from collections import Counter

counter = Counter(text)
chars, counts = zip(*counter.most_common())
indices = np.arange(len(counts))

plt.figure(figsize=(16, 3))
plt.bar(indices, counts, 0.8)
plt.xticks(indices, chars);

## Converting the training data to one-hot vectors

## Building recurrent model and measuring perplexity

## Training the model

## Sampling random text from the model

## Greedy deterministic decoding

## Beam search for deterministic decoding

## Better handling of sentence boundaries

To simplify things we used the lower case version of the text and we ignored any sentence boundaries. This prevents our model to learn when to stop generating characters. If we want to train a model that can start generating text at the beginning of a sentence and stop at the end of a sentence, we need to provide it with sentency boundary markers in the training set and use those special markers when sampling.

The following give an example of how to use NLTK to detect sentence boundaries in English text.

This could be used to insert explicit "start_of_sentence" and "end_of_sentence" symbols so as to train a language model that explicitly generate complete sentences from start to end.

Use the following command (in a terminal) to install nltk before importing it in the notebook:

```
$ pip install nltk
```

In [None]:
text_with_case = open(corpus_path).read().replace("\n", " ")

In [None]:
import nltk

nltk.download('punkt')
from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(text_with_case)

In [None]:
plt.hist([len(s.split()) for s in sentences], bins=30);
plt.title('Distribution of sentence lengths')
plt.xlabel('Approximate number of words');

The first few sentences detected by NLTK are too short to be considered real sentences. Let's have a look at short sentences with at least 20 characters:

In [None]:
sorted_sentences = sorted([s for s in sentences if len(s) > 20], key=len)
for s in sorted_sentences[:5]:
    print(s)

Some long sentences:

In [None]:
for s in sorted_sentences[-3:]:
    print(s)

The NLTK sentence tokenizer seems to do a reasonable job despite the weird casing and '--' signs scattered around the text.

Note that here we use the original case information because it can help the NLTK sentence boundary detection model make better split decisions. Our text corpus is probably too small to train a good sentence aware language model though, especially with full case information. Using larger corpora such as a large collection of [public domain books](http://www.gutenberg.org/) or Wikipedia dumps. The NLTK toolkit also comes from [corpus loading utilities](http://www.nltk.org/book/ch02.html).

The following loads a selection of famous books from the Gutenberg project archive:

In [None]:
import nltk

nltk.download('gutenberg')
book_selection_text = nltk.corpus.gutenberg.raw().replace("\n", " ")

In [None]:
print(book_selection_text[:300])

In [None]:
print("Book corpus length: %d characters" % len(book_selection_text))

Let's do an arbitrary split. Note the training set will have a majority of text that is not authored by the author(s) of the validation set:

In [None]:
split = int(0.9 * len(book_selection_text))
book_selection_train = book_selection_text[:split]
book_selection_validation = book_selection_text[split:]

## Bonus exercises

- Adapt the previous language model to handle explicitly sentence boundaries with a special EOS character.
- Train a new model on the random sentences sampled from the the book selection corpus with full case information.
- Adapt the random sampling code to start sampling at the beginning of sentence and stop when the sentence ends.
- Train a deep LSTM (e.g. two LSTM layers instead of one) to see if you can improve the validation perplexity.
- Git clone the source code of the [Linux kernel](https://github.com/torvalds/linux) and train a C programming language model on it. Instead of sentence boundary markers, we could use source file boundary markers for this exercise.
- Try to increase the vocabulary size to 256 using a [Byte Pair Encoding](https://arxiv.org/abs/1508.07909) strategy.