# Building a Language Model for Game of Thrones

*This notebook is part of the tutorial "Sequence Modelling with Deep Learning" presented at the ODSC London Conference in November 2019.*

In this notebook, we will build a neural language model that can understand Game of Thrones language and concepts and even write its own passages. The architecture we will use is a **recurrent neural network (RNN)** with **LSTM cells** to boost the model's ability to remember longer-term information within the text. 

The framework we will use to build the models is `Keras`. Keras is a high-level neural networks API - it acts as a user-friendly layer on top of lower-level frameworks (like Tensorflow or Theano), and allows you to build neural networks in an intuitive, layer-by-layer way. 

<img src="books.jpg" alt="Picture of Game of Thrones books" width="600"/> 

## Introduction to language models

Training a **language model (LM)** is a classic (but difficult) task in the field of **natural language processing (NLP)** - the goal is to predict the next word given the previous words. 

For example, which word should follow the sequence "the cat is on the"? Good guesses are words like "mat", "bed", "sofa", and we would hope that our model would learn to assign high probabilities to these semantically relevant terms. We would hope that words like "the", "hi", and "banana" would be assigned low probabilities. 

Since LMs learn a deep understanding of the syntax and semantics of a text corpus during training, they have become popular components of many other NLP tasks, via **transfer learning**.

Language models are often used **generatively**, as in smartphone keyboard apps, to predict future text based on a **seed sequence**.

## How are language models trained?

All you need to train a language model is a text corpus - **no annotation or labelling of the data is required**. However, language modelling is treated as a **supervised classification task**. The idea is that we extract training data by **sliding a window over the corpus**, and generating input-output pairs that way. More exactly:

![Building a dataset for training a language model](lm_data.png)

So here, we are sliding a window of some size over the corpus in order to generate sequences of words (here, sequences of 2 words each). Then:
+ The initial 2 words in each sequences is our **input** (or **features** or **X values**)
+ The final word in each sequence is our **output** (or **label** or **y values**)

The model is then trained to use the input words (the **context**) to predict the final word. 

## Considerations when building the dataset
There are a few decisions you have to make with how you will build this dataset. For example:
+ Are you going to treat the text as a sequence at the **word level** or the **character level**? 

    + **The arguments for using words are**: there is a lot of information in words since that's how we structure language. And the length of sequences the model has to deal with and remember will be much shorter, leading to greater coherence. 
    + **The arguments for using characters are**: the size of the input space is much more manageable (there are fewer characters than words), and you gain the ability to handle unknown words and generate new words.
    + **You could also work at the sub-word level**: this is a bit of a happy medium - words are broken down into their components. 
    
+ Are you going to **scrub the text squeaky clean** or do you want the model to learn to deal with **noise**, perhaps at a cost of a hit to performance?
+ What sort of **window size** should you be using?

# I. Building a Toy Language Model First

Before launching straight into the Game of Thrones language modelling problem, let's work with a smaller first and understand all of the steps involved. This way, you can more easily understand and track how all of the input, intermediate steps, and output is behaving. 

Let's use the following poem from Lord of the Rings as our entire corpus:

In [1]:
tiny_corpus = ['All that is gold does not glitter',
               'Not all those who wander are lost;',
               'The old that is strong does not wither,',
               'Deep roots are not reached by the frost.',
               'From the ashes, a fire shall be woken,',
               'A light from the shadows shall spring;',
               'Renewed shall be blade that was broken,',
               'The crownless again shall be king']

### i. Preparing the dataset

To get started, the first thing we need to do is **tokenisation** - break the text up into individual units or **tokens**. 

We can use the text tokeniser from the `Keras` library for this, and specify that we want to treat all text as lowercase, generate tokens by splitting on a space character, and view text at the word level. 

In [2]:
from keras.preprocessing.text import Tokenizer

tokeniser = Tokenizer(lower=True, split=' ', char_level=False)
tiny_corpus = ' '.join(tiny_corpus)
tokeniser.fit_on_texts([tiny_corpus])

Using TensorFlow backend.


The tokeniser identifies tokens in the corpus and assigns an index to each word in the vocabulary. We can check which index corresponds to which word like this:

In [3]:
tokeniser.word_index

{'the': 1,
 'not': 2,
 'shall': 3,
 'that': 4,
 'be': 5,
 'all': 6,
 'is': 7,
 'does': 8,
 'are': 9,
 'from': 10,
 'a': 11,
 'gold': 12,
 'glitter': 13,
 'those': 14,
 'who': 15,
 'wander': 16,
 'lost': 17,
 'old': 18,
 'strong': 19,
 'wither': 20,
 'deep': 21,
 'roots': 22,
 'reached': 23,
 'by': 24,
 'frost': 25,
 'ashes': 26,
 'fire': 27,
 'woken': 28,
 'light': 29,
 'shadows': 30,
 'spring': 31,
 'renewed': 32,
 'blade': 33,
 'was': 34,
 'broken': 35,
 'crownless': 36,
 'again': 37,
 'king': 38}

Now, we can use this tokeniser to convert (**encode**) our original corpus to a sequence of indices correponding to words:

In [4]:
encoded_corpus = tokeniser.texts_to_sequences([tiny_corpus])[0]
encoded_corpus[0:7]

[6, 4, 7, 12, 8, 2, 13]

We can always get back to the words by reversing this process:

In [5]:
tokeniser.sequences_to_texts([encoded_corpus[0:7]])

['all that is gold does not glitter']

Now we can build a dataset of sequences that we will use for training and evaluating our language model. 

Let's use a window size of 3 and slide this over the integer-encoded corpus to build our dataset: a **list of lists of length 3**.

In [6]:
sequences = []
window_size = 3
for i in range(0, len(encoded_corpus)):
    sequences.append(encoded_corpus[i:i+window_size])

sequences

[[6, 4, 7],
 [4, 7, 12],
 [7, 12, 8],
 [12, 8, 2],
 [8, 2, 13],
 [2, 13, 2],
 [13, 2, 6],
 [2, 6, 14],
 [6, 14, 15],
 [14, 15, 16],
 [15, 16, 9],
 [16, 9, 17],
 [9, 17, 1],
 [17, 1, 18],
 [1, 18, 4],
 [18, 4, 7],
 [4, 7, 19],
 [7, 19, 8],
 [19, 8, 2],
 [8, 2, 20],
 [2, 20, 21],
 [20, 21, 22],
 [21, 22, 9],
 [22, 9, 2],
 [9, 2, 23],
 [2, 23, 24],
 [23, 24, 1],
 [24, 1, 25],
 [1, 25, 10],
 [25, 10, 1],
 [10, 1, 26],
 [1, 26, 11],
 [26, 11, 27],
 [11, 27, 3],
 [27, 3, 5],
 [3, 5, 28],
 [5, 28, 11],
 [28, 11, 29],
 [11, 29, 10],
 [29, 10, 1],
 [10, 1, 30],
 [1, 30, 3],
 [30, 3, 31],
 [3, 31, 32],
 [31, 32, 3],
 [32, 3, 5],
 [3, 5, 33],
 [5, 33, 4],
 [33, 4, 34],
 [4, 34, 35],
 [34, 35, 1],
 [35, 1, 36],
 [1, 36, 37],
 [36, 37, 3],
 [37, 3, 5],
 [3, 5, 38],
 [5, 38],
 [38]]

You'll notice that at the end there we have sequences that are not length 3, since we run out of text. We can quickly **pad the sequences with zeroes** to keep the data size consistent: 

In [7]:
import numpy as np
from keras.preprocessing.sequence import pad_sequences

max_sequence_length = np.max([len(sequence) for sequence in sequences])
sequences = pad_sequences(sequences, 
                          maxlen=max_sequence_length, 
                          padding='pre')
sequences

array([[ 6,  4,  7],
       [ 4,  7, 12],
       [ 7, 12,  8],
       [12,  8,  2],
       [ 8,  2, 13],
       [ 2, 13,  2],
       [13,  2,  6],
       [ 2,  6, 14],
       [ 6, 14, 15],
       [14, 15, 16],
       [15, 16,  9],
       [16,  9, 17],
       [ 9, 17,  1],
       [17,  1, 18],
       [ 1, 18,  4],
       [18,  4,  7],
       [ 4,  7, 19],
       [ 7, 19,  8],
       [19,  8,  2],
       [ 8,  2, 20],
       [ 2, 20, 21],
       [20, 21, 22],
       [21, 22,  9],
       [22,  9,  2],
       [ 9,  2, 23],
       [ 2, 23, 24],
       [23, 24,  1],
       [24,  1, 25],
       [ 1, 25, 10],
       [25, 10,  1],
       [10,  1, 26],
       [ 1, 26, 11],
       [26, 11, 27],
       [11, 27,  3],
       [27,  3,  5],
       [ 3,  5, 28],
       [ 5, 28, 11],
       [28, 11, 29],
       [11, 29, 10],
       [29, 10,  1],
       [10,  1, 30],
       [ 1, 30,  3],
       [30,  3, 31],
       [ 3, 31, 32],
       [31, 32,  3],
       [32,  3,  5],
       [ 3,  5, 33],
       [ 5, 3

That looks better. 

Finally, let's break the sequences down into our input data (X; our matrix of features) and our output data (y; our vector of labels):

In [8]:
X = np.array([x[0:2] for x in sequences])
y = np.array([x[2] for x in sequences])

So for example our input features for the first 5 data points are:

In [9]:
X[0:5]

array([[ 6,  4],
       [ 4,  7],
       [ 7, 12],
       [12,  8],
       [ 8,  2]], dtype=int32)

And their corresponding labels are: 

In [10]:
y[0:5]

array([ 7, 12,  8,  2, 13], dtype=int32)

The final thing we need to do is reformat our label vector y into a **one-hot vector format**. The word index numbers are not actually meaningful (no ordinal relationship) but are discrete classes. We also want to calculate the probabilities of the next word, where a probability of 1 for the correct word is the optimal prediction. 

We can convert the label vector y to a matrix of one-hot vectors using Keras' `to_categorical` method:

In [11]:
from keras.utils import to_categorical

vocabulary_size = len(tokeniser.word_index)+1
y = to_categorical(y, num_classes=vocabulary_size)
y[0:5]

array([[0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0.]], dtype=float32)

To summarise, we have gone from a raw dataset of:

In [12]:
tiny_corpus

'All that is gold does not glitter Not all those who wander are lost; The old that is strong does not wither, Deep roots are not reached by the frost. From the ashes, a fire shall be woken, A light from the shadows shall spring; Renewed shall be blade that was broken, The crownless again shall be king'

To a formatted dataset ready to be input to a learning algorithm:

In [13]:
print('Example features: ', *X[0:5], sep='\n')
print('Example labels: ', *y[0:5], sep='\n')

Example features: 
[6 4]
[4 7]
[ 7 12]
[12  8]
[8 2]
Example labels: 
[0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


### ii. Setting up the language model architecture

Now that we have the dataset sorted out, it's time to think about how we want to approach the modelling problem.

Let's build this small recurrent neural network with LSTM units:

![tiny_network](small_network.png)

To explain this network:
+ Our **input layer** represents input into the network. The size of the input layer is the size of the vocabulary of our corpus (+1).
+ We then have an **embedding layer** immediately after the input layer, which will learn **word embeddings** for us (continuous representation of the discrete words in our vocabulary; see my explanatory blog post on embeddings [here](https://towardsdatascience.com/why-do-we-use-embeddings-in-nlp-2f20e1b632d2). An embedding layer first changes your integer-encoded input to a one-hot vector format, followed by a fully-connected layer, and the learned weight matrix of this layer functions as our word embeddings.
+ Our LSTM layer's job is to actually learn something - it takes as input the embeddings of the current word (at time step $t$) and also the hidden state from the previous word (at time step $t-1$), and tries to make predictions of the next word based on this information.


In `Keras` code, we would build this network like this:

In [14]:
from keras.models import Sequential
from keras.layers import Embedding
from keras.layers import LSTM
from keras.layers import Dense

model = Sequential()
model.add(Embedding(input_dim=vocabulary_size, output_dim=10, input_length=max_sequence_length-1))
model.add(LSTM(units=50))
model.add(Dense(units=vocabulary_size, activation='softmax'))

An explanation of this code block:
+ Keras allows the sequential layer-by-layer building of neural network models using its `Sequential` API.
+ The input layer is assumed, we don't need to explicitly build it.
+ The first layer we add is our `Embedding` layer. The input dimensionality is our vocabulary size (the size of our input layer), and let's give this embedding layer a small size of 10 neurons. This means each word will get represented as a real-valued vector of length 10. We state that the length of inputs the network should expect is 2. 
+ Next, we add the workhorse of the network - our layer of `LSTM` neurons. Let's make the layer have 50 of these neurons (which is not a lot). We leave all other options to the default (activation functions, initialisation,etc.)
+ Finally, as our output layer, we add a `Dense` fully-connected layer and softmax it. This means that the output of the network will be a vector of probabilities (summing to 1) spread across all the words of our vocabulary (see example below). 

We can examine our model so far using Keras' `model.summary()` function:

In [15]:
print(model.summary())

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 2, 10)             390       
_________________________________________________________________
lstm_1 (LSTM)                (None, 50)                12200     
_________________________________________________________________
dense_1 (Dense)              (None, 39)                1989      
Total params: 14,579
Trainable params: 14,579
Non-trainable params: 0
_________________________________________________________________
None


This summarises the number of parameters in our model and where they are.
+ **390 parameters from $39*10$**: the number of input neurons times the number of neurons in the embedding layer (it is fully connected - meaning there is a connection between every neuron)
+ **12200 LSTM parameters from $4*((10*50) + (50*50) + 50)$**: 10x50 "normal" weights, 50x50 weights between the previous time step's hidden layer and the current time steps hidden layer, 50 parameters from 1 bias teach, and all times 4 for the 4 gates.  
+ **1989 parameters from $50*39 + 39$**: the number of previous layer neurons (50) times the number of output neurons (39), plus 39 bias parameters from the output layer

Now that we have defined the network, we need to do a `model.compile()` to signify that we have finished building the network and want to define how training should proceed. Specifically, we need to provide:
+ Which loss function we want to use (i.e. what is the goal the model is optimising for as it trains, or what signal is it following in order to improve)
+ Which optimiser we want to use to do our gradient updates (Adam, Adagrad, RMSProp, Nesterov momentum, etc.)
+ Any metrics we want to calculate and output during training in order to keep track of progress. Let's keep track of accuracy, which is just the percentage of predictions that the model gets right. 

We can just use reasonable defaults for now. Since our task is a multiclass classification task, a sensible loss metric to use is **categorical cross-entropy**. 

In [16]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

I'll avoid dumping equations on you and just say that:
+ The model's categorical cross-entropy loss will be **low** when the network generally predicts the next words correctly. This means it tends to assign higher probability to the correct word.
+ A training loss of zero means that the network always assigns a probability of 1 to the correct word and 0 to all other words - its predictions are perfect (in the training set).
+ The model's categorical cross-entropy loss will be **high** when the network generally doesn't predict the next words well. This means it tends to incorrectly assign high probabilities to incorrect words.

During the training process, the model optimises its internal parameters such that training loss is minimised (for an explanation of how this happens, read about backpropagation and gradient descent [here](http://neuralnetworksanddeeplearning.com/chap1.html)).

### iii. Training the language model

Now that the network is compiled, we can begin training it for some time (for some number of **epochs** - which is the number of times the network sees your training data). 

Hopefully, as model training proceeds, we will see that the training loss steadily decreases and the accuracy increases: 

In [25]:
# model.fit(X, y, epochs=50, verbose=0)
model.fit(X, y, epochs=50, verbose=0)

<keras.callbacks.callbacks.History at 0x14cf75f90>

That's the model trained! Training is very fast because our dataset is tiny and the network is small. The accuracy doesn't look that bad either (though of course the model is likely to be **overfitted**; see later section). 

There's a few different things you can do with a trained Keras sequence model. 

You can see all the options by typing `model.` followed by a `tab` in a cell:

In [26]:
model.get_weights()[0]

array([[ 3.15541029e-01,  2.03606471e-01,  2.80016929e-01,
         1.27926588e-01, -3.79317701e-01, -9.04322043e-02,
         2.27908164e-01,  3.21692497e-01,  7.95110017e-02,
        -1.13509171e-01],
       [-1.72527492e-01,  3.44863534e-01,  3.81999835e-02,
         3.60623091e-01,  3.40169400e-01, -2.72232652e-01,
         4.05081093e-01, -7.49856979e-02,  1.09128311e-01,
        -3.01918000e-01],
       [ 2.47175246e-01, -2.75743902e-01, -2.80102402e-01,
        -1.90534458e-01,  1.43431589e-01,  1.39929637e-01,
        -4.07428294e-01, -3.12202513e-01, -2.68854462e-02,
         1.76154733e-01],
       [-1.93544433e-01, -1.22530438e-01, -1.88495405e-02,
         3.50918263e-01, -3.15138280e-01, -1.28867984e-01,
         2.45830148e-01,  1.23829342e-01, -2.26835892e-01,
         1.60934970e-01],
       [ 3.56040895e-01, -3.13914686e-01,  3.78356546e-01,
         2.19874486e-01, -2.96347409e-01,  3.83741379e-01,
        -4.56347942e-01, -1.03706792e-01, -4.22252208e-01,
         3.

In [27]:
model.layers

[<keras.layers.embeddings.Embedding at 0x14a3b0890>,
 <keras.layers.recurrent.LSTM at 0x149de8fd0>,
 <keras.layers.core.Dense at 0x149503850>]

### iv. Using the trained model to make predictions

Probably the most interesting thing to do now is use the trained model to make new predictions. For this, we can use the `model.predict_classes()` method. 

We hope that the model will predict the next word given a seed sequence well, i.e. that it learned about word structure from our poem corpus. For instance, given the seed sequence "shall be", we hope the model predicts the correct, observed next words like "king", "broken", and "blade".

However, we can't just run `model.predict_classes()` on raw text data like "shall be", since the text data has to first be tokenised, assigned to an integer index, and reshaped into the correct array dimensions:


In [28]:
seed_sequence = 'shall be'
seed_sequence_encoded = tokeniser.texts_to_sequences([seed_sequence])[0]
print('Encoded seed sequence: %s' % seed_sequence_encoded)
seed_sequence_encoded = np.array(seed_sequence_encoded).reshape(-1,2)
print('Formatted encoded seed sequence: %s' % seed_sequence_encoded)

Encoded seed sequence: [3, 5]
Formatted encoded seed sequence: [[3 5]]


Now we can use the trained model to make a prediction for the next word:

In [29]:
prediction_index = model.predict_classes(seed_sequence_encoded)
print('Prediction for the next word index: %s' % prediction_index)
print('This index corresponds to word: %s' % tokeniser.sequences_to_texts([prediction_index]))

Prediction for the next word index: [38]
This index corresponds to word: ['king']


Great, that looks like a decent prediction for the next word!

Rather than just have the 1 best prediction, it would be interesting to see the probabilities assigned to each possible next word. With a bit of manoeuvring we can get these scores out: 

In [30]:
import pandas as pd

class_indices = list(range(0, vocabulary_size+1))

df = pd.DataFrame(list(zip(class_indices, 
                      [tokeniser.sequences_to_texts([[index]])[0] for index in class_indices],
                       model.predict(seed_sequence_encoded)[0],
                       np.round(model.predict(seed_sequence_encoded)[0],5))),
                  columns=['index', 'word', 'probability', 'rounded_probability'])

df.sort_values('probability', ascending=False).head(10)

Unnamed: 0,index,word,probability,rounded_probability
38,38,king,0.253669,0.25367
33,33,blade,0.16858,0.16858
28,28,woken,0.164102,0.1641
32,32,renewed,0.123998,0.124
5,5,be,0.071628,0.07163
3,3,shall,0.061355,0.06135
8,8,does,0.039532,0.03953
4,4,that,0.022705,0.0227
31,31,spring,0.021434,0.02143
37,37,again,0.01806,0.01806


Cool, it looks like the network does indeed assign the highest probabilities to the 3 words that actually occur in the corpus! It's fun to see that such a small network can produce sensible results on such a small dataset. 

Let's try another example with a different seed sequence:

In [31]:
seed_sequence = 'does not'
seed_sequence_encoded = tokeniser.texts_to_sequences([seed_sequence])[0]
print('Encoded seed sequence: %s' % seed_sequence_encoded)
seed_sequence_encoded = np.array(seed_sequence_encoded).reshape(-1,2)
print('Formatted encoded seed sequence: %s' % seed_sequence_encoded)
df = pd.DataFrame(list(zip(class_indices, 
                      [tokeniser.sequences_to_texts([[index]])[0] for index in class_indices],
                       model.predict(seed_sequence_encoded)[0],
                       np.round(model.predict(seed_sequence_encoded)[0],5))),
                  columns=['index', 'word', 'probability', 'rounded_probability'])
df.sort_values('probability', ascending=False).head(10)

Encoded seed sequence: [8, 2]
Formatted encoded seed sequence: [[8 2]]


Unnamed: 0,index,word,probability,rounded_probability
20,20,wither,0.199929,0.19993
13,13,glitter,0.16588,0.16588
2,2,not,0.15734,0.15734
6,6,all,0.134247,0.13425
14,14,those,0.048063,0.04806
23,23,reached,0.045011,0.04501
24,24,by,0.040465,0.04047
1,1,the,0.039344,0.03934
21,21,deep,0.029354,0.02935
17,17,lost,0.028293,0.02829


Great, that also looks correct.

Rather than predicting just the next 1 word, would be nice to just let the network write continuous text for us, given some seed sequence starting point. Let's package up the above code into a function that lets us do this:

In [32]:
def write_text_sequence(seed_sequence,
                        length_to_write,
                        model, 
                        tokeniser, 
                        input_length,
                        verbose=True):
    """
    Generates text using a trained language
    model and seed sequence.
    """

    print('Using seed sequence: "%s"' % seed_sequence)
    sequence = seed_sequence
    
    for i in range(length_to_write):
        
        # tokenise and encode the seed sequence
        encoded_sequence = tokeniser.texts_to_sequences([sequence])[0]
        assert len(encoded_sequence)>=input_length, \
            'ERROR: seed sequence must be at least %s words.' % input_length
        encoded_sequence = encoded_sequence[-input_length:]
        encoded_sequence = np.array(encoded_sequence).reshape(-1,input_length)

        # predict the next word index and corresponding word
        prediction_index = model.predict_classes(encoded_sequence)
        prediction = tokeniser.sequences_to_texts([prediction_index])
        
        if verbose:
            print('Sequence so far: %s' % sequence)
            print('Seed sequence encoded: %s' % encoded_sequence)
            print('Most likely next word is {0} (index {1})'.format(prediction, prediction_index[0]))

        sequence += ' ' + prediction[0]
    
    print('Output:\n' + sequence)
    
#     return sequence
        

In [33]:
write_text_sequence("all that", 5,
                    model, tokeniser, 
                    max_sequence_length-1)

Using seed sequence: "all that"
Sequence so far: all that
Seed sequence encoded: [[6 4]]
Most likely next word is ['is'] (index 7)
Sequence so far: all that is
Seed sequence encoded: [[4 7]]
Most likely next word is ['gold'] (index 12)
Sequence so far: all that is gold
Seed sequence encoded: [[ 7 12]]
Most likely next word is ['does'] (index 8)
Sequence so far: all that is gold does
Seed sequence encoded: [[12  8]]
Most likely next word is ['not'] (index 2)
Sequence so far: all that is gold does not
Seed sequence encoded: [[8 2]]
Most likely next word is ['wither'] (index 20)
Output:
all that is gold does not wither


Cool, let's write some more text, but let's turn off the verbosity of the function so we just get the final result:

In [37]:
write_text_sequence("the light", 6,
                    model, tokeniser, 
                    max_sequence_length-1,
                    verbose=False)

Using seed sequence: "the light"
Output:
the light from the shadows shall be king


That's kind of artsy.

And again, writing a longer passage this time:

In [38]:
write_text_sequence("ashes are", 10,
                    model, tokeniser, 
                    max_sequence_length-1,
                    verbose=False)

Using seed sequence: "ashes are"
Output:
ashes are fire the be that a strong shall king be king


Our tiny model only knows the few words in the poem so this is a bit gibberish :) But it's still interesting to see.

This is pretty much all there is to a basic language model. Now, let's tackle a real corpus (Game of Thrones) and build a bigger, more powerful model!

# II. Building a language model for Game of Thrones text

The technical approach we'll take to building a GoT language model is pretty similar, with the major difference being the dataset. We are going to need access to a lot of GoT text - preferably, both the books and the subtitles from the HBO show. 

### i. Identifying some datasets

Interestingly, there seems to already be a rich ecosystem of technical work surrounding GoT content. 

Check out projects like:
+ The [Network of Thrones](https://networkofthrones.wordpress.com/) blog for network analyses of characters (e.g. which character is the most 'central' to the story?)
+ An [API of Ice and Fire](https://anapioficeandfire.com) for grabbing various structured data about the universe
+ And [this Reddit post](https://www.reddit.com/r/datasets/comments/769nhw/game_of_thrones_datasets_xpost_from_rfreefolk/) for a list of various datasets compiled about GoT.

Maybe it's just me, but even despite these resources, I still couldn't actually find the raw text from the books and TV show. 

I did eventually come across 2 Kaggle datasets that contained exactly what I wanted:
1. [Plain text files of all the books](https://www.kaggle.com/muhammedfathi/game-of-thrones-book-files/download) 
2. [Subtitle data for the episodes](https://filmora.wondershare.com/video-editing-tips/game-of-thrones-subtitles.html)
    
A bit of initial manual + regex clean up later, and you get the files included in this repo. 

### ii. Grabbing all text data from the Game of Thrones books

So, we've got a few books in our current directory in .txt format:

In [39]:
import glob
book_txt_files = sorted(glob.glob('*.txt'))
print('Found these .txt files in the current directory:', *book_txt_files, sep='\n')

Found these .txt files in the current directory:
Book_1_A_Game_of_Thrones.txt
Book_2_A_Clash_of_Kings.txt
Book_3_A_Storm_of_Swords.txt
Book_4_A_Feast_for_Crows.txt
Book_5_A_Dance_with_Dragons.txt
requirements.txt


We can write a function to extract all of the text in these files, glue it together, and flatten the resulting list of lists into a single mega GoT list of text:

In [40]:
from iteration_utilities import flatten

def grab_book_data(txt_files):
    """
    Grabb text data from a set of text files.
    """

    # keep all text segments in this list
    all_text_segments = []   
    
    # iterate over each book file
    for txt_file in txt_files:
    
        print('Extracting text from file "%s"...' % txt_file)
        # open file
        with open(txt_file, 'r') as file:
            data = file.read()
            print('Found {0} lines of text in this book.'.format(len(data.split('\n'))))
            print('First few lines:\n %s\n' % ' '.join(data.split('\n')[0:5]))  
            all_text_segments.append(data)
            
    return ''.join(list(flatten(all_text_segments)))

And use it to put all the book text data in one place:

In [41]:
book_data = grab_book_data(book_txt_files)

Extracting text from file "Book_1_A_Game_of_Thrones.txt"...
Found 14002 lines of text in this book.
First few lines:
 A GAME OF THRONES  PROLOGUE  “We should start back,” Gared urged as the woods began to grow dark around them.

Extracting text from file "Book_2_A_Clash_of_Kings.txt"...
Found 15765 lines of text in this book.
First few lines:
 A CLASH OF KINGS  PROLOGUE  The comet’s tail spread across the dawn, a red slash that bled above the crags of Dragonstone like a wound in the pink and purple sky.

Extracting text from file "Book_3_A_Storm_of_Swords.txt"...
Found 19641 lines of text in this book.
First few lines:
 A STORM OF SWORDS  PROLOGUE  The day was grey and bitter cold, and the dogs would not take the scent.

Extracting text from file "Book_4_A_Feast_for_Crows.txt"...
Found 16225 lines of text in this book.
First few lines:
 A FEAST FOR CROWS  PROLOGUE  Dragons,” said Mollander. He snatched a withered apple off the ground and tossed it hand to hand.

Extracting text from fi

Let's quickly summarise the amount of data we're working with:

In [42]:
# count lines and words
print('The number of lines in this corpus: {0}\n'
      'The number of words in this corpus: {1}'.format(len(book_data.split('\n')),
                                                       len(book_data.split(' '))))

The number of lines in this corpus: 84579
The number of words in this corpus: 1724951


## iii. Grabbing all text data from the Game of Thrones show

The subtitle data is a bit more complicated to grab because it's in JSON file format, and also frankly the text is a bit messy - there's markup tags, music note symbols, and various other odd non-textual things. 

We have the following `.json` subtitle files in our current directory:

In [43]:
subtitle_json_files = sorted(glob.glob("*.json"))
print('Found these .json files in the current directory:', *subtitle_json_files, sep='\n')

Found these .json files in the current directory:
Season_1_Subtitles.json
Season_2_Subtitles.json
Season_3_Subtitles.json
Season_4_Subtitles.json
Season_5_Subtitles.json
Season_6_Subtitles.json
Season_7_Subtitles.json


We will need to write a function to get the data out. The function below will:
+ **Iterate** over a given list of json subtitle files, **open** each file and **parse** the json
+ **Sort** the subtitles by index. At the moment, the indices are sorted as strings (so, e.g. '1' is followed by '11') so we need to convert the indices to integers and sort them numerically. This is important to get right because otherwise the subtitles are jumbled out of order! 
+ And finally we **extract** the subtitle text and **append** to a master list (which we reformat by flattening) 

In [44]:
import json

def grab_subtitle_data(subtitle_json_files, verbose=True):
    """
    Grabbing GoT subtitle data from json files.
    """

    # keep all text segments in this list
    all_text_segments = []

    # iterate over each subtitles file
    for season, subtitles_file in enumerate(subtitle_json_files):

        # open subtitle file
        with open(subtitles_file, 'r') as file:
            data = json.load(file)

        # iterate over episodes in the season
        for episode in data.keys():
            episode_data = {int(key):value for key,value in data[episode].items()}
            episode_data = sorted(episode_data.items()) # deal with sorting by line (as integer) s
            episode_text_segments = list(dict(episode_data).values())
            print('Found {0} text segments in Season {1} '
                  'Episode "{2}".'.format(len(episode_text_segments), 
                                          season, 
                                          episode.split('.')[0]))
            if verbose:
                print('First few segments:\n%s' % '\n'.join(episode_text_segments[0:5]))            
            all_text_segments.append(episode_text_segments)
            
    return list(flatten(all_text_segments))

In [45]:
subtitle_data = grab_subtitle_data(subtitle_json_files, verbose=False)

Found 559 text segments in Season 0 Episode "Game Of Thrones S01E01 Winter Is Coming".
Found 571 text segments in Season 0 Episode "Game Of Thrones S01E02 The Kingsroad".
Found 740 text segments in Season 0 Episode "Game Of Thrones S01E03 Lord Snow".
Found 754 text segments in Season 0 Episode "Game Of Thrones S01E04 Cripples, Bastards, And Broken Things".
Found 741 text segments in Season 0 Episode "Game Of Thrones S01E05 The Wolf And The Lion".
Found 583 text segments in Season 0 Episode "Game Of Thrones S01E06 A Golden Crown".
Found 775 text segments in Season 0 Episode "Game Of Thrones S01E07 You Win Or You Die".
Found 666 text segments in Season 0 Episode "Game Of Thrones S01E08 The Pointy End".
Found 679 text segments in Season 0 Episode "Game Of Thrones S01E09 Baelor".
Found 590 text segments in Season 0 Episode "Game Of Thrones S01E10 Fire And Blood".
Found 700 text segments in Season 1 Episode "Game Of Thrones S02E01 The North Remembers".
Found 755 text segments in Season 1 Ep

The final array of subtitle data looks like this:

In [46]:
subtitle_data[0:5]

['Easy, boy.',
 "What do you expect? They're savages.",
 'One lot steals a goat from another lot,',
 "before you know it they're ripping each other to pieces.",
 "I've never seen wildlings do a thing like this."]

And we can summarise the dataset size:

In [47]:
# count lines and words
all_subtitle_text = '\n'.join(subtitle_data)
print('The number of text segments in this corpus: {0}\n'
      'The number of words in this corpus: {1}'.format(len(all_subtitle_text.split('\n')),
                                                       len(all_subtitle_text.split(' '))))

The number of text segments in this corpus: 44844
The number of words in this corpus: 244447


### iv. Combining the book and subtitle datasets

Now we can put the book and subtitle data together:

In [48]:
got_data = book_data + all_subtitle_text

And report on the size:

In [49]:
print('The number of lines in the final corpus: {0}\n'
      'The number of words in the final corpus: {1}'.format(len(got_data.split('\n')),
                                                            len(got_data.split(' '))))

The number of lines in the final corpus: 129422
The number of words in the final corpus: 1969397


That's almost 2 million words to play with, which should help our language model tremendously. 

### v. Preparing the dataset

The process to make the sequence datasets is the same as before. The only difference is that we'll use longer sequences as our input (`window_size` is now 10), so we're taking into account more text before making our prediction.

In [50]:
# tokenise the data
tokeniser = Tokenizer(lower=True, split=' ', char_level=False)
tokeniser.fit_on_texts([got_data])
vocabulary_size = len(tokeniser.word_index)+1
print('The vocabulary size for this corpus is: %s' % vocabulary_size)

# encode the corpus using the fitted tokeniser
encoded_corpus = tokeniser.texts_to_sequences([got_data])[0]

# generate sequences
sequences = []
window_size = 10
for i in range(0, len(encoded_corpus)):
    sequences.append(encoded_corpus[i:i+window_size])
print('Generated {0} sequences each of length {1}.'.format(len(sequences), window_size))

# pad the sequences at the end so each sequence is the same length
max_sequence_length = np.max([len(sequence) for sequence in sequences])
sequences = pad_sequences(sequences, 
                          maxlen=max_sequence_length, 
                          padding='pre')

# separate sequences into input arrays X 
# and the output label vector y
X = np.array([seq[0:window_size-1] for seq in sequences])
y = np.array([seq[window_size-1] for seq in sequences])
print("Shape of X matrix: {0} and y vector: {1}".format(X.shape, y.shape))
y = to_categorical(y, num_classes=vocabulary_size)
print("Shape of X matrix: {0} and categorical y matrix: {1}".format(X.shape, y.shape))

The vocabulary size for this corpus is: 30416
Generated 2095106 sequences each of length 10.
Shape of X matrix: (2095106, 9) and y vector: (2095106,)
Shape of X matrix: (2095106, 9) and categorical y matrix: (2095106, 30416)


Once again, our features look like this:

In [51]:
X[0:5]

array([[    5,   972,     6,  3798, 12148,   322,   122,  1131,    62],
       [  972,     6,  3798, 12148,   322,   122,  1131,    62,     4],
       [    6,  3798, 12148,   322,   122,  1131,    62,     4,  4583],
       [ 3798, 12148,   322,   122,  1131,    62,     4,  4583,  1623],
       [12148,   322,   122,  1131,    62,     4,  4583,  1623,    17]],
      dtype=int32)

And our labels look like this:

In [52]:
y[0:5]

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.]], dtype=float32)

One useful extra step: we should **split the dataset into a train and test set**. The main reason for this is that it will help us get a better estimate of the model's true "in the wild" performance, since we can evaluate its performance on data that *wasn't* used in training. We will also use a validation set during training to keep track of progress as the model learns. 

Evaluating a model on data that was used for training is cheating, since it's already seen that data before, and hence will do unrealistically well when making predictions on it because it has to some extent **overfit** to the training data.

We will also shuffle the entries, since otherwise our dataset first contains Book 1, then Book 2, ..., Book 5 then finally the subtitle data, whereas we want the model to learn from each source simultaneously. 

I would normally do this using sklearn's `train_test_split`, but because our dataset is so large, I wrote a function to do this splitting in batches and using scipy's sparse matrix utilities:

In [53]:
from sklearn.model_selection import train_test_split
from scipy import sparse, hstack

def batch_train_test_split(X, y, batch_size=10000):
    
    X_train = []
    X_test = []
    y_train = []
    y_test = []

    n_batches = int(np.ceil(X.shape[0]/batch_size))

    for batch_index in range(0, n_batches):
        print('On batch {0} of {1}...'.format(str(batch_index), str(n_batches)))
        start_index = batch_index*batch_size
        end_index = start_index+batch_size

        # grab the small batch
        small_X = X[start_index:end_index]
        small_y = y[start_index:end_index]

        # do the train test split on this small batch
        small_X_train, small_X_test, \
        small_y_train, small_y_test = train_test_split(small_X, small_y, test_size=0.1, shuffle=True)

        # append
        X_train.append(small_X_train)
        X_test.append(small_X_test)
        y_train.append(sparse.csr_matrix(small_y_train))
        y_test.append(sparse.csr_matrix(small_y_test))

    # reformat results
    X_train = np.array(list(flatten(X_train)))
    X_test = np.array(list(flatten(X_test)))
    y_train = sparse.vstack(y_train)
    y_test = sparse.vstack(y_test)
    
    return X_train, X_test, y_train, y_test


In [54]:
# X_train, X_test, y_train, y_test = batch_train_test_split(X, y, batch_size=10000)

On batch 0 of 100...
On batch 1 of 100...
On batch 2 of 100...
On batch 3 of 100...
On batch 4 of 100...
On batch 5 of 100...
On batch 6 of 100...
On batch 7 of 100...
On batch 8 of 100...
On batch 9 of 100...
On batch 10 of 100...
On batch 11 of 100...
On batch 12 of 100...
On batch 13 of 100...
On batch 14 of 100...
On batch 15 of 100...
On batch 16 of 100...
On batch 17 of 100...
On batch 18 of 100...
On batch 19 of 100...
On batch 20 of 100...
On batch 21 of 100...
On batch 22 of 100...
On batch 23 of 100...
On batch 24 of 100...
On batch 25 of 100...
On batch 26 of 100...
On batch 27 of 100...
On batch 28 of 100...
On batch 29 of 100...
On batch 30 of 100...
On batch 31 of 100...
On batch 32 of 100...
On batch 33 of 100...
On batch 34 of 100...
On batch 35 of 100...
On batch 36 of 100...
On batch 37 of 100...
On batch 38 of 100...
On batch 39 of 100...
On batch 40 of 100...
On batch 41 of 100...
On batch 42 of 100...
On batch 43 of 100...
On batch 44 of 100...
On batch 45 of 100..

### vi. Setting up the language model architecture

This time, let's build a slightly larger network:

![Larger RNN language model](bigger_network.png)

The main differences here are:
+ Our vocabulary is much larger
+ Our word embeddings are bigger (100 rather than 50 dimensions), which should allow for richer representations of word meaning
+ We have 2 LSTM layers instead of 1. This should allow the model to learn more complex, hierarchical representations of the text.
+ We have added a dense (fully-connected) layer after the LSTM layers for some additional processing capacity (perhaps, again, allowing for higher-level conceptual representations)



In `Keras` code, we would build the network as follows:

In [55]:
model = Sequential()
model.add(Embedding(vocabulary_size, 50, input_length=max_sequence_length-1))
model.add(LSTM(100, return_sequences=True))
model.add(LSTM(100))
model.add(Dense(100, activation='relu'))
model.add(Dense(vocabulary_size, activation='softmax'))

This is very similar code to before, but we have reason to think that this network will be much more complex and nuanced than the previous one:
+ The dataset we are using is much larger and richer than the toy dataset
+ The network we are training is larger and deeper, and should have more expressive power

We can summarise the **model structure and parameters**:

In [56]:
print(model.summary())

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 9, 50)             1520800   
_________________________________________________________________
lstm_2 (LSTM)                (None, 9, 100)            60400     
_________________________________________________________________
lstm_3 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dense_2 (Dense)              (None, 100)               10100     
_________________________________________________________________
dense_3 (Dense)              (None, 30416)             3072016   
Total params: 4,743,716
Trainable params: 4,743,716
Non-trainable params: 0
_________________________________________________________________
None


And compile the finished model and specify some **training settings**:

In [57]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

### vii. Training the language model

Then, we can start the training run by passing the training data to the model. This would take a reasonably long time to train - it would be helpful to have access to a **GPU** to run this on (e.g. via Google Colab, AWS/GCP, your own GPU) to make use of computation **parallelisation** and drastically reduce training time.

Since it's a longer training run, we would also ideally want to save some intermediate results while training is happening. One way to do this is using Keras' `ModelCheckpoint` utility. To save some disc space, you can specify that you only want to save a new checkpoint file when something about the model has improved (commonly, validation accuracy or validation loss). 

In [69]:
from keras.callbacks import ModelCheckpoint

checkpoint_filename="GoT_Language_Model_{epoch:02d}_{val_accuracy:.3f}.hdf5"
checkpoint = ModelCheckpoint(checkpoint_filename, 
                             monitor='val_accuracy', 
                             save_best_only=True, 
                             mode='max',  # 'best' file maximises validation_accuracy 
                             verbose=1, )
model.fit(X_train, y_train, epochs=50, validation_split=0.1, callbacks=[checkpoint])

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 1337429 samples, validate on 148604 samples
Epoch 1/50

Epoch 00001: val_accuracy improved from -inf to 0.11401, saving model to GoT_Language_Model_01_0.114.hdf5
Epoch 2/50

Epoch 00002: val_accuracy improved from 0.11401 to 0.12455, saving model to GoT_Language_Model_02_0.125.hdf5
Epoch 3/50

Epoch 00003: val_accuracy improved from 0.12455 to 0.12846, saving model to GoT_Language_Model_03_0.128.hdf5
Epoch 4/50

Epoch 00004: val_accuracy improved from 0.12846 to 0.13133, saving model to GoT_Language_Model_04_0.131.hdf5
Epoch 5/50

Epoch 00005: val_accuracy improved from 0.13133 to 0.13449, saving model to GoT_Language_Model_05_0.134.hdf5
Epoch 6/50

Epoch 00006: val_accuracy improved from 0.13449 to 0.13726, saving model to GoT_Language_Model_06_0.137.hdf5
Epoch 7/50

Epoch 00007: val_accuracy improved from 0.13726 to 0.13857, saving model to GoT_Language_Model_07_0.139.hdf5
Epoch 8/50

Epoch 00008: val_accuracy improved from 0.13857 to 0.13953, saving model to GoT_Language_Mo


Epoch 00036: val_accuracy did not improve from 0.14277
Epoch 37/50

Epoch 00037: val_accuracy did not improve from 0.14277
Epoch 38/50

Epoch 00038: val_accuracy did not improve from 0.14277
Epoch 39/50

Epoch 00039: val_accuracy did not improve from 0.14277
Epoch 40/50

Epoch 00040: val_accuracy did not improve from 0.14277
Epoch 41/50

Epoch 00041: val_accuracy did not improve from 0.14277
Epoch 42/50

Epoch 00042: val_accuracy did not improve from 0.14277
Epoch 43/50

Epoch 00043: val_accuracy did not improve from 0.14277
Epoch 44/50

Epoch 00044: val_accuracy did not improve from 0.14277
Epoch 45/50

Epoch 00045: val_accuracy did not improve from 0.14277
Epoch 46/50

Epoch 00046: val_accuracy did not improve from 0.14277
Epoch 47/50

Epoch 00047: val_accuracy did not improve from 0.14277
Epoch 48/50

Epoch 00048: val_accuracy did not improve from 0.14277
Epoch 49/50

Epoch 00049: val_accuracy did not improve from 0.14277
Epoch 50/50

Epoch 00050: val_accuracy did not improve from 

<keras.callbacks.callbacks.History at 0x16bab2ad0>

In [70]:
model.save("final_trained_GoT_language_model.h5")

For now, to save time, I will just **load a model** that I already trained. 

For reference, this model was really accessible to train - it was trained overnight on my MacBook, so there's no special GPU supercomputer involved. The model was still improving quite rapidly at that point, so we would see even better performance if the model were given enough time to reach **convergence** ("finish" learning, or at least hit serious diminishing returns).

In [102]:
from keras.models import load_model

loaded_model = load_model('final_trained_GoT_language_model.h5')

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


### viii. Exploring our Game of Thrones language model

Finally, let's see if we can have the language model write some Game of Thrones text for us.

We can use the same function as before to continuously feed in a seed sequence to the model, generate one word, and then append the generated word to the seed sequence. In this way, the model uses its own previous output as input to itself in the future. 



In [192]:
write_text_sequence("You would not believe the start of the story", 29,
                    loaded_model, tokeniser, 
                    max_sequence_length-1,
                    verbose=False)

Using seed sequence: "You would not believe the start of the story"
Output:
You would not believe the start of the story ” the king said “i am a man of the night’s watch ” he said “i am a man of the andals and the rhoynar and the first men


In [193]:
write_text_sequence("A dragon, the dead, and Tyrion walk into a", 20,
                    loaded_model, tokeniser, 
                    max_sequence_length-1,
                    verbose=False)

Using seed sequence: "A dragon, the dead, and Tyrion walk into a"
Output:
A dragon, the dead, and Tyrion walk into a great roast of rubies the wind was the same and the way he had been a man of the harpy


In [202]:
write_text_sequence("Daenerys of the House Targaryen, the First of Her Name, The Unburnt", 21,
                    loaded_model, tokeniser, 
                    max_sequence_length-1,
                    verbose=False)

Using seed sequence: "Daenerys of the House Targaryen, the First of Her Name, The Unburnt"
Output:
Daenerys of the House Targaryen, the First of Her Name, The Unburnt bastard of the night’s watch had been seated in the middle of the night the wind had been a dozen times


Uhh.. it's clear from these samples that the model has clearly learned something about both English language structure and GoT content, but admittedly it's still a bit clunky. 

Check out the section at the bottom for suggestions on how to improve the model. 

This all looks like a decent start - I especially like that the Lannisters are originally Northerners. Other segments sound like weird GoT beat poetry, and I can almost hear the soft accompanying bongo beats.



### ix. Generating more creative output

If you run the model generatively and write longer stories, you'll see that it can sometimes get stuck in a loop:

In [138]:
write_text_sequence("I would not have expected that Ned and Jon", 100,
                    loaded_model, tokeniser, 
                    max_sequence_length-1,
                    verbose=False)

Using seed sequence: "I would not have expected that Ned and Jon"
Output:
I would not have expected that Ned and Jon man within to he my the ser her man a torn ” the beard and the off of the wasn’t her man people ” the wasn’t and the off her man she’d of the polished and the swayed and the think there of the polished and the polished and the swayed and the think how the sansa king his a can of the “were and the dish of the live hair of the insist the great rodrik of while and the forgiveness wrath of the deeply blood the blood of the braid hot his a fighting there and the handful


It really likes the polished and the swayed!

This is because our function `write_text_sequences` will always greedily choose the most probable next word as the next token as it generates text. This is the cause of these repetitive loops. 

Ideally, we want to give the model a bit more space to be creative than this. The easiest way to do this is instead of using the **most probable next word** as our prediction, we can **sample from all possible words proportionally to their probability**. This will help introduce some fun linguistic variety into our generated text. 

The most probable words are still, of course, most likely to be chosen, but there is now space for less probable words to be used as well. We are trading off (potentially rigid) local correctness for (potentially noisy) creativity. 

We can modify our `write_text_sequences` to have an option to use this probabilistic sampling approach:

In [None]:
def write_text_sequence(seed_sequence,
                        length_to_write,
                        model, 
                        tokeniser, 
                        input_length,
                        verbose=True,
                        use_sampling=True):
    """
    Generates text using a trained language
    model and seed sequence.
    """

    print('Using seed sequence: "%s"' % seed_sequence)
    sequence = seed_sequence
    
    for i in range(length_to_write):
        
        # tokenise and encode the seed sequence
        encoded_sequence = tokeniser.texts_to_sequences([sequence])[0]
        assert len(encoded_sequence)>=input_length, \
            'ERROR: seed sequence must be at least %s words.' % input_length
        encoded_sequence = encoded_sequence[-input_length:]
        encoded_sequence = np.array(encoded_sequence).reshape(-1,input_length)

        # predict the next word index and corresponding word
        if use_sampling:
            next_word_probabilities = model.predict_proba(encoded_sequence)[0]
            next_word_indices = range(0, vocabulary_size)
            prediction_index = np.random.choice(next_word_indices, size=1, p=next_word_probabilities)             
        else: 
            prediction_index = model.predict_classes(encoded_sequence)
        
        # convert prediction index to actual word
        prediction = tokeniser.sequences_to_texts([prediction_index])
        
        if verbose:
            print('Sequence so far: %s' % sequence)
            print('Seed sequence encoded: %s' % encoded_sequence)
            print('Most likely next word is {0} (index {1})'.format(prediction, prediction_index[0]))

        # append prediction to the sequence
        sequence += ' ' + prediction[0]
    
    print('Output:\n' + sequence)
        

Let's see if this addition helps us get out of the infinite loop of the Andals 

In [209]:
write_text_sequence("I would not have expected that Ned and Jon", 200,
                    loaded_model, tokeniser, 
                    max_sequence_length-1,
                    verbose=False, 
                    use_sampling=True)

Using seed sequence: "I would not have expected that Ned and Jon"
Output:
I would not have expected that Ned and Jon snow the three who would drive them on the canals “you call him your fight for an instant i dreamed so near the blade in astonishment ” meera leaned back to the door and found a tyrion nails obvious softly her face covered as if he’d told jon snow and some to cold now another handed of miles south warm off the kitchens and three headed dragon out of the lesser but the seven eyed road was their ordinary mouse galloping up beneath it he’d done as much to land a flagon of laughing catelyn thought the cook thought “the steward will have the duty his brothers are enough if you find much to bring our hands away ” margaery could not go out these galleys above the sword until they didn’t you what old bolts of feathers “you have only the outriders shall you go on off ” she walked forward on the laces in the wet earth of slaughter but he served worse than far to the great hall of 

Well.. we're definitely not trapped anymore! And we get a nice hodor sequence at the end (which is really what we're all here for). But now the text sounds absolutely bonkers. Let's see if we can put some breaks on this thing. 

### x. Generating more creative (but controlled) output 

The main way of controlling how creative or random these sampling-based predictions are is by using a hyperparameter called `temperature`. Essentially:
+ **Higher temperature** will emphasise the least likely predictions in a distribution - less likely predictions will have their probabilities increased. To remember this, think of "hot" = more randomness, just like with higher physical temperature leading to more random molecular motion. 
+ **Lower temperature** will downplay the less likely predictions. At the lowest temperatures, we are only ever considering the most likely prediction (our sampling starts to function like an `argmax` and we go back to the greedy approach).

We can write a function to do this scaling of probabilities:

In [210]:
def apply_temp_to_softmax_probs(probs, temp, verbose=False):
    """
    Rescales softmax probabilities using some given temperature.
    """
    
    # add a very small number to probabilities
    # to avoid taking log of zero later (undefined) 
    epsilon = 10e-16 
    probs = probs + epsilon

    # take logs of probabilities
    log_probs = np.log(probs)
    
    # the crucial step - divide the log probabilities by temperature
    scaled_log_probs = log_probs / temp 

    # undo logging to get back to probabilities
    new_probs = np.exp(scaled_log_probs) 

    # and renormalise so that probabilities sum to 1
    normalised_probs = new_probs / np.sum(new_probs)
    
    if verbose:
        print('1. Original probabilities:\n%s\n' % probs)
        print('2. Log of probabilities:\n%s\n' % log_probs)
        print('3. Temperature scaled log of probabilities:\n%s\n' % scaled_log_probs)
        print('4. Back to pseudo-probabilities by undoing logging:\n%s\n' % new_probs)
        print('5. Final normalised probabilities:\n%s\n' % normalised_probs)

    return normalised_probs


You can check out how temperature scaling of probability arrays happens by testing out this function in verbose mode. Let's scale the array `np.array([0.8, 0.1, 0.05, 0.05])` using different temperatures:

#### temperature=1 (should do nothing at all)

In [211]:
_ = apply_temp_to_softmax_probs(np.array([0.8, 0.1, 0.05, 0.05]), 
                            temp=1, verbose=True)

1. Original probabilities:
[0.8  0.1  0.05 0.05]

2. Log of probabilities:
[-0.22314355 -2.30258509 -2.99573227 -2.99573227]

3. Temperature scaled log of probabilities:
[-0.22314355 -2.30258509 -2.99573227 -2.99573227]

4. Back to pseudo-probabilities by undoing logging:
[0.8  0.1  0.05 0.05]

5. Final normalised probabilities:
[0.8  0.1  0.05 0.05]



Great, it's good to know that using a temperature of 1 does nothing at all to the probabilities. 

#### temperature=10 (should boost low probabilities and introduce more randomness)

In [285]:
_ = apply_temp_to_softmax_probs(np.array([0.8, 0.1, 0.05, 0.05]), 
                                temp=10, verbose=True)

1. Original probabilities:
[0.8  0.1  0.05 0.05]

2. Log of probabilities:
[-0.22314355 -2.30258509 -2.99573227 -2.99573227]

3. Temperature scaled log of probabilities:
[-0.02231436 -0.23025851 -0.29957323 -0.29957323]

4. Back to pseudo-probabilities by undoing logging:
[0.97793277 0.79432823 0.74113445 0.74113445]

5. Final normalised probabilities:
[0.30048357 0.2440685  0.22772396 0.22772396]



A high temperature of 10 really amplifies those low probabilities!

#### temperature=0.1 (should dampen out lower probabilities and boost already high probabilities)

In [286]:
_ = apply_temp_to_softmax_probs(np.array([0.8, 0.1, 0.05, 0.05]), 
                                temp=0.1, verbose=True)

1. Original probabilities:
[0.8  0.1  0.05 0.05]

2. Log of probabilities:
[-0.22314355 -2.30258509 -2.99573227 -2.99573227]

3. Temperature scaled log of probabilities:
[ -2.23143551 -23.02585093 -29.95732274 -29.95732274]

4. Back to pseudo-probabilities by undoing logging:
[1.07374182e-01 1.00000000e-10 9.76562500e-14 9.76562500e-14]

5. Final normalised probabilities:
[9.99999999e-01 9.31322574e-10 9.09494701e-13 9.09494701e-13]



And a low temperature of 0.1 really freezes down those low probabilities, they are practically 0. 

How come the maths works? Essentially:
+ Probabilities that are already big don't have big (negative) logarithms, so scaling them by multiplying/dividing by temperature won't make that much of a difference.
+ But small probabilities have very big (negative) logarithms, so scaling them by multiplying/dividing by temperature can hugely change their values. 

We can add 1 line to our `write_text_sequence` function to make use of temperature (line 29):

In [212]:
def write_text_sequence(seed_sequence,
                        length_to_write,
                        model, 
                        tokeniser, 
                        input_length,
                        verbose=True,
                        use_sampling=True, 
                        temperature=1):
    """
    Generates text using a trained language
    model and seed sequence.
    """

    print('Using seed sequence: "%s"' % seed_sequence)
    sequence = seed_sequence
    
    for i in range(length_to_write):
        
        # tokenise and encode the seed sequence
        encoded_sequence = tokeniser.texts_to_sequences([sequence])[0]
        assert len(encoded_sequence)>=input_length, \
            'ERROR: seed sequence must be at least %s words.' % input_length
        encoded_sequence = encoded_sequence[-input_length:]
        encoded_sequence = np.array(encoded_sequence).reshape(-1,input_length)

        # predict the next word index and corresponding word
        if use_sampling:
            next_word_probabilities = model.predict_proba(encoded_sequence)[0]
            next_word_probabilities = apply_temp_to_softmax_probs(next_word_probabilities, temperature)
            next_word_indices = range(0, vocabulary_size)
            prediction_index = np.random.choice(next_word_indices, size=1, p=next_word_probabilities)             
        else: 
            prediction_index = model.predict_classes(encoded_sequence)
        
        # convert prediction index to actual word
        prediction = tokeniser.sequences_to_texts([prediction_index])
        
        if verbose:
            print('Sequence so far: %s' % sequence)
            print('Seed sequence encoded: %s' % encoded_sequence)
            print('Most likely next word is {0} (index {1})'.format(prediction, prediction_index[0]))

        # append prediction to the sequence
        sequence += ' ' + prediction[0]
    
    print('Output:\n' + sequence)
        

Now, we can control the creativity level of the text generation by changing the value of one argument:

#### Predictable text

In [223]:
write_text_sequence("A new adventure starring Varys, The Hound, and Ygritte started by", 30,
                    loaded_model, tokeniser, 
                    max_sequence_length-1,
                    verbose=False, 
                    use_sampling=True, 
                    temperature=0.5)

Using seed sequence: "A new adventure starring Varys, The Hound, and Ygritte started by"
Output:
A new adventure starring Varys, The Hound, and Ygritte started by the power of his blood and growled the girl had been a man grown through the womb of the city and he was and see your sister ” he explained


#### Normal text

In [224]:
write_text_sequence("A new adventure starring Varys, The Hound, and Ygritte started by", 30,
                    loaded_model, tokeniser, 
                    max_sequence_length-1,
                    verbose=False, 
                    use_sampling=True, 
                    temperature=1)

Using seed sequence: "A new adventure starring Varys, The Hound, and Ygritte started by"
Output:
A new adventure starring Varys, The Hound, and Ygritte started by the rain in the arm of ned’s chair the very hand of one of the hungry man before faint squeezing every of the days himself lord blackwood waited tonight itself


#### Mental text

In [226]:
write_text_sequence("A new adventure starring Varys, The Hound, and Ygritte started by", 30,
                    model, tokeniser, 
                    max_sequence_length-1,
                    verbose=False, 
                    use_sampling=True, 
                    temperature=2)

Using seed sequence: "A new adventure starring Varys, The Hound, and Ygritte started by"
Output:
A new adventure starring Varys, The Hound, and Ygritte started by robert’s snow asking when which setting oznak zo weirwoods shook down part suffice north city’s castle blare a true purpose wizard “always maegi conquer vargo hoat’s sam pleased maddy miss


### xi. Conclusion

I hope this project gave you a taste of language models, have a play around with the language model yourself! 

That's all for this tutorial for now. 

# III. Suggested Extensions

Here are some suggestions for extending this work in order to build a more serious and performant Game of Thrones language model:

1. **Data**: Spend more time cleaning up the text corpus, there is definitely some weird stuff in there (e.g. I saw markup tags in the subtitle data)
2. **Data**: Use all of the data, and perhaps think about grabbing more data (maybe by scraping some of the fan Wikis)
3. **Data**: Here, we used lower case words and ommitted punctuation entirely. This makes the modelling task easier for the model (fewer tokens to deal with), but the output doesn't read as nicely. Try to complete the task by sticking to as natural of language as possible, with little clean up or omission. 
3. **Representation**: Use **pre-trained word embeddings** (e.g. FastText, GloVe, Word2Vec) and possibly update them during training
4. **Representation**: Think about using **sub-word tokenisation** rather than word-based tokenisation
8. **Modelling**: **Train for longer**, until convergence :) Monitor for overfitting using a validation set to early stop. 
5. **Modelling**: Look into using **regularisation techniques** (dropout, weight penalties) to improve model performance and generalisability
6. **Modelling**: Experiment with different numbers of layers, sizes, activation functions, initialisation approaches, etc.
7. **Modelling**: Optimise some of the **hyperparameters** in the model (learning rate, momentum, batch sizes)
9. **Modelling**: Forget RNNs for language modelling completely and jump on the **Transformer hype train** ([choo](https://paperswithcode.com/task/language-modelling) [choo!](https://arxiv.org/abs/1904.09408)). 
10. **Modelling**: Try downloading a **pre-trained language model** (like **Google AI's BERT** or **OpenAI's GPT models** or **Carnegie Mellon/Google Brain's XLNet**) and fine-tuning it to Game of Thrones text. This is likely to give the easiest, biggest gains, since these models are pre-trained on massive corpora with a huge amount of GPUs. 
10. **Visualisation**: Try using **Tensorboard** to visualise the progression of model training and diagnose any weird behaviour. 
