# Building a Language Model for Game of Thrones

*This notebook is part of the tutorial "Modelling Sequences with Deep Learning" presented at the ODSC London Conference in November 2019.*

In this notebook, we will build a neural network model that can understand Game of Thrones language and concepts and even write its own passages. The architecture we will use is a **recurrent neural network (RNN)** with **LSTM cells** to boost the model's ability to remember longer-term information within the text. 

The framework we will use to build the models is `Keras`. Keras is a high-level neural networks API - it acts as a user-friendly layer on top of lower-level frameworks (Tensorflow, Theano, or CNTK), and allows you to build neural networks in an intuitive, layer-by-layer way. 

<img src="books.jpg" alt="Picture of Game of Thrones books" width="600"/> 

## Introduction to language models

Learning a **language model (LM)** is a classic modelling task in the field of **natural language processing (NLP)**. Since LMs learn to understand the structure and content of a text corpus, they are invaluable in applications where the quality or originality of text segments are being assessed. 

Language models are often used **generatively**, as in smartphone keyboard apps, to predict future text based on a **seed sequence**. 

For example, which word should follow the sequence "the cat is on the"? Good guesses are words like "mat", "bed", "sofa", and we would hope that our model would learn to assign high probabilities to these semantically relevant terms. We would hope that words like "the", "hi", and "banana" would be assigned low probabilities. 

## How are language models trained?

All you need to train a language model is a text corpus - **no annotation or labelling of the data is required**. However, language modelling is treated as a **supervised classification task**. The idea is that we extract training data by **sliding a window over the corpus**, and generating input-output pairs that way. More exactly:

![Building a dataset for training a language model](lm_data.png)

So here, we are sliding a window of some size over the corpus in order to generate sequences of words (here, sequences of 2 words each). Then:
+ The initial 2 words in each sequences is our **input** (or **features** or **X values**)
+ The final word in each sequence is our **output** (or **label** or **y values**)

The model is then trained to use the input words (the **context**) to predict the final word. 

## Considerations when building the dataset
There are a few decisions you have to make with how you will build this dataset. For example:
+ Are you going to treat the text as a sequence at the **word level** or the **character level**? 

    + **The arguments for using words are**: there is a lot of information in words since that's how we structure language. And the length of sequences the model has to deal with and remember will be much shorter, leading to greater coherence. 
    + **The arguments for using characters are**: the size of the input space is much more manageable (there are fewer characters than words), and you gain the ability to handle unknown words and generate new words.
    + **You could also work at the sub-word level**: this is a bit of a happy medium - words are broken down into their components. 
    
+ Are you going to **scrub the text squeaky clean** or do you want the model to learn to deal with **noise**, perhaps at a cost of a hit to performance?
+ What sort of **window size** should you be using?

# I. Building a Toy Language Model First

Before launching straight into the Game of Thrones language modelling problem, let's work with a smaller first and understand all of the steps involved. This way, you can more easily understand and track how all of the input, intermediate steps, and output is behaving. 

Let's use the following poem from Lord of the Rings as our entire corpus:

In [164]:
tiny_corpus = ['All that is gold does not glitter',
               'Not all those who wander are lost;',
               'The old that is strong does not wither,',
               'Deep roots are not reached by the frost.',
               'From the ashes, a fire shall be woken,',
               'A light from the shadows shall spring;',
               'Renewed shall be blade that was broken,',
               'The crownless again shall be king']

### i. Preparing the dataset

To get started, the first thing we need to do is **tokenisation** - break the text up into individual units or **tokens**. 

We can use the text tokeniser from the `Keras` library for this, and specify that we want to treat all text as lowercase, generate tokens by splitting on a space character, and view text at the word level. 

In [2]:
from keras.preprocessing.text import Tokenizer

tokeniser = Tokenizer(lower=True, split=' ', char_level=False)
tiny_corpus = ' '.join(tiny_corpus)
tokeniser.fit_on_texts([tiny_corpus])

Using TensorFlow backend.


The tokeniser identifies tokens in the corpus and assigns an index to each word in the vocabulary. We can check which index corresponds to which word like this:

In [3]:
tokeniser.word_index

{'the': 1,
 'not': 2,
 'shall': 3,
 'that': 4,
 'be': 5,
 'all': 6,
 'is': 7,
 'does': 8,
 'are': 9,
 'from': 10,
 'a': 11,
 'gold': 12,
 'glitter': 13,
 'those': 14,
 'who': 15,
 'wander': 16,
 'lost': 17,
 'old': 18,
 'strong': 19,
 'wither': 20,
 'deep': 21,
 'roots': 22,
 'reached': 23,
 'by': 24,
 'frost': 25,
 'ashes': 26,
 'fire': 27,
 'woken': 28,
 'light': 29,
 'shadows': 30,
 'spring': 31,
 'renewed': 32,
 'blade': 33,
 'was': 34,
 'broken': 35,
 'crownless': 36,
 'again': 37,
 'king': 38}

Now, we can use this tokeniser to convert (**encode**) our original corpus to a sequence of indices correponding to words:

In [4]:
encoded_corpus = tokeniser.texts_to_sequences([tiny_corpus])[0]
encoded_corpus[0:7]

[6, 4, 7, 12, 8, 2, 13]

We can always get back to the words by reversing this process:

In [5]:
tokeniser.sequences_to_texts([encoded_corpus[0:7]])

['all that is gold does not glitter']

Now we can build a dataset of sequences that we will use for training and evaluating our language model. 

Let's use a window size of 3 and slide this over the integer-encoded corpus to build our dataset: a **list of lists of length 3**.

In [6]:
sequences = []
window_size = 3
for i in range(0, len(encoded_corpus)):
    sequences.append(encoded_corpus[i:i+window_size])

sequences

[[6, 4, 7],
 [4, 7, 12],
 [7, 12, 8],
 [12, 8, 2],
 [8, 2, 13],
 [2, 13, 2],
 [13, 2, 6],
 [2, 6, 14],
 [6, 14, 15],
 [14, 15, 16],
 [15, 16, 9],
 [16, 9, 17],
 [9, 17, 1],
 [17, 1, 18],
 [1, 18, 4],
 [18, 4, 7],
 [4, 7, 19],
 [7, 19, 8],
 [19, 8, 2],
 [8, 2, 20],
 [2, 20, 21],
 [20, 21, 22],
 [21, 22, 9],
 [22, 9, 2],
 [9, 2, 23],
 [2, 23, 24],
 [23, 24, 1],
 [24, 1, 25],
 [1, 25, 10],
 [25, 10, 1],
 [10, 1, 26],
 [1, 26, 11],
 [26, 11, 27],
 [11, 27, 3],
 [27, 3, 5],
 [3, 5, 28],
 [5, 28, 11],
 [28, 11, 29],
 [11, 29, 10],
 [29, 10, 1],
 [10, 1, 30],
 [1, 30, 3],
 [30, 3, 31],
 [3, 31, 32],
 [31, 32, 3],
 [32, 3, 5],
 [3, 5, 33],
 [5, 33, 4],
 [33, 4, 34],
 [4, 34, 35],
 [34, 35, 1],
 [35, 1, 36],
 [1, 36, 37],
 [36, 37, 3],
 [37, 3, 5],
 [3, 5, 38],
 [5, 38],
 [38]]

You'll notice that at the end there we have sequences that are not length 3, since we run out of text. We can quickly **pad the sequences with zeroes** to keep the data size consistent: 

In [7]:
import numpy as np
from keras.preprocessing.sequence import pad_sequences

max_sequence_length = np.max([len(sequence) for sequence in sequences])
sequences = pad_sequences(sequences, 
                          maxlen=max_sequence_length, 
                          padding='pre')
sequences

array([[ 6,  4,  7],
       [ 4,  7, 12],
       [ 7, 12,  8],
       [12,  8,  2],
       [ 8,  2, 13],
       [ 2, 13,  2],
       [13,  2,  6],
       [ 2,  6, 14],
       [ 6, 14, 15],
       [14, 15, 16],
       [15, 16,  9],
       [16,  9, 17],
       [ 9, 17,  1],
       [17,  1, 18],
       [ 1, 18,  4],
       [18,  4,  7],
       [ 4,  7, 19],
       [ 7, 19,  8],
       [19,  8,  2],
       [ 8,  2, 20],
       [ 2, 20, 21],
       [20, 21, 22],
       [21, 22,  9],
       [22,  9,  2],
       [ 9,  2, 23],
       [ 2, 23, 24],
       [23, 24,  1],
       [24,  1, 25],
       [ 1, 25, 10],
       [25, 10,  1],
       [10,  1, 26],
       [ 1, 26, 11],
       [26, 11, 27],
       [11, 27,  3],
       [27,  3,  5],
       [ 3,  5, 28],
       [ 5, 28, 11],
       [28, 11, 29],
       [11, 29, 10],
       [29, 10,  1],
       [10,  1, 30],
       [ 1, 30,  3],
       [30,  3, 31],
       [ 3, 31, 32],
       [31, 32,  3],
       [32,  3,  5],
       [ 3,  5, 33],
       [ 5, 3

That looks better. 

Finally, let's break the sequences down into our input data (X; our matrix of features) and our output data (y; our vector of labels):

In [8]:
X = np.array([x[0:2] for x in sequences])
y = np.array([x[2] for x in sequences])

So for example our input features for the first 5 data points are:

In [9]:
X[0:5]

array([[ 6,  4],
       [ 4,  7],
       [ 7, 12],
       [12,  8],
       [ 8,  2]], dtype=int32)

And their corresponding labels are: 

In [10]:
y[0:5]

array([ 7, 12,  8,  2, 13], dtype=int32)

The final thing we need to do is reformat our label vector y into a **one-hot vector format**. The word index numbers are not actually meaningful (no ordinal relationship) but are discrete classes. We also want to calculate probabilities of word, where a probability of 1 for the correct word is the optimal prediction. 

We can convert the label vector y to a matrix of one-hot vectors using keras' `to_categorical` method:

In [11]:
from keras.utils import to_categorical

vocabulary_size = len(tokeniser.word_index)+1
y = to_categorical(y, num_classes=vocabulary_size)
y[0:5]

array([[0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0.]], dtype=float32)

To summarise, we have gone from a raw dataset of:

In [12]:
tiny_corpus

'All that is gold does not glitter Not all those who wander are lost; The old that is strong does not wither, Deep roots are not reached by the frost. From the ashes, a fire shall be woken, A light from the shadows shall spring; Renewed shall be blade that was broken, The crownless again shall be king'

To a formatted dataset ready to be input to a learning algorithm:

In [13]:
print('Example features: ', *X[0:5], sep='\n')
print('Example labels: ', *y[0:5], sep='\n')

Example features: 
[6 4]
[4 7]
[ 7 12]
[12  8]
[8 2]
Example labels: 
[0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


### ii. Setting up the language model architecture

Now that we have the dataset sorted out, it's time to think about how we want to approach the modelling problem.

Let's build this small recurrent neural network with LSTM units:

![tiny_network](small.png)

To explain this network:
+ Our **input layer** represents input into the network. The size of the input layer is the size of the vocabulary of our corpus (+1).
+ We then have an **embedding layer** immediately after the input layer, which will learn **word embeddings** for us (continuous representation of the discrete words in our vocabulary; see my explanatory blog post on embeddings [here](https://towardsdatascience.com/why-do-we-use-embeddings-in-nlp-2f20e1b632d2). An embedding layer is just a fully-connected layer (with some regularisation and constraints), where the learned weight matrix functions as our word embeddings.
+ 


In `Keras` code, we would build this network like this:

In [14]:
from keras.models import Sequential
from keras.layers import Embedding
from keras.layers import LSTM
from keras.layers import Dense

model = Sequential()
model.add(Embedding(input_dim=vocabulary_size, output_dim=10, input_length=max_sequence_length-1))
model.add(LSTM(units=50))
model.add(Dense(units=vocabulary_size, activation='softmax'))

An explanation of this code block:
+ Keras allows the sequential layer-by-layer building of neural network models using its `Sequential` API.
+ The input layer is assumed, we don't need to explicitly build it.
+ The first layer we add is our `Embedding` layer. The input dimensionality is our vocabulary size (the size of our input layer), and let's give this embedding layer a small size of 10 neurons. This means each word will get represented as a real-valued vector of length 10. We state the the length of inputs the network should expect is 2. 
+ Next, we add the workhorse of the network - our layer of `LSTM` neurons. Let's make the layer have 50 of these neurons (which is not a lot). We leave all other options to the default (activation functions, initialisation,etc.)
+ Finally, as our output layer, we add a `Dense` fully-connected layer and softmax it. This means that the output of the network will be a vector of probabilities (summing to 1) spread across all the words of our vocabulary (see example below). 

We can examine our model so far using Keras' `model.summary()` function:

In [15]:
print(model.summary())

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 2, 10)             390       
_________________________________________________________________
lstm_1 (LSTM)                (None, 50)                12200     
_________________________________________________________________
dense_1 (Dense)              (None, 39)                1989      
Total params: 14,579
Trainable params: 14,579
Non-trainable params: 0
_________________________________________________________________
None


This summarises the number of parameters in our model and where they are.
+ 390 parameters from $39*10$
+ 12200 LSTM parameters from $4*(10*50 + 50*50 + 50)$
+ 1989 parameters from $50*39 + 39$

Now that we have defined the network, we need to do a `model.compile()` to signify that we have finished building the network and want to define how training should proceed. Specifically, we need to provide:
+ Which loss function we want to use (i.e. what is the goal the model is optimising for as it trains, or what signal is it following in order to improve)
+ Which optimiser we want to use to do our gradient updates (Adam, Adagrad, RMSProp, Nesterov momentum, etc.)
+ Any metrics we want to calculate and output during training in order to keep track of progress. Let's keep track of accuracy, which is just the percentage of predictions that the model gets right. 

We can just use sensible defaults for now. Since our task is a multiclass classification task, a sensible loss metric to use is **categorical cross-entropy**. 

In [16]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

I'll avoid dumping equations on you and just say that:
+ The model's categorical cross-entropy loss will be **low** when the network generally predicts the next words correctly. This means it tends to assign higher probability to the correct word.
+ A training loss of zero means that the network always assigns a probability of 1 to the correct word and 0 to all other words - its predictions are perfect (in the training set).
+ The model's categorical cross-entropy loss will be **high** when the network generally doesn't predict the next words well. This means it tends to incorrectly assign high probabilities to incorrect words.

During the training process, the model optimises its internal parameters such that training loss is minimised (for an explanation of how this happens, read about backpropagation and gradient descent [here]()).

### iii. Training the language model

Now that the network is compiled, we can begin training it for some time (for some number of **epochs** - which is the number of times the network sees your training data). 

Hopefully, as model training proceeds, we will see that the training loss steadily decreases and the accuracy increases: 

In [18]:
# model.fit(X, y, epochs=50, verbose=2)
model.fit(X, y, epochs=100, verbose=0)

<keras.callbacks.callbacks.History at 0x13ea22b50>

That's the model trained! Training is very fast because our dataset is tiny and the network is small. The accuracy doesn't look that bad either (though of course the model is likely to be **overfitted**; see later section). 

There's a few different things you can do with a trained Keras sequence model. You can see all the options by typing `model.` followed by a `tab` in a cell:

In [19]:
model.get_weights()[0][0]

array([ 0.33057487,  0.18980494, -0.364027  ,  0.30919576, -0.1292659 ,
       -0.27887756,  0.38433874,  0.20016178,  0.19976257, -0.31896615],
      dtype=float32)

In [20]:
model.layers

[<keras.layers.embeddings.Embedding at 0x13c6a7510>,
 <keras.layers.recurrent.LSTM at 0x103407450>,
 <keras.layers.core.Dense at 0x103448550>]

### iv. Using the trained model to make predictions

Probably the most interesting thing to do now is use the trained model to make new predictions. For this, we can use the `model.predict_classes()` method. 

We hope that the model will predict the next word given a seed sequence well, i.e. that it learned about word structure from our poem corpus. For instance, given the seed sequence "shall be", we hope the model predicts the correct, observed next words like "king", "broken", and "blade".

However, we can't just run `model.predict_classes()` on raw text data like "shall be", since the text data has to first be tokenised, assigned to an integer index, and reshaped into the correct array dimensions:


In [21]:
seed_sequence = 'shall be'
seed_sequence_encoded = tokeniser.texts_to_sequences([seed_sequence])[0]
print('Encoded seed sequence: %s' % seed_sequence_encoded)
seed_sequence_encoded = np.array(seed_sequence_encoded).reshape(-1,2)
print('Formatted encoded seed sequence: %s' % seed_sequence_encoded)

Encoded seed sequence: [3, 5]
Formatted encoded seed sequence: [[3 5]]


Now we can use the trained model to make a prediction for the next word:

In [22]:
prediction_index = model.predict_classes(seed_sequence_encoded)
print('Prediction for the next word index: %s' % prediction_index)
print('This index corresponds to word: %s' % tokeniser.sequences_to_texts([prediction_index]))

Prediction for the next word index: [38]
This index corresponds to word: ['king']


Great, that looks like a decent prediction for the next word!

Rather than just have the 1 best prediction, it would be interesting to see the probabilities assigned to each possible next word. With a bit of manoeuvring we can get these scores out: 

In [23]:
import pandas as pd

class_indices = list(range(0, vocabulary_size+1))

df = pd.DataFrame(list(zip(class_indices, 
                      [tokeniser.sequences_to_texts([[index]])[0] for index in class_indices],
                       model.predict(seed_sequence_encoded)[0],
                       np.round(model.predict(seed_sequence_encoded)[0],5))),
                  columns=['index', 'word', 'probability', 'rounded_probability'])

df.sort_values('probability', ascending=False).head(10)

Unnamed: 0,index,word,probability,rounded_probability
38,38,king,0.202408,0.20241
28,28,woken,0.188621,0.18862
11,11,a,0.157626,0.15763
33,33,blade,0.147121,0.14712
4,4,that,0.108378,0.10838
5,5,be,0.02187,0.02187
10,10,from,0.021261,0.02126
27,27,fire,0.017142,0.01714
25,25,frost,0.015533,0.01553
32,32,renewed,0.015003,0.015


Cool, it looks like the network does indeed assign the highest probabilities to the 3 words that actually occur in the corpus! It's fun to see that such a small network can produce sensible results on such a small dataset. 

Let's try another example with a different seed sequence:

In [25]:
seed_sequence = 'does not'
seed_sequence_encoded = tokeniser.texts_to_sequences([seed_sequence])[0]
print('Encoded seed sequence: %s' % seed_sequence_encoded)
seed_sequence_encoded = np.array(seed_sequence_encoded).reshape(-1,2)
print('Formatted encoded seed sequence: %s' % seed_sequence_encoded)
df = pd.DataFrame(list(zip(class_indices, 
                      [tokeniser.sequences_to_texts([[index]])[0] for index in class_indices],
                       model.predict(seed_sequence_encoded)[0],
                       np.round(model.predict(seed_sequence_encoded)[0],5))),
                  columns=['index', 'word', 'probability', 'rounded_probability'])
df.sort_values('probability', ascending=False).head(10)

Encoded seed sequence: [8, 2]
Formatted encoded seed sequence: [[8 2]]


Unnamed: 0,index,word,probability,rounded_probability
2,2,not,0.193435,0.19344
13,13,glitter,0.191043,0.19104
20,20,wither,0.176577,0.17658
9,9,are,0.111534,0.11153
6,6,all,0.103988,0.10399
17,17,lost,0.086964,0.08696
23,23,reached,0.032819,0.03282
24,24,by,0.021843,0.02184
14,14,those,0.019182,0.01918
21,21,deep,0.014115,0.01412


Great, that also looks correct.

Rather than predicting just the next 1 word, would be nice to just let the network write continuous text for us, given some seed sequence starting point. Let's package up the above code into a function that lets us do this:

In [124]:
def write_text_sequence(seed_sequence,
                        length_to_write,
                        model, 
                        tokeniser, 
                        input_length,
                        verbose=True):
    """
    Generates text using a trained language
    model and seed sequence.
    """

    print('Using seed sequence: "%s"' % seed_sequence)
    sequence = seed_sequence
    
    for i in range(length_to_write):
        
        # tokenise and encode the seed sequence
        encoded_sequence = tokeniser.texts_to_sequences([sequence])[0]
        assert len(encoded_sequence)>=input_length, \
            'ERROR: seed sequence must be at least %s words.' % input_length
        encoded_sequence = encoded_sequence[-input_length:]
        encoded_sequence = np.array(encoded_sequence).reshape(-1,input_length)

        # predict the next word index and corresponding word
        prediction_index = model.predict_classes(encoded_sequence)
        prediction = tokeniser.sequences_to_texts([prediction_index])
        
        if verbose:
            print('Sequence so far: %s' % sequence)
            print('Seed sequence encoded: %s' % encoded_sequence)
            print('Most likely next word is {0} (index {1})'.format(prediction, prediction_index[0]))

        sequence += ' ' + prediction[0]
    
    print('Output:\n' + sequence)
    
#     return sequence
        

In [27]:
write_text_sequence("all that", 5,
                    model, tokeniser, 
                    max_sequence_length-1)

Using seed sequence: "all that"
Sequence so far: all that
Seed sequence encoded: [[6 4]]
Most likely next word is ['is'] (index 7)
Sequence so far: all that is
Seed sequence encoded: [[4 7]]
Most likely next word is ['gold'] (index 12)
Sequence so far: all that is gold
Seed sequence encoded: [[ 7 12]]
Most likely next word is ['does'] (index 8)
Sequence so far: all that is gold does
Seed sequence encoded: [[12  8]]
Most likely next word is ['not'] (index 2)
Sequence so far: all that is gold does not
Seed sequence encoded: [[8 2]]
Most likely next word is ['not'] (index 2)
Output: all that is gold does not not


'all that is gold does not not'

Cool, let's write some more text, but let's turn off the verbosity of the function so we just get the final result:

In [28]:
write_text_sequence("the light", 5,
                    model, tokeniser, 
                    max_sequence_length-1,
                    verbose=False)

Using seed sequence: "the light"
Output: the light from the the that that


'the light from the the that that'

That's kind of artsy.

And again, writing a longer passage this time:

In [29]:
tiny_corpus

'All that is gold does not glitter Not all those who wander are lost; The old that is strong does not wither, Deep roots are not reached by the frost. From the ashes, a fire shall be woken, A light from the shadows shall spring; Renewed shall be blade that was broken, The crownless again shall be king'

In [30]:
write_text_sequence("ashes are", 10,
                    model, tokeniser, 
                    max_sequence_length-1,
                    verbose=False)

Using seed sequence: "ashes are"
Output: ashes are does reached not the lost from the the that that


'ashes are does reached not the lost from the the that that'

Our tiny model only knows the few words in the poem so this is a bit gibberish :) But it's still interesting to see.

This is pretty much all there is to a basic language model. Now, let's tackle a real corpus (Game of Thrones) and build a bigger, more powerful model!

# II. Building a language model for Game of Thrones text

The technical approach we'll take to building a GoT language model is pretty similar, with the major difference being the dataset. We are going to need access to a lot of GoT text - preferably, both the books and the subtitles from the HBO show. 

### i. Identifying some datasets

Interestingly, there seems to already be a rich ecosystem of technical work surrounding GoT content. 

Check out projects like:
+ The [Network of Thrones](https://networkofthrones.wordpress.com/) blog for network analyses of characters (e.g. which character is the most 'central' to the story?)
+ An [API of Ice and Fire](https://anapioficeandfire.com) for grabbing various structured data about the universe
+ And [this Reddit post](https://www.reddit.com/r/datasets/comments/769nhw/game_of_thrones_datasets_xpost_from_rfreefolk/) for a list of various datasets compiled about GoT.

Maybe it's just me, but even despite these resources, I still couldn't actually find the raw text from the books and TV show. 

I did eventually come across 2 Kaggle datasets that contained exactly what I wanted:
1. [Plain text files of all the books](https://www.kaggle.com/muhammedfathi/game-of-thrones-book-files/download) 
2. [Subtitle data for the episodes](https://filmora.wondershare.com/video-editing-tips/game-of-thrones-subtitles.html)
    
A bit of initial manual + regex clean up later, and you get the files included in this repo. 

### ii. Grabbing all text data from the Game of Thrones books

So, we've got a few books in our current directory in .txt format:

In [31]:
import glob
book_txt_files = sorted(glob.glob('*.txt'))
print('Found these .txt files in the current directory:', *book_txt_files, sep='\n')

Found these .txt files in the current directory:
Book_1_A_Game_of_Thrones.txt
Book_2_A_Clash_of_Kings.txt
Book_3_A_Storm_of_Swords.txt
Book_4_A_Feast_for_Crows.txt
Book_5_A_Dance_with_Dragons.txt


We can write a function to extract all of the text in these files, glue it together, and flatten the resulting list of lists into a single mega GoT list of text:

In [32]:
from iteration_utilities import flatten

def grab_book_data(txt_files):
    """
    Grabb text data from a set of text files.
    """

    # keep all text segments in this list
    all_text_segments = []   
    
    # iterate over each book file
    for txt_file in txt_files:
    
        print('Extracting text from file "%s"...' % txt_file)
        # open file
        with open(txt_file, 'r') as file:
            data = file.read()
            print('Found {0} lines of text in this book.'.format(len(data.split('\n'))))
            print('First few lines:\n %s\n' % ' '.join(data.split('\n')[0:5]))  
            all_text_segments.append(data)
            
    return ''.join(list(flatten(all_text_segments)))

And use it to put all the book text data in one place:

In [33]:
book_data = grab_book_data(book_txt_files)

Extracting text from file "Book_1_A_Game_of_Thrones.txt"...
Found 14002 lines of text in this book.
First few lines:
 A GAME OF THRONES  PROLOGUE  “We should start back,” Gared urged as the woods began to grow dark around them.

Extracting text from file "Book_2_A_Clash_of_Kings.txt"...
Found 15765 lines of text in this book.
First few lines:
 A CLASH OF KINGS  PROLOGUE  The comet’s tail spread across the dawn, a red slash that bled above the crags of Dragonstone like a wound in the pink and purple sky.

Extracting text from file "Book_3_A_Storm_of_Swords.txt"...
Found 19641 lines of text in this book.
First few lines:
 A STORM OF SWORDS  PROLOGUE  The day was grey and bitter cold, and the dogs would not take the scent.

Extracting text from file "Book_4_A_Feast_for_Crows.txt"...
Found 16225 lines of text in this book.
First few lines:
 A FEAST FOR CROWS  PROLOGUE  Dragons,” said Mollander. He snatched a withered apple off the ground and tossed it hand to hand.

Extracting text from fi

Let's quickly summarise the amount of data we're working with:

In [34]:
# count lines and words
print('The number of lines in this corpus: {0}\n'
      'The number of words in this corpus: {1}'.format(len(book_data.split('\n')),
                                                       len(book_data.split(' '))))

The number of lines in this corpus: 84518
The number of words in this corpus: 1724951


## iii. Grabbing all text data from the Game of Thrones show

The subtitle data is a bit more complicated to grab because it's in JSON file format, and also frankly the text is a bit messy - there's markup tags, music note symbols, and various other odd non-textual things. 

We have the following `.json` subtitle files in our current directory:

In [35]:
subtitle_json_files = sorted(glob.glob("*.json"))
print('Found these .json files in the current directory:', *subtitle_json_files, sep='\n')

Found these .json files in the current directory:
Season_1_Subtitles.json
Season_2_Subtitles.json
Season_3_Subtitles.json
Season_4_Subtitles.json
Season_5_Subtitles.json
Season_6_Subtitles.json
Season_7_Subtitles.json


We will need to write a function to get the data out. The function below will:
+ **Iterate** over a given list of json subtitle files, **open** each file and **parse** the json
+ **Sort** the subtitles by index. At the moment, the indices are sorted as strings (so, e.g. '1' is followed by '11') so we need to convert the indices to integers and sort them numerically. This is important to get right because otherwise the subtitles are jumbled out of order! 
+ And finally we **extract** the subtitle text and **append** to a master list (which we reformat by flattening) 

In [36]:
import json

def grab_subtitle_data(subtitle_json_files, verbose=True):
    """
    Grabbing GoT subtitle data from json files.
    """

    # keep all text segments in this list
    all_text_segments = []

    # iterate over each subtitles file
    for season, subtitles_file in enumerate(subtitle_json_files):

        # open subtitle file
        with open(subtitles_file, 'r') as file:
            data = json.load(file)

        # iterate over episodes in the season
        for episode in data.keys():
            episode_data = {int(key):value for key,value in data[episode].items()}
            episode_data = sorted(episode_data.items()) # deal with sorting by line (as integer) s
            episode_text_segments = list(dict(episode_data).values())
            print('Found {0} text segments in Season {1} '
                  'Episode "{2}".'.format(len(episode_text_segments), 
                                          season, 
                                          episode.split('.')[0]))
            if verbose:
                print('First few segments:\n%s' % '\n'.join(episode_text_segments[0:5]))            
            all_text_segments.append(episode_text_segments)
            
    return list(flatten(all_text_segments))

In [37]:
subtitle_data = grab_subtitle_data(subtitle_json_files, verbose=False)

Found 559 text segments in Season 0 Episode "Game Of Thrones S01E01 Winter Is Coming".
Found 571 text segments in Season 0 Episode "Game Of Thrones S01E02 The Kingsroad".
Found 740 text segments in Season 0 Episode "Game Of Thrones S01E03 Lord Snow".
Found 754 text segments in Season 0 Episode "Game Of Thrones S01E04 Cripples, Bastards, And Broken Things".
Found 741 text segments in Season 0 Episode "Game Of Thrones S01E05 The Wolf And The Lion".
Found 583 text segments in Season 0 Episode "Game Of Thrones S01E06 A Golden Crown".
Found 775 text segments in Season 0 Episode "Game Of Thrones S01E07 You Win Or You Die".
Found 666 text segments in Season 0 Episode "Game Of Thrones S01E08 The Pointy End".
Found 679 text segments in Season 0 Episode "Game Of Thrones S01E09 Baelor".
Found 590 text segments in Season 0 Episode "Game Of Thrones S01E10 Fire And Blood".
Found 700 text segments in Season 1 Episode "Game Of Thrones S02E01 The North Remembers".
Found 755 text segments in Season 1 Ep

The final array of subtitle data looks like this:

In [38]:
subtitle_data[0:5]

['Easy, boy.',
 "What do you expect? They're savages.",
 'One lot steals a goat from another lot,',
 "before you know it they're ripping each other to pieces.",
 "I've never seen wildlings do a thing like this."]

And we can summarise the dataset size:

In [39]:
# count lines and words
all_subtitle_text = '\n'.join(subtitle_data)
print('The number of text segments in this corpus: {0}\n'
      'The number of words in this corpus: {1}'.format(len(all_subtitle_text.split('\n')),
                                                       len(all_subtitle_text.split(' '))))

The number of text segments in this corpus: 44844
The number of words in this corpus: 244447


### iv. Combining the book and subtitle datasets

Now we can put the book and subtitle data together:

In [40]:
got_data = book_data+all_subtitle_text

And report on the size:

In [41]:
print('The number of lines in the final corpus: {0}\n'
      'The number of words in the final corpus: {1}'.format(len(got_data.split('\n')),
                                                            len(got_data.split(' '))))

The number of lines in the final corpus: 129361
The number of words in the final corpus: 1969397


That's almost 2 million words to play with, which should help our language model tremendously. 

### v. Preparing the dataset

The process to make the sequence datasets is the same as before. The only difference is that we'll use longer sequences as our input (`window_size` is now 6), so we're taking into account more text before making our prediction.

In [None]:
# tokenise the data
tokeniser = Tokenizer(lower=True, split=' ', char_level=False)
tokeniser.fit_on_texts([got_data])
vocabulary_size = len(tokeniser.word_index)+1
print('The vocabulary size for this corpus is: %s' % vocabulary_size)

# encode the corpus using the fitted tokeniser
encoded_corpus = tokeniser.texts_to_sequences([got_data])[0]

# generate sequences
sequences = []
window_size = 6
for i in range(0, len(encoded_corpus)):
    sequences.append(encoded_corpus[i:i+window_size])

# pad the sequences at the end so each sequence is the same length
max_sequence_length = np.max([len(sequence) for sequence in sequences])
sequences = pad_sequences(sequences, 
                                maxlen=max_sequence_length, 
                                padding='pre')

# separate sequences into input arrays X 
# and the output label vector y
X = np.array([seq[0:window_size-1] for seq in sequences])
y = np.array([seq[window_size-1] for seq in sequences])
y = to_categorical(y, num_classes=vocabulary_size)

In [None]:
y.shape

In [162]:
np.save('GoT_X_features.npz', X)
np.save('GoT_y_labels.npz', y)

KeyboardInterrupt: 

In [82]:
y = to_categorical(y, num_classes=vocabulary_size)

Once again, our features look like this:

In [43]:
X[0:5]

array([[    5,   972,     6,  3796, 12141],
       [  972,     6,  3796, 12141,   322],
       [    6,  3796, 12141,   322,   122],
       [ 3796, 12141,   322,   122,  1131],
       [12141,   322,   122,  1131,    62]], dtype=int32)

And our labels look like this:

In [44]:
y[0:5]

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

One useful extra step: we should **split the dataset into a train and test set**. The main reason for this is that it will help us get a better estimate of the model's true "in the wild" performance, since we can evaluate its performance on data that *wasn't* used in training. 

Evaluating a model on data that was used for training is cheating, since it's already seen that data before, and hence will do unrealistically well when making predictions on it because it has **overfit**.

We will also shuffle the entries, since otherwise our dataset first contains Book 1, then Book 2, ..., Book 5 then finally the subtitle data, whereas we want the model to learn from each source simultaneously. 

In [84]:
small_X = X[0:1000]
small_y = y[0:1000]

In [88]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(small_X, small_y, test_size=0.1, shuffle=True)

### vi. Setting up the language model architecture

This time, let's build a slightly larger network:

![Larger RNN language model](big_network.png)

The main differences here are:
+ Our word embeddings are bigger (100 rather than 50 dimensions)
+ We have 2 LSTM layers instead of 1. This should allow the model to learn more complex representations of the text.
+ We have added a dense (fully-connected) layer after the LSTM layers for some additional processing capacity (perhaps allowing for higher-level conceptual representations)



In `Keras` code, we would build the network as follows:

In [132]:
model = Sequential()
model.add(Embedding(vocabulary_size, 50, input_length=max_sequence_length-1))
model.add(LSTM(100, return_sequences=True))
model.add(LSTM(100))
model.add(Dense(100, activation='relu'))
model.add(Dense(vocabulary_size, activation='softmax'))

This is very similar code to before, but we have reason to think that this network will be much more complex and nuanced than the previous one:
+ The dataset we are using is much larger and richer than the toy dataset
+ The network we are training is larger and deeper, and should have more expressive power

We can summarise the **model structure and parameters**:

In [133]:
print(model.summary())

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 5, 50)             1517500   
_________________________________________________________________
lstm_2 (LSTM)                (None, 5, 100)            60400     
_________________________________________________________________
lstm_3 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dense_2 (Dense)              (None, 100)               10100     
_________________________________________________________________
dense_3 (Dense)              (None, 30350)             3065350   
Total params: 4,733,750
Trainable params: 4,733,750
Non-trainable params: 0
_________________________________________________________________
None


And compile the finished model and specify some **training settings**:

In [134]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

### vii. Training the language model

Then, we can start the training run by passing the training data to the model. This would take a reasonably long time to train - it would be helpful to have access to a **GPU** to run this on (e.g. via Google Colab, AWS/GCP, your own GPU) to make use of computation **parallelisation** and drastically reduce training time.

In [135]:
# # UNCOMMENT AND RUN THIS CELL TO TRAIN MODEL YOURSELF
model.fit(X_train, y_train, batch_size=64, epochs=2, verbose=1)
model.save("trained_GoT_language_model.h5")

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Epoch 1/2
Epoch 2/2


For now, to save time, I will just **load a model** that I already trained. 

For reference, this was just trained overnight on my laptop so there's no special supercomputer involved. The model was still improving quite rapidly at that point, so we would see even better performance if the model were given enough time to reach **convergence** ("finish" learning, or at least hit serious diminishing returns).

In [48]:
from keras.models import load_model
loaded_model = load_model('trained_GoT_language_model.h5')

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


### viii. Exploring our Game of Thrones language model

From the training log, I saw that the model's training loss after epoch 50 was around 4.2, and the accuracy around 0.23. 

We can summarise the model's performance on the training and test set as follows:

In [89]:
loaded_model.predict_classes(X_test)

array([138,   5,  42,  34,   1,  56,   2,   6,   3,  34,  34,   4,  56,
        11,   3,   8,   1,   1, 129,   6,  53,   2,   1,  11,  53,   8,
         3,   1,   2,   4,  53,   1,   1, 158,   1,   1,   1,   2,   4,
         6, 149,   6, 132,  40,   8,  53,   1,   2,   2,   2,   5,   5,
         6,   2,   6,   2,  56,   5,   1,  56,   6,  53,   2,  46,   1,
        11,   4,   5,   1,   3,  29,   5,   4,  29,   4,  26,   2,   1,
        69,   1,  11,   1,  73,   2,   5,  16,   4,   1,  56,   5,   4,
         6,   2,  17,   4,   4,   7,   1,   2,   1])

In [52]:
from sklearn.metrics import accuracy_score

In [91]:
loaded_model.predict_classes(X_test)

array([138,   5,  42,  34,   1,  56,   2,   6,   3,  34,  34,   4,  56,
        11,   3,   8,   1,   1, 129,   6,  53,   2,   1,  11,  53,   8,
         3,   1,   2,   4,  53,   1,   1, 158,   1,   1,   1,   2,   4,
         6, 149,   6, 132,  40,   8,  53,   1,   2,   2,   2,   5,   5,
         6,   2,   6,   2,  56,   5,   1,  56,   6,  53,   2,  46,   1,
        11,   4,   5,   1,   3,  29,   5,   4,  29,   4,  26,   2,   1,
        69,   1,  11,   1,  73,   2,   5,  16,   4,   1,  56,   5,   4,
         6,   2,  17,   4,   4,   7,   1,   2,   1])

In [62]:
to_categorical(loaded_model.predict_classes(X_train))

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       ...,
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

In [138]:
train_predictions = loaded_model.predict_classes(X_train)
test_predictions = loaded_model.predict_classes(X_test)

In [139]:
print('Overall training accuracy: {0}\n'
      'Overall test accuracy: {1}'.format(accuracy_score(np.argmax(y_train, axis=1), train_predictions),
                                          accuracy_score(np.argmax(y_test, axis=1), test_predictions)))

Overall training accuracy: 0.16666666666666666
Overall test accuracy: 0.1


What does this performance mean in practical terms? We can examine some of the predictions on the test set:

In [143]:
test_seed_sequences = tokeniser.sequences_to_texts(X_test[0:50])

In [156]:
actual_next_words = tokeniser.sequences_to_texts([np.argmax(y_test, axis=1)[0:50]])[0].split(' ')

In [157]:
prediction_index = loaded_model.predict_classes(X_test[0:50])
prediction_vector = tokeniser.sequences_to_texts([prediction_index])
predictions = prediction_vector[0].split(' ')

In [158]:
actual_next_words

['bowels',
 'sent',
 'fire',
 'royce',
 'me',
 'prepared',
 'will',
 'had',
 'he',
 'glanced',
 'royce',
 '”',
 'shared',
 'must',
 'all',
 'this',
 'nine',
 'close',
 'could',
 'leave',
 'man',
 'maybe',
 'fifty',
 'wished',
 'ride',
 'they',
 'and',
 'will',
 'gared',
 'you',
 'handsome',
 'snow’s',
 'no',
 'mallisters’',
 'a',
 '“did',
 'boy',
 'on',
 'riding',
 'years',
 'will',
 'is',
 'tit',
 'put',
 'you',
 'few',
 'bored',
 'black',
 '“we',
 'he']

In [159]:
df = pd.DataFrame(list(zip(test_seed_sequences,
                           actual_next_words,
                           predictions)),
                  columns=['Seed Sequence', 'Actual Next Word', 'Predicted Next Word'])
df

Unnamed: 0,Seed Sequence,Actual Next Word,Predicted Next Word
0,come rushing back and his,bowels,own
1,first time he had been,sent,a
2,still make it out no,fire,one
3,is falling ” ser waymar,royce,said
4,dead that’s proof enough for,me,the
5,could say he had not,prepared,been
6,that under the wounded pride,will,and
7,southron called the haunted forest,had,of
8,half bored half distracted way,he,to
9,falling ” ser waymar royce,glanced,said


So, it looks like even where the model doesn't get the prediction correct, its prediction does at least seem plausible. 

With something as flexible as language, perhaps relatively low accuracies are not the end of the world. That said, there is tons that can be done to improve this model (see final section in this notebook).

### ix. Grand Finale: Gather Round for a New Tale...

Finally, let's have the language model write some Game of Thrones text for us (since GRR Martin certainly isn't going to!). 

We can use the same function as before to continuously feed in a seed sequence to the model, generate one word, and then append the generated word to the seed sequence. In this way, the model uses its own previous output as input to itself in the future. 

In [160]:
write_text_sequence("The start of the story", 50,
                    loaded_model, tokeniser, 
                    max_sequence_length-1,
                    verbose=False)

Using seed sequence: "The start of the story"
Output:
The start of the story ” he said and the king was a man and the king ” he said and the king was a man and the king ” he said and the king was a man and the king ” he said and the king was a man and the king ” he


In [None]:
model.save("trained_GoT_language_model.h5")

## 4. Build the language model network

We can define the model as follows:

In [34]:
from keras.callbacks import ModelCheckpoint

In [4]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding

model = Sequential()
model.add(Embedding(vocabulary_size, 50, input_length=max_sequence_length-1))
model.add(LSTM(100, return_sequences=True))
model.add(LSTM(100))
model.add(Dense(100, activation='relu'))
model.add(Dense(vocabulary_size, activation='softmax'))

Using TensorFlow backend.


NameError: name 'vocabulary_size' is not defined

In [44]:
# checkpoint
filepath="weights-improvement-{epoch:02d}-{val_accuracy:.2f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_accuracy', verbose=1, save_best_only=True, mode='max')
callbacks_list = [checkpoint]

In [45]:
print(model.summary())

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 5, 50)             1517500   
_________________________________________________________________
lstm_7 (LSTM)                (None, 5, 100)            60400     
_________________________________________________________________
lstm_8 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dense_7 (Dense)              (None, 100)               10100     
_________________________________________________________________
dense_8 (Dense)              (None, 30350)             3065350   
Total params: 4,733,750
Trainable params: 4,733,750
Non-trainable params: 0
_________________________________________________________________
None


In [46]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

Now we can start training the network. This one will take significantly longer to train because:

In [3]:
# fit network
model.fit(X_train, y_train, batch_size=64, epochs=1, verbose=1)

NameError: name 'model' is not defined

In [1]:
model.save("trained_GoT_language_model.h5")

NameError: name 'model' is not defined

In [2]:
model

NameError: name 'model' is not defined

# III. Suggested Extensions

Here are some suggestions for extending this work in order to build a more serious Game of Thrones language model:

1. **Data**: Spend more time cleaning up the text corpus, there is definitely some weird stuff in there (e.g. I saw markup tags in the subtitle data)
2. **Data**: Perhaps think about grabbing more data, maybe by scraping some of the fan Wikis.
3. **Representation**: Use pre-learned word embeddings (e.g. FastText, GloVe, Word2Vec) and possibly update them during training
4. **Representation**: Think about using sub-word tokenisation rather than word-based tokenisation
8. **Modelling**: Probably the most important thing - train for longer, until convergence :) Ideally, monitor for overfitting using a validation set. 
5. **Modelling**: Look into using regularisation techniques (dropout, weight penalties) to improve model performance and generalisability
6. **Modelling**: Experiment with different numbers of layers, sizes, activation functions, initialisation approaches, etc.
7. **Modelling**: Optimise some of the hyperparameters in the model (learning rate, momentum, batch sizes)
9. **Modelling**: Wildcard idea - forget RNNs for language modelling completely and jump on the Transformer hype train ([choo](https://paperswithcode.com/task/language-modelling) [choo!](https://arxiv.org/abs/1904.09408)). 


