# Building a Language Model for Game of Thrones

*This notebook is part of the tutorial "Modelling Sequences with Deep Learning" presented at the ODSC London Conference in November 2019.*

In this notebook, we will build a neural network model that can understand Game of Thrones language and concepts and even write its own passages. The architecture we will use is a **recurrent neural network (RNN)** with **LSTM cells** to boost the model's ability to remember longer-term information within the text. 

The framework we will use to build the models is `Keras`. Keras is a high-level neural networks API - it acts as a user-friendly layer on top of lower-level frameworks (Tensorflow, Theano, or CNTK), and allows you to build neural networks in an intuitive, layer-by-layer way. 

<img src="books.jpg" alt="Picture of Game of Thrones books" width="600"/> 

## Introduction to language models

Learning a **language model (LM)** is a classic modelling task in the field of **natural language processing (NLP)**. Since LMs learn to understand the structure and content of a text corpus, they are invaluable in applications where the quality or originality of text segments are being assessed. 

Language models are often used **generatively**, as in smartphone keyboard apps, to predict future text based on a **seed sequence**. 

For example, which word should follow the sequence "the cat is on the"? Good guesses are words like "mat", "bed", "sofa", and we would hope that our model would learn to assign high probabilities to these semantically relevant terms. We would hope that words like "the", "hi", and "banana" would be assigned low probabilities. 

## How are language models trained?

All you need to train a language model is a text corpus - **no annotation or labelling of the data is required**. However, language modelling is treated as a **supervised classification task**. The idea is that we extract training data by **sliding a window over the corpus**, and generating input-output pairs that way. More exactly:

![Building a dataset for training a language model](lm_data.png)

So here, we are sliding a window of some size over the corpus in order to generate sequences of words (here, sequences of 2 words each). Then:
+ The initial 2 words in each sequences is our **input** (or **features** or **X values**)
+ The final word in each sequence is our **output** (or **label** or **y values**)

The model is then trained to use the input words (the **context**) to predict the final word. 

## Considerations when building the dataset
There are a few decisions you have to make with how you will build this dataset. For example:
+ Are you going to treat the text as a sequence at the **word level** or the **character level**? 

    + **The arguments for using words are**: there is a lot of information in words since that's how we structure language. And the length of sequences the model has to deal with and remember will be much shorter, leading to greater coherence. 
    + **The arguments for using characters are**: the size of the input space is much more manageable (there are fewer characters than words), and you gain the ability to handle unknown words and generate new words.
    + **You could also work at the sub-word level**: this is a bit of a happy medium - words are broken down into their components. 
    
+ Are you going to **scrub the text squeaky clean** or do you want the model to learn to deal with **noise**, perhaps at a cost of a hit to performance?
+ What sort of **window size** should you be using?

# I. Building a Toy Language Model First

Before launching straight into the Game of Thrones language modelling problem, let's work with a smaller first and understand all of the steps involved. This way, you can more easily understand and track how all of the input, intermediate steps, and output is behaving. 

Let's use the following poem from Lord of the Rings as our entire corpus:

In [None]:
tiny_corpus = ['All that is gold does not glitter',
               'Not all those who wander are lost;',
               'The old that is strong does not wither,',
               'Deep roots are not reached by the frost.',
               'From the ashes, a fire shall be woken,',
               'A light from the shadows shall spring;',
               'Renewed shall be blade that was broken,',
               'The crownless again shall be king']

### i. Preparing the dataset

To get started, the first thing we need to do is **tokenisation** - break the text up into individual units or **tokens**. 

We can use the text tokeniser from the `Keras` library for this, and specify that we want to treat all text as lowercase, generate tokens by splitting on a space character, and view text at the word level. 

In [None]:
from keras.preprocessing.text import Tokenizer

tokeniser = Tokenizer(lower=True, split=' ', char_level=False)
tiny_corpus = ' '.join(tiny_corpus)
tokeniser.fit_on_texts([tiny_corpus])

The tokeniser identifies tokens in the corpus and assigns an index to each word in the vocabulary. We can check which index corresponds to which word like this:

In [None]:
tokeniser.word_index

Now, we can use this tokeniser to convert (**encode**) our original corpus to a sequence of indices correponding to words:

In [None]:
encoded_corpus = tokeniser.texts_to_sequences([tiny_corpus])[0]
encoded_corpus[0:7]

We can always get back to the words by reversing this process:

In [None]:
tokeniser.sequences_to_texts([encoded_corpus[0:7]])

Now we can build a dataset of sequences that we will use for training and evaluating our language model. 

Let's use a window size of 3 and slide this over the integer-encoded corpus to build our dataset: a **list of lists of length 3**.

In [None]:
sequences = []
window_size = 3
for i in range(0, len(encoded_corpus)):
    sequences.append(encoded_corpus[i:i+window_size])

sequences

You'll notice that at the end there we have sequences that are not length 3, since we run out of text. We can quickly **pad the sequences with zeroes** to keep the data size consistent: 

In [None]:
import numpy as np
from keras.preprocessing.sequence import pad_sequences

max_sequence_length = np.max([len(sequence) for sequence in sequences])
sequences = pad_sequences(sequences, 
                          maxlen=max_sequence_length, 
                          padding='pre')
sequences

That looks better. 

Finally, let's break the sequences down into our input data (X; our matrix of features) and our output data (y; our vector of labels):

In [None]:
X = np.array([x[0:2] for x in sequences])
y = np.array([x[2] for x in sequences])

So for example our input features for the first 5 data points are:

In [None]:
X[0:5]

And their corresponding labels are: 

In [None]:
y[0:5]

The final thing we need to do is reformat our label vector y into a **one-hot vector format**. The word index numbers are not actually meaningful (no ordinal relationship) but are discrete classes. We also want to calculate probabilities of word, where a probability of 1 for the correct word is the optimal prediction. 

We can convert the label vector y to a matrix of one-hot vectors using keras' `to_categorical` method:

In [None]:
from keras.utils import to_categorical

vocabulary_size = len(tokeniser.word_index)+1
y = to_categorical(y, num_classes=vocabulary_size)
y[0:5]

To summarise, we have gone from a raw dataset of:

In [None]:
tiny_corpus

To a formatted dataset ready to be input to a learning algorithm:

In [None]:
print('Example features: ', *X[0:5], sep='\n')
print('Example labels: ', *y[0:5], sep='\n')

### ii. Setting up the language model architecture

Now that we have the dataset sorted out, it's time to think about how we want to approach the modelling problem.

Let's build this small recurrent neural network with LSTM units:

![tiny_network](small_network.png)

To explain this network:
+ Our **input layer** represents input into the network. The size of the input layer is the size of the vocabulary of our corpus (+1).
+ We then have an **embedding layer** immediately after the input layer, which will learn **word embeddings** for us (continuous representation of the discrete words in our vocabulary; see my explanatory blog post on embeddings [here](https://towardsdatascience.com/why-do-we-use-embeddings-in-nlp-2f20e1b632d2). An embedding layer first changes your integer-encoded input to a one-hot vector format, followed by a fully-connected layer (with some regularisation and constraints), where the learned weight matrix functions as our word embeddings.
+ 


In `Keras` code, we would build this network like this:

In [None]:
from keras.models import Sequential
from keras.layers import Embedding
from keras.layers import LSTM
from keras.layers import Dense

model = Sequential()
model.add(Embedding(input_dim=vocabulary_size, output_dim=10, input_length=max_sequence_length-1))
model.add(LSTM(units=50))
model.add(Dense(units=vocabulary_size, activation='softmax'))

An explanation of this code block:
+ Keras allows the sequential layer-by-layer building of neural network models using its `Sequential` API.
+ The input layer is assumed, we don't need to explicitly build it.
+ The first layer we add is our `Embedding` layer. The input dimensionality is our vocabulary size (the size of our input layer), and let's give this embedding layer a small size of 10 neurons. This means each word will get represented as a real-valued vector of length 10. We state the the length of inputs the network should expect is 2. 
+ Next, we add the workhorse of the network - our layer of `LSTM` neurons. Let's make the layer have 50 of these neurons (which is not a lot). We leave all other options to the default (activation functions, initialisation,etc.)
+ Finally, as our output layer, we add a `Dense` fully-connected layer and softmax it. This means that the output of the network will be a vector of probabilities (summing to 1) spread across all the words of our vocabulary (see example below). 

We can examine our model so far using Keras' `model.summary()` function:

In [None]:
print(model.summary())

This summarises the number of parameters in our model and where they are.
+ 390 parameters from $39*10$
+ 12200 LSTM parameters from $4*(10*50 + 50*50 + 50)$
+ 1989 parameters from $50*39 + 39$

Now that we have defined the network, we need to do a `model.compile()` to signify that we have finished building the network and want to define how training should proceed. Specifically, we need to provide:
+ Which loss function we want to use (i.e. what is the goal the model is optimising for as it trains, or what signal is it following in order to improve)
+ Which optimiser we want to use to do our gradient updates (Adam, Adagrad, RMSProp, Nesterov momentum, etc.)
+ Any metrics we want to calculate and output during training in order to keep track of progress. Let's keep track of accuracy, which is just the percentage of predictions that the model gets right. 

We can just use sensible defaults for now. Since our task is a multiclass classification task, a sensible loss metric to use is **categorical cross-entropy**. 

In [None]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

I'll avoid dumping equations on you and just say that:
+ The model's categorical cross-entropy loss will be **low** when the network generally predicts the next words correctly. This means it tends to assign higher probability to the correct word.
+ A training loss of zero means that the network always assigns a probability of 1 to the correct word and 0 to all other words - its predictions are perfect (in the training set).
+ The model's categorical cross-entropy loss will be **high** when the network generally doesn't predict the next words well. This means it tends to incorrectly assign high probabilities to incorrect words.

During the training process, the model optimises its internal parameters such that training loss is minimised (for an explanation of how this happens, read about backpropagation and gradient descent [here]()).

### iii. Training the language model

Now that the network is compiled, we can begin training it for some time (for some number of **epochs** - which is the number of times the network sees your training data). 

Hopefully, as model training proceeds, we will see that the training loss steadily decreases and the accuracy increases: 

In [None]:
# model.fit(X, y, epochs=50, verbose=2)
model.fit(X, y, epochs=100, verbose=0)

That's the model trained! Training is very fast because our dataset is tiny and the network is small. The accuracy doesn't look that bad either (though of course the model is likely to be **overfitted**; see later section). 

There's a few different things you can do with a trained Keras sequence model. You can see all the options by typing `model.` followed by a `tab` in a cell:

In [None]:
model.get_weights()[0][0]

In [None]:
model.layers

### iv. Using the trained model to make predictions

Probably the most interesting thing to do now is use the trained model to make new predictions. For this, we can use the `model.predict_classes()` method. 

We hope that the model will predict the next word given a seed sequence well, i.e. that it learned about word structure from our poem corpus. For instance, given the seed sequence "shall be", we hope the model predicts the correct, observed next words like "king", "broken", and "blade".

However, we can't just run `model.predict_classes()` on raw text data like "shall be", since the text data has to first be tokenised, assigned to an integer index, and reshaped into the correct array dimensions:


In [None]:
seed_sequence = 'shall be'
seed_sequence_encoded = tokeniser.texts_to_sequences([seed_sequence])[0]
print('Encoded seed sequence: %s' % seed_sequence_encoded)
seed_sequence_encoded = np.array(seed_sequence_encoded).reshape(-1,2)
print('Formatted encoded seed sequence: %s' % seed_sequence_encoded)

Now we can use the trained model to make a prediction for the next word:

In [None]:
prediction_index = model.predict_classes(seed_sequence_encoded)
print('Prediction for the next word index: %s' % prediction_index)
print('This index corresponds to word: %s' % tokeniser.sequences_to_texts([prediction_index]))

Great, that looks like a decent prediction for the next word!

Rather than just have the 1 best prediction, it would be interesting to see the probabilities assigned to each possible next word. With a bit of manoeuvring we can get these scores out: 

In [None]:
import pandas as pd

class_indices = list(range(0, vocabulary_size+1))

df = pd.DataFrame(list(zip(class_indices, 
                      [tokeniser.sequences_to_texts([[index]])[0] for index in class_indices],
                       model.predict(seed_sequence_encoded)[0],
                       np.round(model.predict(seed_sequence_encoded)[0],5))),
                  columns=['index', 'word', 'probability', 'rounded_probability'])

df.sort_values('probability', ascending=False).head(10)

Cool, it looks like the network does indeed assign the highest probabilities to the 3 words that actually occur in the corpus! It's fun to see that such a small network can produce sensible results on such a small dataset. 

Let's try another example with a different seed sequence:

In [None]:
seed_sequence = 'does not'
seed_sequence_encoded = tokeniser.texts_to_sequences([seed_sequence])[0]
print('Encoded seed sequence: %s' % seed_sequence_encoded)
seed_sequence_encoded = np.array(seed_sequence_encoded).reshape(-1,2)
print('Formatted encoded seed sequence: %s' % seed_sequence_encoded)
df = pd.DataFrame(list(zip(class_indices, 
                      [tokeniser.sequences_to_texts([[index]])[0] for index in class_indices],
                       model.predict(seed_sequence_encoded)[0],
                       np.round(model.predict(seed_sequence_encoded)[0],5))),
                  columns=['index', 'word', 'probability', 'rounded_probability'])
df.sort_values('probability', ascending=False).head(10)

Great, that also looks correct.

Rather than predicting just the next 1 word, would be nice to just let the network write continuous text for us, given some seed sequence starting point. Let's package up the above code into a function that lets us do this:

In [43]:
def write_text_sequence(seed_sequence,
                        length_to_write,
                        model, 
                        tokeniser, 
                        input_length,
                        verbose=True):
    """
    Generates text using a trained language
    model and seed sequence.
    """

    print('Using seed sequence: "%s"' % seed_sequence)
    sequence = seed_sequence
    
    for i in range(length_to_write):
        
        # tokenise and encode the seed sequence
        encoded_sequence = tokeniser.texts_to_sequences([sequence])[0]
        assert len(encoded_sequence)>=input_length, \
            'ERROR: seed sequence must be at least %s words.' % input_length
        encoded_sequence = encoded_sequence[-input_length:]
        encoded_sequence = np.array(encoded_sequence).reshape(-1,input_length)

        # predict the next word index and corresponding word
        prediction_index = model.predict_classes(encoded_sequence)
        prediction = tokeniser.sequences_to_texts([prediction_index])
        
        if verbose:
            print('Sequence so far: %s' % sequence)
            print('Seed sequence encoded: %s' % encoded_sequence)
            print('Most likely next word is {0} (index {1})'.format(prediction, prediction_index[0]))

        sequence += ' ' + prediction[0]
    
    print('Output:\n' + sequence)
    
#     return sequence
        

In [None]:
write_text_sequence("all that", 5,
                    model, tokeniser, 
                    max_sequence_length-1)

Cool, let's write some more text, but let's turn off the verbosity of the function so we just get the final result:

In [None]:
write_text_sequence("the light", 5,
                    model, tokeniser, 
                    max_sequence_length-1,
                    verbose=False)

That's kind of artsy.

And again, writing a longer passage this time:

In [None]:
tiny_corpus

In [None]:
write_text_sequence("ashes are", 10,
                    model, tokeniser, 
                    max_sequence_length-1,
                    verbose=False)

Our tiny model only knows the few words in the poem so this is a bit gibberish :) But it's still interesting to see.

This is pretty much all there is to a basic language model. Now, let's tackle a real corpus (Game of Thrones) and build a bigger, more powerful model!

# II. Building a language model for Game of Thrones text

The technical approach we'll take to building a GoT language model is pretty similar, with the major difference being the dataset. We are going to need access to a lot of GoT text - preferably, both the books and the subtitles from the HBO show. 

### i. Identifying some datasets

Interestingly, there seems to already be a rich ecosystem of technical work surrounding GoT content. 

Check out projects like:
+ The [Network of Thrones](https://networkofthrones.wordpress.com/) blog for network analyses of characters (e.g. which character is the most 'central' to the story?)
+ An [API of Ice and Fire](https://anapioficeandfire.com) for grabbing various structured data about the universe
+ And [this Reddit post](https://www.reddit.com/r/datasets/comments/769nhw/game_of_thrones_datasets_xpost_from_rfreefolk/) for a list of various datasets compiled about GoT.

Maybe it's just me, but even despite these resources, I still couldn't actually find the raw text from the books and TV show. 

I did eventually come across 2 Kaggle datasets that contained exactly what I wanted:
1. [Plain text files of all the books](https://www.kaggle.com/muhammedfathi/game-of-thrones-book-files/download) 
2. [Subtitle data for the episodes](https://filmora.wondershare.com/video-editing-tips/game-of-thrones-subtitles.html)
    
A bit of initial manual + regex clean up later, and you get the files included in this repo. 

### ii. Grabbing all text data from the Game of Thrones books

So, we've got a few books in our current directory in .txt format:

In [1]:
import glob
book_txt_files = sorted(glob.glob('*.txt'))
print('Found these .txt files in the current directory:', *book_txt_files, sep='\n')

Found these .txt files in the current directory:
Book_1_A_Game_of_Thrones.txt
Book_2_A_Clash_of_Kings.txt
Book_3_A_Storm_of_Swords.txt
Book_4_A_Feast_for_Crows.txt
Book_5_A_Dance_with_Dragons.txt


We can write a function to extract all of the text in these files, glue it together, and flatten the resulting list of lists into a single mega GoT list of text:

In [2]:
from iteration_utilities import flatten

def grab_book_data(txt_files):
    """
    Grabb text data from a set of text files.
    """

    # keep all text segments in this list
    all_text_segments = []   
    
    # iterate over each book file
    for txt_file in txt_files:
    
        print('Extracting text from file "%s"...' % txt_file)
        # open file
        with open(txt_file, 'r') as file:
            data = file.read()
            print('Found {0} lines of text in this book.'.format(len(data.split('\n'))))
            print('First few lines:\n %s\n' % ' '.join(data.split('\n')[0:5]))  
            all_text_segments.append(data)
            
    return ''.join(list(flatten(all_text_segments)))

And use it to put all the book text data in one place:

In [3]:
book_data = grab_book_data(book_txt_files)

Extracting text from file "Book_1_A_Game_of_Thrones.txt"...
Found 14002 lines of text in this book.
First few lines:
 A GAME OF THRONES  PROLOGUE  “We should start back,” Gared urged as the woods began to grow dark around them.

Extracting text from file "Book_2_A_Clash_of_Kings.txt"...
Found 15765 lines of text in this book.
First few lines:
 A CLASH OF KINGS  PROLOGUE  The comet’s tail spread across the dawn, a red slash that bled above the crags of Dragonstone like a wound in the pink and purple sky.

Extracting text from file "Book_3_A_Storm_of_Swords.txt"...
Found 19641 lines of text in this book.
First few lines:
 A STORM OF SWORDS  PROLOGUE  The day was grey and bitter cold, and the dogs would not take the scent.

Extracting text from file "Book_4_A_Feast_for_Crows.txt"...
Found 16225 lines of text in this book.
First few lines:
 A FEAST FOR CROWS  PROLOGUE  Dragons,” said Mollander. He snatched a withered apple off the ground and tossed it hand to hand.

Extracting text from fi

Let's quickly summarise the amount of data we're working with:

In [4]:
# count lines and words
print('The number of lines in this corpus: {0}\n'
      'The number of words in this corpus: {1}'.format(len(book_data.split('\n')),
                                                       len(book_data.split(' '))))

The number of lines in this corpus: 84518
The number of words in this corpus: 1724951


## iii. Grabbing all text data from the Game of Thrones show

The subtitle data is a bit more complicated to grab because it's in JSON file format, and also frankly the text is a bit messy - there's markup tags, music note symbols, and various other odd non-textual things. 

We have the following `.json` subtitle files in our current directory:

In [5]:
subtitle_json_files = sorted(glob.glob("*.json"))
print('Found these .json files in the current directory:', *subtitle_json_files, sep='\n')

Found these .json files in the current directory:
Season_1_Subtitles.json
Season_2_Subtitles.json
Season_3_Subtitles.json
Season_4_Subtitles.json
Season_5_Subtitles.json
Season_6_Subtitles.json
Season_7_Subtitles.json


We will need to write a function to get the data out. The function below will:
+ **Iterate** over a given list of json subtitle files, **open** each file and **parse** the json
+ **Sort** the subtitles by index. At the moment, the indices are sorted as strings (so, e.g. '1' is followed by '11') so we need to convert the indices to integers and sort them numerically. This is important to get right because otherwise the subtitles are jumbled out of order! 
+ And finally we **extract** the subtitle text and **append** to a master list (which we reformat by flattening) 

In [6]:
import json

def grab_subtitle_data(subtitle_json_files, verbose=True):
    """
    Grabbing GoT subtitle data from json files.
    """

    # keep all text segments in this list
    all_text_segments = []

    # iterate over each subtitles file
    for season, subtitles_file in enumerate(subtitle_json_files):

        # open subtitle file
        with open(subtitles_file, 'r') as file:
            data = json.load(file)

        # iterate over episodes in the season
        for episode in data.keys():
            episode_data = {int(key):value for key,value in data[episode].items()}
            episode_data = sorted(episode_data.items()) # deal with sorting by line (as integer) s
            episode_text_segments = list(dict(episode_data).values())
            print('Found {0} text segments in Season {1} '
                  'Episode "{2}".'.format(len(episode_text_segments), 
                                          season, 
                                          episode.split('.')[0]))
            if verbose:
                print('First few segments:\n%s' % '\n'.join(episode_text_segments[0:5]))            
            all_text_segments.append(episode_text_segments)
            
    return list(flatten(all_text_segments))

In [7]:
subtitle_data = grab_subtitle_data(subtitle_json_files, verbose=False)

Found 559 text segments in Season 0 Episode "Game Of Thrones S01E01 Winter Is Coming".
Found 571 text segments in Season 0 Episode "Game Of Thrones S01E02 The Kingsroad".
Found 740 text segments in Season 0 Episode "Game Of Thrones S01E03 Lord Snow".
Found 754 text segments in Season 0 Episode "Game Of Thrones S01E04 Cripples, Bastards, And Broken Things".
Found 741 text segments in Season 0 Episode "Game Of Thrones S01E05 The Wolf And The Lion".
Found 583 text segments in Season 0 Episode "Game Of Thrones S01E06 A Golden Crown".
Found 775 text segments in Season 0 Episode "Game Of Thrones S01E07 You Win Or You Die".
Found 666 text segments in Season 0 Episode "Game Of Thrones S01E08 The Pointy End".
Found 679 text segments in Season 0 Episode "Game Of Thrones S01E09 Baelor".
Found 590 text segments in Season 0 Episode "Game Of Thrones S01E10 Fire And Blood".
Found 700 text segments in Season 1 Episode "Game Of Thrones S02E01 The North Remembers".
Found 755 text segments in Season 1 Ep

The final array of subtitle data looks like this:

In [8]:
subtitle_data[0:5]

['Easy, boy.',
 "What do you expect? They're savages.",
 'One lot steals a goat from another lot,',
 "before you know it they're ripping each other to pieces.",
 "I've never seen wildlings do a thing like this."]

And we can summarise the dataset size:

In [9]:
# count lines and words
all_subtitle_text = '\n'.join(subtitle_data)
print('The number of text segments in this corpus: {0}\n'
      'The number of words in this corpus: {1}'.format(len(all_subtitle_text.split('\n')),
                                                       len(all_subtitle_text.split(' '))))

The number of text segments in this corpus: 44844
The number of words in this corpus: 244447


### iv. Combining the book and subtitle datasets

Now we can put the book and subtitle data together:

In [10]:
got_data = book_data+all_subtitle_text

In [11]:
del(book_data)
del(all_subtitle_text)

And report on the size:

In [12]:
print('The number of lines in the final corpus: {0}\n'
      'The number of words in the final corpus: {1}'.format(len(got_data.split('\n')),
                                                            len(got_data.split(' '))))

The number of lines in the final corpus: 129361
The number of words in the final corpus: 1969397


That's almost 2 million words to play with, which should help our language model tremendously. 

In [30]:
got_data[0:100]

'A GAME OF THRONES\n\nPROLOGUE\n\n“We should start back,” Gared urged as the woods began to grow dark aro'

In [31]:
got_data[1000:1100]

' And night is falling.”\n\nSer Waymar Royce glanced at the sky with disinterest. “It does that every d'

### v. Preparing the dataset

The process to make the sequence datasets is the same as before. The only difference is that we'll use longer sequences as our input (`window_size` is now 6), so we're taking into account more text before making our prediction.

In [13]:
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Embedding
from keras.layers import LSTM
from keras.layers import Dense
import numpy as np
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential


Using TensorFlow backend.


In [14]:
# tokenise the data
tokeniser = Tokenizer(lower=True, split=' ', char_level=False)
tokeniser.fit_on_texts([got_data])
vocabulary_size = len(tokeniser.word_index)+1
print('The vocabulary size for this corpus is: %s' % vocabulary_size)

# encode the corpus using the fitted tokeniser
encoded_corpus = tokeniser.texts_to_sequences([got_data])[0]

# generate sequences
sequences = []
window_size = 6
for i in range(0, len(encoded_corpus)):
    sequences.append(encoded_corpus[i:i+window_size])

# pad the sequences at the end so each sequence is the same length
max_sequence_length = np.max([len(sequence) for sequence in sequences])
sequences = pad_sequences(sequences, 
                          maxlen=max_sequence_length, 
                          padding='pre')

# separate sequences into input arrays X 
# and the output label vector y
X = np.array([seq[0:window_size-1] for seq in sequences])
y = np.array([seq[window_size-1] for seq in sequences])
y = to_categorical(y, num_classes=vocabulary_size)

The vocabulary size for this corpus is: 30350


In [15]:
y.shape

(2094848, 30350)

In [16]:
# np.save('GoT_X_features.npz', X)
# np.save('GoT_y_labels.npz', y)

Once again, our features look like this:

In [17]:
X[0:5]

array([[    5,   972,     6,  3796, 12141],
       [  972,     6,  3796, 12141,   322],
       [    6,  3796, 12141,   322,   122],
       [ 3796, 12141,   322,   122,  1131],
       [12141,   322,   122,  1131,    62]], dtype=int32)

And our labels look like this:

In [18]:
y[0:5]

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

One useful extra step: we should **split the dataset into a train and test set**. The main reason for this is that it will help us get a better estimate of the model's true "in the wild" performance, since we can evaluate its performance on data that *wasn't* used in training. 

Evaluating a model on data that was used for training is cheating, since it's already seen that data before, and hence will do unrealistically well when making predictions on it because it has **overfit**.

We will also shuffle the entries, since otherwise our dataset first contains Book 1, then Book 2, ..., Book 5 then finally the subtitle data, whereas we want the model to learn from each source simultaneously. 

In [21]:
small_X = X[0:500000]

In [22]:
small_y = y[0:500000]

In [23]:
# # # for tutorial: take a sample of the data for speed
# # sample_size = 1000000
# sampled_indices = np.random.choice(np.arange(len(y)), sample_size, replace=False)
# small_X = X[sampled_indices]
# small_y = y[sampled_indices]

In [24]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(small_X, small_y, test_size=0.1, shuffle=True)
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, shuffle=True)

### vi. Setting up the language model architecture

This time, let's build a slightly larger network:

![Larger RNN language model](big_network.png)

The main differences here are:
+ Our word embeddings are bigger (100 rather than 50 dimensions), which should allow for richer representations of word meaning
+ We have 2 LSTM layers instead of 1. This should allow the model to learn more complex, hierarchical representations of the text.
+ We have added a dense (fully-connected) layer after the LSTM layers for some additional processing capacity (perhaps, again, allowing for higher-level conceptual representations)



In `Keras` code, we would build the network as follows:

In [25]:
model = Sequential()
model.add(Embedding(vocabulary_size, 50, input_length=max_sequence_length-1))
model.add(LSTM(100, return_sequences=True))
model.add(LSTM(100))
model.add(Dense(100, activation='relu'))
model.add(Dense(vocabulary_size, activation='softmax'))

This is very similar code to before, but we have reason to think that this network will be much more complex and nuanced than the previous one:
+ The dataset we are using is much larger and richer than the toy dataset
+ The network we are training is larger and deeper, and should have more expressive power

We can summarise the **model structure and parameters**:

In [26]:
print(model.summary())

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 5, 50)             1517500   
_________________________________________________________________
lstm_1 (LSTM)                (None, 5, 100)            60400     
_________________________________________________________________
lstm_2 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dense_1 (Dense)              (None, 100)               10100     
_________________________________________________________________
dense_2 (Dense)              (None, 30350)             3065350   
Total params: 4,733,750
Trainable params: 4,733,750
Non-trainable params: 0
_________________________________________________________________
None


And compile the finished model and specify some **training settings**:

In [27]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

### vii. Training the language model

Then, we can start the training run by passing the training data to the model. This would take a reasonably long time to train - it would be helpful to have access to a **GPU** to run this on (e.g. via Google Colab, AWS/GCP, your own GPU) to make use of computation **parallelisation** and drastically reduce training time.

Since it's a longer training run, we would also ideally want to save some intermediate results while training is happening. One way to do this is using Keras' `ModelCheckpoint` utility. To save some disc space, you can specify that you only want to save a new checkpoint file when something about the model has improved (commonly, validation accuracy or validation loss). 

In [28]:
# UNCOMMENT AND RUN THIS CELL TO TRAIN THE MODEL YOURSELF

from keras.callbacks import ModelCheckpoint

checkpoint_filename="GoT_Language_Model_{epoch:02d}_{val_accuracy:.3f}.hdf5"
checkpoint = ModelCheckpoint(checkpoint_filename, 
                             monitor='val_accuracy', 
                             save_best_only=True, 
                             mode='max',  # 'best' file maximises validation_accuracy 
                             verbose=1, )
model.fit(X_train, y_train, epochs=50, validation_split=0.1, callbacks=[checkpoint])

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 405000 samples, validate on 45000 samples
Epoch 1/50

Epoch 00001: val_accuracy improved from -inf to 0.10967, saving model to GoT_Language_Model_01_0.11.hdf5
Epoch 2/50

Epoch 00002: val_accuracy improved from 0.10967 to 0.12318, saving model to GoT_Language_Model_02_0.12.hdf5
Epoch 3/50

Epoch 00003: val_accuracy improved from 0.12318 to 0.13156, saving model to GoT_Language_Model_03_0.13.hdf5
Epoch 4/50

Epoch 00004: val_accuracy improved from 0.13156 to 0.13609, saving model to GoT_Language_Model_04_0.14.hdf5
Epoch 5/50

Epoch 00005: val_accuracy improved from 0.13609 to 0.13882, saving model to GoT_Language_Model_05_0.14.hdf5
Epoch 6/50

Epoch 00006: val_accuracy improved from 0.13882 to 0.14118, saving model to GoT_Language_Model_06_0.14.hdf5
Epoch 7/50

Epoch 00007: val_accuracy improved from 0.14118 to 0.14271, saving model to GoT_Language_Model_07_0.14.hdf5
Epoch 8/50

Epoch 00008: val_accuracy improved from 0.14271 to 0.14320, saving model to GoT_Language_Model_08_0.


Epoch 00038: val_accuracy did not improve from 0.14358
Epoch 39/50

Epoch 00039: val_accuracy did not improve from 0.14358
Epoch 40/50

Epoch 00040: val_accuracy did not improve from 0.14358
Epoch 41/50

Epoch 00041: val_accuracy did not improve from 0.14358
Epoch 42/50

Epoch 00042: val_accuracy did not improve from 0.14358
Epoch 43/50

Epoch 00043: val_accuracy did not improve from 0.14358
Epoch 44/50

Epoch 00044: val_accuracy did not improve from 0.14358
Epoch 45/50

Epoch 00045: val_accuracy did not improve from 0.14358
Epoch 46/50

Epoch 00046: val_accuracy did not improve from 0.14358
Epoch 47/50

Epoch 00047: val_accuracy did not improve from 0.14358
Epoch 48/50

Epoch 00048: val_accuracy did not improve from 0.14358
Epoch 49/50

Epoch 00049: val_accuracy did not improve from 0.14358
Epoch 50/50

Epoch 00050: val_accuracy did not improve from 0.14358


<keras.callbacks.callbacks.History at 0x14cbf9790>

In [32]:
model.save("final_trained_GoT_language_model.h5")

For now, to save time, I will just **load a model** that I already trained. 

For reference, this model was really accessible to train - it was trained overnight on my MacBook, so there's no special GPU supercomputer involved. The model was still improving quite rapidly at that point, so we would see even better performance if the model were given enough time to reach **convergence** ("finish" learning, or at least hit serious diminishing returns).

In [449]:
from keras.models import load_model

loaded_model = load_model('final_trained_GoT_language_model.h5')

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


### viii. Exploring our Game of Thrones language model

We can summarise the model's performance on the test set as follows:

In [464]:
from sklearn.metrics import accuracy_score

test_predictions = loaded_model.predict_classes(X_test)
print('Overall test accuracy: {0}'.format(accuracy_score(np.argmax(y_test, axis=1), test_predictions)))

Overall test accuracy: 0.11624


Seems a bit low, but language is complicated and flexible. What does this performance mean in practical terms? We can examine some of the correct answers vs. predictions on the test set:

In [468]:
test_seed_sequences = tokeniser.sequences_to_texts(X_test[0:50])
actual_next_words = tokeniser.sequences_to_texts([np.argmax(y_test, axis=1)[0:50]])[0].split(' ')
prediction_index = model.predict_classes(X_test[0:50])
prediction_vector = tokeniser.sequences_to_texts([prediction_index])
predictions = prediction_vector[0].split(' ')
df = pd.DataFrame(list(zip(test_seed_sequences,
                           actual_next_words,
                           predictions)),
                  columns=['Seed Sequence', 'Actual Next Word', 'Predicted Next Word'])
df

Unnamed: 0,Seed Sequence,Actual Next Word,Predicted Next Word
0,free folk here craster serves,no,the
1,that place since the day,her,of
2,grow around the gravel swallowing,it,the
3,theon had given the matter,no,he
4,but oversweet to his taste,“if,jewels
5,so sore he could scarcely,walk,allow
6,wore on their wedding night,tyrion,”
7,and split them laying the,logs,stew
8,i love him ” sansa,wailed,said
9,wheeled his horse about and,trotted,galloped


So, it looks like even where the model doesn't get the prediction correct, its prediction does at least seem plausible. 

That said, there is tons that can be done to improve this model (see final section in this notebook).

### ix. Gather Round for a New Tale...

Finally, let's have the language model write some Game of Thrones text for us (since GRR Martin certainly isn't going to!). 

We can use the same function as before to continuously feed in a seed sequence to the model, generate one word, and then append the generated word to the seed sequence. In this way, the model uses its own previous output as input to itself in the future. 

Let's write some text (sort of cherry picked lengths):

In [469]:
write_text_sequence("I would not have expected", 47,
                    model, tokeniser, 
                    max_sequence_length-1,
                    verbose=False)

Using seed sequence: "I would not have expected"
Output:
I would not have expected no fear ” “help me a king dark as fresh shadows better catelyn thought herself nipple off the rear from the window “my lord ” the greatjon tossed a sullen wooden arm osmynd had said “what would we do again is this a pack ” shae nipped


In [68]:
write_text_sequence("Once upon a time the", 53,
                    model, tokeniser, 
                    max_sequence_length-1,
                    verbose=False)

Using seed sequence: "Once upon a time the"
Output:
Once upon a time the next morning when the blades poured out in the courtyard and the distant halls of woth had never been a bitter man struggling on the wall and drank the whole terms of the false feast cat ” he said sharply “we have a torch ” the merchant bear said “i have no choice


In [62]:
write_text_sequence("The start of the story", 43,
                    model, tokeniser, 
                    max_sequence_length-1,
                    verbose=False)

Using seed sequence: "The start of the story"
Output:
The start of the story and the rest of the traitor stannis ” he said “i will not see ” he said “i have no choice for that joffrey’s name day he had been born in the crypts ” the knight said “are you a splendid bastard ”


This all looks like a decent start - I especially like the merchant bear, and am shocked by the curveball that Joffrey was born in the crypts of Winterfell. Other segments sound like weird GoT beat poetry, and I can almost hear the soft accompanying bongo beats.

It's clear from these samples that the model has clearly learned something about both language structure and GoT content, but admittedly it's still a bit clunky. Check out the section at the bottom for suggestions on how to improve the model. 

### x. Generating more creative output

If you run the model generatively and write longer stories, you'll see that it can sometimes get stuck in a loop:

In [79]:
write_text_sequence("The men with the swords", 200,
                    model, tokeniser, 
                    max_sequence_length-1,
                    verbose=False)

Using seed sequence: "The men with the swords"
Output:
The men with the swords beating in the solar of his youth “oh yes ” the king said “i have a role in yours ” the raven agreed “nor island remained in the whispering wood and the others had been a pig sworn years the women had been allowed to hear the banners in the dust of the burning fork of the trident he’ll be able to catch the horse with a hand of his own nurse he had been a fortnight past he had been a fortnight past he had been a fortnight past he had been a fortnight past he had been a fortnight past he had been a fortnight past he had been a fortnight past he had been a fortnight past he had been a fortnight past he had been a fortnight past he had been a fortnight past he had been a fortnight past he had been a fortnight past he had been a fortnight past he had been a fortnight past he had been a fortnight past he had been a fortnight past he had been a fortnight past he had been a fortnight past he had been a fortnight pas

This is because our function `write_text_sequences` will always greedily choose the most probable next word as the next token as it generates text. This is the cause of these repetitive loops. 

Ideally, we want to give the model a bit more space to be creative than this. The easiest way to do this is instead of using the **most probable next word** as our prediction, we can **sample from all possible words proportionally to their probability**. This will help introduce some fun linguistic variety into our generated text. 

The most probable words are still, of course, most likely to be chosen, but there is now space for less probable words to be used as well. We are trading off (potentially rigid) local correctness for (potentially noisy) creativity. 

We can modify our `write_text_sequences` to have an option to use this probabilistic sampling approach:

In [131]:
def write_text_sequence(seed_sequence,
                        length_to_write,
                        model, 
                        tokeniser, 
                        input_length,
                        verbose=True,
                        use_sampling=True):
    """
    Generates text using a trained language
    model and seed sequence.
    """

    print('Using seed sequence: "%s"' % seed_sequence)
    sequence = seed_sequence
    
    for i in range(length_to_write):
        
        # tokenise and encode the seed sequence
        encoded_sequence = tokeniser.texts_to_sequences([sequence])[0]
        assert len(encoded_sequence)>=input_length, \
            'ERROR: seed sequence must be at least %s words.' % input_length
        encoded_sequence = encoded_sequence[-input_length:]
        encoded_sequence = np.array(encoded_sequence).reshape(-1,input_length)

        # predict the next word index and corresponding word
        if use_sampling:
            next_word_probabilities = model.predict_proba(encoded_sequence)[0]
            next_word_indices = range(0, vocabulary_size)
            prediction_index = np.random.choice(next_word_indices, size=1, p=next_word_probabilities)             
        else: 
            prediction_index = model.predict_classes(encoded_sequence)
        
        # convert prediction index to actual word
        prediction = tokeniser.sequences_to_texts([prediction_index])
        
        if verbose:
            print('Sequence so far: %s' % sequence)
            print('Seed sequence encoded: %s' % encoded_sequence)
            print('Most likely next word is {0} (index {1})'.format(prediction, prediction_index[0]))

        # append prediction to the sequence
        sequence += ' ' + prediction[0]
    
    print('Output:\n' + sequence)
        

Let's see if this addition helps us get out of the infinite loop of fortnight past:

In [132]:
write_text_sequence("The men with the swords", 200,
                    model, tokeniser, 
                    max_sequence_length-1,
                    verbose=False, 
                    use_sampling=True)

Using seed sequence: "The men with the swords"
Output:
The men with the swords circled to say how badly his fathers were shagga knights mail and golden crabs ” mirri maz duur wept as lord walder rushed to sound behind him she heard coming upon six nights beheaded the last son who had refuge the city blow the hand stood outside the frozen wooden grassy ground with the golden wool in his place broke from these party are nine the two crows but the shouting seemed to lose with once he knew anything but nothing and they she had known afterward she had had beside the whole king as truth let me make out walls on the ridge dany could have done he had told him his death at lord of yours stark how know he read cradled the girl rising but deep strangely slowly “open them you be he’s alone as a poor master for sudden mouth who remained of the eastern road sound uncomfortable like blood the quiet with his hand “it won’t do time he’ll see to me now pray when i felt ” “the man says if his way were ba

Well.. we're definitely not trapped anymore! But now the text sounds absolutely bonkers. Let's see if we can put some breaks on this thing. 

### xi. Generating more creative (but controlled) output 

The main way of controlling how creative or random these sampling-based predictions are is by using a hyperparameter called `temperature`. Essentially:
+ **Higher temperature** will emphasise the least likely predictions in a distribution - less likely predictions will have their probabilities increased. To remember this, think of "hot" = more randomness, just like with higher physical temperature leading to more random molecular motion. 
+ **Lower temperature** will downplay the less likely predictions. At the lowest temperatures, we are only ever considering the most likely prediction (our sampling starts to function like an `argmax` and we go back to the greedy approach).

We can write a function to do this scaling of probabilities:

In [283]:
def apply_temp_to_softmax_probs(probs, temp, verbose=False):
    """
    Rescales softmax probabilities using some given temperature.
    """
    
    # add a very small number to probabilities
    # to avoid taking log of zero later (undefined) 
    epsilon = 10e-16 
    probs = probs + epsilon

    # take logs of probabilities
    log_probs = np.log(probs)
    
    # the crucial step - divide the log probabilities by temperature
    scaled_log_probs = log_probs / temp 

    # undo logging to get back to probabilities
    new_probs = np.exp(scaled_log_probs) 

    # and renormalise so that probabilities sum to 1
    normalised_probs = new_probs / np.sum(new_probs)
    
    if verbose:
        print('1. Original probabilities:\n%s\n' % probs)
        print('2. Log of probabilities:\n%s\n' % log_probs)
        print('3. Temperature scaled log of probabilities:\n%s\n' % scaled_log_probs)
        print('4. Back to pseudo-probabilities by undoing logging:\n%s\n' % new_probs)
        print('5. Final normalised probabilities:\n%s\n' % normalised_probs)

    return normalised_probs


You can check out how temperature scaling of probability arrays happens by testing out this function in verbose mode. Let's scale the array `np.array([0.8, 0.1, 0.05, 0.05])` using different temperatures:

#### temperature=1 (should do nothing at all)

In [284]:
_ = apply_temp_to_softmax_probs(np.array([0.8, 0.1, 0.05, 0.05]), 
                            temp=1, verbose=True)

1. Original probabilities:
[0.8  0.1  0.05 0.05]

2. Log of probabilities:
[-0.22314355 -2.30258509 -2.99573227 -2.99573227]

3. Temperature scaled log of probabilities:
[-0.22314355 -2.30258509 -2.99573227 -2.99573227]

4. Back to pseudo-probabilities by undoing logging:
[0.8  0.1  0.05 0.05]

5. Final normalised probabilities:
[0.8  0.1  0.05 0.05]



Great, it's good to know that using a temperature of 1 does nothing at all to the probabilities. 

#### temperature=10 (should boost low probabilities and introduce more randomness)

In [285]:
_ = apply_temp_to_softmax_probs(np.array([0.8, 0.1, 0.05, 0.05]), 
                                temp=10, verbose=True)

1. Original probabilities:
[0.8  0.1  0.05 0.05]

2. Log of probabilities:
[-0.22314355 -2.30258509 -2.99573227 -2.99573227]

3. Temperature scaled log of probabilities:
[-0.02231436 -0.23025851 -0.29957323 -0.29957323]

4. Back to pseudo-probabilities by undoing logging:
[0.97793277 0.79432823 0.74113445 0.74113445]

5. Final normalised probabilities:
[0.30048357 0.2440685  0.22772396 0.22772396]



A high temperature of 10 really amplifies those low probabilities!

#### temperature=0.1 (should dampen out lower probabilities and boost already high probabilities)

In [286]:
_ = apply_temp_to_softmax_probs(np.array([0.8, 0.1, 0.05, 0.05]), 
                                temp=0.1, verbose=True)

1. Original probabilities:
[0.8  0.1  0.05 0.05]

2. Log of probabilities:
[-0.22314355 -2.30258509 -2.99573227 -2.99573227]

3. Temperature scaled log of probabilities:
[ -2.23143551 -23.02585093 -29.95732274 -29.95732274]

4. Back to pseudo-probabilities by undoing logging:
[1.07374182e-01 1.00000000e-10 9.76562500e-14 9.76562500e-14]

5. Final normalised probabilities:
[9.99999999e-01 9.31322574e-10 9.09494701e-13 9.09494701e-13]



And a low temperature of 0.1 really freezes down those low probabilities, they are practically 0. 

How come the maths works? Essentially:
+ Probabilities that are already big don't have big (negative) logarithms, so scaling them by multiplying/dividing by temperature won't make that much of a difference.
+ But small probabilities have very big (negative) logarithms, so scaling them by multiplying/dividing by temperature can hugely change their values. 

We can add 1 line to our `write_text_sequence` function to make use of temperature (line 29):

In [311]:
def write_text_sequence(seed_sequence,
                        length_to_write,
                        model, 
                        tokeniser, 
                        input_length,
                        verbose=True,
                        use_sampling=True, 
                        temperature=1):
    """
    Generates text using a trained language
    model and seed sequence.
    """

    print('Using seed sequence: "%s"' % seed_sequence)
    sequence = seed_sequence
    
    for i in range(length_to_write):
        
        # tokenise and encode the seed sequence
        encoded_sequence = tokeniser.texts_to_sequences([sequence])[0]
        assert len(encoded_sequence)>=input_length, \
            'ERROR: seed sequence must be at least %s words.' % input_length
        encoded_sequence = encoded_sequence[-input_length:]
        encoded_sequence = np.array(encoded_sequence).reshape(-1,input_length)

        # predict the next word index and corresponding word
        if use_sampling:
            next_word_probabilities = model.predict_proba(encoded_sequence)[0]
            next_word_probabilities = apply_temp_to_softmax_probs(next_word_probabilities, temperature)
            next_word_indices = range(0, vocabulary_size)
            prediction_index = np.random.choice(next_word_indices, size=1, p=next_word_probabilities)             
        else: 
            prediction_index = model.predict_classes(encoded_sequence)
        
        # convert prediction index to actual word
        prediction = tokeniser.sequences_to_texts([prediction_index])
        
        if verbose:
            print('Sequence so far: %s' % sequence)
            print('Seed sequence encoded: %s' % encoded_sequence)
            print('Most likely next word is {0} (index {1})'.format(prediction, prediction_index[0]))

        # append prediction to the sequence
        sequence += ' ' + prediction[0]
    
    print('Output:\n' + sequence)
        

Now, we can control the creativity level of the text generation by changing the value of one argument:

#### Predictable text

In [446]:
write_text_sequence("Jon Snow is the son", 15,
                    model, tokeniser, 
                    max_sequence_length-1,
                    verbose=False, 
                    use_sampling=True, 
                    temperature=0.5)

Using seed sequence: "Jon Snow is the son"
Output:
Jon Snow is the son of mine ” she said “i don’t think i shall find it too to see


#### Normal text

In [447]:
write_text_sequence("Jon Snow is the son", 15,
                    model, tokeniser, 
                    max_sequence_length-1,
                    verbose=False, 
                    use_sampling=True, 
                    temperature=1)

Using seed sequence: "Jon Snow is the son"
Output:
Jon Snow is the son to joffrey’s way lift it in the wood and remark licked his hands “to tell


#### Mental text

In [448]:
write_text_sequence("Jon Snow is the son", 15,
                    model, tokeniser, 
                    max_sequence_length-1,
                    verbose=False, 
                    use_sampling=True, 
                    temperature=2)

Using seed sequence: "Jon Snow is the son"
Output:
Jon Snow is the son for long barracks in all haste fall lines likewise found “ahhhh dothraki this first boy


I hope this project gave you a taste of language models, that's all for this tutorial for now!

# III. Suggested Extensions

Here are some suggestions for extending this work in order to build a more serious Game of Thrones language model:

1. **Data**: Spend more time cleaning up the text corpus, there is definitely some weird stuff in there (e.g. I saw markup tags in the subtitle data)
2. **Data**: Perhaps think about grabbing more data, maybe by scraping some of the fan Wikis.
3. **Representation**: Use **pre-trained word embeddings** (e.g. FastText, GloVe, Word2Vec) and possibly update them during training
4. **Representation**: Think about using **sub-word tokenisation** rather than word-based tokenisation
8. **Modelling**: **Train for longer**, until convergence :) Monitor for overfitting using a validation set to early stop. 
5. **Modelling**: Look into using **regularisation techniques** (dropout, weight penalties) to improve model performance and generalisability
6. **Modelling**: Experiment with different numbers of layers, sizes, activation functions, initialisation approaches, etc.
7. **Modelling**: Optimise some of the **hyperparameters** in the model (learning rate, momentum, batch sizes)
9. **Modelling**: Forget RNNs for language modelling completely and jump on the **Transformer hype train** ([choo](https://paperswithcode.com/task/language-modelling) [choo!](https://arxiv.org/abs/1904.09408)). 
10. **Modelling**: Try downloading a **pre-trained language model** (like **Google AI's BERT** or **OpenAI's GPT models** or **Carnegie Mellon/Google Brain's XLNet**) and fine-tuning it to Game of Thrones text. This is likely to give the easiest, biggest gains, since these models are pre-trained on massive corpora with a huge amount of GPUs. 
10. **Visualisation**: Try using **Tensorboard** to visualise the progression of model training and diagnose any weird behaviour. 
