# Part 1: Sequence Modelling

__Before starting, we recommend you enable GPU acceleration if you're running on Colab.__

In [2]:
# Execute this code block to install dependencies when running on colab
try:
    import torch
except:
    from os.path import exists
    from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag
    platform = '{}{}-{}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag())
    cuda_output = !ldconfig -p|grep cudart.so|sed -e 's/.*\.\([0-9]*\)\.\([0-9]*\)$/cu\1\2/'
    accelerator = cuda_output[0] if exists('/dev/nvidia0') else 'cpu'

    !pip install -q http://download.pytorch.org/whl/{accelerator}/torch-1.0.0-{platform}-linux_x86_64.whl torchvision

try: 
    import torchbearer
except:
    !pip install torchbearer

## Markov chains

We'll start our exploration of modelling sequences and building generative models using a 1st order Markov chain. The Markov chain is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. In our case we're going to learn a model over a set of characters from an English language text. The events, or states, in our model are the set of possible characters, and we'll learn the probability of moving from one character to the next.

Let's start by loading the data from the web:

In [3]:
from torchvision.datasets.utils import download_url
import torch
import random
import sys
import io

# Read the data
download_url('https://s3.amazonaws.com/text-datasets/nietzsche.txt', '.', 'nietzsche.txt', None)
text = io.open('./nietzsche.txt', encoding='utf-8').read().lower()
print('corpus length:', len(text))

Downloading https://s3.amazonaws.com/text-datasets/nietzsche.txt to .\nietzsche.txt


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

corpus length: 600893


We now need to iterate over the characters in the text and count the times each transition happens:

In [4]:
transition_counts = dict()
for i in range(0,len(text)-1):
    currc = text[i]
    nextc = text[i+1]
    if currc not in transition_counts:
        transition_counts[currc] = dict()
    if nextc not in transition_counts[currc]:
        transition_counts[currc][nextc] = 0
    transition_counts[currc][nextc] += 1

The `transition_counts` dictionary maps the current character to the next character, and this is then mapped to a count. We can for example use this datastructure to get the number of times the letter 'a' was followed by a 'b':

In [5]:
print("Number of transitions from 'a' to 'b': " + str(transition_counts['a']['b']))

Number of transitions from 'a' to 'b': 813


Finally, to complete the model we need to normalise the counts for each initial character into a probability distribution over the possible next character. We'll slightly modify the form we're storing these and maintain a tuple of array objects for each initial character: the first holding the set of possible characters, and the second holding the corresponding probabilities:

In [6]:
transition_probabilities = dict()
for currentc, next_counts in transition_counts.items():
    values = []
    probabilities = []
    sumall = 0
    for nextc, count in next_counts.items():
        values.append(nextc)
        probabilities.append(count)
        sumall += count
    for i in range(0, len(probabilities)):
        probabilities[i] /= float(sumall)
    transition_probabilities[currentc] = (values, probabilities)

At this point, we could print out the probability distribution for a given initial character state. For example, to print the distribution for 'a':

In [7]:
for a,b in zip(transition_probabilities['a'][0], transition_probabilities['a'][1]):
    print(a,b)

c 0.03685183172083922
t 0.14721708881400153
  0.05296771388194369
n 0.2322806826829003
l 0.11552886183280792
r 0.08794434177628004
s 0.0968583541689314
v 0.0192412218719426
i 0.03402543754755952
d 0.026986628981411024
g 0.017202956843135123
y 0.02505707142080661
k 0.012827481247961734
b 0.02209479291227307
p 0.020545711490379388
m 0.02030111968692249
u 0.011414284161321883
f 0.004429829329274921
w 0.004837482335036417
, 0.0010870746820306554

 0.005353842809000978
z 0.0006522448092183933
x 0.0007609522774214588
o 0.0005435373410153277
. 0.000489183606913795
- 0.0004348298728122622
' 5.4353734101532776e-05
j 0.0004348298728122622
h 0.00035329927165996303
e 0.0007337754103706925
: 5.4353734101532776e-05
a 5.4353734101532776e-05
) 0.00010870746820306555
! 2.7176867050766388e-05
; 2.7176867050766388e-05
" 8.153060115229916e-05
q 2.7176867050766388e-05
_ 8.153060115229916e-05
[ 2.7176867050766388e-05


It looks like the most probable letter to follow an 'a' is 'n'. 

__What is the most likely letter to follow the letter 'j'? Write your answer in the block below:__

In [11]:
for a,b in zip(transition_probabilities['j'][0], transition_probabilities['j'][1]):
    print(a,b)
    
index = transition_probabilities['j'][1].index(max(transition_probabilities['j'][1]))
print(transition_probabilities["j"][0][index])

e 0.2585278276481149
o 0.15080789946140036
u 0.5709156193895871
a 0.017953321364452424
i 0.0017953321364452424
u


We mentioned earlier that the Markov model is generative. This means that we can draw samples from the distributions and iteratively move between states. 

Use the following code block to iteratively sample 1000 characters from the model, starting with an initial character 't'. You can use the `torch.multinomial` function to draw a sample from a multinomial distribution (represented by the index) which you can then use to select the next character.

In [28]:
current = 't'
for i in range(0, 1000):
    print(current, end='')
    maxindex = transition_probabilities[current][1].index( max(transition_probabilities[current][1]) )
    current = transition_probabilities[current][0][maxindex]

the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the 

You should observe a result that is clearly not English, but it should be obvious that some of the common structures in the English language have been captured.

__Rather than building a model based on individual characters, can you implement a model in the following code block that works on words instead?__

## RNN-based sequence modelling

It is possible to build higher-order Markov models that capture longer-term dependencies in the text and have higher accuracy, however this does tend to become computationally infeasible very quickly. Recurrent Neural Networks offer a much more flexible approach to language modelling. 

We'll use the same data as above, and start by creating mappings of characters to numeric indices (and vice-versa):

In [29]:
chars = sorted(list(set(text)))
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

total chars: 57


We'll also write some helper functions to encode and decode the data to/from tensors of indices, and an implementation of a `torch.Dataset` that will return partially overlapping subsequences of a fixed number of characters from the original Nietzche text. Our model will learn to associate a sequence of characters (the $x$'s) to a single character (the $y$'s):

In [30]:
from torch.utils.data import Dataset, DataLoader
from torch import nn
from torch.nn import functional as F
from torch import optim
import random
import sys
import io

maxlen = 40
step = 3


def encode(inp):
    # encode the characters in a tensor
    x = torch.zeros(maxlen, dtype=torch.long)
    for t, char in enumerate(inp):
        x[t] = char_indices[char]

    return x


def decode(ten):
    s = ''
    for v in ten:
        s += indices_char[v] 
    return s


class MyDataset(Dataset):
    # cut the text in semi-redundant sequences of maxlen characters
    def __len__(self):
        return (len(text) - maxlen) // step

    def __getitem__(self, i):
        inp = text[i*step: i*step + maxlen]
        out = text[i*step + maxlen]

        x = encode(inp)
        y = char_indices[out]

        return x, y

We can now define the model. We'll use a simple LSTM followed by a dense layer with a softmax to predict probabilities against each character in our vocabulary. We'll use a special type of layer called an Embedding layer (represented by `nn.Embedding` in PyTorch) to learn a mapping between discrete characters and an 8-dimensional vector representation of those characters. You'll learn more about Embeddings in the next part of the lab.

In [31]:
class CharPredictor(nn.Module):
    def __init__(self):
        super(CharPredictor, self).__init__()
        self.emb = nn.Embedding(len(chars), 8)
        self.lstm = nn.LSTM(8, 128, batch_first=True)
        self.lin = nn.Linear(128, len(chars))

    def forward(self, x):
        x = self.emb(x)
        lstm_out, _ = self.lstm(x)
        out = self.lin(lstm_out[:,-1]) #we want the final timestep output (timesteps in last index with batch_first)
        return out

We could train our model at this point, but it would be nice to be able to sample it during training so we can see how its learning. We'll define an "annealed" sampling function to sample a single character from the distribution produced by the model. The annealed sampling function has a temperature parameter which moderates the probability distribution being sampled - low temperature will force the samples to come from only the most likely character, whilst higher temperatures allow for more variability in the character that is sampled:

In [32]:
def sample(logits, temperature=1.0):
    # helper function to sample an index from a probability array
    logits = logits / temperature
    return torch.multinomial(F.softmax(logits, dim=0), 1)

Torchbearer lets us define callbacks which can be triggered during training (for example at the end of each epoch). Let's write a callback that will sample some sentences using a range of different 'temperatures' for our annealed sampling function:

In [33]:
import torchbearer
from torchbearer import Trial
from torchbearer.callbacks.decorators import on_end_epoch

device = "cuda:0" if torch.cuda.is_available() else "cpu"

@on_end_epoch
def create_samples(state):
    with torch.no_grad():
        epoch = -1
        if state is not None:
            epoch = state[torchbearer.EPOCH]

        print()
        print('----- Generating text after Epoch: %d' % epoch)

        start_index = random.randint(0, len(text) - maxlen - 1)
        for diversity in [0.2, 0.5, 1.0, 1.2]:
            print()
            print()
            print('----- diversity:', diversity)

            generated = ''
            sentence = text[start_index:start_index+maxlen-1]
            generated += sentence
            print('----- Generating with seed: "' + sentence + '"')
            print()
            sys.stdout.write(generated)

            inputs = encode(sentence).unsqueeze(0).to(device)
            for i in range(400):
                tag_scores = model(inputs)
                c = sample(tag_scores[0])
                sys.stdout.write(indices_char[c.item()])
                sys.stdout.flush()
                inputs[0, 0:inputs.shape[1]-1] = inputs[0, 1:].clone()
                inputs[0, inputs.shape[1]-1] = c
        print()

Now, all the pieces are in place. __Use the following block to:__

- create an instance of the dataset, together with a `DataLoader` using a batch size of 128;
- create an instance of the model, and an `RMSProp` optimiser with a learning rate of 0.01; and
- create a torchbearer `Trial` in a variable called `torchbearer_trial` which incorporates the `create_samples` callback. Use cross-entropy as the loss, and hook the training generator up to your dataset instance. Make sure you move your `Trial` object to the GPU if one is available.

In [46]:
train_data = MyDataset()
val_data = MyDataset()
test_data = MyDataset()

# create data loaders
trainloader = DataLoader(train_data, batch_size=128, shuffle=True)
testloader = DataLoader(test_data, batch_size=128, shuffle=True)

device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = CharPredictor().to(device)

num_epochs = 100

loss_function = nn.CrossEntropyLoss()
optimiser = optim.RMSprop(model.parameters(), lr=0.01)

train_losses = []
val_losses = []

trial = Trial(model, optimiser, loss_function, metrics=['loss', 'accuracy'], callbacks=[create_samples]).to(device)
trial.with_generators(trainloader, test_generator=testloader)
trial.run(epochs=10)
results = trial.evaluate(data_key=torchbearer.TEST_DATA)
print(results)



HBox(children=(FloatProgress(value=0.0, description='0/100(t)', max=1565.0, style=ProgressStyle(description_wi…



----- Generating text after Epoch: 0


----- diversity: 0.2
----- Generating with seed: "fficulty (and will long have difficulty"

fficulty (and will long have difficultyin the in a pready of the
really the vont men of every perropisility, nowaples of there to by conhirety, and conjets one" a mode, age,
what alrical in racation
by, a psychsess, and take of he which the defererst nowa(here such.: on fromoty when therewer it for methery of that ourselves to its certain what more that ones! evil has judguers is corigion--fas seds -a reshy tooe! oyy what invol may and

----- diversity: 0.5
----- Generating with seed: "fficulty (and will long have difficulty"

fficulty (and will long have difficultylactor to handelo-gan that
hay m with their
these ones. it eviduable too mochizations we shayme to crriving! complesses respide them ourselvesadity, from the every puty more in be evalitions of all
the stillt who grandents. it is diswircevanale iser what? so
gh, frone--which that irdre.--it one

HBox(children=(FloatProgress(value=0.0, description='1/100(t)', max=1565.0, style=ProgressStyle(description_wi…



----- Generating text after Epoch: 1


----- diversity: 0.2
----- Generating with seed: "e uncertain also has its charms, the sp"

e uncertain also has its charms, the spunceltulity otcil one the wirach couraging of the wirn quition of the germans everything but we to the light fpirmary intents in your imbarding,s
objectiful porrain things we do a demonative,"
hore botk the feel indivincriblaming, from profoundlent and mendach to the reless, he the brened that hat
priretorderchys yourselve ofcested to were is
a pritives the
syarling stauthours of made and simplity

----- diversity: 0.5
----- Generating with seed: "e uncertain also has its charms, the sp"

e uncertain also has its charms, the sp
                    
      st a know than the parding indivins a old remaintlinction,--and ords on is neighg, heepoch
too can half-orginually toways entiviless and as to
pact, of the experopitable alter"! eureperstle. himst proses of the struggle and into in ord onl'oound and any powers, that 

HBox(children=(FloatProgress(value=0.0, description='2/100(t)', max=1565.0, style=ProgressStyle(description_wi…



----- Generating text after Epoch: 2


----- diversity: 0.2
----- Generating with seed: "he dream
vanished. a time came when peo"

he dream
vanished. a time came when peoand germane is not bockices, all sours to mote with how
expited: it is god who planness
on recognored becieiry, such a terperary
an avo to dread; to not raps, descasiol. they its converem: should be preserrives these complus and domain to memory to be meration, upon be cannoth of roungy we oligy of previl, leves the mean cultity:

10. is bitt him, but because lead, fascures is at a feeling everyes

----- diversity: 0.5
----- Generating with seed: "he dream
vanished. a time came when peo"

he dream
vanished. a time came when peodowf anyther vanity," that greas, which is etabal
dete't
the neenghavicsry us no justice we
wol an above a kensian eperthorture, the "net last noifature
at one's slove. this moralian, and itself step have many terming of a man a causes of the slow-enduring untroness amphounded not--not
compario

HBox(children=(FloatProgress(value=0.0, description='3/100(t)', max=1565.0, style=ProgressStyle(description_wi…



----- Generating text after Epoch: 3


----- diversity: 0.2
----- Generating with seed: "ive and the other dies instantly, or vi"

ive and the other dies instantly, or virightly decisite in the solutic most liner
insterations of
degraritism; such worlled. and personas, and they intendernationed able any
become.

17? at the pity of subject (votic of the an
rathen anyon loved man iduced and
sensations of restitim
plovor moderned and as to halot the something to the wold nowagaistachy bleased by the deith. all into which, schorit loke one paysy
found. the urie.
consc

----- diversity: 0.5
----- Generating with seed: "ive and the other dies instantly, or vi"

ive and the other dies instantly, or vioft of
thing--to the another unfar, asprocted to said and alsoin sciences--armoskunct to desides as
in old
at blen of thing,--to elilitary worldary, ant and uncan
extent
under fitation; in he speak
ungreated of snivid,
consequent inclriminishe
in to purily belachorical
rigory, or to the souls--

HBox(children=(FloatProgress(value=0.0, description='4/100(t)', max=1565.0, style=ProgressStyle(description_wi…



----- Generating text after Epoch: 4


----- diversity: 0.2
----- Generating with seed: " a certain considerate cruelty, which k"

 a certain considerate cruelty, which kreale.
that new century, have neighbyraon
that on"--chy one vention.. one do upid, stymalite, so intorled of grainest oftest armoroughone.n [rehemine"--bas ont of which relatice
more almost that set a phyceal evee, when becaltumalottave
unifurenhiness ownaged.



=aspedityr, the renty or that the derrow, q2 nature dist of exercordingly
doesing i domining):--the truth intimumy unapporer,
when he he

----- diversity: 0.5
----- Generating with seed: " a certain considerate cruelty, which k"

 a certain considerate cruelty, which kphilopenher, is it man?--.in it upon truth is regcomy, for their stupiry and is natifule yeathinable rideal tyts and yours--namely most soul_ in vention i happiness, the easy is the possibile equally states of magness, wath's to the in man. (by a wive al names an obly repenentale om with the ma

HBox(children=(FloatProgress(value=0.0, description='5/100(t)', max=1565.0, style=ProgressStyle(description_wi…



----- Generating text after Epoch: 5


----- diversity: 0.2
----- Generating with seed: "erstood how to
introduce itself as the "

erstood how to
introduce itself as the yowe the others, servandinicule injury must tapt very times--that
hand to experient frome exrempting
man and for all the huenge, that reblolity,
as
perhap as of virtues aboot cans brooctable to people? assersed even is ourselven the
hand himself were "still beartherosing eyherment to since upond that sought sheen all accest as a wored to their man. one now his that, the scould like abod then a sha

----- diversity: 0.5
----- Generating with seed: "erstood how to
introduce itself as the "

erstood how to
introduce itself as the coump chand of the romand, to stay developme, if to relfultull. or hy naw to opeapingly them even for his sone takerency itself: the chseasion a most of coustly take whatest our even to the utformanted instences a cause of a belive as to look there: all thep, an enist, i cut for the soul
and se

HBox(children=(FloatProgress(value=0.0, description='6/100(t)', max=1565.0, style=ProgressStyle(description_wi…

KeyboardInterrupt: 

Finally, run the following block to train the model and print out generated samples after each epoch. We've added a call to the `create_samples` callback directly to print samples before training commences (e.g. with random weights). Be aware this will take some time to run...

In [None]:
create_samples.on_end_epoch(None)
torchbearer_trial.run(epochs=10)

Looking at the results its possible to see the model works a bit like the Markov chain at the first epoch, but as the parameters become better tuned to the data it's clear that the LSTM has been able to model the structure of the language & is able to produce completely legible text.

__Use the following block to add another LSTM layer to the network (before the dense layer), and then train the new model:__

In [52]:
class CharPredictor(nn.Module):
    def __init__(self):
        super(CharPredictor, self).__init__()
        self.emb = nn.Embedding(len(chars), 8)
        self.lstm = nn.LSTM(8, 128, batch_first=True)
        self.lin = nn.Linear(128, len(chars))

    def forward(self, x):
        x = self.emb(x)
        y, _ = self.lstm(x)
        lstm_out, _ = self.lstm(y)
        out = self.lin(lstm_out[:,-1]) #we want the final timestep output (timesteps in last index with batch_first)
        return out
    
    
    
train_data = MyDataset()
val_data = MyDataset()
test_data = MyDataset()

# create data loaders
trainloader = DataLoader(train_data, batch_size=128, shuffle=True)
testloader = DataLoader(test_data, batch_size=128, shuffle=True)

device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = CharPredictor().to(device)

num_epochs = 10

loss_function = nn.CrossEntropyLoss()
optimiser = optim.RMSprop(model.parameters(), lr=0.01)

train_losses = []
val_losses = []

trial = Trial(model, optimiser, loss_function, metrics=['loss', 'accuracy'], callbacks=[create_samples]).to(device)
trial.with_generators(trainloader, test_generator=testloader)
trial.run(epochs=10)
results = trial.evaluate(data_key=torchbearer.TEST_DATA)
print(results)

HBox(children=(FloatProgress(value=0.0, description='0/10(t)', max=1565.0, style=ProgressStyle(description_wid…

Exception: [RuntimeError('input.size(-1) must be equal to input_size. Expected 8, got 128')]

 __How does the additional layer affect performance of the model? Provide your answer in the block below:__

YOUR ANSWER HERE