
# Lab 6: Neural Language Models

This week we are going to be looking at using the pytorch library to build a simple feedforward neural language model.  This notebook is adapted from one of the pytorch tutorials and includes code by Robert Guthrie as well as my own.

https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html#sphx-glr-beginner-nlp-word-embeddings-tutorial-py


### Word Embeddings in Pytorch

Before we get to a worked example and some exercises, a few quick notes
about how to use embeddings in Pytorch.  First, we need to define an index for each word
when using embeddings. These will be keys into a lookup table. That is,
embeddings are stored as a $|V| \times D$ matrix, where $D$
is the dimensionality of the embeddings, such that the word assigned
index $i$ has its embedding stored in the $i$'th row of the
matrix. In all of my code, the mapping from words to indices is a
dictionary named word\_to\_ix.

The module that allows you to use embeddings is torch.nn.Embedding,
which takes two arguments: the vocabulary size, and the dimensionality
of the embeddings.

To index into this table, you must use torch.LongTensor (since the
indices are 64-bit integers, not floats).




In [1]:
# Standard pytorch imports
# Author: Robert Guthrie

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

<torch._C.Generator at 0x1288ca6f0>

In [3]:
#create a word index for 2 words
word_to_ix = {"hello": 0, "world": 1}
#create 5 dimensional embeddings for 2 words
embeds = nn.Embedding(2, 5)  # 2 words in vocab, 5 dimensional embeddings
#identify the index into this embedding matrix for the word of interest - this is stored in a 1-d tensor
lookup_tensor = torch.tensor([word_to_ix["hello"]], dtype=torch.long)
#find the embedding of interest
hello_embed = embeds(lookup_tensor)
print(hello_embed)

tensor([[-0.8923, -0.0583, -0.1955, -0.9656,  0.4224]],
       grad_fn=<EmbeddingBackward0>)


In [4]:
current_tensor = torch.tensor([word_to_ix["world"]], dtype =torch.long)
print(embeds(current_tensor))

tensor([[ 0.2673, -0.4212, -0.5107, -1.5727, -0.1232]],
       grad_fn=<EmbeddingBackward0>)


## N-Gram Language Modeling


Recall that in an n-gram language model, given a sequence of words
$w$, we want to compute

\begin{align}P(w_i | w_{i-1}, w_{i-2}, \dots, w_{i-n+1} )\end{align}

where $w_i$ is the ith word of the sequence.

In this example, we will compute the loss function on some training
examples and update the parameters with backpropagation.




In [6]:
from nltk import word_tokenize as tokenize

CONTEXT_SIZE = 2  #this is the amount of preceding context to consider
EMBEDDING_DIM = 10  #this is the dimension of the embeddings
# We will use Shakespeare Sonnet 2
test_sentence = ["__END","__START"]+tokenize("""When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""")+["__END"]

# build a list of tuples.  Each tuple is ([ word_i-2, word_i-1 ], target word)
trigrams = [([test_sentence[i], test_sentence[i + 1]], test_sentence[i + 2])
            for i in range(len(test_sentence) - 2)]
# print the last 3, just so you can see what they look like
print(trigrams[-3:])



[(["feel'st", 'it'], 'cold'), (['it', 'cold'], '.'), (['cold', '.'], '__END')]


We need to find the set of words making up the vocabulary and create the word_to_ix index.  We'll also make a reverse index ix_to_word at the same time so that we can look up a word associated with an index.

In [7]:
#find the vocabulary and create the index
vocab = set(test_sentence)
word_to_ix = {word: i for i, word in enumerate(vocab)}
ix_to_word = {i: word for i, word in enumerate(vocab)}
print(word_to_ix)

{'beauty': 0, 'of': 1, 'Shall': 2, 'warm': 3, 'in': 4, 'forty': 5, 'sunken': 6, 'to': 7, '__END': 8, 'livery': 9, 'the': 10, 'This': 11, 'count': 12, ';': 13, 'deep': 14, 'and': 15, 'eyes': 16, 'If': 17, 'days': 18, 'now': 19, 'succession': 20, 'deserv': 21, 'besiege': 22, 'excuse': 23, "'s": 24, "feel'st": 25, 'brow': 26, 'his': 27, 'totter': 28, 'thou': 29, 'art': 30, 'youth': 31, 'Where': 32, 'own': 33, "'d": 34, 'held': 35, 'blood': 36, 'child': 37, 'use': 38, 'To': 39, 'all': 40, 'my': 41, 'see': 42, 'old': 43, '!': 44, 'thine': 45, 'lies': 46, 'weed': 47, '.': 48, 'And': 49, 'a': 50, 'asked': 51, 'shame': 52, 'When': 53, 'so': 54, 'couldst': 55, "'This": 56, 'shall': 57, 'all-eating': 58, 'small': 59, 'Were': 60, 'Thy': 61, 'much': 62, 'more': 63, 'within': 64, 'proud': 65, 'thy': 66, 'gazed': 67, 'were': 68, 'answer': 69, ',': 70, 'Proving': 71, 'cold': 72, 'treasure': 73, 'say': 74, 'it': 75, 'thriftless': 76, 'How': 77, 'Will': 78, 'Then': 79, 'fair': 80, "'": 81, 'being': 82,

Now we have our basic NGramLanguageModeler class.  It inherits from the nn.Module class

https://pytorch.org/docs/stable/generated/torch.nn.Module.html

Essentially, the ``__init__`` method is used to define the neural network.  We have a set of embeddings (vocab_size by embedding_dim) and then 2 linear layers.  The first (or hidden) layer has 128 neurons each with context_size * embedding_dim inputs.  The size of the second layer is equal to the vocab_size, where each neuron has 128 inputs (one from each neuron in the preceding layer).  The value at each of the neurons in this output layer will tell us the probability of each word in the vocabulary as the next word in the sequence.

The ``forward`` method is used to run the network in forward mode i.e., give it some inputs and get some outputs.  Activation functions are added to each layer - the hidden layer has a relu function applied to each neuron and the output layer outputs go through a softmax in order to create a probability distribution.

The ``train`` method iterates over the corpus for a certain number of epochs.  The embeddings for the current context are selected and passed to the model's ``forward`` method.  The log probability of the current target word according to the output is used to compute the loss (i.e., how likely is the target word given the current parameters) and this is then back-propagated through the network via stochastic gradient descent.  It also prints the losses on each epoch - so you can see whether this is decreasing.

In [8]:
class NGramLanguageModeler(nn.Module):

    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGramLanguageModeler, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs


    def train(self,inputngrams,loss_function=nn.NLLLoss(),lr=0.001,epochs=10):
        optimizer=optim.SGD(self.parameters(),lr=lr)
        
        losses=[]
        for epoch in range(epochs):
            total_loss = 0
            for context, target in inputngrams:

                # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
                # into integer indices and wrap them in tensors)
                context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)

                # Step 2. Recall that torch *accumulates* gradients. Before passing in a
                # new instance, you need to zero out the gradients from the old
                # instance
                self.zero_grad()

                # Step 3. Run the forward pass, getting log probabilities over next
                # words
                log_probs = self.forward(context_idxs)

                # Step 4. Compute your loss function. (Again, Torch wants the target
                # word wrapped in a tensor)
                loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long))

                # Step 5. Do the backward pass and update the gradient
                loss.backward()
                optimizer.step()
            
                # Get the Python number from a 1-element Tensor by calling tensor.item()
                total_loss += loss.item()
            losses.append(total_loss)
        print(losses)


model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
model.train(trigrams)

[651.8081214427948, 647.6288683414459, 643.5061674118042, 639.439487695694, 635.428459405899, 631.4724450111389, 627.5693445205688, 623.7188656330109, 619.9197814464569, 616.1747660636902]


Now, we are going to some generation with the model.  I've added some extra methods to the class which reflect the methods we had in our ngram language model in week 4.  See if you can work out what each step is doing in each of:
* `get_logprob()`
* `nextlikely()`
* `generate()`

In [9]:
import math,random

class NGramLanguageModeler(nn.Module):

    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGramLanguageModeler, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs

    def get_logprob(self,context,target):
        #return the logprob of the target word given the context
        context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)
        log_probs = self.forward(context_idxs)
        target_idx=torch.tensor(word_to_ix[target],dtype=torch.long)
        return log_probs.index_select(1,target_idx).item()
        
        
    def nextlikely(self,context):
        #sample the distribution of target words given the context
        context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)
        log_probs = self.forward(context_idxs)
        probs=[math.exp(x) for x in log_probs.flatten().tolist()]
        t=random.choices(list(range(len(probs))),weights=probs,k=1)
        return ix_to_word[t[0]]
    
    def generate(self,limit=20):
        #generate a sequence of tokens according to the model
        tokens=["__END","__START"]
        while tokens[-1]!="__END" and len(tokens)<limit:
            current=self.nextlikely(tokens[-2:])
            tokens.append(current)
        return " ".join(tokens[2:-1])
    
    def train(self,inputngrams,loss_function=nn.NLLLoss(),lr=0.001,epochs=10):
        optimizer=optim.SGD(self.parameters(),lr=lr)
        
        losses=[]
        for epoch in range(epochs):
            total_loss = 0
            for context, target in inputngrams:

                # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
                # into integer indices and wrap them in tensors)
                context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)

                # Step 2. Recall that torch *accumulates* gradients. Before passing in a
                # new instance, you need to zero out the gradients from the old
                # instance
                self.zero_grad()

                # Step 3. Run the forward pass, getting log probabilities over next
                # words
                log_probs = self.forward(context_idxs)

                # Step 4. Compute your loss function. (Again, Torch wants the target
                # word wrapped in a tensor)
                loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long))

                # Step 5. Do the backward pass and update the gradient
                loss.backward()
                optimizer.step()
            
                # Get the Python number from a 1-element Tensor by calling tensor.item()
                total_loss += loss.item()
            losses.append(total_loss)
        print(losses)

In [10]:
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
model.train(trigrams)


[649.2758564949036, 645.5560455322266, 641.8782587051392, 638.2407884597778, 634.6414129734039, 631.0819265842438, 627.559642791748, 624.071212053299, 620.617568731308, 617.1982324123383]


In [11]:
model.get_logprob(["his","field"],".")

-4.852831840515137

In [12]:
word=model.nextlikely(["his","field"])
word

'cold'

In [15]:
model.generate()

"on : answer gazed 'd ; deserv see deserv Then Thy by of thriftless made my ,"

### Exercise 1
* Extend your class so that it can be trained on a corpus
    * you can adapt some of the code from week 4
    * however, you will need to think about the order in which things are initialised - the whole corpus will need to be read in so that the vocabulary can be determined BEFORE the neural network layers are initialised
* Train your neural language model on part the training split of the corpus for the Microsoft Research Sentence Completion Challenge (see lab 2).
* Generate some likely sequences

Note that this will take a long time to run even if you only give it one file to process.  Reducing the size of the vocabulary (in exercise 2) will improve the run time and the ability of the model to generalise.

In [34]:

import os
TRAINING_DIR="/Users/finpearson/Desktop/Github/ANLE---Python-/Week4/sentence-completion/Holmes_Training_Data"

def get_training_testing(training_dir=TRAINING_DIR,split=0.5):

    filenames=os.listdir(training_dir)
    n=len(filenames)
    print("There are {} files in the training directory: {}".format(n,training_dir))
    random.seed(53)  #if you want the same random split every time
    random.shuffle(filenames)
    index=int(n*split)
    return(filenames[:index],filenames[index:])

def processLine(line):
    tokens=["__START"]+tokenize(line)+["__END"]
    return tokens

trainingfiles,heldoutfiles=get_training_testing()
testFile = "/Users/finpearson/Desktop/Github/ANLE---Python-/Week4/sentence-completion/Holmes_Training_Data/1ADAM10.TXT"
f = open(testFile, "r")
wholething = ""
sentences = ""
with open(testFile, "r") as instream:
    for line in instream:
        line=line.rstrip()
        if len(line)>0:
            tokens = processLine(line)
            for token in tokens:
                sentences = sentences + token
            
print(sentences)
vocab = set(f)
word_to_ix = {word: i for i, word in enumerate(vocab)}
ix_to_word = {i: word for i, word in enumerate(vocab)}
#print(word_to_ix)


There are 522 files in the training directory: /Users/finpearson/Desktop/Github/ANLE---Python-/Week4/sentence-completion/Holmes_Training_Data
__STARTProjectGutenberg'sEtextofFirstBookofAdamandEve,byPlatt__END__STARTPartoneofaseriesoftheForgottenBooksofEden__END__STARTCopyrightlawsarechangingallovertheworld,besuretocheck__END__STARTthecopyrightlawsforyourcountrybeforepostingthesefiles!__END__STARTPleasetakealookattheimportantinformationinthisheader.__END__STARTWeencourageyoutokeepthisfileonyourowndisk,keepingan__END__STARTelectronicpathopenforthenextreaders.Donotremovethis.__END__START**WelcomeToTheWorldofFreePlainVanillaElectronicTexts**__END__START**EtextsReadableByBothHumansandByComputers,Since1971**__END__START*TheseEtextsPreparedByHundredsofVolunteersandDonations*__END__STARTInformationoncontactingProjectGutenbergtogetEtexts,and__END__STARTfurtherinformationisincludedbelow.Weneedyourdonations.__END__STARTFirstBookofAdamandEve__END__STARTbyRutherfordPlatt__END__STARTJanuary,1996[Ete

### Exercise 2
* Modify your model so that all words in the vocabulary with frequency less than a threshold (e.g, 5, 10 or even 20 if you want it to run really quickly) are replaced by the "\_\_UNK" token
* Generate some likely sequences

### Exercise 3
* Calculate the perplexity of the test corpus according to your NLM

### Exercise 4
* Try some different embedding sizes
* Plot a graph of perplexity against embedding size

### Exercise 5
* Extend your model so that you can consider different amounts of context.
