
# Lab 6: Neural Language Models

This week we are going to be looking at using the pytorch library to build a simple feedforward neural language model.  This notebook is adapted from one of the pytorch tutorials and includes code by Robert Guthrie as well as my own.

https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html#sphx-glr-beginner-nlp-word-embeddings-tutorial-py


### Word Embeddings in Pytorch

Before we get to a worked example and some exercises, a few quick notes
about how to use embeddings in Pytorch.  First, we need to define an index for each word
when using embeddings. These will be keys into a lookup table. That is,
embeddings are stored as a $|V| \times D$ matrix, where $D$
is the dimensionality of the embeddings, such that the word assigned
index $i$ has its embedding stored in the $i$'th row of the
matrix. In all of my code, the mapping from words to indices is a
dictionary named word\_to\_ix.

The module that allows you to use embeddings is torch.nn.Embedding,
which takes two arguments: the vocabulary size, and the dimensionality
of the embeddings.

To index into this table, you must use torch.LongTensor (since the
indices are integers, not floats).




In [1]:
# Standard pytorch imports
# Author: Robert Guthrie

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

<torch._C.Generator at 0x7f9aa85641f0>

In [2]:
word_to_ix = {"hello": 0, "world": 1}
embeds = nn.Embedding(2, 5)  # 2 words in vocab, 5 dimensional embeddings
lookup_tensor = torch.tensor([word_to_ix["hello"]], dtype=torch.long)
hello_embed = embeds(lookup_tensor)
print(hello_embed)

tensor([[ 0.6614,  0.2669,  0.0617,  0.6213, -0.4519]],
       grad_fn=<EmbeddingBackward0>)


In [3]:
current_tensor = torch.tensor([word_to_ix["world"]], dtype =torch.long)
print(embeds(current_tensor))

tensor([[-0.1661, -1.5228,  0.3817, -1.0276, -0.5631]],
       grad_fn=<EmbeddingBackward0>)


## An Example: N-Gram Language Modeling


Recall that in an n-gram language model, given a sequence of words
$w$, we want to compute

\begin{align}P(w_i | w_{i-1}, w_{i-2}, \dots, w_{i-n+1} )\end{align}

where $w_i$ is the ith word of the sequence.

In this example, we will compute the loss function on some training
examples and update the parameters with backpropagation.




In [4]:
from nltk import word_tokenize as tokenize

CONTEXT_SIZE = 2  #this is the amount of preceding context to consider
EMBEDDING_DIM = 10  #this is the dimension of the embeddings
# We will use Shakespeare Sonnet 2
test_sentence = ["__END","__START"]+tokenize("""When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""")+["__END"]

# build a list of tuples.  Each tuple is ([ word_i-2, word_i-1 ], target word)
trigrams = [([test_sentence[i], test_sentence[i + 1]], test_sentence[i + 2])
            for i in range(len(test_sentence) - 2)]
# print the last 3, just so you can see what they look like
print(trigrams[-3:])



[(["feel'st", 'it'], 'cold'), (['it', 'cold'], '.'), (['cold', '.'], '__END')]


We need to find the set of words making up the vocabulary and create the word_to_ix index.  We'll also make a reverse index ix_to_word at the same time so that we can look up a word associated with an index.

In [5]:
#find the vocabulary and create the index
vocab = set(test_sentence)
word_to_ix = {word: i for i, word in enumerate(vocab)}
ix_to_word = {i: word for i, word in enumerate(vocab)}
word_to_ix

{'forty': 0,
 'thriftless': 1,
 'And': 2,
 'thine': 3,
 'sum': 4,
 'Were': 5,
 'the': 6,
 'fair': 7,
 'new': 8,
 'where': 9,
 'totter': 10,
 'eyes': 11,
 ';': 12,
 'Thy': 13,
 'When': 14,
 'his': 15,
 'Will': 16,
 'all': 17,
 'when': 18,
 'shall': 19,
 'thou': 20,
 'treasure': 21,
 '__START': 22,
 'field': 23,
 'art': 24,
 'asked': 25,
 "'d": 26,
 'child': 27,
 'youth': 28,
 'weed': 29,
 'succession': 30,
 'sunken': 31,
 "'s": 32,
 'Proving': 33,
 'being': 34,
 'worth': 35,
 '.': 36,
 'much': 37,
 'gazed': 38,
 'shame': 39,
 'livery': 40,
 'If': 41,
 "feel'st": 42,
 'Where': 43,
 'days': 44,
 'deserv': 45,
 'now': 46,
 'an': 47,
 'To': 48,
 'count': 49,
 'dig': 50,
 'lusty': 51,
 'make': 52,
 'trenches': 53,
 'all-eating': 54,
 'mine': 55,
 'be': 56,
 'more': 57,
 'in': 58,
 'This': 59,
 'old': 60,
 'deep': 61,
 'Shall': 62,
 'own': 63,
 'so': 64,
 'thy': 65,
 'to': 66,
 '!': 67,
 'a': 68,
 'excuse': 69,
 'How': 70,
 'and': 71,
 'warm': 72,
 'say': 73,
 'blood': 74,
 'proud': 75,
 'wer

Now we have our basic NGramLanguageModeler class.  It inherits from the nn.Module class

https://pytorch.org/docs/stable/generated/torch.nn.Module.html

Essentially, the ``__init__`` method is used to define the neural network.  We have a set of embeddings (vocab_size by embedding_dim) and then 2 linear layers.  The first (or hidden) layer has 128 neurons each with context_size * embedding_dim inputs.  The size of the second layer is equal to the vocab_size, where each neuron has 128 inputs (one from each neuron in the preceding layer).  The value at each of the neurons in this output layer will tell us the probability of each word in the vocabulary as the next word in the sequence.

The ``forward`` method is used to run the network in forward mode i.e., give it some inputs and get some outputs.  Activation functions are added to each layer - the hidden layer has a relu function applied to each neuron and the output layer outputs go through a softmax in order to create a probability distribution.

The ``train`` method iterates over the corpus for a certain number of epochs.  The embeddings for the current context are selected and passed to the model's ``forward`` method.  The log probability of the current target word according to the output is used to compute the loss (i.e., how likely is the target word given the current parameters) and this is then back-propagated through the network via stochastic gradient descent.  It also prints the losses on each epoch - so you can see whether this is decreasing.

In [6]:


class NGramLanguageModeler(nn.Module):

    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGramLanguageModeler, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs


    def train(self,inputngrams,loss_function=nn.NLLLoss(),lr=0.001,epochs=10):
        optimizer=optim.SGD(self.parameters(),lr=lr)
        
        losses=[]
        for epoch in range(epochs):
            total_loss = 0
            for context, target in inputngrams:

                # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
                # into integer indices and wrap them in tensors)
                context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)

                # Step 2. Recall that torch *accumulates* gradients. Before passing in a
                # new instance, you need to zero out the gradients from the old
                # instance
                self.zero_grad()

                # Step 3. Run the forward pass, getting log probabilities over next
                # words
                log_probs = self.forward(context_idxs)

                # Step 4. Compute your loss function. (Again, Torch wants the target
                # word wrapped in a tensor)
                loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long))

                # Step 5. Do the backward pass and update the gradient
                loss.backward()
                optimizer.step()
            
                # Get the Python number from a 1-element Tensor by calling tensor.item()
                total_loss += loss.item()
            losses.append(total_loss)
        print(losses)


model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
model.train(trigrams)

[650.6992130279541, 646.4588539600372, 642.2797882556915, 638.162456035614, 634.106636762619, 630.111043214798, 626.1730239391327, 622.2942018508911, 618.4740650653839, 614.7137267589569]


Now, we are going to some generation with the model.  I've added some extra methods to the class which reflect the methods we had in our ngram language model in week 4.  See if you can work out what each step is doing in each of:
* `get_logprob()`
* `nextlikely()`
* `generate()`

In [7]:
import math,random

class NGramLanguageModeler(nn.Module):

    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGramLanguageModeler, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs

    def get_logprob(self,context,target):
        #return the logprob of the target word given the context
        context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)
        log_probs = self.forward(context_idxs)
        target_idx=torch.tensor(word_to_ix[target],dtype=torch.long)
        return log_probs.index_select(1,target_idx).item()
        
        
    def nextlikely(self,context):
        #sample the distribution of target words given the context
        context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)
        log_probs = self.forward(context_idxs)
        probs=[math.exp(x) for x in log_probs.flatten().tolist()]
        t=random.choices(list(range(len(probs))),weights=probs,k=1)
        return ix_to_word[t[0]]
    
    def generate(self,limit=20):
        #generate a sequence of tokens according to the model
        tokens=["__END","__START"]
        while tokens[-1]!="__END" and len(tokens)<limit:
            current=self.nextlikely(tokens[-2:])
            tokens.append(current)
        return " ".join(tokens[2:-1])
    
    def train(self,inputngrams,loss_function=nn.NLLLoss(),lr=0.001,epochs=10):
        optimizer=optim.SGD(self.parameters(),lr=lr)
        
        losses=[]
        for epoch in range(epochs):
            total_loss = 0
            for context, target in inputngrams:

                # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
                # into integer indices and wrap them in tensors)
                context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)

                # Step 2. Recall that torch *accumulates* gradients. Before passing in a
                # new instance, you need to zero out the gradients from the old
                # instance
                self.zero_grad()

                # Step 3. Run the forward pass, getting log probabilities over next
                # words
                log_probs = self.forward(context_idxs)

                # Step 4. Compute your loss function. (Again, Torch wants the target
                # word wrapped in a tensor)
                loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long))

                # Step 5. Do the backward pass and update the gradient
                loss.backward()
                optimizer.step()
            
                # Get the Python number from a 1-element Tensor by calling tensor.item()
                total_loss += loss.item()
            losses.append(total_loss)
        print(losses)

In [8]:
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
model.train(trigrams)


[656.9471893310547, 652.7292160987854, 648.5683555603027, 644.4635548591614, 640.4146265983582, 636.4171857833862, 632.468772649765, 628.5719289779663, 624.7245268821716, 620.9261801242828]


In [9]:
model.get_logprob(["his","field"],".")

-4.344061851501465

In [10]:
word=model.nextlikely(["his","field"])
word

'a'

In [12]:
model.generate()

", small answer blood warm 'This own blood count forty Thy sunken Then Shall Shall totter own"

### Exercise 1
* Train your neural language model on the training split of the corpus for the Microsoft Research Sentence Completion Challenge (see lab 2).
* Generate some likely sequences

Note that this will take a long time to run even if you only give it one file to process.  Reducing the size of the vocabulary (in exercise 2) will improve the run time and the ability of the model to generalise.

In [16]:
import os
TRAINING_DIR="/Users/juliewe/Dropbox/teaching/AdvancedNLP/2024/week4/lab4/lab4resources/sentence-completion/Holmes_Training_Data"  #this needs to be the parent directory for the training corpus

def get_training_testing(training_dir=TRAINING_DIR,split=0.5):

    filenames=os.listdir(training_dir)
    n=len(filenames)
    print("There are {} files in the training directory: {}".format(n,training_dir))
    random.seed(53)  #if you want the same random split every time
    random.shuffle(filenames)
    index=int(n*split)
    return(filenames[:index],filenames[index:])

trainingfiles,heldoutfiles=get_training_testing()


There are 522 files in the training directory: /Users/juliewe/Dropbox/teaching/AdvancedNLP/2024/week4/lab4/lab4resources/sentence-completion/Holmes_Training_Data


In [17]:
len(trainingfiles)

261

In [18]:
class NGramLanguageModeler(nn.Module):

    def __init__(self, embedding_dim, context_size):
        super(NGramLanguageModeler, self).__init__()
        self.embedding_dim=embedding_dim
        self.context_size=context_size
        self.hidden_size=128
        
    def initialise(self):
        self.embeddings = nn.Embedding(self.vocab_size, self.embedding_dim)
        self.linear1 = nn.Linear(self.context_size * self.embedding_dim, self.hidden_size)
        self.linear2 = nn.Linear(self.hidden_size, self.vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs

    def get_logprob(self,context,target):
        #return the logprob of the target word given the context
        context_idxs = torch.tensor([self.word_to_ix[w] for w in context], dtype=torch.long)
        log_probs = self.forward(context_idxs)
        target_idx=torch.tensor(self.word_to_ix[target],dtype=torch.long)
        return log_probs.index_select(1,target_idx).item()
        
        
    def nextlikely(self,context):
        #sample the distribution of target words given the context
        context_idxs = torch.tensor([self.word_to_ix[w] for w in context], dtype=torch.long)
        log_probs = self.forward(context_idxs)
        probs=[math.exp(x) for x in log_probs.flatten().tolist()]
        t=random.choices(list(range(len(probs))),weights=probs,k=1)
        return self.ix_to_word[t[0]]
    
    def generate(self,limit=20):
        #generate a sequence of tokens according to the model
        tokens=["__END","__START"]
        while tokens[-1]!="__END" and len(tokens)<limit:
            current=self.nextlikely(tokens[-2:])
            tokens.append(current)
        return " ".join(tokens[2:-1])
    
    def train(self,inputngrams,loss_function=nn.NLLLoss(),lr=0.001,epochs=3):
        optimizer=optim.SGD(self.parameters(),lr=lr)
        
        losses=[]
        for epoch in range(epochs):
            total_loss = 0
            for context, target in inputngrams:

                # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
                # into integer indices and wrap them in tensors)
                context_idxs = torch.tensor([self.word_to_ix[w] for w in context], dtype=torch.long)

                # Step 2. Recall that torch *accumulates* gradients. Before passing in a
                # new instance, you need to zero out the gradients from the old
                # instance
                self.zero_grad()

                # Step 3. Run the forward pass, getting log probabilities over next
                # words
                log_probs = self.forward(context_idxs)

                # Step 4. Compute your loss function. (Again, Torch wants the target
                # word wrapped in a tensor)
                loss = loss_function(log_probs, torch.tensor([self.word_to_ix[target]], dtype=torch.long))

                # Step 5. Do the backward pass and update the gradient
                loss.backward()
                optimizer.step()
            
                # Get the Python number from a 1-element Tensor by calling tensor.item()
                total_loss += loss.item()
            losses.append(total_loss)
            print("Completed epoch {} with loss {}".format(epoch,total_loss))
        return losses
        
    
    def train_from_corpus(self,training_dir=TRAINING_DIR,files=[]):
        alltokens=["__END"]
        #reading corpus and tokenize
        for afile in files:
            print("Reading {}".format(afile))
            try:
                with open(os.path.join(training_dir,afile)) as instream:
                    for line in instream:
                        line=line.rstrip()
                        if len(line)>0:
                            tokens=["__START"]+tokenize(line)+["__END"]
                            alltokens+=tokens
            except UnicodeDecodeError:
                print("UnicodeDecodeError reading {}: ignoring file".format(afile))
        
        
        #get the vocab and build the indexes
        self.vocab = set(alltokens)
        self.word_to_ix = {word: i for i, word in enumerate(self.vocab)}
        self.ix_to_word = {i: word for i, word in enumerate(self.vocab)}
        
        #MUST SET THE VOCAB SIZE and INITIALISE THE NN
        self.vocab_size=len(self.vocab) 
        print("Vocabulary size is {}".format(self.vocab_size))
        self.initialise()
        
        #convert to trigrams
        trigrams = [([alltokens[i], alltokens[i + 1]], alltokens[i + 2])
            for i in range(len(alltokens) - 2)]
        
        print("Starting training")
        #train using the trigrams
        self.train(trigrams)
        
        

In [19]:
MAX_FILES=1
model = NGramLanguageModeler(EMBEDDING_DIM, CONTEXT_SIZE)
model.train_from_corpus(files=trainingfiles[:MAX_FILES])

Reading 19TOM10.TXT
Vocabulary size is 5280
Starting training


KeyboardInterrupt: 

### Exercise 2
* Modify your model so that all words in the vocabulary with frequency less than a threshold (e.g, 20) are replaced by the "\_\_UNK" token
* Generate some likely sequences

In [20]:
class NGramLanguageModeler(nn.Module):

    def __init__(self, embedding_dim, context_size):
        super(NGramLanguageModeler, self).__init__()
        self.embedding_dim=embedding_dim
        self.context_size=context_size
        self.hidden_size=128
        self.threshold=20
        
    def initialise(self):
        self.embeddings = nn.Embedding(self.vocab_size, self.embedding_dim)
        self.linear1 = nn.Linear(self.context_size * self.embedding_dim, self.hidden_size)
        self.linear2 = nn.Linear(self.hidden_size, self.vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs

    def get_logprob(self,context,target):
        #return the logprob of the target word given the context
        context_idxs = torch.tensor([self.word_to_ix[w] for w in context], dtype=torch.long)
        log_probs = self.forward(context_idxs)
        target_idx=torch.tensor(self.word_to_ix[target],dtype=torch.long)
        return log_probs.index_select(1,target_idx).item()
        
        
    def nextlikely(self,context):
        #sample the distribution of target words given the context
        context_idxs = torch.tensor([self.word_to_ix[w] for w in context], dtype=torch.long)
        log_probs = self.forward(context_idxs)
        probs=[math.exp(x) for x in log_probs.flatten().tolist()]
        t=random.choices(list(range(len(probs))),weights=probs,k=1)
        return self.ix_to_word[t[0]]
    
    def generate(self,limit=20):
        #generate a sequence of tokens according to the model
        tokens=["__END","__START"]
        while tokens[-1]!="__END" and len(tokens)<limit:
            current=self.nextlikely(tokens[-2:])
            tokens.append(current)
        return " ".join(tokens[2:-1])
    
    def train(self,inputngrams,loss_function=nn.NLLLoss(),lr=0.001,epochs=3):
        optimizer=optim.SGD(self.parameters(),lr=lr)
        
        losses=[]
        for epoch in range(epochs):
            total_loss = 0
            for context, target in inputngrams:

                # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
                # into integer indices and wrap them in tensors)
                context_idxs = torch.tensor([self.word_to_ix[w] for w in context], dtype=torch.long)

                # Step 2. Recall that torch *accumulates* gradients. Before passing in a
                # new instance, you need to zero out the gradients from the old
                # instance
                self.zero_grad()

                # Step 3. Run the forward pass, getting log probabilities over next
                # words
                log_probs = self.forward(context_idxs)

                # Step 4. Compute your loss function. (Again, Torch wants the target
                # word wrapped in a tensor)
                loss = loss_function(log_probs, torch.tensor([self.word_to_ix[target]], dtype=torch.long))

                # Step 5. Do the backward pass and update the gradient
                loss.backward()
                optimizer.step()
            
                # Get the Python number from a 1-element Tensor by calling tensor.item()
                total_loss += loss.item()
            losses.append(total_loss)
            print("Completed epoch {} with loss {}".format(epoch,total_loss))
        return losses
        
    
    def train_from_corpus(self,training_dir=TRAINING_DIR,files=[]):
        alltokens=["__END"]
        #reading corpus and tokenize
        for afile in files:
            print("Reading {}".format(afile))
            try:
                with open(os.path.join(training_dir,afile)) as instream:
                    for line in instream:
                        line=line.rstrip()
                        if len(line)>0:
                            tokens=["__START"]+tokenize(line)+["__END"]
                            alltokens+=tokens
            except UnicodeDecodeError:
                print("UnicodeDecodeError reading {}: ignoring file".format(afile))
        
        
        #get the vocab and build the indexes
        self.vocab={}
        for token in alltokens:
            self.vocab[token]=self.vocab.get(token,0)+1
            
        #delete unknown words from vocab
        unknowns=0
        for key,value in list(self.vocab.items()):
            if value < self.threshold:
                unknowns+=value
                self.vocab.pop(key,None)
        self.vocab["__UNK"]=unknowns
        
        self.word_to_ix = {word: i for i, word in enumerate(list(self.vocab.keys()))}
        self.ix_to_word = {i: word for i, word in enumerate(list(self.vocab.keys()))}
        
        #MUST SET THE VOCAB SIZE and INITIALISE THE NN
        self.vocab_size=len(self.vocab) 
        print("Vocabulary size is {}".format(self.vocab_size))
        self.initialise()
        
        #replace unknown words
        
        filteredtokens=[]
        for token in alltokens:
            if token in self.vocab.keys():
                filteredtokens.append(token)
            else:
                filteredtokens.append("__UNK")
        #convert to trigrams
        trigrams = [([filteredtokens[i], filteredtokens[i + 1]], filteredtokens[i + 2])
            for i in range(len(filteredtokens) - 2)]
        
        print("Starting training")
        #train using the trigrams
        self.train(trigrams)
        
        

In [21]:
MAX_FILES=1
model = NGramLanguageModeler(EMBEDDING_DIM, CONTEXT_SIZE)
model.train_from_corpus(files=trainingfiles[:MAX_FILES])

Reading 19TOM10.TXT
Vocabulary size is 337
Starting training
Completed epoch 0 with loss 259095.37446790934
Completed epoch 1 with loss 238583.31883164006
Completed epoch 2 with loss 231471.25745355838


In [23]:
model.generate()

'__UNK for the __UNK of not . He to the and is'

### Exercise 3
* Calculate the perplexity of the test corpus according to your NLM

### Exercise 4
* Try some different embedding sizes
* Plot a graph of perplexity against embedding size

### Exercise 5
* Extend your model so that you can consider different amounts of context.
