<a href="https://colab.research.google.com/github/dbamman/nlp22/blob/main/HW4/HW_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Homework 4: Language Modeling

In this homework, we will explore implementations of various language models we saw in lecture. We will use a dataset of movie reviews to learn statitics about words in our language to build our models. We will build classical N-Gram Models, measure perplexity, and generate text using the models.

## Setup

In [None]:
from collections import Counter
import numpy as np
import math
import tqdm
import random

import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

import torch
from torch import nn
import torch.nn.functional as F
import torchtext.legacy as torchtext

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
# download and load the data
!wget https://raw.githubusercontent.com/dbamman/nlp22/main/HW4/plot_summaries.txt
data = pd.read_csv('plot_summaries.txt', sep='\t', header=None)[1].tolist()

def make_text(data):
    all_text = []
    for d in tqdm.notebook.tqdm(data):
        all_text.append("<eos>")
        clean_text = " ".join(d.lower().split())
        all_text.extend(word_tokenize(clean_text))
        all_text.append("<eos>")
    return all_text

text = make_text(data)
train_size, validation_size = 2000000, 200000
train_text, validation_text = text[:train_size], text[train_size:train_size+validation_size]

text_field = torchtext.data.Field()
counter = Counter(train_text)
text_field.vocab = text_field.vocab_cls(counter)
vocab = text_field.vocab
vocab_size = len(vocab)

print("Number of words in the vocabulary: {}".format(vocab_size))
print("Example text: {}".format(validation_text[:30]))

--2022-02-17 00:47:01--  https://raw.githubusercontent.com/dbamman/nlp22/main/HW4/plot_summaries.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75934033 (72M) [text/plain]
Saving to: ‘plot_summaries.txt.2’


2022-02-17 00:47:01 (219 MB/s) - ‘plot_summaries.txt.2’ saved [75934033/75934033]



  0%|          | 0/42303 [00:00<?, ?it/s]

Number of words in the vocabulary: 57663
Example text: ['he', 'stumbles', 'upon', 'a', 'jailbreak', 'and', 'knocks', 'out', 'the', 'convicts', '.', 'he', 'is', 'hailed', 'a', 'hero', 'and', 'is', 'released', '.', 'outside', 'the', 'jail', ',', 'he', 'discovers', 'life', 'is', 'harsh', ',']


In [None]:
def make_dataset(text):
    a = torchtext.data.Example.fromlist([" ".join(text)], [('text', text_field)])
    return torchtext.data.Dataset([a], [('text', text_field)])

train_dataset, validation_dataset = make_dataset(train_text), make_dataset(validation_text)

In [None]:
def ids(vocab, tokens):
    """Helper function to convert a list of words into indices in our vocab"""
    return [vocab.stoi[t] for t in tokens]

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Running on {}".format(device))

Running on cuda


## Classical N-Gram Model

For this part, we will build a classical N-Gram model to learn the statistics of our training data. To train this model, we simply count the number of times each $n$-gram occurs in our training text and divide it by the number of times the first $n-1$ words occur. Additionally, we will also add alpha-smoothing to make sure no word has $0$ probability. This is summed up in this equation:

$$P(w_n|w_1 \dots w_{n-1})=\frac{C(w_1 \dots w_n)+\alpha}{C(w_1 \dots w_{n-1})+\alpha\cdot|V|}$$

where $|V|$ is the vocab size and $C$ is the count for the given n-gram.

We will handle computing the counts for you and your job will be to simply fill in the functions to compute the above equation and the perplexity for the model.

In [None]:
class NGramModel:
  ## For this part, we will build a classical N-Gram model to learn the statistics of our training data!
  ##To TRAIN THIS MODEL we COUNT the number of times each n-gram occurs in our training, and DIVIDE by the number of times the first (n-1 words occur)
  ##Apply alpha smoothing such that |V| is the vocab size, C isthe count for the given ngram
    def __init__(self, train_text, vocab, n=2, alpha=3e-3):
        # get counts and perform any other setup
        self.n = n
        self.smoothing = alpha
        self.vocab = vocab

        # count n-grams
        self.counts_n = Counter()
        curr = ["<eos>"] * (self.n - 1) # padding for initial words
        for i in range(len(train_text)):
            curr.append(train_text[i])
            gram = " ".join(curr)
            self.counts_n[gram] += 1
            curr.pop(0)

        # count n-1-gram
        self.counts_n_1 = Counter()
        for c in self.counts_n:
          temp = " ".join(c.split(" ")[:-1])
          self.counts_n_1[temp] += self.counts_n[c]  

    def get_probability(self, text):
        """Return the probability of the last word in an n-gram. This is the
        equation given in the text above.
        
        Args:
            text: a list of string tokens

        Hints:
            - self.counts_n contains the number of occurances for any string of n words
            - self.count_n_1 contains the number of occurances for any string of n-1 words
            - you can use the join method to create a string from a list of words
            - Remember, the given string can have more than n words
            - self.vocab contains a dictionary of the vocabulary
        """
        assert len(text) >= self.n, f"Expected at least {self.n} words; got {len(text)} words. \nGiven text: {text}"

        # BEGIN SOLUTION
        #print(" ".join(text[-self.n:]))
        #print(self.counts_n[" ".join(text[-self.n:])])
        #print((" ".join(text[-self.n:-1])))
        #print(self.counts_n_1[" ".join(text[-self.n:-1])])
        #print(" ".join(text[-self.n:]))
        #print(" ".join(text[-(self.n - 1):]))
        return (self.counts_n[" ".join(text[-self.n:])] + self.smoothing) / (self.counts_n_1[" ".join(text[-(self.n):-1])] + self.smoothing * len(self.vocab) )
        # END SOLUTION

    def get_next_word_probabilities(self, text_prefix):
        """Return a list of probabilities for each word in the vocabulary.

        Args:
            text_prefix: a list of string tokens

        Hints:
            - you need to use your get_probability function
            - self.vocab.itos contains a list of words to return probabilities for
            - you will need to handle the cases in which the text prefix is both
                shorter and longer than n-1 words. For the shorter case, you need to
                pad with "<eos>" tokens to the beginning of the text prefix. For the
                longer case, you need to truncate the text prefix to the last n-1 words
            - As a sanity check, you should make sure the probabilities you return add up to 1
        """
        # BEGIN SOLUTION
        #self.vocab.itos contains a list of words to return probabilities for:
        t_p = text_prefix
          #Case where text prefix < n-1 words
        if(len(t_p) <= self.n - 1):
          #Apply pad with "<eos>" tokens to the beginning of the text prefix
          while( len(t_p) < self.n - 1):
            t_p.insert(0,'<eos>')
          #Case where text prefix > n-1 words:
          #Truncate prefix to the last n-1 words
        if(self.n == 1):
          t_p = t_p[-1:]
        else:
          t_p = t_p[- (self.n - 1):]

        probabilities = []
        for word in self.vocab.itos:
          probabilities += [self.get_probability(t_p + [word] )]
        #print(sum(probabilities))
        return probabilities
        # END SOLUTION

### Perplexity

To evaluate how good our language model, we use a metric called perplexity. The perplexity of a language model (PP) on a test set is the inverse probability of the test set, normalized by the number of words. Let $W = w_{1}w_{2}\dots w_{N}$. Then,

$$PP(W) = \sqrt[N]{\prod_{i = 1}^{N}\frac{1}{P(w_{i}|w_{1}\dots w_{i - 1})}}$$

However, since these probabilities are often small, taking the inverse and multiplying can be numerically unstable, so we often first compute these values in the log domain and then convert back. So this equation looks like:

$$\ln PP(W) = \frac{1}{N} \sum_{i = 1}^{N} -\ln P(w_{i}|w_{1}\dots w_{i - 1})$$

$$\implies PP(W) = e^{\frac{1}{N} \sum_{i = 1}^{N} -\ln P(w_{i}|w_{1}\dots w_{i - 1})}$$

In [None]:
def get_perplexity(model, text):
    """Returns the perplexity of the model.
    
    Args:
        text: a list of string tokens

    Hints:
        - you need to use your model.get_probability function
        - you need to handle the edge case for the first n-1 words of text. You
            can similarly pad with "<eos>" tokens
        - you want to get the probability of each increasing sequence of words
    """
    # BEGIN SOLUTION
    probabilities = []
    for i in range(len(text)):
      tempText = []
      #lets say, ngram = 3 and we are at position 0, we get ourselves and two guys behind me [eos] [eos]
      #Lets say, ngram = 3 and we are at position 1, we get ourselves and two guys behind me [word] [eos]
      for n in range(model.n):
        if(i - n < 0 ):
          tempText += ['<eos>']
        else:
          tempText += [text[i-n]]
      probabilities += [model.get_probability(tempText[::-1])]
        
    return math.e ** (np.mean( -np.log(probabilities)))


    # END SOLUTION

unigram_model = NGramModel(train_text, vocab, n=1)
print('unigram validation perplexity:', get_perplexity(unigram_model, validation_text)) # should be around 1300-1400

bigram_model = NGramModel(train_text, vocab, n=2)
print('bigram validation perplexity:', get_perplexity(bigram_model, validation_text)) # should be around 800

trigram_model = NGramModel(train_text, vocab, n=3)
print('trigram validation perplexity:', get_perplexity(trigram_model, validation_text)) # this won't do very well (around 5000)

unigram validation perplexity: 1339.197981504648
bigram validation perplexity: 805.0404106599412
trigram validation perplexity: 5123.022246473123


### Deliverable 1

Fill in the calculated perplexities given from the cell above and answer the question.

<!-- Do not remove this comment, it is used by the autograder: RqYJKsoTS6 -->

Unigram validation perplexity: ***1339.198***

Bigram validation perplexity: ***805.040***

Trigram validation perplexity: ***5123.022***

Question: Why does the trigram model have such a high perplexity?

Answer: 

**Based on the chapter 3.4 of the texbook, we might be able to atrribute some of the complexity due to sparcity. where a bi-gram (eg. "Denied the") could ve detected more often than a tri-gram, (eg. "Denied the passage") due to the a limited corpus, hence giving really low probabilities or 0 (w/o) smoothing.**

**On the other hand, it might be the fact that we have more word-to-word coherence within our corpus since not everything is by the same author (like 3-grams resembling Shakespear on the textbook), this could be because the summaries might've not been written by the same person, making the writting style different from summary to summary.**

***

## Text Generation

### Deliverable 2

In this section, we will explore generating sentences using our models. Your job will be to simply fill-in the following function. You should try out the various models you have built and compare the types of sentences they generate. These models are of course very limited compared to current state of the art models that are able to model much longer sequences of text.

In [None]:
def generate_text(model, n=20, seed=0, prefix=['<eos>', '<eos>']):
    """Returns a randomly generated sentence sampled from the probability
    distribution given by a language model.

    Args:
        model: language model
        n: number of words to generate
        prefix: list of tokens to prompt your model
    
    Hints:
        - you need to use your model.get_next_word_probabilities function
        - you can use the random.choices function to sample from a list according to probabilities
        - model.vocab.itos contains a list of words in the vocabulary
    """
    random.seed(seed)
    np.random.seed(seed)
    # BEGIN SOLUTION
    #Let's get a random word from our probabilities:
    string = prefix
    for i in range(n):
      nextWord = random.choices(model.vocab.itos, model.get_next_word_probabilities( string ), k = 1)
      #print(nextWord)
      string += nextWord
    return " ".join(string)
    # END SOLUTION

In [None]:
unigram_string = generate_text(model = unigram_model, prefix=['<eos>', '<eos>'])
print(unigram_string)

<eos> <eos> portrayed minister it is finds who spy he have around suitable house in ralph attack of humanoid dragon-eye recent raping


In [None]:
bigram_string = generate_text(model = bigram_model, prefix=['<eos>', '<eos>'])
print(bigram_string)

<eos> <eos> quiet anklet giff rajaguru comparisons blackburn introducers taller upper-middle-class sympathises saloons enroute youngster girlfriend/fiancée 8pm ransacked saburo wargames look–alike rocker


In [None]:
trigram_string = generate_text(model = trigram_model, prefix=['<eos>', '<eos>'])
print(trigram_string)

<eos> <eos> zed hammering hisako gala glues consigned jasmina 410 beauvoir undergarments santander famously emphasizes half-processed abstinance vocalist scalograph warnie loxahatchee romania.der


## Optional: Neural Models

As you can see from the text you generated in the previous section, the capabilities of classical language models is quite limited. They simply learn the meanings of words in terms of counts and are limited to a fixed window of words. In this section, we will explore neural methods for language modeling, which are a bit closer to what modern language models look like.

You don't have to do anything in section and there is no deliverable. It is merely here for you to explore how neural language models work. 

### Neural N-Gram Model

Now, we will train a neural network to model our language. We will use a feedforward network that takes in the previous $n-1$ words and outputs a distribution over the vocabulary which can be used to form a probability of the next word.

We will implement this model using PyTorch and its various utilities for dataloading.

In [None]:
class NgramDataset(torch.utils.data.Dataset):
    def __init__(self, text_token_ids, n):
        self.text_token_ids = text_token_ids
        self.n = n

    def __len__(self):
        return len(self.text_token_ids)

    def __getitem__(self, i):
        if i < self.n-1:
            prev_token_ids = [vocab.stoi['<eos>']] * (self.n-i-1) + self.text_token_ids[:i]
        else:
            prev_token_ids = self.text_token_ids[i-self.n+1:i]

        x = torch.tensor(prev_token_ids)
        y = torch.tensor(self.text_token_ids[i])
        return x, y

class NeuralNGram(nn.Module):
    def __init__(self, n):
        super().__init__()
        self.n = n

        self.linear_1 = nn.Linear((self.n - 1) * 128, 256)
        self.linear_2 = nn.Linear(256, 128)
        self.dropout = nn.Dropout(p=0.1)
        self.linear_3 = nn.Linear(128, vocab_size)

    def forward(self, x):
        """Returns a tensor of log probabilities with shape (batch, vocab_size).

        Args:
            x: tensor of input with shape (batch, n-1)
        """
        x = F.embedding(x, weight=self.linear_3.weight) # weight tying
        x = x.reshape(x.size(0), -1)
        x = self.linear_1(x)
        x = F.relu(x)
        x = self.dropout(x)
        x = self.linear_2(x)
        x = self.linear_3(x)
        return x

class NeuralNGramModel:
    def __init__(self, n, vocab):
        self.n = n
        self.vocab = vocab
        self.network = NeuralNGram(n).to(device)

    def train(self):
        dataset = NgramDataset(ids(self.vocab, train_text), self.n)
        train_loader = torch.utils.data.DataLoader(dataset, batch_size=128, shuffle=True)
        optim = torch.optim.Adam(self.network.parameters())
        prev_validation = float("inf")

        for epoch in range(3):
            print("Epoch", epoch)
            self.network.train()

            for prev, curr in tqdm.notebook.tqdm(train_loader, leave=False):
                prev, curr = prev.to(device), curr.to(device)
                optim.zero_grad()
                output = self.network(prev)
                loss = F.cross_entropy(output, curr)
                loss.backward()
                optim.step()

            # save the model with the best validation perplexity
            validation_pp = get_perplexity(self, validation_text)
            print("Validation score:", validation_pp)

            if validation_pp < prev_validation:
                torch.save(self.network.state_dict(), "neural_language_model.pkl")
                prev_validation = validation_pp
        
        # load best saved model
        self.network.load_state_dict(torch.load("neural_language_model.pkl"))
    
    def get_probability(self, text):
        assert len(text) >= self.n, f"Expected at least {self.n} words; got {len(text)} words. \nGiven text: {text}"
        target_id = ids(self.vocab, [text[-1]])[0]
        return self.get_next_word_probabilities(text[:-1])[target_id]

    def get_next_word_probabilities(self, text_prefix):
        self.network.eval()
        while len(text_prefix) < self.n - 1:
            text_prefix = ["<eos>"] + text_prefix
        if len(text_prefix) > self.n - 1:
            text_prefix = text_prefix[len(text_prefix) - self.n + 1 :]
        x = torch.Tensor(ids(self.vocab, text_prefix)).to(torch.int64).to(device)
        x = x.reshape((1, len(x)))
        with torch.no_grad():
            probs = F.softmax(self.network(x), dim=1)[0]
        return probs.cpu()

neural_trigram_model = NeuralNGramModel(3, vocab)
neural_trigram_model.train()
print('neural trigram validation perplexity:', get_perplexity(neural_trigram_model, validation_text))

Epoch 0


  0%|          | 0/15625 [00:00<?, ?it/s]

### RNN Model

Now, we will train a Recurrnt Neural Network to model our language. This is a much more flexible architecture for language modeling since it is able to handle inputs of any length and can thus model longer ranges of text.

In [None]:
num_hidden_rnn_layers = 1
class RNNNetwork(nn.Module):

    def __init__(self):
        super().__init__()

        self.rnn = nn.RNN(128, 128, num_hidden_rnn_layers, dropout=0.5).to(device)
        self.linear = nn.Linear(128, 128).to(device)
        self.linear_2 = nn.Linear(128, vocab_size).to(device)
        self.dp = nn.Dropout(0.5)

    def forward(self, x, state):
        x = F.embedding(x, weight=self.linear_2.weight) # weight tying
        x, state = self.rnn(x, state)
        x = self.dp(x)
        x = self.linear(x)
        x = self.linear_2(x)
        return x, state

class RNNModel:

    def __init__(self, vocab):
        self.n = 2 # makes it compatible with other n-gram models
        self.vocab = vocab
        self.network = RNNNetwork().to(device)

    def train(self):
        train_iterator = torchtext.data.BPTTIterator(train_dataset, batch_size=64, 
                                                     bptt_len=32, device=device)
  
        h = torch.autograd.Variable(torch.zeros(num_hidden_rnn_layers, 64, self.network.rnn.hidden_size), requires_grad=False).to(device)
        optim = torch.optim.Adam(self.network.parameters())
        prev_validation = float('inf')
        for epoch in range(20):
          print('Epoch', epoch + 1)
          self.network.train()
          for batch in tqdm.notebook.tqdm(train_iterator, leave=False):
            assert self.network.training, 'make sure your network is in train mode with `.train()`'
            text, target = batch.text, batch.target
            text, target = text.to(torch.int64).to(device), target.to(torch.int64).to(device)
            optim.zero_grad()

            output, h = self.network(text, h)
            output = output.view(-1, output.shape[-1])
            target = target.view(-1,)
            loss = F.cross_entropy(output, target)
            loss.backward()
            optim.step()
            h = h.detach()

          validation_pp = get_perplexity(self, validation_text)
          print('Validation score:', validation_pp)

          if validation_pp < prev_validation:
            torch.save(self.network.state_dict(), "rnn_language_model.pkl")
            prev_validation = validation_pp

        self.network.load_state_dict(torch.load("rnn_language_model.pkl"))
    
    def get_probability(self, text):
        target_id = ids(self.vocab, [text[-1]])[0]
        return self.get_next_word_probabilities(text[-32:-1])[target_id]

    def get_next_word_probabilities(self, text_prefix):
        prefix_token_tensor = torch.tensor(ids(self.vocab, text_prefix), device=device).view(-1, 1)
        prefix_token_tensor = prefix_token_tensor.to(torch.int64).to(device)
        h = torch.autograd.Variable(next(self.network.parameters()).data.new(num_hidden_rnn_layers, 1, self.network.rnn.hidden_size), requires_grad=False)
        self.network.eval()
        with torch.no_grad():
          output, _ = self.network(prefix_token_tensor, h)
          output = output.squeeze(dim=1)
          probs = F.softmax(output, dim=-1)
        return probs[-1]

rnn_model = RNNModel(vocab)
rnn_model.train()
print('rnn validation perplexity:', get_perplexity(rnn_model, validation_text))

## Submission

Congrats on making it to the end of the notebook. Please download this notebook and upload it to gradescope. Make sure all of the cells where your answers are expected are filled in.