## Project 1b: Language Modeling

In this project, you will implement several different types of language models for text.  We'll start with n-gram models, then move on to neural n-gram and LSTM language models.

Warning: Do not start this project the day before it is due!  Some parts require 20 minutes or more to run, so debugging and tuning can take a significant amount of time.

Our dataset for this project will be the Penn Treebank language modeling dataset.  This dataset comes with some of the basic preprocessing done for us, such as tokenization and rare word filtering (using the `<unk>` token).
Therefore, we can assume that all word types in the test set also appear at least once in the training set.
We'll also use the `torchtext` library to help with some of the data preprocessing, such as converting tokens into id numbers.

In [1]:
# Some of the functions below require an older version of torchtext than the default one Kaggle gives you.
# IMPORTANT: Make sure that Internet is turned on!!! (Notebook options in the bar on the right)
# IMPORTANT: If you're not already using Kaggle, we STRONGLY recommend you switch to Kaggle for hw1b in particular,
# because copying our notebook will pin you to a Python version that lets you install the right version of torchtext.
# On Colab you will have to downgrade your Python to e.g., 3.7 to do the below pip install, which is a pain to do.
!pip install torchtext==0.10.0
exit()

Collecting torchtext==0.10.0
  Downloading torchtext-0.10.0-cp37-cp37m-manylinux1_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m41.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting torch==1.9.0
  Downloading torch-1.9.0-cp37-cp37m-manylinux1_x86_64.whl (831.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m831.4/831.4 MB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: torch, torchtext
  Attempting uninstall: torch
    Found existing installation: torch 1.11.0
    Uninstalling torch-1.11.0:
      Successfully uninstalled torch-1.11.0
  Attempting uninstall: torchtext
    Found existing installation: torchtext 0.12.0
    Uninstalling torchtext-0.12.0:
      Successfully uninstalled torchtext-0.12.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following depe

In [2]:
# Copy wikitext-2 dataset folder to /kaggle/working
import shutil
shutil.copytree('/kaggle/input/wikitext-2', '/kaggle/working/wikitext-2')

'/kaggle/working/wikitext-2'

In [2]:
# This block handles some basic setup and data loading.  
# You shouldn't need to edit this, but if you want to 
# import other standard python packages, that is fine.

# imports
from collections import defaultdict, Counter
import numpy as np
import math
import tqdm
import random
import pdb

import torch
from torch import nn
import torch.nn.functional as F
import torchtext
import torch.optim as optim

from torchtext.legacy import data
from torchtext.legacy import datasets

# download and load the data
text_field = data.Field()
datasets = datasets.WikiText2.splits(root='.', text_field=text_field)
train_dataset, validation_dataset, test_dataset = datasets

text_field.build_vocab(train_dataset, validation_dataset, test_dataset)
vocab = text_field.vocab
vocab_size = len(vocab)

train_text = train_dataset.examples[0].text # a list of tokens (strings)
validation_text = validation_dataset.examples[0].text

print(validation_text[:30])

['<eos>', '=', 'Homarus', 'gammarus', '=', '<eos>', '<eos>', 'Homarus', 'gammarus', ',', 'known', 'as', 'the', 'European', 'lobster', 'or', 'common', 'lobster', ',', 'is', 'a', 'species', 'of', '<unk>', 'lobster', 'from', 'the', 'eastern', 'Atlantic', 'Ocean']


In [3]:
print(validation_text[:300])
text_field.vocab.freqs['.']

['<eos>', '=', 'Homarus', 'gammarus', '=', '<eos>', '<eos>', 'Homarus', 'gammarus', ',', 'known', 'as', 'the', 'European', 'lobster', 'or', 'common', 'lobster', ',', 'is', 'a', 'species', 'of', '<unk>', 'lobster', 'from', 'the', 'eastern', 'Atlantic', 'Ocean', ',', 'Mediterranean', 'Sea', 'and', 'parts', 'of', 'the', 'Black', 'Sea', '.', 'It', 'is', 'closely', 'related', 'to', 'the', 'American', 'lobster', ',', 'H.', 'americanus', '.', 'It', 'may', 'grow', 'to', 'a', 'length', 'of', '60', 'cm', '(', '24', 'in', ')', 'and', 'a', 'mass', 'of', '6', 'kilograms', '(', '13', 'lb', ')', ',', 'and', 'bears', 'a', 'conspicuous', 'pair', 'of', 'claws', '.', 'In', 'life', ',', 'the', 'lobsters', 'are', 'blue', ',', 'only', 'becoming', '"', 'lobster', 'red', '"', 'on', 'cooking', '.', 'Mating', 'occurs', 'in', 'the', 'summer', ',', 'producing', 'eggs', 'which', 'are', 'carried', 'by', 'the', 'females', 'for', 'up', 'to', 'a', 'year', 'before', 'hatching', 'into', '<unk>', 'larvae', '.', 'Homarus'

90077

We've implemented a unigram model here as a demonstration.

In [8]:
class UnigramModel:
    def __init__(self, train_text):
        self.counts = Counter(train_text)
        self.total_count = len(train_text)

    def probability(self, word):
        return self.counts[word] / self.total_count

    def next_word_probabilities(self, text_prefix):
        """Return a list of probabilities for each word in the vocabulary."""
        return [self.probability(word) for word in vocab.itos]

    def perplexity(self, full_text):
        """Return the perplexity of the model on a text as a float.
        
        full_text -- a list of string tokens
        """
        log_probabilities = []
        for word in full_text:
            # Note that the base of the log doesn't matter 
            # as long as the log and exp use the same base.
            log_probabilities.append(math.log(self.probability(word), 2))
        return 2 ** -np.mean(log_probabilities)

unigram_demonstration_model = UnigramModel(train_text)
print('unigram validation perplexity:', 
      unigram_demonstration_model.perplexity(validation_text))

def check_validity(model):
    """Performs several sanity checks on your model:
    1) That next_word_probabilities returns a valid distribution
    2) That perplexity matches a perplexity calculated from next_word_probabilities

    Although it is possible to calculate perplexity from next_word_probabilities, 
    it is still good to have a separate more efficient method that only computes 
    the probabilities of observed words.
    """

    log_probabilities = []
    for i in range(10):
        prefix = validation_text[:i]
        probs = model.next_word_probabilities(prefix)
        assert min(probs) >= 0, "Negative value in next_word_probabilities"
        assert max(probs) <= 1 + 1e-8, "Value larger than 1 in next_word_probabilities"
        assert abs(sum(probs)-1) < 1e-4, "next_word_probabilities do not sum to 1"

        word_id = vocab.stoi[validation_text[i]]
        selected_prob = probs[word_id]
        log_probabilities.append(math.log(selected_prob))

    perplexity = math.exp(-np.mean(log_probabilities))
    your_perplexity = model.perplexity(validation_text[:10])
    assert abs(perplexity-your_perplexity) < 0.1, "your perplexity does not " + \
    "match the one we calculated from `next_word_probabilities`,\n" + \
    "at least one of `perplexity` or `next_word_probabilities` is incorrect.\n" + \
    f"we calcuated {perplexity} from `next_word_probabilities`,\n" + \
    f"but your perplexity function returned {your_perplexity} (on a small sample)."


check_validity(unigram_demonstration_model)

unigram validation perplexity: 965.0860734119312


To generate from a language model, we can sample one word at a time conditioning on the words we have generated so far.

In [9]:
def generate_text(model, n=20, prefix=('<eos>', '<eos>')):
    prefix = list(prefix)
    for _ in range(n):
        probs = model.next_word_probabilities(prefix)
        word = random.choices(vocab.itos, probs)[0]
        prefix.append(word)
    return ' '.join(prefix)

print(generate_text(unigram_demonstration_model))

<eos> <eos> last balance Fear : 1998 At where <unk> latter Official video . army April Visions included <eos> UEFA confronted Gielgud


In fact there are many strategies to get better-sounding samples, such as only sampling from the top-k words or sharpening the distribution with a temperature.  You can read more about sampling from a language model in this recent paper: https://arxiv.org/pdf/1904.09751.pdf.

In [9]:
import numpy as np
import random

def nucleus_sampling(probs, p=0.95):
    sorted_indices = np.argsort(probs)[::-1]
    cumulative_probs = np.cumsum(np.sort(probs)[::-1])
    cutoff_index = np.searchsorted(cumulative_probs, p)
    nucleus_indices = sorted_indices[:cutoff_index + 1]
    nucleus_probs = np.array([probs[i] for i in nucleus_indices])
    nucleus_probs /= np.sum(nucleus_probs)  # Normalize
    return nucleus_indices, nucleus_probs  # Return both the indices and the corresponding probabilities

def generate_text_with_nucleus_sampling(model, n=20, prefix=('<eos>', '<eos>')):
    prefix = list(prefix)
    for _ in range(n):
        probs = model.next_word_probabilities(prefix)
        nucleus_indices, nucleus_probs = nucleus_sampling(probs, p=0.95)
        selected_index = random.choices(nucleus_indices, nucleus_probs)[0]
        word = vocab.itos[selected_index]
        prefix.append(word)
    return ' '.join(prefix)

# Assuming vocab and model are properly defined
print(generate_text_with_nucleus_sampling(unigram_demonstration_model))


<eos> <eos> U2 by episode two The against around and @-@ erosion for = a SR result predators The that . and


You will need to submit some outputs from the models you implement for us to grade.  The following function will be used to generate the required output files.

In [4]:
!wget https://cal-cs288.github.io/sp21/project_files/proj_1/eval_prefixes.txt
!wget https://cal-cs288.github.io/sp21/project_files/proj_1/eval_output_vocab.txt
!wget https://cal-cs288.github.io/sp21/project_files/proj_1/eval_prefixes_short.txt
!wget https://cal-cs288.github.io/sp21/project_files/proj_1/eval_output_vocab_short.txt

def save_truncated_distribution(model, filename, short=True):
    """Generate a file of truncated distributions.
    
    Probability distributions over the full vocabulary are large,
    so we will truncate the distribution to a smaller vocabulary.

    Please do not edit this function
    """
    vocab_name = 'eval_output_vocab'
    prefixes_name = 'eval_prefixes'

    if short: 
      vocab_name += '_short'
      prefixes_name += '_short'

    with open('{}.txt'.format(vocab_name), 'r') as eval_vocab_file:
        eval_vocab = [w.strip() for w in eval_vocab_file]
    eval_vocab_ids = [vocab.stoi[s] for s in eval_vocab]

    all_selected_probabilities = []
    with open('{}.txt'.format(prefixes_name), 'r') as eval_prefixes_file:
        lines = eval_prefixes_file.readlines()
        for line in tqdm.tqdm_notebook(lines, leave=False):
        # Compatible Save with Trigram Backoff model
        # for line in tqdm(lines, leave=False):
            prefix = line.strip().split(' ')
            probs = model.next_word_probabilities(prefix)
            selected_probs = np.array([probs[i] for i in eval_vocab_ids], dtype=np.float32)
            all_selected_probabilities.append(selected_probs)

    all_selected_probabilities = np.stack(all_selected_probabilities)
    np.save(filename, all_selected_probabilities)
    print('saved', filename)

--2024-09-13 16:12:10--  https://cal-cs288.github.io/sp21/project_files/proj_1/eval_prefixes.txt
Resolving cal-cs288.github.io (cal-cs288.github.io)... 185.199.111.153, 185.199.110.153, 185.199.108.153, ...
Connecting to cal-cs288.github.io (cal-cs288.github.io)|185.199.111.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 519055 (507K) [text/plain]
Saving to: ‘eval_prefixes.txt’


2024-09-13 16:12:11 (3.17 MB/s) - ‘eval_prefixes.txt’ saved [519055/519055]

--2024-09-13 16:12:12--  https://cal-cs288.github.io/sp21/project_files/proj_1/eval_output_vocab.txt
Resolving cal-cs288.github.io (cal-cs288.github.io)... 185.199.110.153, 185.199.109.153, 185.199.108.153, ...
Connecting to cal-cs288.github.io (cal-cs288.github.io)|185.199.110.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12497 (12K) [text/plain]
Saving to: ‘eval_output_vocab.txt’


2024-09-13 16:12:12 (12.3 MB/s) - ‘eval_output_vocab.txt’ saved [12497/12497]

--2024-09-13

### N-gram Model

Now it's time to implement an n-gram language model.

Because not every n-gram will have been observed in training, use add-alpha smoothing to make sure no output word has probability 0.

$$P(w_2|w_1)=\frac{C(w_1,w_2)+\alpha}{C(w_1)+N\alpha}$$

where $N$ is the vocab size and $C$ is the count for the given bigram.  An alpha value around `3e-3`  should work.  Later, we'll replace this smoothing with model backoff.

One edge case you will need to handle is at the beginning of the text where you don't have `n-1` prior words.  You can handle this however you like as long as you produce a valid probability distribution, but just using a uniform distribution over the vocabulary is reasonable for the purposes of this project.

A properly implemented bi-gram model should get a perplexity below 510 on the validation set.

**Note**: Do not change the signature of the `next_word_probabilities` and `perplexity` functions.  We will use these as a common interface for all of the different model types.  Make sure these two functions call `n_gram_probability`, because later we are going to override `n_gram_probability` in a subclass. 
Also, we suggest pre-computing and caching the counts $C$ when you initialize `NGramModel` for efficiency. 

In [9]:
save_truncated_distribution(unigram_demonstration_model, 
                            'unigram_demonstration_predictions.npy')

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


  0%|          | 0/1000 [00:00<?, ?it/s]

saved unigram_demonstration_predictions.npy


In [5]:
class NGramModel:
    def __init__(self, train_text, n=2, alpha=3e-3):
        # get counts and perform any other setup
        self.n = n
        self.smoothing = alpha
        self.train_text = train_text

        # YOUR CODE HERE

        # Create (n-1)-grams and n-grams counts
        if n > 1:
            self.n_gram_counts = Counter(tuple(train_text[i:i+n]) for i in range(len(train_text) - n + 1))
            self.n_minus_1_gram_counts = Counter(tuple(train_text[i:i+n-1]) for i in range(len(train_text) - n + 2))
        else:
            self.n_gram_counts = Counter(train_text)
            self.n_minus_1_gram_counts = {}
        
        self.vocab_size = len(set(train_text))
        self.total_count = len(train_text)

    def n_gram_probability(self, n_gram):
        """Return the probability of the last word in an n-gram.
        
        n_gram -- a list of string tokens
        returns the conditional probability of the last token given the rest.
        """
        assert len(n_gram) == self.n
        
        # YOUR CODE HERE
        if self.n > 1:
            n_minus_1_gram_tuple = tuple(n_gram[:-1])
            n_gram_tuple = tuple(n_gram)
            n_gram_count = self.n_gram_counts[n_gram_tuple]
            n_minus_1_gram_count = self.n_minus_1_gram_counts.get(n_minus_1_gram_tuple, 0)
            probability = (n_gram_count + self.smoothing) / (n_minus_1_gram_count + self.vocab_size * self.smoothing)
        else:
            n_gram_count = self.n_gram_counts[n_gram[0]]
            probability = n_gram_count / self.total_count
        return probability
        
        

    def next_word_probabilities(self, text_prefix):
        """Return a list of probabilities for each word in the vocabulary."""

        # YOUR CODE HERE
        # use your function n_gram_probability
        # vocab.itos contains a list of words to return probabilities for
        if len(text_prefix) < self.n - 1:
            # Use a uniform distribution over vocabulary
            return [1/self.vocab_size] * self.vocab_size
        
        # Extract the last n-1 words from text prefix 
        prefix = text_prefix[-(self.n - 1):] if self.n > 1 else []
        
        # Calculate the probability for each word in vocabulary
        probabilities = []
        for word in vocab.itos:
            n_gram = prefix + [word]
            probabilities.append(self.n_gram_probability(n_gram))

        
        return probabilities
        
        
    def perplexity(self, full_text):
        """ full_text is a list of string tokens
        return perplexity as a float """

        # YOUR CODE HERE
        # use your function n_gram_probability
        # This method should differ a bit from the example unigram model because 
        # the first n-1 words of full_text must be handled as a special case.
        log_probabilities = []
        n_minus_1 = self.n - 1 
        
        if self.n > 1:
            for i in range(n_minus_1):
                log_probabilities.append(math.log(1/self.vocab_size, 2))
            
            for i in range(n_minus_1, len(full_text)):
                n_gram = full_text[i-n_minus_1:i+1]
                probability = self.n_gram_probability(n_gram)
                log_probabilities.append(math.log(probability, 2))
                if i % 5000 == 0 and i > 0:
                    print(f'Processed {i} tokens out of {len(full_text)}')
        else:
            for word in full_text:
                probability = self.n_gram_probability([word])
                log_probabilities.append(math.log(probability, 2))
                

        
        return 2 ** -np.mean(log_probabilities)
        


unigram_model = NGramModel(train_text, 1)
# check_validity(unigram_model)
# print('unigram validation perplexity:', unigram_model.perplexity(validation_text)) # this should be the almost the same as our unigram model perplexity above

# bigram_model = NGramModel(train_text, n=2)
# check_validity(bigram_model)
# print('bigram validation perplexity:', bigram_model.perplexity(validation_text))

# trigram_model = NGramModel(train_text, n=3)
# check_validity(trigram_model)
# print('trigram validation perplexity:', trigram_model.perplexity(validation_text)) # this won't do very well...

# save_truncated_distribution(bigram_model, 'bigram_predictions.npy') # this might take a few minutes

Please download `bigram_predictions.npy` once you finish this section so that you can submit it.

In the block below, please report your bigram validation perplexity.  (We will use this to help us calibrate our scoring on the test set.)

<!-- Do not remove this comment, it is used by the autograder: RqYJKsoTS6 -->

Bigram validation perplexity: ***504.42630378506976***

We can also generate samples from the model to get an idea of how it is doing.

In [12]:
print(generate_text(bigram_model))

<eos> <eos> = Football Union annexed prevails pédalier shop King Features petrol leading proponents fishery 1726 700 inventory appropriate for her "


We now free up some RAM, **it is important to run the cell below, otherwise you may quite possibly run out of RAM in the runtime.**

In [20]:
# Free up some RAM. 
del bigram_model
del trigram_model

This basic model works okay for bigrams, but a better strategy (especially for higher-order models) is to use backoff.  Implement backoff with absolute discounting.
$$P\left(w_i|w_{i-n+1:i-1}\right)=\frac{max\left\{C(w_{i-n+1:i})-\delta,0\right\}}{\sum_{w' \in V} C(w_{i-n+1:i-1}, w')} + \alpha(w_{i-n+1:i-1}) P(w_i|w_{i-n+2:i-1})$$

$$\alpha\left(w_{i-n+1:i-1}\right)=\frac{\delta N_{1+}(w_{i-n+1:i-1})}{{\sum_{w' \in V} C(w_{i-n+1:i-1}, w')}}$$
where $V$ is the vocab and $N_{1+}$ is the number of words that appear after the previous $n-1$ words (the number of times the max will select something other than 0 in the first equation). $w_{i-n+1:i}$ denotes the $n$-gram starting at $w_{i-n+1}$ and ending at $w_i$, and $(w_{i-n+1:i-1}, w')$ denotes the n-gram containing the previous $n-1$ words followed by $w'$. If $\sum_{w' \in V} C(w_{i-n+1:i-1}, w')=0$, use the lower order model probability directly (the above equations would have a division by 0).

We found a discount $\delta$ of 0.9 to work well based on validation performance.  A trigram model with this discount value should get a validation perplexity below 275.

In [11]:
from collections import Counter
from tqdm import tqdm  # Progress bar for large loops

In [15]:
class DiscountBackoffModel(NGramModel):
    def __init__(self, train_text, lower_order_model, n=2, delta=0.9):
        super().__init__(train_text, n=n)
        self.lower_order_model = lower_order_model
        self.discount = delta

        # YOUR CODE HERE
        
        print(f"Training {n}-gram model...")

        self.n_gram_counts = Counter(tuple(train_text[i:i+n]) for i in range(len(train_text) - n + 1))
        self.n_minus_1_gram_counts = Counter(tuple(train_text[i:i+n-1]) for i in range(len(train_text) - n + 2))
        
        self.vocab = list(set(train_text))
        
        print("Precomputing distinct following word counts...")
        # Precompute the number of distinct words that follow each (n-1)-gram
        self.following_words_count = {
            ngram: len([w for w in self.vocab if ngram + (w,) in self.n_gram_counts]) 
            for ngram in tqdm(self.n_minus_1_gram_counts, desc="Precomputing following word counts", position=0, leave=True)
        }
        
        
    def n_gram_probability(self, n_gram):
        assert len(n_gram) == self.n

        # YOUR CODE HERE
        # back off to the lower_order model with n'=n-1 using its n_gram_probability function
        # Get the count of the full n-gram and the (n-1)-gram prefix
        
        n_minus_1_gram_tuple = tuple(n_gram[:-1])
        n_gram_tuple = tuple(n_gram)

        # Get the count of the full n-gram and the (n-1)-gram prefix
        n_gram_count = self.n_gram_counts.get(n_gram_tuple, 0)
        n_minus_1_gram_count = self.n_minus_1_gram_counts.get(n_minus_1_gram_tuple, 0)
        

        # Calculate alpha for discounting
        following_words_count = self.following_words_count.get(n_minus_1_gram_tuple, 0)
        alpha = (self.discount * following_words_count) / n_minus_1_gram_count if n_minus_1_gram_count > 0 else 1

        # Compute the discounted probability or back off to the lower-order model
        return ((max(n_gram_count - self.discount, 0) / n_minus_1_gram_count) if n_minus_1_gram_count > 0 else 0) + alpha * self.lower_order_model.n_gram_probability(n_gram[1:])
    


bigram_backoff_model = DiscountBackoffModel(train_text, unigram_model, 2)
trigram_backoff_model = DiscountBackoffModel(train_text, bigram_backoff_model, 3)
check_validity(trigram_backoff_model)
print('trigram backoff validation perplexity:', trigram_backoff_model.perplexity(validation_text))

Training 2-gram model...
Precomputing distinct following word counts...


Precomputing following word counts: 100%|██████████| 33278/33278 [06:56<00:00, 79.95it/s]


Training 3-gram model...
Precomputing distinct following word counts...


Precomputing following word counts: 100%|██████████| 619592/619592 [2:10:00<00:00, 79.43it/s]  


Processed 5000 tokens out of 217646
Processed 10000 tokens out of 217646
Processed 15000 tokens out of 217646
Processed 20000 tokens out of 217646
Processed 25000 tokens out of 217646
Processed 30000 tokens out of 217646
Processed 35000 tokens out of 217646
Processed 40000 tokens out of 217646
Processed 45000 tokens out of 217646
Processed 50000 tokens out of 217646
Processed 55000 tokens out of 217646
Processed 60000 tokens out of 217646
Processed 65000 tokens out of 217646
Processed 70000 tokens out of 217646
Processed 75000 tokens out of 217646
Processed 80000 tokens out of 217646
Processed 85000 tokens out of 217646
Processed 90000 tokens out of 217646
Processed 95000 tokens out of 217646
Processed 100000 tokens out of 217646
Processed 105000 tokens out of 217646
Processed 110000 tokens out of 217646
Processed 115000 tokens out of 217646
Processed 120000 tokens out of 217646
Processed 125000 tokens out of 217646
Processed 130000 tokens out of 217646
Processed 135000 tokens out of 2

In [22]:
check_validity(trigram_backoff_model)
# save_truncated_distribution(bigram_backoff_model, 'bigram_backoff_model.npy') # this might take a few minutes
# save_truncated_distribution(trigram_backoff_model, 'trigram_backoff_model.npy') # this might take a few minutes

In [None]:
# Release models we don't need any more. 
del unigram_model
del bigram_backoff_model
del trigram_backoff_model

Free up RAM. 

Fill in your trigram backoff perplexity here.

<!-- Do not remove this comment, it is used by the autograder: RqYJKsoTS6 -->

Trigram backoff validation perplexity: ***271.13***



Free up RAM. 

### Neural N-gram Model

In this section, you will implement a neural version of an n-gram model.  The model will use a simple feedforward neural network that takes the previous `n-1` words and outputs a distribution over the next word.

You will use PyTorch to implement the model.  We've provided a little bit of code to help with the data loading using PyTorch's data loaders (https://pytorch.org/docs/stable/data.html)

A model with the following architecture and hyperparameters should reach a validation perplexity below 226.
* embed the words with dimension 128, then flatten into a single embedding for $n-1$ words (with size $(n-1)*128$)
* run 2 hidden layers with 1024 hidden units, then project down to size 128 before the final layer (ie. 4 layers total). 
* use weight tying for the embedding and final linear layer (this made a very large difference in our experiments); you can do this by creating the output layer with `nn.Linear`, then using `F.embedding` with the linear layer's `.weight` to embed the input
* rectified linear activation (ReLU) and dropout 0.1 after first 2 hidden layers. **Note: You will likely find a performance drop if you add a nonlinear activation function after the dimension reduction layer.**
* train for 10 epochs with the Adam optimizer (should take around 15-20 minutes)
* do early stopping based on validation set perplexity (see Project 0)


We encourage you to try other architectures and hyperparameters, and you will likely find some that work better than the ones listed above.  A proper implementation with these should be enough to receive full credit on the assignment, though.

In [6]:
def ids(tokens):
    return [vocab.stoi[t] for t in tokens]

assert torch.cuda.is_available(), "no GPU found; in Colab go to 'Edit->Notebook settings' and choose a GPU hardware accelerator; \n in Kaggle go to 'Settings->Accelerator' and choose a GPU hardware accelerator"

class NeuralNgramDataset(torch.utils.data.Dataset):
    def __init__(self, text_token_ids, n):
        self.text_token_ids = text_token_ids
        self.n = n

    def __len__(self):
        return len(self.text_token_ids)

    def __getitem__(self, i):
        if i < self.n-1:
            prev_token_ids = [vocab.stoi['<eos>']] * (self.n-i-1) + self.text_token_ids[:i]
        else:
            prev_token_ids = self.text_token_ids[i-self.n+1:i]

        assert len(prev_token_ids) == self.n-1

        x = torch.tensor(prev_token_ids)
        y = torch.tensor(self.text_token_ids[i])
        return x, y

class NeuralNGramNetwork(nn.Module):
    # a PyTorch Module that holds the neural network for your model

    def __init__(self, n):
        super().__init__()
        self.n = n

        # YOUR CODE HERE
        self.embedding_dim = 128
        self.hidden_dim = 1024
        self.vocab_size = vocab_size
        
        
        # Word embeddings
        self.embedding = nn.Embedding(vocab_size, self.embedding_dim)
        # Two hidden layers
        self.hidden_layer1 = nn.Linear((n-1) * self.embedding_dim, self.hidden_dim)
        self.hidden_layer2 = nn.Linear(self.hidden_dim, self.hidden_dim)
        # Projection down to 128
        self.projection = nn.Linear(self.hidden_dim, self.embedding_dim)
        # Output layer with weight tying (bias = False to avoid breaking embedding-output tying)
        self.output_layer = nn.Linear(self.embedding_dim, vocab_size, bias=False)
        # Dropout
        self.dropout = nn.Dropout(0.1)
        
        
        


    def forward(self, x):
        # x is a tensor of inputs with shape (batch, n-1)
        # this function returns a tensor of log probabilities with shape (batch, vocab_size)

        # YOUR CODE HERE
        # Naive Embedding "embedded = self.embedding(x)" provide worse result 271 Validation PPL
        
        # Step 1: Embed the input tokens using the output layer's weight
        embedded = F.embedding(x, self.output_layer.weight)  # Weight tying applied here
        # Step 2: Flatten the embeddings
        embedded = embedded.view(embedded.size(0), -1)  # (batch_size, (n-1) * embedding_dim)
        
        # Step 3: First hidden layer with ReLU and dropout
        h1 = F.relu(self.hidden_layer1(embedded))
        h1 = self.dropout(h1)
        
        h2 = F.relu(self.hidden_layer2(h1))
        h2 = self.dropout(h2)
        
        # Step 5: Project down to 128 dimensions
        projected = self.projection(h2)
        
        # Step 6: Output layer (using weight tying)
        output = F.linear(projected, self.output_layer.weight)
        
        return F.log_softmax(output, dim = -1)
        


class NeuralNGramModel:
    # a class that wraps NeuralNGramNetwork to handle training and evaluation
    # it's ok if this doesn't work for unigram modeling
    def __init__(self, n):
        self.n = n
        self.network = NeuralNGramNetwork(n).cuda()
        # YOUR CODE HERE
        self.criterion = nn.NLLLoss()
        self.optimizer = optim.Adam(self.network.parameters())

    def train(self):
        dataset = NeuralNgramDataset(ids(train_text), self.n)
        train_loader = torch.utils.data.DataLoader(dataset, batch_size=128, shuffle=True)
        # iterating over train_loader with a for loop will return a 2-tuple of batched tensors
        # the first tensor will be previous token ids with size (batch, n-1),
        # and the second will be the current token id with size (batch, )
        # you will need to move these tensors to GPU, e.g. by using the Tensor.cuda() function.

        # this will take some time to run; use tqdm.tqdm_notebook to get a progress bar 
        # (see Project 1a for example)
        
        best_validation_score = float('inf')
        patience = 1
        epochs_no_improve = 0
        best_model_path = 'best_model.pth'

        # YOUR CODE HERE
        for epoch in range(10):
            self.network.train()
            total_loss = 0 
            for x_batch, y_batch in tqdm.tqdm_notebook(train_loader):
                x_batch, y_batch = x_batch.cuda(), y_batch.cuda()
                # Forward pass
                output = self.network(x_batch)
                # Compute loss
                loss = self.criterion(output, y_batch)
                # Backward pass and optimization
                self.optimizer.zero_grad()
                loss.backward()
                self.optimizer.step()
                
                total_loss += loss.item()
            
            print(f'Epoch {epoch+1}, Loss: {total_loss / len(train_loader)}')
            
            # Compute validation perplexity
            val_perplexity = self.perplexity(validation_text)
            print(f'Epoch {epoch+1}, Validation Perplexity: {val_perplexity}')
            
            # Early stopping check - stop when 3 no-improve on validation
            if val_perplexity < best_validation_score: 
                best_validation_score = val_perplexity
                torch.save(self.network.state_dict(), best_model_path)
                print(f'New best model saved with validation perplexity: {val_perplexity}')
                epochs_no_improve = 0 
            else: 
                epochs_no_improve += 1
                print(f'No improvement in validation perplexity for {epochs_no_improve} epoch(s).')
                if epochs_no_improve > patience:
                    print('Early Stopping Trigger')
                    break
                    


    def next_word_probabilities(self, text_prefix):
        # YOUR CODE HERE
        # don't forget self.network.eval()
        # you will need to convert text_prefix from strings to numbers with the `ids` function
        # if your `perplexity` function below is based on a NeuralNgramDataset DataLoader, you will need to use the same strategy for prefixes with less than n-1 tokens to pass the validity check
        #   the data loader appends extra "<eos>" (end of sentence) tokens to the start of the input so there are always enough to run the network
        self.network.eval()
        token_ids = ids(text_prefix)  # Convert text prefix to token ids
        
        # Pad with <eos> if the prefix length is less than n-1
        if len(token_ids) < self.n-1:
            token_ids = [vocab.stoi['<eos>']] * (self.n - 1 - len(token_ids)) + token_ids
        
        x = torch.tensor(token_ids[-(self.n-1):]).unsqueeze(0).cuda()
        with torch.no_grad():
            log_probs = self.network(x)
        
        return torch.exp(log_probs[0]).cpu().numpy()

    def perplexity(self, text):
        # you may want to use a DataLoader here with a NeuralNgramDataset
        # don't forget self.network.eval()

        # YOUR CODE HERE
        self.network.eval()
        dataset = NeuralNgramDataset(ids(text), self.n)
        data_loader = torch.utils.data.DataLoader(dataset, batch_size=128, shuffle=False)
        
        total_log_prob = 0
        total_words = 0
        
        with torch.no_grad():
            for x_batch, y_batch in data_loader:
                x_batch, y_batch = x_batch.cuda(), y_batch.cuda()
                log_probs = self.network(x_batch)
                
                # Gather the log probabilities of the true words
                total_log_prob += F.nll_loss(log_probs, y_batch, reduction='sum').item()
                total_words += y_batch.size(0)
        
        perplexity = torch.exp(torch.tensor(total_log_prob) / total_words)
        return perplexity.item()
        



neural_trigram_model = NeuralNGramModel(3)
# check_validity(neural_trigram_model)
neural_trigram_model.train()
print('neural trigram validation perplexity:', neural_trigram_model.perplexity(validation_text))

save_truncated_distribution(neural_trigram_model, 'neural_trigram_predictions.npy', short=False)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


  0%|          | 0/16318 [00:00<?, ?it/s]

Epoch 1, Loss: 5.862605750407347
Epoch 1, Validation Perplexity: 234.15843200683594
New best model saved with validation perplexity: 234.15843200683594


  0%|          | 0/16318 [00:00<?, ?it/s]

Epoch 2, Loss: 5.370933127762684
Epoch 2, Validation Perplexity: 213.89324951171875
New best model saved with validation perplexity: 213.89324951171875


  0%|          | 0/16318 [00:00<?, ?it/s]

Epoch 3, Loss: 5.199152794432473
Epoch 3, Validation Perplexity: 211.20938110351562
New best model saved with validation perplexity: 211.20938110351562


  0%|          | 0/16318 [00:00<?, ?it/s]

Epoch 4, Loss: 5.093588398826343
Epoch 4, Validation Perplexity: 207.04798889160156
New best model saved with validation perplexity: 207.04798889160156


  0%|          | 0/16318 [00:00<?, ?it/s]

Epoch 5, Loss: 5.019865160991286
Epoch 5, Validation Perplexity: 208.89659118652344
No improvement in validation perplexity for 1 epoch(s).


  0%|          | 0/16318 [00:00<?, ?it/s]

Epoch 6, Loss: 4.963196056586648
Epoch 6, Validation Perplexity: 211.49327087402344
No improvement in validation perplexity for 2 epoch(s).
Early Stopping Trigger
neural trigram validation perplexity: 211.49327087402344


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


  0%|          | 0/5000 [00:00<?, ?it/s]

saved neural_trigram_predictions.npy


Fill in your neural trigram perplexity.

<!-- Do not remove this comment, it is used by the autograder: RqYJKsoTS6 -->

Neural trigram validation perplexity: ***207***

In [10]:
print(generate_text(neural_trigram_model))

<eos> <eos> = = = <eos> <eos> The church was a large number of aircraft and five . <eos> <eos> There was


Free up RAM.

In [11]:
# Delete model we don't need. 
del neural_trigram_model

### LSTM Model

For this stage of the project, you will implement an LSTM language model.

For recurrent language modeling, the data batching strategy is a bit different from what is used in some other tasks.  Sentences are concatenated together so that one sentence starts right after the other, and an unfinished sentence will be continued in the next batch.  We'll use the `torchtext` library to manage this batching for you.  To properly deal with this input format, you should save the last state of the LSTM from a batch to feed in as the first state of the next batch.  When you save state across different batches, you should call `.detach()` on the state tensors before the next batch to tell PyTorch not to backpropagate gradients through the state into the batch you have already finished (which will cause a runtime error).

We expect your model to reach a validation perplexity below 130.  The following architecture and hyperparameters should be sufficient to get there.
* 3 LSTM layers with 512 units
* dropout of 0.5 after each LSTM layer
* instead of projecting directly from the last LSTM output to the vocabulary size for softmax, project down to a smaller size first (e.g. 512->128->vocab_size). **NOTE: You may find that adding nonlinearities between these layers can hurt performance, try without first.**
* use the same weights for the embedding layer and the pre-softmax layer; dimension 128
* train with Adam (using default learning rates) for at least 20 epochs


In [13]:
class LSTMNetwork(nn.Module):
    # a PyTorch Module that holds the neural network for your model

    def __init__(self):
        super().__init__()

        # YOUR CODE HERE
        self.embedding_dim = 128
        self.hidden_dim = 512
        self.lstm_layers = 3
        self.vocab_size = vocab_size
        
        self.embedding = nn.Embedding(self.vocab_size, self.embedding_dim)
        
        self.lstm = nn.LSTM(self.embedding_dim, self.hidden_dim, self.lstm_layers, dropout=0.5)

        self.projection1 = nn.Linear(self.hidden_dim, self.embedding_dim)
        # Output layer: bias=False for weight tying
        self.projection2 = nn.Linear(self.embedding_dim, self.vocab_size, bias=False)
        
        # Weight tying
        self.projection2.weight = self.embedding.weight


    def forward(self, x, state):
        """Compute the output of the network.
        
        Note: In the Pytorch LSTM tutorial, the state variable is named "hidden":
        https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html

        The torch.nn.LSTM documentation is quite helpful:
        https://pytorch.org/docs/stable/nn.html#lstm
    
        x - a tensor of int64 inputs with shape (seq_len, batch)
        state - a tuple of two tensors with shape (num_layers, batch, hidden_size)
                representing the hidden state and cell state of the of the LSTM.
        returns a tuple with two elements:
          - a tensor of log probabilities with shape (seq_len, batch, vocab_size)
          - a state tuple returned by applying the LSTM.
        """

        # Note that the nn.LSTM module expects inputs with the sequence 
        # dimension before the batch by default.
        # In this case the dimensions are already in the right order, 
        # but watch out for this since sometimes people put the batch first

        # YOUR CODE HERE
        embedded = self.embedding(x)  # Use weight tying
        lstm_output, state = self.lstm(embedded, state) # LSTM output (seq_len, batch, hidden_dim)
        
        # Projection to 128 then Vocab Size
        projected_output = self.projection1(lstm_output)
        output = self.projection2(projected_output)
        
        
        return F.log_softmax(output,dim=-1), state
        


class LSTMModel:
    "A class that wraps LSTMNetwork to handle training and evaluation."

    def __init__(self):
        self.network = LSTMNetwork().cuda()
        # YOUR CODE HERE
        self.criterion = nn.NLLLoss()
        self.optimizer = optim.Adam(self.network.parameters())
        self.patience = 2  # Early stopping patience
        self.best_model_path = 'best_lstm_model.pth'

    def train(self):
        train_iterator = data.BPTTIterator(train_dataset, batch_size=64, 
                                                     bptt_len=32, device='cuda')
        # Iterate over train_iterator with a for loop to get batches
        # each batch object has a .text and .target attribute with
        # token id tensors for the input and output respectively.

        # The initial state passed into the LSTM should be set to zero.

        # YOUR CODE HERE
        best_val_loss = float('inf')
        epochs_no_improve = 0
        
        for epoch in range(30):
            total_loss = 0
            self.network.train()

            # Initialize the hidden and cell states to zeros at the start of the epoch
            state = (torch.zeros(self.network.lstm_layers, 64, self.network.hidden_dim).cuda(),
                     torch.zeros(self.network.lstm_layers, 64, self.network.hidden_dim).cuda())
            
            for batch in tqdm.tqdm_notebook(train_iterator):
                inputs, targets = batch.text, batch.target
                
                state = (state[0].detach(), state[1].detach()) # Detach state for the next batch
                
                self.optimizer.zero_grad()
                outputs, states = self.network(inputs, state)
                loss = self.criterion(outputs.view(-1, outputs.size(-1)), targets.view(-1))
                loss.backward()
                self.optimizer.step()
                total_loss += loss.item()
                
            print(f'Epoch {epoch+1}, Loss: {total_loss / len(train_iterator)}')
        
            # Validate the model and apply early stopping
            val_perplexity = self.dataset_perplexity(validation_dataset)
            print(f'Epoch {epoch + 1}, Validation Perplexity: {val_perplexity}')
            
            # Check for improvement in validation perplexity
            if val_perplexity < best_val_loss:
                best_val_loss = val_perplexity
                epochs_no_improve = 0
                torch.save(self.network.state_dict(), self.best_model_path)
                print(f'New best model saved with validation perplexity: {val_perplexity}')
            else:
                epochs_no_improve += 1

            # Early stopping if no improvement for patience epochs
            if epochs_no_improve >= self.patience:
                print("Early stopping triggered.")
                break


    def next_word_probabilities(self, text_prefix):
        "Return a list of probabilities for each word in the vocabulary."

        prefix_token_tensor = torch.tensor(ids(text_prefix), device='cuda').view(-1, 1)
        
        # YOUR CODE HERE
        self.network.eval()
        state = (torch.zeros(self.network.lstm_layers, 1, self.network.hidden_dim).cuda(),
                 torch.zeros(self.network.lstm_layers, 1, self.network.hidden_dim).cuda())
        
        with torch.no_grad():
            log_probs, new_state = self.network(prefix_token_tensor, state)
        
        return torch.exp(log_probs[-1, 0]).cpu().numpy()
        
        

    def dataset_perplexity(self, torchtext_dataset):
        "Return perplexity as a float."
        # Your code should be very similar to next_word_probabilities, but
        # run in a loop over batches. Use torch.no_grad() for extra speed.

        iterator = data.BPTTIterator(torchtext_dataset, batch_size=64, bptt_len=32, device='cuda')

        # YOUR CODE HERE
        total_loss, total_words = 0, 0 
        state = (torch.zeros(self.network.lstm_layers, 64, self.network.hidden_dim).cuda(),
                 torch.zeros(self.network.lstm_layers, 64, self.network.hidden_dim).cuda())
        
        with torch.no_grad():
            for batch in iterator:
                inputs, targets = batch.text, batch.target
                state = (state[0].detach(), state[1].detach())
                outputs, state = self.network(inputs, state)
                
                loss = F.nll_loss(outputs.view(-1, outputs.size(-1)), targets.view(-1), reduction = 'sum')
                
                total_loss += loss.item()
                total_words += targets.numel()
        return torch.exp(torch.tensor(total_loss / total_words)).item()
        

lstm_model = LSTMModel()
lstm_model.train()

print('lstm validation perplexity:', lstm_model.dataset_perplexity(validation_dataset))

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 1, Loss: 6.823638892173767
Epoch 1, Validation Perplexity: 437.1129455566406
New best model saved with validation perplexity: 437.1129455566406


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 2, Loss: 6.0943540984509035
Epoch 2, Validation Perplexity: 324.0853271484375
New best model saved with validation perplexity: 324.0853271484375


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 3, Loss: 5.818457244891746
Epoch 3, Validation Perplexity: 268.1468200683594
New best model saved with validation perplexity: 268.1468200683594


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 4, Loss: 5.620028014276542
Epoch 4, Validation Perplexity: 237.2594757080078
New best model saved with validation perplexity: 237.2594757080078


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 5, Loss: 5.465574232737223
Epoch 5, Validation Perplexity: 215.70616149902344
New best model saved with validation perplexity: 215.70616149902344


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 6, Loss: 5.339913617395887
Epoch 6, Validation Perplexity: 200.96966552734375
New best model saved with validation perplexity: 200.96966552734375


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 7, Loss: 5.236591354538413
Epoch 7, Validation Perplexity: 189.84767150878906
New best model saved with validation perplexity: 189.84767150878906


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 8, Loss: 5.147879929636039
Epoch 8, Validation Perplexity: 182.6044921875
New best model saved with validation perplexity: 182.6044921875


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 9, Loss: 5.071899966165131
Epoch 9, Validation Perplexity: 175.56724548339844
New best model saved with validation perplexity: 175.56724548339844


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 10, Loss: 5.003757230440775
Epoch 10, Validation Perplexity: 170.86135864257812
New best model saved with validation perplexity: 170.86135864257812


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 11, Loss: 4.943912158292883
Epoch 11, Validation Perplexity: 166.5323486328125
New best model saved with validation perplexity: 166.5323486328125


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 12, Loss: 4.889673208722881
Epoch 12, Validation Perplexity: 164.2957000732422
New best model saved with validation perplexity: 164.2957000732422


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 13, Loss: 4.839953689949185
Epoch 13, Validation Perplexity: 162.62049865722656
New best model saved with validation perplexity: 162.62049865722656


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 14, Loss: 4.794508008863412
Epoch 14, Validation Perplexity: 159.71224975585938
New best model saved with validation perplexity: 159.71224975585938


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 15, Loss: 4.751493612925212
Epoch 15, Validation Perplexity: 158.36021423339844
New best model saved with validation perplexity: 158.36021423339844


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 16, Loss: 4.7129925928863825
Epoch 16, Validation Perplexity: 158.4795684814453


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 17, Loss: 4.676219066451577
Epoch 17, Validation Perplexity: 157.0289764404297
New best model saved with validation perplexity: 157.0289764404297


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 18, Loss: 4.641533553366568
Epoch 18, Validation Perplexity: 157.2941131591797


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 19, Loss: 4.609416576927783
Epoch 19, Validation Perplexity: 156.60104370117188
New best model saved with validation perplexity: 156.60104370117188


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 20, Loss: 4.578115526367636
Epoch 20, Validation Perplexity: 157.7437286376953


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 21, Loss: 4.548326000980302
Epoch 21, Validation Perplexity: 156.92739868164062
Early stopping triggered.
lstm validation perplexity: 156.5919952392578


In [14]:
save_truncated_distribution(lstm_model, 'lstm_predictions.npy', short=False)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


  0%|          | 0/5000 [00:00<?, ?it/s]

saved lstm_predictions.npy


In [15]:
print(generate_text(lstm_model))

<eos> <eos> = = Major @-@ life Management = = <eos> <eos> Weir was commissioned to fragment for the outbreak of World


<!-- Do not remove this comment, it is used by the autograder: RqYJKsoTS6 -->

Fill in your LSTM perplexity. 

LSTM validation perplexity: ***157***

# Experimentation: 1-Page Report

Now it's time for you to experiment.  Try to reach a validation perplexity below 120. You may either modify the LSTM class above, or copy it down to the code cell below and modify it there. Just **be sure to run code cell below to generate results with your improved LSTM**.  

It is okay if the bulk of your improvements are due to hyperparameter tuning (such as changing number or sizes of layers), but implement at least one more substantial change to the model.  Here are some ideas (several of which come from https://arxiv.org/pdf/1708.02182.pdf):
* activation regularization - add a l2 regularization penalty on the activation of the LSTM output (standard l2 regularization is on the weights)
* weight-drop regularization - apply dropout to the weight matrices instead of activations
* learning rate scheduling - decrease the learning rate during training
* embedding dropout - zero out the entire embedding for a random set of words in the embedding matrix
* ensembling - average the predictions of several models trained with different initialization random seeds
* temporal activation regularization - add l2 regularization on the difference between the LSTM output activations at adjacent timesteps

You may notice that most of these suggestions are regularization techniques.  This dataset is considered fairly small, so regularization is one of the best ways to improve performance.

For this section, you will submit a write-up describing the extensions and/or modifications that you tried.  Your write-up should be **1-page maximum** in length and should be submitted in PDF format.  You may use any editor you like, but we recommend using LaTeX and working in an environment like Overleaf.
For full credit, your write-up should include:
1.   A concise and precise description of the extension that you tried.
2.   A motivation for why you believed this approach might improve your model.
3.   A discussion of whether the extension was effective and/or an analysis of the results.  This will generally involve some combination of tables, learning curves, etc.
4.   A bottom-line summary of your results comparing validation perplexities of your improvement to the original LSTM.
The purpose of this exercise is to experiment, so feel free to try/ablate multiple of the suggestions above as well as any others you come up with!
When you submit the file, please name it `report.pdf`.



Run the cell below in order to train your improved LSTM and evaluate it.  

In [16]:
class WeightDropLSTM(nn.LSTM):
    """LSTM with weight drop applied to hidden-to-hidden weights"""
    def __init__(self, *args, weight_dropout=0.5, **kwargs):
        super().__init__(*args, **kwargs)
        self.weight_dropout = weight_dropout
        self._setup_weights()

    def _setup_weights(self):
        # Apply dropout to the recurrent weights (weight_hh_l0)
        self.weight_hh_l0_raw = nn.Parameter(self.weight_hh_l0.data.clone())
        self.weight_hh_l0.requires_grad = False

    def _setweights(self):
        self.weight_hh_l0.data.copy_(self.weight_hh_l0_raw.data)
        mask = torch.ones(self.weight_hh_l0.data.size(0), self.weight_hh_l0.data.size(1), device=self.weight_hh_l0.device)
        mask = torch.nn.functional.dropout(mask, self.weight_dropout, training=self.training)
        self.weight_hh_l0.data.mul_(mask)

    def forward(self, *args, **kwargs):
        self._setweights()
        return super().forward(*args, **kwargs)
    
class LSTMNetwork(nn.Module):
    # a PyTorch Module that holds the neural network for your model

    def __init__(self):
        super().__init__()

        # YOUR CODE HERE
        self.embedding_dim = 128
        self.hidden_dim = 512
        self.lstm_layers = 3
        self.vocab_size = vocab_size
        
        self.embedding = nn.Embedding(self.vocab_size, self.embedding_dim)
        
        # self.lstm = nn.LSTM(self.embedding_dim, self.hidden_dim, self.lstm_layers, dropout=0.5)
        self.lstm = WeightDropLSTM(self.embedding_dim, self.hidden_dim, self.lstm_layers, dropout=0.5)


        self.projection1 = nn.Linear(self.hidden_dim, self.embedding_dim)
        # Output layer: bias=False for weight tying
        self.projection2 = nn.Linear(self.embedding_dim, self.vocab_size, bias=False)
        
        # Weight tying
        self.projection2.weight = self.embedding.weight


    def forward(self, x, state):
        """Compute the output of the network.
        
        Note: In the Pytorch LSTM tutorial, the state variable is named "hidden":
        https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html

        The torch.nn.LSTM documentation is quite helpful:
        https://pytorch.org/docs/stable/nn.html#lstm
    
        x - a tensor of int64 inputs with shape (seq_len, batch)
        state - a tuple of two tensors with shape (num_layers, batch, hidden_size)
                representing the hidden state and cell state of the of the LSTM.
        returns a tuple with two elements:
          - a tensor of log probabilities with shape (seq_len, batch, vocab_size)
          - a state tuple returned by applying the LSTM.
        """

        # Note that the nn.LSTM module expects inputs with the sequence 
        # dimension before the batch by default.
        # In this case the dimensions are already in the right order, 
        # but watch out for this since sometimes people put the batch first

        # YOUR CODE HERE
        embedded = self.embedding(x)  # Use weight tying
        lstm_output, state = self.lstm(embedded, state) # LSTM output (seq_len, batch, hidden_dim)
        
        # Projection to 128 then Vocab Size
        projected_output = self.projection1(lstm_output)
        output = self.projection2(projected_output)
        
        
        return F.log_softmax(output,dim=-1), state
        


class LSTMModel:
    "A class that wraps LSTMNetwork to handle training and evaluation."

    def __init__(self):
        self.network = LSTMNetwork().cuda()
        # YOUR CODE HERE
        self.criterion = nn.NLLLoss()
        self.optimizer = optim.Adam(self.network.parameters())
        self.patience = 2  # Early stopping patience
        self.best_model_path = 'best_lstm_model.pth'

    def train(self):
        train_iterator = data.BPTTIterator(train_dataset, batch_size=64, 
                                                     bptt_len=32, device='cuda')
        # Iterate over train_iterator with a for loop to get batches
        # each batch object has a .text and .target attribute with
        # token id tensors for the input and output respectively.

        # The initial state passed into the LSTM should be set to zero.

        # YOUR CODE HERE
        best_val_loss = float('inf')
        epochs_no_improve = 0
        
        for epoch in range(30):
            total_loss = 0
            self.network.train()

            # Initialize the hidden and cell states to zeros at the start of the epoch
            state = (torch.zeros(self.network.lstm_layers, 64, self.network.hidden_dim).cuda(),
                     torch.zeros(self.network.lstm_layers, 64, self.network.hidden_dim).cuda())
            
            for batch in tqdm.tqdm_notebook(train_iterator):
                inputs, targets = batch.text, batch.target
                
                state = (state[0].detach(), state[1].detach()) # Detach state for the next batch
                
                self.optimizer.zero_grad()
                outputs, states = self.network(inputs, state)
                loss = self.criterion(outputs.view(-1, outputs.size(-1)), targets.view(-1))
                loss.backward()
                self.optimizer.step()
                total_loss += loss.item()
                
            print(f'Epoch {epoch+1}, Loss: {total_loss / len(train_iterator)}')
        
            # Validate the model and apply early stopping
            val_perplexity = self.dataset_perplexity(validation_dataset)
            print(f'Epoch {epoch + 1}, Validation Perplexity: {val_perplexity}')
            
            # Check for improvement in validation perplexity
            if val_perplexity < best_val_loss:
                best_val_loss = val_perplexity
                epochs_no_improve = 0
                torch.save(self.network.state_dict(), self.best_model_path)
                print(f'New best model saved with validation perplexity: {val_perplexity}')
            else:
                epochs_no_improve += 1

            # Early stopping if no improvement for patience epochs
            if epochs_no_improve >= self.patience:
                print("Early stopping triggered.")
                break


    def next_word_probabilities(self, text_prefix):
        "Return a list of probabilities for each word in the vocabulary."

        prefix_token_tensor = torch.tensor(ids(text_prefix), device='cuda').view(-1, 1)
        
        # YOUR CODE HERE
        self.network.eval()
        state = (torch.zeros(self.network.lstm_layers, 1, self.network.hidden_dim).cuda(),
                 torch.zeros(self.network.lstm_layers, 1, self.network.hidden_dim).cuda())
        
        with torch.no_grad():
            log_probs, new_state = self.network(prefix_token_tensor, state)
        
        return torch.exp(log_probs[-1, 0]).cpu().numpy()
        
        

    def dataset_perplexity(self, torchtext_dataset):
        "Return perplexity as a float."
        # Your code should be very similar to next_word_probabilities, but
        # run in a loop over batches. Use torch.no_grad() for extra speed.

        iterator = data.BPTTIterator(torchtext_dataset, batch_size=64, bptt_len=32, device='cuda')

        # YOUR CODE HERE
        total_loss, total_words = 0, 0 
        state = (torch.zeros(self.network.lstm_layers, 64, self.network.hidden_dim).cuda(),
                 torch.zeros(self.network.lstm_layers, 64, self.network.hidden_dim).cuda())
        
        with torch.no_grad():
            for batch in iterator:
                inputs, targets = batch.text, batch.target
                state = (state[0].detach(), state[1].detach())
                outputs, state = self.network(inputs, state)
                
                loss = F.nll_loss(outputs.view(-1, outputs.size(-1)), targets.view(-1), reduction = 'sum')
                
                total_loss += loss.item()
                total_words += targets.numel()
        return torch.exp(torch.tensor(total_loss / total_words)).item()
        

lstm_model = LSTMModel()
lstm_model.train()

print('lstm validation perplexity:', lstm_model.dataset_perplexity(validation_dataset))

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 1, Loss: 6.809304465032091
Epoch 1, Validation Perplexity: 431.561279296875
New best model saved with validation perplexity: 431.561279296875


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 2, Loss: 6.074475097188762
Epoch 2, Validation Perplexity: 319.9286193847656
New best model saved with validation perplexity: 319.9286193847656


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 3, Loss: 5.799917399182039
Epoch 3, Validation Perplexity: 266.8653869628906
New best model saved with validation perplexity: 266.8653869628906


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 4, Loss: 5.603944595187318
Epoch 4, Validation Perplexity: 234.23101806640625
New best model saved with validation perplexity: 234.23101806640625


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 5, Loss: 5.450821961608588
Epoch 5, Validation Perplexity: 212.7420196533203
New best model saved with validation perplexity: 212.7420196533203


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 6, Loss: 5.327732241854948
Epoch 6, Validation Perplexity: 196.43463134765625
New best model saved with validation perplexity: 196.43463134765625


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 7, Loss: 5.227847313413433
Epoch 7, Validation Perplexity: 185.91220092773438
New best model saved with validation perplexity: 185.91220092773438


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 8, Loss: 5.142344941344915
Epoch 8, Validation Perplexity: 177.3015899658203
New best model saved with validation perplexity: 177.3015899658203


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 9, Loss: 5.069100797410105
Epoch 9, Validation Perplexity: 170.15765380859375
New best model saved with validation perplexity: 170.15765380859375


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 10, Loss: 5.005309290979422
Epoch 10, Validation Perplexity: 165.6397247314453
New best model saved with validation perplexity: 165.6397247314453


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 11, Loss: 4.948205143797631
Epoch 11, Validation Perplexity: 161.01425170898438
New best model saved with validation perplexity: 161.01425170898438


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 12, Loss: 4.897338754055546
Epoch 12, Validation Perplexity: 157.48060607910156
New best model saved with validation perplexity: 157.48060607910156


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 13, Loss: 4.850997534920188
Epoch 13, Validation Perplexity: 154.15867614746094
New best model saved with validation perplexity: 154.15867614746094


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 14, Loss: 4.809218663795321
Epoch 14, Validation Perplexity: 152.0447235107422
New best model saved with validation perplexity: 152.0447235107422


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 15, Loss: 4.769125881382063
Epoch 15, Validation Perplexity: 150.25213623046875
New best model saved with validation perplexity: 150.25213623046875


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 16, Loss: 4.7331113455342315
Epoch 16, Validation Perplexity: 148.34962463378906
New best model saved with validation perplexity: 148.34962463378906


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 17, Loss: 4.700017370897181
Epoch 17, Validation Perplexity: 147.46815490722656
New best model saved with validation perplexity: 147.46815490722656


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 18, Loss: 4.668002068295198
Epoch 18, Validation Perplexity: 145.91783142089844
New best model saved with validation perplexity: 145.91783142089844


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 19, Loss: 4.638793729333317
Epoch 19, Validation Perplexity: 144.6492462158203
New best model saved with validation perplexity: 144.6492462158203


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 20, Loss: 4.611190943156972
Epoch 20, Validation Perplexity: 144.38958740234375
New best model saved with validation perplexity: 144.38958740234375


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 21, Loss: 4.584291686263739
Epoch 21, Validation Perplexity: 144.14564514160156
New best model saved with validation perplexity: 144.14564514160156


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 22, Loss: 4.55896888480467
Epoch 22, Validation Perplexity: 143.5348663330078
New best model saved with validation perplexity: 143.5348663330078


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 23, Loss: 4.53425007614435
Epoch 23, Validation Perplexity: 142.9739227294922
New best model saved with validation perplexity: 142.9739227294922


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 24, Loss: 4.510986056514815
Epoch 24, Validation Perplexity: 142.56272888183594
New best model saved with validation perplexity: 142.56272888183594


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 25, Loss: 4.489686052471984
Epoch 25, Validation Perplexity: 142.32106018066406
New best model saved with validation perplexity: 142.32106018066406


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 26, Loss: 4.467238087747611
Epoch 26, Validation Perplexity: 141.93104553222656
New best model saved with validation perplexity: 141.93104553222656


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 27, Loss: 4.447107078514847
Epoch 27, Validation Perplexity: 142.32452392578125


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 28, Loss: 4.427009384772357
Epoch 28, Validation Perplexity: 142.42384338378906
Early stopping triggered.
lstm validation perplexity: 142.0345001220703


In [17]:
save_truncated_distribution(lstm_model, 'lstm_predictions_WeightDrop.npy', short=False)
print(generate_text(lstm_model))

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


  0%|          | 0/5000 [00:00<?, ?it/s]

saved lstm_predictions_WeightDrop.npy
<eos> <eos> Originally Diplocystaceae was all Greek , particularly classified as " <unk> " articulated from other languages . <eos> <unk> <unk>


### Submission

Upload a submission with the following files to Gradescope:
* hw1b.ipynb (rename to match this exactly)
* lstm_predictions.npy (this should also include all improvements from your exploration)
* neural_trigram_predictions.npy
* bigram_predictions.npy
* report.pdf

You can upload files individually or as part of a zip file, but if using a zip file be sure you are zipping the files directly and not a folder that contains them.

Be sure to check the output of the autograder after it runs.  It should confirm that no files are missing and that the output files have the correct format.  Note that the test set perplexities shown by the autograder are on a completely different scale from your validation set perplexities due to truncating the distribution and selecting different text.  Don't worry if the values seem much worse.