Homework 4: Neural Language Models (& 🎃 SpOoKy 👻 authors 🧟 data) - Task 3
----

### Names
----
Names: __Katherine Aristizabal, Jose Meza Llamosas__ (Write these in every notebook you submit.)

Task 3: Feedforward Neural Language Model (80 points)
--------------------------

For this task, you will create and train neural LMs for both your word-based embeddings and your character-based ones. You should write functions when appropriate to avoid excessive copy+pasting.

In [1]:
# import your libraries here

import numpy as np

# if you want fancy progress bars
from tqdm.autonotebook import tqdm

# Remember to restart your kernel if you change the contents of this file!
import neurallm_utils as nutils

import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader
import torch.optim as optim

# This function gives us nice print-outs of our models.
from torchinfo import summary

  from tqdm.autonotebook import tqdm
[nltk_data] Downloading package punkt to /Users/0wner/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /Users/0wner/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


### a) First, encode  your text into integers (5 points)

In [2]:
# Edit constants as you would like.
EMBEDDINGS_SIZE = 50
NGRAM = 3
NUM_SEQUENCES_PER_BATCH = 128

TRAIN_FILE = 'spooky_author_train.csv' # The file to train your language model on
OUTPUT_WORDS = 'generated_wordbased.txt' # The file to save your generated sentences for word-based LM
OUTPUT_CHARS = 'generated_charbased.txt' # The file to save your generated sentences for char-based LM

# you can update these file names if you want to depending on how you are exploring 
# hyperparameters
EMBEDDING_SAVE_FILE_WORD = f"spooky_embedding_word_{EMBEDDINGS_SIZE}.model" # The file to save your word embeddings to
EMBEDDING_SAVE_FILE_CHAR = f"spooky_embedding_char_{EMBEDDINGS_SIZE}.model" # The file to save your char embeddings to
MODEL_FILE_WORD = f'spooky_author_model_word_{NGRAM}.pt' # The file to save your trained word-based neural LM to
MODEL_FILE_CHAR = f'spooky_author_model_char_{NGRAM}.pt' # The file to save your trained char-based neural LM to



In [3]:
# load your word vectors that you made in your previous notebook AND 
# use the create_embedder function to make your pytorch embedder
char_wv = nutils.load_word2vec(EMBEDDING_SAVE_FILE_CHAR)
char_embedder = nutils.create_embedder(char_wv)

word_wv = nutils.load_word2vec(EMBEDDING_SAVE_FILE_WORD)
word_embedder = nutils.create_embedder(word_wv)

In [4]:
# you'll also need to re-load your text data
char_data = nutils.read_file_spooky(TRAIN_FILE, NGRAM, by_character=True)
text_data = nutils.read_file_spooky(TRAIN_FILE, NGRAM, by_character=False)

In [5]:
# This function is used to vectorize a text corpus. 
# Here, it creates a mapping from word to that word's unique index.

# Hint: use one of the dicts from your embedding function.

def encode_tokens(data: list[list[str]], embedder: torch.nn.Embedding) -> list[list[int]]:
    """
    Replaces each natural-language token with its embedder index.

    e.g. [["<s>", "once", "upon", "a", "time"],
          ["there", "was", "a", ]]
        ->
        [[0, 59, 203, 1, 126],
         [26, 15, 1]]
        (The indices are arbitrary, as they are dependent on your embedder)

    Params:
        data: The corpus
        embedder: An embedder trained on the given data.
    """

    finalList = []
    for list in data:
        currList = []
        for word in list:
            index = embedder.token_to_index[word]
            currList.append(index)
        finalList.append(currList)

    return finalList

In [6]:
# encode your data from tokens to integers for both word and char embeddings
encoded_chars = encode_tokens(char_data, char_embedder)
encoded_word = encode_tokens(text_data, word_embedder)

In [7]:
# print out the size of the mappings for each of your embedders.
# these should match the vocab sizes you calculated in Task 2

# 4 points
# print out the vocabulary size for your embeddings for both your word
# embeddings and your character embeddings
# label which is which when you print them out

char_vocab_size = len(char_embedder.token_to_index)
word_vocab_size = len(word_embedder.token_to_index)


print(f"char embedder size {char_vocab_size}")
print(f"word embedder size {word_vocab_size}")

char embedder size 60
word embedder size 25374


### b) Next, prepare the sequences to train your model from text (2 points)

#### Fixed n-gram based sequences

The training samples will be structured in the following format. 
Depening on which ngram model we choose, there will be (n-1) tokens 
in the input sequence (X) and we will need to predict the nth token (y).

Example: this process however afforded me

Would become:
```
X
[[this,    process]
[process, however]
[however, afforded]]

y
[however,
afforded,
me]
```


Our first step is to generate n-grams like we have always been doing. We'll just do this 
on our encoded data instead of the raw text. (Feel free to consult your past HW here).

In [8]:

def create_ngrams(tokens: list, n: int) -> list:
    """Creates n-grams for the given token sequence.
    Args:
      tokens (list): a list of tokens as strings
      n (int): the length of n-grams to create

    Returns:
      list: list of tuples of strings, each tuple being one of the individual n-grams
    """
    # STUDENTS IMPLEMENT
    res = []
    for i in range(0, len(tokens)-n):
        #append n gram + yth value
        res.append(tokens[i:i+n+1])
    return res

def generate_ngram_training_samples(encoded: list[list[int]], ngram: int) -> list:
    """
    Takes the **encoded** data (list of lists of ints) and 
    generates the training samples out of it.
    
    Parameters:
        up to you, we've put in what we used
        but you can add/remove as needed
    return: 
    list of lists in the format [[x1, x2, ... , x(n-1), y], ...]
    """

    #1 2 3 4
    #[1,2, y=3]
    #[2,3, y=4]

    # if you'd like to use tqdm, you can use it like this:
    # for i in tqdm(range(len(encoded))):
    final_list = []
    for list in encoded:
        currList = create_ngrams(list, ngram-1)
        final_list.extend(currList)
    return final_list


In [9]:
# generate your training samples for both word and character data
# print out the first 5 training samples for each
# we have displayed the number of sequences
# to expect for both characters and words
#
char_sample = generate_ngram_training_samples(encoded_chars, NGRAM)
word_sample = generate_ngram_training_samples(encoded_word, NGRAM)
print(f"length char  {len(char_sample)}")
print(f"length word  {len(word_sample)}")

print(char_sample[0:5])
print(word_sample[0:5])

finalList = []
for list in char_sample:
    currList = []
    for i in list:
        tok = char_embedder.index_to_token[i]
        currList.append(tok)
    finalList.append(currList)
print( finalList[0:5])

finalList = []
for list in char_sample:
    currList = []
    for i in list:
        tok = word_embedder.index_to_token[i]
        currList.append(tok)
    finalList.append(currList)
print( finalList[0:5])

# Spooky data by words shoud give 634080 sequences
# [0, 0, 31]
# [0, 31, 2959]
# [31, 2959, 2]
# ...

# Spooky data by character should give 2957553 sequences
# [20, 20, 2]
# [20, 2, 8]
# [2, 8, 6]
# ...

# print out the first 5 training samples for each and make sure that the 
# windows are sliding one word at a time. These should be integers!
# make sure that they map to the correct words in your vocab
# Hint: what word maps to token 0?
print(word_embedder.token_to_index[','])

length char  2957553
length word  634080
[[25, 25, 2], [25, 2, 8], [2, 8, 6], [8, 6, 7], [6, 7, 0]]
[[3, 3, 31], [3, 31, 2959], [31, 2959, 0], [2959, 0, 154], [0, 154, 0]]
[['<s>', '<s>', 't'], ['<s>', 't', 'h'], ['t', 'h', 'i'], ['h', 'i', 's'], ['i', 's', '_']]
[['at', 'at', 'of'], ['at', 'of', 'i'], ['of', 'i', 'and'], ['i', 'and', 'to'], ['and', 'to', ',']]
0


### c) Then, split the sequences into X and y and create a DataLoader (10 points)

In [10]:
# Note here that each sequence we've created so far is in the form:
# sequence = [x1, x2, ... , x(n-1), y]
# We still need to separate them into [[x1, x2, ... , x(n-1)], ...], [y1, y2, ...]]
# do that here for both word and character data
# you can write a function to do this if you'd like (not required, might be helpful)

def split_sequences(training_sample):
    x_sample = []
    y_sample = []
    for line in training_sample:
        x_sample.append(line[0:-1])
        y_sample.append(line[-1])
    return x_sample, y_sample

x_char = []
y_char = []
for line in char_sample:
    x_char.append(line[0:-1])
    y_char.append(line[-1])

x_word = []
y_word = []
for line in word_sample:
    x_word.append(line[0:-1])
    y_word.append(line[-1])

"""print(char_sample[0:5])
def split_training_data(training: list[list[int]]) -> list:

    Takes the training data and splits it into X and y.
    tuple: a tuple of two lists, one for X and one for y
    
    x = []
    y = []
    for l in training:
        print(l)
        x.append(l[0:-1])
        y.append(l[-1])
    return x, y"""
# print out the shapes (or lengths to know how many sequences there are and how many
# elements each sub-list has) for word-based to verify that they are correct

print(f"word x length {len(x_word)}")
length_list =[]
for line in x_word:
  #  print(line)
    length_list.append(len(line))

print(length_list)

print(f"word y length {len(y_word)}")


word x length 634080
[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 

In [11]:
def create_dataloaders(X: list, y: list, num_sequences_per_batch: int, 
                       test_pct: float = 0.1, shuffle: bool = True) -> tuple[torch.utils.data.DataLoader]:
    """
    Convert our data into a PyTorch DataLoader.    
    A DataLoader is an object that splits the dataset into batches for training.
    PyTorch docs: 
        https://pytorch.org/tutorials/beginner/basics/data_tutorial.html
        https://pytorch.org/docs/stable/data.html

    Note that you have to first convert your data into a PyTorch DataSet.
    You DO NOT have to implement this yourself, instead you should use a TensorDataset.

    You are in charge of splitting the data into train and test sets based on the given
    test_pct. There are several functions you can use to acheive this!

    The shuffle parameter refers to shuffling the data *in the loader* (look at the docs),
    not whether or not to shuffle the data before splitting it into train and test sets.
    (don't shuffle before splitting)

    Params:
        X: A list of input sequences
        Y: A list of labels
        num_sequences_per_batch: Batch size
        test_pct: The proportion of samples to use in the test set.
        shuffle: INSTRUCTORS ONLY

    Returns:
        One DataLoader for training, and one for testing.
    """
    
    dataSet = TensorDataset(torch.tensor(X), torch.tensor(y))
    test_size = int(len(dataSet)*test_pct)
    train_size = len(dataSet) - test_size
    train_data, test_data = torch.utils.data.random_split(dataSet, [train_size, test_size])
    dataloader_train = DataLoader(train_data, batch_size=num_sequences_per_batch, shuffle=shuffle)
    dataloader_test = DataLoader(test_data, batch_size=num_sequences_per_batch, shuffle=shuffle)
    return dataloader_train, dataloader_test


### some definitions:
- a single __batch__ is the number of sequences that your model will evaluate at once when it learns
-  __steps per epoch__ is the number of batches that your model will see in a single epoch  (one pass through the data)-- your NUM_SEQUENCES_PER_BATCH constant is the batch size--you won't need this for pytorch but it's useful to know

In [12]:
# initialize your dataloaders for both word and character data
# print out the shapes of the first batch to verify that it is 
# correct for both word and character data
# note that your train data and your test data should have the same shapes!
# print enough information to verify that the shapes are correct

word_dataloader_train, word_dataloader_test = create_dataloaders(x_word, y_word, NUM_SEQUENCES_PER_BATCH)
char_dataloader_train, char_dataloader_test = create_dataloaders(x_char, y_char, NUM_SEQUENCES_PER_BATCH)

# Accesing the first batch of each data loader to print shape with next(iter())

sample_x_word_train, sample_y_word_train = next(iter(word_dataloader_train))
sample_x_word_test, sample_y_word_test = next(iter(word_dataloader_test))

sample_x_char_train, sample_y_char_train = next(iter(char_dataloader_train))
sample_x_char_test, sample_y_char_test = next(iter(char_dataloader_test))

print(f"word data loader train shape for X: {sample_x_word_train.shape} and y: {sample_y_word_train.shape}")
print(f"word data loader test shape for X: {sample_x_word_test.shape} and y: {sample_y_word_test.shape}\n")

print(f"char data loader train shape for X: {sample_x_char_train.shape} and y: {sample_y_char_train.shape}")
print(f"char data loader test shape for X: {sample_x_char_test.shape} and y: {sample_y_char_test.shape}")

# Examples:
# Normally you would loop over your dataloader, but we just want to get a single batch to test it out:
# Every time you call next, you advance to the next batch
# sample_X, sample_y = next(iter(train_dataloader))
# sample_X.shape # (batch_size, (n-1)*EMBEDDING_SIZE) # Correction from Piazza it should be (batch_size, n-1)  
# sample_y.shape  # (batch_size)

word data loader train shape for X: torch.Size([128, 2]) and y: torch.Size([128])
word data loader test shape for X: torch.Size([128, 2]) and y: torch.Size([128])

char data loader train shape for X: torch.Size([128, 2]) and y: torch.Size([128])
char data loader test shape for X: torch.Size([128, 2]) and y: torch.Size([128])


### d) Define, train & save your models (25 points)

Write the code to train feedforward neural language models for both word embeddings and character embeddings make sure not to just copy + paste to train your two models (define functions as needed).

Define your model architecture using PyTorch layers and activation functions. When training, use the Adam optimizer (https://pytorch.org/docs/stable/generated/torch.optim.Adam.html) instead of sgd (https://pytorch.org/docs/stable/generated/torch.optim.SGD.html#torch.optim.SGD).

add cells as desired :)

Your FFNN should have the following architecture:
- It should be a two layer neural net (one hidden layer, one output layer)
- It should use ReLU as its activation function

Our biggest piece of advice--make sure that you understand what dimensions each layer needs to be!

In [13]:
# 10 points

class FFNN(nn.Module):
    """
    A class representing our implementation of a Feed-Forward Neural Network.
    You will need to implement two methods:
        - A constructor to set up the architecture and hyperparameters of the model
        - The forward pass
    """
    
    def __init__(self, vocab_size: int, ngram: int, embedding_layer: torch.nn.Embedding, hidden_units=128):
        """
        Initialize a new untrained model. 
        
        You can change these parameters as you would like.
        Once you get a working model, you are encouraged to
        experiment with this constructor to improve performance.
        
        Params:
            vocab_size: The number of words in the vocabulary
            ngram: The value of N for training and prediction.
            embedding_layer: The previously trained embedder. 
            hidden_units: The size of the hidden layer.
        """        
        super().__init__()
        # YOUR CODE HERE
        # we recommend saving the parameters as instance variables
        # so you can access them later as needed
        # (in addition to anything else you need to do here)
        
		# Saving parameters as instance variables
        self.vocab_size = vocab_size
        self.ngram = ngram
        self.embedding_layer = embedding_layer
        self.hidden_units = hidden_units
        
		# Save embedding size
        embedding_size = embedding_layer.embedding_dim
        
		# Defining layers
        self.flatten = nn.Flatten() # Useful later to flatten array of ngram-1 after embedding before passing it to the linear layer
        self.linear_relu_stack = nn.Sequential(
			nn.Linear(in_features=(ngram-1)*embedding_size, out_features=hidden_units, bias=True),
			nn.ReLU(),
			nn.Linear(in_features=hidden_units, out_features=vocab_size, bias=True)
		)
        
    def forward(self, X: list) -> torch.tensor:
        """
        Compute the forward pass through the network.
        This is not a prediction, and it should not apply softmax.

        Params:
            X: the input data

        Returns:
            The output of the model; i.e. its predictions.
        
        """
        # YOUR CODE HERE
        embedded = self.embedding_layer(X)
        flat_embedded = self.flatten(embedded)
        logits = self.linear_relu_stack(flat_embedded)
        return logits


In [14]:
# 10 points

# Defining a training function that goes over every batch per epoch
def train_one_epoch(dataloader, model, optimizer, loss_fn):
    epoch_loss = 0

    for data in dataloader:
        # Separating the input + label pair for each instance
        inputs, labels = data
        
		# Zeroing gradients for every batch
        optimizer.zero_grad()
        
		# Make predictions for this batch
        outputs = model(inputs)
        
		# Compute loss and gradients
        batch_loss = loss_fn(outputs, labels)
        batch_loss.backward()
        
		# Adjust learning weights
        optimizer.step()
        
		# Adding to epoch loss
        epoch_loss += batch_loss.item() # Covert scalar tensor into floating-point

    return epoch_loss

# Defining a general training function that goes over all the epochs
def train(dataloader, model, epochs: int = 1, lr: float = 0.001) -> None:
    """
    Our model's training loop.
    Print the cross entropy loss every epoch.
    You should use the Adam optimizer instead of SGD.

    When looking for documentation, try to stay on PyTorch's website.
    This might be a good place to start: https://pytorch.org/tutorials/beginner/introyt/trainingyt.html 
    They should have plenty of tutorials, and we don't want you to get confused from other resources.

    Params:
        dataloader: The training dataloader
        model: The model we wish to train
        epochs: The number of epochs to train for
        lr: Learning rate 
    """
    # YOUR CODE HERE
    # you will need to initialize an optimizer and a loss function, which you should do
    # before the training loop
    
    optimizer = torch.optim.Adam(model.parameters(), lr=lr) # Adam optimizer instead of SGD
    loss_fn = torch.nn.CrossEntropyLoss() # Multinomial Cross Entropy Loss that applies log-softmax internally and computes the negative log likelihood
    
    n_batches = len(dataloader)
    
	# Making sure gradient tracking is on before start training
    model.train()
    
    for epoch in tqdm(range(epochs)):
        epoch_loss = train_one_epoch(dataloader, model, optimizer, loss_fn)
        avg_epoch_loss = epoch_loss/n_batches
        print(f"Epoch: {epoch}, Loss: {avg_epoch_loss}\n")

    # print out the epoch number and the current average loss after each epoch
    # you can use tqdm to print out a progress bar

For the next part, we're testing our model's functions so we can see if it works.
No need to do this on both the word and character data, just one is fine.

In [15]:
# Create your model
# Print out its architecture (use the imported summary function)

model = FFNN(vocab_size=word_vocab_size, ngram=NGRAM, embedding_layer=word_embedder)
model


FFNN(
  (embedding_layer): Embedding(25374, 50)
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=100, out_features=128, bias=True)
    (1): ReLU()
    (2): Linear(in_features=128, out_features=25374, bias=True)
  )
)

In [16]:
# 5 points

# train your models for 1 epoch
# see timing information posted on Canvas!

# re-create your data loader fresh
word_dataloader_train, word_dataloader_test = create_dataloaders(x_word, y_word, NUM_SEQUENCES_PER_BATCH)
char_dataloader_train, char_dataloader_test = create_dataloaders(x_char, y_char, NUM_SEQUENCES_PER_BATCH)

# train your model
train(word_dataloader_train, model, epochs=1)


100%|██████████| 1/1 [01:46<00:00, 106.34s/it]

Epoch: 0, Loss: 5.771614972131887






10. You're reporting the loss after each epoch of training. What is the loss for your model after 1 epoch?
- word or character-based? __word based__
- loss? __5.72__
- time __03 m 58 s__

Loss isn't accuracy, but it does tell us whether or not the model is improving over time. For character-based, loss after one epoch should be ~2.1; for word-based it is ~5.9.

### e) create a full pipeline (13 points)

We've made all the pieces that you'll need for a full pipeline, now let's package everything together nicely.

In [None]:
# 3 points

# make a function that does your full *training* pipeline
# This is essentially pulling the pieces that you've done so far earlier in this 
# notebook into a single function that you can call to train your model


def full_pipeline(data, word_embeddings_filename: str, 
                batch_size:int = NUM_SEQUENCES_PER_BATCH,
                ngram:int = NGRAM, hidden_units = 128, epochs = 1,
                lr = 0.001, test_pct = 0.1
                ) -> FFNN:
    """
    Run the entire pipeline from loading embeddings to training.
    You won't use the test set for anything.

    Params:
        data: The raw data to train on, parsed as a list of lists of tokens
        word_embeddings_filename: The filename of the Word2Vec word embeddings
        batch_size: The batch size to use
        hidden_units: The number of hidden units to use
        epochs: The number of epochs to train for
        lr: The learning rate to use
        test_pct: The proportion of samples to use in the test set.

    Returns:
        The trained model.
    """
    # Loading embeddings
    token_embeddings = nutils.load_word2vec(word_embeddings_filename)
    embedder = nutils.create_embedder(token_embeddings)
    
	# Encode tokens
    encoded_tokens = encode_tokens(data, embedder)
    
	# Define vocab size from embedder
    vocab_size = embedder.num_embeddings
    
	# Prepare training samples
    training_sample = generate_ngram_training_samples(encoded_tokens, ngram)
    
	# Split sequences
    x_sample, y_sample = split_sequences(training_sample)
    
	# Create training dataloader
    dataloader_train, _ = create_dataloaders(x_sample, y_sample, batch_size, test_pct)

	# Create FFNN model
    model = FFNN(vocab_size=vocab_size, ngram=ngram, embedding_layer=embedder, hidden_units=hidden_units)

	# Train our model
    train(dataloader=dataloader_train, model=model, epochs=epochs, lr=lr)

    return model

In [None]:
# 10 points

# Use your full pipeline to train models on the word data and the character data.
# Feel free to add cells if you'd like to.

# Train your models however you'd like. Play around with number of epochs, learning rate, etc.
# Do whatever you'd like to for exploring hyperparameters.
# You aren't required to hit a certain loss, but you should leave code here that shows
# that you explored effects of changing at least two of the different hyperparameters
# Please don't change the architecture of the model (keep it a 2-layer model with 1 hidden layer)

# You'll likely want to do this exploration AFTER completing your prediction and generation code, so start
# with just training for 1 - 5 epochs with default params.


# Word-based takes Felix's computer 7 - 8 min for 5 epochs with default params running on CPU
# Char-based Felix's computer ~1min 30sec - 2min for 5 epochs with default params running on CPU

# Importing data
char_data = nutils.read_file_spooky(TRAIN_FILE, NGRAM, by_character=True)
text_data = nutils.read_file_spooky(TRAIN_FILE, NGRAM, by_character=False)

# Defining base models with default params

base_word_model = full_pipeline(data=text_data, word_embeddings_filename=EMBEDDING_SAVE_FILE_WORD,epochs=5)
base_char_model = full_pipeline(data=char_data, word_embeddings_filename=EMBEDDING_SAVE_FILE_CHAR,epochs=5)

 20%|██        | 1/5 [01:41<06:46, 101.55s/it]

Epoch: 0, Loss: 5.767561000568179



 40%|████      | 2/5 [03:21<05:01, 100.60s/it]

Epoch: 1, Loss: 5.225431412852754



 60%|██████    | 3/5 [05:22<03:39, 109.75s/it]

Epoch: 2, Loss: 4.989004248700138



 80%|████████  | 4/5 [07:30<01:57, 117.29s/it]

Epoch: 3, Loss: 4.814662314552073



100%|██████████| 5/5 [09:30<00:00, 114.01s/it]

Epoch: 4, Loss: 4.676736436582088




 20%|██        | 1/5 [00:29<01:57, 29.31s/it]

Epoch: 0, Loss: 2.0796695163523284



 40%|████      | 2/5 [00:54<01:20, 26.83s/it]

Epoch: 1, Loss: 1.9820976156607388



 60%|██████    | 3/5 [01:18<00:51, 25.73s/it]

Epoch: 2, Loss: 1.967492361192773



 80%|████████  | 4/5 [01:42<00:25, 25.04s/it]

Epoch: 3, Loss: 1.9608832631954485



100%|██████████| 5/5 [02:04<00:00, 24.83s/it]

Epoch: 4, Loss: 1.9567836109068009






With default parameters:
* Word <br>
Time: 03m 51s <br>
Loss: 5.77<br>
<br>
* Character <br>
Time: 02m 33s<br>
Loss: 2.08


In [19]:
# when you're happy with them, save both models
# Feel free to play around with any hyperparameters you'd like

# using torch.save and the model's state_dict
torch.save(base_word_model.state_dict(), MODEL_FILE_WORD)
torch.save(base_char_model.state_dict(), MODEL_FILE_CHAR)

### f) Generate Sentences (25 points)

Now that you have trained models, you'll work on the generation piece. Note that because you saved your models, even if you have to re-start your kernel, you should be able to re-load them without having to re-train them again.

In [20]:
# load the models in again with code like:
base_word_model = FFNN(vocab_size=word_vocab_size, ngram=NGRAM, embedding_layer=word_embedder, hidden_units=128)
base_char_model = FFNN(vocab_size=char_vocab_size, ngram=NGRAM, embedding_layer=char_embedder, hidden_units=128)

base_word_model.load_state_dict(torch.load(MODEL_FILE_WORD))
base_char_model.load_state_dict(torch.load(MODEL_FILE_CHAR))

# then switch the model into evaluation mode
# model.eval()

<All keys matched successfully>

In [21]:
# 10 points 

# Create a function that predicts the next token in a sequence.
def predict(model, input_tokens) -> str:
    """
    Get the model's next word prediction for an input.
    This is where you'll use the softmax function!
    Assume that the input tokens do not contain any unknown tokens.

    Params:
        model: Your trained model
        input_tokens: A list of natural-language tokens. Must be length N-1.

    Returns:
        The predicted token (not the predicted index!)
    """
    # YOUR CODE HERE
	# Encode tokens
    encoded_tokens = [model.embedding_layer.token_to_index[token] for token in input_tokens]
    
	# Trasform to tensor
    encoded_tokens = torch.tensor([encoded_tokens]) # Dim [1, ngram-1]
    
    # Setting model to evaluation mode turns off Dropout and BatchNorm making the predictions deterministic
    model.eval()  # Set the model to evaluation mode if you haven't already
    
    with torch.no_grad(): # Speeds up inference and reduces memory usage by not having to calcualte gradients
        logits = model(encoded_tokens) # Forward pass on the model
        probability = nn.functional.softmax(logits, dim=1) # Normalize z scores to probability
        predicted_idx = torch.multinomial(probability, num_samples=1).item()

        #predicted_idx = probability.argmax(dim=1).item() # Retrieve int value
		
	# Transform index to natural-language token
    predicted_token = model.embedding_layer.index_to_token[predicted_idx] 
    
    return predicted_token


In [23]:
# 10 points
from typing import List
# Generate a sequence from the model until you get an end of sentence token.
def generate(model, seed: List[str], max_tokens: int = None) -> List[str]:
    """
    Use the trained model to generate a sentence.
    This should be somewhat similar to generation for HW2...
    Make sure to use your predict function!

    Params:
        model: Your trained model
        seed: [w_1, w_2, ..., w_(n-1)].
        max_tokens: The maximum number of tokens to generate. When None, should gener
            generate until the end of sentence token is reached.

    Return:
        A list of generated tokens.
    """ 
    n_tokens = 0 # Count tokens that have been generated
    tokens = seed.copy() # Copy of initial seed
    end_token = "<\s>"
    
    while True:
        for_prediction = seed[-(model.ngram-1):]
        predicted_token = predict(model, for_prediction)
        if predicted_token == end_token:
        	break
        tokens.append(predicted_token)
        n_tokens += 1
        if max_tokens is not None and n_tokens >= max_tokens:
            break
        
    return tokens

  end_token = "<\s>"


In [24]:
def generate_sentences(model, seed: List[str],  n_sentences: int, max_tokens: int = None) -> List[str]:
    return [generate(model, seed, max_tokens) for i in range(n_sentences)]

In [29]:
# you might want to define some functions to help you format the text nicely
# and/or generate multiple sequences

def format_sentence(tokens_list: List[List[str]], by_char = False) -> str:
  """Removes <s> at the start of the sentence and </s> at ehe end. Joins the list of tokens into a string and capitalizes it.
  Args:
    tokens (list(list)): the list of tokens list to be formatted into a sentence

  Returns:
    string: formatted sentence as a string
  
  """
  text = "" # Initializing final sentence
  for tokens in tokens_list: # Parsing through each individual sentence
    while tokens[0] == '<s>': # Removes all <s> at the beggining even if there are several for ngram > 2 models
      tokens.pop(0)
    if tokens[-1] == '</s>': # Removes the one </s> at the end of the sentence
      tokens.pop(-1)
    if by_char:
      sentence = "".join(tokens) # Converts list of tokens into a string
      sentence = sentence.capitalize() # Capitalizes the first letter of each sentence
    else:
      sentence = " ".join(tokens) # Converts list of tokens into a string
      sentence = sentence.capitalize() # Capitalizes the first letter of each sentence
    text += sentence + ".\n" # Adds a period and space separator between sentences
  return text.strip(" ") # Removes the last space in the last sentence


In [30]:
# 2.5 points

# generate and display ten sequences from both your word model and your character model
# do not include <s> or </s> in your displayed sentences
# make sure that you can read the output easily (i.e. don't just print out a list of tokens)

# For character-based, replace _ with a space

#model.eval()

word_test = ["<s>", "this"]
char_test = ["<s>", "t"]

word_generated = generate_sentences(model=base_word_model, seed=word_test, n_sentences=10, max_tokens=10)
char_generated = generate_sentences(model=base_char_model, seed=char_test, n_sentences=10, max_tokens=10)

print(word_generated)
print(format_sentence(word_generated))
print(format_sentence(char_generated, by_char=True))

[['<s>', 'this', 'genius', 'floor', 'time', 'about', 'had', 'relented', 'fancied', 'early', 'very', 'melancholy'], ['<s>', 'this', 'chance', 'afforded', 'was', 'jaws', 'robin', 'rarely', 'cohort', 'few', 'wont', 'is'], ['<s>', 'this', 'warm', 'manner', 'leetle', 'is', 'one', 'say', 'young', 'silence', 'time', 'engaging'], ['<s>', 'this', 'thick', ',', 'lucid', 'appeared', 'was', 'time', ',', 'form', 'old', 'closely'], ['<s>', 'this', 'you', 'somewhat', 'sudden', 'i', 'last', 'above', 'complete', 'indignation', ',', 'in'], ['<s>', 'this', 'was', 'fallacious', 'was', 'of', 'square', 'morning', 'of', 'for', 'crushed', 'little'], ['<s>', 'this', 'event', 'subject', 'came', 'head', 'was', 'few', 'to', 'occupied', 'assured', 'for'], ['<s>', 'this', 'place', 'i', 'will', '.', 'day', 'sirocco', 'words', 'wild', 'pitch', 'can'], ['<s>', 'this', 'stage', 'spot', 'articulate', 'same', 'sort', 'was', 'from', 'i', 'vanish', 'conduct'], ['<s>', 'this', 'body', 'capable', 'when', ',', 'strange', 'rig

In [None]:
# 2.5 points

# Generate 100 example sentences with each model and save them to two files, one sentence per line
# do not include <s> and </s> in your saved sentences (you'll use these sentences in your next task)
# this will produce two files, one for each model
# We've defined the filenames for you at the top of this notebook
# Do not print these sentences here :)

word_generated_final = generate_sentences(model=base_word_model, seed=word_test, n_sentences=100, max_tokens=10)
char_generated_final = generate_sentences(model=base_char_model, seed=char_test, n_sentences=100, max_tokens=10)

torch.save(format_sentence(word_generated_final), OUTPUT_WORDS)
torch.save(format_sentence(char_generated_final, by_char=True), OUTPUT_CHARS)


11. What were the final parameters that you used for your model? 
- N: __3__
- embedding size: __50__
- epochs: __5__
- hidden units: __128__
- learning rate: __0.001__
- training time + system you were running it on (operating system + chip/specs): __Training time: ~11min
System: macOS 15.2, M1 chip 2020__
    - for pairs, you can either note both partners' training times or just one

- What was the word-based model's final loss? __4.676736436582088__
- Character based? __1.9567836109068009__

If you used different parameters for your word-based and character-based models, note the different parameters clearly.