Homework 5: Neural Language Models  (& 🎃 SpOoKy 👻 authors 🧟 data) - Task 3
---

Task 3: Feedforward Neural Language Model (60 points)
--------------------------

For this task, you will create and train neural LMs for both your word-based embeddings and your character-based ones. You should write functions when appropriate to avoid excessive copy+pasting.

### a) First, encode  your text into integers (5 points)

In [2]:
# Importing utility functions from Keras
import keras
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical

# necessary
from keras.models import Sequential
from keras.layers import Dense

# optional
# from keras.layers import Dropout

# if you want fancy progress bars
from tqdm import notebook
from IPython.display import display

# your other imports here
import time

import numpy as np
import neurallm_utils as nutils



[nltk_data] Downloading package punkt to /Users/maxnbf/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [3]:
# load in necessary data
from gensim.models import KeyedVectors

# abstract into util functions
NGRAM = 3 # The ngram language model you want to train
EMBEDDING_SAVE_FILE_WORD = "spooky_embedding_word.txt" # The file to save your word embeddings to
EMBEDDING_SAVE_FILE_CHAR = "spooky_embedding_char.txt" # The file to save your word embeddings to
TRAIN_FILE = 'spooky_author_train.csv' # The file to train your language model on

data_by_char = nutils.read_file_spooky(TRAIN_FILE, NGRAM, by_character=True)
data_by_word = nutils.read_file_spooky(TRAIN_FILE, NGRAM, by_character=False)


In [4]:
# constants you may find helpful. Edit as you would like.
#EMBEDDINGS_SIZE = 50
# i think its 100
EMBEDDINGS_SIZE = 100
NGRAM = 3 # The ngram language model you want to train

In [5]:
# Initialize a Tokenizer and fit on your data
# do this for both the word and character data

# It is used to vectorize a text corpus. Here, it just creates a mapping from 
# word to a unique index. (Note: Indexing starts from 0)
# Example:
# tokenizer = Tokenizer()
# tokenizer.fit_on_texts(data)
# encoded = tokenizer.texts_to_sequences(data)
char_tokenizer = Tokenizer()
char_tokenizer.fit_on_texts(data_by_char)
char_encoded = char_tokenizer.texts_to_sequences(data_by_char)

word_tokenizer = Tokenizer()
word_tokenizer.fit_on_texts(data_by_word)
word_encoded = word_tokenizer.texts_to_sequences(data_by_word)

In [6]:
# print out the size of the word index for each of your tokenizers
# this should match what you calculated in Task 2 with your embeddings
print("Size of word index for character tokenizer: ", len(char_tokenizer.word_index))
print("Size of word index for word tokenizer: ", len(word_tokenizer.word_index))

Size of word index for character tokenizer:  60
Size of word index for word tokenizer:  25374


### b) Next, prepare the sequences to train your model from text (5 points)

#### Fixed n-gram based sequences

In [7]:
def generate_ngram_training_samples(encoded: list, ngram: int) -> list:
    '''
    Takes the encoded data (list of lists) and 
    generates the training samples out of it.
    Parameters:
    up to you, we've put in what we used
    but you can add/remove as needed
    return: 
    list of lists in the format [[x1, x2, ... , x(n-1), y], ...]
    '''

    samples = []
    for e in encoded: 
        samples.extend([e[i:i+ngram] for i in range(len(e) - ngram + 1)])
    
    return samples

char_samples = generate_ngram_training_samples(encoded=char_encoded, ngram=NGRAM)
word_samples = generate_ngram_training_samples(encoded=word_encoded, ngram=NGRAM)

print(len(char_samples))
print(len(word_samples))
# generate your training samples for both word and character data
# print out the first 5 training samples for each
# we have displayed the number of sequences
# to expect for both characters and words
#
# Spooky data by character should give 2957553 sequences
# [21, 21, 3]
# [21, 3, 9]
# [3, 9, 7]
# ...
# Spooky data by words shoud give 634080 sequences
# [1, 1, 32]
# [1, 32, 2956]
# [32, 2956, 3]
# ...



2957553
634080


### c) Then, split the sequences into X and y and create a Data Generator (20 points)

In [8]:
# 2.5 points

# Note here that the sequences were in the form: 
# sequence = [x1, x2, ... , x(n-1), y]
# We still need to separate it into [[x1, x2, ... , x(n-1)], ...], [y1, y2, ...]]
# do that here


def get_X_and_y_from_samples(samples: list): 
    X = []
    y = []

    for s in samples:
        X.append(s[:-1])
        y.append(s[-1])

    return X, y


X_char, y_char = get_X_and_y_from_samples(char_samples)
X_word, y_word = get_X_and_y_from_samples(word_samples)

# TODO: print out the shapes to verify that they are correct



In [9]:
# 2.5 points

# Initialize a function that reads the word embeddings you saved earlier
# and gives you back mappings from words to their embeddings and also 
# indexes from the tokenizers to their embeddings

def read_embeddings(filename: str, tokenizer: Tokenizer) -> (dict, dict):
    '''Loads and parses embeddings trained in earlier.
    Parameters:
        filename (str): path to file
        Tokenizer: tokenizer used to tokenize the data (needed to get the word to index mapping)
    Returns:
        (dict): mapping from word to its embedding vector
        (dict): mapping from index to its embedding vector
    '''
    # YOUR CODE HERE
    embedding = KeyedVectors.load_word2vec_format(filename, binary=False)

    word_to_vector = {}
    index_to_vector = {}
    for index, word in tokenizer.index_word.items():
        vector = embedding[word]
        word_to_vector[word] = vector
        index_to_vector[index] = vector

    # print(len(word_to_vector.keys()))
    # print(len(index_to_vector.keys()))

    return word_to_vector, index_to_vector

word_to_vector, word_index_to_vector = read_embeddings(EMBEDDING_SAVE_FILE_WORD, word_tokenizer)
char_to_vector, char_index_to_vector = read_embeddings(EMBEDDING_SAVE_FILE_CHAR, char_tokenizer)


In [10]:
# NECESSARY FOR CHARACTERS

# the "0" index of the Tokenizer is assigned for the padding token. Initialize
# the vector for padding token as all zeros of embedding size
# this adds one to the number of embeddings that were initially saved
# (and increases your vocab size by 1)


padding_token_vector = [0 for _ in range(EMBEDDINGS_SIZE)]

# what is the padding token
char_to_vector["padding_token"] = padding_token_vector
char_index_to_vector[0] = padding_token_vector

In [11]:
# 10 points

def data_generator(X: list, y: list, num_sequences_per_batch: int, index_2_embedding: dict) -> (list,list):
    '''
    Returns data generator to be used by feed_forward
    https://wiki.python.org/moin/Generators
    https://realpython.com/introduction-to-python-generators/
    
    Yields batches of embeddings and labels to go with them.
    Use one hot vectors to encode the labels 
    (see the to_categorical function)
    
    If for_feedforward is True: 
    Returns data generator to be used by feed_forward
    else: Returns data generator for RNN model
    '''
    # YOUR CODE HERE

    index = 0
    while True:

        # this is the data in the form [[21, 21], [21, 3], [3, 9], ...]
        embeddings = []
        for indeces in X[index:index+num_sequences_per_batch]:
            row = []
            for i in indeces:
                row.extend(index_2_embedding[i])

            embeddings.append(row)

        # labels = to_categorical(y[index:index+num_sequences_per_batch], num_classes=num_sequences_per_batch)

        labels = []
        vocab_len = len(index_2_embedding.keys())
        for label in y[index:index+num_sequences_per_batch]:
            zeros = [0 for _ in range(vocab_len)]
            zeros[label] = 1
            labels.append(zeros)

        yield (np.array(embeddings), np.array(labels))
        index += num_sequences_per_batch

    pass

In [12]:
# 5 points

# initialize your data_generator for both word and character data
# print out the shapes of the first batch to verify that it is correct for both word and character data

# Examples:
# num_sequences_per_batch = 128 # this is the batch size
# steps_per_epoch = len(sequences)//num_sequences_per_batch  # Number of batches per epoch
# train_generator = data_generator(X, y, num_sequences_per_batch)

# sample=next(train_generator) # this is how you get data out of generators
# sample[0].shape # (batch_size, (n-1)*EMBEDDING_SIZE)  (128, 200)
# sample[1].shape   # (batch_size, |V|) to_categorical

# character data
num_sequences_per_batch = 128 # this is the batch size
steps_per_epoch = len(X_char)//num_sequences_per_batch  # Number of batches per epoch
train_generator = data_generator(X_char, y_char, num_sequences_per_batch, char_index_to_vector)

sample=next(train_generator) # this is how you get data out of generators
print(sample[0].shape)
print(sample[1].shape)
#sample[0].shape # (batch_size, (n-1)*EMBEDDING_SIZE)  (128, 200)
#sample[1].shape   # (batch_size, |V|) to_categorical


(128, 200)
(128, 61)


### d) Train & __save__ your models (15 points)

In [19]:
# 15 points 

# code to train a feedforward neural language model for 
# both word embeddings and character embeddings
# make sure not to just copy + paste to train your two models
# (define functions as needed)

# train your models for between 3 & 5 epochs
# on Felix's machine, this takes ~ 24 min for character embeddings and ~ 10 min for word embeddings
# DO NOT EXPECT ACCURACIES OVER 0.5 (and even that is very for this many epochs)
# We recommend starting by training for 1 epoch

# Define your model architecture using Keras Sequential API
# Use the adam optimizer instead of sgd
# add cells as desired

def build_nn(input_size: int) -> Sequential:
    model = Sequential()

    # adds hidden layer
    model.add(Dense(units=50, activation="relu", input_dim=input_size))

    # adds an output layer
    model.add(Dense(units=1, activation="sigmoid"))

    # shows the model's verbose
    model.summary()

    # calls compile here
    model.compile(loss='binary_crossentropy',
                    optimizer='adam',
                    metrics=['accuracy'])

    return model

char_model = build_nn(200)
#char_model = build_nn(len(char_index_to_vector.keys()))

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_2 (Dense)             (None, 50)                10050     
                                                                 
 dense_3 (Dense)             (None, 1)                 51        
                                                                 
Total params: 10101 (39.46 KB)
Trainable params: 10101 (39.46 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_4 (Dense)             (None, 50)                3100      
                                                                 
 dense_5 (Dense)             (None, 1)                 51        
                                                                 
Total params: 3151 (12.31 KB)
Trainable params: 3151 (12.31 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [18]:
# Here is some example code to train a model with a data generator
# model.fit(x=train_generator, 
#           steps_per_epoch=steps_per_epoch,
#           epochs=1)


char_model.fit(x=train_generator, 
          steps_per_epoch=steps_per_epoch,
          epochs=1)

InvalidArgumentError: Graph execution error:

Detected at node sequential/dense/Relu defined at (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main

  File "<frozen runpy>", line 88, in _run_code

  File "/Users/maxnbf/anaconda3/lib/python3.11/site-packages/ipykernel_launcher.py", line 17, in <module>

  File "/Users/maxnbf/anaconda3/lib/python3.11/site-packages/traitlets/config/application.py", line 992, in launch_instance

  File "/Users/maxnbf/anaconda3/lib/python3.11/site-packages/ipykernel/kernelapp.py", line 711, in start

  File "/Users/maxnbf/anaconda3/lib/python3.11/site-packages/tornado/platform/asyncio.py", line 195, in start

  File "/Users/maxnbf/anaconda3/lib/python3.11/asyncio/base_events.py", line 607, in run_forever

  File "/Users/maxnbf/anaconda3/lib/python3.11/asyncio/base_events.py", line 1922, in _run_once

  File "/Users/maxnbf/anaconda3/lib/python3.11/asyncio/events.py", line 80, in _run

  File "/Users/maxnbf/anaconda3/lib/python3.11/site-packages/ipykernel/kernelbase.py", line 510, in dispatch_queue

  File "/Users/maxnbf/anaconda3/lib/python3.11/site-packages/ipykernel/kernelbase.py", line 499, in process_one

  File "/Users/maxnbf/anaconda3/lib/python3.11/site-packages/ipykernel/kernelbase.py", line 406, in dispatch_shell

  File "/Users/maxnbf/anaconda3/lib/python3.11/site-packages/ipykernel/kernelbase.py", line 729, in execute_request

  File "/Users/maxnbf/anaconda3/lib/python3.11/site-packages/ipykernel/ipkernel.py", line 411, in do_execute

  File "/Users/maxnbf/anaconda3/lib/python3.11/site-packages/ipykernel/zmqshell.py", line 531, in run_cell

  File "/Users/maxnbf/anaconda3/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3006, in run_cell

  File "/Users/maxnbf/anaconda3/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3061, in _run_cell

  File "/Users/maxnbf/anaconda3/lib/python3.11/site-packages/IPython/core/async_helpers.py", line 129, in _pseudo_sync_runner

  File "/Users/maxnbf/anaconda3/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3266, in run_cell_async

  File "/Users/maxnbf/anaconda3/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3445, in run_ast_nodes

  File "/Users/maxnbf/anaconda3/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3505, in run_code

  File "/var/folders/xn/89skw6lx4pvgwt09xw22s_jw0000gn/T/ipykernel_38767/455497032.py", line 7, in <module>

  File "/Users/maxnbf/anaconda3/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler

  File "/Users/maxnbf/anaconda3/lib/python3.11/site-packages/keras/src/engine/training.py", line 1783, in fit

  File "/Users/maxnbf/anaconda3/lib/python3.11/site-packages/keras/src/engine/training.py", line 1377, in train_function

  File "/Users/maxnbf/anaconda3/lib/python3.11/site-packages/keras/src/engine/training.py", line 1360, in step_function

  File "/Users/maxnbf/anaconda3/lib/python3.11/site-packages/keras/src/engine/training.py", line 1349, in run_step

  File "/Users/maxnbf/anaconda3/lib/python3.11/site-packages/keras/src/engine/training.py", line 1126, in train_step

  File "/Users/maxnbf/anaconda3/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler

  File "/Users/maxnbf/anaconda3/lib/python3.11/site-packages/keras/src/engine/training.py", line 589, in __call__

  File "/Users/maxnbf/anaconda3/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler

  File "/Users/maxnbf/anaconda3/lib/python3.11/site-packages/keras/src/engine/base_layer.py", line 1149, in __call__

  File "/Users/maxnbf/anaconda3/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 96, in error_handler

  File "/Users/maxnbf/anaconda3/lib/python3.11/site-packages/keras/src/engine/sequential.py", line 398, in call

  File "/Users/maxnbf/anaconda3/lib/python3.11/site-packages/keras/src/engine/functional.py", line 515, in call

  File "/Users/maxnbf/anaconda3/lib/python3.11/site-packages/keras/src/engine/functional.py", line 672, in _run_internal_graph

  File "/Users/maxnbf/anaconda3/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler

  File "/Users/maxnbf/anaconda3/lib/python3.11/site-packages/keras/src/engine/base_layer.py", line 1149, in __call__

  File "/Users/maxnbf/anaconda3/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 96, in error_handler

  File "/Users/maxnbf/anaconda3/lib/python3.11/site-packages/keras/src/layers/core/dense.py", line 255, in call

  File "/Users/maxnbf/anaconda3/lib/python3.11/site-packages/keras/src/activations.py", line 306, in relu

  File "/Users/maxnbf/anaconda3/lib/python3.11/site-packages/keras/src/backend.py", line 5397, in relu

Matrix size-incompatible: In[0]: [128,200], In[1]: [61,50]
	 [[{{node sequential/dense/Relu}}]] [Op:__inference_train_function_663]

In [16]:

# spooky data model by character for 5 epochs takes ~ 24 min on Felix's computer
# with adam optimizer, gets accuracy of 0.3920

# spooky data model by word for 5 epochs takes 10 min on Felix's computer
# results in accuracy of 0.2110

61

In [None]:
# save your trained models so you can re-load instead of re-training each time
# also, you'll need these to generate your sentences!


### e) Generate Sentences (15 points)

In [None]:
# load your models if you need to


In [None]:
# 10 points

# # generate a sequence from the model until you get an end of sentence token
# This is an example function header you might use
# def generate_seq(model: Sequential, 
#                  tokenizer: Tokenizer, 
#                  seed: list):
#     '''
#     Parameters:
#         model: your neural network
#         tokenizer: the keras preprocessing tokenizer
#         seed: [w1, w2, w(n-1)]
#     Returns: string sentence
#     '''
#     pass



In [None]:
# 5 points

# generate and display one sequence from both the word model and the character model
# do not include <s> or </s> in your displayed sentences
# make sure that you can read the output easily (i.e. don't just print out a list of tokens)

# you may leave _ as _ or replace it with a space if you prefer

In [None]:
# generate 100 example sentences with each model and save them to a file, one sentence per line
# do not include <s> and </s> in your saved sentences (you'll use these sentences in your next task)
# this will produce two files, one for each model