Homework 5: Neural Language Models  (& 🎃 SpOoKy 👻 authors 🧟 data) - Task 3
---
Names: Kaan Tural, Arinjay Singh

Task 3: Feedforward Neural Language Model (60 points)
--------------------------

For this task, you will create and train neural LMs for both your word-based embeddings and your character-based ones. You should write functions when appropriate to avoid excessive copy+pasting.

### a) First, encode  your text into integers (5 points)

In [16]:
# Importing utility functions from Keras
import keras
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical

# necessary
from keras.models import Sequential
from keras.layers import Dense

# optional
# from keras.layers import Dropout

# if you want fancy progress bars
from tqdm import notebook
from IPython.display import display

# your other imports here
import time
import neurallm_utils as nutils
import numpy as np
import pandas as pd
from gensim.models import KeyedVectors


In [19]:
# load in necessary data

TRAIN_FILE = 'spooky_author_train.csv'

train = pd.read_csv(TRAIN_FILE)
# read the data in both by character and by word

modelChar = KeyedVectors.load_word2vec_format('spooky_embedding_char.txt', binary=False)
modelWord = KeyedVectors.load_word2vec_format('spooky_embedding_word.txt', binary=False)

In [20]:
# constants you may find helpful. Edit as you would like.
EMBEDDINGS_SIZE = 50
NGRAM = 3 # The ngram language model you want to train

by_char = nutils.read_file_spooky(TRAIN_FILE, NGRAM, by_character=True)
by_word = nutils.read_file_spooky(TRAIN_FILE, NGRAM, by_character=False)

In [21]:
# Initialize a Tokenizer and fit on your data
# do this for both the word and character data

# It is used to vectorize a text corpus. Here, it just creates a mapping from 
# word to a unique index. (Note: Indexing starts from 0)
# Example:
# tokenizer = Tokenizer()
# tokenizer.fit_on_texts(data)
# encoded = tokenizer.texts_to_sequences(data)

tokenizer_char = Tokenizer(char_level=True)
tokenizer_word = Tokenizer()

tokenizer_char.fit_on_texts(by_char)
tokenizer_word.fit_on_texts(by_word)

encoded_char = tokenizer_char.texts_to_sequences(by_char)
encoded_word = tokenizer_word.texts_to_sequences(by_word)

In [27]:
# print out the size of the word index for each of your tokenizers
# this should match what you calculated in Task 2 with your embeddings

print('Size of word index for character tokenizer:', len(tokenizer_char.word_index))
print('Size of word index for word tokenizer:', len(tokenizer_word.word_index), "\n")

#Embeddings from task 2 size
print('Size of word index for character embeddings:', len(modelChar.index_to_key))
print('Size of word index for word embeddings:', len(modelWord.index_to_key))

Size of word index for character tokenizer: 60
Size of word index for word tokenizer: 25374 

Size of word index for character embeddings: 60
Size of word index for word embeddings: 25374


### b) Next, prepare the sequences to train your model from text (5 points)

#### Fixed n-gram based sequences

In [30]:
def generate_ngram_training_samples(encoded: list, ngram: int) -> list:
    '''
    Takes the encoded data (list of lists) and 
    generates the training samples out of it.
    Parameters:
    up to you, we've put in what we used
    but you can add/remove as needed
    return: 
    list of lists in the format [[x1, x2, ... , x(n-1), y], ...]
    '''
    
    # Initialize an empty list to store the training samples
    training_samples = []
    
    # Loop over the encoded data to generate samples
    for i in range(len(encoded) - ngram + 1):
        # Extract the n-gram sequence from the current position
        ngram_sequence = encoded[i:i+ngram]
        # Add the extracted sequence to the training samples
        training_samples.append(ngram_sequence)
    
    return training_samples

# generate your training samples for both word and character data
# print out the first 5 training samples for each
# we have displayed the number of sequences
# to expect for both characters and words
#
# Spooky data by character should give 2957553 sequences
# [21, 21, 3]
# [21, 3, 9]
# [3, 9, 7]
# ...
# Spooky data by words shoud give 634080 sequences
# [1, 1, 32]
# [1, 32, 2956]
# [32, 2956, 3]
# ...

# flatten the encoded lists since they are in the form of list of lists after encoding
encoded_char_flat = [token for sublist in encoded_char for token in sublist]
encoded_word_flat = [token for sublist in encoded_word for token in sublist]

training_samples_char = generate_ngram_training_samples(encoded_char_flat, NGRAM)
training_samples_word = generate_ngram_training_samples(encoded_word_flat, NGRAM)

print ("\nnumber of sequences for character data:", len(training_samples_char))
print("First 5 training samples for character data:")
for sample in training_samples_char[:5]:
    print(sample)

print("\nnumber of sequences for word data:", len(training_samples_word))
print("First 5 training samples for word data:")
for sample in training_samples_word[:5]:
    print(sample)


number of sequences for character data: 2996709
First 5 training samples for character data:
[21, 21, 3]
[21, 3, 9]
[3, 9, 7]
[9, 7, 8]
[7, 8, 1]

number of sequences for word data: 673236
First 5 training samples for word data:
[1, 1, 32]
[1, 32, 2956]
[32, 2956, 3]
[2956, 3, 155]
[3, 155, 3]


### c) Then, split the sequences into X and y and create a Data Generator (20 points)

In [36]:
# 2.5 points

# Note here that the sequences were in the form: 
# sequence = [x1, x2, ... , x(n-1), y]
# We still need to separate it into [[x1, x2, ... , x(n-1)], ...], [y1, y2, ...]]
# do that here

def split_sequences(sequences):
    X = [sequence[:-1] for sequence in sequences]
    y = [sequence[-1] for sequence in sequences]
    return X, y

X_char, y_char = split_sequences(training_samples_char)
X_word, y_word = split_sequences(training_samples_word)

# print out the shapes to verify that they are correct
print("\nShape of X_char:", np.array(X_char).shape)
print("Shape of y_char:", np.array(y_char).shape)
print("Shape of X_word:", np.array(X_word).shape)
print("Shape of y_word:", np.array(y_word).shape)

print("\nFirst 5 X_char:", X_char[:5])
print("First 5 y_char:", y_char[:5])

print("\nShape of X_char:", len(X_char), "Shape of y_char:", len(y_char))
print("Shape of X_word:", len(X_word), "Shape of y_word:", len(y_word))


Shape of X_char: (2996709, 2)
Shape of y_char: (2996709,)
Shape of X_word: (673236, 2)
Shape of y_word: (673236,)

First 5 X_char: [[21, 21], [21, 3], [3, 9], [9, 7], [7, 8]]
First 5 y_char: [3, 9, 7, 8, 1]

Shape of X_char: 2996709 Shape of y_char: 2996709
Shape of X_word: 673236 Shape of y_word: 673236


In [40]:
# 2.5 points

# Initialize a function that reads the word embeddings you saved earlier
# and gives you back mappings from words to their embeddings and also 
# indexes from the tokenizers to their embeddings

def read_embeddings(filename: str, tokenizer: Tokenizer) -> (dict, dict):
    '''Loads and parses embeddings trained in earlier.
    Parameters:
        filename (str): path to file
        Tokenizer: tokenizer used to tokenize the data (needed to get the word to index mapping)
    Returns:
        (dict): mapping from word to its embedding vector
        (dict): mapping from index to its embedding vector
    '''
    # Load the model using KeyedVectors
    model = KeyedVectors.load_word2vec_format(filename, binary=False)

    # Initialize mappings
    word_to_embedding = {}
    index_to_embedding = {}

    # Iterate over the tokenizer's word_index to populate the mappings
    for word, index in tokenizer.word_index.items():
        if word in model.key_to_index:  # Check if the word exists in the model
            vector = model.get_vector(word)  # Retrieve the embedding vector
            word_to_embedding[word] = vector
            index_to_embedding[index] = vector

    return word_to_embedding, index_to_embedding

word_to_embedding_char, index_to_embedding_char = read_embeddings('spooky_embedding_char.txt', tokenizer_char)
word_to_embedding_word, index_to_embedding_word = read_embeddings('spooky_embedding_word.txt', tokenizer_word)

print("Character Embeddings:")
print("Word to Embedding Size:", len(word_to_embedding_char))
print("Index to Embedding Size:", len(index_to_embedding_char))

print("\nWord Embeddings:")
print("Word to Embedding Size:", len(word_to_embedding_word))
print("Index to Embedding Size:", len(index_to_embedding_word))


Character Embeddings:
Word to Embedding Size: 60
Index to Embedding Size: 60

Word Embeddings:
Word to Embedding Size: 25374
Index to Embedding Size: 25374


In [None]:
# NECESSARY FOR CHARACTERS

# the "0" index of the Tokenizer is assigned for the padding token. Initialize
# the vector for padding token as all zeros of embedding size
# this adds one to the number of embeddings that were initially saved
# (and increases your vocab size by 1)

In [None]:
# 10 points

def data_generator(X: list, y: list, num_sequences_per_batch: int, index_2_embedding: dict) -> (list,list):
    '''
    Returns data generator to be used by feed_forward
    https://wiki.python.org/moin/Generators
    https://realpython.com/introduction-to-python-generators/
    
    Yields batches of embeddings and labels to go with them.
    Use one hot vectors to encode the labels 
    (see the to_categorical function)
    
    Returns data generator to be used by feed_forward
    '''
    # YOUR CODE HERE
    pass

In [None]:
# 5 points

# initialize your data_generator for both word and character data
# print out the shapes of the first batch to verify that it is correct for both word and character data

# Examples:
# num_sequences_per_batch = 128 # this is the batch size
# steps_per_epoch = len(sequences)//num_sequences_per_batch  # Number of batches per epoch
# train_generator = data_generator(X, y, num_sequences_per_batch)

# sample=next(train_generator) # this is how you get data out of generators
# sample[0].shape # (batch_size, (n-1)*EMBEDDING_SIZE)  (128, 200)
# sample[1].shape   # (batch_size, |V|) to_categorical


### d) Train & __save__ your models (15 points)

In [None]:
# 15 points 

# code to train a feedforward neural language model for 
# both word embeddings and character embeddings
# make sure not to just copy + paste to train your two models
# (define functions as needed)

# train your models for between 3 & 5 epochs
# on our machine, this takes ~ 24 min for character embeddings and ~ 10 min for word embeddings
# DO NOT EXPECT ACCURACIES OVER 0.5 (and even that is very for this many epochs)
# We recommend starting by training for 1 epoch

# Define your model architecture using Keras Sequential API
# Use the adam optimizer instead of sgd
# add cells as desired



In [None]:
# Here is some example code to train a model with a data generator
# model.fit(x=train_generator, 
#           steps_per_epoch=steps_per_epoch,
#           epochs=1)

In [None]:

# spooky data model by character for 5 epochs takes ~ 24 min on our computer
# with adam optimizer, gets accuracy of 0.3920

# spooky data model by word for 5 epochs takes 10 min on our computer
# results in accuracy of 0.2110


In [None]:
# save your trained models so you can re-load instead of re-training each time
# also, you'll need these to generate your sentences!


### e) Generate Sentences (15 points)

In [None]:
# load your models if you need to


In [None]:
# 10 points

# # generate a sequence from the model until you get an end of sentence token
# This is an example function header you might use
# def generate_seq(model: Sequential, 
#                  tokenizer: Tokenizer, 
#                  seed: list):
#     '''
#     Parameters:
#         model: your neural network
#         tokenizer: the keras preprocessing tokenizer
#         seed: [w1, w2, w(n-1)]
#     Returns: string sentence
#     '''
#     pass



In [None]:
# 5 points

# generate and display one sequence from both the word model and the character model
# do not include <s> or </s> in your displayed sentences
# make sure that you can read the output easily (i.e. don't just print out a list of tokens)

# you may leave _ as _ or replace it with a space if you prefer

In [None]:
# generate 100 example sentences with each model and save them to a file, one sentence per line
# do not include <s> and </s> in your saved sentences (you'll use these sentences in your next task)
# this will produce two files, one for each model