Homework 5: Neural Language Models  (& 🎃 SpOoKy 👻 authors 🧟 data) - Task 3
---

Task 3: Feedforward Neural Language Model (60 points)
--------------------------
Harishraj Udaya Bhaskar(6120 Student) 
For this task, you will create and train neural LMs for both your word-based embeddings and your character-based ones. You should write functions when appropriate to avoid excessive copy+pasting.

### a) First, encode  your text into integers (5 points)

In [1]:
# Importing utility functions from Keras
import keras
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical

# necessary
from keras.models import Sequential
from keras.layers import Dense

# optional
# from keras.layers import Dropout
import neurallm_utils as nutils
# if you want fancy progress bars
from tqdm import notebook
from IPython.display import display

# your other imports here
import time

import numpy as np


2023-11-09 13:27:31.662964: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[nltk_data] Downloading package punkt to /Users/harisha/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
# constants you may find helpful. Edit as you would like.
EMBEDDINGS_SIZE = 50
NGRAM = 3 # The ngram language model you want to train

In [3]:
# load in necessary data
data=nutils.read_file_spooky('spooky_author_train.csv',NGRAM)
data_char=nutils.read_file_spooky('spooky_author_train.csv',NGRAM,True)

In [6]:
# Initialize a Tokenizer and fit on your data
# do this for both the word and character data

# It is used to vectorize a text corpus. Here, it just creates a mapping from 
# word to a unique index. (Note: Indexing starts from 0)
# Example:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(data)
encoded = tokenizer.texts_to_sequences(data)

tokenizer_char = Tokenizer()
tokenizer_char.fit_on_texts(data_char)
encoded_char = tokenizer.texts_to_sequences(data_char)



In [7]:
# print out the size of the word index for each of your tokenizers
# this should match what you calculated in Task 2 with your embeddings

print("Size of the word index for the tokenizer:")
print(len(tokenizer.word_index))


print("Size of the char index for the tokenizer:")
print(len(tokenizer_char.word_index))


Size of the word index for the tokenizer:
25385
Size of the char index for the tokenizer:
60


### b) Next, prepare the sequences to train your model from text (5 points)

#### Fixed n-gram based sequences

In [12]:
def generate_ngram_training_samples(encoded: list, ngram: int) -> list:
    '''
    Takes the encoded data (list of lists) and 
    generates the training samples out of it.
    Parameters:
    up to you, we've put in what we used
    but you can add/remove as needed
    return: 
    list of lists in the format [[x1, x2, ... , x(n-1), y], ...]
    '''
    training_samples = []

    for sequence in encoded:
        for i in range(len(sequence) - ngram):
            x = sequence[i:i + ngram - 1]
            y = sequence[i + ngram - 1]
            training_samples.append(x + [y])

    return training_samples


# generate your training samples for both word and character data
# print out the first 5 training samples for each
# we have displayed the number of sequences
# to expect for both characters and words
#
# Spooky data by character should give 2957553 sequences
# [21, 21, 3]
# [21, 3, 9]
# [3, 9, 7]
# ...
# Spooky data by words shoud give 634080 sequences
# [1, 1, 32]
# [1, 32, 2956]
# [32, 2956, 3]
# ...

word_samples = generate_ngram_training_samples(encoded, NGRAM)
char_samples = generate_ngram_training_samples(encoded_char, NGRAM)



### c) Then, split the sequences into X and y and create a Data Generator (20 points)

In [13]:
# 2.5 points

# Note here that the sequences were in the form: 
# sequence = [x1, x2, ... , x(n-1), y]
# We still need to separate it into [[x1, x2, ... , x(n-1)], ...], [y1, y2, ...]]
# do that here

def separate_sequences(training_samples: list) -> tuple:

    input_sequences = [sample[:-1] for sample in training_samples]
    output_sequences = [sample[-1] for sample in training_samples]

    return input_sequences, output_sequences


# print out the shapes to verify that they are correct
word_input, word_output = separate_sequences(word_samples)
char_input, char_output = separate_sequences(char_samples)

# Print out the shapes to verify that they are correct
print("Shapes for word data:")
print("Input:", len(word_input), len(word_input[0]))
print("Output:", len(word_output))

print("\nShapes for character data:")
print("Input:", len(char_input), len(char_input[0]))
print("Output:", len(char_output))



Shapes for word data:
Input: 614488 2
Output: 614488

Shapes for character data:
Input: 2428982 2
Output: 2428982


In [20]:
# 2.5 points

# Initialize a function that reads the word embeddings you saved earlier
# and gives you back mappings from words to their embeddings and also 
# indexes from the tokenizers to their embeddings

def read_embeddings(filename: str, tokenizer: Tokenizer) -> (dict, dict):
    '''Loads and parses embeddings trained in earlier.
    Parameters:
        filename (str): path to file
        Tokenizer: tokenizer used to tokenize the data (needed to get the word to index mapping)
    Returns:
        (dict): mapping from word to its embedding vector
        (dict): mapping from index to its embedding vector
    '''
    # YOUR CODE HERE
    word_to_embedding = {}
    index_to_embedding = {}

    with open(filename, 'r', encoding='utf-8') as file:
        for line in file:
            values = line.strip().split()
            word = values[0]
            embedding = np.array(values[1:], dtype='float32')

            # Map word to its embedding
            word_to_embedding[word] = embedding

    # Adding the padding token with all zeros
    if word_to_embedding:  # Check if there is at least one embedding
        word_to_embedding['<PAD>'] = np.zeros_like(list(word_to_embedding.values())[0])
    else:
        # If the embeddings file is empty, create a default zero embedding
        word_to_embedding['<PAD>'] = np.zeros(EMBEDDING_SIZE, dtype='float32')

    # Create the index to embedding mapping
    for word, index in tokenizer.word_index.items():
        if word in word_to_embedding:
            index_to_embedding[index] = word_to_embedding[word]

    return word_to_embedding, index_to_embedding

# Use the function with the padding token in the embeddings
word_to_embedding, index_to_embedding = read_embeddings('spooky_embedding_word.txt', tokenizer)
char_to_emebedding,charindex_to_embedding=read_embeddings('spooky_embedding_char.txt',tokenizer_char)



In [None]:
# NECESSARY FOR CHARACTERS

# the "0" index of the Tokenizer is assigned for the padding token. Initialize
# the vector for padding token as all zeros of embedding size
# this adds one to the number of embeddings that were initially saved
# (and increases your vocab size by 1)

In [34]:
# 10 points

def data_generator(X: list, y: list, num_sequences_per_batch: int, index_2_embedding: dict) -> (list,list):
    '''
    Returns data generator to be used by feed_forward
    https://wiki.python.org/moin/Generators
    https://realpython.com/introduction-to-python-generators/
    
    Yields batches of embeddings and labels to go with them.
    Use one hot vectors to encode the labels 
    (see the to_categorical function)
    
    If for_feedforward is True: 
    Returns data generator to be used by feed_forward
    else: Returns data generator for RNN model
    '''
    # YOUR CODE HERE
    
    while True:
        for i in range(0, len(X), num_sequences_per_batch):
            batch_X = X[i:i + num_sequences_per_batch]
            batch_y = y[i:i + num_sequences_per_batch]

            # Convert sequences to embeddings
            batch_X_embeddings = [
                [index_2_embedding[index] for index in sequence]
                for sequence in batch_X
            ]

            # Convert labels to one-hot encoding
            batch_y_one_hot = to_categorical(batch_y, num_classes=len(index_2_embedding) + 1)

            yield (np.array(batch_X_embeddings), np.array(batch_y_one_hot))

In [93]:
# 5 points

# initialize your data_generator for both word and character data
# print out the shapes of the first batch to verify that it is correct for both word and character data

# Examples:
num_sequences_per_batch = 128 # this is the batch size
steps_per_epoch = len(encoded)//num_sequences_per_batch  # Number of batches per epoch
train_generator = data_generator(word_input, word_output, num_sequences_per_batch,index_to_embedding)

sample=next(train_generator) # this is how you get data out of generators
sample[0].shape # (batch_size, (n-1)*EMBEDDING_SIZE)  (128, 200)
print(sample[1].shape)   # (batch_size, |V|) to_categorical


# Now the same for character level 
train_generator_char = data_generator(char_input, char_output, num_sequences_per_batch,charindex_to_embedding)




(128, 25386)


### d) Train & __save__ your models (15 points)

In [36]:
# 15 points 
from tensorflow.keras.layers import Flatten
# code to train a feedforward neural language model for 
# both word embeddings and character embeddings
# make sure not to just copy + paste to train your two models
# (define functions as needed)

# train your models for between 3 & 5 epochs
# on Felix's machine, this takes ~ 24 min for character embeddings and ~ 10 min for word embeddings
# DO NOT EXPECT ACCURACIES OVER 0.5 (and even that is very for this many epochs)
# We recommend starting by training for 1 epoch

# Define your model architecture using Keras Sequential API
# Use the adam optimizer instead of sgd
# add cells as desired

 # Assuming your word embeddings are of size 50
 
#Defining the word model
vocab_size = len(index_to_embedding) + 1  # Additional 1 for the padding token

model = Sequential([
    Flatten(input_shape=((NGRAM - 1) * EMBEDDINGS_SIZE,)),  # Flatten the input sequence
    Dense(100, activation='relu'),  # Add a Dense layer with ReLU activation
    Dense(vocab_size, activation='softmax')  # Output layer with softmax activation
])

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.summary()



Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 flatten_1 (Flatten)         (None, 100)               0         
                                                                 
 dense_2 (Dense)             (None, 100)               10100     
                                                                 
 dense_3 (Dense)             (None, 25386)             2563986   
                                                                 
Total params: 2,574,086
Trainable params: 2,574,086
Non-trainable params: 0
_________________________________________________________________


In [83]:
#Defining the Character model

vocab_size = len(index_to_embedding) + 1  # Additional 1 for the padding token

model_char = Sequential([
    Flatten(input_shape=((NGRAM - 1) * EMBEDDINGS_SIZE,)),  # Flatten the input sequence
    Dense(100, activation='relu'),  # Add a Dense layer with ReLU activation
    Dense(vocab_size, activation='softmax')  # Output layer with softmax activation
])

model_char.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model_char.summary()


Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 flatten_3 (Flatten)         (None, 100)               0         
                                                                 
 dense_6 (Dense)             (None, 100)               10100     
                                                                 
 dense_7 (Dense)             (None, 25386)             2563986   
                                                                 
Total params: 2,574,086
Trainable params: 2,574,086
Non-trainable params: 0
_________________________________________________________________


In [85]:
#Training word Level
model_.fit(train_generator,
    steps_per_epoch=len(word_input)//num_sequences_per_batch,
    epochs=3
)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7fdc489f0730>

In [90]:
#Training character level model
model_char.fit(train_generator,
    steps_per_epoch=len(word_input)//num_sequences_per_batch,
    epochs=1
)



<keras.callbacks.History at 0x7fdc48a27310>

In [39]:
#Saving the word Level model
from keras.models import save_model
save_model(model, 'my_model.hdf5')

### e) Generate Sentences (15 points)

In [40]:
# load your models if you need to
# save your trained models so you can re-load instead of re-training each time
# also, you'll need these to generate your sentences!
from tensorflow.keras.models import load_model
model = load_model('my_model.hdf5')

In [43]:
sentence_len = 20
pred_len = 1
train_len = sentence_len - pred_len
seq = []
# Sliding window to generate test and train data
for i in range(len(encoded)-sentence_len):
    seq.append(encoded[i:i+sentence_len])
# Reverse dictionary so as to decode tokenized sequences back to words and sentences
reverse_word_map = dict(map(reversed, tokenizer.word_index.items()))


In [44]:
# 10 points
#Function for generating sentences
import tensorflow
from tensorflow.keras.preprocessing.sequence import pad_sequences
def gen(seq,max_len = 20):
    sent = tokenizer.texts_to_sequences([seq])
    #print(sent)
    while len(sent[0]) < max_len:
        sent2 = tensorflow.keras.preprocessing.sequence.pad_sequences(sent[-19:],maxlen=19)
        op = model.predict(np.asarray(sent2).reshape(1,-1))
        sent[0].append(op.argmax()+1)
    return " ".join(map(lambda x : reverse_word_map[x],sent[0]))



In [77]:
# 5 points

# generate and display one sequence from both the word model and the character model
# do not include <s> or </s> in your displayed sentences
# make sure that you can read the output easily (i.e. don't just print out a list of tokens)

# you may leave _ as _ or replace it with a space if you prefer

sample_sentences=[]
start = [("i am curious of",26),("is this why he was ",32),("is this ",32)]
# Last one was Describe in 600 words
for i in range(len(start)):
    print("<<-- Sentence %d -->>\n"%(i),gen(start[i][0],start[i][1]))
    sample_sentences.append(gen(start[i][0],start[i][1]))

<<-- Sentence 0 -->>
 i am curious of pretty the <s> unknown , <s> back </s> <s> beyond will is child companion <s> countenance </s> <s> since <s> confusion a
<<-- Sentence 1 -->>
 is this why he was <s> remains appetite and when for else individual </s> in through difference have <s> low </s> <s> mountains and idea for magic my those in secret he
<<-- Sentence 2 -->>
 is this us , . ' through strange not head </s> before he <s> calm when ? received he edge i well and when for read castle to had slightest by then


In [79]:
#Sample sentences generated
for i in sample_sentences:
    print('-------')
    print(i)

-------
i am curious of pretty the <s> unknown , <s> back </s> <s> beyond will is child companion <s> countenance </s> <s> since <s> confusion a
-------
is this why he was <s> remains appetite and when for else individual </s> in through difference have <s> low </s> <s> mountains and idea for magic my those in secret he
-------
is this us , . ' through strange not head </s> before he <s> calm when ? received he edge i well and when for read castle to had slightest by then


In [70]:
combinations = [
    "I am curious of where is",
    "I am curious of is where",
    "Is this why he was where is",
    "Is this why he was is where",
    "Is this where is",
    "Is this is where",
    "Where is this",
    "Where is why",
    "Is where this",
    "Is where why",
    "Is is this",
    "Is is why",
    "Where is curious",
    "Where is he",
    "Is where curious",
    "Is where he",
    "Is is curious",
    "Is is he",
    "Curious where is",
    "Curious is where",
    "Curious is this",
    "Curious where he",
    "Curious is he",
    "He is curious of",
    "He is curious where",
    "He is curious is",
    "He was curious of",
    "He was curious where",
    "He was curious is",
    "Is he curious of",
    "Is he curious where",
    "Is he curious is",
    "Is this curious of",
    "Is this curious where",
    "Is this curious is",
    "Where is curious of",
    "Where is curious where",
    "Where is curious is",
    "Is curious of where",
    "Is curious of is",
    "Is curious where is",
    "Curious of where is",
    "Curious of is where",
    "Curious where is this",
    "Curious where he is",
    "Curious where is he",
    "Curious is this why",
    "Curious is this where",
    "Curious is this is",
    "Curious is where why",
    "Curious is where this",
    "Curious is where is",
    "Curious where this is",
    "Curious where is this",
    "Curious where is he",
    "Curious is this why he",
    "Curious is this he was",
    "Curious is where why he",
    "Curious is where he was",
    "Curious is where is he",
    "Curious is he was where",
    "Curious is he was is",
    "Curious where he was is",
    "He is curious where is",
    "He is curious where this",
    "He is curious where he",
    "He is curious is where",
    "He is curious is this",
    "He is curious is he",
    "He was curious where is",
    "He was curious where this",
    "He was curious where he",
    "He was curious is where",
    "He was curious is this",
    "He was curious is he",
    "Is he curious where is",
    "Is he curious where this",
    "Is he curious where he",
    "Is he curious is where",
    "Is he curious is this",
    "Is he curious is he",
    "Is this curious where is",
    "Is this curious where he",
    "Is this curious is where",
    "Is this curious is he",
    "Where is curious where",
    "Where is curious is",
    "Where is where this is",
    "Where is where is this",
    "Where is where is he",
    "Where is is this why",
    "Where is is this where",
    "Where is is this is",
    "Is where curious where",
    "Is where curious is",
    "Is where where this is",
    "Is where where is this",
    "Is where where is he",
    "Is where is this why",
    "Is where is this where"
]
start = [("i am curious of",26),
("is this why he was ",32),
("is this ",32)]

for i in combinations:
    tup=(i,32)
    start.append(tup)

print(start)
all_sentences=[]

# Last one was Describe in 600 words
for i in range(len(start)):
    print("<<-- Sentence %d -->>\n"%(i),gen(start[i][0],start[i][1]))
    all_sentences1.append(gen(start[i][0],start[i][1]))

    


[('i am curious of', 26), ('is this why he was ', 32), ('is this ', 32), ('I am curious of where is', 32), ('I am curious of is where', 32), ('Is this why he was where is', 32), ('Is this why he was is where', 32), ('Is this where is', 32), ('Is this is where', 32), ('Where is this', 32), ('Where is why', 32), ('Is where this', 32), ('Is where why', 32), ('Is is this', 32), ('Is is why', 32), ('Where is curious', 32), ('Where is he', 32), ('Is where curious', 32), ('Is where he', 32), ('Is is curious', 32), ('Is is he', 32), ('Curious where is', 32), ('Curious is where', 32), ('Curious is this', 32), ('Curious where he', 32), ('Curious is he', 32), ('He is curious of', 32), ('He is curious where', 32), ('He is curious is', 32), ('He was curious of', 32), ('He was curious where', 32), ('He was curious is', 32), ('Is he curious of', 32), ('Is he curious where', 32), ('Is he curious is', 32), ('Is this curious of', 32), ('Is this curious where', 32), ('Is this curious is', 32), ('Where is

In [72]:
#writing the generated sentences to a txt file
file_path = 'word_sentences.txt'

with open(file_path, 'w') as file:
    for line in all_sentences1:
        file.write(line + '\n')