
# IST664/CIS668 - Homework 5 Template

In recent classes, we've examined sequence-to-sequence (S2S) models that use recurrent networks to process sequences of character or word tokens. In this final homework, you will process the SQUAD V2 dataset, extracting the most brief among the Q/A pairs and process them with an S2S model to explore the limits of recurrence.

Most of this code has been borrowed from Labs 10 and 11. If you run into any problems, compare the code to what you used for the lab. 

Use this template to organize your code for Homework 5. Put your name and the names of your collaborators (if any) here:

Your name: _______________________________

Your collaborators: ________________________________


*Task 1*: Install and process the SQUAD V2 dataset.


In [None]:
# Get the datasets package and load the data from HuggingFace

!pip install datasets # A package for creating a connection to HuggingFace data resources
from datasets import load_dataset

raw_dataset = load_dataset("squad_v2") # The names of the datasets can be obtained from the web-based discovery interface
raw_train_dataset = raw_dataset["train"] # We will use the training data first

type(raw_dataset), type(raw_train_dataset) # Display the types

In [None]:
# We will be using numpy, tensorflow, and keras
import numpy as np
import tensorflow as tf
from tensorflow import keras

tf.__version__

The MAX_INP and MAX_TARG variables below are key to the operation of this S2S model. We know from Class 10 that LSTMs have limits on how far a training gradient can go back through a long sequence of LSTM cells. The loop below filters the dataset to only include question and answer pairs that fit the maximum sequence lengths. Low values mean short LSTMs that are easy to train, but not much training data. 

Start the model using the  sequence length suggestions below and then gradually increase them to try to recover more training data from the dataset. You MUST try at least one additional gradation of these two values to get full credit for this lab. For example, you could raise MAX_INP to 50 and MAX_TARG to 30.

In [None]:
# Build a list of questions and answers plus character sets for each.
# Only include Q/A pairs where the message lengths fit within the
# maximums noted here:
MAX_INP = 30
MAX_TARG = 20

input_texts = []
target_texts = []
input_characters = set()
target_characters = set()

for entry in raw_train_dataset:
  # Some entries may not have an answer: Skip them
  if len(entry['answers']['text']) > 0:
    
    
    inp = entry['question'].strip()
    inp = inp.encode("ascii", "ignore")
    inp = inp.decode()
    
    targ = "\t" + entry['answers']['text'][0].strip() + "\n"
    targ = targ.encode("ascii","ignore")
    targ = targ.decode()

    if (len(inp) < MAX_INP) and (len(targ) < MAX_TARG):

      input_texts.append(inp)
      for char in inp:
          if char not in input_characters:
              input_characters.add(char)
    
      target_texts.append(targ)
      for char in targ:
          if char not in target_characters:
              target_characters.add(char)

len(input_texts), len(target_texts) # How much training data do we have

In [None]:
#
# Task 1a: Display the list of input characters
#


In [None]:
#
# Task 1b: Display the list of target characters
#


In [None]:
#
# Task 1c: Display the lengths of the two character sets.
# Leave a comment explaining the difference in lengths of the two sets.
#


In [None]:
# Task 1d: Display the first 15 pairs of questions and answers
# 


In [None]:
# Colab instances do not have much memory. This code provides an upper
# bound on how many training instances we will try to process.
num_samples = min(20000, len(input_texts))
input_texts = input_texts[:num_samples]
target_texts = target_texts[:num_samples]

In [None]:
# Randomize the order of the texts
arr = np.arange(num_samples)
np.random.shuffle(arr)

input_texts = [input_texts[i] for i in arr]
target_texts = [target_texts[i] for i in arr]

In [None]:
# Task 1d: Display the first 15 pairs of questions and answers after randomization
# 


In [None]:
# This code creates tokens lists (of character tokens)
input_characters = sorted(list(input_characters))
target_characters = sorted(list(target_characters))
num_encoder_tokens = len(input_characters)
num_decoder_tokens = len(target_characters)


In [None]:
# Task 1e: Display the number of unique input and output tokens.
# Add a comment explaining why the number of tokens is different.
#


In [None]:
# This code defines "maximum message lengths" calibrated in the number of characters.
# Because we are doing a character-level model these values define how far
# the encoder LSTM and the decoder LSTM (respectively) need to be "unrolled"
# in order to do the training. 
#
# Also remember that shorter sequences will need to be padded.
max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt) for txt in target_texts])

print("Max sequence length for inputs:", max_encoder_seq_length)
print("Max sequence length for outputs:", max_decoder_seq_length)

In [None]:
# Task 1f: Display the max sequence length for the encoder and the decoder.
# Explain how these values came to be.
#


In [None]:
# Next we will vectorize all of the input and target messages

# First, make Python dictionaries for the input and target messages
input_token_index = dict([(char, i) for i, char in enumerate(input_characters)])
target_token_index = dict([(char, i) for i, char in enumerate(target_characters)])


In [None]:
# Remember that the seq2seq data is actually a 3-tuple. The encoder takes 
# input messages as input but creates no output except for the hidden state.
# The decoder has both inputs and targets. These three lines fill vectors
# with zeroes to initialize them.

# The encoder inputs - will hold character sequences for short English phrases
encoder_input_data = np.zeros((len(input_texts), max_encoder_seq_length, num_encoder_tokens), dtype="float32")

# The decoder inputs - will hold character sequences for short French phrases
decoder_input_data = np.zeros((len(input_texts), max_decoder_seq_length, num_decoder_tokens), dtype="float32")

# Same size numpy array for the decoder targets
decoder_target_data = np.zeros((len(input_texts), max_decoder_seq_length, num_decoder_tokens), dtype = "float32")

In [None]:
# Now fill the vectors

# Iterate over all of our phrase pairs
for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
    
    # Iterate over all of the characters in the input phrase
    for t, char in enumerate(input_text):
        encoder_input_data[i, t, input_token_index[char]] = 1.0

    # This adds padding with spaces
    encoder_input_data[i, t + 1 :, input_token_index[" "]] = 1.0

    # Iterate over all of the characters in the target phrase. Here we are
    # filling two vectors 
    for t, char in enumerate(target_text):
        # decoder_target_data is ahead of decoder_input_data by one timestep
        decoder_input_data[i, t, target_token_index[char]] = 1.0
        if t > 0:
            # decoder_target_data will be ahead by one timestep
            # and will not include the start character.
            decoder_target_data[i, t - 1, target_token_index[char]] = 1.0
    
    # This adds padding with spaces
    decoder_input_data[i, t + 1 :, target_token_index[" "]] = 1.0
    decoder_target_data[i, t:, target_token_index[" "]] = 1.0

In [None]:
# Reverse-lookup token index to decode sequences back to
# something readable.
reverse_input_char_index = dict((i, char) for char, i in input_token_index.items())
reverse_target_char_index = dict((i, char) for char, i in target_token_index.items())

In [None]:
# This is a daignostic to check our work from above.
# Task 1g: Add comments to each line of code in this block
chars = ""

for i, iv in enumerate(encoder_input_data[0]):
  for j, val in enumerate(iv):
    if val == 1.0:
      chars += reverse_input_char_index[j]

print(chars)

In [None]:
# Task 1h: Create another block of code modeled on the above
# that translates one instance of decoder input data. 
#


In [None]:
# Task 1i: Create another block of code modeled on the above
# that translates one instance of decoder target data. 
#


In [None]:
# Task 1j: Display the lengths of the encoder input data, decoder input data
# and decoder target data. Explain the results in a comment.

*Task 2*: Build a sequence to sequence LSTM model to train.

In [None]:
# Example hyperparameters to use for training the model: Tweak these to improve
# model performance. What does changing the batch_size do? Should the latent
# dimension of the "though vector" be larger or smaller?

batch_size = 32  # Batch size for training.

latent_dim = 256  # Latent dimensionality of the thought vector encoding space.


In [None]:
# 1. Define an input sequence and process it.
encoder_inputs = keras.Input(shape=(None, num_encoder_tokens))

# 2. Use a LSTM layer to process the input vectors. After today's lecture
# you should know what return_state does.
encoder = keras.layers.LSTM(latent_dim, return_state=True)

# 3. Save the output from the encoder, but see step 4. Note the use of 
# the functional programming interface here. For deep learning models that
# are not simple seqiential layers, this interface provides a stratightforward
# way of connecting one element of a model to the element that it should 
# feed into.
encoder_outputs, state_h, state_c = encoder(encoder_inputs)

# 4. We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]

In [None]:
# Set up the decoder, using `encoder_states` as initial state.

# This takes the target tokens as the input.
decoder_inputs = keras.Input(shape=(None, num_decoder_tokens))

# The LSTM later has the same internal dimensionality as for the encoder.
decoder_lstm = keras.layers.LSTM(latent_dim, return_sequences=True, return_state=True)

# Save the decoder output: Note that this uses decoder_inputs
decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)

# Dense with softmax allows us to predict categorical output (our list of French characters)
decoder_dense = keras.layers.Dense(num_decoder_tokens, activation="softmax")

# Output layer
decoder_outputs = decoder_dense(decoder_outputs)

# Define the overall model. This binds the encoder and decoder and will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = keras.Model([encoder_inputs, decoder_inputs], decoder_outputs)

In [None]:
# We use categorical_crossentropy because our prediction is multinomial: we
# are trying to predict which is the most likely character for the next time step.
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["categorical_accuracy"])


In [None]:
# Task 2a: Display a model summary using the appropriate method.
# Add a comment describing the number of trainable gates and how 
# this might compare to a CNN model.
#


In [None]:
# Task 2b: Display a plot of the model's shape using the appropriate method.
# Add a comment pointing out how and where the 3-tuple data are used in the model.


*Task 3*: Train the model, paying close attention to training time and the progress of validation loss and validation accuracy. Given the basic model hyperparameters used in the template together with the small amount of data this model trains very fast. You may need to raise the number of epochs or lower the batch size or both to try to improve the model accuracy.

In [None]:
# Train the model: Choose a value for epochs to balance training time and accuracy

epochs = 35  # Number of epochs to train for. You may need to raise this after seeing initial results.

history = model.fit(
    [encoder_input_data, decoder_input_data],
    decoder_target_data,
    batch_size=batch_size,
    epochs=epochs,
    validation_split=0.1,
)

# You should be looking for a val_categorical_accuracy in excess of 0.60.

In [None]:
# Graphing code fragment modified from Rahul Verma on Stackoverflow
from matplotlib import pyplot as plt

plt.plot(history.history['categorical_accuracy'])
plt.title('Model History')
plt.ylabel('Categorical Accuracy')
plt.xlabel('Epoch')
plt.legend(['Accuracy'], loc='upper right')
plt.show()

####Task 3: Add a comment about training success

Replace this text with a comment  that describes what you see in the model history graph and the model training diagnostics. Have you reached the point in the training where not much additional improvement can be achieved? How do you know? Is the validation categorical accuracy sufficient?

## Task 4: Use the trained model for inference

The basic inference functions from Lab 11 are included here. 

Note: Add a code timer to the appropriate code to find out how long it takes to run inference. 

In [None]:
# Construct the encoder and decoder

# Here's the encoder
encoder_inputs = model.input[0]  # Encoder input layer
encoder_outputs, state_h_enc, state_c_enc = model.layers[2].output  # Encoder model LSTM

# This is the "thought vector" the hidden state that is used to start the decoder
encoder_states = [state_h_enc, state_c_enc]
encoder_model = keras.Model(encoder_inputs, encoder_states)

# This is the decoder, starting with the input layer 
decoder_inputs = model.input[1]  
decoder_state_input_h = keras.Input(shape=(latent_dim,))
decoder_state_input_c = keras.Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

decoder_lstm = model.layers[3] # Decoder model LSTM 

decoder_outputs, state_h_dec, state_c_dec = decoder_lstm(
    decoder_inputs, initial_state=decoder_states_inputs
)
decoder_states = [state_h_dec, state_c_dec]

decoder_dense = model.layers[4] # Dense output layer

decoder_outputs = decoder_dense(decoder_outputs)

decoder_model = keras.Model(
    [decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states
)

In [None]:
# Task 4a: Summarize the encoder model with the appropriate method.
#


In [None]:
# Task 4b: Summarize the decoder model with the appropriate method.
#


####Task 4c: Add a comment explaining the encoder and decoder models

Replace this text with a comment that describes what you see above in the model summaries. Comment on the number of trainable gates and how this corresponds to the size of the original model we trained. Based on the model builder just above, describe how the internal cell states are managed for the decoder and encoder. Make sure you use the correct terminology when referring to each of the two internal cell states.

In these next three blocks, we have code to test whether our encoder is properly trained.

In [None]:
# Make one prediction using the first data instance
states_value = encoder_model.predict(encoder_input_data[0 :  1])

states_value[0].shape, states_value[1].shape

In [None]:
# Make one prediction using the second data instance
states_value1 = encoder_model.predict(encoder_input_data[1 :  2])

states_value1[0].shape, states_value1[1].shape

In [None]:
# We now have two instances of thought vectors, with states_value
# based on doing a prediction using the FIRST instance from the input
# data and the states_value1 based on doing a prediction using the SECOND 
# instance from the input data. Because these two pieces of input data
# are very different, they should produce very different thought vectors.
# If the resulting thought vectors are similar or identical, this indicates that
# the encoder model is not properly trained.

# Task 4d: Write a few lines of code to compare the two thought vectors. 
# Confirm whether they are notably different from one another. If they are 
# too similar, the model tests below will tend to generate the same answers
# for many questions.
#

Task 4d, just above, is critical for diagnosing whether the encoder stage of the model is sufficiently trained. If the two thought vectors are too similar, then the encoder has not trained properly. Go back and make adjustments to the model hyperparameters to improve the training.

Add a comment in this text block documenting where you ended up.

In [None]:
# Here's a function to decode sequences. This takes a one-hot encoded 
# input sequence as the input. It runs the encoder model with that input to
# generate the "thought vector." The thought vector is fed to the decoder
# model and a \t is issued to the decoder to start producing a sequence.

def decode_sequence(input_seq):
    # Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq)

    # Generate an empty target sequence of length 1.
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    
    # Populate the first character of target sequence with the start character.
    target_seq[0, 0, target_token_index["\t"]] = 1.0

    # Sampling loop for a batch of sequences
    stop_condition = False
    decoded_sentence = ""
    
    # Always a little risky to use a while loop, but we don't know what 
    # length of sentence the decoder will issue - that's pretty much the 
    # whole point of a sequence-to-sequence model, right?
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value, verbose=False)

        # Sample a token: Find the index of the most probable output character
        sampled_token_index = np.argmax(output_tokens[0, -1, :])

        # Use our reversing dictionary to decode the predicted character
        sampled_char = reverse_target_char_index[sampled_token_index]
        decoded_sentence += sampled_char

        # Exit condition: either hit max length of a decoder output sequence
        # or find that the decoder has issued a stop character.
        if sampled_char == "\n" or len(decoded_sentence) > max_decoder_seq_length:
            stop_condition = True

        # Update the target sequence (of length 1).
        target_seq = np.zeros((1, 1, num_decoder_tokens))

        # This one-hot encodes the current character to use as input for the next iteration
        target_seq[0, 0, sampled_token_index] = 1.0

        # Update states
        states_value = [h, c]
    
    return decoded_sentence


In [None]:
# The range() call controls how many input sequences we will test. 
# Task 4e: Change to review at least 10 decoded sequences

for seq_index in range(1):
    # Take one sequence (part of the training set)
    # for trying out decoding.
    input_seq = encoder_input_data[seq_index : seq_index + 1]

    # Here's where we call our custom function
    decoded_sentence = decode_sequence(input_seq)

    print("Input sentence:", input_texts[seq_index], "Decoded sentence:", decoded_sentence)

In [None]:
# Here's a function that put's one more layer of convenience on top of
# decode_sequences. This function takes plain text as input and converts
# it to the one hot encoding needed to get the prediction model to work.

def answer(text_2):
  text_2 = text_2.replace("\n","")
  if len(text_2) <= max_encoder_seq_length:
    # Fill up a single one-hot encodign vector with zeroes
    my_input_data =  np.zeros((1, max_encoder_seq_length, num_encoder_tokens), dtype="float32")

    # Iterate over all of the characters in the input phrase
    for t, char in enumerate(text_2):
      my_input_data[0, t, input_token_index[char]] = 1.0

    # This adds padding with spaces
    my_input_data[0, t + 1 :, input_token_index[" "]] = 1.0
    translated_sentence = decode_sequence(my_input_data)
    print (text_2, " ", translated_sentence)
    return translated_sentence

  else:
    print("Input phrase is longer than the maximum encoder sequence length.")


In [None]:
# Task 4f: Test the answer() function with at least two different examples
# of short questions. Make sure not to exceed the maximum encoder sequence length.
#
 

*Task 5*: In this final task we are going to use the validation data from SQUAD V2 to make some predictions and compare what the model produces to what the real answers are in the validation data.

In [None]:
# This is similar to the code at the top of the notebook.
raw_test_dataset = raw_dataset["validation"] # We will use the validation data now

type(raw_dataset), type(raw_test_dataset) # Display the types

In [None]:
# Task 5a: Display how much data is in the raw_test_dataset
#


In [None]:
# This is similar to the code at the top of the notebook. We're going to make
# the assumption, based on reusing the same ASCII conversion as before, that
# the character sets produced in these questions and answers will be the same
# as the ones our model was set up to handle.
 
# Build a list of questions and answers plus character sets for each test instance
test_input_texts = []
test_target_texts = []

for entry in raw_test_dataset:
  # Some entries may not have an answer: Skip them
  if len(entry['answers']['text']) > 0:
    
    inp = entry['question'].strip()
    inp = inp.encode("ascii", "ignore")
    inp = inp.decode()
    
    targ = "\t" + entry['answers']['text'][0].strip() + "\n"
    targ = targ.encode("ascii","ignore")
    targ = targ.decode()

    # Remember that we can only use short questions and answers, same as
    # the original model
    if (len(inp) < MAX_INP) and (len(targ) < MAX_TARG):
      test_input_texts.append(inp)
      test_target_texts.append(targ)



In [None]:
# Task 5b: Display how much data is in test_input_texts and test_target_texts
#


In [None]:
# Task 5c: Display the first 15 question/answer pairs
#


In [None]:
# Task 5d: Display one model prediction generated by answer() and compare that
# to the actual, correct response in test_target_texts
#


To complete task 5 you will need to generate cosine similarities between the predictions created by answer() and the actual, correct responses in test_target_texts. We'll use a sentence summarizer to do the job.

In [None]:
!pip install sentence-transformers

In [None]:
# Now load a pre-trained sentence transformer. There are hundreds to choose from.
# This downloads a lot of data to your virtual machine and takes half a minute or so.
from sentence_transformers import SentenceTransformer

sent_model = SentenceTransformer('distiluse-base-multilingual-cased-v2')
# Why is it sometimes a good idea to use a multilingual model?

In [None]:
# Now evaluate all of the text items in a loop generating a list of cosine similarities
from sklearn.metrics.pairwise import cosine_similarity
cos_sim_list = []

for test_text, correct_answer in zip(test_input_texts, test_target_texts):
  a = sent_model.encode([answer(test_text)])
  b = sent_model.encode([correct_answer])
  cos_sim_list.append(cosine_similarity(a, b))

In [None]:
# Task 5e: Display some of cos_sim_list and say why the values make sense.
#


In [None]:
# Task 5f: Plot a histogram of the values in cos_sim_list
#


In [None]:
# Task 5g: Show the mean cosine similarity of the values in cos_sim_list. The
# basic code template generates a mean cosine similarity of about 0.35. Your
# job in adjusting the model training data and hyperparameters is to  
# beat this value.
#


##Concluding Comments

Replace this text with answers to the following questions based on your exploration of model and data parameters.

1. What's the upper limit of sequence length for encoder and decoder models that still produces a trainable model.

2. Why can't you properly train a sequence to sequence model with longer sequences?

3. Given the best model that you were able to train, is the model producing sensible answers? Why or why not? 

4. What are one or two of the main challenges involved in working with a question and answer dataset like SQUAD V2?

5. Based on everything you have learned in the class, does it seem like a sequence to sequence model is the best approach to question answering? If you answered "no", then mention one or two other models that you think would perform better.