# CS5100 - Conversational Agent

Using Seq2Seq LSTM models

In [4]:
dataset_path = 'train.json'
file_id_train = '1qYS0TsSsMkBFINWqOryt6BiHlqZtS6Qs' # train.json from Google Drive
dataset_name = 'alpaca_data'
model_name = 'model' + dataset_name

## 1. Importing Packages

In [5]:
import numpy as np
import tensorflow as tf
import pickle
from tensorflow.keras import layers , activations , models , preprocessing
from typing import Tuple, List

## 2. Preprocessing Dataset

Stanford Alpaca: An Instruction-following LLaMA Model
Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto

Dataset: https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json

Year of Publication: 2023

Overview
The current Alpaca model is fine-tuned from a 7B LLaMA model on 52K instruction-following data generated by the techniques in the Self-Instruct paper, with some modifications that we discuss in the next section. In a preliminary human evaluation, we found that the Alpaca 7B model behaves similarly to the text-davinci-003 model on the Self-Instruct instruction-following evaluation suite.

Alpaca is still under development, and there are many limitations that have to be addressed. Importantly, we have not yet fine-tuned the Alpaca model to be safe and harmless. We thus encourage users to be cautious when interacting with Alpaca, and to report any concerning behavior to help improve the safety and ethical considerations of the model.

Our initial release contains the data generation procedure, dataset, and training recipe. We intend to release the model weights if we are given permission to do so by the creators of LLaMA. For now, we have chosen to host a live demo to help readers better understand the capabilities and limits of Alpaca, as well as a way to help us better evaluate Alpaca's performance on a broader audience.

*    Read the JSON file



In [6]:
# Download and load the dataset
import json

!gdown $file_id_train

with open(dataset_path, 'r') as file:
    data = json.load(file)

# Using only a portion of the dataset
# Limit the size of the dataset if lack system RAM
# data = data[:5000]

Downloading...
From: https://drive.google.com/uc?id=1qYS0TsSsMkBFINWqOryt6BiHlqZtS6Qs
To: /content/train.json
100% 19.0M/19.0M [00:00<00:00, 78.3MB/s]


In [7]:
print(f"Sample 1: {data[0]}")
print(f"Sample 2: {data[1]}")
print(f"Sample 3: {data[2]}")

Sample 1: {'question': 'Give three tips for staying healthy.', 'answer': '1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.'}
Sample 2: {'question': 'What are the three primary colors?', 'answer': 'The three primary colors are red, blue, and yellow.'}
Sample 3: {'question': 'Describe the structure of an atom.', 'answer': 'An atom is made up of a nucleus, which contains protons and neutrons, surrounded by electrons that travel in orbits around the nucleus. The protons and neutrons have a positive charge, while the electrons have a negative charge, resulting in an overall neutral atom. The number of each particle determines the atomic number and the type of atom.'}


*    Remove unwanted data types which are produced while parsing the data.
*    Append `<BOS>`, beginning-of-sentence token, and `<EOS>`, end-of-sentence token, to all the answers.
*    Create a Tokenizer and load the whole vocabulary ( questions + answers ) into it.

In [8]:
from tensorflow.keras import preprocessing, utils
from tensorflow.keras.preprocessing.text import Tokenizer
from random import sample
max_sequence_length = 80

questions_list = []
answers_list = []

for item in data:
    question = item['question']
    answer = item['answer']
    # Temporarily tokenize question and answer to check length
    temp_question_tokens = question.split()  # Simple whitespace tokenization
    temp_answer_tokens = ('<BOS> ' + answer + ' <EOS>').split()

    # Only append if both question and answer are within the desired length
    if len(temp_question_tokens) <= max_sequence_length and len(temp_answer_tokens) <= max_sequence_length:
        questions_list.append(question)
        answers_list.append('<BOS> ' + answer + ' <EOS>')

# Calculate subset size (e.g., 10% of the data)
data_size = len(questions_list)
subset_size = int(0.1 * data_size)  # 10% of the total data

# Randomly select indices for the subset
indices = sample(range(data_size), subset_size)

# Create new lists containing only the sampled subset
sampled_questions_list = [questions_list[i] for i in indices]
sampled_answers_list = [answers_list[i] for i in indices]

tokenizer = Tokenizer()
tokenizer.fit_on_texts(['<UNK>'] + sampled_questions_list + sampled_answers_list)

vocab_size = len(tokenizer.word_index) + 1

# Save the tokenizer
with open(f'tokenizer{dataset_name}.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [9]:
print(f"Vocab size: {vocab_size}")
print(f"Vocab: {tokenizer.word_index}")

Vocab size: 13404


`encoder_input_data`: tokenize the list of questions. Pad them to their maximum length.

`decoder_input_data`: tokenize the list of answers. Pad them to their maximum length.

`decoder_output_data`: tokenize the list of answers. Remove the first element `<BOS>` from all the `tokenized_answers`.

In [10]:
from gensim.models import Word2Vec
import re

vocab = []
for word in tokenizer.word_index:
    vocab.append(word)

def tokenize(sentences: list) -> Tuple[List[List[str]], List[str]]:
    """
    Tokenize the sentences

    Parameters:
        sentences: list of sentences

    Returns:
        tokenized_sentences: list of tokenized sentences
        vocab_list: list of vocabulary
    """
    tokens_list = []
    vocab_list = [] # Include an unknown token for unknown words
    for sentence in sentences:
        sentence = sentence.lower() # Convert to lower case
        sentence = re.sub('[^a-zA-z0-9\']', ' ', sentence) # Remove special characters
        tokens = sentence.split() # Tokenize the sentence
        vocab_list.extend(tokens)
        tokens_list.append(tokens)

    return tokens_list, vocab_list

# encoder_input_data
tokenized_questions = tokenizer.texts_to_sequences(sampled_questions_list)
questions_max_len = max([ len(x) for x in tokenized_questions ])
padded_questions = preprocessing.sequence.pad_sequences(tokenized_questions, maxlen=questions_max_len, padding='post')
encoder_input_data = np.array(padded_questions)
print(f"Encoder input data shape: (# Samples, Max Sequence Length): {encoder_input_data.shape}")

# decoder_input_data
tokenized_answers = tokenizer.texts_to_sequences(sampled_answers_list)
answers_max_len = max([ len(x) for x in tokenized_questions ])
padded_answers = preprocessing.sequence.pad_sequences(tokenized_answers, maxlen=answers_max_len, padding='post')
decoder_input_data = np.array(padded_answers)
print(f"Decoder input data shape: (# Samples, Max Sequence Length): {decoder_input_data.shape}")

# decoder_output_data
tokenized_answers = tokenizer.texts_to_sequences(sampled_answers_list)
for i in range(len(tokenized_answers)):
    tokenized_answers[i] = tokenized_answers[i][1:] # Remove the <bos> token

padded_answers = preprocessing.sequence.pad_sequences(tokenized_answers, maxlen=answers_max_len, padding='post')
onehot_answers = utils.to_categorical(padded_answers, vocab_size)
decoder_output_data = np.array(onehot_answers)
print(f"Decoder output data shape: (# Samples, Max Sequence Length, Vocab Size): {decoder_output_data.shape}")

Encoder input data shape: (# Samples, Max Sequence Length): (4209, 37)
Decoder input data shape: (# Samples, Max Sequence Length): (4209, 37)
Decoder output data shape: (# Samples, Max Sequence Length, Vocab Size): (4209, 37, 13404)


The model will have Embedding, LSTM and Dense layers. The basic configuration is as follows.


*    2 Input Layers : One for `encoder_input_data` and another for `decoder_input_data`.
*    Embedding layer : For converting token vectors to fix sized dense vectors. **( Note :  Don't forget the `mask_zero=True` argument here )**
*    LSTM layer : Provide access to Long-Short Term cells.

Working :

1.    The `encoder_input_data` comes in the Embedding layer (  `encoder_embedding` ).
2.    The output of the Embedding layer goes to the LSTM cell which produces 2 state vectors ( `h` and `c` which are `encoder_states` )
3.    These states are set in the LSTM cell of the decoder.
4.    The decoder_input_data comes in through the Embedding layer.
5.    The Embeddings goes in LSTM cell ( which had the states ) to produce seqeunces.

In [11]:
encoder_inputs = tf.keras.layers.Input(shape=(questions_max_len,), name='Encoder_Inputs')
encoder_embedding = tf.keras.layers.Embedding(vocab_size, 200, mask_zero=True, name='Encoder_Embedding')(encoder_inputs)
encoder_ouputs, state_h, state_c = tf.keras.layers.LSTM(200, return_state=True, name='Encoder_LSTM')(encoder_embedding)
encoder_states = [state_h, state_c]

decoder_inputs = tf.keras.layers.Input(shape=(answers_max_len,), name='Decoder_Inputs')
decoder_embedding = tf.keras.layers.Embedding(vocab_size, 200, mask_zero=True, name='Decoder_Embedding')(decoder_inputs)
decoder_lstm = tf.keras.layers.LSTM(200, return_state=True, return_sequences=True, name='Decoder_LSTM')
decoder_ouputs, _, _ = decoder_lstm(decoder_embedding, initial_state=encoder_states)

decoder_dense = tf.keras.layers.Dense(vocab_size, activation=tf.keras.activations.softmax, name='Output_Layer')
output = decoder_dense(decoder_ouputs)

model = tf.keras.models.Model([encoder_inputs, decoder_inputs], output, name='Encoder_Decoder_Model')
model.compile(optimizer=tf.keras.optimizers.RMSprop(), loss='categorical_crossentropy')

model.summary()

Model: "Encoder_Decoder_Model"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 Encoder_Inputs (InputLayer  [(None, 37)]                 0         []                            
 )                                                                                                
                                                                                                  
 Decoder_Inputs (InputLayer  [(None, 37)]                 0         []                            
 )                                                                                                
                                                                                                  
 Encoder_Embedding (Embeddi  (None, 37, 200)              2680800   ['Encoder_Inputs[0][0]']      
 ng)                                                                          

## 3. Training Model

Train the model for a number of epochs with `RMSprop` optimizer and `categorical_crossentropy` loss function.

In [12]:
model.fit([encoder_input_data , decoder_input_data], decoder_output_data, batch_size=50, epochs=600 )
model.save(f"{model_name}.keras")

Epoch 1/600
Epoch 2/600
Epoch 3/600
Epoch 4/600
Epoch 5/600
Epoch 6/600
Epoch 7/600
Epoch 8/600
Epoch 9/600
Epoch 10/600
Epoch 11/600
Epoch 12/600
Epoch 13/600
Epoch 14/600
Epoch 15/600
Epoch 16/600
Epoch 17/600
Epoch 18/600
Epoch 19/600
Epoch 20/600
Epoch 21/600
Epoch 22/600
Epoch 23/600
Epoch 24/600
Epoch 25/600
Epoch 26/600
Epoch 27/600
Epoch 28/600
Epoch 29/600
Epoch 30/600
Epoch 31/600
Epoch 32/600
Epoch 33/600
Epoch 34/600
Epoch 35/600
Epoch 36/600
Epoch 37/600
Epoch 38/600
Epoch 39/600
Epoch 40/600
Epoch 41/600
Epoch 42/600
Epoch 43/600
Epoch 44/600
Epoch 45/600
Epoch 46/600
Epoch 47/600
Epoch 48/600
Epoch 49/600
Epoch 50/600
Epoch 51/600
Epoch 52/600
Epoch 53/600
Epoch 54/600
Epoch 55/600
Epoch 56/600
Epoch 57/600
Epoch 58/600
Epoch 59/600
Epoch 60/600
Epoch 61/600
Epoch 62/600
Epoch 63/600
Epoch 64/600
Epoch 65/600
Epoch 66/600
Epoch 67/600
Epoch 68/600
Epoch 69/600
Epoch 70/600
Epoch 71/600
Epoch 72/600
Epoch 73/600
Epoch 74/600
Epoch 75/600
Epoch 76/600
Epoch 77/600
Epoch 78

## 4. Define Inference Models

**Encoder inference model**: Takes the question as input and outputs LSTM states ( `h` and `c` ).

**Decoder inference model**: Takes in 2 inputs, one are the LSTM states ( Output of encoder model ), second are the answer input seqeunces ( ones not having the `<bos>` tag ). It will output the answers for the question which we fed to the encoder model and its state values.

In [13]:
def make_inference() -> Tuple[tf.keras.models.Model, tf.keras.models.Model]:
    """
    Constructs separate encoder and decoder models for inference based on a trained Encoder-Decoder model.

    Returns:
        encoder_model (tf.keras.models.Model): A Keras model representing the encoder component for inference.
            This model takes encoder inputs (sequences of tokens) and outputs the encoder states (hidden state
            and cell state) produced by the encoder LSTM layer.

        decoder_model (tf.keras.models.Model): A Keras model representing the decoder component for inference.
            This model takes decoder inputs (sequences of tokens) along with the initial decoder states (hidden
            state and cell state) as inputs and outputs the decoder outputs and updated decoder states.
            It consists of the decoder LSTM layer and the decoder dense layer.
    """
    encoder_model = tf.keras.models.Model(encoder_inputs, encoder_states)

    decoder_state_input_h = tf.keras.layers.Input(shape=(200,), name='Input_Layer_h')
    decoder_state_input_c = tf.keras.layers.Input(shape=(200,), name='Input_Layer_c')
    decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
    decoder_outs, state_h, state_c = decoder_lstm(decoder_embedding, initial_state=decoder_states_inputs)
    decoder_states = [state_h, state_c]
    decoder_ouputs = decoder_dense(decoder_outs)
    decoder_model = tf.keras.models.Model([decoder_inputs] + decoder_states_inputs, [decoder_ouputs] + decoder_states)

    return encoder_model, decoder_model

## 5. Talking with Chatbot

Convert `str` into tokens with paddings.

In [14]:
def str_to_tokens(sentence: str) -> List[int]:
    words = sentence.lower()
    words = re.sub('[^a-zA-Z0-9\']', ' ', words)
    words = words.split()
    # tokens_list = [ tokenizer.word_index[word] for word in words ]
    tokens_list = list()
    for word in words:
        if word in tokenizer.word_index:
            tokens_list.append(tokenizer.word_index[word])
        else:
            tokens_list.append(tokenizer.word_index['unk'])
    padded_sequence = preprocessing.sequence.pad_sequences([tokens_list], maxlen=questions_max_len, padding='post')
    return padded_sequence

1.    First, we take a question as input and predict the state values using `encoder_model`.
2.    We set the state values in the decoder's LSTM.
3.    Then, we generate a sequence which contains the `<start>` element.
4.    We input this sequence in the `decoder_model`.
5.    We replace the `<bos>` element with the element which was predicted by the `decoder_model` and update the state values.
6.    We carry out the above steps iteratively till we hit the `<eos>` tag or the maximum answer length.

In [15]:
encoder_model, decoder_model = make_inference()

input_strs = [
    "What is your name?",
    "How are you feeling today?",
    "What is the weather like outside?",
    "Can you tell me a joke?",
    "How old are you?",
    "What is the capital of France?",
    "How do I bake a cake?",
    "What time is it?",
    "Who is the president of the United States?",
    "What is the meaning of life?"]

for input_str in input_strs:

    # Convert input string to tokens for the encoder
    states_values = encoder_model.predict(str_to_tokens(input_str), verbose=0)

    # Initialize target sequence with 'bos' (beginning of sentence) token
    empty_target_seq = np.zeros((1, 1))
    empty_target_seq[0, 0] = tokenizer.word_index['bos']

    # Initialize variables for translation and stop condition
    stop_condition = False
    decoded_translation = ''

    while not stop_condition:
        # Predict next word using the decoder model
        decoder_outputs, h, c = decoder_model.predict([empty_target_seq] + states_values, verbose=0)

        # Get index of the most probable word and fine the word corresponding to the index
        sampled_word_index = np.argmax(decoder_outputs[0, -1, :])
        sampled_word = None
        for word, index in tokenizer.word_index.items():
            if sampled_word_index == index:
                if word !="eos" and word !="unk":
                    decoded_translation += f" {word}"
                sampled_word = word

        if sampled_word == 'eos' or len(decoded_translation.split()) > answers_max_len:
            stop_condition = True

        # Update target sequence with sampled word index
        empty_target_seq = np.zeros((1, 1))
        empty_target_seq[0, 0] = sampled_word_index

        # Update states for the next iteration
        states_values = [h, c]

    # Print the decoded translation
    print(f"Answer: {decoded_translation}")

Answer:  the world changed overnight unexpectedly
Answer:  i will be eating lunch
Answer:  the time in 24 hour format is 20 45
Answer:  hent tree what kind of the beach tomorrow
Answer:  is located in the road decreased by dr k r r martin
Answer:  the capital of the dominican republic is santo domingo
Answer:  the puppy pattered purposely past the petunias
Answer:  4
Answer:  the black research sea and the movie with a lightning strike an entire online common in which was a low rumbling of regular and a wounded distance
Answer:  the family goes to understand some vibrant


In [16]:
encoder_model, decoder_model = make_inference()

input_strs = ['Where is Barack Obama born?',
              'What is the capital of Canada?',
              'Who is the Prime Minister of India?']

for input_str in input_strs:

    # Convert input string to tokens for the encoder
    states_values = encoder_model.predict(str_to_tokens(input_str), verbose=0)

    # Initialize target sequence with 'bos' (beginning of sentence) token
    empty_target_seq = np.zeros((1, 1))
    empty_target_seq[0, 0] = tokenizer.word_index['bos']

    # Initialize variables for translation and stop condition
    stop_condition = False
    decoded_translation = ''

    while not stop_condition:
        # Predict next word using the decoder model
        decoder_outputs, h, c = decoder_model.predict([empty_target_seq] + states_values, verbose=0)

        # Get index of the most probable word and fine the word corresponding to the index
        sampled_word_index = np.argmax(decoder_outputs[0, -1, :])
        sampled_word = None
        for word, index in tokenizer.word_index.items():
            if sampled_word_index == index:
                if word !="eos" and word !="unk":
                    decoded_translation += f" {word}"
                sampled_word = word

        if sampled_word == 'eos' or len(decoded_translation.split()) > answers_max_len:
            stop_condition = True

        # Update target sequence with sampled word index
        empty_target_seq = np.zeros((1, 1))
        empty_target_seq[0, 0] = sampled_word_index

        # Update states for the next iteration
        states_values = [h, c]

    # Print the decoded translation
    print(f"Question: {input_str}")
    print(f"Answer: {decoded_translation}\n")

Question: Where is Barack Obama born?
Answer:  4 is is they're

Question: What is the capital of Canada?
Answer:  the samsung galaxy s20 has an approximate retail value of 4 would divide by 7 5

Question: Who is the Prime Minister of India?
Answer:  the sum of the integers is 3 4



## 6. Downloading the Model and Tokenizer

In [17]:
# zip the files and download
import os
import shutil
from google.colab import files
from datetime import datetime

temp_dir = 'output'
files_to_include = [f'model{dataset_name}.keras', f'tokenizer{dataset_name}.pickle']

if not os.path.exists(temp_dir):
    os.makedirs(temp_dir)

try:
    for file in files_to_include:
        shutil.copy(file, temp_dir)
except Exception as e:
    print(f"Error copying files: {e}")

file_name = f"Model{dataset_name}_{datetime.now().strftime('%Y%m%d-%H%M%S')}"
shutil.make_archive(file_name, 'zip', 'output')
files.download(file_name + '.zip')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>