# Text-based Chatbot using LSTM (College)


Technique: LSTM or Long Short Term Memory networks are an extension for Recurrent Neural Networks with explicitly extended memory capability well suited to handle long term dependencies. In the domain of chatbots for time series conversations, LSTM is shown to perform well and maintain the context for longer durations.


Purpose: Answer the Frequently Asked Questions (FAQs) by prospective college students during out-of-hours.

Benefit: The students tend to connect to the college's peer mentors in their convenience. If the students are coming from different parts of the world, the difference in time zones creates a problem for both the students and the mentors. Hence, the development of a chatbot in taking questions offline is imperative in solving these issues.

How it works: LSTM-based neural network is trained on
the dataset. The questions are tagged to the extracted data. This creates a question-answer dataset for training the chatbot.  

Results: The results yielded a high accuracy score, inline with the correctness of the chatbot in answering the queries.

Caveat: More dataset could have been included if we were given a longer time.

In [None]:
import numpy as np
import pandas as pd
# ! pip install tensorflow
import tensorflow as tf
from tensorflow.keras import layers , activations , models , preprocessing , utils
import pickle
import re
from gensim.models import Word2Vec

## Data Cleaning

In [None]:
# Mount Google Drive in Colab
from google.colab import drive

drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
import yaml

with open(r'/content/gdrive/MyDrive/course_bachelors.yaml') as file:
  documents = yaml.load(file, Loader=yaml.FullLoader)
documents

{'categories': ['course', "Bachelor's"],
 'conversations': [['How long is the duration of the course?', 'Three years.'],
  ['What are the entry requirements for the course?',
   'Minimum entry requirements are a grade H5 and above in two higher level subjects together with a minimum of O6/H7 in four other subjects. A minimum of grade O6/H7 must be obtained in English. A grade O5/H6 must be obtained in Mathematics.For applicants whose first language is not English, please note the English language entry requirements. Mature applicants, applicants with a disability or those applying through the DARE or HEAR access schemes can find out more information on the application process.'],
  ['How much is the tuition fees for the course?',
   'The fees for this course for international students is €10000 per year. For domestic students applying through the CAO, this course applies under the free fees initiative.']]}

In [None]:
# The dataset is split into question and answer lists. 
# For our chatbot, we have used the conversations subject of the dataset.

questions, answers = [], []

conversations = documents['conversations']

for conv in conversations:
  if len(conv) > 2 :
    questions.append(conv[0])
    replies = conv[1 :]
    ans = ' '
    for rep in replies:
      ans += ' ' + rep
      answers.append(ans)
  elif len(conv) > 1:
    questions.append(conv[0])
    answers.append(conv[1])

In [None]:
questions

['How long is the duration of the course?',
 'What are the entry requirements for the course?',
 'How much is the tuition fees for the course?']

In [None]:
answers

['Three years.',
 'Minimum entry requirements are a grade H5 and above in two higher level subjects together with a minimum of O6/H7 in four other subjects. A minimum of grade O6/H7 must be obtained in English. A grade O5/H6 must be obtained in Mathematics.For applicants whose first language is not English, please note the English language entry requirements. Mature applicants, applicants with a disability or those applying through the DARE or HEAR access schemes can find out more information on the application process.',
 'The fees for this course for international students is €10000 per year. For domestic students applying through the CAO, this course applies under the free fees initiative.']

In [None]:
# Data preprocessing for seq2seq learning

# For preprocessing, a single vocabulary is used for tokenization.

answers_tags = []

for i in range(len(answers)):
  if type(answers[i]) == str:
    answers_tags.append(answers[i])
  else:
    questions.pop(i)

answers = []

for i in range(len(answers_tags)):
  answers.append('<START>' + answers_tags[i] + '<END>')

tokenizer = preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(questions + answers)

VOCAB_SIZE = len(tokenizer.word_index) + 1

In [None]:
VOCAB_SIZE 

# Calculates the VOCAB_SIZE variable by adding 1 to the length of the word index, 
# since the index starts from 1 and not 0. 
# This is the size of the vocabulary that will be used in the sequence-to-sequence mode

81

In [None]:
vocab = []

for word in tokenizer.word_index:
  vocab.append(word)

def tokenize(sentences):
  tokens_list = []
  vocabulary = []
  for sentence in sentences:
    sentence = sentence.lower()
    sentence = re.sub('[^a-zA-Z]', ' ', sentence)
    tokens = sentence.split()
    vocabulary += tokens
    tokens_list.append(tokens)
  return tokens_list, vocabulary

## Model Building

**Encoder input data**: This is a 2D numpy array of shape (num_questions, maxlen_questions), where num_questions is the number of questions in the dataset and maxlen_questions is the maximum length of a question in the dataset. Each element of the array represents a word in a question and is an integer corresponding to the index of that word in the vocabulary. The encoder input data is fed into the encoder LSTM, which processes the entire input sequence and produces a context vector.



**Decoder output data**: This is also a 2D numpy array of shape (num_answers, maxlen_answers), where each element represents a word in an answer and is an integer corresponding to the index of that word in the vocabulary. However, unlike the decoder input data, the decoder output data is shifted one time step to the right, so that the first element of each answer sequence is the second word of the original answer sequence, the second element is the third word, and so on. This is because the decoder LSTM is trained to predict the next word in the answer sequence based on the previous words, so the decoder output data serves as the "expected output" for the decoder LSTM. The decoder output data is also one-hot encoded using utils.to_categorical() function so that it can be used with the categorical cross-entropy loss function during training.

In [None]:
# encoder input data

tokenized_questions = tokenizer.texts_to_sequences(questions)

maxlen_questions = max([len(x) for x in tokenized_questions])

encoder_input_data = preprocessing.sequence.pad_sequences(tokenized_questions, maxlen=maxlen_questions, padding='post')

#padded_questions = preprocessing.sequence.pad_sequences(tokenized_questions, maxlen=maxlen_questions, padding='post')

#encoder_input_data = np.array([padded_questions])

print(encoder_input_data.shape)

(3, 9)


**Decoder input data**: This is a 2D numpy array of shape (num_answers, maxlen_answers), where num_answers is the number of answers in the dataset and maxlen_answers is the maximum length of an answer in the dataset. Each element of the array represents a word in an answer and is an integer corresponding to the index of that word in the vocabulary. The decoder input data is fed into the decoder LSTM one word at a time, along with the context vector produced by the encoder LSTM.

In [None]:
# decoder input data

tokenized_answers = tokenizer.texts_to_sequences(answers)

maxlen_answers = max([len(x) for x in tokenized_answers])

decoder_input_data = preprocessing.sequence.pad_sequences(tokenized_answers, maxlen=maxlen_answers, padding='post')

#padded_answers = preprocessing.sequence.pad_sequences(tokenized_answers, maxlen=maxlen_answers, padding='post')

#decoder_input_data = np.array(padded_answers)

print(decoder_input_data.shape)

(3, 87)



**Decoder output data**: This is also a 2D numpy array of shape (num_answers, maxlen_answers), where each element represents a word in an answer and is an integer corresponding to the index of that word in the vocabulary. However, unlike the decoder input data, the decoder output data is shifted one time step to the right, so that the first element of each answer sequence is the second word of the original answer sequence, the second element is the third word, and so on. This is because the decoder LSTM is trained to predict the next word in the answer sequence based on the previous words, so the decoder output data serves as the "expected output" for the decoder LSTM. The decoder output data is also one-hot encoded using utils.to_categorical() function so that it can be used with the categorical cross-entropy loss function during training.

In [None]:
# decoder output data

tokenized_answers = tokenizer.texts_to_sequences(answers)

for i in range(len(tokenized_answers)):
  tokenized_answers[i] = tokenized_answers[i][1:]

padded_answers = preprocessing.sequence.pad_sequences(tokenized_answers, maxlen=maxlen_answers, padding='post')

decoder_output_data = utils.to_categorical(padded_answers, VOCAB_SIZE)

#onehot_answers = utils.to_categorical(padded_answers, VOCAB_SIZE)

#decoder_output_data = np.array([onehot_answers])

print(decoder_output_data.shape)

(3, 87, 81)


The below lines of code denote a sequence-to-sequence model with an encoder and decoder architecture.

encoder_inputs is an input layer that takes sequences of integers as inputs with shape (maxlen_questions,). This layer is used to feed the encoded question sequences to the model.

encoder_embedding is an embedding layer that converts the integer input sequences to dense vectors of fixed size (batch_size, maxlen_questions, 200), where 200 is the size of the embedding dimension.

encoder_outputs, state_h and state_c are output tensors from an LSTM layer with 200 units, which takes the embedded input sequences as input. encoder_outputs contains the output of the last timestep of the LSTM layer, while state_h and state_c represent the final cell state and hidden state of the LSTM layer. These states will be used as the initial states for the decoder LSTM.

decoder_inputs is an input layer that takes sequences of integers as inputs with shape (maxlen_answers,). This layer is used to feed the encoded answer sequences to the model.

decoder_embedding is an embedding layer that converts the integer input sequences to dense vectors of fixed size (batch_size, maxlen_answers, 200), where 200 is the size of the embedding dimension. This layer is used to convert the input answer sequences to dense vectors that can be used by the decoder LSTM.

decoder_lstm is an LSTM layer with 200 units, which takes the embedded input sequences as input. The return_state and return_sequences parameters are set to True, which returns the output sequences and the final states of the LSTM layer. The final states of the LSTM layer will be used as the initial states for the decoder LSTM.

decoder_outputs is an output tensor from the decoder LSTM layer. It contains the output sequence of the LSTM layer for each timestep.

decoder_dense is a dense layer with VOCAB_SIZE units, which takes the output sequence of the decoder LSTM layer as input. It is used to convert the output sequence to a probability distribution over the output vocabulary.

The final output of the model is obtained by passing the output sequence of the decoder LSTM layer through the dense layer.

The model is compiled with the Adam optimizer, categorical cross-entropy loss function, and accuracy metric. The model is trained to minimize the categorical cross-entropy loss between the predicted and actual output sequences.

In [None]:
# Keras Functional API is used to build the architecture of the model. 
# The model is a multi input model, the encoder input and the decoder input. 
# Successive layers include the Embedding and the LSTM layers

# Embedding LSTM and Desne Layers

encoder_inputs = tf.keras.layers.Input(shape=(maxlen_questions, ))
encoder_embedding = tf.keras.layers.Embedding(VOCAB_SIZE, 200, mask_zero=True)(encoder_inputs)
encoder_outputs, state_h, state_c = tf.keras.layers.LSTM(200, return_state=True)(encoder_embedding)
encoder_states = [state_h, state_c]

decoder_inputs = tf.keras.layers.Input(shape=(maxlen_answers, ))
decoder_embedding = tf.keras.layers.Embedding(VOCAB_SIZE, 200, mask_zero=True)(decoder_inputs)
decoder_lstm = tf.keras.layers.LSTM(200, return_state=True, return_sequences=True)
decoder_outputs, _ , _  = decoder_lstm(decoder_embedding, initial_state=encoder_states)
decoder_dense = tf.keras.layers.Dense(VOCAB_SIZE, activation=tf.keras.activations.softmax)
output = decoder_dense(decoder_outputs)

model = tf.keras.models.Model([encoder_inputs, decoder_inputs], output)
model.compile(optimizer='adam', loss= tf.keras.losses.categorical_crossentropy, metrics=['accuracy'])

In [None]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_1 (InputLayer)           [(None, 9)]          0           []                               
                                                                                                  
 input_2 (InputLayer)           [(None, 87)]         0           []                               
                                                                                                  
 embedding (Embedding)          (None, 9, 200)       16200       ['input_1[0][0]']                
                                                                                                  
 embedding_1 (Embedding)        (None, 87, 200)      16200       ['input_2[0][0]']                
                                                                                              

The below lines of code trains a neural machine translation model by fitting it to the input data. The input data consists of three parts:

encoder_input_data: This is a 2D array that represents the input sequences for the encoder. Each row in the array represents a single input sequence, and the columns represent the tokens in the sequence. The shape of the array is (num_sequences, maxlen_questions).

decoder_input_data: This is a 2D array that represents the input sequences for the decoder. Each row in the array represents a single input sequence, and the columns represent the tokens in the sequence. The shape of the array is (num_sequences, maxlen_answers).

decoder_output_data: This is a 3D array that represents the output sequences for the decoder. Each row in the array represents a single output sequence, and the columns represent the tokens in the sequence. The shape of the array is (num_sequences, maxlen_answers, VOCAB_SIZE).

The model.fit function trains the neural machine translation model using the input data. The batch_size parameter specifies the number of input sequences to process at once, and the epochs parameter specifies the number of times to iterate over the entire input data during training.

In [None]:
model.fit([encoder_input_data , decoder_input_data], decoder_output_data, batch_size=32, epochs=100)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.callbacks.History at 0x7fb0b534a350>

Inference in LSTM (Long Short-Term Memory) refers to the process of generating predictions using a trained LSTM model on new, unseen data.

The inference() function creates two models, the encoder model and the decoder model, which will be used to generate responses to questions based on the trained neural network. The encoder model takes in the input sequence of the question and generates the encoder states, which will be used to generate the response. The decoder model takes in the decoder input sequence along with the decoder states and generates the decoder output sequence and the decoder states.

In [None]:
# Making inferences

# For making inferences, two inference models namely the encoder and the decoder inference model are built. 
# These models undergo similar preprocessing steps as the model did during the training phase.

def inference():
  encoder_model = tf.keras.models.Model(encoder_inputs, encoder_states)

  decoder_state_input_h = tf.keras.layers.Input(shape=(200 ,))
  decoder_state_input_c = tf.keras.layers.Input(shape=(200 ,))
    
  decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
    
  decoder_outputs, state_h, state_c = decoder_lstm(
  decoder_embedding , initial_state=decoder_states_inputs)
  decoder_states = [state_h, state_c]
  decoder_outputs = decoder_dense(decoder_outputs)
  decoder_model = tf.keras.models.Model(
      [decoder_inputs] + decoder_states_inputs,
      [decoder_outputs] + decoder_states)
    
  return encoder_model , decoder_model

The preprocess_input() function takes in an input sentence and tokenizes it, converts the tokens to their corresponding word indices, and then pads the sequence to ensure it has the same length as the maximum length of the input sequences used in training the neural network. The output of this function is the padded token sequence.

In [None]:
def preprocess_input(input_sentence):
    tokens = input_sentence.lower().split()
    tokens_list = []
    for word in tokens:
        tokens_list.append(tokenizer.word_index[word]) 
    return preprocessing.sequence.pad_sequences([tokens_list] , maxlen=maxlen_questions , padding='post')

The final two lines of code call the inference() function to obtain the encoder and decoder models and store them in enc_model and dec_model, respectively.

In [None]:
enc_model , dec_model = inference()

## Chatbot Testing

Inference is being performed correctly by first encoding the input query using the trained encoder model, and then decoding it using the trained decoder model by iteratively predicting the next word until the 'end' token is generated or the maximum length of the answer sequence is reached. The predicted output is then printed as the bot's response to the input query.

In [None]:
tests = ['How long is the duration of the course', 'What are the entry requirements for the course', 'How much is the tuition fees for the course']

for i in range(3):
    states_values = enc_model.predict(preprocess_input(tests[i]))
    empty_target_seq = np.zeros((1 , 1))
    empty_target_seq[0, 0] = tokenizer.word_index['start']
    stop_condition = False
    decoded_translation = ''
    
    while not stop_condition :
        dec_outputs , h , c = dec_model.predict([empty_target_seq] + states_values)
        sampled_word_index = np.argmax(dec_outputs[0, -1, :])
        sampled_word = None
        
        for word , index in tokenizer.word_index.items() :
            if sampled_word_index == index :
                decoded_translation += f' {word}'
                sampled_word = word
        
        if sampled_word == 'end' or len(decoded_translation.split()) > maxlen_answers:
            stop_condition = True
            
        empty_target_seq = np.zeros((1 , 1))  
        empty_target_seq[0 , 0] = sampled_word_index
        states_values = [h , c] 
    print(f'Human: {tests[i]}')
    print()
    decoded_translation = decoded_translation.split(' end')[0]
    print(f'Bot: {decoded_translation}')
    print('-'*25)


Human: How long is the duration of the course

Bot:  three years
-------------------------
Human: What are the entry requirements for the course

Bot:  minimum entry requirements are a grade h5 and above in two higher level subjects together with a minimum of o6 h7 in four other subjects a minimum of grade o6 h7 must be obtained in english a grade o5 h6 must be obtained in mathematics for applicants whose first language is not english please note the english language entry requirements applicants applicants applicants with a disability or those applying through the dare or hear access schemes can find out more information on the application process
-------------------------
Human: How much is the tuition fees for the course

Bot:  the fees for this course for international students is €10000 per year for domestic students applying through the cao this course applies under the free fees initiative
-------------------------
