<a href="https://colab.research.google.com/github/pravin691983/chatbot/blob/main/ChatBot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Overview

In this project is implemented, compared and analyzed two models, retrieval-based & generative, that constitute the state of the art in neural machine translation applied to chatbots. 
One model, i.e. retrieval-based model is based on simple neural network and for other generative model implementation is done exclusively using Sequence to Sequence(LSTM Encoder Decoder) architecture.



# Problem Statements

Businesses aim to improve customer experience and also reduce costs, by integrating the right conversational AI technology that, enables automatic messaging and conversation between computers and humans.

# Anatomy of Conversational AI
We can define the chatbots into two categories, following are the two categories of chatbots:

- **Rule-Based Approach** – In this approach, a bot is trained according to rules. Based on this a bot can answer simple queries but sometimes fails to answer complex queries.
- **Self-Learning Approach** – These bots follow the machine learning approach which is rather more efficient and is further divided into two more categories.
  - **Retrieval-Based Models** – In this approach, the bot retrieves the best response from a list of responses according to the user input.
  - **Generative Models** – These models often come up with answers than searching from a set of answers which makes them intelligent bots as well.

# 1. Retrieval-Based Models Chat Bot


In this Python project with source code, we are going to build a chatbot using deep learning techniques. The chatbot will be trained on the dataset which contains categories (intents), pattern and responses. 

We use a simple neural network to classify which category the user’s message belongs to and then we will give a random response from the list of responses using NLTK, Keras, Tensor Flow, Python etc.




## 1.1: IMPORT LIBRARIES AND DATASETS

Now we’ll be importing some libraries needed to load, process, and transform our data and then feed it into a deep learning network. Just remember to keep your JSON file in the same directory as your python file.



In [None]:
# Mount the drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
projectPath = '/content/drive/My Drive/Colab Notebooks/Modern AI Portfolio Builder/Chat Bot/'
%cd /content/drive/My Drive/Colab Notebooks/Modern AI Portfolio Builder/Chat Bot/


In [None]:
print(projectPath + 'intents.json')

In [None]:
# Install necessary packages

!pip install nltk
!pip install tensorflow
!pip install keras

In [None]:
# Import the necessary packages

import nltk
from nltk.stem import WordNetLemmatizer
import json
import pickle

import tensorflow as tf
from tensorflow import keras
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.optimizers import SGD
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
import random
import matplotlib.pyplot as plt
import os
import PIL
import seaborn as sns
import pickle

### Load Data

In [None]:
# Initilise Data
lemmatizer = WordNetLemmatizer()
words=[]
classes = []
documents = []
ignore_words = ['?', '!']

In [None]:
# Download wordnet & punkt

nltk.download('punkt')
nltk.download('wordnet')

In [None]:
# load rules based intents
data_file = open(projectPath + 'Data/intents.json').read()
intents = json.loads(data_file)

In [None]:
for intent in intents['intents']:
    for pattern in intent['patterns']:
        #tokenize each word
        w = nltk.word_tokenize(pattern)
        words.extend(w)
        #add documents in the corpus
        documents.append((w, intent['tag']))
        # add to our classes list
        if intent['tag'] not in classes:
            classes.append(intent['tag'])

In [None]:
documents

In [None]:
classes

## 1.2: PERFORM DATA CLEANUP AND FEATURE ENGINEERING

In [None]:
# lemmatize, lower each word and remove duplicates
words = [lemmatizer.lemmatize(w.lower()) for w in words if w not in ignore_words]

words = sorted(list(set(words)))
# sort classes
classes = sorted(list(set(classes)))
# documents = combination between patterns and intents
print (len(documents), "documents")
# classes = intents
print (len(classes), "classes", classes)
# words = all words, vocabulary
print (len(words), "unique lemmatized words", words)

pickle.dump(words,open('words.pkl','wb'))
pickle.dump(classes,open('classes.pkl','wb'))

## 1.3: TRAIN DEEP LEARNING MODEL FOR RETRIEVAL BASED CHAT BOT

Now, we have to take the “tag” and “patterns” out of the file and store it in a list. We’ll also make a collection of unique words in the patterns to create a Bag of Words (BoW) vector.

In [None]:
# create our training data
training = []
# create an empty array for our output
output_empty = [0] * len(classes)
# training set, bag of words for each sentence
for doc in documents:
    # initialize our bag of words
    bag = []
    # list of tokenized words for the pattern
    pattern_words = doc[0]
    print("Before lematize pattern_words", pattern_words)
    # lemmatize each word - create base word, in attempt to represent related words
    pattern_words = [lemmatizer.lemmatize(word.lower()) for word in pattern_words]
    print("After lematize pattern_words", pattern_words)
    # create our bag of words array with 1, if word match found in current pattern
    for w in words:
        bag.append(1) if w in pattern_words else bag.append(0)
    
    # output is a '0' for each tag and '1' for current tag (for each pattern)
    output_row = list(output_empty)
    output_row[classes.index(doc[1])] = 1
    
    print("bag : ", bag)
    training.append([bag, output_row])
# shuffle our features and turn into np.array
random.shuffle(training)
training = np.array(training)


In [None]:
# create train and test lists. X - patterns, Y - intents
train_x = list(training[:,0])
train_y = list(training[:,1])
print("Training data created")

In [None]:
from sklearn.model_selection import train_test_split

# Split the data into train and test data
X_train, X_test, y_train, y_test = train_test_split(train_x, train_y, test_size = 0.2)

In [None]:
# X_train[0]

In [None]:
# train_y.shape

## 1.4: BUILD DEEP NEURAL NETWORK RETRIEVAL BASED CHAT BOT MODEL 

Now that we’re done with data preprocessing, it’s time to build a model and feed our preprocessed data to it. The network architecture is not too complicated. We will be using Fully Connected Layers (FC layers) with two of them being hidden layers and one giving out the target probabilities. Hence, the last layer will be having a softmax activation.
Feel free to mess around with the architecture and the numbers to get the model that suits your requirements. You could also choose to add a bit more steps into text preprocessing to get more out of the data. The more trial and error cycles you perform better will be your understanding of the architecture.

In [None]:
# Create model - 3 layers. First layer 128 neurons, second layer 64 neurons and 3rd output layer contains number of neurons
# equal to number of intents to predict output intent with softmax
model_retrieval_based = Sequential()
model_retrieval_based.add(Dense(128, input_shape=(len(X_train[0]),), activation='relu'))
model_retrieval_based.add(Dropout(0.5))
model_retrieval_based.add(Dense(64, activation='relu'))
model_retrieval_based.add(Dropout(0.5))
model_retrieval_based.add(Dropout(0.5))
model_retrieval_based.add(Dense(len(y_train[0]), activation='softmax'))
model_retrieval_based.summary()

## 1.5: COMPILE AND TRAIN RETRIEVAL BASED CHAT BOT DEEP LEARNING MODEL

All we have to do now is feed the data to this model and begin training. We will set our epochs to 200 and batch size to 8. Again, you can experiment with these numbers and find the right one for your data. After training, we will be saving it on the disk so that we can use the trained model in our Flask application.

In [None]:
# Compile model. Stochastic gradient descent with Nesterov accelerated gradient gives good results for this model
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model_retrieval_based.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])

In [None]:
# save the best model with least validation loss
checkpointer = ModelCheckpoint(filepath = "RetrievalBased_ChatBot_weights.hdf5", verbose = 1, save_best_only = True)

In [None]:
#fitting and saving the model 

# history = model.fit(np.array(train_x), np.array(train_y), batch_size = 5, epochs = 200, verbose=1)

# history = model.fit(np.array(train_x), np.array(train_y), batch_size = 5, epochs = 50, verbose=1, callbacks=[checkpointer])


# history = model.fit(X_train, y_train, batch_size = 32, epochs = 2, validation_split = 0.05, callbacks=[checkpointer])
history = model_retrieval_based.fit(np.array(X_train), np.array(y_train), batch_size = 5, epochs = 50, verbose=1, validation_split = 0.05, callbacks=[checkpointer])

In [None]:
# model.save('RuledBased_ChatBot_model.h5', history)

In [None]:
# save the model architecture to json file for future use
model_json = model_retrieval_based.to_json()
with open("RetrievalBased_ChatBot_model.json","w") as json_file:
  json_file.write(model_json)

## 1.6: ASSESS TRAINED RETRIEVAL MODEL PERFORMANCE

In [None]:
from keras.models import load_model
import json
import random
intents = json.loads(open(projectPath + 'Data/intents.json').read())
words = pickle.load(open('words.pkl','rb'))
classes = pickle.load(open('classes.pkl','rb'))

In [None]:
# model = load_model('RuledBased_ChatBot_model.h5')

In [None]:
# data_file = open(projectPath + 'Data/intents.json').read()
with open(projectPath + 'RetrievalBased_ChatBot_model.json', 'r') as json_file:
    json_savedModel= json_file.read()

# load the model architecture 
model_retrieval_based = tf.keras.models.model_from_json(json_savedModel)

In [None]:
# Load the model wieghts
from pathlib import Path

my_file = Path(projectPath + 'RetrievalBased_ChatBot_weights.hdf5')
if my_file.is_file():
    # file exists
    model_retrieval_based.load_weights(projectPath + 'RetrievalBased_ChatBot_weights.hdf5')

In [None]:
# Compile model. Stochastic gradient descent with Nesterov accelerated gradient gives good results for this model
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model_retrieval_based.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])

In [None]:
# Evaluate the model

result = model_retrieval_based.evaluate(X_test, y_test)
print("Accuracy : {}".format(result[1]))

# Get the model keys 
history.history.keys()

In [None]:
# Plot the training artifacts

plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Retrieval Based Model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train_loss','val_loss'], loc = 'upper right')
plt.show()

In [None]:
accuracy = history.history['accuracy']
val_accuracy = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']

In [None]:
epochs = range(len(accuracy))

plt.plot(epochs, accuracy, 'bo', label='Training Accuracy')
plt.plot(epochs, val_accuracy, 'b', label='Validation Accuracy')
plt.title('Retrieval Based Training and Validation Accuracy')
plt.legend()

In [None]:
plt.plot(epochs, loss, 'ro', label='Training loss')
plt.plot(epochs, val_loss, 'r', label='Validation loss')
plt.title('Retrieval Based Training and Validation loss')
plt.legend()

In [None]:
def clean_up_sentence(sentence):
    # tokenize the pattern - split words into array
    sentence_words = nltk.word_tokenize(sentence)
    # stem each word - create short form for word
    sentence_words = [lemmatizer.lemmatize(word.lower()) for word in sentence_words]
    # print(sentence_words)
    return sentence_words

In [None]:
def bow(sentence, words, show_details=True):
    # tokenize the pattern
    sentence_words = clean_up_sentence(sentence)
    # bag of words - matrix of N words, vocabulary matrix
    bag = [0]*len(words)  
    for s in sentence_words:
        for i,w in enumerate(words):
            if w == s: 
                # assign 1 if current word is in the vocabulary position
                bag[i] = 1
                if show_details:
                    print ("found in bag: %s" % w)
    return(np.array(bag))

In [None]:
def predict_class(sentence, model):
    # filter out predictions below a threshold
    p = bow(sentence, words,show_details=False)
    res = model.predict(np.array([p]))[0]
    ERROR_THRESHOLD = 0.25
    results = [[i,r] for i,r in enumerate(res) if r>ERROR_THRESHOLD]
    # sort by strength of probability
    results.sort(key=lambda x: x[1], reverse=True)
    return_list = []
    for r in results:
        return_list.append({"intent": classes[r[0]], "probability": str(r[1])})
    return return_list

In [None]:
def getResponse(ints, intents_json):
    list_of_intents = intents_json['intents']

    if ints:
      tag = ints[0]['intent']
      # print('List of intents :', list_of_intents)
      for i in list_of_intents:
          if(i['tag']== tag):
              result = random.choice(i['responses'])
              break
    else:
      for i in list_of_intents:
        if(i['tag']== 'noanswer'):
          result = random.choice(i['responses'])
          break

    # print('getResponse result :', result)
    return result

In [None]:
def chatbot_response(msg):
    ints = predict_class(msg, model_retrieval_based)
    # print('predict_class response : ',ints)
    res = getResponse(ints, intents)
    return res

In [None]:
chatbot_response("hi")


In [None]:
chatbot_response("@%@#$%@$%!@#CSDF")

In [None]:
def testRetrivalModel(): 
  print("Welcome to the Bot Service! Let me know how can I help you?")
  while True:
      request=input('User'+':')
      if request=='Bye' or request =='bye':
          print('Bot: Bye')
          break
      else:
          print('Bot:',chatbot_response(request))

In [None]:
testRetrivalModel()

# 2. Generative Content Based Chat Bot

In the above article, the responses were fixed and the machine learning helped to select the correct response given in the user’s question. But here, we are not going to select from pre-defined responses but instead, we will generate a response based on the training corpus. We are going to use the encoder-decoder (seq2seq) model for this approach.


## 2.1: IMPORT LIBRARIES AND DATASETS

To train a Deep learning NLP network in supervised mode, we need labeled dataset, so as the chatbot seq2seq model will learn how to process questions and generate corresponding answers. Here some datasets that we can use :

- Question-Answer Dataset: http://www.cs.cmu.edu/~ark/QA-data/
- The WikiQA corpus: https://www.microsoft.com/en-us/download/confirmation.aspx?id=52419
- Yahoo Language Data: https://webscope.sandbox.yahoo.com/catalog.php?datatype=l
- ConvAI2 Dataset: http://convai.io/data
- Open dialogue dataset (Microsoft/Maluuba): booking flights and a hotel. https://datasets.maluuba.com/Frames
- Cornell Movie — Dialogs Corpus: https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html



In [None]:
%cd /content/drive/My Drive/Colab Notebooks/Modern AI Portfolio Builder/Chat Bot/Data

In [None]:
# import data for generic content based chat bot
import os
import yaml
from tensorflow.keras import layers , activations , models , preprocessing, utils

In [None]:
# unzip download datayes
!unzip chatbot_nlp.zip

In [None]:
# prepare data

dir_path = 'chatbot_nlp/data'
files_list = os.listdir(dir_path + os.sep)



## 2.2: VISUALIZE DATA AND PLOT LABELS (TBD)

In [None]:
# Plot bar chart to outline how many samples (images) are present per emotion

# plt.figure(figsize = (10,10))
# sns.barplot(x = facialexpression_df.emotion.value_counts().index, y = facialexpression_df.emotion.value_counts())

## 2.3: PERFORM DATA PREPARATION AND FEATURE ENGINEERING

In our example, we use Cornell Movie — Dialogs Corpus that contains 220,579 conversational exchanges (304,713 utterances) between 10,292 pairs (involving 9,035 characters) extracted from 617 movies:

Here one of the conversations from the data set:
- Mike: 
  -"Drink up, Charley. We're ahead of you."
-Charley: 
 -"I'm not thirsty."





 **Cleaning**:
 First, for the 2 files we get “Questions” and “Answers”, we must proceed to the cleaning, by replacing short form terms by their corresponding long terms:

 **Filtering**:
Remove infrequent words that appear time to time, by counting words appearance for (less than a certain threshold, example 20), and replace it by a tag <OUT>.

**Padding**:
For the seq-2seq model, Questions, and answers sentences must have the same length, that why we apply padding technique by adding a term “PAD” when the sentence is shorter than the fixed initial length.

**Tokenizing**
Knowing that deep learning models understand only mathematics and numbers, the input word sequences must be encoded into a vector of numbers before feeding the Seq2Seq model. We use a two-step process to convert text into numbers that can be used in a neural network.
The first step is Tokenizing that converts text-words into integer-tokens, by splitting the text into smaller parts (words and punctuations) called tokens, creating 2 dictionaries, one for “Questions” and another for “Answers” because their vocabularies are different and adding start <SOS> and end <EOS> tokens at the beginning and end of each utterance.

**Word Embedding (Encoding corpus words)**
The second step is to convert integer-tokens (words) into vectors of floating-point numbers. Many methods like Bag-of-words (e.g. TF-IDF or Count Vectorize), LDA, LSA or Word Embedding. The last one, Word Embedding, is recommended since it does not suffer from drawbacks like “high dimensional vector” that grow with the corpus size.
Word Embedding encodes every word using a pre-defined and fixed vector space of N dimensions (E.g N=300), regardless of the size of the corpus. The word vector encodes the semantic relationship between words. Words have similar meaning if their vectors are closed (e.g using cosine similarity).

In [None]:
# Load conversion in form of questions & answers
questions = list()
answers = list()

for filepath in files_list:
    stream = open( dir_path + os.sep + filepath , 'rb')
    docs = yaml.safe_load(stream)
    conversations = docs['conversations']
    for con in conversations:
        if len( con ) > 2 :
            questions.append(con[0])
            replies = con[ 1 : ]
            ans = ''
            for rep in replies:
                ans += ' ' + rep
            answers.append( ans )
        elif len( con )> 1:
            questions.append(con[0])
            answers.append(con[1])



In [None]:
print(len(questions))
questions[:5]

In [None]:
print(len(answers))
answers[:5]

In [None]:
# Tag answers
answers_with_tags = list()
for i in range( len( answers ) ):
    if type( answers[i] ) == str:
        #print("answers[i]", answers[i])
        answers_with_tags.append( answers[i] )
    else:
        #print("questions[i]", questions[i])
        questions.pop( i )



In [None]:
print(len(answers_with_tags))
answers_with_tags[:5]

In [None]:
# Prepare answers list with start and end tag description
answers = list()
for i in range( len( answers_with_tags ) ) :
    answers.append( '<START> ' + answers_with_tags[i] + ' <END>' )
    #print('<START> ' + answers_with_tags[i] + ' <END>')


In [None]:
print(len(answers))
answers[:5]

In [None]:
# Tokenize questions & answers
tokenizer = preprocessing.text.Tokenizer()
tokenizer.fit_on_texts( questions + answers )
VOCAB_SIZE = len( tokenizer.word_index )+1
print( 'VOCAB SIZE : {}'.format( VOCAB_SIZE ))

In [None]:
tokenizer.word_index

In [None]:
from gensim.models import Word2Vec
import re

vocab = []
for word in tokenizer.word_index:
    vocab.append( word )

In [None]:
vocab

In [None]:
def tokenize( sentences ):
    tokens_list = []
    vocabulary = []
    for sentence in sentences:
        sentence = sentence.lower()
        sentence = re.sub( '[^a-zA-Z]', ' ', sentence )
        tokens = sentence.split()
        vocabulary += tokens
        tokens_list.append( tokens )
    return tokens_list , vocabulary

#p = tokenize( questions + answers )
#model = Word2Vec( p[ 0 ] ) 

#embedding_matrix = np.zeros( ( VOCAB_SIZE , 100 ) )
#for i in range( len( tokenizer.word_index ) ):
    #embedding_matrix[ i ] = model[ vocab[i] ]

In [None]:
# encoder_input_data
tokenized_questions = tokenizer.texts_to_sequences( questions )
maxlen_questions = max( [ len(x) for x in tokenized_questions ] )
padded_questions = preprocessing.sequence.pad_sequences( tokenized_questions , maxlen=maxlen_questions , padding='post' )
encoder_input_data = np.array( padded_questions )
print( encoder_input_data.shape , maxlen_questions )



In [None]:
# decoder_input_data
tokenized_answers = tokenizer.texts_to_sequences( answers )
maxlen_answers = max( [ len(x) for x in tokenized_answers ] )
padded_answers = preprocessing.sequence.pad_sequences( tokenized_answers , maxlen=maxlen_answers , padding='post' )
decoder_input_data = np.array( padded_answers )
print( decoder_input_data.shape , maxlen_answers )


In [None]:
# decoder_output_data
tokenized_answers = tokenizer.texts_to_sequences( answers )
for i in range(len(tokenized_answers)) :
    tokenized_answers[i] = tokenized_answers[i][1:]
padded_answers = preprocessing.sequence.pad_sequences( tokenized_answers , maxlen=maxlen_answers , padding='post' )
onehot_answers = utils.to_categorical( padded_answers , VOCAB_SIZE )
decoder_output_data = np.array( onehot_answers )
print( decoder_output_data.shape )

In [None]:
padded_answers

In [None]:
onehot_answers

In [None]:
tokenizer.word_index

In [None]:
tokenized_questions

In [None]:
tokenized_answers

In [None]:
questions

In [None]:
answers

In [None]:
encoder_input_data

In [None]:
encoder_input_data.shape

In [None]:
decoder_input_data

In [None]:
decoder_input_data.shape

In [None]:
decoder_output_data

In [None]:
decoder_output_data.shape

In [None]:
from sklearn.model_selection import train_test_split

def data_spliter(encoder_input_data, decoder_input_data, test_size1=0.2, test_size2=0.3):
  
  en_train, en_test, de_train, de_test = train_test_split(encoder_input_data, decoder_input_data, test_size=test_size1)
  en_train, en_val, de_train, de_val = train_test_split(en_train, de_train, test_size=test_size2)
  
  return en_train, en_val, en_test, de_train, de_val, de_test

In [None]:
from sklearn.model_selection import train_test_split

# Split the data into train and test data
# X_train, X_test, y_train, y_test = train_test_split(decoder_input_data, decoder_output_data, test_size = 0.2)
# X_train, X_test, y_train, y_test = train_test_split(encoder_input_data, decoder_output_data, test_size = 0.2)

en_train, en_val, en_test, de_train, de_val, de_test = data_spliter(encoder_input_data, decoder_input_data)


## 2.4: BUILD AND TRAIN DEEP LEARNING MODEL FOR GENERIC CONTENT BASED CHAT BOT





**Recurrent Neural network (RNN)**
**RNN** is a deep network that extracts temporal features while processing sequences of inputs like text, audio or video. It’s used when we need history/context to be able to provide the output based on previous inputs, like for video tracking, Image captioning, Speech-to-text, Translation, Stock forecasting, etc.

RNN neuron uses its internal memory to maintain information about the previous inputs and update the hidden states accordingly, which allows them to make predictions for every element of a sequence.

RNNs have shown great success in many NLP tasks, the most used type of RNN are LSTMs, that perform very well at capturing long-term dependencies than RNNs can do (due to the Vanishing gradient problem). GRU is a newer version of RNN with a less complex structure (fewer parameters) than LSTM, its training is a bit faster and need less data, but may lead to lower results.

**Seq2Seq architecture and functioning:**
Almost all task in NLP can be performed using a sequence to sequence mapping models: machine translation, summarization, question answering, and many more. An Encoder-Decoder model for recurrent neural networks is an architecture for sequence-to-sequence prediction problems in the field of natural language processing NLP, it takes a sequence as input and generates another sequence as output, It is comprised of two sub-modules :
- **Encoder**: Process the input sequence to detect important patterns, in order to shrink it into a smaller fixed length “context vector”, this feature vector hold the information, that represents the input, which becomes the initial state to the first recurrent layer of the decoder part.
- **Decoder**: generates a sequence of its own that represents the output. It gives the best closest match to the intended output during the training or to the actual input during the test or after Go live.


To higher the performance and accuracy of the model, two additional algorithms can be used:
- **Attention mechanism:**
So as to perform well on long input or output sequences, we use Attention mechanism which tells the model the specific parts of the input sequence on which it must focus when decoding by providing a richer context from the encoder, instead of using only the raw “context vector”.
-  **Beam search:** 
beam search is an algorithm that builds a search tree and tries to find the best path for a given number N on tree levels (limited set of nodes) in a greedy way.

In [None]:
# BUILD AND TRAIN DEEP LEARNING MODEL FOR GENERIC CONTENT BASED CHAT BOT

# Dimension for embedding layer
embedding_dimension = 200

#Dimensionality
dimensionality = 200 #256


# Prepare encode input & embedding
encoder_inputs = tf.keras.layers.Input(shape=( maxlen_questions , ))
encoder_embedding = tf.keras.layers.Embedding( VOCAB_SIZE, embedding_dimension , mask_zero=True ) (encoder_inputs)
encoder_outputs , state_h , state_c = tf.keras.layers.LSTM( dimensionality , return_state=True )( encoder_embedding )
encoder_states = [ state_h , state_c ]

# Prepare decode input & embedding
decoder_inputs = tf.keras.layers.Input(shape=( maxlen_answers ,  ))
decoder_embedding = tf.keras.layers.Embedding( VOCAB_SIZE, embedding_dimension , mask_zero=True) (decoder_inputs)
decoder_lstm = tf.keras.layers.LSTM( dimensionality , return_state=True , return_sequences=True )
decoder_outputs , _ , _ = decoder_lstm ( decoder_embedding , initial_state=encoder_states )
decoder_dense = tf.keras.layers.Dense( VOCAB_SIZE , activation=tf.keras.activations.softmax ) 
output = decoder_dense ( decoder_outputs )

model_content_based = tf.keras.models.Model([encoder_inputs, decoder_inputs], output )
# model.compile(optimizer=tf.keras.optimizers.RMSprop(), loss='categorical_crossentropy')
model_content_based.summary()

## 2.5: COMPILE AND TRAIN GENERIC CONTENT BASED CHAT BOT DEEP LEARNING MODEL

Our encoder model requires an input layer which defines a matrix for holding the one-hot vectors and an LSTM layer with some number of hidden states. Decoder model structure is almost the same as encoder’s but here we pass in the state data along with the decoder inputs.

In [None]:
model_content_based.compile(optimizer=tf.keras.optimizers.RMSprop(), loss='categorical_crossentropy', metrics=['accuracy'])


# adam = tf.keras.optimizers.Adam(learning_rate = 0.0001, beta_1 = 0.9, beta_2 = 0.999, amsgrad = False)
# model_1_facialKeyPoints.compile(loss = "mean_squared_error", optimizer = adam , metrics = ['accuracy'])
# Check this out for more information on Adam optimizer: https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam

In [None]:
# save the best model with least validation loss
checkpointer = ModelCheckpoint(filepath = "ContentBase_ChatBot_weights.hdf5", verbose = 1, save_best_only = True)

In [None]:
#The batch size and number of epochs
batch_size = 50 #10 # 50
epochs = 150  # 600
validation_split = 0.05

history = model_content_based.fit([encoder_input_data , decoder_input_data], decoder_output_data, batch_size = batch_size, epochs = epochs, validation_split = validation_split, callbacks=[checkpointer])

# history = model_content_based.fit([en_train , de_train], decoder_output_data, batch_size = 50, epochs = 50, validation_split = 0.05, callbacks=[checkpointer])

# history = model_content_based.fit([encoder_input_data , X_train], y_train, batch_size = 50, epochs = 2, validation_split = 0.05, callbacks=[checkpointer])
# model_content_based.fit([encoder_input_data , decoder_input_data], decoder_output_data, batch_size=50, epochs=150,  callbacks=[checkpointer] ) 


# history = model_1_facialKeyPoints.fit(X_train, y_train, batch_size = 32, epochs = 2, validation_split = 0.05, callbacks=[checkpointer])
# Don't need to save this model
# model.save( 'chatBot_model.h5' ) 

# en_train, en_val, en_test, de_train, de_val, de_test = data_spliter(encoder_input_data, decoder_input_data)

In [None]:
# save the model architecture to json file for future use

model_json = model_content_based.to_json()
with open("ContentBase_ChatBot_model.json","w") as json_file:
  json_file.write(model_json)

##2.6: ASSESS THE PERFORMANCE OF TRAINED GENERIC CONTENT BASED MODEL





Now, to handle an input that the model has not seen we will need a model that decodes step-by-step instead of using teacher forcing because the model we created only works when the target sequence is known. In the Generative chatbot application, we will not know what the generated response will be for input the user passes in. For doing this, we will have to build a seq2seq model in individual pieces. Let’s first build an encoder model with encoder inputs and encoder output states. We will do this with the help of the previously trained model.

Next, we will need to create placeholders for decoder input states as we do not know what we need to decode or what hidden state we will get.

Now we will create new decoder states and outputs with the help of decoder LSTM and Dense layer that we trained earlier.

Finally, we have the decoder input layer, the final states from the encoder, the decoder outputs from the Dense layer of the decoder, and decoder output states which is the memory during the network from one word to the next. We can bring this all together now and set up the decoder model as shown below.

In [None]:
with open('ContentBase_ChatBot_model.json', 'r') as json_file:
    json_savedModel= json_file.read()
    
# load the model architecture 
model_content_based = tf.keras.models.model_from_json(json_savedModel)
model_content_based.load_weights('ContentBase_ChatBot_weights.hdf5')
# model_content_based.compile(optimizer = "Adam", loss = "categorical_crossentropy", metrics = ["accuracy"])

In [None]:
# model_content_based.compile(optimizer=tf.keras.optimizers.RMSprop(), loss='categorical_crossentropy', metrics=['accuracy'])

In [None]:
# TODO: Fixed this issue, unable to calculate accuracy 
# score = model_content_based.evaluate([encoder_input_data + decoder_input_data], decoder_output_data)

# print('Test Accuracy: {}'.format(score[1]))

# en_train, en_val, en_test, de_train, de_val, de_test = data_spliter(encoder_input_data, decoder_input_data)

In [None]:
history.history.keys()

In [None]:
# Plot the training artifacts

plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Generative Based Model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train_loss','val_loss'], loc = 'upper right')
plt.show()

In [None]:
accuracy = history.history['accuracy']
val_accuracy = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']

In [None]:
epochs = range(len(accuracy))

plt.plot(epochs, accuracy, 'bo', label='Training Accuracy')
plt.plot(epochs, val_accuracy, 'b', label='Validation Accuracy')
plt.title('Generative Based Training and Validation Accuracy')
plt.legend()

In [None]:
plt.plot(epochs, loss, 'ro', label='Training loss')
plt.plot(epochs, val_loss, 'r', label='Validation loss')
plt.title('Generative Based Training and Validation loss')
plt.legend()

At last, we will create a function that accepts our text inputs and generates a response using encoder and decoder that we created. In the function below, we pass in the NumPy matrix that represents our text sentence and we get the generated response back from it. I have added comments for almost every line of code for you to understand it quickly. What happens in the below function is this: 1.) We retrieve output states from the encoder 2.) We pass in the output states to the decoder (which is our initial hidden state of the decoder) to decode the sentence word by word 3.) Update the hidden state of decoder after decoding each word so that we can use previously decoded words to help decode new ones
We will stop once we encounter ‘<END>’ token that we added to target sequences in our preprocessing task or we hit the maximum length of the sequence.

In [None]:
def make_inference_models():
    
    # first build an encoder model with encoder inputs and encoder output states.
    # first build an encoder model with encoder inputs and encoder output states.
    # encoder_inputs = model_content_based.input[0]
    # encoder_outputs, state_h_enc, state_c_enc = model_content_based.layers[2].output
    # encoder_states = [state_h_enc, state_c_enc]
    encoder_model = tf.keras.models.Model(encoder_inputs, encoder_states)
    
    #create placeholders for decoder input states
    decoder_state_input_h = tf.keras.layers.Input(shape=( 200 ,))
    decoder_state_input_c = tf.keras.layers.Input(shape=( 200 ,))
    decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
    
    # create new decoder states and outputs with the help of decoder LSTM and Dense layer that we trained earlier.
    decoder_outputs, state_h, state_c = decoder_lstm(
        decoder_embedding , initial_state=decoder_states_inputs)
    decoder_states = [state_h, state_c]
    decoder_outputs = decoder_dense(decoder_outputs)
    decoder_model = tf.keras.models.Model(
        [decoder_inputs] + decoder_states_inputs,
        [decoder_outputs] + decoder_states)
    
    return encoder_model , decoder_model

In [None]:
def str_to_tokens( sentence : str ):
    # print("maxlen_questions", maxlen_questions)
    # print("tokenizer", tokenizer)
    # print("tokenizer.word_index", tokenizer.word_index)
    words = sentence.lower().split()
    # print("words", words)
    tokens_list = list()
    # print("Before tokens_list", tokens_list)
    for word in words:
        if word in tokenizer.word_index :
          # print("tokenizer word", tokenizer.word_index[ word ])
          tokens_list.append( tokenizer.word_index[ word ] ) 
        else:
          tokens_list.append( tokenizer.word_index[ "out" ] ) 
        # print("After tokens_list", tokens_list)
    return preprocessing.sequence.pad_sequences( [tokens_list] , maxlen=maxlen_questions , padding='post')

In [None]:
enc_model, dec_model = make_inference_models()
# enc_model , dec_model = make_inference_modelsEx()

In [None]:
def testGenerativeModel(): 
  print("Welcome to the Bot Service! Let me know how can I help you?")
  while True:
      request=input('User'+':')
      if request=='Bye' or request =='bye':
          print('Bot: Bye')
          break
      else:
          states_values = enc_model.predict( str_to_tokens( request ) )
          empty_target_seq = np.zeros( ( 1 , 1 ) )
          empty_target_seq[0, 0] = tokenizer.word_index['start']
          stop_condition = False
          decoded_translation = ''
          while not stop_condition :
            dec_outputs , h , c = dec_model.predict([ empty_target_seq ] + states_values )
            sampled_word_index = np.argmax( dec_outputs[0, -1, :] )
            # print("sampled_word_index", sampled_word_index)
            if sampled_word_index == 1:
              # print('Got end tag: Bye')
              stop_condition = True
              break

            sampled_word = None
            for word , index in tokenizer.word_index.items() :
              if sampled_word_index == index :  
                # print("word", word)
                decoded_translation += ' {}'.format( word )
                sampled_word = word
              
            if sampled_word == 'end' or len(decoded_translation.split()) > maxlen_answers:
              stop_condition = True
                  
            empty_target_seq = np.zeros( ( 1 , 1 ) )  
            empty_target_seq[ 0 , 0 ] = sampled_word_index
            states_values = [ h , c ]

          # Print Respone
          print( decoded_translation )  


In [None]:
testGenerativeModel()

## 3. Future scope vs limitation



Here we used a very small dataset and got an accuracy of around 20%. In the future for a larger dataset, the model might give better accuracy. The limitation of using this approach for creating chatbots is that we need a very large dataset to give the best responses to the user as we can see in the above output that chatbot does not give the right responses in some cases because of a smaller dataset.
A similar task we can do with the above-shown approach is Machine Translation. The below article shows how we can use the seq2seq model to perform Machine Translation.

# Conclusion :





- The rule-based & retrieval-based approaches are the most used nowadays due to its effectiveness at the time of maintaining a close-domain conversation. 
- However the generative-based models, on the other hand, arise as a powerful alternative in the sense that they can handle better an open topic conversation.
- Add more datasets to help it learn better from more conversations. This can help improve its conversation skills and help it give a better variety of responses to queries.
- The Seq2Seq model allows making a more realistic and human chatbot, the Dataset is also a crucial element in this equation, the larger and more diversified it is, the best is the user experience and perception.
- The hybrid approach is also possible to develop a chatbot that is robust, reliable and scalable. This approach not only increases the quality, performance and accuracy of the chatbot but will also be more reliable in nature while handling real-time scenarios.