# Chatbot using Seq2Seq LSTM models


- In this notebook, we will assemble a seq2seq LSTM model using Keras Functional API to create a working Chatbot which would answer questions asked to it.

- Chatbots have become applications themselves. You can choose the field or stream and gather data regarding various questions. We can build a chatbot for an e-commerce webiste or a school website where parents could get information about the school.


- The famous [Google Assistant](https://assistant.google.com/), [Siri](https://www.apple.com/in/siri/), [Cortana](https://www.microsoft.com/en-in/windows/cortana) and [Alexa](https://www.alexa.com/) may have been build using simialr models.

So, let's start building our Chatbot.

## 1) Importing the packages

We will import [TensorFlow](https://www.tensorflow.org) and [Keras](https://www.tensorflow.org/guide/keras). Also, we import other modules which help in defining model layers.

In [1]:
import numpy as np
from tensorflow.keras import preprocessing , utils
import string
import tensorflow as tf
# NLTK
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 

## 2) Data selection

### A) Reading the data from the file

In [2]:
#  Loading Data
import pandas as pd
import gzip

def parse(path):
  g = gzip.open(path, 'rb')
  for l in g:
    yield eval(l)

def getDF(path):
  i = 0
  df = {}
  for d in parse(path):
    df[i] = d
    i += 1
  return pd.DataFrame.from_dict(df, orient='index')

df = getDF('qa_Electronics.json.gz')

In [3]:
pd.set_option('display.max_columns',None)

In [4]:
df.head(10)

Unnamed: 0,questionType,asin,answerTime,unixTime,question,answerType,answer
0,yes/no,594033926,"Dec 27, 2013",1388131000.0,Is this cover the one that fits the old nook c...,Y,Yes this fits both the nook color and the same...
1,yes/no,594033926,"Jan 5, 2015",1420445000.0,Does it fit Nook GlowLight?,N,No. The nook color or color tablet
2,open-ended,594033926,2 days ago,,Would it fit Nook 1st Edition? 4.9in x 7.7in ?,,I don't think so. The nook color is 5 x 8 so n...
3,yes/no,594033926,17 days ago,,Will this fit a Nook Color that's 5 x 8?,Y,yes
4,yes/no,594033926,"Feb 10, 2015",1423555000.0,will this fit the Samsung Galaxy Tab 4 Nook 10.1,N,"No, the tab is smaller than the 'color'"
5,yes/no,594033926,"Jan 30, 2015",1422605000.0,does it have a flip stand?,N,"No, there is not a flip stand. It has a pocket..."
6,yes/no,594033926,"Jan 30, 2015",1422605000.0,does this have a flip stand,?,"Hi, no it doesn't"
7,open-ended,594033926,"Dec 22, 2014",1419235000.0,also fits the HD+?,,It should. They are the same size and the char...
8,yes/no,594033926,"Nov 16, 2014",1416125000.0,Does it have 2 positions for the reader? Horiz...,Y,Yes
9,open-ended,594033926,"Aug 7, 2014",1407395000.0,"Is there a closure mechanism? Bands, magnetic,...",,No- it is more like a normal book would be. It...


### B) Pre-processing the data

#### Remove null questions and also questions which have character count less than 10 

In [5]:
null_id = []
for i, val in enumerate(df['question']):
    if len(val) <= 10:
        null_id.append(i)

In [6]:
df =  df.drop(df.index[null_id])

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 312129 entries, 0 to 314262
Data columns (total 7 columns):
questionType    312129 non-null object
asin            312129 non-null object
answerTime      312129 non-null object
unixTime        302749 non-null float64
question        312129 non-null object
answerType      165529 non-null object
answer          312129 non-null object
dtypes: float64(1), object(6)
memory usage: 19.1+ MB


In [8]:
df['asin'].nunique()

39371

In [9]:
df1 = df.drop(['questionType','asin','answerTime','unixTime','answerType'], axis =1)

In [10]:
df1.head()

Unnamed: 0,question,answer
0,Is this cover the one that fits the old nook c...,Yes this fits both the nook color and the same...
1,Does it fit Nook GlowLight?,No. The nook color or color tablet
2,Would it fit Nook 1st Edition? 4.9in x 7.7in ?,I don't think so. The nook color is 5 x 8 so n...
3,Will this fit a Nook Color that's 5 x 8?,yes
4,will this fit the Samsung Galaxy Tab 4 Nook 10.1,"No, the tab is smaller than the 'color'"


#### Convert text to lowercase

In [11]:
df2 = df1.apply(lambda x: x.astype(str).str.lower())

In [12]:
df2.head(10)

Unnamed: 0,question,answer
0,is this cover the one that fits the old nook c...,yes this fits both the nook color and the same...
1,does it fit nook glowlight?,no. the nook color or color tablet
2,would it fit nook 1st edition? 4.9in x 7.7in ?,i don't think so. the nook color is 5 x 8 so n...
3,will this fit a nook color that's 5 x 8?,yes
4,will this fit the samsung galaxy tab 4 nook 10.1,"no, the tab is smaller than the 'color'"
5,does it have a flip stand?,"no, there is not a flip stand. it has a pocket..."
6,does this have a flip stand,"hi, no it doesn't"
7,also fits the hd+?,it should. they are the same size and the char...
8,does it have 2 positions for the reader? horiz...,yes
9,"is there a closure mechanism? bands, magnetic,...",no- it is more like a normal book would be. it...


#### Get the count of questions with word length less than 10


In [13]:
cnt = [i for i in df2['question'] if  len(i.split()) <= 10]
print("Total Questions: ",len(cnt))
cnt = [i for i in df2['answer'] if  len(i.split()) <= 10]
print("Total Anwers: ",len(cnt))

Total Questions:  138103
Total Anwers:  88087


#### Get the index of questions which has word length above 7 to remove

In [14]:
# # .find("soccer")
# cnt =0
# for i, val in enumerate(df2['answer']):
#     if val.find("http:")!= -1:
#         print(i, " :  ",val)
#         cnt+=1

# print(cnt)

In [15]:
def remove_qn_ans(col_type):
    data_id = []
    for i, val in enumerate(df2[col_type]):
        if len(val.split()) >= 10 :
            data_id.append(i)
            
    print('Count of ',col_type,' will be removed: ',len(data_id))
    return data_id

In [16]:
def remove_qn_ans_alternate(col_type):
    data_id = []
    checker = 0
    for i, val in enumerate(df2[col_type]):
        if len(val) <= 3 and checker < 2:
            data_id.append(i)
            checker += 1
        elif val.find("http:")!= -1:
            data_id.append(i)
        else:
            checker = 0
            
    print('Count of ',col_type,' will be removed: ',len(data_id))
    return data_id

In [17]:
df2 = df2.drop(df2.index[remove_qn_ans('question')])
df2.reset_index(drop=True, inplace=True)
df2 = df2.drop(df2.index[remove_qn_ans('answer')])
df2.reset_index(drop=True, inplace=True)

Count of  question  will be removed:  190590
Count of  answer  will be removed:  79094


In [18]:
df2 = df2.drop(df2.index[remove_qn_ans_alternate('answer')])
df2.reset_index(drop=True, inplace=True)

Count of  answer  will be removed:  9911


In [19]:
df2

Unnamed: 0,question,answer
0,does it fit nook glowlight?,no. the nook color or color tablet
1,does this have a flip stand,"hi, no it doesn't"
2,how far out does the arm extend?,18 inches on our tv.
3,does this item come with a charger?,ours did
4,does this version have a camera?,"no, nook glowlight does not have a camera."
...,...,...
32529,how many watts is this speaker?,3 watt
32530,how much does this speaker weigh (without the ...,almost 7 oz with the strap to carry it.
32531,can you use this laptop on dsl + satellite,"yep, it doesn't have a cd rom though."
32532,does it have a cd drive ?,no. it has no cd/dvd drive


In [20]:
questions =  df2['question'].values.tolist()
# questions = questions[0:len(questions)//2]
questions = questions[0:10100]
print(len(questions))
questions[:10]

10100


['does it fit nook glowlight?',
 'does this have a flip stand',
 'how far out does the arm extend?',
 'does this item come with a charger?',
 'does this version have a camera?',
 'does this nook play games',
 'does this model have an sd card slot?',
 'how well can you see the screen in sunlight?',
 'can i download the kindle app for this?',
 'can you download netflix?']

In [21]:
answers = df2['answer'].values.tolist()
answers = answers[0:10100]
print(len(answers))
answers[:10]

10100


['no. the nook color or color tablet',
 "hi, no it doesn't",
 '18 inches on our tv.',
 'ours did',
 'no, nook glowlight does not have a camera.',
 'no it does not .... sorry',
 'yes it has an sd card slot',
 'in the shades ok direct sun not so good',
 'yes you can. i have it installed',
 'yes, you can also use the amazon kindle app.']

#### Average character length of questions and answers

In [22]:
print(questions[1])
print(len(questions[1]))

does this have a flip stand
27


In [23]:
length =0
for i in questions:
    length += len(i)
    
print('average length of questions: ',length//4879)

average length of questions:  71


In [24]:
length =0
for i in answers:
    length += len(i)
    
print('average length of answers: ',length//4879)

average length of answers:  52


In [25]:
answers_with_tags = list()
for i in range( len( answers ) ):
    if type( answers[i] ) == str :
        answers_with_tags.append( answers[i] )
    else:
        questions.pop( i )

In [26]:
answers_with_tags[15:25]

['yes.',
 'yes, it works on any mac',
 "why wouldn't it?",
 'yes... this will work all over the globe.',
 'it has worked fine for us.',
 'it work perfectly thanks',
 'i understand that it will work in it fine.',
 'yes.',
 "yes. it's on my phone, it's good.",
 'yes sir.']

In [27]:
answers = list()
for i in range( len( answers_with_tags ) ) :
    answers.append( '<START> ' + answers_with_tags[i] + ' <END>' )

In [28]:
answers[15:25]

['<START> yes. <END>',
 '<START> yes, it works on any mac <END>',
 "<START> why wouldn't it? <END>",
 '<START> yes... this will work all over the globe. <END>',
 '<START> it has worked fine for us. <END>',
 '<START> it work perfectly thanks <END>',
 '<START> i understand that it will work in it fine. <END>',
 '<START> yes. <END>',
 "<START> yes. it's on my phone, it's good. <END>",
 '<START> yes sir. <END>']

*   Create a `Tokenizer` and load the whole vocabulary ( `questions` + `answers` ) into it.

In [29]:
tokenizer = preprocessing.text.Tokenizer()
tokenizer.fit_on_texts( questions + answers )
VOCAB_SIZE = len( tokenizer.word_index )+1
print( 'VOCAB SIZE : {}'.format( VOCAB_SIZE ))

VOCAB SIZE : 8476



### C) Preparing data for Seq2Seq model

Our model requires three arrays namely `encoder_input_data`, `decoder_input_data` and `decoder_output_data`.

For `encoder_input_data` :
* Tokenize the `questions`. Pad them to their maximum length.

For `decoder_input_data` :
* Tokenize the `answers`. Pad them to their maximum length.

For `decoder_output_data` :

* Tokenize the `answers`. Remove the first element from all the `tokenized_answers`. This is the `<START>` element which we added earlier.



In [38]:
# encoder_input_data
tokenized_questions = tokenizer.texts_to_sequences( questions )
maxlen_questions = max( [ len(x) for x in tokenized_questions ] )
padded_questions = preprocessing.sequence.pad_sequences( tokenized_questions , maxlen=maxlen_questions , padding='post' )
encoder_input_data = np.array( padded_questions )
print( encoder_input_data.shape , maxlen_questions )

(10100, 15) 15


In [39]:
padded_questions[1]

array([   7,    4,   15,   10, 1331,  292,    0,    0,    0,    0,    0,
          0,    0,    0,    0])

In [40]:
# decoder_input_data
tokenized_answers = tokenizer.texts_to_sequences( answers )
maxlen_answers = max( [ len(x) for x in tokenized_answers ] )
padded_answers = preprocessing.sequence.pad_sequences( tokenized_answers , maxlen=maxlen_answers , padding='post' )
decoder_input_data = np.array( padded_answers )
print( decoder_input_data.shape , maxlen_answers )

(10100, 26) 26


In [41]:
# decoder_output_data -1 
tokenized_answers = tokenizer.texts_to_sequences( answers )
for i in range(len(tokenized_answers)) :
    tokenized_answers[i] = tokenized_answers[i][1:]
padded_answers = preprocessing.sequence.pad_sequences( tokenized_answers , maxlen=maxlen_answers , padding='post' )

In [42]:
padded_answers

array([[  14,    5,  919, ...,    0,    0,    0],
       [ 141,   14,    3, ...,    0,    0,    0],
       [ 461,   87,   17, ...,    0,    0,    0],
       ...,
       [  18,   71,   66, ...,    0,    0,    0],
       [  13,  126,    3, ...,    0,    0,    0],
       [  10, 8474, 8475, ...,    0,    0,    0]])

In [43]:
# # decoder_output_data -2

onehot_answers = utils.to_categorical( padded_answers , VOCAB_SIZE )
# onehot_answers = sparse_categorical_crossentropy( padded_answers , VOCAB_SIZE )
decoder_output_data = np.array( onehot_answers )
print( decoder_output_data.shape)

(10100, 26, 8476)


In [45]:
import os
import pickle

if os.path.isfile("./encoder_input_data.npy"):
    print("Loading existing numpy file: ")
    # load model
    encoder_input_data = np.load('encoder_input_data.npy')
    decoder_input_data = np.load('decoder_input_data.npy')
    decoder_output_data = np.load('decoder_output_data.npy')
else:
    # Saving all the arrays to storage
    np.save( 'encoder_input_data.npy' , encoder_input_data )
    np.save( 'decoder_input_data.npy' , decoder_input_data )
    np.save( 'decoder_output_data.npy' , decoder_output_data )

In [44]:
# Saving all the arrays to storage
np.save( 'encoder_input_data.npy' , encoder_input_data )
np.save( 'decoder_input_data.npy' , decoder_input_data )
np.save( 'decoder_output_data.npy' , decoder_output_data )

## 3) Defining the Encoder-Decoder model

The model will have Embedding, LSTM and Dense layers. The basic configuration is as follows.


*   2 Input Layers : One for `encoder_input_data` and another for `decoder_input_data`.
*   Embedding layer : For converting token vectors to fix sized dense vectors. **( Note :  Don't forget the `mask_zero=True` argument here )**
*   LSTM layer : Provide access to Long-Short Term cells.

Working : 

1.   The `encoder_input_data` comes in the Embedding layer (  `encoder_embedding` ). 
2.   The output of the Embedding layer goes to the LSTM cell which produces 2 state vectors ( `h` and `c` which are `encoder_states` )
3.   These states are set in the LSTM cell of the decoder.
4.   The decoder_input_data comes in through the Embedding layer.
5.   The Embeddings goes in LSTM cell ( which had the states ) to produce seqeunces.

**Important points :**


*   `200` is the output of the GloVe embeddings.
*   `embedding_matrix` is the GloVe embedding which we downloaded earlier.


<center><img style="float: center;" src="https://cdn-images-1.medium.com/max/1600/1*bnRvZDDapHF8Gk8soACtCQ.gif"></center>


Image credits to [Hackernoon](https://hackernoon.com/tutorial-3-what-is-seq2seq-for-text-summarization-and-why-68ebaa644db0).










In [46]:
encoder_inputs = tf.keras.layers.Input(shape=( None , ))
encoder_embedding = tf.keras.layers.Embedding( VOCAB_SIZE, 200 , mask_zero=True ) (encoder_inputs)
encoder_outputs , state_h , state_c = tf.keras.layers.LSTM( 200 , return_state=True )( encoder_embedding )
encoder_states = [ state_h , state_c ]

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


In [47]:
tf.__version__

'1.15.0'

In [48]:
decoder_inputs = tf.keras.layers.Input(shape=( None ,  ))
decoder_embedding = tf.keras.layers.Embedding( VOCAB_SIZE, 200 , mask_zero=True) (decoder_inputs)
decoder_lstm = tf.keras.layers.LSTM( 200 , return_state=True , return_sequences=True )
decoder_outputs , _ , _ = decoder_lstm ( decoder_embedding , initial_state=encoder_states )
decoder_dense = tf.keras.layers.Dense( VOCAB_SIZE , activation=tf.keras.activations.softmax ) 
output = decoder_dense ( decoder_outputs )

model = tf.keras.models.Model([encoder_inputs, decoder_inputs], output )
model.compile(optimizer=tf.keras.optimizers.RMSprop(), loss='categorical_crossentropy')

model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, None)]       0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            [(None, None)]       0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, None, 200)    1695200     input_1[0][0]                    
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, None, 200)    1695200     input_2[0][0]                    
______________________________________________________________________________________________

## 4) Train / Save / Load Model

In [49]:
# load and evaluate a saved model
from numpy import loadtxt
from tensorflow.keras.models import load_model

In [50]:
import os
import pickle
if os.path.isfile("./model_lstm_220.h5"):
    print("Loading existing model: ")
    # load model
    model = load_model('model_lstm_220.h5')
    # summarize model.
    model.summary()
else:
    print("Training model: ")
    # Train model first
    model.fit([encoder_input_data , decoder_input_data], decoder_output_data, batch_size=220, epochs=50, use_multiprocessing= True) 

    # Creating pickel file
    model.save( 'model_lstm_220.h5' )

Training model: 
Train on 10100 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


## 5) Defining inference models

We create inference models which help in predicting answers.

**Encoder inference model** : Takes the question as input and outputs LSTM states ( `h` and `c` ).

**Decoder inference model** : Takes in 2 inputs, one are the LSTM states ( Output of encoder model ), second are the answer input seqeunces ( ones not having the `<start>` tag ). It will output the answers for the question which we fed to the encoder model and its state values.

In [51]:
def make_inference_models():
    
    encoder_model = tf.keras.models.Model(encoder_inputs, encoder_states)
    
    decoder_state_input_h = tf.keras.layers.Input(shape=( 200 ,))
    decoder_state_input_c = tf.keras.layers.Input(shape=( 200 ,))
    
    decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
    
    decoder_outputs, state_h, state_c = decoder_lstm(
        decoder_embedding , initial_state=decoder_states_inputs)
    decoder_states = [state_h, state_c]
    decoder_outputs = decoder_dense(decoder_outputs)
    decoder_model = tf.keras.models.Model(
        [decoder_inputs] + decoder_states_inputs,
        [decoder_outputs] + decoder_states)
    
    return encoder_model , decoder_model

## 6) Talking with our Chatbot

First, we define a method `str_to_tokens` which converts `str` questions to Integer tokens with padding.


In [52]:
def str_to_tokens( sentence : str ):
    words = sentence.lower().split()
    tokens_list = list()
    for word in words:
        tokens_list.append( tokenizer.word_index[ word ] ) 
    return preprocessing.sequence.pad_sequences( [tokens_list] , maxlen=maxlen_questions , padding='post')




1.   First, we take a question as input and predict the state values using `enc_model`.
2.   We set the state values in the decoder's LSTM.
3.   Then, we generate a sequence which contains the `<start>` element.
4.   We input this sequence in the `dec_model`.
5.   We replace the `<start>` element with the element which was predicted by the `dec_model` and update the state values.
6.   We carry out the above steps iteratively till we hit the `<end>` tag or the maximum answer length.







In [None]:
enc_model , dec_model = make_inference_models()

for _ in range(10):
    states_values = enc_model.predict( str_to_tokens( input( 'Enter question : ' ) ) )
    empty_target_seq = np.zeros( ( 1 , 1 ) )
    empty_target_seq[0, 0] = tokenizer.word_index['start']
    stop_condition = False
    decoded_translation = ''
    while not stop_condition :
        dec_outputs , h , c = dec_model.predict([ empty_target_seq ] + states_values )
        sampled_word_index = np.argmax( dec_outputs[0, -1, :] )
        sampled_word = None
        for word , index in tokenizer.word_index.items() :
            if sampled_word_index == index :
                decoded_translation += ' {}'.format( word )
                sampled_word = word
#             else:
#                 decoded_translation = "I am sorry! I don't understand you"
        
        if sampled_word == 'end' or len(decoded_translation.split()) > maxlen_answers:
            stop_condition = True
            
        empty_target_seq = np.zeros( ( 1 , 1 ) )  
        empty_target_seq[ 0 , 0 ] = sampled_word_index
        states_values = [ h , c ] 
        
    print( decoded_translation )

Enter question : does it has camera
 yes it does end
Enter question : is it waterproof
 no it is not end
Enter question : size of mobile
 4 volts end
Enter question : what are the specifications
 20 watts end
Enter question : can you download netflix
 yes we can ship to the computer end
Enter question : what is size
 4 5 x 3 4 x 1 4 end
Enter question : how bright is it
 i don't know because it's about 5 feet end
