# Corona Chatbot

### Author: Harsh Khatri

In this project, I tried to create a closed domain chatbot. While extracting the data for the chatbot from the internet, I noticed that the questions are short, but the answers are lengthy.
If I try to build a seq-to-seq model for this chatbot, it might not work correctly bcoz of the length difference in questions and answers. The chatbot won't be able to predict the long responses accurately.
So I decided to treat every answer as a class. I trained my model to perform an LSTM multi-class classification.  The model is trained on different questions and trying to predict which answer class it belongs to.

PS: The bot is currently in development

In [2]:
import pandas as pd
import numpy as np
import re
from keras.models import Sequential
from keras.layers import Dense, Dropout, LSTM
from keras.preprocessing.sequence import pad_sequences
from sklearn.preprocessing import OneHotEncoder

In [3]:
dataset = pd.read_excel("COVID-19-dataset.xlsx")
lastRow = 41
df = dataset.iloc[0:lastRow-1,:]

In [4]:
df.head(10)

Unnamed: 0,que,ans,alternate-que,reference
0,What is a coronavirus,Coronaviruses are a large family of viruses th...,What is a coronavirus%Tell me about coronaviru...,
1,Can humans become infected with a novel corona...,Several known coronaviruses are circulating in...,Can humans become infected with a novel corona...,
2,Are health workers at risk from a novel corona...,"Yes, they can be, as health care workers come ...",Are health workers at risk from a novel corona...,
3,What is COVID-19,COVID-19 is the infectious disease caused by t...,What is COVID-19%Tell me about COVID19,
4,Is COVID-19 the same as SARS,"No, SARS was more deadly but much less infecti...",Is COVID-19 the same as SARS%Is novel coronavi...,
5,What are the symptoms of COVID-19,The most common symptoms of COVID-19 are: feve...,What are the symptoms of COVID-19%COVID19 symp...,
6,Can I catch COVID-19 from surfaces or packages,Yes it is possible to catch COVID-19 from surf...,Can I catch COVID-19 from surfaces or packages...,
7,Is it safe to receive a package from any area...,Yes. The likelihood of an infected person cont...,Is it safe to receive a package from any area...,
8,What can I do to protect myself from COVID-19,You can reduce your chances of being infected ...,What can I do to protect myself from COVID-19,
9,What should I do if I have visited an area whe...,If you have recently visited (past 14 days) ar...,What should I do if I have visited an area whe...,


In [5]:
def make_ans_dictionary(ans_array):
    ans_dictionary = {}
    for i, item in enumerate(ans_array):
        ans_dictionary[i+1] = item
    return ans_dictionary
    
ans_dictionary = make_ans_dictionary(df["ans"].values)

In [6]:
def clean_text(text):
    text = text.lower()
    re.sub(r"covid-19", "covid19", text)
    re.sub(r"covid 19", "covid19", text)
    re.sub(r"novel coronavirus", "covid19", text)
    
    text = re.sub(r"i'm", "i am", text)
    text = re.sub(r"he's", "he is", text)
    text = re.sub(r"she's", "she is", text)
    text = re.sub(r"that's", "that is", text)
    text = re.sub(r"what's", "what is", text)
    text = re.sub(r"where's", "where is", text)
    text = re.sub(r"\'ll", " will", text)
    text = re.sub(r"\'ve", " have", text)
    text = re.sub(r"\'re", " are", text)
    text = re.sub(r"\'d", " would", text)
    text = re.sub(r"won't", "will not", text)
    text = re.sub(r"can't", "cannot", text)
    text = re.sub(r"[-()\"#/@;:<>{}+=~|.?,]", "", text)
    return text

In [7]:
def process_input_string(text):
    text = clean_text(text)
    text_arr = []
    for word in text.split():
        if word in word2int:
            text_arr.append(word2int[word])
        else:
            text_arr.append(0)
    
    padded_text = pad_sequences([np.array(text_arr)], maxlen=18, padding='post')
    final_text = np.reshape(padded_text, (padded_text.shape[0], padded_text.shape[1], 1))
    return final_text

In [8]:
def make_que_dictionary(que_array):
    que_dictionary = {}
    for i, item in enumerate(que_array):
        que_arr = []
        for text in item.split('%'):
            que_arr.append(clean_text(text))
        que_dictionary[i+1] = que_arr
    
    return que_dictionary

que_dictionary = make_que_dictionary(df["alternate-que"].values)

In [9]:
def training_dataset(que_dictionary):
    que_train = []
    y_train = []
    for i, item in que_dictionary.items():
        for que in item:
            que_train.append(que)
            y_train.append(i)
    
    return np.array(que_train).reshape(-1,1), np.array(y_train)
    
que_train, y_train = training_dataset(que_dictionary)
print(que_train[:6])

[['what is a coronavirus']
 ['tell me about coronavirus']
 ['what are coronaviruses']
 ['what is coronavirus']
 ['can humans become infected with a novel coronavirus of animal source']
 ['can i get affected by my dog']]


In [10]:
def modify_ytrain(y_train):
    encoder = OneHotEncoder()
    return encoder.fit_transform(y_train.reshape(-1,1)).toarray()

y_train = modify_ytrain(y_train)
y_train

array([[1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

In [11]:
def word_count(train):
    word2count = {}
    for text in train:
        for word in text.split():
            if word in word2count:
                word2count[word] += 1
            else:
                word2count[word] = 1
    return word2count

word2count = word_count(que_train[:,0])

In [12]:
def word_to_id(word2count, threshold):
    word2int = {}
    counter = 1
    for word, count in word2count.items():
        if count > threshold:
            word2int[word] = counter
            counter += 1
    return word2int
    
word2int = word_to_id(word2count, 0)

In [13]:
def id_to_word(word2int):
    int2word = {}
    for word, i in word2int.items():
        int2word[i] = word
    return int2word

int2word = id_to_word(word2int)

In [14]:
def que_to_int(que_train, word2int):
    que_array = []
    for que in que_train[:,0]:
        ints = []
        for word in que.split():
            ints.append(word2int[word])
        que_array.append(ints)
    return np.array(que_array)

que_array = que_to_int(que_train, word2int)

In [15]:
def pad_input_array(que_array, max_length):
    padded = pad_sequences(que_array, maxlen=max_length, padding='post')
    return np.array(padded)

X_train = pad_input_array(que_array, 18)
X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1))

In [16]:
def create_model(X_train, y_train, nepochs, batch_size):
    
    classifier = Sequential()
    classifier.add(LSTM(units=20, return_sequences=True, input_shape=(X_train.shape[1],1)))
    classifier.add(Dropout(0.2))
    classifier.add(LSTM(units=20, return_sequences=False))
    classifier.add(Dense(units=40, activation='softmax'))
    
    classifier.compile(optimizer='adam', loss='categorical_crossentropy')
    print(classifier.summary())
    classifier.fit(X_train, y_train, epochs = nepochs, batch_size=batch_size)
    
    return classifier

In [17]:
# Initializing parameters
no_of_epochs = 150
batch_size = 4

In [18]:
model = create_model(X_train, y_train, no_of_epochs, batch_size)

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 18, 20)            1760      
_________________________________________________________________
dropout_1 (Dropout)          (None, 18, 20)            0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 20)                3280      
_________________________________________________________________
dense_1 (Dense)              (None, 40)                840       
Total params: 5,880
Trainable params: 5,880
Non-trainable params: 0
_________________________________________________________________
None

Epoch 1/150
Epoch 2/150
Epoch 3/150
Epoch 4/150
Epoch 5/150
Epoch 6/150
Epoch 7/150
Epoch 8/150
Epoch 9/150
Epoch 10/150
Epoch 11/150
Epoch 12/150
Epoch 13/150
Epoch 14/150
Epoch 15/150
Epoch 16/150
Epoch 17/150
Epoch 18/150
Epoch 19/150
Epoch 

Epoch 89/150
Epoch 90/150
Epoch 91/150
Epoch 92/150
Epoch 93/150
Epoch 94/150
Epoch 95/150
Epoch 96/150
Epoch 97/150
Epoch 98/150
Epoch 99/150
Epoch 100/150
Epoch 101/150
Epoch 102/150
Epoch 103/150
Epoch 104/150
Epoch 105/150
Epoch 106/150
Epoch 107/150
Epoch 108/150
Epoch 109/150
Epoch 110/150
Epoch 111/150
Epoch 112/150
Epoch 113/150
Epoch 114/150
Epoch 115/150
Epoch 116/150
Epoch 117/150
Epoch 118/150
Epoch 119/150
Epoch 120/150
Epoch 121/150
Epoch 122/150
Epoch 123/150
Epoch 124/150
Epoch 125/150
Epoch 126/150
Epoch 127/150
Epoch 128/150
Epoch 129/150
Epoch 130/150
Epoch 131/150
Epoch 132/150
Epoch 133/150
Epoch 134/150
Epoch 135/150
Epoch 136/150
Epoch 137/150
Epoch 138/150
Epoch 139/150
Epoch 140/150
Epoch 141/150
Epoch 142/150
Epoch 143/150
Epoch 144/150
Epoch 145/150
Epoch 146/150
Epoch 147/150
Epoch 148/150
Epoch 149/150
Epoch 150/150


In [33]:
def run_bot():    
    print('\033[1m' +"Enter 'Bye' to exit \n")
    print('\033[0m')
    while(True):
        question = input("You: ")
        if "bye" in question.lower():
            break
        que_array = process_input_string(question)
        pred_array = model.predict(que_array)[0]
        pred = np.argmax(pred_array)
        # print(np.max(pred_array))
        if np.max(pred_array) > 0.6:
            ans = ans_dictionary[pred+1]
        else:
            ans = "Sorry the bot coundn't answer that. Ask a different question."
        print('ChatBot: ' + ans +'\n')

In [34]:
run_bot()

[1mEnter 'Bye' to exit 

[0m
You: what is covid-19
ChatBot: COVID-19 is the infectious disease caused by the most recently discovered coronavirus. This new virus and disease were unknown before the outbreak began in Wuhan, China, in December 2019

You: Symptoms of covid-19
ChatBot: The most common symptoms of COVID-19 are: fever, tiredness and dry cough

You: Should i wear a mask
ChatBot: Sorry the bot coundn't answer that. Ask a different question.

You: what should I not do
ChatBot: Sorry the bot coundn't answer that. Ask a different question.

You: bye
