<a href="https://www.kaggle.com/code/ksenia5/chatbot?scriptVersionId=113090567" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Chatbot

Build a **corpus-based conversational chatbot using NLTK and python**, using this [reference tutorial](https://medium.com/swlh/a-chatbot-in-python-using-nltk-938a37a9eacc) for guidance and ideas.
<br><br>

Perform the following tasks:
1. Use the dataset provided in the tutorial, or develop your own dataset with similar structure.
2. Perform text normalisation: convert text to lowercase,remove special characters, and perform lemmatisation; remove any stopwords.
3. Use word embeddings such as:bag of words and TF-IDF, and compute cosine similarity.
4. Compare the performance and results of the two methods,i.e. bag of words (BOW) and TF-IDF.
5. Customise using any of the NLP techniques we have learned.

In [1]:
import nltk, re # NLTK library of language resources
nltk.download('omw-1.4')

# Part of speech tagging and tokenisation
from nltk import pos_tag, word_tokenize 

# to perform lemmatisation
from nltk.stem import wordnet, WordNetLemmatizer

# stopwords
from nltk.corpus import stopwords

import json

# to perform bow and tfidf
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# For cosine similarity
from sklearn.metrics import pairwise_distances

# Data processing and visualisation
import random
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Image

# Pandas and print rounding
pd.set_option('precision', 3) 
%precision 3

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

[nltk_data] Downloading package omw-1.4 to /usr/share/nltk_data...


/kaggle/input/dataset/dialog_talk_agent.xlsx


Import the dataset (corpus) into a pandas dataframe

In [2]:
df = pd.read_excel("../input/dataset/dialog_talk_agent.xlsx")
df.head(15)

Unnamed: 0,Context,Text Response
0,Tell me about your personality,Just think of me as the ace up your sleeve.
1,I want to know you better,I can help you work smarter instead of harder
2,Define yourself,
3,Describe yourself,
4,tell me about yourself,
5,all about you,
6,tell me some stuff about you,
7,talk some stuff about you,
8,talk about yourself,
9,about yourself,


In [3]:
df.shape

(1592, 2)

There are 1592 entries in the dataset, each entry giving a turn-response in the conversation, similar questions are grouped together and NaN values indicate a response similar to the previous entry.


In [4]:
# Replace null values with previous value
df.ffill(axis = 0, inplace = True)
df.head(10)

Unnamed: 0,Context,Text Response
0,Tell me about your personality,Just think of me as the ace up your sleeve.
1,I want to know you better,I can help you work smarter instead of harder
2,Define yourself,I can help you work smarter instead of harder
3,Describe yourself,I can help you work smarter instead of harder
4,tell me about yourself,I can help you work smarter instead of harder
5,all about you,I can help you work smarter instead of harder
6,tell me some stuff about you,I can help you work smarter instead of harder
7,talk some stuff about you,I can help you work smarter instead of harder
8,talk about yourself,I can help you work smarter instead of harder
9,about yourself,I can help you work smarter instead of harder


* Text normalisation (lower case, remove special characters, lemmatisation)
* word embedding (bag of words (BOW), TF-IDF)
* cosine similarity

In [5]:
df1 = df.head(10)

In [6]:
def normalise_text(text):
    '''
    Function takes a text string (utterance) as input, 
    converts to lowercase, 
    removes special characters and punctuation, 
    tokenises, POS-tags and lemmatises each token..
    Joins lemmatised tokens, and returns lemmatised string.
    
    '''
    
    # convert to lowercase
    text = str(text).lower()
        
    # remove special characters
    text = re.sub(r'[^a-z0-9]', " ",text)
        
    # tokenise
    tokens = word_tokenize(text)
    
    # Initialise lemmatiser
    lemmatiser = wordnet.WordNetLemmatizer()
    
    # Part of speech (POS) tagging, tagset set to default
    tagged_tokens =  pos_tag(tokens, tagset = None)

    # Empty list
    token_lemmas = []
    for (token, pos_token) in tagged_tokens:
        if pos_token.startswith("V"): # verb
            pos_val = "v"
        elif pos_token.startswith("J"): # adjective
            pos_val = "a"
        elif pos_token.startswith("R"): # adverb
            pos_val = "r"
        else:
            pos_val = 'n' # noun
        
        # lemmatise and append to list of lemmatised tokens
        token_lemmas.append(lemmatiser.lemmatize(token, pos_val))
    
    return " ".join(token_lemmas)

In [7]:
normalise_text("Oh my days!")
normalise_text("Can't wait to get Korean again!!")
normalise_text("You are overanalysing things! Why do we need lemmatisation?")

'you be overanalysing thing why do we need lemmatisation'

In [8]:
# apply the normalise_text function to each entry in the context column
df["lemmatised_text"] = df["Context"].apply(normalise_text)
df.head()

Unnamed: 0,Context,Text Response,lemmatised_text
0,Tell me about your personality,Just think of me as the ace up your sleeve.,tell me about your personality
1,I want to know you better,I can help you work smarter instead of harder,i want to know you good
2,Define yourself,I can help you work smarter instead of harder,define yourself
3,Describe yourself,I can help you work smarter instead of harder,describe yourself
4,tell me about yourself,I can help you work smarter instead of harder,tell me about yourself


### Remove stopwords

In [9]:
def remove_stopwords(text):
    
    # stopwords
    stop = stopwords.words("english")
    
    #if token not in stop
    text = [word for word in text.split() if word not in stop]
    return " ".join(text)

### Bag of words (BOW)

In [10]:
# count vectoriser 
cv = CountVectorizer()
X = cv.fit_transform(df["lemmatised_text"]).toarray()

features = cv.get_feature_names_out()
df_bow = pd.DataFrame(X, columns = features)
df_bow.head()

Unnamed: 0,21,abort,about,absolutely,abysmal,actually,adore,advice,advise,affirmative,...,year,yeh,yep,yes,yet,you,your,yours,yourself,yup
0,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


## Utterance
s = string input to the chatbot

In [11]:
# Utterance preprocessing - remove stopwords and normalise text

#s = "I can't believe how tasty the Korean is!!! We should go there again!"
#s = "What's the weather like tomorrow?"
s = "Help me!"

t = remove_stopwords(s)
print(t)

u = normalise_text(t)
print(u)

Help me!
help me


In [12]:
# Convert the preprocessed utterance to bag of words
u_bow = cv.transform([u]).toarray()

# Apply the cosine similarity to utternace to search for 
# most similar utterance in training dataset bow
cosine_value = 1 - pairwise_distances(df_bow, u_bow, metric = "cosine")

# cosine_value calculates similarity between utterance and each entry in the dataset

# print the question
print(s)

# Get the index of the most similar entry
index_value1 = cosine_value.argmax()

# Get the response of the most similar entry 
df.loc[index_value1,"Text Response"]

Help me!


"I'm glad to help. What can I do for you?"

In [13]:
# Initialise sklearn tfidf
tfidf = TfidfVectorizer()
x_tfidf = tfidf.fit_transform(df["lemmatised_text"]).toarray()
df_tfidf = pd.DataFrame(x_tfidf, columns = tfidf.get_feature_names_out())
df_tfidf.head()

Unnamed: 0,21,abort,about,absolutely,abysmal,actually,adore,advice,advise,affirmative,...,year,yeh,yep,yes,yet,you,your,yours,yourself,yup
0,0.0,0.0,0.408,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.331,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.205,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.642,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.642,0.0
4,0.0,0.0,0.454,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.609,0.0


In [14]:
s = "Want to come for dinner?"
s = "How are you doing today?"
s = "Do you like mangoes?"
s = "Hello!"
s = "Good evening! How are you?"
s = "Exceptionally well! Thank you"
s = "It's quite easy to improve that, isn't it!"


def chat_tfidf(text):
    
    # Lemmatised utterance
    text = normalise_text(text)
    print(text)
    
    text_tfidf = tfidf.transform([text]).toarray()
    cos = 1 - pairwise_distances(df_tfidf, text_tfidf, metric = "cosine")
    index_value = cos.argmax()
    return df["Text Response"].loc[index_value]

chat_tfidf(s)

it s quite easy to improve that isn t it


'Okay then.'

# Storing a list of conversation topics

In [15]:
def get_conversation_topics(text):
    '''
    Function takes a text string (utterance) as input, 
    converts to lowercase, removes special characters and punctuation, 
    tokenises, POS-tags and returns nouns.
    '''
    text = str(text).lower() # convert to lowercase
    text = re.sub(r'[^a-z0-9]', " ",text) # remove special characters
    tokens = word_tokenize(text) # tokenise
    # Initialise lemmatiser
    lemmatiser = wordnet.WordNetLemmatizer()
     # Part of speech (POS) tagging, tagset set to default
    tagged_tokens =  pos_tag(tokens, tagset = None)
    
    # Empty lists to store nouns and verbs from input
    noun_lemmas = []
    verb_lemmas = []
    for (token, pos_token) in tagged_tokens:
        if pos_token.startswith("V"): # verb
            pos_val = "v"
            noun_lemmas.append(lemmatiser.lemmatize(token, pos_val))
        elif pos_token.startswith("NN"): # noun 
            pos_val = 'n' # noun
            verb_lemmas.append(lemmatiser.lemmatize(token, pos_val))
    
    if len(set(noun_lemmas)) > 0:
        return sorted(set(noun_lemmas))
    elif len(set(verb_lemmas)) >0:
        return sorted(set(verb_lemmas))
    else:
        return ["I am not sure... What do you think?",
                "I need time to think about that. Do you have other ideas?",
                "Hmmm"
                ]

In [16]:
# Use tfidf on text if cosine similarity > 0.75
# Otherwise generate text on subject
def chat_extend(chat_input):
    
    # Lemmatised utterance
    text = normalise_text(chat_input)
    #print(text)
    
    text_tfidf = tfidf.transform([text]).toarray()
    cos = 1 - pairwise_distances(df_tfidf, text_tfidf, metric = "cosine")
    
    # Use cosine similarity w tfidf for strong simliarity
    if max(cos) >= 0.8:
        index_value = cos.argmax()
        return df["Text Response"].loc[index_value]
    else:
        topic_options = get_conversation_topics(chat_input)
        topic = random.choice(topic_options)
        return topic # To use as seed to generate bot response
        



In [17]:
#s = "What do you think about the weather?"
#s = "The weather the weather the weather!"
# s = "Want to come for dinner?"
# s = "How are you doing today?"
#s = "Do you like mangoes?"
# s = "Hello!"
# s = "Good evening! How are you?"
# s = "Exceptionally well! Thank you"
# s = "It's quite easy to improve that, isn't it!"
# s = " It's a rather fine day!"
s = "A bottle of water and an apple!"

chat_extend(s) # Use output as seed for response generation
#### work in progress

'bottle'

## MODIFICATIONS
* Extend the training corpus - use NLTK chat corpus for BOW
* Hybrid - introduce rules to integrate rules based conversation
* Adapt BERT or GPT-3

## REFERENCES

* [Chatbot tutorial by Bhargava Sai Reddy P](https://medium.com/swlh/a-chatbot-in-python-using-nltk-938a37a9eacc)
* [A Sentiment-Based Chat Bot](https://docslib.org/doc/25782/a-sentiment-based-chat-bot) [accessed 5 December 2022].
* [How To Make AI Chatbot In Python Using NLP (NLTK) In 2022?, Pykit, 2022](https://pykit.org/chatbot-in-python-using-nlp/) [accessed 5 December 2022]
* [How To Create A Chatbot with Python & Deep Learning In Less Than An Hour](https://towardsdatascience.com/how-to-create-a-chatbot-with-python-deep-learning-in-less-than-an-hour-56a063bdfc44)[accessed 5 December 2022]
* [Build a simple chat bot graphical user interface using python](https://randerson112358.medium.com/build-a-simple-chat-bot-graphical-user-interface-using-python-adf7bd558fc3)[accessed 5 December 2022]
* [Natural Language Toolkit](https://www.nltk.org/book/)

### Chatbot deployment on the web
* [Chatbot deployment with flask code](https://github.com/patrickloeber/chatbot-deployment) and [video tutorial](https://www.youtube.com/watch?v=a37BL0stIuM)
* https://www.geeksforgeeks.org/gui-chat-application-using-tkinter-in-python/
* https://www.python-engineer.com/posts/chatbot-gui-tkinter/