<p style="font-family:Roboto; font-size: 28px; color: magenta"> Python for NLP: Creating a Rule-Based Chatbot</p>

In [1]:
'''
 A chatbot is a conversational agent capable of answering user queries in the form of text,
   speech, or via a graphical user interface
'''

'\n A chatbot is a conversational agent capable of answering user queries in the form of text,\n   speech, or via a graphical user interface\n'

In [2]:
'''
Chatbots can be broadly categorized into two types: Task-Oriented Chatbots and General Purpose Chatbots
The task-oriented chatbots are designed to perform specific tasks.
For instance, a task-oriented chatbot can answer queries related to train reservation, pizza delivery;
it can also work as a personal medical therapist or personal assistant.

On the other hand, general purpose chatbots can have open-ended discussions with the users.

There is also a third type of chatbots called hybrid chatbots
that can engage in both task-oriented and open-ended discussion with the users.
'''

'\nChatbots can be broadly categorized into two types: Task-Oriented Chatbots and General Purpose Chatbots\nThe task-oriented chatbots are designed to perform specific tasks.\nFor instance, a task-oriented chatbot can answer queries related to train reservation, pizza delivery;\nit can also work as a personal medical therapist or personal assistant.\n\nOn the other hand, general purpose chatbots can have open-ended discussions with the users.\n\nThere is also a third type of chatbots called hybrid chatbots\nthat can engage in both task-oriented and open-ended discussion with the users.\n'

<p style="font-family:consolas; font-size: 26px; color: magenta; text-decoration-line: overline; "> Chatbot development approaches fall in two categories: rule-based chatbots and learning-based chatbots.</p>

<p style="font-family:Roboto; font-size: 22px; color: orange; text-decoration-line: overline; "> Part: _Learning-Based Chatbots</p>

In [None]:
'''
Learning-based chatbots are the type of chatbots that use machine learning techniques and a dataset
to learn to generate a response to user queries.
Learning-based chatbots can be further divided into two categories: retrieval-based chatbots and generative chatbots.
'''

'\nLearning-based chatbots are the type of chatbots that use machine learning techniques and a dataset \nto learn to generate a response to user queries. \nLearning-based chatbots can be further divided into two categories: retrieval-based chatbots and generative chatbots.\n'

<p style="font-family:Roboto; font-size: 22px; color: orange; text-decoration-line: overline; "> Part: _Rule-Based Chatbots</p>

In [None]:
'''
There are a specific set of rules. If the user query matches any rule, the answer to the query is generated,
otherwise the user is notified that the answer to user query doesn't exist.
'''

"\nThere are a specific set of rules. If the user query matches any rule, the answer to the query is generated, \notherwise the user is notified that the answer to user query doesn't exist.\n"

In [3]:
'''
When a user enters a query, the query will be converted into vectorized form.
All the sentences in the corpus will also be converted into their corresponding vectorized forms.
Next, the sentence with the highest cosine similarity
with the user input vector will be selected as a response to the user input.
'''

'\nWhen a user enters a query, the query will be converted into vectorized form.\nAll the sentences in the corpus will also be converted into their corresponding vectorized forms.\nNext, the sentence with the highest cosine similarity\nwith the user input vector will be selected as a response to the user input.\n'

In [4]:
import nltk
import numpy as np
import random
import string

import bs4 as bs
import urllib.request
import re

<p style="font-family:Roboto; font-size: 22px; color: orange; text-decoration-line: overline; "> Part: _Creating the Corpus</p>

In [5]:
# The following script retrieves the Wikipedia article
raw_html = urllib.request.urlopen('https://en.wikipedia.org/wiki/Tennis')
raw_html = raw_html.read()

article_html = bs.BeautifulSoup(raw_html, 'lxml')
# extracts all the paragraphs from the article text
article_paragraphs = article_html.find_all('p')

article_text = ''

for para in article_paragraphs:
    article_text += para.text
# text is converted into the lower case for easier processing
article_text = article_text.lower()

<p style="font-family:Roboto; font-size: 22px; color: orange; text-decoration-line: overline; "> Part: _Text Preprocessing and Helper Function</p>

In [6]:
# remove all the special characters
article_text = re.sub(r'\[[0-9]*\]', ' ', article_text)
# remove all the empty spaces
article_text = re.sub(r'\s+', ' ', article_text)

In [7]:
#  divide our text into sentences and word
nltk.download('punkt_tab')
article_sentences = nltk.sent_tokenize(article_text)
article_words = nltk.word_tokenize(article_text)

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [8]:
wnlemmatizer = nltk.stem.WordNetLemmatizer()

def perform_lemmatization(tokens):
    return [wnlemmatizer.lemmatize(token) for token in tokens]
# The punctuation_removal list removes the punctuation from the passed text
punctuation_removal = dict((ord(punctuation), None) for punctuation in string.punctuation)

def get_processed_text(document):
    return perform_lemmatization(nltk.word_tokenize(document.lower().translate(punctuation_removal)))

<p style="font-family:Roboto; font-size: 22px; color: orange; text-decoration-line: overline; "> Part: _Responding to Greetings</p>

In [9]:
'''
When a user enters a greeting, we will try to search it in the greetings_inputs list, if the greeting is found,
we will randomly choose a response from the greeting_outputs list
'''
greeting_inputs = ("hey", "good morning", "good evening", "morning", "evening", "hi", "whatsup")
greeting_responses = ["hey", "hey hows you?", "*nods*", "hello, how you doing", "hello", "Welcome, I am good and you"]

def generate_greeting_response(greeting):
    for token in greeting.split():
        if token.lower() in greeting_inputs:
            return random.choice(greeting_responses)

<p style="font-family:Roboto; font-size: 22px; color: orange; text-decoration-line: overline; "> Part: _Responding to User Queries</p>

In [10]:
'''
he response will be generated based upon the cosine similarity of the vectorized form
of the input sentence and the sentences in the corpora
'''
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [11]:
'''
We will create a method that takes in user input, finds the cosine similarity of the user input
and compares it with the sentences in the corpus
'''
def generate_response(user_input):
    tennisrobo_response = ''
    article_sentences.append(user_input)

    word_vectorizer = TfidfVectorizer(tokenizer=get_processed_text, stop_words='english')
    all_word_vectors = word_vectorizer.fit_transform(article_sentences)
    similar_vector_values = cosine_similarity(all_word_vectors[-1], all_word_vectors)
    similar_sentence_number = similar_vector_values.argsort()[0][-2]

    matched_vector = similar_vector_values.flatten()
    matched_vector.sort()
    vector_matched = matched_vector[-2]

    if vector_matched == 0:
        tennisrobo_response = tennisrobo_response + "I am sorry, I could not understand you"
        return tennisrobo_response
    else:
        tennisrobo_response = tennisrobo_response + article_sentences[similar_sentence_number]
        return tennisrobo_response

In [12]:
word_vectorizer = TfidfVectorizer(tokenizer=get_processed_text, stop_words='english')
# We initialize the tfidfvectorizer and then convert all the sentences in the corpus along with the input sentence
# into their corresponding vectorized form.
nltk.download('wordnet')
all_word_vectors = word_vectorizer.fit_transform(article_sentences)

[nltk_data] Downloading package wordnet to /root/nltk_data...


In [13]:
similar_vector_values = cosine_similarity(all_word_vectors[-1], all_word_vectors)

In [14]:
'''
We use the cosine_similarity function to find the cosine similarity between the last item in the all_word_vectors list
(which is actually the word vector for the user input since it was appended at the end)
and the word vectors for all the sentences in the corpus.
'''
similar_sentence_number = similar_vector_values.argsort()[0][-2]

In [15]:
'''
We sort the list containing the cosine similarities of the vectors, the second last item in the list will actually
have the highest cosine (after sorting) with the user input.
The last item is the user input itself, therefore we did not select that.
'''

'\nWe sort the list containing the cosine similarities of the vectors, the second last item in the list will actually\nhave the highest cosine (after sorting) with the user input.\nThe last item is the user input itself, therefore we did not select that.\n'

In [None]:
'''
If the cosine similarity of the matched vector is 0, that means our query did not have an answer.
In that case, we will simply print that we do not understand the user query.
'''

'\nIf the cosine similarity of the matched vector is 0, that means our query did not have an answer. \nIn that case, we will simply print that we do not understand the user query.\n'

<p style="font-family:Roboto; font-size: 22px; color: orange; text-decoration-line: overline; "> Part: _Chatting with the Chatbot</p>

In [None]:
'''
As a final step, we need to create a function that allows us to chat with the chatbot that we just designed.
To do so, we will write another helper function
that will keep executing until the user types "Bye".
'''
continue_dialogue = True
print("Hello, I am your friend TennisRobo. You can ask me any question regarding tennis:")
# we first set the flag continue_dialogue to true
while(continue_dialogue == True):
    human_text = input()
    human_text = human_text.lower()
    if human_text != 'bye':
        if human_text == 'thanks' or human_text == 'thank you very much' or human_text == 'thank you':
            continue_dialogue = False
            print("TennisRobo: Most welcome")
        else:
            if generate_greeting_response(human_text) != None:
                # After that, we print a welcome message to the user asking for any input
                # if the user input is not equal to None, the generate_response method is called which fetches the user response
                # based on the cosine similarity as explained in the last section
                print("TennisRobo: " + generate_greeting_response(human_text))
            else:
                print("TennisRobo: ", end="")
                print(generate_response(human_text))
                article_sentences.remove(human_text)
    else:
        continue_dialogue = False
        print("TennisRobo: Good bye and take care of yourself...")

Hello, I am your friend TennisRobo. You can ask me any question regarding tennis:
Novak Djokovic
TennisRobo: by the early twenty-first century, the 'big three' of roger federer, rafael nadal and novak djokovic have dominated men's singles tennis for two decades, collectively winning 66 major singles tournaments; djokovic with an all-time record 24 titles, nadal with 22 and federer with 20. they have been ranked as world no.
