# Creating Your Own Basic Chatbot

This tutorial is sourced and adapted from https://medium.com/analytics-vidhya/building-a-simple-chatbot-in-python-using-nltk-7c8c8215ac6e

We will be building a simple **retrieval-based** chatbot, which selects a response from a library of predefined responses based on the context of the conversation. A more intelligent type of chatbot is a **generative** bot which can create answers which are not always from the library.

We will use Python's powerful **scikit** library which contains a Natural Language Toolkit (NLTK) for natural language processing. See https://www.nltk.org/ for more info. 

### Natural Language Processing (NLP)

NLP is the study of interactions between human language and computers, where computers analyze understand and derive meaning from human lanugage in a smart and useful way. 

The overall flow to our NLP exercise today looks like the following:

    1. Identify a *corpus* or body of text data used as the response library 
        For this exercise, we will use the *Memorable Cornell Movie-Quotes Corpus* which is a dataset of over  
        200,000 conversational exchanges between 10,000+ pairs of movie characters! See 
        https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html. 

    2. Pre-process this text data which includes:
         * Converting all the text to *lowercase*
         
         * Removing movie line numbers
         
         * Removing all punctuations
         
         * Removing **stop words** which are extremely common words (such as 'has', 'is', 'of') which would 
         appear to be of little value in helping select the response. 
         
         * Tokenizing the text, which simply means converting the text 'strings' into a list of both individual 
         words and sentences. We will have both *sentence_token* and *word_token* lists. 
         
         * Lemmatization which is reducing a word to its *stem* as well as combining similar words into the same 
         lemma. So for example, "run" is the lemma for words like "running", "runs", or "ran". 

    3. Creating a 'Bag of Words'
        A 'Bag of Words' is a representation of the text which describes the occurrence of words within a 
        document and includes:
         * A vocabulary of known words
         
         * A measure of whether the known words occur in the document

        The representation of the text is transformed into a vector (or array) of numbers. So for example, if our 
        vocabulary of known words is: `{Learning, is, the, not, great}` and we want to vectorize the text 
        "Learning is great" then the vector representation of that text is simply: `(1,1,0,0,1)`. 

    4. Scaling word frequency
        We won't get into the mathematical details here, but essentially, a problem with the Bag of Words 
        approach is that highly frequent words start to dominate in the document. One way to address this is to 
        apply a scaling factor to the frequency of words so that higher frequent words are penalized so that they 
        have less weighting, both within a document and across multiple documents. This is done by using the 
        following:
         * Term Frequency (TF) - a scoring of the frequency of the word in the current document
         * Inverse Document Frequency (IDF) - a scoring of how rare the word is across multiple documents

        Together, the **TF-IDF** weighting is used to evaluate how important a word is to a document in a 
        collection or corpus.

    5. Calculating similarity
        The nice thing about converting text into numerical vectors is that you can now use math to calculate how 
        similar (aligned) or disimilar (opposed) the input text is to the text in the vocabulary of known words. 
        Two vectors closely aligned have a cosine of the angle between them close to 1, and when opposed, the 
        cosine is -1. 

    6. Identify most similar text vector and send back as response


### Chatbot Code

#### Package installations

First, we install the `NLTK` Python library

In [37]:
pip install nltk

Note: you may need to restart the kernel to use updated packages.


Then, we import the `nltk` library, along with a few others for data manipulation, and download key `nltk` modules:

In [13]:
import nltk
import random
import string # to process standard python strings
import re

nltk.download('punkt') # Tokenizer which divides text into a list of sentences using machine learning
nltk.download('wordnet') # A corpus reader to initially read our data
nltk.download('stopwords') # Collection of stop words

from nltk.corpus import stopwords # This is our default list of stopwords


[nltk_data] Downloading package punkt to /Users/newadmin/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/newadmin/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/newadmin/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#### Read movie file

Next, take the movie quotes file you downloaded from GitHub, and open it:

In [23]:
f = open('moviequotes.memorable_quotes.txt', 'r', errors = 'ignore')
raw_data = f.read()

In [24]:
raw_data



See how the `raw_data` object now contains our text from our file. This is our **corpus**.

#### Pre-process movie file

Next, we pre-process our text data and convert them into tokens (i.e. into lists of words and sentences)

In [25]:
raw_data = raw_data.lower() # Make all letters lowercase
clean_data = re.sub(r"\d+", "", raw_data) # Remove all numbers

clean_data



Now we have some cleaner text data to work with. We will remove the punctuation in a bit. 

#### Tokenize our text

Next, we tokenize our text data into words and sentences:

In [26]:
sent_tokens = nltk.sent_tokenize(clean_data)# converts to list of sentences 
word_tokens = nltk.word_tokenize(clean_data)# converts to list of words

In [27]:
sent_tokens

['\n things i hate about you\nwho knocked up your sister?',
 'who knocked up your sister?',
 'things i hate about you\ni was watching you out there, before.',
 "i've never seen you look so sexy.",
 "i watched you out there   i've never seen you look like that\n\n things i hate about you\nyou're , you don't know what you want.",
 "and you won't know what you want 'til you're , and even if you get it, you'll be too old to use it.",
 "you're eighteen.",
 "you don't know what you want.",
 "you won't know until you're forty-five and you don't have it.",
 'things i hate about you\nooh, see that, there.',
 'who needs affection when i have blind hatred?',
 'see that?',
 "who needs affection when i've got blind hatred?",
 "things i hate about you\njust 'cause you're beautiful, that doesn't mean that you can treat people like they don't matter.",
 "just because you're beautiful, doesn't mean you can treat people like they don't matter.",
 "things i hate about you\nyou're asking me out?",
 "that'

In [28]:
word_tokens

['things',
 'i',
 'hate',
 'about',
 'you',
 'who',
 'knocked',
 'up',
 'your',
 'sister',
 '?',
 'who',
 'knocked',
 'up',
 'your',
 'sister',
 '?',
 'things',
 'i',
 'hate',
 'about',
 'you',
 'i',
 'was',
 'watching',
 'you',
 'out',
 'there',
 ',',
 'before',
 '.',
 'i',
 "'ve",
 'never',
 'seen',
 'you',
 'look',
 'so',
 'sexy',
 '.',
 'i',
 'watched',
 'you',
 'out',
 'there',
 'i',
 "'ve",
 'never',
 'seen',
 'you',
 'look',
 'like',
 'that',
 'things',
 'i',
 'hate',
 'about',
 'you',
 'you',
 "'re",
 ',',
 'you',
 'do',
 "n't",
 'know',
 'what',
 'you',
 'want',
 '.',
 'and',
 'you',
 'wo',
 "n't",
 'know',
 'what',
 'you',
 'want',
 "'til",
 'you',
 "'re",
 ',',
 'and',
 'even',
 'if',
 'you',
 'get',
 'it',
 ',',
 'you',
 "'ll",
 'be',
 'too',
 'old',
 'to',
 'use',
 'it',
 '.',
 'you',
 "'re",
 'eighteen',
 '.',
 'you',
 'do',
 "n't",
 'know',
 'what',
 'you',
 'want',
 '.',
 'you',
 'wo',
 "n't",
 'know',
 'until',
 'you',
 "'re",
 'forty-five',
 'and',
 'you',
 'do',
 "n'

#### Functions to convert text to tokens
Next, we have a few functions to lemmanize our tokens:

In [30]:
lemmer = nltk.stem.WordNetLemmatizer()

def LemTokens(tokens):
    return [lemmer.lemmatize(token) for token in tokens]

remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)

def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

#### Initial greeting

Now, let us setup our initial greeting from a defined list:

In [31]:
GREETING_INPUTS = ("hello", "hi", "greetings", "sup", "what's up","hey",)
GREETING_RESPONSES = ["hi", "hey", "*nods*", "hi there", "hello", "I am glad! You are talking to me"]

def greeting(sentence):
 
    for word in sentence.split():
        if word.lower() in GREETING_INPUTS:
            return random.choice(GREETING_RESPONSES)

#### List of stopwords

And let us shore up our list of stopwords from the default:

In [32]:
stopwords = stopwords.words('english')

additional_stop_words = ['arent', 'couldnt', 'didnt', 'doe', 'doesnt', 'dont', 'ha', 'hadnt', 'hasnt', 'havent', 'isnt', 'mightnt', 'mustnt', 'neednt', 'shant', 'shes', 'shouldnt', 'shouldve', 'thatll', 'wa', 'wasnt', 'werent', 'wont', 'wouldnt', 'youd', 'youll', 'youre', 'youve']

for word in additional_stop_words:
    stopwords.append(word)

#### Build our reponse engine

Our response engine function will take in a user's response, create a text vector using our vocabulary of known words, and do a similarity test to identify the best response:

In [33]:
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics.pairwise import cosine_similarity

In [34]:
def response(user_response):
    robo_response=''
    sent_tokens.append(user_response)
    TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words=stopwords)
    tfidf = TfidfVec.fit_transform(sent_tokens)
    vals = cosine_similarity(tfidf[-1], tfidf)
    idx=vals.argsort()[0][-2]
    flat = vals.flatten()
    flat.sort()
    req_tfidf = flat[-2]
    if(req_tfidf==0):
        robo_response=robo_response+"I am sorry! I don't understand you"
        return robo_response
    else:
        robo_response = robo_response+sent_tokens[idx]
        return robo_response

#### Run the chatbot

Now we run the chatbot using a loop:

In [38]:
flag=True
print("ROBO: My name is Robo. I will answer your queries about Chatbots. If you want to exit, type Bye!")
while(flag==True):
    user_response = input()
    user_response=user_response.lower()
    if(user_response!='bye'):
        if(user_response=='thanks' or user_response=='thank you' ):
            flag=False
            print("ROBO: You are welcome..")
        else:
            if(greeting(user_response)!=None):
                print("ROBO: "+greeting(user_response))
            else:
                print("ROBO: ",end="")
                print(response(user_response))
                sent_tokens.remove(user_response)
    else:
        flag=False
        print("ROBO: Bye! take care..")

ROBO: My name is Robo. I will answer your queries about Chatbots. If you want to exit, type Bye!
Hi
ROBO: hey
how you doing?
ROBO: I am sorry! I don't understand you
I would like to take a walk
ROBO: we would take long walks by the river.
bye
ROBO: Bye! take care..
