# Building intelligent bots. Rule-based chatbots

This section consist of major topics related to rule-based chatbots. One of the most important topic are the similarity methods for word and sentences. It contains a short use case of regular expressions for sentence parsing. We show the differences between LIKE and full-text search that are used in SQL databases. A major topic included in this section are NLP methods that can be used for easier sentence comparison. Finally, we build a simple example of questions and answers that can be used by chatbots.

## Word and sentences similarity

Word does have different meanings. This makes the comparison and analysis a bit more complex.

In [None]:
from textblob import Word

w = Word("developer")

for synset, definition in zip(w.get_synsets(), w.define()):
    print(synset, definition)

## Regular expressions

Regular expressions are commonly used for string manipulation and are often used behind many other high-level string manipulations methods. Below we show a use case of parsing the strings for SQL queries. We use this example later for LIKE keyword and full-text search.

In [None]:
import sqlite3, csv, re

# load the dataset
conn = sqlite3.connect('oreilly.sqlite')
conn.execute("CREATE VIRTUAL TABLE books USING fts5(title,description);")
cur = conn.cursor()

with open('oreilly.csv', 'r', encoding='utf-8',errors='ignore') as csvfile:
    csv = csv.reader(csvfile,delimiter=',')
    for row in csv:
        title = re.escape(row[1])
        description = re.sub('\"',"",row[4])        
        cur.execute('INSERT INTO books(title,description) VALUES("'+ title +'","'+ description +'");')
        conn.commit()

query = cur.execute("SELECT title FROM books LIMIT 0,10;")
print(cur.fetchall())
#cur.execute("DROP TABLE books;")
conn.close()

## Similarity measures

There are plenty of methods to measure the similarity of strings. Two most popular Python libraries examples for such measure are shown. We compare two strings: trains and training. The SequenceMatcher class allow us to use the Gestalt pattern matching algorithm:

In [None]:
from difflib import SequenceMatcher
a = "training"
b = "trains"
print(len(a))
print(len(b))
ratio = SequenceMatcher(None, a, b).ratio()
print(ratio)

The distance is a normalized value between 0 and 1, where 1 means identical.

A different approach is shown below. We use the Jellyfish library. There are a few methods that we can use here. One of it is the Levenshtein distance. Below the distance and normalize distance values are calculated.

In [None]:
import jellyfish
distance = jellyfish.levenshtein_distance(a,b)
print(distance)

normalized_distance = distance/max(len(a),len(b))
print(normalized_distance)

## SQLite LIKE vs. Full-text search

It unlikely to show SQL together with NLP methods for string comparison, but there are two features that are worth to mention. In the first cell, we connect to the SQLite database that we have created earlier in this notebook.

In [None]:
conn = sqlite3.connect('oreilly.sqlite')

cur = conn.cursor()

The LIKE keyword needs a % if we want to say that it can have any words before or after the sentence that we are looking for.

In [None]:
cur.execute("SELECT title FROM books WHERE description LIKE '%software%';")
print(cur.fetchall())

The full-text search is a bit more intelligent as it calculated a score to accuracy of the text that we are looking for. There are plenty of such metrics. In SQLite we have bm25. We can use different combination of a term like do, does, doing and it will come to similar results.

In [None]:
cur.execute("SELECT title,bm25(books) FROM books WHERE books MATCH 'software' ORDER BY bm25(books) DESC;")
print(cur.fetchall())

## NLP methods

Three NLP methods that are helpful when buildingn a rule-based chatbots are:
*tokenization,
*lemmatization,
*stemming.

### Tokenization

Tokenization works similar to split method known from the regular expression library, but there is at least one small difference.

In [None]:
example = "A training example."
import re

pattern = "\\s+"
words = re.split(pattern, example)
print(words)

NLTK is one the most known library for NLP. As many other suchn libraries, it contain a tokenizer does not only split the sentence, but also take punctuation marks as a separate part of a sentence:

In [None]:
import nltk

tokens = nltk.word_tokenize(example)
print("Tokens: " + str(tokens))

### Lemmatization

Lemmatization is a process of getting a basic form of a word. 

In [None]:
from nltk.stem import WordNetLemmatizer

wordnet_lemmatizer = WordNetLemmatizer()

print(wordnet_lemmatizer.lemmatize('do',pos='v'))
print(wordnet_lemmatizer.lemmatize('does',pos='v'))
print(wordnet_lemmatizer.lemmatize('doing',pos='v'))

We can change the word to it basic form depending on part of speech:
ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v'

### Stemming

Stemming is similar to lemmatization, but the main difference is that is just reduce the word to it root. It gives in many cases different results than lemmatization as the second solution is based on part of speech.

In [None]:
from nltk import PorterStemmer, LancasterStemmer, word_tokenize

sample = "This is a new training about machine learning usage for chatbots"

tokens = word_tokenize(sample)

porter = PorterStemmer()
p_stem = [porter.stem(t) for t in tokens]
print(p_stem)

lancaster = LancasterStemmer()
l_stem = [lancaster.stem(t) for t in tokens]
print(l_stem)

## Rule-based chatbot

In this section we implement two examples of chatbots. The first one is a chatbot where the scenario is simple and the goal is to go through all questions and respond with an answer. The second chatbot is comparing the questions and gives appropriate responses based on text similarity.

### Simple scenario chatbot - Greg is your stock marker advisor

This chatbot is a stock market advisor with a list of questions. The answers for these questions allow the chatbot to give a the stock value for a given date.

In [None]:
welcome = "Hi! I'm Greg, your stock market advisor."

questions = (
    "What stock exchange would like to check? Please provide the stock exchange symbol.",
    "What stock from the stock exchange would like to check? Please use the stock index name.",
    "What date are you interested in?",
    "Should I print the maximum, minimum, opening or closing value? Choose one."
            )

We use here [Alpha Vantage](https://www.alphavantage.co/support/#api-key) to get the current stock value. You need to obtain the API key first and past it below into API_KEY variable. We simplify the date and choose it randomly.

In [None]:
import requests
import random

API_KEY = "5OODSC46TH6XNTCQ"
URL = "https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol="

def get_stock_value(stock_request):
    if stock_request[0].upper() != "NYSE":
        return "Stock exchange not supported"
    
    resposne = requests.get(URL+stock_request[1]+ "&apikey=" + API_KEY)
    stock_data = resposne.json()["Time Series (Daily)"]
    date = random.choice(list(stock_data.keys()))
    answer = "The stock " + stock_request[3] +" for " + stock_request[1]+ " on " + stock_request[2] +" is "+ stock_data[date]['2. high']
    return answer

The main part of the chatbot is written in just few lines. We loop over the questions and used it to get the stock details.

In [None]:
import sys

def run_chatbot():
    print(welcome)
    answers = []
    for question_id in range(len(questions)):
        print(questions[question_id])
        answer = input()
        answers.append(answer)
    print(get_stock_value(answers))
    
run_chatbot()    

### Rule-based customer support chatbot

In this case we also need to setup a welcome message and a list of questions. This time the questions are potential customer questions. We set also a list of answers for each question.

In [None]:
welcome = "Hi! I'm Arthur, the customer support chatbot. How can I help you?"

questions = (
    "The app if freezing after I click run button",
    "I don't know how to proceed with the invoice",
    "I get an error when I try to install the app",
    "It crash after I have updated it",
    "I cannot login in to the app",
    "I'm not able to download it"
            )

answers = (
        "You need to clean up the cache. Please go to ...",
        "Please go to Setting, next Subscriptions and there is the Billing section",
        "Could you plese send the log files placed in ... to ...",
        "Please restart your PC",
        "Use the forgot password button to setup a new password",
        "Probably you have an ad blocker plugin installed and it blocks the popup with the download link"
            )

Most questions will not be exactly the same as we have on our list, but can be similar. Let's define a function to measure the similarity.

In [None]:
from difflib import SequenceMatcher

similarity_treshold = 0.2

def get_highest_similarity(customer_question):
    max_similarity = 0
    highest_prob_index = 0
    for question_id in range(len(questions)):
        similarity = SequenceMatcher(None,customer_question,questions[question_id]).ratio()
        #print(similarity)
        if similarity > max_similarity:
            highest_index = question_id
            max_similarity = similarity
    if max_similarity > similarity_treshold:
        return answers[highest_index]
    else:
        return "The issues has been saved. We will contact you soon."

The main part is just a few lines of code. You can print the similarities of each sentence.

In [None]:
def run_chatbot():
    print(welcome)
    question = ""
    while question != "thank you":
        question = input()
        answer = get_highest_similarity(question)
        print(answer)
    
run_chatbot()

## EXERCISE 1: Build a rule-based chatbot.

There is a list of questions below. Use different method of comparison to figure out which one gives the best results and why. Compare the above used methods with at least one of the following:
- SQL full text search,
- normalized Levenshtein distance,
- tokenization and lemmatization and count the number of words.

In [None]:
similarity_treshold = 0.2

def get_highest_similarity(customer_question):
    max_similarity = 0
    highest_prob_index = 0
    for question_id in range(len(questions)):
        # put your code here
        similarity = 0 #
        #print(similarity)
        if similarity > max_similarity:
            highest_index = question_id
            max_similarity = similarity
    if max_similarity > similarity_treshold:
        return answers[highest_index]
    else:
        return "The issues has been saved. We will contact you soon."

Test your solution:

In [None]:
def run_chatbot():
    print(welcome)
    question = ""
    while question != "thank you":
        question = input()
        answer = get_highest_similarity(question)
        print(answer)
    
run_chatbot()