# Chatbot : webscraping & human interaction

As someone in Data Science with experience in human-in-the-loop and real-time human measures, designing and programming a human interaction chatbot sounded interesting.  Without following a known chatbot structure, I tried to be creative and come up with my own logical structure.  

I looked at a few medium blogs, and I tried to first make a "Self-learning Retrieval Based chatbot" over a Rule-based chatbot or Self-learning Generative chatbot.

As you can see in the image, I thought of using an architecture similar to query (q), key (k), value (v) that is used in Question & Answer deep learning transformers for the "searching Knowledge logic".  In this practice session, I only did the "When" architecture; I looked for sentences related to information that would contain "when" information like numbers, dates, months, and time.  I think this simplistic architecture could work extremely well, with a Q&A transformer architecture analyzing the found sentences instead of cos-sine similarity.

Check out github for the supporting python subfunctions : https://github.com/j622amilah/Chatbot!

<img src="chatbot.png" alt="Drawing" style="width: 500px;"/>

In [1]:
%load_ext autoreload 
%autoreload 2

In [2]:
import numpy as np

import nltk

# Personal python functions
import sys
sys.path.insert(1, 'C:\\Users\\jamilah\\Documents\\Subfunctions_python')

from findall import *
from dictionary.sort_dict_by_value import *
from string_text_processing.text_url_2_senANDwordtokens import *
from string_text_processing.preprocessing import *
from string_text_processing.get_word_count_uniquewords import *
from string_text_processing.remove_chars_from_wordtoken import *
from string_text_processing.get_cossine_similarity import *

# Get Knowlege base for chatbot by Webscraping

In [25]:
inputurl = "https://en.wikipedia.org/wiki/Chatbot"
name_txtfile = 'chatbot_wikipedia'
sen, word_tok = text_url_2_senANDwordtokens(inputurl, name_txtfile)

There are 209 sentences


In [14]:
inputurl = "https://en.wikipedia.org/wiki/Q%26A_software"
name_txtfile = 'QampA_wikipedia'
sen, word_tok = text_url_2_senANDwordtokens(inputurl, name_txtfile)

There are 82 sentences


In [182]:
inputurl = "https://preply.com/en/blog/22-useful-english-greetings-for-every-day/"
name_txtfile = 'how2sayHello'
sen, word_tok = text_url_2_senANDwordtokens(inputurl, name_txtfile)

There are 1493 sentences


In [10]:
inputurl = "https://regardsurlefrancais.com/2017/09/19/les-salutations-en-francais/"
name_txtfile = 'commentDitBonjour'
sen, word_tok = text_url_2_senANDwordtokens(inputurl, name_txtfile)

There are 1591 sentences


In [15]:
sen[0]

['b',
 'QampA',
 'software',
 'is',
 'software>online',
 'that',
 'attempts',
 'to',
 'answer',
 'questions',
 'asked',
 'by',
 'users',
 'QampA',
 'stands',
 'for',
 'question',
 'and']

# Retrieval Based chatbot

## Logic and Parser Rules

Give the chatbot text, and it will behave based on the text.
If you want it to say 'greetings', give it text on 'greetings' and 'manners'.

Need to keep the texts separate maybe, based on text A it would say AAA, and based on text B it would say BBB.  Then decide if AAA or BBB is better, based on similarity maybe

## The chatbot tells us what it knows: 

In [26]:
# Get the theme of the knowledge base
word_tokens2 = preprocessing(word_tok)

list_to_remove = ['b', "their", "based", "which", 'would', 'https']
wc, keywords, mat_sort = get_word_count_uniquewords(word_tokens2, list_to_remove)

print("Hello, I am a chatbot and I can talk about the following topics : ", keywords[0:10])
#print("Top 10 wc : ", wc[0:10])

There are 2570 word tokens, but 651 words are unique.
[['128' 'chatbot']
 ['34' 'health']
 ['30' 'servic']
 ...
 ['1' 'classif']
 ['1' 'classifi']
 ['1' 'impact']]
Hello, I am a chatbot and I can talk about the following topics :  ['chatbot', 'health', 'servic', 'custom', 'convers', 'compani', 'human', 'provid', 'gener', 'inform']


## Let's ask the chatbot a question about the topics it knows:

In [34]:
# Get user input about the topics it can talk about
inp = input("Type a question to the chatbot: ")
 
# prints inp
print(inp)
inp = 'when were chatbots first made'
#inp = 'when did chatbots become popular'
#inp = 'when was qampa first made'

Type a question to the chatbot: when were chatbots first made
when were chatbots first made


In [30]:
# Determine which sentences are relate to the main themes and {who, what, when, where, why}

# who = name, he, she, Mr, Mrs, Ms
# what = keywords
# when = number!  ---> search for months, dates, time, year, when was
# where = location
# why = because, due to the fact

# verbs ?? = action

# word token input
inp_wt0 = inp.split()
print('inp_wt0: ', inp_wt0)

# Stem the input text
ps = PorterStemmer()
inp_wt = []
for w in inp_wt0:
    inp_wt.append(ps.stem(w))
print('inp_wt: ', inp_wt)

# Simple framework for setting up a Q&A transformer model: Retrival of key sentences in the knowledge data source
q_type = []
for i in inp_wt:
    if i in ['who', 'name of']:
        q_type.append('who')
        # Need a function that detects names
        list_to_search = ['he', 'she', 'they', 'Mr', 'Mrs', 'Ms']
        # Search knowledge for sentences that contain names, pronouns (he, she, they, Mr, Mrs, Ms)
    elif i in ['what']:
        q_type.append('what')
        # Search knowledge for sentences that contain knowledge keywords similar to user keywords
    elif i in ['when']:
        q_type.append('when')  # ONLY DOING 'when' for the moment
    elif i in ['where']:
        q_type.append('where')
    elif i in ['why']:
        q_type.append('why')
q_type        

inp_wt0:  ['when', 'were', 'chatbots', 'first', 'made']
inp_wt:  ['when', 'were', 'chatbot', 'first', 'made']


['when']

In [31]:
# Based on the Question Type: find the relavent sentences in the Knowledge Source that could Answer the Question Type
for i in q_type:
    
    if i == 'when':
        # -----------------------------------------
        # What to do if "when" is in the input:
        # -----------------------------------------
        out = []
        for i in word_tok:
            # Search to see if there are '.' at the end of the word
            i = remove_chars_from_wordtoken(i, '.', '')
            i = remove_chars_from_wordtoken(i, '%', '')
            i = remove_chars_from_wordtoken(i, '$', '')

            # First determine if a number is a ratio
            i = remove_chars_from_wordtoken(i, '/', ' divided by ')

            # Second determine if a number is present
            if i.upper() == i.lower() and i != '':
                # is a number
                out.append(int(i))

        imp_nums = np.unique(out)
        imp_nums_str = [str(i) for i in imp_nums]
        
        # Search knowledge for sentences that contain numbers and text referring to time 
        # -----------------------------------------
        # 1) Get the sentences that contain numbers
        sen_with_nums = []
        for ind, i in enumerate(sen):
            # Returns the unique values in 1st array NOT in 2nd array
            vals_from_long_not_in_short = np.setdiff1d(imp_nums_str, i)
            out = np.setdiff1d(imp_nums_str, vals_from_long_not_in_short)
            if any(out):
                sen_with_nums.append(ind)
        # -----------------------------------------
        
        # -----------------------------------------
        # 2) Get the sentences 
        list_to_search = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December', 'year', 'month', 'week', 'day', 'hour', 'minute', 'second', 'ms', 'millisecond']
        sen_with_timetext = []
        for ind, i in enumerate(sen):
            # Returns the unique values in 1st array NOT in 2nd array
            vals_from_long_not_in_short = np.setdiff1d(list_to_search, i)
            out = np.setdiff1d(list_to_search, vals_from_long_not_in_short)
            if any(out):
                sen_with_timetext.append(ind)
        
        # -----------------------------------------
        
print('sen_with_nums: ', sen_with_nums)
print('sen_with_timetext: ', sen_with_timetext)

# Combine 'relevant sentence' lists
sen_with_nums += sen_with_timetext
relv_sen_unique = np.unique(sen_with_nums)
print('relv_sen_unique: ', relv_sen_unique)

sen_with_nums:  [1, 3, 5, 26, 30, 31, 35, 51, 53, 62, 63, 64, 65, 66, 68, 74, 75, 76, 77, 79, 80, 82, 83, 109, 110, 112, 113, 114, 115, 116, 117, 118, 119, 130, 131, 134, 143, 177, 180, 208]
sen_with_timetext:  [63, 65, 66, 68, 80, 83, 118, 119, 120, 122, 208]
relv_sen_unique:  [  1   3   5  26  30  31  35  51  53  62  63  64  65  66  68  74  75  76
  77  79  80  82  83 109 110 112 113 114 115 116 117 118 119 120 122 130
 131 134 143 177 180 208]


In [32]:
# Search for common words in the input with each of 'relevant sentences'
cos_sim_all = get_cossine_similarity(inp_wt, relv_sen_unique, sen)
cos_sim_all = make_a_properlist(cos_sim_all)

# 1) Get a Percentage of how many words of 'input' are used in sen
dict_val = {}
for ind, rels in enumerate(relv_sen_unique):
    tot = np.zeros((len(inp_wt),1))
    for i in range(len(inp_wt)):
        for j in range(len(sen[rels])):
            if inp_wt[i] == sen[rels][j]:
                tot[i] = 1
                break

    per = sum(np.ravel(tot))/len(inp_wt) * 100
    
    # Percentage and cos-sine similarity
    dict_val[rels] = [per, cos_sim_all[ind], len(sen[rels])] 

# Sort by max
dict_val_sorted = sort_dict_by_value(dict_val, reverse = True)
dict_val_sorted

{1: [40.0, 0.6324555320336759, 59],
 115: [40.0, 0.6324555320336759, 29],
 114: [40.0, 0.6324555320336759, 28],
 119: [40.0, 0.6324555320336759, 20],
 118: [40.0, 0.6324555320336759, 19],
 63: [40.0, 0.6324555320336759, 17],
 65: [40.0, 0.6324555320336759, 17],
 3: [20.0, 0.4472135954999579, 83],
 130: [20.0, 0.4472135954999579, 63],
 5: [20.0, 0.4472135954999579, 34],
 30: [20.0, 0.4472135954999579, 31],
 35: [20.0, 0.4472135954999579, 31],
 134: [20.0, 0.4472135954999579, 29],
 143: [20.0, 0.4472135954999579, 29],
 131: [20.0, 0.4472135954999579, 25],
 117: [20.0, 0.4472135954999579, 24],
 116: [20.0, 0.4472135954999579, 23],
 122: [20.0, 0.4472135954999579, 16],
 120: [20.0, 0.4472135954999579, 15],
 26: [0.0, nan, 11],
 31: [0.0, nan, 10],
 51: [0.0, nan, 26],
 53: [0.0, nan, 26],
 62: [0.0, nan, 11],
 64: [0.0, nan, 12],
 66: [0.0, nan, 14],
 68: [0.0, nan, 15],
 74: [0.0, nan, 8],
 75: [0.0, nan, 15],
 76: [0.0, nan, 9],
 77: [0.0, nan, 15],
 79: [0.0, nan, 18],
 80: [0.0, nan, 2

In [33]:
# So the Longest sentence WITH the Highest similarity appears to be the best response!!
chatbot_output = sen[list(dict_val_sorted.keys())[0]]


vals_from_long_not_in_short = np.setdiff1d(imp_nums_str, chatbot_output)
out = np.setdiff1d(imp_nums_str, vals_from_long_not_in_short)
if any(out):
    print('number_val', out)

print('chatbot_output : ', chatbot_output)

number_val ['1994']
chatbot_output :  ['Designed', 'to', 'convincingly', 'simulate', 'the', 'way', 'a', 'human', 'would', 'behave', 'as', 'a', 'conversational', 'partner', 'chatbot', 'systems', 'typically', 'require', 'continuous', 'tuning', 'and', 'testing', 'and', 'many', 'in', 'production', 'remain', 'unable', 'to', 'adequately', 'converse', 'while', 'none', 'of', 'them', 'can', 'pass', 'the', 'standard', 'test>Turing', 'The', 'term', 'ChatterBot', 'was', 'originally', 'coined', 'by', 'Loren', 'Mauldin>Michael', 'creator', 'of', 'the', 'first', 'in', '1994', 'to', 'describe', 'these', 'conversational']


I asked some simple questions, so I think I got Luckly!

I think it could be improved with a Q&A transformer step with the 'relavant sentences'!  And, I was a little lazy about separating the sentences perfectly using the webscraped words.  If the sentences were perfectly separated, it would be interesting to see how it performs!