Learning how to build a chatbox. Following Tech with Time (https://www.youtube.com/watch?v=wypVcNIH6D4&list=PLzMcBGfZo4-ndH9FoC4YWHGXG5RZekt-Q)

In [4]:
# import libraies
import nltk
#nltk.download('punkt')
from nltk.stem.lancaster import LancasterStemmer
stemmer = LancasterStemmer()

import numpy as np
import tflearn
import tensorflow as tf
import random
import json
import pickle

import warnings
warnings.filterwarnings('ignore')

The intent file is formated as a JSON file. Here is the format:
{"intents": [
        {"tag": "greating",
        "patterns": ["Hi"],
        "response" ["Hello"],
        "context_set": ""
        }
        
The basic structor is as follows:
    - person types a message
    - the chatbot, through deeplearning, tags the message
    - the chatbot then returns with one of the responses

In [5]:
# reading our json file
with open ('learning_nlp.json') as file:
    data = json.load(file)
    
#print(data)  # just to make sure we correctly loaded/read our file
#print(data['intents'])

## Preprocess Data

In [6]:
# save each patten/grouping into a variable 
words = []
labels = []
docs_x = [] # for each patten, we need to know what intent is
docs_y = [] 

for intent in data['intents']:
    for pattern in intent['patterns']:
        # stemming will break each word in our pattern and break it down into the root word
        # ex. "Is anyone there?": the root is "there" ignoreing the other words and ?
        # we need to tokkenize (getting all the words)
        wrds = nltk.word_tokenize(pattern)
        words.extend(wrds)
        docs_x.append(wrds)
        docs_y.append(intent['tag'])
        
    if intent['tag'] not in labels:
            labels.append(intent['tag'])
            

# stem all words and remove duplicates 
words = [stemmer.stem(w.lower()) for w in words if w != '?']
words = sorted(list(set(words))) # set removes duplicates, list converts set back to list, sorted just sorts the words

labels = sorted(labels)


To run data in a neural network, I need to convert the string data into numbers. To do this, I'll convert the strings into a "bag of words" (one-hot encoding)

In [7]:
training = []  # a list of 0 and 1. ex. [0,0,0,1,2,0,4,0] meaning there is 0 mention of 'a', 2 mention of "hi", etc
output = []  # a list of 0 and 1. ex. [0,0,0,1] meaning there is a mention of "goodby"

out_empty = [0 for _ in range(len(labels))]

for x, doc in enumerate(docs_x):
    bag = []
    
    wrds = [stemmer.stem(w) for w in doc]
    
    # one-hot encoding
    for w in words:
        if w in wrds: 
            bag.append(1)
        else:
            bag.append(0)
            
    output_row = out_empty[:]
    output_row[labels.index(docs_y[x])] = 1  # look through the "label" list. See where the tag is in the list and place a 1

    training.append(bag)
    output.append(output_row)  # this is where i made the mistake. I coded: bag.append(output_row), should be ouput.app...
    
    
training = np.array(training)
output = np.array(output)

# saving our data as a pickle file (used for serializing and de-serializing a Python object structure)
with open('data_to_train.pickle', 'wb') as f:
    pickle.dump((words, labels, training, output), f)

In [8]:
# verify if training and output are correctly one-hot encoded

test = 25

print(training[test])
print(output[test])

print(len(training))
print(len(output))

print(len(training[test]))
print(len(output[test]))

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0
 0 0 0 0 0 0 0 0 0]
[0 0 0 1 0 0]
26
26
46
6


In [10]:
tf.reset_default_graph()

# model layers
net = tflearn.input_data(shape=[None, len(training[0])]) # the shape of data
net = tflearn.fully_connected(net, 8) # hidden layer
net = tflearn.fully_connected(net, 8) # hidden layer
net = tflearn.fully_connected(net, len(output[0]), activation='softmax') # probability distribution 
net = tflearn.regression(net)

model = tflearn.DNN(net) # the network

In [11]:
model.fit(training, output, n_epoch=1000, batch_size=8, show_metric=True)
model.save('model.tflearn')

Training Step: 3999  | total loss: [1m[32m0.27194[0m[0m | time: 0.016s
| Adam | epoch: 1000 | loss: 0.27194 - acc: 0.9750 -- iter: 24/26
Training Step: 4000  | total loss: [1m[32m0.24806[0m[0m | time: 0.016s
| Adam | epoch: 1000 | loss: 0.24806 - acc: 0.9775 -- iter: 26/26
--


Model accuracy: 98%

Now to create our function to store our words

In [12]:
def bag_of_words(s, words):  # takes in a list of words
    bag = [0 for _ in range(len(words))]  # creates an element of bag_of_words and chooses the element if the word exist
    
    s_words = nltk.word_tokenize(s)
    s_words = [stemmer.stem(word.lower()) for word in s_words]
    
    for se in s_words:
        for i, w in enumerate(words):
            if w == se:
                bag[i] = 1  # 1 means word exist
                #bag[i].append(1) # error: 'int' object has no attribute 'append' 
        
        return np.array(bag)

A function to actually have the user "talk" to the bot

In [23]:
def chat():
    print('Start chatting (type "quit" to stop)')
    while True:
        u_input = input('You: ')
        if u_input.lower() == 'quit':  # to exit the chat
            break
        
    # take input, turn it into a bag of words, feed it to the model, model returns response
        results = model.predict([bag_of_words(u_input, words)])  # eventhough we only have one prediction, you have to 
                                                             # feed in a list as predict expects a bunch of entries 
                                                             # to return a bunch of entries
        #print(results) # returns a probability of what the model thinks should be the correct result
        results_index = np.argmax(results) # this will return the index with the greatest value in our list of results
        tag = labels[results_index]
        #print(tag)
        
        # the code for the actual response
        for tag_int in data['intents']:  # looking in our JSON file in the intents lable. data is our JSON file
            if tag_int['tag'] == tag:  # if the tag_int matches the tag
                responses = tag_int['responses']  # store the responese in the JSON file to responses variable
        #print(random.choice(responses))
        top_response = np.argmax(responses)
        print(responses)
        
            

In [14]:
# without argmax
chat()

Start chatting (type "quit" to stop)
You: helo
[[0.05419211 0.30227187 0.5946966  0.00530772 0.03586594 0.00766577]]
You: quite
[[0.05419211 0.30227187 0.5946966  0.00530772 0.03586594 0.00766577]]
You: quit


As you can see from above, the model returned 7 possible responses for 'hello' each with its own probability 

In [16]:
# with argmax, printing out the tag
chat()

Start chatting (type "quit" to stop)
You: hello
greeting
You: name
name
You: name
name
You: what is your name
greeting
You: quit


From above, we now get an actual word. This only prints the 'tag', not the response yet. The 'tag' is what the model thinks the root word is

In [18]:
# with a response code
chat()

Start chatting (type "quit" to stop)
You: hello
Hello!
You: menu
I am 30 years old!
You: goodby
Hope to see speak again!
You: menu
30 years young!
You: name
I'm cupcake bot!
You: quit


The chat() now prints the correct response. Still needs a bit of work as 'menu' returned '30 years young!'