# 10/8 Notebook - Customer Support Chatbot (Part A)

Hello and welcome to this week's notebook! Today, we'll be looking at how to create our own, customizable chat bot. Specifically, we'll be creating a custom data set, learning how to professionally clean data, and training a chat bot using a bag-of-words model

**Note: This notebook does NOT require any additional installation of `Keras` and `Tensorflow`. If you want to get some experience with these libaries, check out the other notebook!**

Below are the methods you need to complete for the notebook:
1. Edit `intents.json`
2. `process_words()`
3. `parse_intents()`
4. `build_bag()`
5. `build_training_set()`
6. `test_accuracy()`

We'll start by importing our libraries as always. Make sure you run the cell with `pip install nltk`, which will let you download the `nltk` library we'll be using

In [None]:
pip install nltk

In [None]:
# import our nltk libraries
import nltk
from nltk.stem import WordNetLemmatizer
# install specific downloads
nltk.download('punkt', quiet = True)
nltk.download('wordnet', quiet = True)

In [None]:
# other useful libraries (numpy == 🐐)
import numpy as np
import random
import json

## Part 1: Modify your intents

The great part about this chat bot is that it is fully customizable! Edit `intents.json` to your liking to create your own bot. Make sure that for each `intent`, you fill out the fields `tag`, `patterns`, and `responses`

You can look at my file, `taco-bell-intents.json`, for reference

Once you're done, you can continue to run the cells below!

**Note: if you're having JSON formatting issues in the next cell, use [this link](https://jsonlint.com) to validate your JSON**

In [None]:
data_file = open("intents.json").read()
intents = json.loads(data_file)
# when you print, you should see your JSON
print(intents)

## Part 2: Parsing the JSON

We'll practice a common first step in any NLP project, data cleaning

First, complete the function `process_words()` which will clean up our words according to the following steps:
1. Get the tokens using `nltk.word_tokenize()`
2. Set `cleaned_word` equal to the `lemmatized` and `lowercased` word

**Note: Make sure you run the cell immediately below this first; it stores values needed in `process_words()`**

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<p>
<ul>
    <li>Set <code>tokens = nltk.word_tokenize(pattern)</code></li>
    <li><code>lemmatizer.lemmatize(...)</code> will lemmatize a word</li>
    <li>The paremeter of <code>lemmatizer.lemmatize(...)</code> should be <code>word.lower()</code></li>  
</ul>
</p>

In [None]:
# declare needed variables for process_words()
ignore_punctuation = ["?", "!", ".", ","]
lemmatizer = WordNetLemmatizer()

In [None]:
def process_words(pattern):
    # return variable
    words = []
    # [your code here] - get the tokens using nltk
    tokens = ...
    for word in tokens:
        # check if the word should be ignored
        if word not in ignore_punctuation and word.isalnum():
            # [your code here] - clean the word and add it to the list
            cleaned_word = ...
            words.append(cleaned_word)
    # return the list
    return words

In [None]:
# run this cell to test your code
if (process_words("How was your day today?") == ['how', 'wa', 'your', 'day', 'today']):
    print("Nice work, sport!")
else:
    print("Try again, buddy!")

Now that we have `process_words()` to clean our words, we can parse the data from our JSON

Complete the method `parse_intents()` which does the following:
1. Set the value of `tag` from our `intent`
2. Set `tokenized_words` using the helper method in `process_words()`
3. Append a tuple of `tokenized_words` and `tag` to `tag_tokens`

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<p>
<ul>
    <li>Values of a JSON can be extracted using arrays</li>
    <li>Let <code>tag = intent["tag"]</code></li>
    <li>Let <code>tokenized_words = process_words(pattern)</code></li>
    <li>For the third step, the tuple can be appended with <code>tag_tokens.append((tokenized_words, tag))</code></li>
</ul>
</p>

In [None]:
def parse_intents(intents):
    # declare our needed variables
    tags = []
    all_words = []
    tag_tokens = []
    response_dict = dict()
    
    # iterate through each intent
    for intent in intents["intents"]:
        # if the intent has no patterns, we can skip
        if (len(intent["patterns"]) == 0):
            continue
        
        # [your code here] - add the tag to the list of tag
        tag = ...
        tags.append(tag)
        
        # update the dictionary
        response_dict[tag] = intent["responses"]
        
        # iterate through each pattern
        for pattern in intent["patterns"]:
            # [your code here] - create our tokenized words
            tokenized_words = ...
            # add all the tokenized words to our words
            all_words.extend(tokenized_words)
            # [your code here] - adds a tuple -> (list of tokens, tag) -> to the list
            tag_tokens.append(...)
    # return our values in a tuple
    return (np.array(tags), np.array(all_words), np.array(tag_tokens), response_dict)

We can do this cool trick below to remove all duplicates from our arrays (and sort them)

In [None]:
# call our function
tags, all_words, tag_tokens, tag_responses = parse_intents(intents)
# sort and remove duplicates
tags = np.array(sorted(list(set(tags))))
all_words = np.array(sorted(list(set(all_words))))

Run the cell below and take a quick look to make sure that everything makes sense. It's hard for me to test your code without knowing what's in your JSON, but in general:

- `tags` should contain a list of all your tags in the JSON, excluding `noanswer`
- `all_words` should be a list of all the words in your JSON's patterns. There should be no duplicates or patterns that aren't words
- Each entry of `tag_token_mappings` should have two values in a list. The first should be a list of patterns, and the second should be the tag of that pattern

In [None]:
print("Tags: {0}".format(tags))
print("------")
print("All Words: {0}".format(all_words))
print("------")
print("Tag-Token Mappings: {0}".format(tag_tokens))

## Part 3: Creating a Training Set

We know from previous lessons that the computer can't train a model without numeric values. To solve this, we'll use the `bag of words` technique we discussed in the Google Sheets



Complete the helper method `build_bag()` which iterates through each `word` in `all_words`, and appends 1 to `bag` if the word is in `all_words`, and 0 otherwise

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<p>
<ul>
    <li>The easiest way to do this is by using a simple <code>if else</code> statement</li>
    <li>Recall that <code>A in B</code> will return <code>true</code> if the element A is in the iterable object B, and <code>false</code> otherwise</li>
    <li>If you're feeling really fancy, you can just write <code>bag.append(1 * (word in tokens))</code></li>
</ul>
</p>

In [None]:
def build_bag(all_words, tokens):
    # reset our current bag
    bag = []
    for word in all_words:
        # [your code here] - append the correct value to the bag
    return bag

In [None]:
# run this cell to test your code
test_all_words = ["edgar", "allen", "poe", "said", "the", "raven", "was", "nevermore"]
test_tokens = ["quote", "the", "raven", "nevermore"]
if (build_bag(test_all_words, test_tokens) == [0, 0, 0, 0, 1, 1, 0, 1]):
    print("You crushed it!")
else:
    print("Ruh roh raggy")

Complete the method `build_training_set()` below, which performs the following steps:
1. Grabs the value of `tokens`, the first (index 0) element of `tag_token`
2. Grabs the value of `tag`, the second (index 1) element of `tag_token`
3. Sets `current_bag` using the helper method `build_bag()`

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<p>
<ul>
    <li>You can get the values of <code>tokens</code> and <code>tag</code> with <code>tag_token[X]</code>, where <code>X</code> is 0 or 1, appropriately</li>
    <li>Let <code>current_bag = build_bag(all_words, tokens)</code></li>
</ul>
</p>

In [None]:
def build_training_set(tags, all_words, tag_tokens):
    # define our variables to return
    train_x = []
    train_y = []
        
    # iterate through each tag-token mapping
    for tag_token in tag_tokens:
        
        # [your code here] - grab our needed values
        tokens = ...
        tag = ...
        
        # [your code here] - reset our current bag
        current_bag = ...
            
        # update our training inputs
        train_x.append(current_bag)
        
        # set our outputs equal to 1 in the location
        train_y.append(1 * (tags == tag))
    
    # return our values
    return (np.array(train_x), np.array(train_y))

In [None]:
train_x, train_y = build_training_set(tags, all_words, tag_tokens)

Print your `train_x` and `train_y` values in the following cell. It's hard for me to tell if you did everything correctly since you could be using a custom data set. If you have any questions about the program, feel free to message me on discord!

- `train_x` should be dimension `(m, n)` where `m` = # of total patterns and `n` = # words in `all_words`
- `train_y` should be dimension `(m, n)` where `m` = # of total patterns and `n` = # tags in `tags`

In [None]:
print(train_x.shape)
print(train_y.shape)
print("Training Inputs: {0}".format(train_x))
print("-----")
print("Training Outputs: {0}".format(train_y))

Before we continue with training, you may notice that our data is very similarly grouped, specifically the training outputs. As you may have thought, this can cause some unwanted bias in our model. To fix this, we'll `shuffle` our training set by using `np.random.permutation()` and some clever array indexing:

In [None]:
# shuffled indexes
shuffled_indexes = np.random.permutation(train_x.shape[0])
# set new values for train_x and train_y
train_x = train_x[shuffled_indexes]
train_y = train_y[shuffled_indexes]

## Part 4: Training Our Model from Scratch (no coding until the end)

We have our cleaned, numeric inputs and outputs (`train_x` and `train_y`), so now what? 

It's time to train our model!

**Note: In this version of the notebook, we'll be using the `numpy` neural network we developed last week. This neural network isn't as sophisticated as the one that Keras/Tensorflow generates, but you won't have to install any extra packages. If you want some experience working with other packages, check out the other version of the notebook (but know you will need to install Keras/Tensorflow)**

Our model will use the following architecture:

<img src = "./bag_of_words.PNG" style="width:75%;"></img>

We'll copy and paste our nifty helper functions from last week, `sigmoid()` and `sigmoid_derivative()`

In [None]:
def sigmoid(x):
    # calculate the output of the sigmoid function and return it
    sigmoid_val = 1 / (1 + np.exp(-x))
    return sigmoid_val

In [None]:
def sigmoid_derivative(x):
    # calculate the derivative and return the value
    sigmoid_deriv = np.multiply(sigmoid(x), 1 - sigmoid(x))
    return sigmoid_deriv

Recall that our neural networks are broken down into two methods, defined below:
1. `forward_prop()`
2. `back_prop()`

`forward_prop()` makes our predictions for a set of `thetas`, while `back_prop()` makes adjustments to these `thetas`

We will combine these functions in `train_model()`, but first let's take a look at `forward_prop()`

In [None]:
def forward_prop(inputs, thetas, m):
    # declare the values we need
    outputs = []
    curr_inputs = inputs
    ones_col = np.ones((m, 1))
    for theta in thetas:
        # format the inputs by adding the column of ones
        formatted_inputs = np.hstack((ones_col, curr_inputs))
        # calculate the predicted value, and append it to the list of outputs
        pred_val = formatted_inputs @ theta.T
        outputs.append(pred_val)
        # set curr_inputs to the the sigmoid of our predicted value
        curr_inputs = sigmoid(pred_val)
    # return our list of outputs
    return outputs

Next, we'll define `back_prop()`

In [None]:
def back_prop(y_predictions, y_actual, inputs, thetas, m, num_classifications):
    # sets the constant for ones_col
    ones_col = np.ones((m, 1))
    
    # calculates the "difference" for the final layer
    diff3 = sigmoid(y_predictions[-1]) - y_actual
    
    # calculates the "difference" for the penultimate layer
    diff2_unadjusted = diff3 @ thetas[1][:, 1:]
    diff2 = np.multiply(diff2_unadjusted, sigmoid_derivative(y_predictions[0]))
    
    # formats the partial derivatives
    format_partial_one = np.hstack((ones_col, np.asarray(inputs)))
    format_partial_two = np.hstack((ones_col, np.asarray(sigmoid(y_predictions[0]))))
    
    # calculates the unadjusted partial derivatives
    delta_one = diff2.T @ format_partial_one
    delta_two = diff3.T @ format_partial_two
    
    # returns our partial derivatives divided by m (to scale)
    return [delta_one / m, delta_two / m]

Finally, we'll combine them in `train_model()`

In [None]:
def train_model(thetas, inputs, actual_outputs, num_iterations, learning_rate, sample_size, num_classifications):
    for iteration in range(num_iterations):
        # calculate the outputs for the iteration
        outputs = forward_prop(inputs, thetas, sample_size)
        # calculate the gradients for the iteration
        gradients = back_prop(outputs, actual_outputs, inputs, thetas, sample_size, num_classifications)
        # adjust both of our thetas, taking a small step towards the minimum
        thetas[0] = thetas[0] - learning_rate * gradients[0]
        thetas[1] = thetas[1] - learning_rate * gradients[1]
    return thetas

Before we can train our model, we need to define our `constants`

In [None]:
num_iterations = 1000
learning_rate = 0.5
sample_size = train_x.shape[0]
num_classifications = train_y.shape[1]

As well as our `thetas`

In [None]:
num_hidden_nodes = 64
theta_one = np.random.random((num_hidden_nodes, train_x.shape[1] + 1)) - 0.5
theta_two = np.random.random((num_classifications, num_hidden_nodes + 1)) - 0.5
thetas = [theta_one, theta_two]

Now, it's finally time to `train` our model! 🤞

In [None]:
thetas = train_model(thetas, train_x, train_y, num_iterations, learning_rate, sample_size, num_classifications)

Before we get to make the chatbot GUI, test the model, and other fun stuff (unfortunately next week 😢), we need to test the `accuracy` of our model

Recall the formula for `accuracy`:

<img src="https://latex.codecogs.com/gif.latex?\dpi{200}&space;accuracy&space;=&space;\frac{n_{correct}}{n_{total}}" title="accuracy = \frac{n_{correct}}{n_{total}}" />

Complete the function `test_accuracy()` completes the following steps:

1. Calculate `max_inputs` and `max_outputs` by using `np.argmax()` and setting the `axis` parameter equal to 1
2. Calculate `num_correct`
3. Calculate `accuracy`

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<p>
<ul>
    <li>Use <code>np.argmax(A, axis = 1)</code> for <code>max_inputs</code> and <code>max_outputs</code>, where <code>A</code> is the array you want to find the max of</li>
    <li><code>num_correct</code> can be found by using <code>np.sum()</code> with the values where <code>max_inputs == max_outputs</code></li>
    <li>Use the formula to calculate <code>accuracy</code>!</li>
</ul>
</p>

In [None]:
def test_accuracy(x, y, thetas):
    # get our outputs by forward propogating
    outputs = sigmoid(forward_prop(x, thetas, x.shape[0])[-1])
    
    # [your code here] - find our max inputs and max outputs
    max_inputs = ...
    max_outputs = ...
    
    # [your code here] - calculate the number the model predicted correctly
    num_correct = ...
    
    # [your code here] - calculate and return the accuracy
    accuracy = ...
    return accuracy

Run the next cell to print out the accuracy of the model. Since we have a smaller sample size, it should be 100%!

In [None]:
accuracy = test_accuracy(train_x, train_y, thetas)
print("Accuracy: {0}%".format(accuracy * 100))

Since the exciting part of the project is having your chatbot make predictions, I'll be extra kind and give you a sneak preview of next week

(I know I know this code is really ugly but I did it to try and deter people from trying to move too far ahead)

In [None]:
user_input = "What should I eat?"
random.choice(tag_responses.get(tags[np.argmax(sigmoid(forward_prop(np.array([build_bag(all_words, process_words(user_input))]), thetas, 1)[-1]))]))