# 10/8 Notebook - Customer Support Chatbot (Part A)

Hello and welcome to this week's notebook! Today, we'll be looking at how to create our own, customizable chat bot. Specifically, we'll be creating a custom data set, learning how to professionally clean data, and training a chat bot using a bag-of-words model

**Note: This notebook does requires the additional installation of `Keras` and `Tensorflow`. If you don't want to install these libraries or are having trouble, try the other notebook!**

Below are the methods you need to complete for the notebook:
1. Edit `intents.json`
2. `process_words()`
3. `parse_intents()`
4. `build_bag()`
5. `build_training_set()`

We'll start by importing our libraries as always. Make sure you run the cell with `pip install nltk`, which will let you download the `nltk` library we'll be using

In [None]:
pip install nltk

In [None]:
# import our nltk libraries
import nltk
from nltk.stem import WordNetLemmatizer
# install specific downloads
nltk.download('punkt', quiet = True)
nltk.download('wordnet', quiet = True)

In [None]:
# other useful libraries (numpy == 🐐)
import numpy as np
import random
import json

## Part 1: Modify your intents

The great part about this chat bot is that it is fully customizable! Edit `intents.json` to your liking to create your own bot. Make sure that for each `intent`, you fill out the fields `tag`, `patterns`, and `responses`

You can look at my file, `taco-bell-intents.json`, for reference

Once you're done, you can continue to run the cells below!

**Note: if you're having JSON formatting issues in the next cell, use [this link](https://jsonlint.com) to validate your JSON**

In [None]:
data_file = open("intents.json").read()
intents = json.loads(data_file)
# when you print, you should see your JSON
print(intents)

## Part 2: Parsing the JSON

We'll practice a common first step in any NLP project, data cleaning

First, complete the function `process_words()` which will clean up our words according to the following steps:
1. Get the tokens using `nltk.word_tokenize()`
2. Set `cleaned_word` equal to the `lemmatized` and `lowercased` word

**Note: Make sure you run the cell immediately below this first; it stores values needed in `process_words()`**

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<p>
<ul>
    <li>Set <code>tokens = nltk.word_tokenize(pattern)</code></li>
    <li><code>lemmatizer.lemmatize(...)</code> will lemmatize a word</li>
    <li>The paremeter of <code>lemmatizer.lemmatize(...)</code> should be <code>word.lower()</code></li>  
</ul>
</p>

In [None]:
# declare needed variables for process_words()
ignore_punctuation = ["?", "!", ".", ","]
lemmatizer = WordNetLemmatizer()

In [None]:
def process_words(pattern):
    # return variable
    words = []
    # [your code here] - get the tokens using nltk
    tokens = ...
    for word in tokens:
        # check if the word should be ignored
        if word not in ignore_punctuation and word.isalnum():
            # [your code here] - clean the word and add it to the list
            cleaned_word = ...
            words.append(cleaned_word)
    # return the list
    return words

In [None]:
# run this cell to test your code
if (process_words("How was your day today?") == ['how', 'wa', 'your', 'day', 'today']):
    print("Nice work, sport!")
else:
    print("Try again, buddy!")

Now that we have `process_words()` to clean our words, we can parse the data from our JSON

Complete the method `parse_intents()` which does the following:
1. Set the value of `tag` from our `intent`
2. Set `tokenized_words` using the helper method in `process_words()`
3. Append a tuple of `tokenized_words` and `tag` to `tag_tokens`

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<p>
<ul>
    <li>Values of a JSON can be extracted using arrays</li>
    <li>Let <code>tag = intent["tag"]</code></li>
    <li>Let <code>tokenized_words = process_words(pattern)</code></li>
    <li>For the third step, the tuple can be appended with <code>tag_tokens.append((tokenized_words, tag))</code></li>
</ul>
</p>

In [None]:
def parse_intents(intents):
    # declare our needed variables
    tags = []
    all_words = []
    tag_tokens = []
    response_dict = dict()
    
    # iterate through each intent
    for intent in intents["intents"]:
        # if the intent has no patterns, we can skip
        if (len(intent["patterns"]) == 0):
            continue
        
        # [your code here] - add the tag to the list of tag
        tag = ...
        tags.append(tag)
        
        # update the dictionary
        response_dict[tag] = intent["responses"]
        
        # iterate through each pattern
        for pattern in intent["patterns"]:
            # [your code here] - create our tokenized words
            tokenized_words = ...
            # add all the tokenized words to our words
            all_words.extend(tokenized_words)
            # [your code here] - adds a tuple -> (list of tokens, tag) -> to the list
            tag_tokens.append(...)
    # return our values in a tuple
    return (np.array(tags), np.array(all_words), np.array(tag_tokens), response_dict)

We can do this cool trick below to remove all duplicates from our arrays (and sort them)

In [None]:
# call our function
tags, all_words, tag_tokens, tag_responses = parse_intents(intents)
# sort and remove duplicates
tags = np.array(sorted(list(set(tags))))
all_words = np.array(sorted(list(set(all_words))))

Run the cell below and take a quick look to make sure that everything makes sense. It's hard for me to test your code without knowing what's in your JSON, but in general:

- `tags` should contain a list of all your tags in the JSON, excluding `noanswer`
- `all_words` should be a list of all the words in your JSON's patterns. There should be no duplicates or patterns that aren't words
- Each entry of `tag_token_mappings` should have two values in a list. The first should be a list of patterns, and the second should be the tag of that pattern

In [None]:
print("Tags: {0}".format(tags))
print("------")
print("All Words: {0}".format(all_words))
print("------")
print("Tag-Token Mappings: {0}".format(tag_tokens))

## Part 3: Creating a Training Set

We know from previous lessons that the computer can't train a model without numeric values. To solve this, we'll use the `bag of words` technique we discussed in the Google Sheets



Complete the helper method `build_bag()` which iterates through each `word` in `all_words`, and appends 1 to `bag` if the word is in `all_words`, and 0 otherwise

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<p>
<ul>
    <li>The easiest way to do this is by using a simple <code>if else</code> statement</li>
    <li>Recall that <code>A in B</code> will return <code>true</code> if the element A is in the iterable object B, and <code>false</code> otherwise</li>
    <li>If you're feeling really fancy, you can just write <code>bag.append(1 * (word in tokens))</code></li>
</ul>
</p>

In [None]:
def build_bag(all_words, tokens):
    # reset our current bag
    bag = []
    for word in all_words:
        # [your code here] - append the appropriate value to the bag
    return bag

In [None]:
# run this cell to test your code
test_all_words = ["edgar", "allen", "poe", "said", "the", "raven", "was", "nevermore"]
test_tokens = ["quote", "the", "raven", "nevermore"]
if (build_bag(test_all_words, test_tokens) == [0, 0, 0, 0, 1, 1, 0, 1]):
    print("You crushed it!")
else:
    print("Ruh roh raggy")

Complete the method `build_training_set()` below, which performs the following steps:
1. Grabs the value of `tokens`, the first (index 0) element of `tag_token`
2. Grabs the value of `tag`, the second (index 1) element of `tag_token`
3. Sets `current_bag` using the helper method `build_bag()`

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<p>
<ul>
    <li>You can get the values of <code>tokens</code> and <code>tag</code> with <code>tag_token[X]</code>, where <code>X</code> is 0 or 1, appropriately</li>
    <li>Let <code>current_bag = build_bag(all_words, tokens)</code></li>
</ul>
</p>

In [None]:
def build_training_set(tags, all_words, tag_tokens):
    # define our variables to return
    train_x = []
    train_y = []
        
    # iterate through each tag-token mapping
    for tag_token in tag_tokens:
        
        # [your code here] - grab our needed values
        tokens = ...
        tag = ...
        
        # [your code here] - reset our current bag
        current_bag = ...
            
        # update our training inputs
        train_x.append(current_bag)
        
        # set our outputs equal to 1 in the location
        train_y.append(1 * (tags == tag))
    
    # return our values
    return (np.array(train_x), np.array(train_y))

In [None]:
train_x, train_y = build_training_set(tags, all_words, tag_tokens)

Print your `train_x` and `train_y` values in the following cell. It's hard for me to tell if you did everything correctly since you could be using a custom data set. If you have any questions about the program, feel free to message me on discord!

- `train_x` should be dimension `(m, n)` where `m` = # of total patterns and `n` = # words in `all_words`
- `train_y` should be dimension `(m, n)` where `m` = # of total patterns and `n` = # tags in `tags`

In [None]:
print(train_x.shape)
print(train_y.shape)
print("Training Inputs: {0}".format(train_x))
print("-----")
print("Training Outputs: {0}".format(train_y))

Before we continue with training, you may notice that our data is very similarly grouped, specifically the training outputs. As you may have thought, this can cause some unwanted bias in our model. To fix this, we'll `shuffle` our training set by using `np.random.permutation()` and some clever array indexing:

In [None]:
# shuffled indexes
shuffled_indexes = np.random.permutation(train_x.shape[0])
# set new values for train_x and train_y
train_x = train_x[shuffled_indexes]
train_y = train_y[shuffled_indexes]

## Part 4: Training Our Model Using Keras/Tensorflow (no coding)

We have our cleaned, numeric inputs and outputs (`train_x` and `train_y`), so now what? 

It's time to train our model!

**Note: In this version of the notebook, we'll be using `Tensorflow` and `Keras`. I have some instructions below on how to set this up. If you're still having trouble, switch over to the other notebook as there's no installation required**

1. Open `Anaconda Prompt`
2. `conda install pip`
3. `pip install --upgrade tensorflow`
4. `pip install Keras`
5. `conda create -n mnist tensorflow keras`
6. `conda activate mnist`
7. `conda install jupyter`
8. `conda list` - verify that you see jupyter, numpy, keras, and tensorflow
9. run `jupyter notebook` and open this file again

Hopefully, your installation worked without too much trouble. If you can run the next cell without any errors, you should be good to go! As always, if you have any questions you can message me on Discord

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.optimizers import SGD

If the installation worked properly, you'll see a message that reads `Using Tensorflow backend.`

Now, we'll use `Keras` to create a `Sequential` model. This library makes it very easy for us to create convolutional neural networks

Our model will use the following architecture:

<img src = "./bag_of_words.PNG" style="width:75%;"></img>

In [None]:
# declare our model
model = Sequential()
# add our layers
model.add(Dense(128, input_shape=(len(train_x[0]),), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(len(train_y[0]), activation='softmax'))

You may see a lot of unfamiliar terms in this model, so I'll do my best to define what the above cell does:
- `model.add(Dense(...))` adds a layer of neurons to our neural network. The number is the size of our network, but can be overriden by `input_shape`
- `Dropout(0.5)` adds `regularization` to our model, something we haven't talked about yet. Basically, `regularization` decreases the likelihood of the model overfitting our data. Overfitting occurs when our model can predict our training set very well, but does poorly with new data
- `activation = 'relu'` changes the activation function. Before, we were using `sigmoid`, but `relu` is another popular function. You can read more about it [here](https://www.kaggle.com/dansbecker/rectified-linear-units-relu-in-deep-learning)
- In many neural networks, the final activation function is `softmax`, which essentially normalizes our data. You can read more about it [here](https://en.wikipedia.org/wiki/Softmax_function)

Next, we'll create an optimizer using `stochastic gradient descent`. The algorithm we were using in earlier weeks was `batch gradient descent`. The main difference between the two optimizers is that `batch gradient descent` takes the derivative of the entire data set at once, while `stochastic gradient descent` takes the partial derivative of each entry in the data set one at a time

The parameters `lr`, `decay`, `momentum`, and `nesterov` adjusts how fast our model will train. With these parameters set, our model will train more slowly over time

We set our `loss` function to [categorical_crossentropy](https://gombru.github.io/2018/05/23/cross_entropy_loss/), our `optimizer` to `stochastic gradient descent`, and tell the model to print out the `accuracy` during each iteration

In [None]:
sgd = SGD(lr = 0.01, decay = 1e-6, momentum = 0.9, nesterov = True)
model.compile(loss = 'categorical_crossentropy', optimizer = sgd, metrics = ['accuracy'])

`Keras` makes it very easy to train our model. We can use `model.fit()` to accomplish this. Some notes about the parameters:
- `epochs` is equivalent to our number of iterations
- `batch_size` tells our model how often to compute the partial derivatives
- setting `verbose` to 1 just displays a progress bar

Run the cell below to visualize the training of our model!

In [None]:
hist = model.fit(train_x, train_y, epochs = 500, batch_size = 5, verbose = 1)

Since the exciting part of the project is having your chatbot make predictions, I'll be extra kind and give you a sneak preview of next week

(I know I know this code is really ugly but I did it to try and deter people from trying to move too far ahead)

In [None]:
user_input = "Give me a fun fact"
random.choice(tag_responses.get(tags[np.argmax(model.predict(np.array([build_bag(all_words, process_words(user_input))]))[0])]))