# Intent Classification with Keras
In my past notebooks, my goal was to recieve my labeled data for my chatbot. Now this notebook focuses on using Keras to classify intents of new, unseen data that a user might type up. The model now switched to a supervised learning approach now that we generated the labels from the unsupervised learning we did in the previous notebook.

### Rasa Comparison
Rasa trains this intent classification step with SVM and GridsearchCV because they can try different configurations ([source](https://medium.com/bhavaniravi/intent-classification-demystifying-rasanlu-part-4-685fc02f5c1d)). When deploying preprocessing pipeline should remain same between train and test.

In [1]:
# Data science
import pandas as pd
print(f"Pandas: {pd.__version__}")
import numpy as np
print(f"Numpy: {np.__version__}")

# Deep Learning 
import tensorflow as tf
print(f"Tensorflow: {tf.__version__}")
from tensorflow import keras
print(f"Keras: {keras.__version__}")
import sklearn
print(f"Sklearn: {sklearn.__version__}")

# Visualization 
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="ticks", color_codes=True)

# Cool progress bars
from tqdm import tqdm_notebook as tqdm
tqdm().pandas()  # Enable tracking of execution progress

import collections
import yaml

# Preprocessing and Keras
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.lancaster import LancasterStemmer
import re
from sklearn.preprocessing import OneHotEncoder
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.models import Sequential, load_model
from keras.layers import Dense, LSTM, Bidirectional, Embedding, Dropout
from keras.callbacks import ModelCheckpoint
from sklearn.model_selection import train_test_split

# Reading back in intents
with open(r'objects/intents.yml') as file:
    intents = yaml.load(file, Loader=yaml.FullLoader)

# Reading in representative intents
with open(r'objects/intents_repr.yml') as file:
    intents_repr = yaml.load(file, Loader=yaml.FullLoader)
    
# Reading in training data
train = pd.read_pickle('objects/train.pkl')

print(train.head())
print(f'\nintents:\n{intents}')
print(f'\nrepresentative intents:\n{intents_repr}')

Pandas: 1.0.5
Numpy: 1.18.5
Tensorflow: 2.2.0
Keras: 2.3.0-tf
Sklearn: 0.23.1


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))


                                           Utterance   Intent
0  [hey, please, fix, io, battery, drain, issue, ...  Battery
1  [okay, new, update, io, still, see, question, ...   Update
2       [iphone, slow, dial, since, new, io, update]   iphone
3  [also, anytime, wan, na, fix, i️, thing, annny...      app
4  [hi, instal, high, update, day, ago, start, mo...      mac

intents:
{'app': ['app', 'application'], 'battery': ['battery'], 'bug': ['bug'], 'greeting': ['hi', 'hello', 'hey', 'yo'], 'icloud': ['icloud', 'i cloud'], 'ios': ['io'], 'iphone': ['iphone', 'i phone'], 'mac': ['mac', 'macbook', 'laptop', 'computer'], 'music': ['music', 'song', 'playlist'], 'payment': ['credit', 'card', 'payment', 'pay'], 'settings': ['settings', 'setting'], 'troubleshooting': ['problem', 'trouble'], 'update': ['update'], 'watch': ['watch']}

representative intents:
{'Battery': ['io', 'drain', 'battery', 'iphone', 'twice', 'fast', 'io', 'help'], 'Update': ['new', 'update', 'i️', 'make', 'sure', 'downl

In [2]:
def top10_bagofwords(data, output_name, title):
    ''' Taking as input the data and plots the top 10 words based on counts in this text data'''
    bagofwords = CountVectorizer()
    inbound = bagofwords.fit_transform(data)
    
    # Make rank
    word_counts = np.array(np.sum(inbound, axis=0)).reshape((-1,))
    words = np.array(bagofwords.get_feature_names())
    words_df = pd.DataFrame({"word":words, 
                             "count":word_counts})
    words_rank = words_df.sort_values(by="count", ascending=False)
    
    # Visualizing top 10 words
    plt.figure(figsize=(12,6))
    sns.barplot(words_rank['word'][:10], words_rank['count'][:10].astype(str), palette = 'inferno')
    plt.title(title)
    
    # Saving
    plt.savefig(f'visualizations/{output_name}.png')
    
    plt.show()

# Keras Preprocessing
I've done most of the main preprocessing work, but Keras needs some more specific things for to model with it.

In [6]:
# 1. Create tokenizer object
def make_tokenizer(docs, filters = '!"#$%&()*+,-./:;<=>?@[\]^_`{|}~'):
    t = Tokenizer(filters = filters)
    t.fit_on_texts(docs)
    return t

token = make_tokenizer(train['Utterance'])

# 2. Finding length of vocabulary
vocab_size = len(token.word_index) + 1

# 3. Finding maximum length of Tokens
get_max_token_length = lambda series: len(max(series, key = len))
max_token_length = get_max_token_length(train['Utterance'])

print(f'Vocab Size: {vocab_size} \nMax Token Length: {max_token_length}')

# 4. Encode documents - matching with Keras dictionary
encode_tweets = lambda token, words: token.texts_to_sequences(words)
encoded_tweets = encode_tweets(token, train['Utterance'])

# 5. Padding my documents - filling with tags to normalize the lengths
pad_tweets = lambda encoded_doc, max_length: pad_sequences(encoded_doc, maxlen = max_length, padding = "post")

padded_tweets = pad_tweets(encoded_tweets, max_token_length)
print("Shape of padded tweets:",padded_tweets.shape)
print("\nPreview of encoded and padded tweets:\n", padded_tweets)

# 6. One hot encode to represent target variable data (intents)
one_hot = lambda encode: OneHotEncoder(sparse = False).fit_transform(encode)
unique_intents = list(set(train['Intent']))
# making another tokenizer
output_tokenizer = make_tokenizer(unique_intents, filters = '!"#$%&()*+,-/:;<=>?@[\]^`{|}~')
encoded_intents = encode_tweets(output_tokenizer, train['Intent'])
# reshaping encoded Tweets for this one hot function
encoded_intents = np.array(encoded_intents).reshape(len(encoded_intents), 1)
one_hot_intents = one_hot(encoded_intents)
print(f'\nPreview of intent representation:\n{one_hot_intents}')

# 7. Split in to train and test
X_train, X_val, y_train, y_val = train_test_split(padded_tweets, one_hot_intents, test_size = 0.3, 
                                                   shuffle = True, stratify = one_hot_intents)
print(f'\nShape checks:\nX_train: {X_train.shape} X_val: {X_val.shape}\ny_train: {y_train.shape} y_val: {y_val.shape}')

Vocab Size: 4164 
Max Token Length: 32
Shape of padded tweets: (5000, 32)

Preview of encoded and padded tweets:
 [[ 38  11   6 ...   0   0   0]
 [368   7   1 ...   0   0   0]
 [  3  22 870 ...   0   0   0]
 ...
 [  1   2   4 ...   0   0   0]
 [  2   1 125 ...   0   0   0]
 [  8  26   2 ...   0   0   0]]

Preview of intent representation:
[[1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 ...
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 1.]
 [0. 0. 0. 1. 0.]]

Shape checks:
X_train: (3500, 32) X_val: (1500, 32)
y_train: (3500, 5) y_val: (1500, 5)


Keras models look for y variables to be one hot encoded. When it's multiclass many people feed it as one hot encoded vectors. It's just one of the design choices.

If you're using doc2vec embeddings, how do you pass in your Tweets. You may have to pass it in as full tweets. Check how you pass in the Tweets. You may have to tokenize at a Tweet level. If you pass it in, if it's Tweet 57, it will activate the node such that it gets multiplied out by the embeddings for the 57th document.

In [87]:
# Making my own embedding matrix that's in a specific order
d2v_embedding_matrix = pd.read_pickle('objects/inbound_d2v.pkl')

In [None]:
embeddings_index = {}
f = open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

In a regular word embedding, the order of the embeddings in the matrix has to be setup so that it matches how the words appear in my keras tokenizer word index. It does it so that the most common words appear up front, and the embedding matrix needs to be aligned.

Make sure the order of the embeddings is going to be the order that the embeddings are.

# Keras Modelling
I will create a neural network with Keras with the output layer having the same number of nodes as there are intents. The following is my architecture:

In [82]:
def make_model(vocab_size, max_token_length):
    ''' In this function I define all the layers of my neural network'''
    # Initialize
    model = Sequential()
    
    # Adding layers
    model.add(Embedding(vocab_size, 128, input_length = max_length, trainable = False))
    model.add(Bidirectional(LSTM(128)))
#    model.add(LSTM(128)) # Another LSTM layer if things aren't doing well. Beef up the size of the Dense layer
    model.add(Dense(32, activation = "relu")) # Try 50, another dense layer? This takes a little bit of exploration
    
    # Only update 50 percent of the nodes
    model.add(Dropout(0.5))
    # Make sure when you update your number of unique intents, you reflect it in this last layer
    model.add(Dense(22, activation = "sigmoid"))
    return model

In [None]:
filename = 'model.h5'
checkpoint = ModelCheckpoint(filename, monitor='val_loss', verbose=1, save_best_only=True, mode='min')

hist = model.fit(train_X, train_Y, epochs = 100, batch_size = 32, validation_data = (val_X, val_Y), callbacks = [checkpoint])

In [None]:
def predictions(text):
    clean = re.sub(r'[^ a-z A-Z 0-9]', " ", text)
    test_word = word_tokenize(clean)
    test_word = [w.lower() for w in test_word]
    test_ls = word_tokenizer.texts_to_sequences(test_word)
    print(test_word)
    # Check for unknown words
    if [] in test_ls:
        test_ls = list(filter(None, test_ls))
    test_ls = np.array(test_ls).reshape(1, len(test_ls))
    x = padding_doc(test_ls, max_length)
    pred = model.predict_proba(x)
    return pred

In [None]:
def get_final_output(pred, classes):
    predictions = pred[0]
    classes = np.array(classes)
    ids = np.argsort(-predictions)
    classes = classes[ids]
    predictions = -np.sort(-predictions)
    for i in range(pred.shape[1]):
        print("%s has confidence = %s" % (classes[i], (predictions[i])))

In [None]:
text = "are you a robot"
pred = predictions(text)
get_final_output(pred, unique_intent)

# Exploration of Different Intent Classification Methods

There already exists chatbot frameworks people use, such as Wix.

Still, hopefully the intent classification of my Tweets would be good enough. The problem with my Tweets is that there are a lot of noise, and it's not clear what intent a particular Tweet represents. Also, I will definitely have class imbalance, in which I would have to do upsampling or down sampling to get my classification accuracies to be good.



In [None]:
def create_model(vocab_size, max_length):
    ''' In this function I define all the layers of my neural network'''
    model = Sequential()
    model.add(Embedding(vocab_size, 128, input_length = max_length, trainable = False))
    model.add(Bidirectional(LSTM(128)))
#   model.add(LSTM(128)) # Another LSTM layer if things aren't doing well. Beef up the size of the Dense layer
    model.add(Dense(32, activation = "relu")) # Try 50, another dense layer? This takes a little bit of exploration
    
    # Only update 50 percent of the nodes
    model.add(Dropout(0.5))
    # Make sure when you update your number of unique intents, you reflect it in this last layer
    model.add(Dense(22, activation = "sigmoid"))
    return model

# Tensorflow

For multilabel classification, you use sigmoid. You'll still have 10 distinct intents. But you need to modelled such that each of those intents are independent of each other.

Prediction of intent 1 shouldn't effect intent 2. Softmax takes all the scores over all classes, and the highest number will have highest probability output but everything will sum to 1. For the final softmax layer will sum to 1, but that doesn't work in my case.

But you're going to classify each intent separately. They can sum to greater than 1. 

Similar to logreg in multiclass. One curve for class 0 and not class 0. The sum of those probs can be greater than 1.

You would use sigmoid for the activation function.

For class 1, itll be 1 or not 1. Etc. You would look at your output layer, and whichever nodes have a greater probability output of >0.5, those 2 are your final output. You can do up to 3. Depends on how many nodes you will have.

When you feed in your target vector, they need to go into the one hot encoded vectors. Target column will have 10 columns. It will all sum up to one for each node. Each node will have a separate sigmoid function (P (1-P)). Across the nodes they're going to sum more than 1. One versus rest classification. Read up based on logreg terms. Multilabel classification. The main thing is your labels need to one hot encode. Loss function would use binary cross entropy.

Issues: the more classes you have, the harder it is for your model. Especially for the 2nd and 3rd label that's when accs start to drop a bit.

### Question and Answering Example with Tensorflow
Found at [Hugging Face](https://huggingface.co/transformers/task_summary.html#sequence-classification)

Hugging face constantly updating with pretrained models.

Here is an example of question answering using a model and a tokenizer. The process is the following:

Instantiate a tokenizer and a model from the checkpoint name. The model is identified as a BERT model and loads it with the weights stored in the checkpoint.
Define a text and a few questions.
Iterate over the questions and build a sequence from the text and the current question, with the correct model-specific separators token type ids and attention masks
Pass this sequence through the model. This outputs a range of scores across the entire sequence tokens (question and text), for both the start and end positions.
Compute the softmax of the result to get probabilities over the tokens
Fetch the tokens from the identified start and stop values, convert those tokens to a string.
Print the results

### Keras Tokenizer

Creates a dictionary of all the words in the vocab, and it stores the index. For each sequence it passes in the sequence and converts each word into the index that refers to the Keras word dictionary. When you feed in sentences into the model, they all have to be the same length. But some tweets are going to be longer than others, so pad_sequences just pad all the other ones so they are on the same length. It padding the messages with 0s until they are the same length as the longest message. They might set a max-length that are shorter because longer sequences are harder to train on.

Got this tokenizer function from https://www.tensorflow.org/tutorials/text/nmt_with_attention

In [109]:
# tf.keras.preprocessing.text.Tokenizer(
#     num_words=None, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True,
#     split=' ', char_level=False, oov_token=None, document_count=0
# )

def tokenize(lang):
    lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(
      filters='')
    lang_tokenizer.fit_on_texts(lang)
    tensor = lang_tokenizer.texts_to_sequences(lang)
    tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor,
                                                         padding='post')
    return tensor, lang_tokenizer

tokenized = tokenize(cleaned)

# Fitting my model
When I build my neural network with K fold cross validation, it will take a LONG time so you can probably get away without doing CV and hyperparamater optimization.