# Intent Classifier 

In this notebook a simple way to classify incoming query like "I want a hot dog." into one of the intents. Finding the intent of the user query is a very important task in building a chatbot.

Here intent classication is done by using a keras sequence model to extract the feature from the incoming query.

In [1]:
import keras, tensorflow, sys
keras.__version__, tensorflow.__version__, sys.version

Using TensorFlow backend.


('2.2.4',
 '1.11.0',
 '3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:27:44) [MSC v.1900 64 bit (AMD64)]')

In [2]:
# import required packages

import json
import pandas as pd
import numpy as np

import tensorflow as tf

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

from keras.utils.np_utils import to_categorical

from keras.layers import Dense, Input, Flatten, Lambda, Permute, GlobalMaxPooling1D, Activation, Concatenate
from keras.layers import Convolution1D, MaxPooling1D, Embedding, Dropout, Bidirectional, CuDNNGRU, SpatialDropout1D

from keras.models import Model

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, accuracy_score


## Dataset

Dataset is taken from the link -> https://github.com/snipsco/nlu-benchmark/tree/master/2017-06-custom-intent-engines This Dataset have been collected from different sources and have queries pertaining to 7 different intents.

The dataset is given in json format and the below block of code is used to read the data.

In [3]:
data = pd.DataFrame()

for intent in ['AddToPlaylist', 'BookRestaurant', 'GetWeather', 'PlayMusic', 'RateBook', 'SearchCreativeWork',
               'SearchScreeningEvent']:

    with open("./data/2017-06-custom-intent-engines/" + intent + "/train_" + intent + "_full.json",
              encoding='cp1251') as data_file:
        full_data = json.load(data_file)
        
    texts = []
    for i in range(len(full_data[intent])):
        text = ''
        for j in range(len(full_data[intent][i]['data'])):
            text += full_data[intent][i]['data'][j]['text']
        texts.append(text)

    dftrain = pd.DataFrame(data=texts, columns=['request'])
    dftrain[intent] = np.ones(dftrain.shape[0], dtype='int')

    data = data.append(dftrain, ignore_index=True, sort=False)

data = data.fillna(value=0)

data.shape

(13784, 8)

## Sample query

The dataframe contains the query and the column corresponding the intent is marked 1. See below:

In [4]:
data.sample(5)

Unnamed: 0,request,AddToPlaylist,BookRestaurant,GetWeather,PlayMusic,RateBook,SearchCreativeWork,SearchScreeningEvent
7144,Play a top five Linda Strawberry ep,0.0,0.0,0.0,1.0,0.0,0.0,0.0
5584,What will the weather be in IN?,0.0,0.0,1.0,0.0,0.0,0.0,0.0
12905,Find the movie schedule at twelve AM.,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4798,will the weather be colder in Naguabo four min...,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3989,What's the weather close to Cambodia at 05:44:13,0.0,0.0,1.0,0.0,0.0,0.0,0.0


## Load Glove Embedding

Load the embedding file 'glove.840B.300d.txt' and find the mean and standard deviation vectors of the word vectors. Than for all the words in the vocab initialize the corresponding word vector from the loaded embedded file. For the words for which wordvecs cannot be found in the embedding file, initialize them with a random normal distribution with the above found mean and standard deviation.

In [5]:
def load_glove(word_index):
    EMBEDDING_FILE = '../../embeddings/glove.840B.300d/glove.840B.300d.txt'
    def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
    embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE, encoding="utf8"))

    all_embs = np.stack(embeddings_index.values())
    emb_mean,emb_std = all_embs.mean(), all_embs.std()
    embed_size = all_embs.shape[1]

    # word_index = tokenizer.word_index
    nb_words = len(word_index)
    embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
    for word, i in word_index.items():
        if i >= nb_words: continue
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None: embedding_matrix[i] = embedding_vector
            
    return embedding_matrix 

In [6]:
# split data into test and train
X_train, X_test, y_train, y_test = train_test_split(data["request"], data[["AddToPlaylist", "BookRestaurant",
                                                    "GetWeather", "PlayMusic", "RateBook", "SearchCreativeWork",
                                                    "SearchScreeningEvent"]], test_size=0.25)

## Tokenize and pad the text sequences

Tokenize -> change the word to there integer ids

Pad -> Trim or pad with zeros to make all sentences of same length.


In [7]:

X_train = list(X_train)

# tokenize input strings
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)

X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)

word_index = tokenizer.word_index
vocab_size = len(word_index)

# prune each sentence to maximum of 100 words.
max_sent_len = 100

# sentences with less than 100 words, will be padded with zeroes to make it of length 100
# sentences with more than 100 words, will be pruned to 100.
X_train = pad_sequences(X_train, maxlen=max_sent_len)
X_test = pad_sequences(X_test, maxlen=max_sent_len)

embedding_matrix = load_glove(word_index)

Converte the one hot vectors of class labels into numerical labels. 

In [8]:
y_train = np.argmax(np.array(y_train), axis=-1)
y_test = np.argmax(np.array(y_test), axis=-1)


## Model

Using a preloaded glove vectors as embedding weights for the model.

Embedded word vectors are first featurized with 1D convolution and than passed to bidirectional GRU. GRU takes care of the sequential inforamtion, while CNN improved the embeddings by emphasizing on neighbor inforamtion. 

Global max pool layer pools 1 feature from each of the feature vector, unlike maxpool where we determine how many values is to be pooled.

Features are enriched with concatenating Self-attented features of the RNN output. 

Finally multiple fully-connected layers are used to classify the incoming query into one of the possible intents.

Adam optimizer and sparse categorical crossentropy loss are used.

In [9]:
# Model

sequence_input = Input(shape=(max_sent_len,), dtype='int32')

words = Embedding(input_dim=embedding_matrix.shape[0], output_dim=embedding_matrix.shape[1], weights=[embedding_matrix],
                  trainable=True)(sequence_input)
words = Dropout(rate=0.3)(words)

output = Convolution1D(filters=256, filter_length=3, activation="tanh", padding='same', strides=1)(words)
output = Dropout(rate=0.3)(output)

output = Bidirectional(CuDNNGRU(units=64, return_sequences=True), merge_mode='concat')(output)
output_h = Activation('tanh')(output)

output1 = GlobalMaxPooling1D()(output_h) 

# Applying attention to RNN output
output = Dense(units=1)(output_h)
output = Permute((2, 1))(output)
output = Activation('softmax', name="attn_softmax")(output)
output = Lambda(lambda x: tf.matmul(x[0], x[1])) ([output, output_h])
output2 = Flatten() (output)

# Concatenating maxpooled and self attended features.
output = Concatenate()([output1, output2])
output = Dropout(rate=0.3)(output)

output = Dense(units=128, activation='tanh')(output)
output = Dropout(rate=0.3)(output)

output = Dense(units=32, activation='tanh')(output)
output = Dense(units=7, activation='softmax')(output)

model = Model(inputs=sequence_input, outputs=output)
model.compile(loss='sparse_categorical_crossentropy', optimizer="adam", metrics=['accuracy'])

model.summary()

  if __name__ == '__main__':


__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 100)          0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 100, 300)     2929200     input_1[0][0]                    
__________________________________________________________________________________________________
dropout_1 (Dropout)             (None, 100, 300)     0           embedding_1[0][0]                
__________________________________________________________________________________________________
conv1d_1 (Conv1D)               (None, 100, 256)     230656      dropout_1[0][0]                  
__________________________________________________________________________________________________
dropout_2 

In [10]:
# train the model
model.fit(X_train, np.array(y_train), epochs=10, batch_size=128)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1f5740fe978>

## Validation

Being able to classify intents for a query with an accuracy of 98.7%

In [11]:
#get scores and predictions.
p = model.predict(X_test)
p = [np.argmax(i) for i in p]

print("f1_score (macro):", f1_score(y_test, p, average="macro"))
print("accuracy_score:", accuracy_score(y_test, p))

f1_score (macro): 0.9872548281477977
accuracy_score: 0.9872315728380732
