<a href="https://colab.research.google.com/github/Dark-Sied/Intent_Classification/blob/master/Intent_classification_final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**INTENT CLASSIFICATION USING NEURAL NETWORKS**

In [1]:
import numpy as np
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.lancaster import LancasterStemmer
import nltk
import re
from sklearn.preprocessing import OneHotEncoder
import matplotlib.pyplot as plt
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.models import Sequential, load_model
from keras.layers import Dense, LSTM, Bidirectional, Embedding, Dropout
from keras.callbacks import ModelCheckpoint


  'Matplotlib is building the font cache using fc-list. '
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


This function fetches a dataset of sentences and their intents.

In [2]:
def load_dataset(filename):
  df = pd.read_csv(filename, encoding = "latin1", names = ["Sentence", "Intent"])
  print(df.head())
  intent = df["Intent"]
  unique_intent = list(set(intent))
  sentences = list(df["Sentence"])
  
  return (intent, unique_intent, sentences)
  


Let's load the dataset and print the first 5 rows.

In [3]:
intent, unique_intent, sentences = load_dataset("Dataset.csv")

                Sentence          Intent
0       Need help pleese  commonQ.assist
1              Need help  commonQ.assist
2       I need some info  commonQ.assist
3      Will you help me?  commonQ.assist
4  What else can you do?  commonQ.assist


We now create a "tokenizer" object which separates sentences into words using a 'filter' list.

In [4]:
def create_tokenizer(words, filters = '!"#$%&()*+,-./:;<=>?@[\]^_`{|}~'):
  #print(words)
  tokenizer = Tokenizer(filters = filters)
  tokenizer.fit_on_texts(words)
  return tokenizer

Let's now tokenize the sentences and find out the vocabulary size. Let's also fix a maximum length for sentences. This max_length will be used to 'pad' the short sentences later on.

In [5]:
word_tokenizer = create_tokenizer(sentences)
vocab_size = len(word_tokenizer.word_index) + 1
max_length = 30
print("Vocab Size = ",vocab_size)

Vocab Size =  494


Now, we convert each sentence (i.e. sequence of words) into a sequence of indices. This process is sometimes called encoding. 

In [6]:
def encoding_doc(tokenizer, words):
  return(tokenizer.texts_to_sequences(words))

In [7]:
encoded_doc = encoding_doc(word_tokenizer, sentences)

Pad all the short sequences to max_length, to get the final input matrix.

In [8]:
def padding_doc(encoded_doc, max_length):
  return(pad_sequences(encoded_doc, maxlen = max_length, padding = "post"))

In [9]:
padded_doc = padding_doc(encoded_doc, max_length)

Let's see what the first 5 inputs, after padding, look like.

In [10]:
padded_doc[:5]

array([[ 24,  77, 332,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0],
       [ 24,  77,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0],
       [  2,  24, 198, 181,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0],
       [ 51,  10,  77,  16,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0],
       [  9, 268,   4,  10,  30,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0]], dtype=int32)

The size of input can be found as follows. The first value is the number of data samples, and the second is the dimension of each sample.

In [11]:
print("Shape of padded docs = ",padded_doc.shape)

Shape of padded docs =  (1113, 30)


In [12]:
#tokenizer with filter changed
output_tokenizer = create_tokenizer(unique_intent, filters = '!"#$%&()*+,-/:;<=>?@[\]^`{|}~')


The following command gives us the different intents present in the dataset

In [13]:
output_tokenizer.word_index

{'commonq.assist': 14,
 'commonq.bot': 8,
 'commonq.how': 2,
 'commonq.just_details': 19,
 'commonq.name': 17,
 'commonq.not_giving': 6,
 'commonq.query': 10,
 'commonq.wait': 20,
 'contact.contact': 21,
 'faq.aadhaar_missing': 4,
 'faq.address_proof': 15,
 'faq.application_process': 1,
 'faq.apply_register': 5,
 'faq.approval_time': 3,
 'faq.bad_service': 7,
 'faq.banking_option_missing': 16,
 'faq.biz_category_missing': 9,
 'faq.biz_new': 12,
 'faq.biz_simpler': 13,
 'faq.borrow_limit': 11,
 'faq.borrow_use': 18}

Let's encode the outputs too. This means we assign each unique class an index.

In [14]:
encoded_output = encoding_doc(output_tokenizer, intent)

In [15]:
encoded_output = np.array(encoded_output).reshape(len(encoded_output), 1)

In [16]:
encoded_output.shape

(1113, 1)

In [17]:
def one_hot(encode):
  o = OneHotEncoder(sparse = False)
  return(o.fit_transform(encode))

In [18]:
output_one_hot = one_hot(encoded_output)

The following commands shows that there are 1113 data samples, and 21 classes

In [19]:
output_one_hot.shape

(1113, 21)

In [21]:
from sklearn.model_selection import train_test_split

Split the data into training set and validation set

In [27]:
train_X, val_X, train_Y, val_Y = train_test_split(padded_doc, output_one_hot, shuffle = True, test_size = 0.2)


In [28]:
print("Shape of train_X = %s and train_Y = %s" % (train_X.shape, train_Y.shape))
print("Shape of val_X = %s and val_Y = %s" % (val_X.shape, val_Y.shape))

Shape of train_X = (890, 30) and train_Y = (890, 21)
Shape of val_X = (223, 30) and val_Y = (223, 21)


Finally, we create the neural network. It contains an embedding layer, which converts each input word to a vector of specified size (128 in this case). This is followed by a recurrent neural network (LSTM), followed by a fully connected layer (Dense), a dropout layer, and another fully connected layer.

In [24]:
def create_model(vocab_size, max_length):
  model = Sequential()
  model.add(Embedding(vocab_size, 128, input_length = max_length, trainable = False))
  model.add(Bidirectional(LSTM(128)))
#   model.add(LSTM(128))
  model.add(Dense(32, activation = "relu"))
  model.add(Dropout(0.5))
  model.add(Dense(21, activation = "softmax"))
  
  return model

We define the loss function. Categorical crossentropy is essential negative log likelihood of the data.

In [29]:
model = create_model(vocab_size, max_length)

model.compile(loss = "categorical_crossentropy", optimizer = "adam", metrics = ["accuracy"])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 30, 128)           63232     
_________________________________________________________________
bidirectional_2 (Bidirection (None, 256)               263168    
_________________________________________________________________
dense_3 (Dense)              (None, 32)                8224      
_________________________________________________________________
dropout_2 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_4 (Dense)              (None, 21)                693       
Total params: 335,317
Trainable params: 272,085
Non-trainable params: 63,232
_________________________________________________________________


In [30]:
filename = 'model.h5'
checkpoint = ModelCheckpoint(filename, monitor='val_loss', verbose=1, save_best_only=True, mode='min')

hist = model.fit(train_X, train_Y, epochs = 10, batch_size = 64, validation_data = (val_X, val_Y), callbacks = [checkpoint])

Train on 890 samples, validate on 223 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


To check how well our model does, we create the following two functions. Then we invoke the model on a test input of our choice.

In [31]:
def predictions(text):
  clean = re.sub(r'[^ a-z A-Z 0-9]', " ", text)
  test_word = word_tokenize(clean)
  test_word = [w.lower() for w in test_word]
  test_ls = word_tokenizer.texts_to_sequences(test_word)
  print(test_word)
  #Check for unknown words
  if [] in test_ls:
    test_ls = list(filter(None, test_ls))
    
  test_ls = np.array(test_ls).reshape(1, len(test_ls))
 
  x = padding_doc(test_ls, max_length)
  
  pred = model.predict_proba(x)
  
  
  return pred


  

In [32]:
def get_final_output(pred, classes):
  predictions = pred[0]
 
  classes = np.array(classes)
  ids = np.argsort(-predictions)
  classes = classes[ids]
  predictions = -np.sort(-predictions)
 
  for i in range(pred.shape[1]):
    print("%s has confidence = %s" % (classes[i], (predictions[i])))



In [45]:
text = "Can you help me?"
pred = predictions(text)
get_final_output(pred, unique_intent)

['can', 'you', 'help', 'me']
contact.contact has confidence = 0.15496406
faq.application_process has confidence = 0.11929979
faq.borrow_use has confidence = 0.07631553
faq.apply_register has confidence = 0.06454459
faq.biz_new has confidence = 0.062133495
faq.biz_simpler has confidence = 0.0520627
faq.address_proof has confidence = 0.04624241
commonQ.assist has confidence = 0.044755496
faq.aadhaar_missing has confidence = 0.043643948
faq.approval_time has confidence = 0.041520894
commonQ.name has confidence = 0.036536757
faq.borrow_limit has confidence = 0.03383332
faq.banking_option_missing has confidence = 0.03310464
commonQ.how has confidence = 0.03139223
faq.biz_category_missing has confidence = 0.030802295
commonQ.bot has confidence = 0.026777714
commonQ.just_details has confidence = 0.024570618
faq.bad_service has confidence = 0.021721687
commonQ.wait has confidence = 0.019781144
commonQ.query has confidence = 0.01942338
commonQ.not_giving has confidence = 0.01657323
