## Text classification using Tensorflow DNN with onehot encoding

We have yelp review data taken from kaggle competetion.

** Goal: **
Predict stars ranges from 1 - 5 based on review text

- The data is splitted into training and test data set.
- Each review is tokenized and followed by stemming.
- A list of vocabulary is generated.
- OneHot encoded vector is generated for each review.
- Apply the tensorflow DNN neural network model on review text.
- Accuracy, Confusion matrix and classification report for the predicted test data is calculated.


In [69]:
import pandas as pd
import numpy as np
import re

In [128]:
yelp = pd.read_csv('./data/yelp.csv')
yelp.head(5)

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,review,0hT2KtfLiobPvh6cDC8JQg,0,1,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0


In [256]:
# filter out text data 
X = yelp.text.apply(lambda x: re.sub("[^a-zA-Z]", " ", x))
y = yelp.stars

# find out unique list of stars
categories = sorted(list(y_test.value_counts().keys()))
X.shape

(10000,)

In [126]:
# split into train and test data set
from sklearn.model_selection import train_test_split
X_train , X_test , y_train, y_test = train_test_split(X, y, random_state=1)

In [83]:
import nltk as nl
from nltk.stem import LancasterStemmer

# instantiate a LancasterStemmer method
stemmer = LancasterStemmer()

In [129]:
# create a dataframe 'train' with columns text and its tokens
list_tokens = [nl.word_tokenize(sentences) for sentences in X_train]
train = pd.DataFrame({'text': X_train, 'tokens': list_tokens})

In [130]:
# create a list of vocabolary
words = [stemmer.stem(str(word)) for word_list in list_tokens for word in word_list]
words = set(words)

In [131]:
# prepare training data, we are feeding onehot encoded data to dnn model
training = []
for index, item in train.iterrows():
    token_words = [stemmer.stem(word) for word in item['tokens']]
    onehot = []
    for word in words:
        onehot.append(1) if word in token_words else onehot.append(0)
    training.append([onehot])

In [132]:
len(training)

7500

In [133]:
from numpy import array

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

# integer encode
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(y_train)

# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)

In [134]:
onehot_encoded[1]

array([0., 0., 0., 1., 0.])

In [136]:
# invert first example
inverted = label_encoder.inverse_transform(
    [np.argmax(onehot_encoded[1, :])]
)
print(inverted)

[4]


In [149]:
# convert training data to numpy array
training_new = np.array(training)

train_x = training_new[:, 0]
train_y = onehot_encoded

In [156]:
import tensorflow as tf
import tflearn

# reset underlying graph data
tf.reset_default_graph()

# Build neural network
net = tflearn.input_data(shape=[None, len(train_x[0])])
net = tflearn.fully_connected(net, 8)
net = tflearn.fully_connected(net, 8)
net = tflearn.fully_connected(net, len(train_y[0]), activation='softmax')
net = tflearn.regression(net)
 
# Define model and setup tensorboard
model = tflearn.DNN(net, tensorboard_dir='tflearn_logs')

# Start training (apply gradient descent algorithm)
model.fit(train_x, train_y, n_epoch=10, batch_size=8, show_metric=True)
model.save('model.tflearn')

Training Step: 9379  | total loss: [1m[32m0.13896[0m[0m | time: 5.115s
| Adam | epoch: 010 | loss: 0.13896 - acc: 0.9796 -- iter: 7496/7500
Training Step: 9380  | total loss: [1m[32m0.12892[0m[0m | time: 5.120s
| Adam | epoch: 010 | loss: 0.12892 - acc: 0.9817 -- iter: 7500/7500
--
INFO:tensorflow:D:\surinder\ds\kaggle\yelp_one_hot_encoding\model.tflearn is not in all_model_checkpoint_paths. Manually adding it.


In [157]:
# create a dataframe 'test' with test data and tokens for review text
test_tokens = [nl.word_tokenize(sentences) for sentences in X_test]
test = pd.DataFrame({'text': X_test, 'tokens': test_tokens})

In [158]:
# convert test data into onehot encoded data
testing = []
for index,item in test.iterrows():
    onehot = []
    token_words = [stemmer.stem(word) for word in item['tokens']]
    for w in words:
        onehot.append(1) if w in token_words else onehot.append(0)
    
    testing.append(onehot)

In [159]:
# It is necessary to convert list to np.array, models accepts an numpy array
testing = list(np.array(testing))

In [193]:
predicted = model.predict(X=testing)

In [249]:
# map predicted data with categories
# np.argmax(x) returns either 0, 1, 2, 3, 4 but categories
# data is [1, 2, 3, 4, 5], so 0 maps to 1, 1 to 2 so on..
predicted_new = pd.Series(list(predicted)).apply(lambda x: categories[np.argmax(x)])

In [254]:
# calculate accuracy for test data
metrics.accuracy_score(y_test, predicted_new)

0.4804

In [251]:
# print confusion matrix
print(metrics.confusion_matrix(y_test, predicted_new))

[[ 88  41  33  11  12]
 [ 38  71  77  35  13]
 [ 12  49 132 124  48]
 [ 14  33 130 425 282]
 [ 24  23  62 238 485]]


In [253]:
# confusion matrix using tensorflow function
labels = list(y_test)
predictions = [np.argmax(x) for x in predicted]
confusion_matrix = tf.confusion_matrix(labels, predictions)

with tf.Session():
    print('\nConfusion Matrix:\n', tf.Tensor.eval(confusion_matrix,feed_dict=None, session=None))


Confusion Matrix:
 [[  0   0   0   0   0   0]
 [ 88  41  33  11  12   0]
 [ 38  71  77  35  13   0]
 [ 12  49 132 124  48   0]
 [ 14  33 130 425 282   0]
 [ 24  23  62 238 485   0]]


In [261]:
# generate classification report
print(metrics.classification_report(y_test, predicted_new))

             precision    recall  f1-score   support

          1       0.50      0.48      0.49       185
          2       0.33      0.30      0.31       234
          3       0.30      0.36      0.33       365
          4       0.51      0.48      0.50       884
          5       0.58      0.58      0.58       832

avg / total       0.48      0.48      0.48      2500

