# Natural Language Processing - Classification

**Adapted from: Emmanuel Dufourq** (edufourq@gmail.com - [www.emmanueldufourq.com](http://www.emmanueldufourq.com) )

July 2018

*Made for the Theoretical Foundations of Data Science 2018 (African Institute for Mathematical Sciences)*

Adapted from https://cloud.google.com/blog/big-data/2017/10/intro-to-text-classification-with-keras-automatically-tagging-stack-overflow-posts

### Objective:

Construct a model that can classify text data. Here we are interested in tagging reviews from Amazon.


**Important Clarification**

This example is using Amazon review from the [Cards against Humanity](https://www.amazon.com/Cards-Against-Humanity-LLC-CAHUS/product-reviews/B004S8F7QM/ref=cm_cr_arp_d_hist_2?ie=UTF8&filterByStar=two_star&reviewerType=all_reviews&pageNumber=1#reviews-filter-bar) product.

All the collected data sets are available for download on [this](https://github.com/jaortiz117/HPAI-HW7) github repository

## Imports

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
import keras
from sklearn.preprocessing import LabelBinarizer, LabelEncoder
from keras.preprocessing import text, sequence
from keras import utils
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
import numpy as np

Using TensorFlow backend.


## Download the data

In [0]:
df = pd.read_csv("https://raw.githubusercontent.com/jaortiz117/HPAI-HW7/master/merged-reviews.csv")

## Look the some of the data

In [3]:
df.head()

Unnamed: 0,post,tags
0,Is this what the world has come to?!Utterly ba...,1
1,Too much for a family game. Didn't realize ho...,1
2,"a 13 yr old meth head might think it is funny,...",1
3,"If you are 18 and drunk, this game is for you,...",1
4,"It’s just not very funny, compared to Relative...",1


## Print out the unique tags

In [4]:
df['tags'].unique()

array([1, 2, 3, 4, 5])

## Determine the number of classes

In [0]:
num_classes = len(df['tags'].unique())

In [6]:
num_classes

5

## Check how many instances for each class

In [7]:
df['tags'].value_counts()

5    100
4    100
3    100
2    100
1    100
Name: tags, dtype: int64

## Determine the number of words in each instance

In [0]:
df['Word Count'] =  df['post'].apply(lambda x: len(x.split (' ')))

In [9]:
 df.sort_values(by=['Word Count'], ascending=False)

Unnamed: 0,post,tags,Word Count
342,"Cards Against Humanity: 4+ Players, Ages 17+, ...",4,622
487,This is NOT a game for your parents!!!!!! But...,5,452
402,I gave this box to my boyfriend for Christmas ...,5,449
433,"Sorry, infer what you will from this, but this...",5,414
399,Ever wondered what a grown-up version of Apple...,5,412
453,My family's love of margaritas and juvenile hu...,5,399
430,"Firstly, let me just say that I took some of t...",5,396
475,I've played a lot of games over the years. A l...,5,395
482,Sometimes a rating of 5/5 stars isn’t enough. ...,5,394
401,"If you like to laugh, and you don't take life ...",5,389


## Convert the data into X and Y

In [0]:
X = df['post'].values

In [0]:
Y = df['tags']

## Split the data into training and testing

In [0]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state=42)

In [13]:
X_train[0]

'Good luck not peeing your pants while playing this game! Awesome to play with friends and surprisingly hilarious to play with parents/aunts/uncles/etc. It’s super inappropriate and incredibly funny. This is also a great game for a group, whether long term friends or new acquaintances. It’ll break the ice for sure and it also easy to pause for snack/drink refills.'

## Tokenize

Tokenizer has the ability to count the number of unique words and to allocate a unique number to each of the the words. We can specify the number of words that we want, this is typically the most frequent words. So in our case, we can to allocate an index number of 1000 words. The documentation is here: https://keras.io/preprocessing/text/#tokenizer

In [0]:
max_words = 1000
tokenize = text.Tokenizer(num_words=max_words, char_level=False)

Now, we can convert each post in our dataset into a vector. The size of the vector *max_words*. The vector is made up of 0's and 1's. There is a value of 1 at the index location of the tokenized words. In other words, if the tokenized words are [what, I, you, where, cat] then the sentence "where is the cat" is converted into [0, 0,0,1,1] which indicates that words where and cat are present. In other words, the tokenizer creates a vocabulary and then we can assign a 1 if a word in the text is found in the vocabulary, and the index location is based on the vocabulary. We need to fit this to some data, so we use the training data:

In [0]:
tokenize.fit_on_texts(X_train) 

We can take a look at the words and the indices in the vocabulary here:

In [16]:
tokenize.word_index

{'the': 1,
 'to': 2,
 'a': 3,
 'and': 4,
 'game': 5,
 'of': 6,
 'i': 7,
 'it': 8,
 'you': 9,
 'this': 10,
 'is': 11,
 'cards': 12,
 'for': 13,
 'that': 14,
 'with': 15,
 'in': 16,
 'are': 17,
 'but': 18,
 'not': 19,
 'or': 20,
 'be': 21,
 'have': 22,
 'as': 23,
 'if': 24,
 'play': 25,
 'so': 26,
 'was': 27,
 'my': 28,
 'we': 29,
 'fun': 30,
 'all': 31,
 'people': 32,
 'some': 33,
 'card': 34,
 'can': 35,
 'get': 36,
 'on': 37,
 'just': 38,
 'your': 39,
 'at': 40,
 "it's": 41,
 'they': 42,
 'played': 43,
 'friends': 44,
 'will': 45,
 'playing': 46,
 'like': 47,
 'more': 48,
 'one': 49,
 'had': 50,
 'out': 51,
 'who': 52,
 'would': 53,
 'box': 54,
 'about': 55,
 'apples': 56,
 'time': 57,
 'them': 58,
 'there': 59,
 'when': 60,
 'up': 61,
 'funny': 62,
 'really': 63,
 'after': 64,
 'only': 65,
 'an': 66,
 'were': 67,
 'even': 68,
 'very': 69,
 'their': 70,
 'expansion': 71,
 'humor': 72,
 'what': 73,
 'much': 74,
 'buy': 75,
 'group': 76,
 'family': 77,
 'great': 78,
 "don't": 79,
 'firs

Then, we go ahead and convert the training and testing features into their corresponding vectors. The size of these vectors is based on the size of the vocabulary, in our case 1000.

In [0]:
X_train_token = tokenize.texts_to_matrix(X_train)
X_test_token = tokenize.texts_to_matrix(X_test)

In [18]:
X_train_token[0]

array([0., 1., 1., 1., 1., 1., 0., 0., 1., 0., 1., 1., 0., 1., 0., 1., 0.,
       0., 0., 1., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 1.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
       0., 0., 0., 0., 1.

Check size here

In [19]:
len(X_train_token[0])

1000

Now we need to convert the labels (targets) into their corresponding one-hot encoded values. One way to do this is to convert each label into a number, and then convert the number into a one-hot encoded vector.

## Encode the targets

In [0]:
# Use sklearn utility to convert label strings to numbered index
encoder = LabelEncoder()
encoder.fit(Y_train)
Y_train_encoded = encoder.transform(Y_train)
Y_test_encoded = encoder.transform(Y_test)

In [21]:
Y_train_encoded[0]

4

Now convert into one-hot encoded vectors

In [0]:
Y_train_hot = utils.to_categorical(Y_train_encoded, num_classes)
Y_test_hot = utils.to_categorical(Y_test_encoded, num_classes)

In [23]:
Y_train_hot[0]

array([0., 0., 0., 0., 1.], dtype=float32)

Check the shapes.

Here are 2680 training samples and 1320 testing samples.

Each feature sample is a vector of length 1000 and each target is of length 20 (since there are 20 unique classes and the values have been one-hot encoded).

In [24]:
print('x_train shape:', X_train_token.shape)
print('x_test shape:', X_test_token.shape)
print('y_train shape:', Y_train_hot.shape)
print('y_test shape:', Y_test_hot.shape)

x_train shape: (335, 1000)
x_test shape: (165, 1000)
y_train shape: (335, 5)
y_test shape: (165, 5)


## Hyper-parameters

In [0]:
batch_size = 16
epochs = 10

## Build the model

In [26]:
model = Sequential()
model.add(Dense(512, input_shape=(max_words,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


In [27]:
history = model.fit(X_train_token, Y_train_hot,batch_size=batch_size,
                    epochs=epochs,verbose=1,
                    validation_split=0.1)

Instructions for updating:
Use tf.cast instead.
Train on 301 samples, validate on 34 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## Check accuracy

In [28]:
# Evaluate the accuracy of our trained model
score = model.evaluate(X_test_token, Y_test_hot,
                       batch_size=batch_size, verbose=1)
print('Test accuracy:', score[1])

Test accuracy: 0.4545454549066948


In [29]:
Y_test.values

array([4, 1, 4, 2, 2, 4, 4, 2, 1, 5, 1, 2, 5, 1, 4, 4, 1, 1, 4, 5, 5, 5,
       3, 4, 1, 5, 3, 3, 2, 4, 5, 4, 5, 2, 1, 4, 5, 1, 3, 1, 4, 1, 5, 2,
       1, 1, 1, 1, 5, 1, 1, 2, 2, 5, 2, 1, 2, 1, 3, 1, 5, 3, 4, 5, 1, 1,
       5, 1, 3, 4, 4, 4, 3, 3, 4, 4, 3, 1, 5, 5, 2, 1, 5, 2, 4, 1, 2, 5,
       4, 3, 4, 5, 4, 4, 5, 4, 1, 1, 5, 1, 3, 5, 1, 4, 4, 1, 4, 3, 4, 2,
       2, 5, 4, 1, 5, 3, 3, 3, 2, 1, 3, 2, 4, 2, 3, 3, 5, 3, 2, 4, 1, 2,
       5, 3, 2, 1, 2, 1, 1, 3, 1, 3, 5, 1, 3, 3, 1, 3, 3, 5, 1, 2, 1, 1,
       5, 4, 1, 3, 5, 1, 2, 4, 5, 1, 5])

## Predict

In [30]:
text_labels = encoder.classes_ 
for i in range(10):
    prediction = model.predict(np.array([X_test_token[i]]))
    predicted_label = text_labels[np.argmax(prediction)]
    print('Text: ',X_test[i])
    print('Actual Rating: ' + str(Y_test.values[i]))
    print('Predicted Rating: ' + str(predicted_label) + '\n')

Text:  This game is perfect for getting friends together, and having a good laugh. Apples to Apples but a more mature way of putting it. After one really long game though you know every single card. I will be ordering an expansion pack for the game soon, because the game seems like a money trap. But I like it and I like to have guests over. Its a great way to get to know your friends humor and how they react to certain 'Horrible' situations.
Actual Rating: 4
Predicted Rating: 4

Text:  I thought I would be able to play this with my granddaughters. No way. The cards are rude and crass and inappropriate. Not funny to me at all. Must be for 20something age group.  I wouldn't play this if you paid me.
Actual Rating: 1
Predicted Rating: 1

Text:  Very fun game hours of laughter. The only thing I don’t like is that once you’re through the cards once, it’s not funny anymore. You’ll have to buy more cards.
Actual Rating: 4
Predicted Rating: 2

Text:  People say this is so much fun. I don't rea

## Prediciting user input:

Now you get to add your own review and let the program predict its rating!

In [34]:
new_input = input("write review: ")
new_rating = input("write amount of stars (1 through 5): ")

write review: This product is very good. I only find myself finding flaws in some of the jokes that can be repetitive, but overall it is very fun.
write amount of stars (1 through 5): 3


In [0]:
new_input_token = tokenize.texts_to_matrix([new_input])
prediction = model.predict(np.array([new_input_token[0]]))
predicted_label = text_labels[np.argmax(prediction)]

In [36]:
print("Text: " + new_input)
print("Actual Rating: " + new_rating)
print("Predicted Rating: " + str(predicted_label) + "\n")

Text: This product is very good. I only find myself finding flaws in some of the jokes that can be repetitive, but overall it is very fun.
Actual Rating: 3
Predicted Rating: 2

