# Travel agency's reviews - classification using LSTM

Implement and evaluate a classifier of user reviews using LSTM.

In [1]:
import pandas as pd

reviews = pd.read_csv('https://raw.githubusercontent.com/mlcollege/natural-language-processing/master/data/en_reviews.csv', sep='\t', header=None, names =['rating', 'text'])
reviews[35:45]

Unnamed: 0,rating,text
35,5,I bought the cheapest tickets through this ser...
36,5,Such a pleasure to know that you will be prope...
37,5,I always use this website to look for flights ...
38,2,A startup that finds discount flight tickets '...
39,5,"Excellent customer service, fast and kind. Wan..."
40,4,very good service from Quan Costa to help me w...
41,3,.@Skypickercom Finds Cheap Flights 'Hidden' On...
42,5,I have a problem with my tickets skypicker don...
43,4,Even though it took a bit time untill an agent...
44,5,Today I had a great experience with one of Kiw...


## Preparation of train and test data sets
Separate and rename target values.

In [2]:
target = reviews['rating']
data = reviews['text']

print(data[:5])
print(target[:5])

0    A voucher to nowhere #skypickerfail 2400 out o...
1    I booked with Kiwi for the first time, just a ...
2    I would like to say THANKS YOU for your custom...
3    I just noticed 2 hours before my flight that I...
4    This is the first time I have dealt with Skypi...
Name: text, dtype: object
0    2
1    5
2    5
3    5
4    2
Name: rating, dtype: int64


Set hyperparameters


In [3]:
vocab_size = 5000
embedding_dim = 64
max_length = 100
trunc_type = 'post'
padding_type = 'post'
oov_tok = '<OOV>' # OOV = Out of Vocabulary

Shuffle the data and split it to train and test parts.

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.1)
print('Train size: {}'.format(len(X_train)))
print('Test size: {}'.format(len(X_test)))

Train size: 7013
Test size: 780


Tokenize reviews using the Keras tokenizer.

In [5]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(X_train)
X_train = tokenizer.texts_to_sequences(X_train)
X_test= tokenizer.texts_to_sequences(X_test)

Pad and truncate token sequences.

In [6]:
X_train = pad_sequences(X_train, maxlen=max_length, padding=padding_type, truncating=trunc_type)
X_test = pad_sequences(X_test, maxlen=max_length, padding=padding_type, truncating=trunc_type)


Encode target values to one-hot encoding.

In [7]:
from tensorflow.python.keras.utils import np_utils

n_classes = 5
y_train = np_utils.to_categorical(y_train-1, n_classes)
y_test = np_utils.to_categorical(y_test-1, n_classes)

## Model definition

Define the neural network model. Word *embedding* and bidirectional LSTM are used.

In [8]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, LSTM, Dropout, Activation, Embedding, Bidirectional

model = Sequential()
model.add(Embedding(vocab_size, embedding_dim))
model.add(Dropout(0.5))
model.add(Bidirectional(LSTM(64)))
model.add(Dense(5, activation="softmax"))
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 64)          320000    
_________________________________________________________________
dropout (Dropout)            (None, None, 64)          0         
_________________________________________________________________
bidirectional (Bidirectional (None, 128)               66048     
_________________________________________________________________
dense (Dense)                (None, 5)                 645       
Total params: 386,693
Trainable params: 386,693
Non-trainable params: 0
_________________________________________________________________


## Model training

In [9]:
import tensorflow as tf
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001, weight_decay=1e-6)
model.compile(loss="categorical_crossentropy", optimizer=optimizer,
    metrics=['accuracy'],
)

In [10]:
history = model.fit(X_train, y_train, epochs=10, validation_data=(X_test, y_test), verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
