# Travel agency's reviews - classification using word2vec vectors

Implement and evaluate a classifier of user reviews using Support Vector Machines with RBF kernel. Use the word2vec vectors as features.

In [1]:
import pandas as pd

reviews = pd.read_csv('../data/en_reviews.csv', sep='\t', header=None, names =['rating', 'text'])
reviews[35:45]

Unnamed: 0,rating,text
35,5,I bought the cheapest tickets through this ser...
36,5,Such a pleasure to know that you will be prope...
37,5,I always use this website to look for flights ...
38,2,A startup that finds discount flight tickets '...
39,5,"Excellent customer service, fast and kind. Wan..."
40,4,very good service from Quan Costa to help me w...
41,3,.@Skypickercom Finds Cheap Flights 'Hidden' On...
42,5,I have a problem with my tickets skypicker don...
43,4,Even though it took a bit time untill an agent...
44,5,Today I had a great experience with one of Kiw...


## Preparation of train and test data sets
Separate and rename target values.

In [2]:
target = reviews['rating']
data = reviews['text']
names = ['Class 1', 'Class 2', 'Class 3','Class 4', 'Class 5']

print(data[:5])
print(target[:5])

0    A voucher to nowhere #skypickerfail 2400 out o...
1    I booked with Kiwi for the first time, just a ...
2    I would like to say THANKS YOU for your custom...
3    I just noticed 2 hours before my flight that I...
4    This is the first time I have dealt with Skypi...
Name: text, dtype: object
0    2
1    5
2    5
3    5
4    2
Name: rating, dtype: int64


Tokenize the texts.

In [3]:
from nltk.tokenize.casual import casual_tokenize
tokens = data.apply(lambda x: casual_tokenize(x))

Read word2vec vectors from '../data/crawl-300.vec' as store them as a dictionary. The keys of the dictionary will be tokens, the values will be their word2vec vectors.

In [4]:
import numpy as np
DIM = 300 #dimension of the word2vec vectors

word_vectors = {}

with open('../data/crawl-300.vec') as f:
    f.readline()
    for line in f.readlines():
        items = line.strip().split()
        word_vectors[items[0]] = np.array(items[1:], dtype=np.float32)

Convert the document representation from 'list of tokens' to 'average word2vec vector'. Experiment with other aggregation methods.

In [5]:
vectors = []
for line in tokens:
    vec_list = []
    for token in line:
        if token in word_vectors.keys():
            vec_list.append(word_vectors.get(token))
    if len(vec_list) == 0:
        vec_list.append(np.zeros(DIM))        
    vectors.append(np.average(vec_list, axis=0))

Shuffle the data and split it to train and test parts.

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(vectors, target, test_size=0.2)
print('Train size: {}'.format(len(X_train)))
print('Test size: {}'.format(len(X_test)))

Train size: 6234
Test size: 1559


## Classification

Prepare ML pipeline including data and train a classifier.

In [7]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

clf_pipeline = Pipeline([('std', StandardScaler()),
                         ('svm', SVC(kernel='rbf'))])
    
clf_pipeline.fit(X_train, y_train)

y_pred = clf_pipeline.predict(X_test)

## Evaluation

Evaluate the models using standard methods.

In [8]:
from sklearn import metrics

print()
print("ML MODEL REPORT")
print("Accuracy: {}".format(metrics.accuracy_score(y_test, y_pred)))
print("Confusion matrix:")
print(metrics.confusion_matrix(y_test, y_pred))
print(metrics.classification_report(y_test, y_pred,
                                            target_names=names))


ML MODEL REPORT
Accuracy: 0.8152661962796665
Confusion matrix:
[[ 150    0    0    0   14]
 [  16   20    0    0    4]
 [  16    0   19    0   31]
 [   6    0    0    5  190]
 [  11    0    0    0 1077]]
             precision    recall  f1-score   support

    Class 1       0.75      0.91      0.83       164
    Class 2       1.00      0.50      0.67        40
    Class 3       1.00      0.29      0.45        66
    Class 4       1.00      0.02      0.05       201
    Class 5       0.82      0.99      0.90      1088

avg / total       0.85      0.82      0.75      1559

