# Travel agency's reviews - classification using word2vec vectors

Implement and evaluate a classifier of user reviews using Support Vector Machines with RBF kernel. Use the word2vec vectors as features.

In [0]:
import pandas as pd

reviews = pd.read_csv('https://raw.githubusercontent.com/mlcollege/natural-language-processing/master/data/en_reviews.csv', sep='\t', header=None, names =['rating', 'text'])
reviews[35:45]

## Preparation of train and test data sets
Separate and rename target values.

In [0]:
target = reviews['rating']
data = reviews['text']
names = ['Class 1', 'Class 2', 'Class 3','Class 4', 'Class 5']

print(data[:5])
print(target[:5])

Tokenize the texts.

In [0]:
from nltk.tokenize.casual import casual_tokenize
tokens = data.apply(lambda x: casual_tokenize(x))

Read word2vec vectors from 'crawl-300.vec' as store them as a dictionary. The keys of the dictionary will be tokens, the values will be their word2vec vectors.



In [0]:
!wget https://www.mlcollege.com/data/crawl-300.vec.bz2
!bunzip2 crawl-300.vec.bz2

In [0]:
import numpy as np
DIM = 300 #dimension of the word2vec vectors

word_vectors = {}

with open('crawl-300.vec') as f:
    f.readline()
    for line in f.readlines():
        items = line.strip().split()
        word_vectors[items[0]] = np.array(items[1:], dtype=np.float32)

Convert the document representation from 'list of tokens' to 'average word2vec vector'. Experiment with other aggregation methods.

In [0]:
vectors = []
for line in tokens:
    vec_list = []
    for token in line:
        if token in word_vectors.keys():
            vec_list.append(word_vectors.get(token))
    if len(vec_list) == 0:
        vec_list.append(np.zeros(DIM))        
    vectors.append(np.average(vec_list, axis=0))

Shuffle the data and split it to train and test parts.

In [0]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(vectors, target, test_size=0.2)
print('Train size: {}'.format(len(X_train)))
print('Test size: {}'.format(len(X_test)))

## Classification

Prepare ML pipeline including data and train a classifier.

In [0]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

clf_pipeline = Pipeline([('std', StandardScaler()),
                         ('svm', SVC(kernel='rbf'))])
    
clf_pipeline.fit(X_train, y_train)

y_pred = clf_pipeline.predict(X_test)

## Evaluation

Evaluate the models using standard methods.

In [0]:
from sklearn import metrics

print()
print("ML MODEL REPORT")
print("Accuracy: {}".format(metrics.accuracy_score(y_test, y_pred)))
print("Confusion matrix:")
print(metrics.confusion_matrix(y_test, y_pred))
print(metrics.classification_report(y_test, y_pred,
                                            target_names=names))