# Travel agency's reviews - classification

Implement and evaluate a classifier of user reviews using methods described in the NLP tutorial.

In [9]:
import pandas as pd

reviews = pd.read_csv('../data/en_reviews.csv', sep='\t', header=None, names =['rating', 'text'])
reviews[35:45]

Unnamed: 0,rating,text
35,5,I bought the cheapest tickets through this ser...
36,5,Such a pleasure to know that you will be prope...
37,5,I always use this website to look for flights ...
38,2,A startup that finds discount flight tickets '...
39,5,"Excellent customer service, fast and kind. Wan..."
40,4,very good service from Quan Costa to help me w...
41,3,.@Skypickercom Finds Cheap Flights 'Hidden' On...
42,5,I have a problem with my tickets skypicker don...
43,4,Even though it took a bit time untill an agent...
44,5,Today I had a great experience with one of Kiw...


## Preparation of train and test data sets
Separate and rename target values.

In [11]:
target = reviews['rating']
data = reviews['text']
names = ['Class 1', 'Class 2', 'Class 3','Class 4', 'Class 5']

#reduce number of classes
target = list(map(lambda t: 1 if t==4 or t==5 else 0, target))
names = ["Negative", "Positive"]

print(data[:5])
print(target[:5])

0    A voucher to nowhere #skypickerfail 2400 out o...
1    I booked with Kiwi for the first time, just a ...
2    I would like to say THANKS YOU for your custom...
3    I just noticed 2 hours before my flight that I...
4    This is the first time I have dealt with Skypi...
Name: text, dtype: object
[0, 1, 1, 1, 0]


Shuffle the data and split it to train and test parts.

In [12]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2)
print('Train size: {}'.format(len(X_train)))
print('Test size: {}'.format(len(X_test)))

Train size: 6234
Test size: 1559


## Classification

Prepare ML pipeline including data preprocessing and train a classifier. Implement a baseline model using [DummyClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html). Experiment with various models and data preprocessing techniques.

In [13]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from nltk.tokenize.casual import casual_tokenize

from sklearn.dummy import DummyClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier

baselineModel = Pipeline([('vec', CountVectorizer(tokenizer=lambda x: casual_tokenize(x))),
                          ('clf', DummyClassifier(strategy='stratified'))
                         ])
                          
clfModel = Pipeline([('vec', CountVectorizer(tokenizer=lambda x: casual_tokenize(x))),
                     ('tfidf', TfidfTransformer()),
                     #('clf', MultinomialNB())
                     #('clf', LogisticRegression())
                     #('clf', GradientBoostingClassifier(n_estimators=300))
                     ('clf', SVC(kernel='linear'))
                    ])

baselineModel.fit(X_train, y_train)
clfModel.fit(X_train, y_train)

y_baseline = baselineModel.predict(X_test)
y_pred = clfModel.predict(X_test)

## Evaluation

Evaluate the models using standard methods. Which model performs best?

In [14]:
from sklearn import metrics

print("BASELINE REPORT")
print("Accuracy: {}".format(metrics.accuracy_score(y_test, y_baseline)))
print("Confusion matrix:")
print(metrics.confusion_matrix(y_test, y_baseline))
print(metrics.classification_report(y_test, y_baseline,
                                            target_names=names))
print()
print("ML MODEL REPORT")
print("Accuracy: {}".format(metrics.accuracy_score(y_test, y_pred)))
print("Confusion matrix:")
print(metrics.confusion_matrix(y_test, y_pred))
print(metrics.classification_report(y_test, y_pred,
                                            target_names=names))

BASELINE REPORT
Accuracy: 0.7132777421423989
Confusion matrix:
[[  33  224]
 [ 223 1079]]
             precision    recall  f1-score   support

   Negative       0.13      0.13      0.13       257
   Positive       0.83      0.83      0.83      1302

avg / total       0.71      0.71      0.71      1559


ML MODEL REPORT
Accuracy: 0.9518922386144965
Confusion matrix:
[[ 212   45]
 [  30 1272]]
             precision    recall  f1-score   support

   Negative       0.88      0.82      0.85       257
   Positive       0.97      0.98      0.97      1302

avg / total       0.95      0.95      0.95      1559

