# Classifying hotel reviews (text) with pipelines in scikit-learn

All Rights Reserved © <a href="http://www.louisdorard.com" style="color: #6D00FF;">Louis Dorard</a>

<img src="http://s3.louisdorard.com.s3.amazonaws.com/ML_icon.png">

In this notebook we consider the problem of detecting [fake hotel reviews](http://myleott.com/op_spam/). We define a _pipeline_ that chains text featurization to model building, and we apply grid search to the pipeline’s parameter space, in order to find a good combination of parameter values. We then inspect how predictions are made and review model behaviour.

## Data preparation

Import hotel reviews data and prepare X and y (inputs and outputs)

In [None]:
import pandas as pd
data = pd.read_csv('https://oml-data.s3.amazonaws.com/hotel-reviews.csv')
X = data.text.values
y = data.label.values

## Define pipeline

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, chi2

text_clf = Pipeline([('vect', CountVectorizer()),
                     ('selector', SelectKBest(chi2)),
                     ('clf', RandomForestClassifier())])

## Grid search

Define the grid and type of evaluations to be performed for the search

In [None]:
parameters = {"clf__n_estimators": [10], # reasonable values between 10 and 100; this has an impact on the time an evaluation takes
              "clf__max_depth": [2, 4, 10, None],
              "selector__k": [100, 1000]}

from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(text_clf, parameters, scoring="accuracy", cv=3) # cv is the number of folds; smaller values will make the evaluation quicker; recommended values are between 3 and 10

Run the search

In [None]:
grid_search.fit(X, y)

Report results

In [None]:
import numpy as np

# Utility function to report best scores
def report(results, n_top=3):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
                  results['mean_test_score'][candidate],
                  results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")

In [None]:
report(grid_search.cv_results_)

In [None]:
grid_search.best_params_

## Create optimal model

In [None]:
text_clf.set_params(**grid_search.best_params_)
text_clf.fit(X, y)

Use model

In [None]:
text_clf.predict_proba(["I will NEVER stay in this hotel again!", "My $200 Gucci sunglasses were stolen"])

## Explain predictions

We use [LIME](https://github.com/marcotcr/lime) (Local Interpretable Model-agnostic Explanations), which allows to explain the predictions of any ML classifier.

We start by initializing a text explainer, and pass pretty names of the two classes to predict:

In [None]:
from lime.lime_text import LimeTextExplainer
explainer = LimeTextExplainer(class_names=['Fake', 'Real'])

Apply the explainer on an input:

In [None]:
review = X[0] # get a review from the dataset
explanation = explainer.explain_instance(review, text_clf.predict_proba, num_features=6)

See raw values contained in the explanation:

In [None]:
explanation.as_list()

Visualize these values as a bar plot:

In [None]:
%matplotlib inline
fig = explanation.as_pyplot_figure()

Visualize them and the text input at the same time:

In [None]:
explanation.show_in_notebook()

Might want to add a stop word remover in the pipeline?