## AUTOMATIC REVIEW CLASSIFIER

In this NLP project classifying Yelp Reviews into positive or negative review based off the text content in the reviews. I'm going to use the data labeled with 1(negative opinion) and 5(positive opinion) stars. In order to do this several algorithms has been used, on this notebook a Naive Bayes method is deployed as a baseline and a Logistic Regression is choosen as it shows the best results. 

I'm using the [Yelp Review Data Set from Kaggle](https://www.kaggle.com/c/yelp-recsys-2013).

In [None]:
from __future__ import division
import string
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

import pandas as pd
import matplotlib.pyplot as plt

In [83]:
# other techniques of tokenization are stemmers (reduce words to it's root (ran, running, runner ==> run))
# or categorization on verbds, adj...

def tokenize_text(document):
    stemmer = PorterStemmer()
    #decode
    doc_decoded = [word.decode('utf-8') for word in document.split()]
    
    doc_decoded = ' '.join(doc_decoded)
    
    # Check characters to see if they are in punctuation
    no_punc = [char for char in doc_decoded if char not in string.punctuation]

    # Join the characters again to form the string.
    no_punc = ''.join(no_punc)
    
    # remove capital letters
    no_upper = [word.lower() for word in no_punc.split()]
    
    # Now just remove any stopwords
    no_stopwords = [word for word in no_upper if word.lower() not in stopwords.words('english')]
    
    # stemmer
    stemmed = [stemmer.stem(word) for word in no_stopwords]
    
    return stemmed

In [3]:
df = pd.read_csv('yelp.csv')

df = df[['stars', 'text']]
df.columns  = ['label','review']

df = df.loc[(df['label'] == 5) | (df['label'] == 1)]

df.loc[(df['label'] == 5),'label'] = 'positive'
df.loc[(df['label'] == 1),'label'] = 'negative'

df.head()

Unnamed: 0,label,review
0,positive,My wife took me here on my birthday for breakf...
1,positive,I have no idea why some people give bad review...
3,positive,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!..."
4,positive,General Manager Scott Petello is a good egg!!!...
6,positive,Drop what you're doing and drive here. After I...


In [95]:
df['label'].value_counts(normalize=True) * 100

positive reviews:  81.669114048
negative reviews:  0.0


positive    81.669114
negative    18.330886
Name: label, dtype: float64

In [5]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(df['review'], df['label'], test_size=0.2)

In [9]:
print y_train.value_counts(normalize=True) * 100
print y_test.value_counts(normalize=True) * 100

positive    82.129743
negative    17.870257
Name: label, dtype: float64
positive    79.828851
negative    20.171149
Name: label, dtype: float64


In [99]:
pipeline = Pipeline([
    ('bag of words', CountVectorizer(analyzer=tokenize_text)),  # strings to token integer counts
    ('tfidf', TfidfTransformer()),  # integer counts to weighted TF-IDF scores
    ('classifier', MultinomialNB()),  # train on TF-IDF vectors w/ Naive Bayes classifier
])

pipeline.fit(X_train,y_train)
predictions = pipeline.predict(X_test)
print(classification_report(predictions,y_test))

             precision    recall  f1-score   support

          0       1.00      0.82      0.90       816
          1       0.01      1.00      0.03         2

avg / total       1.00      0.82      0.90       818



In [100]:
pipeline2 = Pipeline([
    ('bag of words', CountVectorizer(analyzer=tokenize_text)),
    ('tfidf', TfidfTransformer(norm=None)),
    ('classifier', MultinomialNB()),
])

pipeline2.fit(X_train,y_train)
predictions = pipeline2.predict(X_test)
print(classification_report(predictions,y_test))

             precision    recall  f1-score   support

          0       0.95      0.95      0.95       664
          1       0.79      0.78      0.78       154

avg / total       0.92      0.92      0.92       818



In [84]:
pipeline3 = Pipeline([
    ('bag of words', CountVectorizer(analyzer=tokenize_text)),
    ('classifier', MultinomialNB()),
])
pipeline3.fit(X_train,y_train)
predictions = pipeline3.predict(X_test)
print(classification_report(predictions,y_test))

             precision    recall  f1-score   support

          0       0.99      0.94      0.96       698
          1       0.73      0.93      0.82       120

avg / total       0.95      0.94      0.94       818



In [76]:
pipeline4 = Pipeline([
    ('bag of words', CountVectorizer()),
    ('classifier', MultinomialNB()),
])

pipeline4.fit(X_train,y_train)
predictions = pipeline4.predict(X_test)
print(classification_report(predictions,y_test))


             precision    recall  f1-score   support

          0       0.98      0.93      0.96       701
          1       0.70      0.91      0.79       117

avg / total       0.94      0.93      0.93       818



FINDINGS:

- On the first pipeline the results are very bad (always classifies as positive). This is due the preprocessing, tf-idf does not provide useful data if the data isn't normalized.
- The second pipeline brings the best results activating the normalization parameter (79 % on negative detection / 95% on positive).
- The next one performs well without the tf-idf.
- The last one uses the default CountVectorizer analyzer. It performs a little bit worse but is faster (due python internal optimizations).

In [134]:
from sklearn.linear_model import LogisticRegression

pipeline5 = Pipeline([
    ('bag of words', CountVectorizer()),
    ('tf-idf', TfidfTransformer(norm=None)),
    ('classifier', LogisticRegression()), 
])

pipeline5.fit(X_train,y_train)
predictions = pipeline5.predict(X_test)
print(classification_report(predictions,y_test))

             precision    recall  f1-score   support

          0       0.97      0.96      0.97       678
          1       0.81      0.88      0.84       140

avg / total       0.95      0.94      0.94       818



In [157]:
from sklearn.model_selection import GridSearchCV

param_grid_ = {'classifier__C': [1e-5, 1e-3, 1e-1, 1e0, 1e1, 1e2]}
bow_search = GridSearchCV(pipeline5, cv=5, param_grid=param_grid_)
bow_search.fit(X_train, y_train)


GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('bag of words', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
  ...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'classifier__C': [1e-05, 0.001, 0.1, 1.0, 10.0, 100.0]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [158]:
c = bow_search.best_params_['classifier__C']

pipeline6 = Pipeline([
    ('bag of words', CountVectorizer()),
    ('tf-idf', TfidfTransformer(norm=None)),
    ('classifier', LogisticRegression(C=c)), 
])

pipeline6.fit(X_train,y_train)
predictions = pipeline6.predict(X_test)

print(c)
print(classification_report(predictions,y_test))

0.1
             precision    recall  f1-score   support

          0       0.98      0.96      0.97       683
          1       0.80      0.90      0.85       135

avg / total       0.95      0.95      0.95       818



MORE FINDINGS:

- Logistic regression performs better classification
- Multiple experiments has been done with GridSearchCv. Finally a regularization C = 0.1 brings the best results.
- The choosen system is:
    * default vectorizer analyzer
    * tf-idf without normalization
    * logistic regressor classifier with c=0.1
    * Detection Rate (DR) and False Alarm Rate(FAR): positve class (DR=98%, FAR=96%); negative class (DR=80%, FAR=90%)