# Training a Classifier on Fake Reviews
How to create and train a classifier to spot fake reviews on Yelp, using supervised learning and the Natural Language Toolkit (NLTK).

In [1]:
import nltk
import sqlite3
import random
import pandas as pd
import numpy as np

## The Dataset
We are using a dataset of Yelp reviews that is stored in a SQL database. These reviews are flagged as either 'fake' or 'real', and there is some additional information about each review. The first step is to grab these reviews and put them in a dataframe. We'll be using pandas, which feels very familiar coming from R.

In [2]:
conn = sqlite3.connect('yelpHotelData.db')
query = 'SELECT reviewContent, rating, usefulCount, coolCount, funnyCount FROM review WHERE flagged = "Y" OR flagged = "YR"'
fake = pd.read_sql(query, conn)
query = 'SELECT reviewContent, rating, usefulCount, coolCount, funnyCount FROM review WHERE flagged = "N" OR flagged = "NR"'
real = pd.read_sql(query, conn)
conn.close()

We currently have over 700 thousand reviews, that's a lot to process. Instead we will randomly sample 5000 each from both classes, and combine them in a single dataframe.

In [3]:
fake = fake.sample(5000)
fake['tag'] = 'fake'
real = real.sample(5000)
real['tag'] = 'real'
df = pd.concat([fake,real])
df = df.iloc[np.random.permutation(len(df))]

Now that we have a single dataframe, we need to use the nltk to "tokenize" the words in the reviews. This will make each word easy to access, and allow us to get a frequency distribution accross all of the reviews to find the 2000 most common words.

In [4]:
df['reviewContent'] = df.apply(lambda row: nltk.word_tokenize(row['reviewContent']), axis=1)
all_words = nltk.FreqDist(word.lower() for row in df['reviewContent'] for word in row)
word_features = list(all_words)[:2000]

The word_features list contains the 2000 most common words in the reviews. We will use these words as binary features ( true if in a review, false otherwise). Now these data structures are very expensive, let's do some desperate cleanup.

In [5]:
del(all_words)
del(real)
del(fake)

## The Feature Extractor
We have our reviews in a nice dataframe, but now we need to start grabbing features for each review. We'll create a function that can features from each row of the dataframe. We will collect the word features mentioned earlier, among a few other features that seem useful.

In [6]:
def document_features(doc):
    document_words = set(doc['reviewContent'])
    features = {}
    for word in word_features:
        features['contains ' + word] = (word in document_words)
    features.update(
    {'rating': doc['rating'], 'useful': doc['usefulCount'], 'cool': doc['coolCount'], 'funny': doc['funnyCount'], 'length': len(doc['reviewContent'])})
    return [features,doc['tag']]

Now that we have our function, we can easily apply it to the dataframe and create training and test sets. Our training set will be 80% of our data, and the test will be the rest.

In [7]:
featuresets = df.apply(document_features, axis = 1)
train_set, test_set = featuresets[1000:], featuresets[:1000]
del(word_features)
del(featuresets)

## Naive Bayes Classifier
Now to train a learner on this data. We will use nltk's built-in naive bayes classifier first, since it is quick and easy to implement. We will simply train it and test it, and print the accuracy, along with what the classifier identifies as the most useful features.

In [8]:
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))
classifier.show_most_informative_features(7)

0.574
Most Informative Features
          contains pairs = True             real : fake   =      7.0 : 1.0
                  length = 210              real : fake   =      6.6 : 1.0
       contains estimate = True             real : fake   =      6.3 : 1.0
      contains operation = True             real : fake   =      5.4 : 1.0
      contains saltiness = True             fake : real   =      5.0 : 1.0
                  length = 320              fake : real   =      5.0 : 1.0
                  length = 282              real : fake   =      5.0 : 1.0


## Scikit-Learn Classifiers
Now let's try some other classifiers

In [10]:
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.svm import SVC
svmClass = SklearnClassifier(SVC(C = .7)).train(train_set)
print("SVM Classifier:")
print(nltk.classify.accuracy(svmClass, test_set))

from sklearn.ensemble import AdaBoostClassifier
adaClass = SklearnClassifier(AdaBoostClassifier()).train(train_set)
print("Adaboost Classifier:")
print(nltk.classify.accuracy(adaClass, test_set))

from sklearn.neural_network import MLPClassifier
nnClass = SklearnClassifier(MLPClassifier()).train(train_set)
print("Neural Network Classifier:")
print(nltk.classify.accuracy(nnClass, test_set))

SVM Classifier:
0.527
Adaboost Classifier:
0.594
Neural Network Classifier:
0.524
