# Training a Classifier on Fake Reviews
How to create and train a classifier to spot fake reviews on Yelp, using supervised learning and the Natural Language Toolkit (NLTK).

In [None]:
import nltk
import sqlite3
import random
import pandas as pd
import numpy as np
from nltk.classify.scikitlearn import SklearnClassifier

## The Dataset
We are using a dataset of Yelp reviews that is stored in a SQL database. These reviews are flagged as either 'fake' or 'real', and there is some additional information about each review. The first step is to grab these reviews and put them in a dataframe. We'll be using pandas, which feels very familiar coming from R.

In [None]:
conn = sqlite3.connect('yelpHotelData.db')
query = 'SELECT reviewContent, rating, usefulCount, coolCount, funnyCount FROM review WHERE flagged = "Y"'
fake = pd.read_sql(query, conn)
query = 'SELECT reviewContent, rating, usefulCount, coolCount, funnyCount FROM review WHERE flagged = "N"'
real = pd.read_sql(query, conn)
conn.close()

While the database has over 700,000 reviews, we are going to focus on a certain subset of them. We will focus on hotel reviews, and I'll explain why this is important when we get to the feature extractor. But for now, we need to make sure we have a balanced dataset.

In [None]:
fake = fake.sample(750)
fake['tag'] = 'fake'
real = real.sample(750)
real['tag'] = 'real'
df = pd.concat([fake,real])
df = df.iloc[np.random.permutation(len(df))]

Now that we have a single dataframe, we need to use the nltk to "tokenize" the words in the reviews. This will make each word easy to access, and allow us to get a frequency distribution accross all of the reviews to find the 2000 most common words.

In [None]:
df['reviewContent'] = df.apply(lambda row: nltk.word_tokenize(row['reviewContent']), axis=1)
all_words = nltk.FreqDist(word.lower() for row in df['reviewContent'] for word in row)
word_features = list(all_words)[:2000]

In [None]:
df

The word_features list contains the 2000 most common words in the reviews. We will use these words as binary features ( true if in a review, false otherwise). This is why we need to focus on one type of review. Extracting these word features tell us the most obvious differences between reviews. It would be incredibly hard to determine a fake versus real review, when the most obvious differences would be between reviews about hotels, restaurants, etc.

We can also pull the most common "bigrams", which is essentially the most common pairs of words (e.g. "i am", "this house").

In [None]:
df['bigrams'] = df.apply(lambda row: list(nltk.bigrams(row['reviewContent'])), axis=1)
bigrams = nltk.FreqDist(word[0].lower() +" "+ word[1].lower() for row in df['bigrams'] for word in row)
bigrams = list(bigrams)[:500]

This script is very memory intensive, so we'll try to delete unused data structures as we go.

In [None]:
del(all_words)
del(real)
del(fake)

## The Feature Extractor
We have our reviews in a nice dataframe, but now we need to start grabbing features for each review. We'll create a function that can features from each row of the dataframe. We will collect the word features mentioned earlier, among a few other features that seem useful. There is a lot to unpack here, but essentially we are grabbing the 2000 word features, 500 bigram features, the rating, useful count, etc. from the review, and whether or not they say "me" or "I" a lot. This combination of features seems to get consistently high accuracy. 

In [None]:
def document_features(doc):
    document_words = set(doc['reviewContent'])
    features = {}
    # Grabbing the bigrams
    bigSet = []
    for word in doc['bigrams']:
        bigSet.append(word[0].lower() + " " +word[1].lower())
    bigSet = set(bigSet)
    for word in bigrams:
        features['bigram: ' + word] = (word in bigSet)

    # Counting the pronoun usage
    meCount = 0
    for word in doc['reviewContent']:
        if (word.lower() == 'i' or word.lower() == 'me'):
            meCount += 1
    me = False
    if (meCount > 5):
        me = True

    for word in word_features:
        features['contains ' + word] = (word in document_words)
    features.update(
    {'rating': doc['rating'], 'useful': doc['usefulCount'], 'cool': doc['coolCount'], 'funny': doc['funnyCount'], 'meCount': me})
    return [features,doc['tag']]

Now that we have our function, we can easily apply it to the dataframe and create training and test sets. Our training set will be 80% of our data, and the test will be the rest.

In [None]:
featuresets = df.apply(document_features, axis = 1)
train_set, test_set = featuresets[300:], featuresets[:300]
del(featuresets)

## Naive Bayes Classifier
Now to train a learner on this data. We will use nltk's built-in naive bayes classifier first, since it is quick and easy to implement. We will simply train it and test it, and print the accuracy, along with what the classifier identifies as the most useful features.

In [None]:
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set)*100)
classifier.show_most_informative_features(7)

## Scikit-Learn Classifiers
Now let's try some other classifiers

In [None]:
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.svm import SVC
svmClass = SklearnClassifier(SVC()).train(train_set)
print("SVM Classifier:")
print(nltk.classify.accuracy(svmClass, test_set)*100)

from sklearn.ensemble import AdaBoostClassifier
adaClass = SklearnClassifier(AdaBoostClassifier()).train(train_set)
print("Adaboost Classifier:")
print(nltk.classify.accuracy(adaClass, test_set)*100)

from sklearn.neural_network import MLPClassifier
nnClass = SklearnClassifier(MLPClassifier()).train(train_set)
print("Neural Network Classifier:")
print(nltk.classify.accuracy(nnClass, test_set)*100)

## Results
The results are often consistent, Adaboost is almost always the best algorithm, with usually 75%+ accuracy. The SVM is consistently the worst, sometimes approaching 50%. The SVM is using the default settings and does not play nicely with the lexical features. Naive bayes and the Neural Network classifier are generally neck and neck. Naive bayes tends to linger around 70% constantly, but the neural network is less consistent, and ranges from 65-75% accuracy.

## Optimization
Scikit Learn has a lot of built in functions for cross-validation and finding ideal parameters. The issue is that the dataset is formatted to be used by nltk, which is why we needed the wrappers in the previous examples. If we extract our features separately, then convert them into numbers, we can start implementing these functions and see how they affect the results.

In [None]:
def svm_features(doc):
    """This function returns the features without the tags"""
    document_words = set(doc['reviewContent'])
    features = {}
    # Grabbing the bigrams
    bigSet = []
    for word in doc['bigrams']:
        bigSet.append(word[0].lower() + " " +word[1].lower())
    bigSet = set(bigSet)
    for word in bigrams:
        features['bigram: ' + word] = (word in bigSet)

    # Counting the pronoun usage
    meCount = 0
    for word in doc['reviewContent']:
        if (word.lower() == 'i' or word.lower() == 'me'):
            meCount += 1
    me = False
    if (meCount > 5):
        me = True

    for word in word_features:
        features['contains ' + word] = (word in document_words)
    features.update(
    {'rating': doc['rating'], 'useful': doc['usefulCount'], 'cool': doc['coolCount'], 'funny': doc['funnyCount'], 'meCount': me})
    return features

Now we will use the feature extractor to convert these features into numbers, and use grid search to find the best parameters for the SVM.

In [None]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import GridSearchCV
parameters = {'kernel':('linear','poly', 'rbf'), 'C':[1, 10]}
vec = DictVectorizer()
trainTags, testTags = df[250:],df[:250]
svmSet = df.apply(svm_features, axis = 1)
svmSet = vec.fit_transform(svmSet).toarray()
svmTest, svmTrain = svmSet[:250],svmSet[250:]

In [None]:
svr = SVC()
svmClass = GridSearchCV(svr, parameters)
svmClass.fit(svmTrain,trainTags['tag'])
print("SVM with Grid Search Cross-Validation: ")
print(svmClass.score(svmTest,testTags['tag'])*100)
print(svmClass.get_params)

# Conclusion
Clearly preprocessing for SVM had a huge impact. However, AdaBoost generally seems the most reliable for this kind of dataset. While it performs well, it's still not quite strong enough to be used in practice. In my own testing I've found that the false positive and negative rate is usually about equal, so this classifier doesn't seem to lean one way or the other. There is more that can be done, the word tokenization can be optimized, and more advanced NLP tools could be used (such as part of speech tagging). Being able to see user information, such as posting habits or rating habits, could also greatly uplift the accuracy of the classifier. 