# Training a Classifier on Fake Reviews
How to create and train a classifier to spot fake reviews on Yelp, using supervised learning and the Natural Language Toolkit (NLTK).

In [1]:
import nltk
import sqlite3
import random

The reviews are contained in a database, there are 750,000 reviews in total, but we will only be sampling 20,000 entries.

Our first step will be to extract these reviews from the database.

In [2]:
conn = sqlite3.connect('yelpHotelData.db')
c = conn.cursor()
fake = []
real = []

In [3]:
for row in c.execute(
    'SELECT reviewContent, rating, usefulCount, coolCount, funnyCount FROM review WHERE flagged = "Y" OR flagged = "YR" '):
    fake.append([nltk.word_tokenize(row[0]), row[1], row[2], row[3], row[4],'fake'])
random.shuffle(fake)
fake = fake[:5000]

In [4]:
for row in c.execute('SELECT reviewContent, rating, usefulCount, coolCount, funnyCount FROM review WHERE flagged = "N" OR flagged = "NR" '):
    real.append([nltk.word_tokenize(row[0]), row[1], row[2], row[3], row[4],'real'])
random.shuffle(real)
real = real[:5000]

Now to combine the data

In [5]:
documents = real + fake
random.shuffle(documents)

A common strategy for semantice analysis is to look at whether or not a word appears in the text. While it would be impossible to have a separate feature for every word, we can grab the 2000 most common words and turn them into features. First we will pull these words and create a set.

In [6]:
all_words = nltk.FreqDist(word.lower() for (doc,rt,use,cool,fun,tg) in documents for word in doc)
word_features = list(all_words)[:2000]

## The feature extractor
We have the data, and we have some word features we want to check. The next step is to construct a feature extractor that can pull the features from each document. We will construct a function that checks if each word in the set of 2000 words appears in the review. Each of these checks is actually a binary feature, so we will end up having 2000+ features for each review. We will also grab some other useful features, such as...

In [7]:
def document_features(doc,rt,use,cool,fun):
    document_words = set(doc)
    features = {}
    for word in word_features:
        features['contains ' + word] = (word in document_words)
    features.update({'rating': rt, 'useful': use, 'cool': cool, 'funny': fun, 'length': len(doc)})
    return features

Now that we have the feature extractor, it's time to extract those features and create test and datasets

In [8]:
featuresets = [(document_features(d,rt,use,cool,fun), c) for (d,rt,use,cool,fun,c) in documents]
train_set, test_set = featuresets[2000:], featuresets[:2000]

### Naive Bayes Classifier
NLTK comes with a few built-in classifiers, we will first try the naive bayes classifier. This is a good classifier for our data since it is fast, we have a huge dataset with a lot of features so this is a very good quality.

In [9]:
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))
classifier.show_most_informative_features(7)

0.554
Most Informative Features
           contains yoga = True             real : fake   =     13.3 : 1.0
      contains newspaper = True             real : fake   =      6.5 : 1.0
         contains nicest = True             real : fake   =      6.0 : 1.0
                  length = 340              real : fake   =      5.8 : 1.0
                  length = 220              fake : real   =      5.5 : 1.0
   contains overpowering = True             fake : real   =      5.2 : 1.0
       contains research = True             real : fake   =      5.1 : 1.0
