# Capstone Project

   In this project, I download Yelp reviews, isolate the star value and text of each review, clean the text, split the dataset into folds, build a RandomForest model, and attempt to predict star values of a testing set using text as features.

In [558]:
import requests
import random
import pandas as pd
import numpy as np
import sqlite3 as lite
from pandas.io.json import json_normalize
import matplotlib.pyplot as plt
import time
import datetime
from dateutil.parser import parse
import collections
import json
import ijson
from bs4 import BeautifulSoup
import re
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
import sklearn.metrics as metrics
from sklearn.cross_validation import KFold
from sklearn.cross_validation import StratifiedKFold

## Download the Yelp Data Challenge dataset

https://www.yelp.com/dataset_challenge/dataset

## Create dataset from first 25,000 reviews

In [559]:
sample_size = 25000

- The dataset contains ~2.2 million reviews. For the purpose of this project, I used 25,000 of these reviews. The data is structured as a file where each line is a separate JSON file, so  the following code reads each JSON file in separately. In order to simplify this analysis, I created a binary classification structure where reviews with < 3 stars become 0 stars, and reviews with > 3 stars become 1 star.

In [560]:
data = []
data_test = []
y = []
X = []
with open('yelp_academic_dataset_review.json','rU') as reviews:
    i = 0
    while i < sample_size:
        first_line = json.loads(reviews.readline())
        if first_line['stars'] != 3:
            if first_line['stars'] < 3:
                first_line['stars'] = 0
            else:
                first_line['stars'] = 1
            data.append([first_line['stars'], first_line['text']])
            y.append(first_line['stars'])
            X.append(first_line['text'])
        i += 1
X = np.array(X)
y = np.array(y)

## Clean text for Bag-of-Words:

- The following block is a text-cleaning function. It spits out a bag of words representing the input without puncuation, capitalization, or any overly common words (the, a, is, etc...)

In [561]:
def review_to_words(raw_review):
    #     1) Remove HTML:
    review_text = BeautifulSoup(raw_review).get_text()
    #     2) Remove non-letters:
    letters_only = re.sub('[^a-zA-Z]',' ', review_text)
    #     3) Convert to lower case, split into words:
    words = letters_only.lower().split()
    #     4) Convert stopwords to set:
    stops = set(stopwords.words('english'))
    #     5) Remove stopwords:
    meaningful_words = [w for w in words if not w in stops]
    #     6) Join words back into one string separated by space:
    return( ' '.join(meaningful_words))

In [562]:
num_reviews = X.size
num_reviews

20963

In [563]:
print "Cleaning and parsing the Yelp reviews...\n"
clean_reviews = []
for i in xrange(0, num_reviews):
    if (i+1)%1000 == 0:
        print "Review %d of %d\n" % (i+1,num_reviews)
    clean_reviews.append(review_to_words(X[i]))

Cleaning and parsing the Yelp reviews...

Review 1000 of 20963

Review 2000 of 20963

Review 3000 of 20963

Review 4000 of 20963

Review 5000 of 20963

Review 6000 of 20963

Review 7000 of 20963

Review 8000 of 20963

Review 9000 of 20963

Review 10000 of 20963

Review 11000 of 20963

Review 12000 of 20963

Review 13000 of 20963

Review 14000 of 20963

Review 15000 of 20963

Review 16000 of 20963

Review 17000 of 20963

Review 18000 of 20963

Review 19000 of 20963

Review 20000 of 20963



In [564]:
clean_reviews = np.array(clean_reviews)

## Vectorize word counts

In [565]:
print 'Creating the bag of words...\n'
vectorizer = CountVectorizer(analyzer='word',
                             tokenizer=None,
                             preprocessor=None,
                             stop_words=None,
                             max_features=5000)

Creating the bag of words...



## Partition the data into folds

In [566]:
skf = StratifiedKFold(y, n_folds=10)

## Train model on vectorized training text, generate RandomForest model

In [567]:
forest = RandomForestClassifier(n_estimators=100)

In [568]:
precision_list = []
recall_list = []
output_list = []
f1_score_list = []
confusion_list = []
for train, test in skf:
#     Partition the data
    X_train, X_test = clean_reviews[train], clean_reviews[test]
    y_train, y_test = y[train], y[test]
#     Fit model and learn features, transform into vectors
    data_features_train = vectorizer.fit_transform(X_train)
    data_features_train = data_features_train.toarray()
#     Transform test data into vectors
    data_features_test = vectorizer.transform(X_test)
    data_features_test = data_features_test.toarray()
    print 'Training the random forest...'
#     Train RandomForest on training data
    forest = forest.fit(data_features_train, y_train)

#     Try to predict using RandomForest model
    print 'Testing using the random forest...'
    result = forest.predict(data_features_test)
#     Statistical analysis for each fold
    precision = metrics.precision_score(y_test,result)
    precision_list.append(precision)
    recall = metrics.recall_score(y_test,result)
    recall_list.append(recall)
    f1_score = metrics.f1_score(y_test,result)
    f1_score_list.append(f1_score)
    confusion = metrics.confusion_matrix(y_test,result)
    confusion_list.append(confusion)

    print 'PRECISION:', precision, 'RECALL:', recall, 'F1 SCORE:', f1_score, 'CONFUSION MATRIX:', confusion

Training the random forest...
Testing using the random forest...
PRECISION: 0.889216263995 RECALL: 0.960534691279 F1 SCORE: 0.923500611995 CONFUSION MATRIX: [[ 338  188]
 [  62 1509]]
Training the random forest...
Testing using the random forest...
PRECISION: 0.88340034463 RECALL: 0.978994271165 F1 SCORE: 0.928743961353 CONFUSION MATRIX: [[ 323  203]
 [  33 1538]]
Training the random forest...
Testing using the random forest...
PRECISION: 0.898224852071 RECALL: 0.966263526416 F1 SCORE: 0.93100275989 CONFUSION MATRIX: [[ 354  172]
 [  53 1518]]
Training the random forest...
Testing using the random forest...
PRECISION: 0.888436955259 RECALL: 0.973265436028 F1 SCORE: 0.928918590522 CONFUSION MATRIX: [[ 334  192]
 [  42 1529]]
Training the random forest...
Testing using the random forest...
PRECISION: 0.880571428571 RECALL: 0.980903882877 F1 SCORE: 0.928033724782 CONFUSION MATRIX: [[ 317  209]
 [  30 1541]]
Training the random forest...
Testing using the random forest...
PRECISION: 0.8891

In [569]:
data_features_train.shape

(18868, 5000)

## Aggregate statistics on each fold

In [570]:
mean_precision = np.mean(precision_list)
mean_recall = np.mean(recall_list)
mean_f1 = np.mean(f1_score_list)
std_precision = np.std(precision_list)
std_recall = np.std(recall_list)
std_f1 = np.std(f1_score_list)
print 'MEAN PRECISION:', mean_precision
print 'STD PRECISION:', std_precision
print 'MEAN RECALL:', mean_recall
print 'STD RECALL:', std_recall
print 'MEAN F1 SCORE:', mean_f1
print 'STD F1 SCORE', std_f1
print 'CONFUSION MATRIX LIST:', confusion_list

MEAN PRECISION: 0.887422699805
STD PRECISION: 0.00543863227911
MEAN RECALL: 0.970138213722
STD RECALL: 0.0108069938289
MEAN F1 SCORE: 0.92689011571
STD F1 SCORE 0.0048199781492
CONFUSION MATRIX LIST: [array([[ 338,  188],
       [  62, 1509]]), array([[ 323,  203],
       [  33, 1538]]), array([[ 354,  172],
       [  53, 1518]]), array([[ 334,  192],
       [  42, 1529]]), array([[ 317,  209],
       [  30, 1541]]), array([[ 336,  190],
       [  47, 1524]]), array([[ 332,  194],
       [  45, 1525]]), array([[ 312,  213],
       [  33, 1537]]), array([[ 340,  185],
       [  35, 1535]]), array([[ 337,  188],
       [  89, 1481]])]


- Average precision of 89%, average recall of 97%, overall f1 score of 92%
- Based on these stats, I would say this model did a fairly reasonable job predicting the test data