# Yelp Dataset Challenge

# Overview

The dataset is a single gzip-compressed file, composed of one json-object per line. Every object contains a 'type' field, which tells you whether it is a business, a user, or a review.

--------------


## Business Objects

Business objects contain basic information about local businesses. The 'business_id' field can be used with the Yelp API to fetch even more information for visualizations, but note that you'll still need to comply with the API TOS. The fields are as follows:

```json
{
  'type': 'business',
  'business_id': (a unique identifier for this business),
  'name': (the full business name),
  'neighborhoods': (a list of neighborhood names, might be empty),
  'full_address': (localized address),
  'city': (city),
  'state': (state),
  'latitude': (latitude),
  'longitude': (longitude),
  'stars': (star rating, rounded to half-stars),
  'review_count': (review count),
  'photo_url': (photo url),
  'categories': [(localized category names)]
  'open': (is the business still open for business?),
  'schools': (nearby universities),
  'url': (yelp url)
}
```

-------
        
## Review Objects

Review objects contain the review text, the star rating, and information on votes Yelp users have cast on the review. Use user_id to associate this review with others by the same user. Use business_id to associate this review with others of the same business.

```json
{
  'type': 'review',
  'business_id': (the identifier of the reviewed business),
  'user_id': (the identifier of the authoring user),
  'stars': (star rating, integer 1-5),
  'text': (review text),
  'date': (date, formatted like '2011-04-19'),
  'votes': {
    'useful': (count of useful votes),
    'funny': (count of funny votes),
    'cool': (count of cool votes)
  }
}
```

--------

## User Objects

User objects contain aggregate information about a single user across all of Yelp (including businesses and reviews not in this dataset).

```json
{
  'type': 'user',
  'user_id': (unique user identifier),
  'name': (first name, last initial, like 'Matt J.'),
  'review_count': (review count),
  'average_stars': (floating point average, like 4.31),
  'votes': {
    'useful': (count of useful votes across all reviews),
    'funny': (count of funny votes across all reviews),
    'cool': (count of cool votes across all reviews)
  }
}
```

![](http://i.imgur.com/QEHb5lU.gif)

The task is to predict the 'star rating' for a restaurant for a given user. 

The dataset comprises three tables that cover

- 11,537 businesses
- 8,282 check-ins
- 43,873 users
- 229,907 reviews. 

Link to [Official Yelp Website](http://www.yelp.com/dataset_challenge)

In [7]:
import os
import sys
import operator
import functools
import itertools
import boto
import warnings
import json
import pandas as pd

import matplotlib.pyplot as plt
import graphlab as gl

from textblob import TextBlob
from os.path import join as jp

try:
    from configparser import ConfigParser
except ImportError:
    from ConfigParser import ConfigParser
    
gl.canvas.set_target('ipynb')

## Technical Challenges

1. Big Data... somewhat
    - More like _Medium Data_
2. Highly Networked Data Structures
3. User Sentiment Analysis

![](http://i.imgur.com/LRuQh0N.gif)

## Proposed Solutions

1. Streaming and Lazy Evaluation. Also utilize compression.
2. Use Graph Algorithms and parsing strategies
3. Magic!??

## S3 Remote File Streaming

In [18]:
from IPython.display import *

In [3]:
def aws_config(cfg):
    """
    Queries local environment for aws configurations
    """
    home, user = os.getenv('HOME'), os.getlogin()
    valid_user = user in cfg.sections()

    return user if valid_user else cfg.sections()[0]

def s3_signin(**auth):
    """
    Convenience function for validating keys  and providing
    access to bucket shares.
    
    Returns S3Object
    """
    token_ids  = 'aws_access_key_id', 'aws_secret_access_key'

    cfg = ConfigParser()
    cfg.read(jp(os.getenv('HOME'), '.aws', 'credentials'))

    account    = itertools.repeat(aws_config(cfg), 2)
    valid_auth = all(auth.has_key(i) for i in token_ids)
    token      = zip(account, token_ids) if not valid_auth else [token_ids]
    store      = cfg if not auth else auth
    
    user_id    = dict(zip(token_ids, map(lambda t: store.get(*t), token)))
    
    if not all(user_id.values()):
        raise ValueError('No valid authorization found')

    return boto.connect_s3(**user_id)

## Key and Configuration Management

In [None]:
s3 = s3_signin()

gl.aws.set_credentials(s3.gs_access_key_id, s3.gs_secret_access_key)

## Remote JSON to DataFrame

In [None]:
def remote_json_loader(filename):
    """
    Load JSON from a remote data store.
    """
    sf = gl.SFrame.read_csv(filename, delimiter='\n', header=False)
    return sf.unpack('X1', column_name_prefix='')

def gen_data_url(s3, bucket , dataset):
    s3_dir   = s3.get_bucket(bucket)
    s3_urls  = [
        '/'.join(['s3:/', s3_dir.name, d.name])
                for d in s3_dir.list(dataset)
    ]
    for url in s3_urls:
        yield url

def flatten(sf):
    """
    Flatten nested SFrame DataStructure. 
    """
    dtypes = dict(zip(sf.column_names(), gl.SFrame.dtype(sf)))
    cols = [k for k,v in dtypes.items() if v in [dict, list]]
    return sf[cols]

![](http://i.imgur.com/JzaJ7s6.gif)

# Holy Crap Evil Unicorn Power

In [4]:
# Data On the Internet!
aws   = 'https://s3-us-west-1.amazonaws.com/ds3-machine-learning/yelp/{file}.csv'
links = (aws.format(file=f) for f in ['business', 'user', 'review'])

business, user, review = map(gl.SFrame.read_csv, links)

PROGRESS: Downloading https://s3-us-west-1.amazonaws.com/ds3-machine-learning/yelp/business.csv to /var/tmp/graphlab-jjangsangy/3456/000000.csv
PROGRESS: Finished parsing file https://s3-us-west-1.amazonaws.com/ds3-machine-learning/yelp/business.csv
PROGRESS: Parsing completed. Parsed 100 lines in 0.114366 secs.
------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[str,list,str,str,float,float,str,int,int,float,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
PROGRESS: Finished parsing file https://s3-us-west-1.amazonaws.com/ds3-machine-learning/yelp/business.csv
PROGRESS: Parsing completed. Parsed 11537 lines in 0.115099 secs.
PROGRESS: Downloading https://s3-us-west-1.amazonaws.com/ds3-machine-learning/yelp/user.csv to /var/tmp/graphlab-jjangsangy/3456/000001.csv
PR

![](http://i.imgur.com/xsDUgFE.png)

# Fits in 8 GB of RAM!!

![](http://i.imgur.com/uz3OXl9.gif)

![](http://i.imgur.com/U40CbGB.gif)

## Data Compression (Please..)

Structures into a more compact data structure.

We join together based on user and business keys and the old objects get garbage collected

In [52]:
review_business = review.join(business, how='inner', on='business_id')
review_business = review_business.rename({'stars.1': 'business_avg_stars', 
                                          'type.1' : 'business_type',
                                          'review_count': 'business_review_count'})

In [53]:
user_review = review_business.join(user, how='inner', on='user_id')
user_review = user_review.rename({'name.1': 'user_name', 
                                  'type.1': 'user_type', 
                                  'average_stars': 'user_avg_stars',
                                  'review_count' : 'user_review_count'})

![](http://i.imgur.com/wggPoky.gif)

In [78]:
yelp_reviews = user_review.join(review_business, on='review_id')

# Split Testing and Training Set

Data Science stuff

In [79]:
train_set, test_set = yelp_reviews.random_split(0.8, seed=1)

In [113]:
display(train_set.head(3))

business_id,date,review_id,stars,text,type
9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for break ...,review
ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad reviews ...,review
6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I also ...,review

user_id,votes,year,month,day,categories,city
rLtl8ZkDX5vH5nAx9C3q5Q,"{'funny': 0, 'useful': 5, 'cool': 2} ...",2011,1,26,"[Breakfast & Brunch, Restaurants] ...",Phoenix
0a2KyEL0d3Yb1V6aivbIuQ,"{'funny': 0, 'useful': 0, 'cool': 0} ...",2011,7,27,"[Italian, Pizza, Restaurants] ...",Phoenix
0hT2KtfLiobPvh6cDC8JQg,"{'funny': 0, 'useful': 1, 'cool': 0} ...",2012,6,14,"[Middle Eastern, Restaurants] ...",Tempe

full_address,latitude,longitude,name,open,business_review_count,business_avg_stars
"6106 S 32nd St\nPhoenix, AZ 85042 ...",33.3908,-112.013,Morning Glory Cafe,1,116,4.0
"4848 E Chandler Blvd\nPhoenix, AZ 85044 ...",33.3056,-111.979,Spinato's Pizzeria,1,102,4.0
"1513 E Apache Blvd\nTempe, AZ 85281 ...",33.4143,-111.913,Haji-Baba,1,265,4.5

state,business_type,user_avg_stars,user_name,user_review_count,user_type,votes_funny,votes_cool,votes_useful
AZ,business,3.72,Jason,376,user,331,322,1034
AZ,business,5.0,Paul,2,user,2,0,0
AZ,business,4.33,Nicole,3,user,0,0,3

business_id.1,date.1,stars.1,text.1,type.1,user_id.1
9yKzy9PApeiPPOUJEtnvkg,2011-01-26,5,My wife took me here on my birthday for break ...,review,rLtl8ZkDX5vH5nAx9C3q5Q
ZRJwVLyzEJq1VAihDhYiow,2011-07-27,5,I have no idea why some people give bad reviews ...,review,0a2KyEL0d3Yb1V6aivbIuQ
6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,4,love the gyro plate. Rice is so good and I also ...,review,0hT2KtfLiobPvh6cDC8JQg

votes.1,year.1,month.1,day.1,categories.1,...
"{'funny': 0, 'useful': 5, 'cool': 2} ...",2011,1,26,"[Breakfast & Brunch, Restaurants] ...",...
"{'funny': 0, 'useful': 0, 'cool': 0} ...",2011,7,27,"[Italian, Pizza, Restaurants] ...",...
"{'funny': 0, 'useful': 1, 'cool': 0} ...",2012,6,14,"[Middle Eastern, Restaurants] ...",...


![](http://i.imgur.com/iv8xcTU.gif)

In [81]:
display_javascript(train_set['city'].show())

# Train Regression Model!

-------------

![](http://i.imgur.com/iDRoqCb.gif)

In [90]:
model = gl.linear_regression.create(train_set, target='stars', 
                                    features = ['user_avg_stars','business_avg_stars', 
                                                'user_review_count', 'business_review_count', 
                                                'city'])

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 164052
PROGRESS: Number of features          : 5
PROGRESS: Number of unpacked features : 5
PROGRESS: Number of coefficients    : 65
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | 1         | 2        | 0.203407     | 3.971958           | 3.5646

In [83]:
model.evaluate(test_set)

{'max_error': 4.016124743972821, 'rmse': 0.9706849263734884}

![](http://i.imgur.com/uN2FbbK.gif)

In [84]:
model.summary()

Class                         : LinearRegression

Schema
------
Number of coefficients        : 65
Number of examples            : 205139
Number of feature columns     : 5
Number of unpacked features   : 5

Hyperparameters
---------------
L1 penalty                    : 0.0
L2 penalty                    : 0.01

Training Summary
----------------
Solver                        : auto
Solver iterations             : 1
Solver status                 : SUCCESS: Optimal solution found.
Training time (sec)           : 0.3522

Settings
--------
Residual sum of squares       : 193305.2713
Training RMSE                 : 0.9707

Highest Positive Coefficients
-----------------------------
city[Sun City Anthem]         : 1.5828
user_avg_stars                : 0.8133
business_avg_stars            : 0.7777
city[North Pinal]             : 0.3682
city[Grand Junction]          : 0.3246

Lowest Negative Coefficients
----------------------------
(intercept)                   : -2.2332
city[Saguaro Lake]   

# More Training!!

Well crap, just keep on the iterating!

# Iterate 10 More Times!

In [91]:
model = gl.linear_regression.create(yelp_reviews, target='stars', 
                                    features = ['user_id','business_id',
                                                'user_avg_stars','business_avg_stars'],
                                                max_iterations=10)

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 205280
PROGRESS: Number of features          : 4
PROGRESS: Number of unpacked features : 4
PROGRESS: Number of coefficients    : 54308
PROGRESS: Starting L-BFGS
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |
PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | 1         | 6        | 0.000000  

# Or even 100X!

![](http://img3.wikia.nocookie.net/__cb20120228151221/dragonball/images/thumb/3/3e/Goku_Charges_Kaioken_Times_3.JPG/1023px-Goku_Charges_Kaioken_Times_3.JPG)

In [103]:
model = gl.linear_regression.create(yelp_reviews, target='stars', 
                                    features = ['user_id','business_id',
                                                'user_avg_stars','business_avg_stars'],
                                                max_iterations=100)

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 205059
PROGRESS: Number of features          : 4
PROGRESS: Number of unpacked features : 4
PROGRESS: Number of coefficients    : 54334
PROGRESS: Starting L-BFGS
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |
PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | 1         | 6        | 0.000000  

![](http://i.imgur.com/EyV29mp.gif)

# Dictionary and List Features

In [92]:
train_set['votes'].head(3)

dtype: dict
Rows: 3
[{'funny': 0, 'useful': 5, 'cool': 2}, {'funny': 0, 'useful': 0, 'cool': 0}, {'funny': 0, 'useful': 1, 'cool': 0}]

In [93]:
tags_to_dict = lambda tags: dict(zip(tags, [1 for tag in tags]))

# Using Review Category Tags

In [94]:
train_set['categories_dict'] = train_set.apply(lambda row: tags_to_dict(row['categories']))
train_set['categories_dict'].head(5)

dtype: dict
Rows: 5
[{'Breakfast & Brunch': 1, 'Restaurants': 1}, {'Restaurants': 1, 'Pizza': 1, 'Italian': 1}, {'Middle Eastern': 1, 'Restaurants': 1}, {'Dog Parks': 1, 'Parks': 1, 'Active Life': 1}, {'Tires': 1, 'Automotive': 1}]

In [95]:
model = gl.linear_regression.create(train_set, target='stars', 
                                    features = ['user_id','business_id', 'categories_dict',
                                                'user_avg_stars','votes', 'business_avg_stars'])

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 163915
PROGRESS: Number of features          : 6
PROGRESS: Number of unpacked features : 515
PROGRESS: Number of coefficients    : 50076
PROGRESS: Starting L-BFGS
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |
PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | 1         | 6        | 0.000000

![](http://i.imgur.com/W3gTgHC.gif)

# Text Data: Using Raw Review Data


In [96]:
train_set['text'].head(1)

dtype: str
Rows: 1
['My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.

Do yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I've ever had.  I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.

While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best "toast" I've ever had.

Anyway, I can't wait to go back!']

In [97]:
gen_blobs = (TextBlob(i) for i in train_set['text'])
sample    = itertools.islice(gen_blobs, 0, 10)

for blob in sample:
    print("Calculated Polarity and Subjectivity")
    print("====================================")
    print(blob.sentiment.polarity, blob.sentiment.subjectivity, sep='\n', end='\n\n')
    print(blob)
    print("----------\n")

Calculated Polarity and Subjectivity
0.402469135802
0.65911228689

My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.

Do yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I've ever had.  I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.

While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best "toast" I've ever had.

Anyway, I can't wait to go back!
----------

Calculated Polarity and Subject

![](http://i.imgur.com/vxcxbe5.gif)

# Insight from Bad Reviews

![](http://i.imgur.com/KUruZeB.gif)

In [105]:
train_set['negative_review_tags'] = gl.text_analytics.count_words(train_set['text'])

In [106]:
bad_review_words = (
    'hate','terrible', 'awful', 'spit', 'disgusting', 'filthy', 'tasteless', 'rude', 
    'dirty', 'slow', 'poor', 'late', 'angry', 'flies', 'disappointed', 'disappointing', 'wait', 
    'waiting', 'dreadful', 'appalling', 'horrific', 'horrifying', 'horrible', 'horrendous', 'atrocious', 
    'abominable', 'deplorable', 'abhorrent', 'frightful', 'shocking', 'hideous', 'ghastly', 'grim', 
    'dire', 'unspeakable', 'gruesome'
)
train_set['negative_review_tags'] = train_set['negative_review_tags'].dict_trim_by_keys(bad_review_words, exclude=False)

In [107]:
model = gl.linear_regression.create(train_set, target='stars', 
                                    features = ['user_id', 'business_id', 'categories_dict', 'negative_review_tags', 
                                                'user_avg_stars', 'votes', 'business_avg_stars'])

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 164054
PROGRESS: Number of features          : 7
PROGRESS: Number of unpacked features : 551
PROGRESS: Number of coefficients    : 50130
PROGRESS: Starting L-BFGS
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |
PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | 1         | 6        | 0.000000

In [108]:
test_set['categories_dict'] = test_set.apply(lambda row: tags_to_dict(row['categories']))
test_set['categories_dict'].head(5)

dtype: dict
Rows: 5
[{'Sushi Bars': 1, 'Restaurants': 1}, {'Food': 1, 'Tea Rooms': 1, 'Japanese': 1, 'Restaurants': 1}, {'Pubs': 1, 'Bars': 1, 'Restaurants': 1, 'Nightlife': 1, 'Irish': 1}, {'Breakfast & Brunch': 1, 'Bars': 1, 'Mexican': 1, 'Nightlife': 1, 'Restaurants': 1}, {'American (Traditional)': 1, 'Bars': 1, 'Nightlife': 1, 'Lounges': 1, 'Restaurants': 1}]

In [109]:
test_set['negative_review_tags'] = gl.text_analytics.count_words(test_set['text'])
test_set['negative_review_tags'] = test_set['negative_review_tags'].dict_trim_by_keys(bad_review_words, exclude=False)

model.evaluate(test_set)

{'max_error': 6.253360542412668, 'rmse': 1.1452486861156776}

#  Magic!

![](http://i.imgur.com/lDOUcN0.gif)

# Just kidding, it's just the internet

![](http://i.imgur.com/DDwEGGr.gif)

![](http://i.imgur.com/qzfKPvB.gif)