This notebook extracts the ground truth for our project.

We approximate the quality of a review by doing some rudimentary text analysis. For each category of business, we count the number of words in all reviews. After discarding words only seen once, we find the set of words for each category unique to reviews in that category.

See the file `do_review_counts.py` for the process of extracting the review counts JSON, which was parallelized.

In [1]:
%matplotlib inline
from __future__ import division
import pandas as pd
import simplejson as json  # faster json parsing
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from collections import Counter
import itertools
import string

In [2]:
with open('yelp_academic_dataset_user.json') as f:
    user_data = pd.DataFrame(json.loads(line) for line in f)

with open('yelp_academic_dataset_review.json') as f:
    review_data = pd.DataFrame(json.loads(line) for line in f)

with open('yelp_academic_dataset_business.json') as f:
    business_data = pd.DataFrame(json.loads(line) for line in f)

# parsed counts from do_review_counts.py
with open('parsed_review_counts.json') as f:
    parsed_reviews = pd.DataFrame(json.loads(line) for line in f)

In [3]:
# extract the business categories

def find_best(cat_list):
    counts = [[x, category_counts[x]] for x in cat_list]
    return sorted(counts, key=lambda (cat, count): -count)[0][0]

category_counts = Counter(itertools.chain.from_iterable(business_data['categories']))
categories = [find_best(cat_list) if len(cat_list) > 0 else 'Other' for cat_list in business_data['categories']]
business_data['simplified_categories'] = categories

# drop any category that is less than 1% of total, remove Other category
categories_to_remove = {c for c in set(categories) if categories.count(c) / len(categories) <= 0.01}
filtered_business_data = business_data[~business_data['simplified_categories'].isin(categories_to_remove)]
filtered_categories = set(filtered_business_data['simplified_categories'])

# what does this look like?
filtered_business_data['simplified_categories'].value_counts()

Restaurants                  26729
Shopping                     12406
Food                          6978
Beauty & Spas                 6676
Home Services                 5475
Health & Medical              5239
Automotive                    4456
Local Services                3051
Active Life                   2976
Nightlife                     2529
Event Planning & Services     2472
Pets                          1600
Arts & Entertainment          1200
Hotels & Travel               1026
Financial Services             952
Name: simplified_categories, dtype: int64

For each category, we will count the number of times a specific word is seen in every review for every business in that category. Then, all words seen only once are dropped. Then, for each category, we find an ordered list of words seen in only that category.

In [4]:
# associate each simplified category with each review based on business id
merged_data = pd.merge(review_data, filtered_business_data, on='business_id')
parsed_reviews.columns = ['review_counts', 'review_id']
# perform a join to create the new dataframe
parsed_df = pd.merge(parsed_reviews, merged_data, on='review_id')
parsed_df = parsed_df.reset_index()
parsed_df.head()

Unnamed: 0,index,review_counts,review_id,business_id,date,stars_x,text,type_x,user_id,votes,...,latitude,longitude,name,neighborhoods,open,review_count,stars_y,state,type_y,simplified_categories
0,0,"{u'and': 4, u'Blue': 1, u'selection': 1, u'Hop...",VzfQdZhyAFS7IzsAwROAsw,G4ZZHlp6CdYBZOirW2_PQA,2014-03-16,4,This is a great spot in downtown Phoenix. I ge...,review,msJa3Q9y5JsJQJbhkvGJQA,"{u'funny': 0, u'useful': 0, u'cool': 0}",...,33.447259,-112.072743,Tilted Kilt Pub & Eatery,[],True,261,3.0,AZ,business,Restaurants
1,1,"{u'and': 5, u'limited': 1, u'From': 1, u'Floor...",q2Dr_Gn3t2PiMa-mVk6WRQ,axa4191R9VuMaGtyxrd77Q,2012-05-04,4,"From outside, it looks pretty weak. But I hav...",review,dUAFgAWQkKqZxX7q16IFoA,"{u'funny': 0, u'useful': 0, u'cool': 0}",...,36.145674,-115.212881,Las Vegas Athletic Club,[Westside],True,73,3.0,NV,business,Active Life
2,2,"{u'and': 2, u'all': 1, u'because': 1, u'honest...",3ZrYHyiFV2N4y_yIWofrDQ,8IMEf_cj8KyTQojhNOyoPg,2011-04-10,3,"I have not stayed in the rooms but, I actually...",review,2bAbL28lhrnyOlYiZ8yMdg,"{u'funny': 0, u'useful': 0, u'cool': 1}",...,36.118699,-115.186484,Rio All Suites Hotel & Casino,[],True,1575,3.0,NV,business,Event Planning & Services
3,3,"{u'and': 3, u'often': 1, u'don't': 1, u'period...",sCGbOnmiLnaWTOMUN2hPQw,cmGR1HS9ms233roSllcglw,2008-10-12,4,"I travel to Pittsburgh pretty often, and the o...",review,0c3yK4oWCf43fuwFxLYdkQ,"{u'funny': 0, u'useful': 0, u'cool': 0}",...,40.451636,-79.933392,Crepes Parisiennes,[Shadyside],False,73,4.0,PA,business,Restaurants
4,4,"{u'and': 1, u'Genuine': 1, u'love': 2, u'then'...",Dz8aT5oEG-mDxl8lP9mmrA,BVCDPqlHMDPLWn9EhdDXNg,2014-08-03,4,I love the curry here! It has a homemade taste...,review,gqcpNd8NyV4_HU0CpGMuhA,"{u'funny': 2, u'useful': 2, u'cool': 3}",...,36.12682,-115.209419,Japanese Curry Zen,[Chinatown],True,615,4.5,NV,business,Restaurants


In [5]:
# for each business category, count the words seen
category_word_counts = {}
for category, category_business_df in parsed_df.groupby('simplified_categories'):
    business_ids = set(category_business_df.business_id)
    category_reviews = parsed_df[parsed_df.business_id.isin(business_ids)]
    counts = Counter()
    for _, review_row in category_reviews.iterrows():
        counts.update(review_row.review_counts)
    # remove rare words
    most_common = Counter()
    for word, count in counts.iteritems():
        if count > 1:
            most_common[word] = count
    category_word_counts[category] = most_common


In [6]:
# find the set of words which are present in only 1 category
distinct_word_dict = defaultdict(Counter)
for category, most_common in category_word_counts.iteritems():
    for word, count in most_common.iteritems():
        c = [category_word_counts[cat].get(word, 0) > 0 for cat in filtered_categories if cat != category]
        if c.count(True) == 1:
            distinct_word_dict[category][word] = count

            
df = pd.DataFrame([[cat, 
                    len(category_word_counts[cat]), 
                    len(distinct_word_dict[cat])] for cat in filtered_categories])
df.columns = ['Category', '# Filtered', '# Distinctive']
df['% Distinctive'] = df['# Distinctive'] / df['# Filtered']
print df

                     Category  # Filtered  # Distinctive  % Distinctive
0                    Shopping       70882           6306       0.088965
1                        Food       67814           9964       0.146931
2                  Automotive       34837           1498       0.043000
3               Beauty & Spas       47317           2778       0.058710
4              Local Services       25499            674       0.026432
5        Arts & Entertainment       43482           2139       0.049193
6                 Active Life       43895           2006       0.045700
7            Health & Medical       36023           1908       0.052966
8                   Nightlife       57230           4607       0.080500
9                        Pets       25393            976       0.038436
10            Hotels & Travel       31635            993       0.031389
11              Home Services       35310           1522       0.043104
12         Financial Services        9860            143       0

To calculate the ground truth, we look at every review a reviewer has given in each category and calculate:

$$\text{user_weight}_{c} = \sum_{r} \frac{1}{r} \cdot \frac{\text{num distinctive words}_{r}}{\text{total distinctive words}_{c}}$$

This value provides a measure of the confidence we have in this user's ability to review this category. This matrix will still be very sparse!

In [10]:
# for each user, stores a n-category vector of weights
# parallelizing for time
from multiprocessing import Pool
p = Pool(processes=40)

def parse_user(x):
    user_id, user_df = x
    user_vector = []
    for category in ordered_categories:
        user_category_df = user_df[user_df['simplified_categories'].str.match(category)]
        if len(user_category_df) == 0:
            user_vector.append(0)
        else:
            raw_scores = []
            for _, review_row in user_category_df.iterrows():
                num_distinctive = len(review_row.review_counts.viewkeys() & distinct_word_dict[category])
                total_distinctive = len(distinct_word_dict[category])
                raw_scores.append(num_distinctive / total_distinctive)
            user_weight = np.mean(raw_scores)
            user_vector.append(user_weight)
    return user_id, user_vector
    

ordered_categories = sorted(filtered_categories)
user_weights = p.map(parse_user, parsed_df.groupby('user_id'))

In [24]:
user_weights_list = [[user_id] + vec for user_id, vec in user_weights]
user_weight_df = pd.DataFrame(user_weights_list)
user_weight_df.columns = ['user_id'] + ordered_categories
user_weight_df.to_csv('user_weights.csv')
user_weight_df.head()

Unnamed: 0,user_id,Active Life,Arts & Entertainment,Automotive,Beauty & Spas,Event Planning & Services,Financial Services,Food,Health & Medical,Home Services,Hotels & Travel,Local Services,Nightlife,Pets,Restaurants,Shopping
0,---teJGnwK07UO6_oJfbRw,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,--0HEXd4W6bJI8k7E0RxTA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,--0KsjlAThNWua2Pr4HStQ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,--0mI_q_0D1CdU4P_hoImQ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,--106arHH4D3fLenTl3YZA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [32]:
'Percent of users with non-zero weight: {:.2f}'.format(
    [sum(vec) > 0 for user_id, vec in user_weights].count(True) / len(user_weights))

'Percent of users with non-zero weight: 0.27'