First, background research:

'Instant Foodie: Predicting Expert Ratings From Grassroots', http://www.cs.cornell.edu/~chenhao/pub/instant-foodie.pdf

In this paper, the authors use 'grassroots' yelp and google places scores to predict Zagat ratings. This has the complexity that Zagat ratings are three-pronged (food, service, decor) while the grassroots ratings are not. They draw on the fields of 'crowdsourced labeling' and 'collective filtering'. Latent factor models have become popular in collective filtering. The authors adapt this model, and "characterize both items and users as vectors in a space automatically inferred from observed ratings". Good discussion of challenges/biases in using grassroots reviews to predict expert reviews.

On a less academic front, 538 compared yelp reviews with Michelin stars in NYC: http://fivethirtyeight.com/features/yelp-and-michelin-have-the-same-taste-in-new-york-restaurants/

Nate Silver used a "VORB" approach -- "VORB is a formula that combines the review count and the star ratings." Review count is adjusted based on google searches of the restaurant. Silver finds that, 
    "The correlation between Yelp stars and Michelin stars is highly statistically significant. Indeed, it forms an eerily linear progression. The restaurants to have lost their Michelin stars have 3.83 Yelp stars on average, barely better than the average for all restaurants citywide." 
    
However, Silver finds that Yelp reviews are influenced by price (positively!):
    "Controlling for their Michelin ratings, Yelp actually rates restaurants higher when they are more expensive. Each additional dollar sign (e.g. going from 2 to 3 dollar signs) works out to 0.2 additional Yelp stars; the relationship is highly statistically significant."

Cuisine also mattered: 
    "Another theme is that certain cuisines do poorly in Yelp as compared with their Michelin star ratings. Consider the four restaurants with the lowest VORB scores. They are (or were), respectively, a Vietnamese restaurant, a Malaysian restaurant, a Thai restaurant and a Chinese (Szechuan) restaurant."

'Zagat Overhauls Restaurant Review Ratings', http://www.wsj.com/articles/zagat-overhauls-restaurant-review-ratings-1469505614
In July Zagat moved to a 1-5 star rating system, but still for the three categories of food, service, decor.
"Under Google, Zagat no longer prints restaurant guides beyond New York, but it has retained its digital listings."

Per this article: https://www.washingtonpost.com/lifestyle/food/dcs-food-scene-gets-a-prestigious-boost-michelin-inspection-and-stars/2016/05/27/fc1db658-2132-11e6-8690-f14ca9de2972_story.html, 
there are only three other Michelin-reviewed cities in the US: Chicago, NY and San Francisco.

This student used Yelp reviews to predict Michelin stars in SF: http://blog.nycdatascience.com/student-works/predicting-michelin-stars-yelp-reviews-san-francisco/

Directions on scraping to not get banned: http://www.markbartlett.org/web-engineering/web-engineering-1/page-scraping-with-urllib2-beautifulsoup


In [9]:
import pandas as pd
import json

In [10]:
#start of my webscraping - trying page 1 of NYC: 
import urllib
from bs4 import BeautifulSoup

#initially I tried scraping the website you see in your browser...turns out to iterate through multiple pages must go to
#inspect, network, XHR
url="https://www.zagat.com/proxy/v1.4?kft=461&vertical=46&orderby=score_food&sort=desc&page=2&city=1020&query=&key=abbc09b7c840c10937a4db331422c98b&mobile_only_content=false&limit=15&m=filter&a=place"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html,"html.parser")

In [23]:
#print soup.prettify...because this is JSON, not very pretty!!
print soup.prettify()

{"success":true,"data":[{"addr_city":"New York","city":{"id":"1020","title":"New York City","slug":"new-york-city","avail_filters":["neighborhood"]},"closed":false,"cost":76,"cuisine":"French","currency_symbol":"$","date_opened":null,"feature_id":"0x89c258f66d739287:0x667798011a76e1cd","hours":[["11:45am - 11:00pm"],["11:45am - 11:00pm"],["11:45am - 11:00pm"],["11:45am - 11:00pm"],["11:45am - 11:00pm"],["11:45am - 11:00pm"],["11:45am - 11:00pm"]],"id":"676172","latitude":"40.7690697","longitude":"-73.9815598","neighborhood":"West 60s","obj_type":"place","open_now":false,"open_table":"http:\/\/www.opentable.com\/restaurant\/profile\/108982?ref=5305","partners":[],"photo":"http:\/\/storage.googleapis.com\/zgt-photos\/0x89c258f66d739287_0x667798011a76e1cd\/f8e54cb11d57ce0be67851570f95e591.jpg","photo_map":"http:\/\/maps.googleapis.com\/maps\/api\/staticmap?center=40.7690697,-73.9815598&amp;zoom;=14&amp;size;=200x200&amp;sensor;=false","price_level":"E","redirect_url":null,"renamed":false,

In [25]:
# Loading the json
ny=json.loads(str(soup))

In [26]:
# Dictionary format
ny

{u'count': 985,
 u'data': [{u'addr_city': u'New York',
   u'city': {u'avail_filters': [u'neighborhood'],
    u'id': u'1020',
    u'slug': u'new-york-city',
    u'title': u'New York City'},
   u'closed': False,
   u'cost': 76,
   u'cuisine': u'French',
   u'currency_symbol': u'$',
   u'date_opened': None,
   u'feature_id': u'0x89c258f66d739287:0x667798011a76e1cd',
   u'hours': [[u'11:45am - 11:00pm'],
    [u'11:45am - 11:00pm'],
    [u'11:45am - 11:00pm'],
    [u'11:45am - 11:00pm'],
    [u'11:45am - 11:00pm'],
    [u'11:45am - 11:00pm'],
    [u'11:45am - 11:00pm']],
   u'id': u'676172',
   u'latitude': u'40.7690697',
   u'longitude': u'-73.9815598',
   u'neighborhood': u'West 60s',
   u'obj_type': u'place',
   u'open_now': False,
   u'open_table': u'http://www.opentable.com/restaurant/profile/108982?ref=5305',
   u'partners': [],
   u'photo': u'http://storage.googleapis.com/zgt-photos/0x89c258f66d739287_0x667798011a76e1cd/f8e54cb11d57ce0be67851570f95e591.jpg',
   u'photo_map': u'http:/

In [27]:
#looking at keys -- 'data' is the one I'm interested in
list(ny.keys())

[u'count', u'refinements', u'is_elasticsearch', u'data', u'success']

In [28]:
# Taking a look at the first row  0 in data, I can see that zagat score is included as sub-dictionary 
# (its also entertaining to notice the old 30-point scores still in there!)
ny['data'][0]

{u'addr_city': u'New York',
 u'city': {u'avail_filters': [u'neighborhood'],
  u'id': u'1020',
  u'slug': u'new-york-city',
  u'title': u'New York City'},
 u'closed': False,
 u'cost': 76,
 u'cuisine': u'French',
 u'currency_symbol': u'$',
 u'date_opened': None,
 u'feature_id': u'0x89c258f66d739287:0x667798011a76e1cd',
 u'hours': [[u'11:45am - 11:00pm'],
  [u'11:45am - 11:00pm'],
  [u'11:45am - 11:00pm'],
  [u'11:45am - 11:00pm'],
  [u'11:45am - 11:00pm'],
  [u'11:45am - 11:00pm'],
  [u'11:45am - 11:00pm']],
 u'id': u'676172',
 u'latitude': u'40.7690697',
 u'longitude': u'-73.9815598',
 u'neighborhood': u'West 60s',
 u'obj_type': u'place',
 u'open_now': False,
 u'open_table': u'http://www.opentable.com/restaurant/profile/108982?ref=5305',
 u'partners': [],
 u'photo': u'http://storage.googleapis.com/zgt-photos/0x89c258f66d739287_0x667798011a76e1cd/f8e54cb11d57ce0be67851570f95e591.jpg',
 u'photo_map': u'http://maps.googleapis.com/maps/api/staticmap?center=40.7690697,-73.9815598&amp;zoom;=1

In [29]:
# This syntax will pull just the 5-point food score from row 0
ny['data'][0]['score']['score5_food']

4.7

In [99]:
nyc = pd.DataFrame()

In [102]:
#now that I've looked at the data, I want to iterate over pages to pull variables into a dataframe pages 1-186
for i in range(1, 186):
    url="https://www.zagat.com/proxy/v1.4?vertical=46&orderby=score_food&sort=desc&page=%d&city=1020&query=&key=abbc09b7c840c10937a4db331422c98b&mobile_only_content=false&limit=15&m=filter&a=place" %(i)
    html = urllib.urlopen(url).read()
    soup = BeautifulSoup(html,"html.parser")
    ny=json.loads(str(soup))
    nyadd = pd.DataFrame(ny["data"])
    food = pd.DataFrame()
    decor = pd.DataFrame()
    service = pd.DataFrame()
    for j in range (0,len(ny["data"])):
        food = food.append(pd.Series(ny['data'][j]['score']['score5_food']),ignore_index=True)
        decor = decor.append(pd.Series(ny['data'][j]['score']['score5_decor']),ignore_index=True)
        service = service.append(pd.Series(ny['data'][j]['score']['score5_service']),ignore_index=True)
    nyadd['food'] = food
    nyadd['decor'] = decor
    nyadd['service'] = service
    nyc = nyc.append(nyadd)      
nyc = nyc.reset_index()
del nyc['index']

In [103]:
nyc.shape

(2770, 33)

In [104]:
# A few spot check comparisons to the website
vars = ['title','food', 'decor', 'service']
nyc[vars][nyc.title.str.contains('(Lemon Ice)')]

  app.launch_new_instance()


Unnamed: 0,title,food,decor,service
56,The Lemon Ice King of Corona,4.6,2.7,3.7


In [105]:
#Now that I have my NYC Zagat data, I'm going to save. I may pull again, merge, drop duplicates.
nyc.to_pickle('../nyc_raw2.pkl') 

#When I want to read this in, I'll just read in: new_file_name = pd.read_pickle('../nyc_zagat.pkl')

In [51]:
#And now, to get the San Francisco and Chicago data, I'll repeat these steps... 
chicago = pd.DataFrame()

In [53]:
#Chicago - 1376 restaurants, pages 1-93
for i in range(1, 93):
    url= "https://www.zagat.com/proxy/v1.4?vertical=46&orderby=score_food&sort=desc&page=%d&city=1013&query=&key=abbc09b7c840c10937a4db331422c98b&mobile_only_content=false&limit=15&m=filter&a=place" %(i)
    html = urllib.urlopen(url).read()
    soup = BeautifulSoup(html,"html.parser")
    city=json.loads(str(soup))
    cityadd = pd.DataFrame(city["data"])
    food = pd.DataFrame()
    decor = pd.DataFrame()
    service = pd.DataFrame()
    for j in range (0,len(city["data"])):
        food = food.append(pd.Series(city['data'][j]['score']['score5_food']),ignore_index=True)
        decor = decor.append(pd.Series(city['data'][j]['score']['score5_decor']),ignore_index=True)
        service = service.append(pd.Series(city['data'][j]['score']['score5_service']),ignore_index=True)
    cityadd['food'] = food
    cityadd['decor'] = decor
    cityadd['service'] = service
    chicago = chicago.append(cityadd) 
chicago = chicago.reset_index()
del chicago['index']

In [54]:
chicago.shape

(1376, 33)

In [56]:
# A few spot check comparisons to the website
vars = ['title','food', 'decor', 'service']
chicago[vars][chicago.title.str.contains('(Mercat)')]

  app.launch_new_instance()


Unnamed: 0,title,food,decor,service
49,Mercat a la Planxa,4.6,4.4,4.3
928,Mercato,0.0,0.0,0.0
1019,Pizzeria del Mercato,0.0,0.0,0.0


In [57]:
chicago.to_pickle('../chicago_raw1.pkl') 

In [11]:
#DC - 1191 restaurants, 81 pages
dc = pd.DataFrame()

In [12]:
for i in range(1, 81):
    url="https://www.zagat.com/proxy/v1.4?vertical=46&orderby=score_food&sort=desc&page=%d&city=1024&query=&key=abbc09b7c840c10937a4db331422c98b&mobile_only_content=false&limit=15&m=filter&a=place" %(i)
    html = urllib.urlopen(url).read()
    soup = BeautifulSoup(html,"html.parser")
    city=json.loads(str(soup))
    cityadd = pd.DataFrame(city["data"])
    food = pd.DataFrame()
    decor = pd.DataFrame()
    service = pd.DataFrame()
    for j in range (0,len(city["data"])):
        food = food.append(pd.Series(city['data'][j]['score']['score5_food']),ignore_index=True)
        decor = decor.append(pd.Series(city['data'][j]['score']['score5_decor']),ignore_index=True)
        service = service.append(pd.Series(city['data'][j]['score']['score5_service']),ignore_index=True)
    cityadd['food'] = food
    cityadd['decor'] = decor
    cityadd['service'] = service
    dc = dc.append(cityadd) 

dc = dc.reset_index()
del dc['index']

In [13]:
dc.shape

(1197, 33)

In [14]:
dc.to_pickle('../dc_raw3.pkl') 

In [83]:
# A few spot check comparisons to the website
vars = ['title','food', 'decor', 'service', 'neighborhood']
dc[vars][dc.title.str.contains('(Lost Dog)')]

  app.launch_new_instance()


Unnamed: 0,title,food,decor,service,neighborhood
95,Lost Dog Cafe,4.5,4.1,4.1,
112,Lost Dog Cafe,4.5,4.1,4.1,
130,Lost Dog Cafe,4.5,4.1,4.1,
150,Lost Dog Cafe,4.5,4.1,4.1,
362,Lost Dog Cafe Old Town Alexandria,4.3,3.8,4.1,


In [84]:
#DC proper - 688 restaurants
dc_prop = pd.DataFrame()

In [85]:
for i in range(1, 47):
    url="https://www.zagat.com/proxy/v1.4?addr_city=Washington&vertical=46&orderby=score_food&sort=desc&page=%d&city=1024&query=&key=abbc09b7c840c10937a4db331422c98b&mobile_only_content=false&limit=15&m=filter&a=place" %(i)
    html = urllib.urlopen(url).read()
    soup = BeautifulSoup(html,"html.parser")
    city=json.loads(str(soup))
    cityadd = pd.DataFrame(city["data"])
    food = pd.DataFrame()
    decor = pd.DataFrame()
    service = pd.DataFrame()
    for j in range (0,len(city["data"])):
        food = food.append(pd.Series(city['data'][j]['score']['score5_food']),ignore_index=True)
        decor = decor.append(pd.Series(city['data'][j]['score']['score5_decor']),ignore_index=True)
        service = service.append(pd.Series(city['data'][j]['score']['score5_service']),ignore_index=True)
    cityadd['food'] = food
    cityadd['decor'] = decor
    cityadd['service'] = service
    dc_prop = dc_prop.append(cityadd) 

dc_prop = dc_prop.reset_index()
del dc_prop['index']

In [86]:
dc_prop.shape

(690, 33)

In [89]:
# A few spot check comparisons to the website
vars = ['title','url', 'food', 'decor', 'service']
dc_prop[vars][dc_prop.title.str.contains('(Oohh)')]

  app.launch_new_instance()


Unnamed: 0,title,url,food,decor,service
136,Oohh's &amp; Aahh's,https://www.zagat.com/r/oohhs-aahhs-washington,4.4,2.8,3.8


In [90]:
dc_prop.to_pickle('../dc_prop_raw1.pkl') 

In [3]:
sf = pd.DataFrame()

In [6]:
#SF - 2021 restaurants 1,136
for i in range(1, 50):
    url= "https://www.zagat.com/proxy/v1.4?vertical=46&orderby=score_food&sort=desc&page=%d&city=1021&query=&key=abbc09b7c840c10937a4db331422c98b&mobile_only_content=false&limit=15&m=filter&a=place" %(i)
    html = urllib.urlopen(url).read()
    soup = BeautifulSoup(html,"html.parser")
    city=json.loads(str(soup))
    cityadd = pd.DataFrame(city["data"])
    food = pd.DataFrame()
    decor = pd.DataFrame()
    service = pd.DataFrame()
    for j in range (0,len(city["data"])):
        food = food.append(pd.Series(city['data'][j]['score']['score5_food']),ignore_index=True)
        decor = decor.append(pd.Series(city['data'][j]['score']['score5_decor']),ignore_index=True)
        service = service.append(pd.Series(city['data'][j]['score']['score5_service']),ignore_index=True)
    cityadd['food'] = food
    cityadd['decor'] = decor
    cityadd['service'] = service
    sf = sf.append(cityadd) 

sf = sf.reset_index()
del sf['index']

In [7]:
sf.shape

(2023, 33)

In [8]:
sf.to_pickle('../sf_raw2.pkl') 