# What Michelin Guide Restaurants are Participating in Restaurant Week?
We can answer this question with some web scraping :-)

## First, get the list of restaurant week restaurants

We can fetch all the restaurant data from the nice API at https://service.nycgo.com/

In [1]:
import requests, json
url = 'https://service.nycgo.com/nycgo/v2/body-grid-blocks?entryId=411&gridId=restaurant-week&randomizeFirst=true&callback=ng_jsonp_callback_1'
resp = requests.get(url).content.decode('utf-8')

# The request has some extra characters the the beginning and end
# which wrap the JSON object, hence the hacky [24:-2] indexing
restaurant_data = json.loads(resp[24:-2])['data'][0]['gridItems']

In [2]:
restaurant_week_names = []
for rdata in restaurant_data:
    restaurant_week_names.append(rdata['displayTitle'])

# should be 662 as of 7/28/2022
len(restaurant_week_names)

662

## Second, get the names and ratings of NYC restaurants on the Michelin Guide
There didn't seem to be an easy place to get the restaurants like in the former example, so we resort to scraping with Selenium

In [3]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager

In [4]:
# Initialize a headless browser
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(ChromeDriverManager().install(), options=chrome_options)




  driver = webdriver.Chrome(ChromeDriverManager().install(), options=chrome_options)


In [5]:
from selenium.common.exceptions import StaleElementReferenceException

NYC_MICHELIN_PAGES = 25
michelin_names = []
for page_num in range(1, NYC_MICHELIN_PAGES + 1):
    print(f'on page {page_num} / {NYC_MICHELIN_PAGES}')
    driver.get(f'https://guide.michelin.com/us/en/new-york-state/new-york/restaurants/page/{page_num}')
    name_cards = driver.find_elements(By.CLASS_NAME, 'card__menu-content--title') # restaurant name
    rating_cards = driver.find_elements(By.CLASS_NAME, 'card__menu-content--rating') # restaurant rating

    # There are two failure modes (that I've found) here as a result of asynchronous page loading:

    # 1. name_cards/rating_cards get updated asynchronously after we load the page.
    #    This leads to a StaleElementReferenceException. We can fix this by retrying with a try/except.

    # 2. name_cards/rating_cards is populated, but with empty strings. We can just check for this in a loop
    retry = True
    while retry:
        try:
            while name_cards[0].text == '':
                name_cards = driver.find_elements(By.CLASS_NAME, 'card__menu-content--title')
                rating_cards = driver.find_elements(By.CLASS_NAME, 'card__menu-content--rating')

            retry = False
        except StaleElementReferenceException:
            name_cards = driver.find_elements(By.CLASS_NAME, 'card__menu-content--title')
            rating_cards = driver.find_elements(By.CLASS_NAME, 'card__menu-content--rating')
            retry = True
            

    for name_card, rating_card in zip(name_cards, rating_cards):
        restaurant_name = name_card.text
        restaurant_rating = rating_card.text
        rating = ''
        if '=' in restaurant_rating: # '=' denotes Bib Gourmand
            rating += ':P'
        rating += ' ' + '*' * restaurant_rating.count('m') # each 'm' denotes a single Michelin star
        michelin_names.append(f'{restaurant_name}\t{rating}')

on page 1
on page 2
on page 3
on page 4
on page 5
on page 6
on page 7
on page 8
on page 9
on page 10
on page 11
on page 12
on page 13
on page 14
on page 15
on page 16
on page 17
on page 18
on page 19
on page 20
on page 21
on page 22
on page 23
on page 24
on page 25


In [7]:
# Should be 482
len(michelin_names)

482

## Finally, get the intersection of our two sets
We can definitely do this faster than the O(n^2) nested for loop, but we're not dealing with that much data so it's no big deal :)

In [8]:
from nltk.metrics.distance import jaccard_distance

set_of_rw_and_michelin_restaurants = set()
for m_name_with_rating in michelin_names:
    m_name, m_rating = m_name_with_rating.split('\t')
    m_set = set(ch for ch in m_name)
    for rw_name in restaurant_week_names:
        rw_name = rw_name
        rw_set = set(ch for ch in rw_name)
        if jaccard_distance(m_set, rw_set) < .2: # imperfect method that is good enough in practice
            set_of_rw_and_michelin_restaurants.add(f'{m_name} {m_rating}')

# Should be 49. 
len(set_of_rw_and_michelin_restaurants)

49

## Print the list in the notebook
Obviously, you could also export this to a file if you'd like.

Notably, at least Gramercy Tavern and The Modern are missing here due to naming differences across the websites.

In [10]:
print('\n'.join(set_of_rw_and_michelin_restaurants))

Veranda  
Pylos  
Peasant  
Soba Totto  
Eléa  
Foragers Table  
Maya  
Gage & Tollner  
The Leopard at Des Artistes  
Wau  
Dagon  
Empellón  
Huertas  
JoJo  
Khe-Yo :P 
Periyali  
Golden Unicorn  
Kubeh :P 
Boulud Sud  
Schilling  
Oceans  
Noreetuh  
Barbetta  
Portale  
Tanoreen :P 
Bâtard  *
Bar Tulix  
Il Fiorista  
Lore  
Aburiya Kinnosuke  
Baar Baar  
Wayan  
Oso :P 
Hearth  
Junoon  
The Fulton  
Orsay  
Gentle Perch :P 
HanGawi :P 
Bar Primi :P 
Vestry  *
Kyma  
Il Cortile  
Ci Siamo  
232 Bleecker  
Pastis  
Carne Mare  
Danji  
Union Square Cafe  
