# DC Michelin Guide Challenge - Kevin Markham

Because I budgeted less than a day to work on this challenge, I decided to use an approach that was as simple as possible:

I chose Yelp reviews from DC as my only data source. I limited my scope to restaurants with 3 or 4 "dollar signs" on Yelp, meaning restaurants that are either expensive or very expensive, since I didn't think the Michelin Guide would include less expensive restaurants.

Then, I decided to use the average Yelp rating to determine likelihood of Michelin inclusion, and completely ignored the review text. It seemed like recent Yelp reviews are the only ones that would matter (since Michelin employees sampled the restaurants over the past year), so I decided to focus on a restaurant's Yelp rating during 2016 alone. It didn't seem like the Yelp API would allow you to access ratings only for a particular time period, but I noticed that on a business' Yelp page, you can view the "rating details" which shows the average Yelp rating by month:

![](rating_details.jpg)

Using web scraping, I collected the URLs of the individual pages for the top 40 rated restaurants with 3 dollar signs, and the top 40 rated restaurants with 4 dollar signs. Then, I visited each individual page and gathered data about it:

- restaurant name
- restaurant location (to verify it's in DC)
- average of 2016 monthly ratings
- standard deviation for those ratings
- minimum for those ratings
- whether or not it has a TV
- whether or not it is categorized as a bar

Out of those 80 restaurants, I took the top 30 and removed any not in DC proper (since they aren't allowed in this Michelin Guide), removed any that had at least one month with an average rating below 4 (since that would mean it has inconsistent quality), removed any that had a TV (since that means it's probably not very fancy), and removed any that is categorized as a bar (since that means it probably doesn't focus on food). That left 12 restaurants. If any of those 12 were on the [Bib Gourmand List](https://www.washingtonian.com/2016/10/06/michelin-releases-bib-gourmand-list-dc/), I would have removed it, but that wasn't the case for any of them.

Finally, I had to decide how many restaurants to assign each number of Michelin Stars. Looking at [Wikipedia](https://en.wikipedia.org/wiki/Michelin_Guide), I checked out the number of Bib Gourmands, one stars, two stars, and three stars for the three US cities with their own guides (NY, SF, Chicago). I calculated that on average, those cities have 44% as many one star restaurants as Bibs, 8% as many two star restaurants as Bibs, and 5% as many three star restaurants as Bibs. I multiplied those percentages by the number of Bibs in DC (19), which told me that I should assign one star to 8 DC restaurants, two stars to 1 DC restaurant, and three stars to 1 DC restaurant.

Thus, I sorted the 12 remaining restaurants, and assigned three stars to the top restaurant, two stars to the next restaurant, and one star to the next 8 restaurants.

Some shortcomings of my approach are as follows:

- I didn't validate that this methodology would actually work in NY, SF, Chicago, or any other Michelin Guide city.
- I ignored the Yelp review text.
- I didn't take the number of Yelp reviews for each restaurant into account.
- I only used a single data source.

In [1]:
import requests
from bs4 import BeautifulSoup
from time import sleep
import json
import numpy as np
import pandas as pd

In [2]:
# collect the URLs of the individual pages for the top 40 rated $$$ restaurants and the top 40 rated $$$$ restaurants
biz_urls = []
prices = [3, 4]
start = [0, 10, 20, 30]
for price in prices:
    for idx in start:
        url_part_1 = 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=Washington,+DC&start='
        url_part_2 = '&sortby=rating&attrs=RestaurantsPriceRange2.'
        r = requests.get(url_part_1 + str(idx) + url_part_2 + str(price))
        b = BeautifulSoup(r.text, 'lxml')
        results = b.find_all(name='div', attrs={'class':'main-attributes'})
        for result in results[1:]:
            partial_url = result.find(name='a')['href']
            biz_urls.append((partial_url, price))
        sleep(2)

In [3]:
# visit the individual page for each restaurant and gather data about it
biz_info = []
tv_text = '\n                                Has TV\n                            '
for biz_url, price in biz_urls:
    r = requests.get('https://www.yelp.com' + biz_url)
    b = BeautifulSoup(r.text, 'lxml')
    try:
        biz_name = b.find(name='h1', attrs={'itemprop':'name'}).text.strip()
        biz_location = b.find(name='span', attrs={'itemprop':'addressLocality'}).text
        monthly = json.loads(b.find(name='div', attrs={'id':'rating-details-modal-content'})['data-monthly-ratings'])['2016']
        ratings_2016 = np.array(monthly)[:, 1]
        monthly_average = ratings_2016.mean()
        monthly_std = ratings_2016.std()
        monthly_min = ratings_2016.min()
        has_tv = b.find(name='dt', attrs={'class':'attribute-key'}, text=tv_text).find_next('dd').text.strip()
        is_bar = 'Yes' if b.find(name='span', attrs={'class':'category-str-list'}).find(name='a', attrs={'href':'/c/dc/bars'}) else 'No'
        biz_info.append((price, monthly_average, monthly_std, monthly_min, biz_name, biz_location, has_tv, is_bar))
    except:
        pass
    sleep(2)

In [4]:
# select the top 30 by average 2016 rating
top_30 = sorted(biz_info, reverse=True, key=lambda x:x[1])[:30]
df = pd.DataFrame(top_30, columns='price average stdev minimum restaurant city tv bar'.split())
df

Unnamed: 0,price,average,stdev,minimum,restaurant,city,tv,bar
0,4,4.61,0.202237,4.3,1789 Restaurant,"Washington, DC",No,No
1,3,4.596909,0.239271,4.058824,Pennsylvania 6 DC,"Washington, DC",Yes,No
2,4,4.565568,0.205241,4.2,Pineapple & Pearls,"Washington, DC",No,No
3,3,4.55,0.280179,3.9,Barmini By José Andrés,"Washington, DC",No,No
4,4,4.48,0.116619,4.4,2020 Restaurant and Lounge,"Washington, DC",Yes,No
5,3,4.47,0.29,3.9,Acqua Al 2,"Washington, DC",No,No
6,3,4.46677,0.146382,4.225,Rose’s Luxury,"Washington, DC",No,No
7,3,4.44,0.28,3.9,Off The Record,"Washington, DC",Yes,Yes
8,3,4.42,0.357211,3.9,Corduroy,"Washington, DC",No,No
9,3,4.416667,0.192787,4.166667,Little Serow,"Washington, DC",No,No


In [5]:
# only keep restaurants in DC, with minimum of at least 4, with no TV, and that isn't a bar
filtered = df.loc[(df.minimum >= 4) & (df.city == 'Washington, DC') & (df.tv == 'No') & (df.bar == 'No'), :]
filtered

Unnamed: 0,price,average,stdev,minimum,restaurant,city,tv,bar
0,4,4.61,0.202237,4.3,1789 Restaurant,"Washington, DC",No,No
2,4,4.565568,0.205241,4.2,Pineapple & Pearls,"Washington, DC",No,No
6,3,4.46677,0.146382,4.225,Rose’s Luxury,"Washington, DC",No,No
9,3,4.416667,0.192787,4.166667,Little Serow,"Washington, DC",No,No
10,4,4.402857,0.18459,4.0,Fiola,"Washington, DC",No,No
12,4,4.38,0.183303,4.2,The Lafayette,"Washington, DC",No,No
14,3,4.343468,0.187216,4.1,Tail Up Goat,"Washington, DC",No,No
15,4,4.34,0.08,4.2,minibar by José Andrés,"Washington, DC",No,No
18,3,4.306123,0.189642,4.071429,Filomena Ristorante,"Washington, DC",No,No
26,4,4.26,0.237487,4.1,Komi,"Washington, DC",No,No


In [6]:
# number of Bib Gourmands, one stars, two stars, and three stars in the other US cities
cities = pd.DataFrame([['Chicago', 59, 19, 3, 2], ['NY', 124, 60, 10, 6], ['SF', 74, 38, 7, 5]], columns='city bib one two three'.split())
cities

Unnamed: 0,city,bib,one,two,three
0,Chicago,59,19,3,2
1,NY,124,60,10,6
2,SF,74,38,7,5


In [7]:
# calculate the ratio of one star, two stars, and three stars to bibs for each city
cities['one_ratio'] = cities.one/cities.bib
cities['two_ratio'] = cities.two/cities.bib
cities['three_ratio'] = cities.three/cities.bib
cities

Unnamed: 0,city,bib,one,two,three,one_ratio,two_ratio,three_ratio
0,Chicago,59,19,3,2,0.322034,0.050847,0.033898
1,NY,124,60,10,6,0.483871,0.080645,0.048387
2,SF,74,38,7,5,0.513514,0.094595,0.067568


In [8]:
# average those ratios
cities.one_ratio.mean(), cities.two_ratio.mean(), cities.three_ratio.mean()

(0.439806126520178, 0.07536240450401195, 0.04995098980883563)

In [9]:
# multiply those averages by 19 (DC's Bib number) to determine how many restaurants in DC to assign each star rating
cities.one_ratio.mean()*19, cities.two_ratio.mean()*19, cities.three_ratio.mean()*19

(8.356316403883381, 1.431885685576227, 0.949068806367877)

In [10]:
# assign 3 stars to one restaurant, 2 stars to one restaurant, and 1 star to 8 restaurants
final = filtered.head(10).copy()
final['stars'] = [3] + [2] + [1]*8
final

Unnamed: 0,price,average,stdev,minimum,restaurant,city,tv,bar,stars
0,4,4.61,0.202237,4.3,1789 Restaurant,"Washington, DC",No,No,3
2,4,4.565568,0.205241,4.2,Pineapple & Pearls,"Washington, DC",No,No,2
6,3,4.46677,0.146382,4.225,Rose’s Luxury,"Washington, DC",No,No,1
9,3,4.416667,0.192787,4.166667,Little Serow,"Washington, DC",No,No,1
10,4,4.402857,0.18459,4.0,Fiola,"Washington, DC",No,No,1
12,4,4.38,0.183303,4.2,The Lafayette,"Washington, DC",No,No,1
14,3,4.343468,0.187216,4.1,Tail Up Goat,"Washington, DC",No,No,1
15,4,4.34,0.08,4.2,minibar by José Andrés,"Washington, DC",No,No,1
18,3,4.306123,0.189642,4.071429,Filomena Ristorante,"Washington, DC",No,No,1
26,4,4.26,0.237487,4.1,Komi,"Washington, DC",No,No,1


In [12]:
# write results to CSV
final.loc[:, ['restaurant', 'stars']].set_index('restaurant').to_csv('kevinmarkham-submission.csv', encoding='utf-8')