# <font size="10"><p style="text-align: center;color:#2a4b8d">Yelp API</font>

In [21]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

This notebook is the one that had been used to scrapp the data from Yelp. As explained in our main notebook 'Descriptive Analysis', we wanted to add to our dataset the rates of the restaurants on the Yelp website. To be able to do this, Yelp provides an API which enables us to query the dataset.

The following text is a copy/paste of what the one written on the 'Descriptive Analysis' notebook regarding the way we scrapped the data:

Yelp.com is a website that helps people find great local businesses such as restaurants based on customer reviews. Each business has a page where visitors can write comments and add a general grade reflecting their experience there. The Yelp fusion API enables us to access relevant information on restaurants such as their rating and their respective number of reviews. When trying to scrap the data, we first encountered an issue of how to uniquely identify each restaurant. Indeed, the Yelp API enables us to pass as an argument the latitude and longitude of all places we want to query. We initially thought that this will enable us to find the right businesses. However, the Yelp coordinates do not match exactly the ones of our dataset. Thus, quering by latitude/longitude didn't give us the right business (we generally obtained restaurants, which were closed to the one in which we were interested).

Therefore, we had to come up with another way to query our dataset. By looking at the API documentation on Yelp Fusion website, we found another function, called business_match_query, which enables us to give as argument the name and complete address of every restaurant. Trying this function on a few cases showed us that by doing this "cross-research", taking into account several pieces of information on the businesses, we obtain the right results. However, one disadvantage of this function is that the business_match_query doesn't enable us to access directly the Yelp reviews. But it gives us the Yelp ID, which is a unique identifier on Yelp website for a business. As a next step, we can use the requests package to access the reviews by giving it as an argument the Yelp ID.

This way, we were able to obtain the right results. One disadvantage of this method is the fact that we need to query the dataset twice. First to obtain the Yelp ID via the business_match_query function, and secondly to obtain the reviews using the ID. For this reason, we haven't queried our whole dataset yet. We should be able to have the full dataset ready before Wednesday (as Yelp restricts us to 5'000 queries per day).

For the code of the scrapping, please refer to the file called API_yelp. In this file, we simply implemented the procedure explained above. Every day, we scrapped 2'500 rows of our dataset, then saved it as a csv file, and re-open this file the next day to continue the scrapping and so on (of course, we made some copies of it during the procedure to make sure to not lose any information).

Note: First, we had to obtain a private key to be able to acces the Yelp API.

In [2]:
#This if the private key that we obtained 
API_key = 'B51wnmtnL30svioyDrNUrhl4h7We2HR8i-DQgKUNPsbVsiruZAItOeM6IHC8e5Y72G-d9TjHZFQHtKJjBrSCWCF0gLTvDlfr9u7bKce5jClTMT4ua-m0dtzxI6q1XXYx'

In [3]:
from yelpapi import YelpAPI #Yelp Api package, might need to be install on your compute first
import argparse
from pprint import pprint
#see help code https://github.com/gfairchild/yelpapi/blob/master/examples/examples.py

In [4]:
import requests
import json
headers = {'Authorization': 'Bearer %s' % API_key}

In [5]:
yelp_api = YelpAPI(API_key)

In [6]:
day = int(input("Hello, what's the day of scrapping (day count start at 0) ?")) #day of scrapping

Hello, what's the day of scrapping (day count start at 0) ?1


In [None]:
if day==0: #If this is the first day of scrapping, we open the clean dataset
    data = pd.read_csv('data/chicago-food-inspections/cfi_recent.csv')
else: #Else, we open the previously scrapped dataset
    data = pd.read_csv('data/chicago-food-inspections/food-inspections_scrapped'+str(day-1)+'.csv')

In [9]:
data.head(2)

Unnamed: 0.1,Unnamed: 0,index,License #,Inspection ID,DBA Name,Facility Type,Risk,Address,City,State,...,Inspection Type,Results,Violations,Latitude,Longitude,Last Inspection Year,Chain,Yelp name,Yelp review count,Yelp rating
0,0,0,2.0,"[2144871, 2050308, 1977093, 1970902, 1970312, ...",cosi,Restaurant,1.0,230 W MONROE ST,CHICAGO,IL,...,"['Canvass', 'Canvass', 'Short Form Complaint',...","['Pass w/ Conditions', 'Pass', 'Pass w/ Condit...",['3. POTENTIALLY HAZARDOUS FOOD MEETS TEMPERAT...,41.880757158647214,-87.6347092983425,2018,,Cosi,70.0,3.0
1,1,1,9.0,"[2304407, 2181605, 2181227, 2050713, 1975322, ...",xando coffee & bar / cosi sandwich bar,Restaurant,1.0,116 S MICHIGAN AVE,CHICAGO,IL,...,"['Canvass', 'Canvass Re-Inspection', 'Canvass'...","['Pass w/ Conditions', 'Pass', 'Fail', 'Pass',...","[""1. PERSON IN CHARGE PRESENT, DEMONSTRATES KN...",41.880395838259616,-87.62450172159464,2019,,,,


Note: We have a bit almost 13'000 rows in our clean dataset, as Yelp API restrict us to 5'000 queries per day and knowing the fact that each rows requires 2 queries, we divided the number of rows by 6, and scrapped the whole dataset in 6 days.

In [27]:
n_days = 6 #Total number of days required
scrapping_per_day = int(data.shape[0]/n_days) #Number of rows to scrapp
if day == 5:
    scrapping_per_day = scrapping_per_day + 3 #As the scrapping number is not an interger, we need to add 3 to obtain a complete scrapping

In [29]:
start = scrapping_per_day*day #Starting row for the scrapping
end = start + scrapping_per_day #Ending row for the scrapping
if start == 0: #If this is the first time that we scrapped the data, we need to create the new columns, else, they are already there
    data['Yelp name']=np.NaN
    data['Yelp review count']=np.NaN
    data['Yelp rating']=np.NaN
data.head(2)

Unnamed: 0.1,Unnamed: 0,index,License #,Inspection ID,DBA Name,Facility Type,Risk,Address,City,State,...,Inspection Type,Results,Violations,Latitude,Longitude,Last Inspection Year,Chain,Yelp name,Yelp review count,Yelp rating
0,0,0,2.0,"[2144871, 2050308, 1977093, 1970902, 1970312, ...",cosi,Restaurant,1.0,230 W MONROE ST,CHICAGO,IL,...,"['Canvass', 'Canvass', 'Short Form Complaint',...","['Pass w/ Conditions', 'Pass', 'Pass w/ Condit...",['3. POTENTIALLY HAZARDOUS FOOD MEETS TEMPERAT...,41.880757158647214,-87.6347092983425,2018,,Cosi,70.0,3.0
1,1,1,9.0,"[2304407, 2181605, 2181227, 2050713, 1975322, ...",xando coffee & bar / cosi sandwich bar,Restaurant,1.0,116 S MICHIGAN AVE,CHICAGO,IL,...,"['Canvass', 'Canvass Re-Inspection', 'Canvass'...","['Pass w/ Conditions', 'Pass', 'Fail', 'Pass',...","[""1. PERSON IN CHARGE PRESENT, DEMONSTRATES KN...",41.880395838259616,-87.62450172159464,2019,,,,


In [25]:
for i in range(start,end):
    temp = data.iloc[i]
    #First part, we use the business_match_query to obtain the Yelp ID of the restaurant
    #This function takes into account several arguments (notably name and full address) and enables us to obtain the
    #Yelp ID of the business
    response = yelp_api.business_match_query(name=temp['DBA Name'],
                                             address1=temp['Address'],
                                             city=temp['City'],
                                             state=temp['State'],
                                             country='US') 
    
    #Second part, we use the Yelp ID to obtain the ratings and number of reviews
    if len(response['businesses'])>0:
        data['Yelp name'].iloc[i] = response['businesses'][0]['name'] #Yelp Name of the business
        id_ =  response['businesses'][0]['id'] #Yelp ID
        url = "https://api.yelp.com/v3/businesses/" + id_ 
        req = requests.get(url, headers=headers) #Access to the Yelp page of the business
        #print('the status code is {}'.format(req.status_code))
        parsed = json.loads(req.text)
        data['Yelp rating'].iloc[i] = parsed['rating'] #Obtained the rating
        data['Yelp review count'].iloc[i] = parsed['review_count'] #Obtain the number of reviews

In [20]:
print("Among the rows", 0,"and", i,"we scrapped",i-np.count_nonzero(pd.isnull(data['Yelp name'][0:i])),"restaurants.")

Among the rows 0 and 5498 we scrapped 4486 restaurants.


In [17]:
#Each day, we save the newly scrapped dataframe to a csv (so if an error occur at one point, we don't loose all the information from the previous days)
data.to_csv("data/chicago-food-inspections/food-inspections_scrapped"+str(day)+".csv")