# Yelp API - Gathering Data

## In this Notebook:
 - Using YELP Fusion API, restaurant information from the lower mainland (within 25 miles of Vancouver, BC) was saved
 - A web scraper was created, but not executed to gather full review information (to adhere to ToS).

### Using offset and limit parameters in Yelp API

As there is a limit of 50 places per API call, offset and limit parameters allow a total of 1000 places to be called.

Offset is the "distance" from a known memory address, which in this case, is where the API call "starts"

eg)
with OFFSET = 50, LIMIT = 50,
you will receive results 51-100


### Steps taken to retrieve data

- Get list of categories (for restaurants)
- For each category, cycle through to get 1000 restaurants of that type
- Do that for all categories to retrieve all restaurant data
- For each restaurant in each category, create a scraper to gather review data (for example, for top 500 restaurants)
    - Built the scraper but did not execute, as it is against ToS.

## Testing endpoint 

In [7]:
import requests

#API_KEY = '<--- API KEY HERE --->'
#CLIENT_ID = '<--- CLIENT ID HERE --->'

ENDPOINT = "https://api.yelp.com/v3/businesses/search"

HEADERS = {'Authorization': 'bearer %s' % API_KEY}

PARAMETERS = {'term': 'restaurants',
              'offset': 0,
              'limit': 50,
              'radius': 40000,
              'location': 'Vancouver, BC'}

response = requests.get(url=ENDPOINT, params=PARAMETERS, headers=HEADERS)

## Next, let's gather the data for all restaurants within a radius of 40,000 m (or 25 miles) from Vancouver, BC

In [8]:
# open categories json file
import json

with open('data/categories.json') as f:
    data = json.load(f)

restaurants = [place for place in data if 'restaurants' in place['parents']]

In [9]:
restaurant_aliases = [restaurant['alias'] for restaurant in restaurants]
restaurant_titles = [restaurant['title'] for restaurant in restaurants]

print("Restaurant aliases: {} || Num: {}".format(restaurant_aliases[:3], len(restaurant_aliases)))
print("Restaurant titles: {} || Num: {}".format(restaurant_titles[:3], len(restaurant_titles)))

Restaurant aliases: ['afghani', 'african', 'andalusian'] || Num: 192
Restaurant titles: ['Afghan', 'African', 'Andalusian'] || Num: 192


In [10]:
import time

PARAMETERS = {'term': 'restaurants',
              'offset': 0, # start at 0
              'limit': 50, # maximum is 50
              'radius': 40000, # in m
              'location': 'Vancouver, BC'}

restaurants_in_vancouver = []

# Cycle through categories
for category in restaurant_aliases:
    PARAMETERS['categories'] = category
    # Cycle through restaurants
    for offset_number in range(0,1000,50):
        PARAMETERS['offset'] = offset_number

        response = requests.get(url=ENDPOINT, params=PARAMETERS, headers=HEADERS)

        if not response.json().get('businesses', False):
            break

        restaurants_in_vancouver.extend(response.json()['businesses'])

        print("{}: {}-{}".format(category, offset_number, offset_number+50))
        
        time.sleep(0.5) ## Don't want to get blocked by Yelp API

afghani: 0-50
african: 0-50
arabian: 0-50
asianfusion: 0-50
asianfusion: 50-100
australian: 0-50
austrian: 0-50
bangladeshi: 0-50
bbq: 0-50
bbq: 50-100
belgian: 0-50
bistros: 0-50
brasseries: 0-50
brazilian: 0-50
breakfast_brunch: 0-50
breakfast_brunch: 50-100
breakfast_brunch: 100-150
breakfast_brunch: 150-200
breakfast_brunch: 200-250
british: 0-50
buffets: 0-50
burgers: 0-50
burgers: 50-100
burgers: 100-150
burgers: 150-200
burmese: 0-50
cafes: 0-50
cafes: 50-100
cafes: 100-150
cafes: 150-200
cafes: 200-250
cafes: 250-300
cajun: 0-50
cambodian: 0-50
caribbean: 0-50
cheesesteaks: 0-50
chicken_wings: 0-50
chicken_wings: 50-100
chickenshop: 0-50
chinese: 0-50
chinese: 50-100
chinese: 100-150
chinese: 150-200
chinese: 200-250
chinese: 250-300
chinese: 300-350
chinese: 350-400
chinese: 400-450
chinese: 450-500
chinese: 500-550
chinese: 550-600
comfortfood: 0-50
creperies: 0-50
cuban: 0-50
delis: 0-50
delis: 50-100
diners: 0-50
dinnertheater: 0-50
dumplings: 0-50
ethiopian: 0-50
filipino:

In [11]:
# This number includes duplicates
print(len(restaurants_in_vancouver))

5694


In [12]:
restaurants_file =  open("data/vancouver_restaurants_duplicates.json", "w")
json.dump(restaurants_in_vancouver, restaurants_file, indent=6)
restaurants_file.close()

In [13]:
# Remove the duplicate entries
res_list = [i for n, i in enumerate(restaurants_in_vancouver) if i not in restaurants_in_vancouver[n + 1:]] 

In [14]:
restaurants_file = open("data/vancouver_restaurants.json", "w")
json.dump(res_list, restaurants_file, indent=6)
restaurants_file.close()

In [15]:
newlist = sorted(res_list, key=lambda k: k['name']) 

In [16]:
len(newlist)

3753

3753 restaurants' data was gathered.

# Gathering review data using a web scraper

A web scraper was created, but not executed to gather full review information. For legality sake and to adhere to ToS, it will be saved in a separate file and will not be uploaded online. If you are an employer and would like to see the work, please contact me.