# Use efficient API extraction method to obtain data from Yelp API

## Obective

- Create a new file without accidentally erasing prior results.
- Loop through a list of queries and save the results throughout the loop.
- Use the tqdm library to make a progress bar to track the time remaining in a loop
    

### Applying Code From
- Efficient API Calls Lesson Link: https://login.codingdojo.com/m/376/12529/88078

In [2]:
# if Yelp API is not installed, run the following code
# !pip install yelpapi

In [1]:
# Standard Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Additional Imports
import os, json, math, time
from yelpapi import YelpAPI
from tqdm.notebook import tqdm_notebook

### Set up API credentials


- Yelp: https://www.yelp.com/developers/documentation/v3/get_started


> Check the official API documentation to know what arguments we can search for: https://www.yelp.com/developers/documentation/v3/business_search

### Load Credentials and Create Yelp API Object

In [3]:
# Load API Credentials
with open('/Users/Shenyue/.secret/yelp.api.json','r') as f:
    login = json.load(f)

In [4]:
login.keys()

dict_keys(['Client-ID', 'API Key'])

In [15]:
# Instantiate YelpAPI Variable
yelp_api = YelpAPI(login['API Key'], timeout_s = 5.0)

### Define Search Terms and File Paths

In [17]:
# set our API call parameters and filename before the first call
LOCATION = 'Garden Grove, CA'
TERM = 'Pho'

In [18]:
## Specify fodler for saving data
FOLDER = 'Data/'
os.makedirs(FOLDER,exist_ok = True)

# Specifying JSON_FILE filename (can include a folder)
JSON_FILE = FOLDER+f"{LOCATION.split(',')[0]}-{TERM}.json"
JSON_FILE

'Data/Garden Grove-Pho.json'

### Check if JSON File exists and Create it if it doesn't

In [19]:
## Check if JSON_FILE exists
file_exists = os.path.isfile(JSON_FILE)
## If it does not exist: 
if file_exists == False:    
    ## CREATE ANY NEEDED FOLDERS
    # Get the Folder Name only
    folder = os.path.dirname(JSON_FILE)
    
    ## If JSON_FILE included a folder:
    if len(folder)>0:
        # create the folder
        os.makedirs(folder,exist_ok = True)
        
        
    ## INFORM USER AND SAVE EMPTY LIST
    print('')
    
    
    ## save the first page of results
    with open(JSON_FILE,'w') as f:
        json.dump([],f)
        
## If it exists, inform user
else:
    print(f"[i] {JSON_FILE} already exists.")

[i] Data/Garden Grove-Pho.json already exists.


### Determine how many results are already in the file

In [61]:
## Load previous results and use len of results for offset
with open(JSON_FILE,'r') as f:
    previous_results = json.load(f)
    
## set offset based on previous results
n_results = len(previous_results)
print(f'- {n_results} previous results found.')

- 866 previous results found.


### Figure out how many pages of results we will need

- The API will return results by pages.
- We will perform our first query to get our first page of results and the total number of results.
- Then we will calculate how many pages we will need to retrieve all of our results so that we do not waste API requests.

In [21]:
# use our yelp_api variable's search_query method to perform our API call
results = yelp_api.search_query(location=LOCATION,
                                term=TERM,
                               offset=n_results)
results.keys()

dict_keys(['businesses', 'total', 'region'])

In [22]:
## How many results total?
total_results = results['total']
total_results

866

In [23]:
## How many did we get the details for?
results_per_page = len(results['businesses'])
results_per_page

20

- There are `866` businesses to retrieve from our API and we can get `20` results at a time (per "page").
- We can calculate the total number of pages using the code block below
  - We need to ***round up** the results to make sure we can get all results on the last page.

### Perform query and handle queries with > 1000 results

- Sometimes we may obtain more than 1000 results from a query, which will cause an error like this below

> YelpAPIError: VALIDATION_ERROR: Too many results requested, limit+offset must be <= 1000.

- We can add an extra logic check to see if the length of results we have so far (`n_results`) + the number of results in each page (`results_per_page`) is greater than 1,000
  - If yes, we will use break to end our loop early
  
- We also need a function to automatically delete existing JSON file if we set up to do so

In [53]:
def create_json_file(JSON_FILE,  delete_if_exists=False):
    
    file_exists = os.path.isfile(JSON_FILE)
    
    if file_exists == True:
        
        if delete_if_exists == True:
            
            print(f"[!] {JSON_FILE} already exists. Deleting previous file...")
            
            os.remove(JSON_FILE)
            
            create_json_file(JSON_FILE, delete_if_exists = False)
        else:
            print(f"[!] {JSON_FILE} aleady exists.")
    
    else:
        
        print(f"[i] {JSON_FILE} not found. Saving empty list to new file.")
        
        folder = os.path.dirname(JSON_FILE)
        
        if len(folder)>0:
            os.makedirs(folder, exist_ok = True)
        
        with open(JSON_FILE, 'w') as f:
            json.dump

- Use `create_json_file` function to create an empty file and handle duplicated files with `delete_if_exists = True`

- Get `n_pages` to be used in the for loop to perform API calls

In [54]:
## Create a new empty json file (exist the previous if it exists)
create_json_file(JSON_FILE, delete_if_exists=True)
## Load previous results and use len of results for offset
with open(JSON_FILE,'r') as f:
    previous_results = json.load(f)
    
## set offset based on previous results
n_results = len(previous_results)
print(f'- {n_results} previous results found.')
# use our yelp_api variable's search_query method to perform our API call
results = yelp_api.search_query(location=LOCATION,
                                term=TERM,
                               offset=n_results)
## How many results total?
total_results = results['total']
## How many did we get the details for?
results_per_page = len(results['businesses'])
# Use math.ceil to round up for the total number of pages of results.
n_pages = math.ceil((results['total']-n_results)/ results_per_page)
n_pages


[i] Data/Garden Grove-Pho.json not found. Saving empty list to new file.
- 0 previous results found.


44

- Use a for loop to perform API calls
- Break the loop if returned results > 1000

In [56]:
# use the loop to perform queries
# if n_results + results_per_page > 1000, break the loop
for i in tqdm_notebook( range(1,n_pages+1)):
    
    ## Read in results in progress file and check the length
    with open(JSON_FILE, 'r') as f:
        previous_results = json.load(f)
    ## save number of results for to use as offset
    n_results = len(previous_results)
    
    if (n_results + results_per_page) > 1000:
        print('Exceeded 1000 api calls. Stopping loop.')
        break
    
    ## use n_results as the OFFSET 
    results = yelp_api.search_query(location=LOCATION,
                                    term=TERM, 
                                    offset=n_results)
    
    
    
    ## append new results and save to file
    previous_results.extend(results['businesses'])
    
    # display(previous_results)
    with open(JSON_FILE,'w') as f:
        json.dump(previous_results,f)
    
    time.sleep(.2)

  0%|          | 0/44 [00:00<?, ?it/s]

### Convert `.JSON` to a dataframe

In [62]:
# load final results
df = pd.read_json(JSON_FILE)
display(df.head(), df.tail())

Unnamed: 0,id,alias,name,image_url,is_closed,url,review_count,categories,rating,coordinates,transactions,price,location,phone,display_phone,distance
0,svgFm8Ybzq9D8vPPhWE38A,pho-79-restaurant-garden-grove,Pho 79 Restaurant,https://s3-media4.fl.yelpcdn.com/bphoto/6-3WUL...,False,https://www.yelp.com/biz/pho-79-restaurant-gar...,2972,"[{'alias': 'vietnamese', 'title': 'Vietnamese'...",4.0,"{'latitude': 33.752461, 'longitude': -117.95576}",[delivery],$$,"{'address1': '9941 Hazard Ave', 'address2': ''...",17145312490,(714) 531-2490,2165.252785
1,7_lDdYuloowE2Jlav8PRnQ,pho-45-garden-grove,Pho 45,https://s3-media4.fl.yelpcdn.com/bphoto/GMKCMC...,False,https://www.yelp.com/biz/pho-45-garden-grove?a...,1809,"[{'alias': 'vietnamese', 'title': 'Vietnamese'...",4.0,"{'latitude': 33.7736410518458, 'longitude': -1...",[delivery],$$,"{'address1': '9240 Garden Grove Blvd', 'addres...",17145379000,(714) 537-9000,590.58778
2,TT0Y5sxPE2R5l0Pv_VxbNQ,pho-redbo-garden-grove,Pho Redbo,https://s3-media3.fl.yelpcdn.com/bphoto/BE9BsO...,False,https://www.yelp.com/biz/pho-redbo-garden-grov...,823,"[{'alias': 'vietnamese', 'title': 'Vietnamese'}]",4.5,"{'latitude': 33.773975, 'longitude': -117.997094}","[pickup, restaurant_reservation, delivery]",$$,"{'address1': '7725 Garden Grove Blvd', 'addres...",17146224896,(714) 622-4896,2713.451559
3,jGlHGLneaPZ3vS5l0nRRXA,pho-kuroushi-garden-grove-2,Pho Kuroushi,https://s3-media3.fl.yelpcdn.com/bphoto/x6gPI4...,False,https://www.yelp.com/biz/pho-kuroushi-garden-g...,100,"[{'alias': 'asianfusion', 'title': 'Asian Fusi...",5.0,"{'latitude': 33.75325279398934, 'longitude': -...","[pickup, restaurant_reservation, delivery]",$$,"{'address1': '14376 Brookhurst St', 'address2'...",16579668984,(657) 966-8984,2173.135412
4,fAptoiXJxRK7KqaNsBAKTg,phoholic-westminster-2,PhoHolic,https://s3-media2.fl.yelpcdn.com/bphoto/dsdPDz...,False,https://www.yelp.com/biz/phoholic-westminster-...,1318,"[{'alias': 'vietnamese', 'title': 'Vietnamese'...",4.0,"{'latitude': 33.7457405317976, 'longitude': -1...",[delivery],$$,"{'address1': '14932 Bushard St', 'address2': '...",17147338822,(714) 733-8822,2610.52922


Unnamed: 0,id,alias,name,image_url,is_closed,url,review_count,categories,rating,coordinates,transactions,price,location,phone,display_phone,distance
861,8yt1y5WKfOdGK8n7QmdoLw,world-market-lakewood-2,World Market,https://s3-media3.fl.yelpcdn.com/bphoto/-1_b47...,False,https://www.yelp.com/biz/world-market-lakewood...,189,"[{'alias': 'furniture', 'title': 'Furniture St...",4.0,"{'latitude': 33.849438508453304, 'longitude': ...",[],$$,"{'address1': '5041 Lakewood Blvd', 'address2':...",15626020031,(562) 602-0031,18513.387204
862,C_whtGAuGd7f6H9nmiQrog,h-mart-lakewood-lakewood,H Mart - Lakewood,https://s3-media2.fl.yelpcdn.com/bphoto/m2aBX3...,False,https://www.yelp.com/biz/h-mart-lakewood-lakew...,181,"[{'alias': 'grocery', 'title': 'Grocery'}, {'a...",3.5,"{'latitude': 33.8467014125716, 'longitude': -1...",[delivery],$$,"{'address1': '20137 Pioneer Blvd.', 'address2'...",15623039810,(562) 303-9810,13645.745664
863,nE7T2od6Olz69b43qp5cNw,grocery-outlet-bargain-market-lakewood-3,Grocery Outlet Bargain Market,https://s3-media1.fl.yelpcdn.com/bphoto/MKV1SW...,False,https://www.yelp.com/biz/grocery-outlet-bargai...,76,"[{'alias': 'grocery', 'title': 'Grocery'}, {'a...",3.0,"{'latitude': 33.859024, 'longitude': -118.11855}",[],$,"{'address1': '5615 Woodruff Ave', 'address2': ...",15629202900,(562) 920-2900,17114.097253
864,aGcbWDczc0b_hedHC-5pxQ,layer-cake-bakery-lcb-irvine-3,Layer Cake Bakery - LCB,https://s3-media2.fl.yelpcdn.com/bphoto/TvVHlC...,False,https://www.yelp.com/biz/layer-cake-bakery-lcb...,823,"[{'alias': 'bakeries', 'title': 'Bakeries'}, {...",3.5,"{'latitude': 33.681151, 'longitude': -117.804845}","[pickup, delivery]",$$,"{'address1': '4250 Barranca Pkwy', 'address2':...",19497860223,(949) 786-0223,18014.28766
865,PUeGfzWMVFtbJXLYXOA4xA,stater-bros-markets-tustin-3,Stater Bros. Markets,https://s3-media2.fl.yelpcdn.com/bphoto/8gk2cX...,False,https://www.yelp.com/biz/stater-bros-markets-t...,113,"[{'alias': 'grocery', 'title': 'Grocery'}, {'a...",3.0,"{'latitude': 33.7323158, 'longitude': -117.816...",[],$,"{'address1': '14171 Red Hill Ave', 'address2':...",17145441812,(714) 544-1812,14591.443006


### Check for duplicates

In [63]:
# check for duplicate ID's 
df.duplicated(subset='id').sum()

0

In [64]:
## Drop duplicate ids and confirm there are no more duplicates
df = df.drop_duplicates(subset='id')
df.duplicated(subset='id').sum()

0

### Save the final dataframe to a CSV file

In [59]:
## convert the filename to a .csv.gz
csv_file = JSON_FILE.replace('.json','.csv.gz')
csv_file

'Data/Garden Grove-Pho.csv.gz'

In [60]:
## Save it as a compressed csv (to save space)
df.to_csv(csv_file, compression='gzip', index=False)