Using Python's YelpAPI to Access the Yelp Fusion API

In [5]:
from yelpapi import YelpAPI
import requests, time, json
import pandas as pd
import re
import os

In [6]:
MY_API_KEY = "G07Y44e7xpqOaHwVSSyrU8yORkXOc6wvMrXw_ZswImf_vi9v5Ex8N55y1gPmcS6ud51rR2WF8EDQ093Bxq7fKJgEwuf1GlIFCP8qenIbW-qwBnSPekztADggYdmrXHYx"

In [7]:
yelp_api = YelpAPI(MY_API_KEY)

Defining a custom function for automatically creating a log-file

In [8]:
def filename_format_log(file_path, 
                        logfile = './data/file_log.txt',  
                        file_description = None): 
   
    try:
        ext = re.search('(?<!^)(?<!\.)\.(?!\.)', file_path).start() 
    except:
        raise NameError('Please enter a relative path with a file extension.') 
    
    stamp = re.search('(?<!^)(?<!\.)[a-z]+_[a-z]+(?=\.)', file_path).start()
    formatted_name = f'{file_path[:stamp]}{round(time.time())}_{file_path[stamp:]}'  
    if not file_description:
        file_description = f'Pull: {time.asctime(time.gmtime(round(time.time())))}'
    with open(logfile, 'a+') as f:
        f.write(f'{formatted_name}: {file_description}\n')
    return formatted_name, file_description

Defining a custom function for my Yelp query

In [9]:
def yelp_query(category, zip_code, offset_number, n_samples=1000):
    yelp_api = YelpAPI(MY_API_KEY)
    last_result = round(time.time())
    yelp_results = []
    size = 50
    loops = 0
    run = 1
    offset_count = offset_number
   
    
    while loops < n_samples:
        
        print(f'Starting query {run}')
        posts = yelp_api.search_query(categories=category, location = zip_code, offset=offset_count, limit=size) 
            
        yelp_results.extend(posts['businesses'])
        loops += size
        offset_count += 50
        time.sleep(3) 
        run += 1
       
    
    formatted_name, file_description = filename_format_log(file_path =f'./data/raw_{category}.json')
    with open(formatted_name, 'w+') as f:
        json.dump(yelp_results, f)
    
    print(f'Saved and completed query and returned {len(yelp_results)} {category}s.')
    print(f'Yelp text is ready for processing.')
    return print(f'Last timestamp was {round(time.time())}.')

Creating a list for all our zipcodes of interest in Los Angeles County

In [10]:
zip_code_list = [91342,91344,91335,91331,91326,91364,91306,91406,91343,91367,91304,90047,90045,90065,
                 90066,90042,90068,90049,90272,91604,90046,91307,91311,90731,91352,91605,91042,
                 91040,90044,91356,91423,90043,91325,90032,91401,91316,90064,91405,90016,91402,91436,91606,90026,90019,
                 90003,91403,90002,91324,90039,90291,90034,90041,91607,90744,90210,90027,90018,90011,90732,90077,90004,
                 90062,91345,90059,90008,90035,90037,91601,90069,91602,90036,90048,91411,90031, 
                 90024,91303,90710,90025,90014,90501,91340,90230,90247,90061,90293, 90023,90033,90094,90006,
                 90007,90292,90038,90001,90005,90029,90028,90063,90017,90248,90020,90402,90013,90012]

My section of the zip code lists (we split up the zip codes evenly between the team members)

In [11]:
zip_code_list[34:68]

[91401,
 91316,
 90064,
 91405,
 90016,
 91402,
 91436,
 91606,
 90026,
 90019,
 90003,
 91403,
 90002,
 91324,
 90039,
 90291,
 90034,
 90041,
 91607,
 90744,
 90210,
 90027,
 90018,
 90011,
 90732,
 90077,
 90004,
 90062,
 91345,
 90059,
 90008,
 90035,
 90037,
 91601]

Iterating through all the zipcodes in my list (I couldn't do all of them at once, so the code below is not from start to finish, but eventually I got all the zip codes by continuing to run the cell below wherever the API failed.)

In [75]:
for zipcode in zip_code_list[52:68]:
    yelp_query('restaurant', zipcode , 0)

Starting query 1
Starting query 2
Starting query 3
Starting query 4
Starting query 5
Starting query 6
Starting query 7
Starting query 8
Starting query 9
Starting query 10
Starting query 11
Starting query 12
Starting query 13
Starting query 14
Starting query 15
Starting query 16
Starting query 17
Starting query 18
Starting query 19
Starting query 20
Saved and completed query and returned 426 restaurants.
Yelp text is ready for processing.
Last timestamp was 1555439987.
Starting query 1
Starting query 2
Starting query 3
Starting query 4
Starting query 5
Starting query 6
Starting query 7
Starting query 8
Starting query 9
Starting query 10
Starting query 11
Starting query 12
Starting query 13
Starting query 14
Starting query 15
Starting query 16
Starting query 17
Starting query 18
Starting query 19
Starting query 20
Saved and completed query and returned 560 restaurants.
Yelp text is ready for processing.
Last timestamp was 1555440063.
Starting query 1
Starting query 2
Starting query 3
Sta

Grabbing the location of the saved JSON files.

In [12]:
files = !ls data/*_raw*

Loading in those JSON files so we can parse them.

In [111]:
yelp_jsons = []
for file in files:
    with open(file) as f:
        yelp_jsons.extend(json.load(f))

Checking how many restaurants I pulled. Looks like 24557 for just the zip codes I queried.

In [112]:
len(yelp_jsons)

24557

Grabbing the review count for our first restaurant, just to test out keying into the dictionary.

In [113]:
yelp_jsons[0]['review_count']

771

Defining a custom function to automatically parse the aforementioned JSONs and keep only the data we want.

In [114]:
def yelp_parse(sample):
    yelp_list = []
    yelp_list.extend(sample)
    for x in range(len(yelp_list)):
        yelp_list[x]['type'] = yelp_list[x]['categories'][0]['alias']
        yelp_list[x]['latitude'] = yelp_list[x]['coordinates']['latitude']
        yelp_list[x]['longitude'] = yelp_list[x]['coordinates']['longitude']
        yelp_list[x]['location'] = yelp_list[x]['location']['zip_code']
        yelp_list[x]['review_count'] = yelp_list[x]['review_count']
    
    col_list = ['id',
                'name',
                'alias',
                'type',
                'rating',
                'review_count',
                'price',
                'location',
                'latitude',
                'longitude'
                ]
    
    yelp_df = pd.DataFrame(yelp_list)
    yelp_df = yelp_df[col_list]
    yelp_df['price'] = yelp_df['price'].map({'$': 1, '$$': 2, '$$$': 3, '$$$$': 4})

    return yelp_df[col_list]

Creating a dataframe from the parsed JSONs

In [115]:
yelp_df = yelp_parse(yelp_jsons)

In [116]:
yelp_df.head()

Unnamed: 0,id,name,alias,type,rating,review_count,price,location,latitude,longitude
0,PEHM9AEqq0ca3vACyOMEwA,Lusy's Mediterranean Cafe & Grill,lusys-mediterranean-cafe-and-grill-van-nuys-2,mediterranean,4.5,771,2.0,91401,34.186598,-118.431349
1,ja_cBagHfhI0eFJrw3BRTA,Kobee Factory,kobee-factory-van-nuys-2,mideastern,4.5,536,2.0,91401,34.179265,-118.44037
2,vWuft2V5ZKKWRPzQUHuKDw,Nat's Early Bite Coffee Shop,nats-early-bite-coffee-shop-sherman-oaks,diners,4.5,1069,2.0,91401,34.1724,-118.44053
3,DfmaMh5rJQ_o9vEvhfUDgQ,Uncle Tony's Pizzeria,uncle-tonys-pizzeria-north-hollywood,italian,4.0,1164,2.0,91606,34.18738,-118.416558
4,Mfa5dHJKcY4K-c3IQIxKkA,Krimsey's Cajun Kitchen,krimseys-cajun-kitchen-north-hollywood-2,vegan,4.5,870,2.0,91606,34.186299,-118.413965


Dropping duplicates and missing data.

In [128]:
yelp_df.drop_duplicates(inplace=True)

In [129]:
yelp_df.dropna(inplace=True)

Even though I only specified 34 zip codes, we have 214 unique zip codes, but that's fine, once we merge it with the housing data later, we'll only keep the zip codes that overlap.

In [130]:
len(yelp_df.location.unique())

214

Final checks for proper variable types and missing data

In [131]:
yelp_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10037 entries, 0 to 24533
Data columns (total 10 columns):
id              10037 non-null object
name            10037 non-null object
alias           10037 non-null object
type            10037 non-null object
rating          10037 non-null float64
review_count    10037 non-null int64
price           10037 non-null float64
location        10037 non-null object
latitude        10037 non-null float64
longitude       10037 non-null float64
dtypes: float64(4), int64(1), object(5)
memory usage: 862.6+ KB


In [133]:
df.isna().sum()

id              0
name            0
alias           0
type            0
rating          0
review_count    0
price           0
location        0
latitude        0
longitude       0
dtype: int64

In [134]:
df.shape

(10019, 10)

Noticed that there were blanks for location, but they weren't missing, so only keeping those observations with actual values.

In [132]:
df = yelp_df[yelp_df['location']!='']

Saving the dataframe to a CSV to then combine with everyone else's data. 

In [135]:
df.to_csv('./data/hovs_section.csv')