# Yelp Fusion API

With millions of business updates every month, Yelp Fusion delivers the most current and most accurate local data available. Choose from dozens of attributes per business, and as millions of new reviews and photos are added by active Yelp users, the Yelp data set remains unparalleled in its rich detail, freshness, and accuracy.

For our project, we will be pulling restaurant data for New York City based on the zip codes for the 5 boroughs of New York City : Manhattan, Bronx, Brooklyn, Queens and Staten Island. These zip codes correspond to the housing price data downloaded from [NYC Department of Finance](https://www1.nyc.gov/site/finance/taxes/property-annualized-sales-update.page). 

The following code for API was repeated 5 times for each borough and saved as a pickle file. 

**Data Source** : [Yelp Fusion API](https://www.yelp.com/developers/v3/manage_app)
Note: This needs an authorized API key from Yelp Fusion. 

##### Yelp Fusion API limitations
Yelp API data limits - Yelp allows you to pull only 1,000 results at a time and only 50 per request with 5000 API pulls approved every 24 hours. 

##### Method
We created a list of zip codes for each borough from the housing price dataset. To get around the Yelp API limitation of 1,000 results at a time and only 50 per request, we adapted a function written by "rspiro9 on Yelp vs Inspection Analysis for NYC". We pulled 250 restaurants per zip code. 


### This notebook was run 5 times per NYC borough

In [4]:
import requests
import pandas as pd
import pickle

In [5]:
api_key = 'enter API key'

In [6]:
# Using the yelp business search API: https://www.yelp.com/developers/documentation/v3/business_search

# headers contain the api key.
headers = {'Authorization': 'Bearer {}'.format(api_key)}
url = 'https://api.yelp.com/v3/businesses/search'

In [165]:
## List of zip codes per NYC borough. Change this list for every borough. 
neighborhoods = ["10314",	"10312",	"10306",	"10305",	"10309",	"10304",	"10308",	"10301",	"10310",	"10303",	"10307",	"10302",]

In [166]:
## Create temporary dataframe to hold data:
nyc = [[] for i in range(len(neighborhoods))] 

In [167]:
#Function to draw in data for each neighborhood:
for x in range(len(neighborhoods)):
    print('---------------------------------------------')
    print('Gathering Data for {}'.format(neighborhoods[x]))
    print('---------------------------------------------')
    
    for y in range(5):
        location = neighborhoods[x]
        term = "Restaurants"
        search_limit = 50
        offset = 50 * y
        categories = "(restaurants, All)"
        sort_by = 'distance'
        url_params = {
                     'location': location.replace(' ', '+'),
                     'term' : term,
                      'limit': search_limit,
                     'offset': offset,
                     'categories': categories,
                     'sorty_by': sort_by
                     }
        
        response = requests.get(url, headers=headers, params=url_params)
        print('***** {} Restaurants #{} - #{} ....{}'.format(neighborhoods[x], 
                                                             offset+1, offset+search_limit,
                                                             response))
        nyc[x].append(response)

print(response)
print(type(response.text))
print(response.json().keys())
print(response.text[:1000])

---------------------------------------------
Gathering Data for 10314
---------------------------------------------
***** 10314 Restaurants #1 - #50 ....<Response [200]>
***** 10314 Restaurants #51 - #100 ....<Response [200]>
***** 10314 Restaurants #101 - #150 ....<Response [200]>
***** 10314 Restaurants #151 - #200 ....<Response [200]>
***** 10314 Restaurants #201 - #250 ....<Response [200]>
---------------------------------------------
Gathering Data for 10312
---------------------------------------------
***** 10312 Restaurants #1 - #50 ....<Response [200]>
***** 10312 Restaurants #51 - #100 ....<Response [200]>
***** 10312 Restaurants #101 - #150 ....<Response [200]>
***** 10312 Restaurants #151 - #200 ....<Response [200]>
***** 10312 Restaurants #201 - #250 ....<Response [200]>
---------------------------------------------
Gathering Data for 10306
---------------------------------------------
***** 10306 Restaurants #1 - #50 ....<Response [200]>
***** 10306 Restaurants #51 - #10

In [168]:
## Check for any empty business lists:
for x in range(len(neighborhoods)):
    try: 
        for y in range(20):
            num = len(nyc[x][y].json()['businesses'])
            if num != 50:
                print(neighborhoods[x], y, num)
    except:
        print("Invalid data. Skipping entry...")
        pass

Invalid data. Skipping entry...
Invalid data. Skipping entry...
Invalid data. Skipping entry...
Invalid data. Skipping entry...
Invalid data. Skipping entry...
Invalid data. Skipping entry...
Invalid data. Skipping entry...
Invalid data. Skipping entry...
Invalid data. Skipping entry...
Invalid data. Skipping entry...
10307 1 5
10307 2 0
10307 3 0
10307 4 0
Invalid data. Skipping entry...
Invalid data. Skipping entry...


In [169]:
## Save the compiled data into dataframe and remove any empty data:
df = pd.DataFrame()
for x in range(len(neighborhoods)):
    try:
        for y in range(20):
            df_temp = pd.DataFrame.from_dict(nyc[x][y].json()['businesses'])
            if not df_temp.empty:
                df_temp.loc[:,'neighborhood'] = neighborhoods[x]
                df = df.append(df_temp)
    except:
        print("Invalid data. Skipping entry...")
        pass

Invalid data. Skipping entry...
Invalid data. Skipping entry...
Invalid data. Skipping entry...
Invalid data. Skipping entry...
Invalid data. Skipping entry...
Invalid data. Skipping entry...
Invalid data. Skipping entry...
Invalid data. Skipping entry...
Invalid data. Skipping entry...
Invalid data. Skipping entry...
Invalid data. Skipping entry...
Invalid data. Skipping entry...


#### Saving file as pickle
* The pickle module keeps track of the objects it has already serialized, so that later references to the same object won’t be serialized again, thus allowing for faster execution time.
* Allows saving model in very little time.
* Good For small models with fewer parameters like the one we used.


In [170]:
# Save Dataset: (data pulled 8/25/20)
with open ('NYC_API/data_staten_island.pickle','wb')as f:
    pickle.dump(df, f)