# Data Collection: Yelp 

We used [Yelp Fusion API](https://www.yelp.com/developers/documentation/v3/get_started)'s [businesses/search path](https://www.yelp.com/developers/documentation/v3/business_search) and collected 30,293 business information from Yelp.

**Yelp Data Dictionary**

| Name | Data Types (Pandas) | Description |
|---|---|---|
|id|object|a unique ID for each business, e.g. E8RJkjfdcwgtyoPMjQ_Olg|
|alias |object|business name alias| 
|name|object|business name|
|image_url|object|URL of photos taken at a given business|
|is_closed|bool|True if the business is currently open, else False|
|url|object|url of business listing on Yelp|
|review_count|int64|Number of reviews|
|categories|object|List of category title and alias pairs associated with this business|
|rating|float64|Rating for this business (value ranges from 1, 1.5, ... 4.5, 5)|
|coordinates|object|Coordinates of this business|
|transactions|object|List of Yelp transactions that the business is registered for, such as pickup, delivery and restaurant_reservation.|
|price|object|Price level of the business. Value is one of \\$, \\$\\$, \\$\\$\\$ and \\$\\$\\$\\$.|
|location|object|Location of this business, including address, city, state, zip code and country.|
|phone|object|phone number of the business|
|display_phone|object|Phone number of the business formatted nicely to be displayed to users. The format is the standard phone number format for the business's country.|
|distance|float64|Distance in meters from the search location. This returns meters regardless of the locale.|

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Data-Collection:-Yelp" data-toc-modified-id="Data-Collection:-Yelp-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Data Collection: Yelp</a></span><ul class="toc-item"><li><span><a href="#Import-libraries" data-toc-modified-id="Import-libraries-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Import libraries</a></span></li><li><span><a href="#Create-constants" data-toc-modified-id="Create-constants-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Create constants</a></span></li><li><span><a href="#Define-function-to-provide-clean-list-of-zipcodes" data-toc-modified-id="Define-function-to-provide-clean-list-of-zipcodes-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Define function to provide clean list of zipcodes</a></span></li><li><span><a href="#Define-functions-to-fetch-businesses-from-Yelp-Funsion-API" data-toc-modified-id="Define-functions-to-fetch-businesses-from-Yelp-Funsion-API-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Define functions to fetch businesses from Yelp Funsion API</a></span></li><li><span><a href="#Fetch-data" data-toc-modified-id="Fetch-data-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Fetch data</a></span></li><li><span><a href="#Export-results-as-.csv-file" data-toc-modified-id="Export-results-as-.csv-file-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Export results as .csv file</a></span></li></ul></li></ul></div>

## Import libraries

In [49]:
import pandas as pd
import requests
import datetime as dt
import re
from pathlib import Path

## Create constants
- Create Yelp API key [here](https://www.yelp.com/developers/documentation/v3/authentication).
- Save the API key as `.txt` file, create a directory with the path `"~/api_keys"`, and store the API key there.
- To authenticate API calls with the API Key, set the Authorization HTTP header value as `Bearer API_KEY`.

In [6]:
with open(str(Path.home() / 'api_keys/yelp_fusion.txt')) as file:
    API_KEY = file.read().replace('\n', '')
HEADERS = {'Authorization': 'bearer %s' % API_KEY}
URL = 'https://api.yelp.com/v3/businesses/search'

- By default, Yelp allows [5,000 calls per day](https://www.yelp.com/developers/faq). 
- Limit: Number of business results to return. By default, it will return 20. Maximum is 50.
- Offset: Using the offset and limit parameters, Yelp allows up to 1000 businesses from this endpoint if there are more than 1000 results. If one requests a page out of this 1000 business limit, this endpoint will return an error. Therefore, MAX_ITEMS_PER_SEARCH is set to be 1000.

In [1]:
API_CALL_LIMIT = 5_000
RESULTS_LIMIT = 50
MAX_ITEMS_PER_SEARCH = 1_000

## Define function to provide clean list of zipcodes 

In [119]:
def zipcodes(zipcode_file):
    '''
    Processes .txt file
    Returns a set of zipcodes
    '''
    with open(f'../data/{zipcode_file}.txt') as file:
        zipcodes = {i.replace('\n', '') for i in file.readlines() if 'PO BOX' not in i}
        zipcodes = {re.match(r"(^[0-9]*).*", i).group(1) for i in zipcodes}
    return zipcodes

## Define functions to fetch businesses from Yelp Funsion API

In [134]:
def single_request(location, offset=1):
    '''
    Searches businesses for one given location
    Returns a tuple: 
        A boolean value: True if no errors in the request, otherwise False 
        A list of businesses starting at the given offset, or the response if there is an error
    Number of results is limited to RESULTS_LIMIT
    '''
    params = {'location': location,
              'limit': RESULTS_LIMIT,
              'offset': offset}
    response = requests.get(url=URL, params=params, headers=HEADERS)
    
    if response.status_code != 200:
        print(f"Request failed with {response.text}")
        return (False, response)
    else:
        data = response.json()
        if 'error' in data:
            print(f'Error occured in response {data}')
            return (False, response)
        else:
            return (True, data)

In [140]:
def businesses_by_location(location):
    '''
    Searches business by one given location
    Sends out multipule requests
    Number of results is limited to MAX_ITEMS_PER_SEARCH
    Note: there might be more businesses than MAX_ITEMS_PER_SEARCH, 
    but Yelp only provides up to MAX_ITEMS_PER_SEARCH per location.
    '''
    offset = 1
    all_results = []
    
    for i in range(MAX_ITEMS_PER_SEARCH // RESULTS_LIMIT):
        succeeded, data = single_request(location, offset=(RESULTS_LIMIT * i))
                    
        if succeeded is False:
            print('Error occured in single_request, Abort!')
            return all_results
        else:
            businesses = data['businesses']
            all_results.extend(businesses)
            
            # If the amount of businesses returned by single_request() is greater or equal to RESULTS_LIMIT,
            # Then there are more results to fetch 
            if len(businesses) >= RESULTS_LIMIT:
                # Pause to slow down frequency of requests before next iteration
                time.sleep(.5)
            else:
                # If there is fewer results than the RESULTS_LIMIT, we exhausted the results.
                break

    print(f'Searched: {location}')
    print(f'    Total available search results: {data["total"]}')
    print(f'    Total amount of businesses collected: {len(all_results)}')
    return all_results

In [143]:
def businesses_by_multiple_locations(locations):
    '''
    Searches business by a list of locations
    Returns a Pandas dataframe of unique businesses 
    '''
    unique_business_dict = dict()
    
    # Collects business by ID to ensure results are unique.
    for location in locations:
        businesses = businesses_by_location(location)
        for business in businesses:
            unique_business_dict[business['id']] = business
            
    print('Query completed at: ' + dt.datetime.now().strftime('%Y-%m-%d %H:%M:%S'))
    return pd.DataFrame(unique_business_dict.values())

## Fetch data

In [120]:
zipcodes = zipcodes('zipcodes')

In [127]:
results = businesses_by_multiple_locations(zipcodes)

Total Avaible Search Result: 737
Query Completion Time and Date: 2020-02-08 00:04:53
Total Avaible Search Result: 316
Query Completion Time and Date: 2020-02-08 00:05:02
Total Avaible Search Result: 198
Query Completion Time and Date: 2020-02-08 00:05:07
Total Avaible Search Result: 9
Query Completion Time and Date: 2020-02-08 00:05:07
Total Avaible Search Result: 10
Query Completion Time and Date: 2020-02-08 00:05:08
Total Avaible Search Result: 854
Query Completion Time and Date: 2020-02-08 00:05:32
Total Avaible Search Result: 964
Query Completion Time and Date: 2020-02-08 00:05:58
Total Avaible Search Result: 764
Query Completion Time and Date: 2020-02-08 00:06:20
Total Avaible Search Result: 1500
Query Completion Time and Date: 2020-02-08 00:06:51
Total Avaible Search Result: 5500
Query Completion Time and Date: 2020-02-08 00:07:22
Total Avaible Search Result: 312
Query Completion Time and Date: 2020-02-08 00:07:31
Total Avaible Search Result: 2200
Query Completion Time and Date: 

Total Avaible Search Result: 6200
Query Completion Time and Date: 2020-02-08 00:32:11
Total Avaible Search Result: 34
Query Completion Time and Date: 2020-02-08 00:32:12
Total Avaible Search Result: 2500
Query Completion Time and Date: 2020-02-08 00:32:43
Total Avaible Search Result: 224
Query Completion Time and Date: 2020-02-08 00:32:49
Total Avaible Search Result: 55
Query Completion Time and Date: 2020-02-08 00:32:51
Total Avaible Search Result: 528
Query Completion Time and Date: 2020-02-08 00:33:06
Total Avaible Search Result: 733
Query Completion Time and Date: 2020-02-08 00:33:26
Total Avaible Search Result: 2400
Query Completion Time and Date: 2020-02-08 00:33:56
Total Avaible Search Result: 1800
Query Completion Time and Date: 2020-02-08 00:34:25
Total Avaible Search Result: 2200
Query Completion Time and Date: 2020-02-08 00:34:55
Total Avaible Search Result: 1100
Query Completion Time and Date: 2020-02-08 00:35:23
Total Avaible Search Result: 19
Query Completion Time and Dat

Total Avaible Search Result: 982
Query Completion Time and Date: 2020-02-08 01:03:39
Total Avaible Search Result: 210
Query Completion Time and Date: 2020-02-08 01:03:45
Total Avaible Search Result: 453
Query Completion Time and Date: 2020-02-08 01:03:59
Total Avaible Search Result: 1300
Query Completion Time and Date: 2020-02-08 01:04:29
Total Avaible Search Result: 4900
Query Completion Time and Date: 2020-02-08 01:05:01
Total Avaible Search Result: 20
Query Completion Time and Date: 2020-02-08 01:05:02
Total Avaible Search Result: 1300
Query Completion Time and Date: 2020-02-08 01:05:31
Total Avaible Search Result: 6300
Query Completion Time and Date: 2020-02-08 01:06:04
Total Avaible Search Result: 3200
Query Completion Time and Date: 2020-02-08 01:06:32
Total Avaible Search Result: 1500
Query Completion Time and Date: 2020-02-08 01:07:02
Total Avaible Search Result: 999
Query Completion Time and Date: 2020-02-08 01:07:30
Total Avaible Search Result: 20
Query Completion Time and Da

## Export results as .csv file

In [128]:
# results.to_csv('../data/raw.csv')

In [129]:
results.shape

(30293, 16)