# **Collecting Business Data from the Yelp API**


## Import Packages

First, we are going to import our packages for use in the notebook. 

We are going to import packages to:
- Read and write to .csv files
- Explore the data and generate statistics 

In [2]:
# Accessing and saving stored data
import csv
import json

# Data exploration and statistics
import pandas as pd
import numpy as np

# Accessing Yelp API for data
import requests

# Opening secret folder for Yelp API key
with open(r'C:\Users\bmcca\.secret\yelp_api.json') as f:
    keys = json.load(f)

client_id = keys['id']
yelp_key = keys['key']

# Business Data

## Request

First, we need to create a function to request data from the Yelp API. 

To ensure we can get all of the data, we are including the "offset" parameter. Each request will only pull the first fifty results due to limitations from the API. 

The "offset" parameter will allow us to change the data we're pulling by changing the number count of the business. Initially we will get results 0-49, then we will change the offset to "50" to move to the next set of results, 50-99, etc..

In [3]:
def yelp_request_offset(term, location, yelp_key, offset=0, verbose=False):
    '''Adapted from Yelp API Lab: https://github.com/BenJMcCarty/dsc-yelp-api-lab/tree/solution'''
    
    url = 'https://api.yelp.com/v3/businesses/search'

    headers = {
            'Authorization': 'Bearer {}'.format(yelp_key),
        }

    url_params = {
                    'term': term.replace(' ', '+'),
                    'location': location.replace(' ', '+'),
                    'limit': 50,
                    'offset': offset
                        }
    
    response = requests.get(url, headers=headers, params=url_params)
    
    if verbose == True:
        print(response)
        print(type(response.text))
        print(response.text[:1000])
        
    return response.json()

## Parse

When we request the data, we will get a LOT of information, more than we will need for our analyses and insights.

We will create a function to loop through each result and save specific parts of the information. Once this information is pulled, it will be returned to us as a dataframe, which we can use for cleaning and feature engineering afterwards.

In [4]:
def parse_data(list_of_data):
    '''Adapted from Tyrell's code'''  

    # Create empty list to store results
    
    parsed_data = []
    
    # Loop through each business in the list of businesses
    # Add specific k:v pairs to a dictionary
    
    for business in list_of_data:
        if 'price' not in business:
            business['price'] = np.nan
            
            # Verify that the "price" key is in the selected business dict
            
        details = {'name': business['name'],
                     'location': ' '.join(business['location']['display_address']),
                     'Business ID': business['id'],
                     'alias': business['categories'][0]['alias'],
                     'title': business['categories'][0]['title'],
                     'rating': business['rating'],
                     'review_count': business['review_count'],
                     'price': business['price'],
                     'latitude': business['coordinates']['latitude'],
                     'longitude': business['coordinates']['longitude']
                    }
        # Add the new dictionary to the previous list
        
        parsed_data.append(details)
    
    # Create a DataFrame from the resulting list
    
    df_parsed_data = pd.DataFrame(parsed_data)

    
    return df_parsed_data

## Collect All Data 

Now that we created our functions to request the data and to filter the data for the most relevant parts, we will create a function to pull all of the data, filter it, then save the results to a .csv file for storage.

In [5]:
def get_full_data(term, location, yelp_key):
    '''Requests all results from Yelp API and 
    saves as a .csv; and returns a DataFrame.'''
    
    # Create a .csv to store results
    file_name = 'data/wineries_' + location +'.csv'
    
    # Save results to the .csv
    blank_df = pd.DataFrame()
    blank_df.to_csv(file_name)
    
    # Process first request to Yelp API and calculate number of pages 
    results = yelp_request_offset(term, location, yelp_key)
    
    num_pages = results['total']//50+1
    
    # Print out confirmation feedback
    print(f'For {term} and {location}: ')
    print(f"    Total number of results: {results['total']}.")
    print(f'    Total number of pages: {num_pages}.')
    
    # Create offset for additional results
    offset = 0

    # Retrieves remaining pages
    for num in range(num_pages-1):
        try:
            # Process API request
            results = yelp_request_offset(term, location, yelp_key,
                                          offset=offset)
            
            # From results, take values from "Businesses" key and save
            parsed_results = parse_data(results['businesses'])

            # Add new key:value pair to identify in which region 
            # this business is located.
            parsed_results['City'] = location
          
            # Save resulting dataframe to .csv from top
            parsed_results.to_csv(file_name, mode='a', index = False)
            
            # Increase offset to move to next "page" of data
            offset += 50
            
        except:
            # If error, print where the error happens
            print(f'Error on page {num}.')
            # Then save the results so far to the .csv
            parsed_results.to_csv(file_name, mode='a', index = False)


    return parsed_results

In [6]:
# get_full_data('winery', 'San_Diego', yelp_key)

## Cleaning Data and Feature Engineering

At this point, we successfully pulled our data and saved it. Now, we need to make sure that we select only those businesses that are wineries (not distributors or venues). We will start by sorting our data by the number of businesses with each alias, or type of business.

In [7]:
def identify_top_aliases(raw_data = None):
    '''- Requires user to specify an existing .csv file
    - Takes raw business data from the Yelp API and identifies the top two
    aliases.
    '''

    # Read in businesses
    df1 = pd.read_csv(raw_data, header = 1)

    alias_index = df1['alias'].value_counts()[:3].index
    print(alias_index)

In [8]:
# identify_top_aliases(raw_data = 'data/wineries_San_Diego.csv')

Based off of our function, we identified our top three businesses. Since we can tell that we only want "wineries" and "winetastingrooms," we'll create a new function to filter out the unwanted results.

In [9]:
def top_two_aliases(raw_data = None, location_name = ''):
    '''- Requires an existing .csv file and name of location.
    - Takes raw business data from the Yelp API and filters for the top two
    aliases (focusing on "wineries" and "winetastingrooms").
    '''

    # Read in businesses
    df1 = pd.read_csv(raw_data, header = 1)

    alias_index = df1['alias'].value_counts()[:2].index
    print(alias_index)
    
    # Filtering rows based on condition

    df2 = df1[df1['alias'].isin(alias_index)]
    
    # Resetting index
    df2.reset_index(drop=True, inplace=True)
    
    # Save results
    new_file_name = 'data/wineries_' + location_name + '_cleaned.csv'
    df2.to_csv(new_file_name, index = False)
       
    print(f"Saved to ''{new_file_name}''.")
    
    return df2

In [10]:
# top_two_aliases(raw_data = 'data/wineries_San_Diego.csv', 
#                 location_name = 'San_Diego')

Now, we want to review the code to check for any missing or null values in our 'price' column. These missing/null values can and will cause issues during our analysis phase, so we will fix them now.

In [11]:
# Import data to check
df1 = pd.read_csv('data/wineries_San_Diego_cleaned.csv')

# Check for null values
df1.isna().sum()

name             0
location         0
Business ID      0
alias            0
title            0
rating           0
review_count     0
price           25
latitude         0
longitude        0
City             0
dtype: int64

In [12]:
# # Exploring the price column null values
# df1['price'].value_counts()

Our review of the "price" column indicates that the average value is two dollar signs, so we will correct the null values by inserting two dollar signs.

In [13]:
# # Fill with "$$", which is the mean, median, and mode all in one!
# df1.fillna(value="$$", inplace=True)

# ## Confirm all NaN are fixed
# df1.isna().sum()

Finally, we need to convert the details from our "price" column into more usable data. Currently, if we try to run any analysis on the column, the "$" symbols will return an error.

To fix this, we will create another formula to convert the prices to an integer value and save them in a new column. 

In [14]:
def convert_price(dataframe, location_name):
    ''' - Requires a dataframe with the 'price' column elements being NaN, $, $$, $$$, $$$$, or $$$$$.
    - Takes a pre-existing dataframe and adds a column to store the conversion from $ to an integer.
    - Saves results to new .csv and includes the location name in the file name.'''
    
    # Converting $s to integers, then saving to new column.
    dataframe['price_converted'] = dataframe.loc[:,'price'] \
    .map({np.nan:0, '$':1, '$$':2, '$$$':3, '$$$$':4, '$$$$$':5})
    
    # Saves results to new file
    new_file_name = 'data/wineries_' + location_name + '_price_converted.csv'
    dataframe.to_csv(new_file_name,index = False)
    
    print(f"Saved to '{new_file_name}.")
    
    return dataframe

In [15]:
# convert_price(df1, 'San_Diego')

Great! **We finished up our process to call the Yelp API for our business data; parse the results and save the relevant details; and we cleaned the resulting data.**

Now we are able to use this processed data for analysis in another notebook.

# Generating Data for Napa Valley

Now that we have the code for San Diego, we will use the functions and code for the Napa Valley region. Then, we will take the code and use it in our data exploration and visualizations.

In [16]:
# get_full_data('winery', 'Napa_Valley', yelp_key)

In [17]:
# identify_top_aliases('data/wineries_Napa_Valley.csv')

In [18]:
# top_two_aliases('data/wineries_Napa_Valley.csv', "Napa Valley")

In [19]:
# Import data to check
df2 = pd.read_csv('data/wineries_Napa Valley_cleaned.csv')

# Check for null values
df2.isna().sum()

name              0
location          0
Business ID       0
alias             0
title             0
rating            0
review_count      0
price           166
latitude          0
longitude         0
City              0
dtype: int64

In [20]:
# Exploring the price column null values
df2['price'].value_counts()

$$      156
$$$      61
$$$$     16
$         4
Name: price, dtype: int64

In [21]:
# Fill with "$$", which is the best representation of the data set.
df2.fillna(value="$$", inplace=True)

## Confirm all NaN are fixed
df2.isna().sum()

name            0
location        0
Business ID     0
alias           0
title           0
rating          0
review_count    0
price           0
latitude        0
longitude       0
City            0
dtype: int64

In [22]:
convert_price(df2, 'Napa Valley')

Saved to 'data/wineries_Napa Valley_price_converted.csv.


Unnamed: 0,name,location,Business ID,alias,title,rating,review_count,price,latitude,longitude,City,price_converted
0,Hendry Vineyard and Winery,"3104 Redwood Rd Napa, CA 94558",mO8n3zTLoFhlmcfQr7X_TQ,wineries,Wineries,5.0,658,$$,38.321680,-122.344810,Napa_Valley,2
1,Domaine Carneros,"1240 Duhig Rd Napa, CA 94559",8eGTOeEQpUpYb89ISug3ag,wineries,Wineries,4.0,2239,$$,38.255534,-122.351391,Napa_Valley,2
2,Paraduxx Winery,"7257 Silverado Trl Napa, CA 94558",cBFZALrZbLV5XBsiPcgknQ,wineries,Wineries,4.5,373,$$,38.435480,-122.351430,Napa_Valley,2
3,Jarvis Winery,"2970 Monticello Rd Napa, CA 94558",NPkAqW68Og5eBofEpPiRXQ,wineries,Wineries,4.5,209,$$$,38.357010,-122.213620,Napa_Valley,3
4,Cuvaison Estate Wines,"1221 Duhig Rd Napa, CA 94559",rjiMUH4UecBVD3wkqhgxXw,wineries,Wineries,4.0,327,$$,38.251176,-122.347084,Napa_Valley,2
...,...,...,...,...,...,...,...,...,...,...,...,...
398,Andretti Winery,"1625 Trancas St Ste 3017 Napa, CA 94558",NKCMqIlRopcSMA15JpeyJg,wineries,Wineries,3.5,311,$$,38.321516,-122.304108,Napa_Valley,2
399,Lionstone International,"21481 8th St E Sonoma, CA 95476",pW9QPUkm2_tTXLCzyQ6qvg,wineries,Wineries,1.0,1,$$,38.262062,-122.442036,Napa_Valley,2
400,Napa Vinyards,"Napa, CA 94558",UwgQWRkTzlFnw3-QYCaBlQ,wineries,Wineries,1.0,1,$$,38.383260,-122.313060,Napa_Valley,2
401,Cook Vinyard Management,"19626 Eighth St E Sonoma, CA 95476",LxMkyxBokxu6iRIsuMF5Tw,wineries,Wineries,1.0,1,$$,38.286261,-122.434893,Napa_Valley,2


In [23]:
# Check for null values
df2.isna().sum()

name               0
location           0
Business ID        0
alias              0
title              0
rating             0
review_count       0
price              0
latitude           0
longitude          0
City               0
price_converted    0
dtype: int64