# **Collecting Data and Reviews from Yelp API**


## Import Packages

First, we are going to import our packages for use in the notebook. 

We are going to import packages to:
- access our saved data
- explore the data and generate statistics 

In [16]:
# Accessing and saving stored data
import csv
import json

# Data exploration and statistics
import pandas as pd
import numpy as np

# Accessing Yelp API for data
import requests

# Opening secret folder for Yelp API key
with open(r'C:\Users\bmcca\.secret\yelp_api.json') as f:
    keys = json.load(f)

client_id = keys['id']
yelp_key = keys['key']

# Business Data

## Request

We need to create a function to request data from the Yelp API. 

To ensure we can get all of the data, we are including the "offset" parameter. Each request will only pull the first fifty results due to limitations from the API. 

The "offset" parameter will allow us to change the data we're pulling by changing the business result number (initially we will get results 0-49, then we will change the offset to "50" to move to the next set of results, 50-99).

In [17]:
def yelp_request_offset(term, location, yelp_key, offset=0, verbose=False):
    '''Adapted from Yelp API Lab: https://github.com/BenJMcCarty/dsc-yelp-api-lab/tree/solution'''
    
    url = 'https://api.yelp.com/v3/businesses/search'

    headers = {
            'Authorization': 'Bearer {}'.format(yelp_key),
        }

    url_params = {
                    'term': term.replace(' ', '+'),
                    'location': location.replace(' ', '+'),
                    'limit': 50,
                    'offset': offset
                        }
    
    response = requests.get(url, headers=headers, params=url_params)
    
    if verbose == True:
        print(response)
        print(type(response.text))
        print(response.text[:1000])
        
    return response.json()

## Parse

When we request the data, we will get a LOT of information, more than we will need for our analyses and insights.

We will create a function to loop through each result and save specific parts of the information. Once this information is pulled, it will be returned to us as a dataframe, which we can use for cleaning and feature engineering afterwards.

In [18]:
def parse_data(list_of_data):
    '''Adapted from Tyrell's code'''  

    # Create empty list to store results
    
    parsed_data = []
    
    # Loop through each business in the list of businesses
    # Add specific k:v pairs to a dictionary
    
    for business in list_of_data:
        if 'price' not in business:
            business['price'] = np.nan
            
            # Verify that the "price" key is in the selected business dict
            
        details = {'name': business['name'],
                     'location': ' '.join(business['location']['display_address']),
                     'Business ID': business['id'],
                     'alias': business['categories'][0]['alias'],
                     'title': business['categories'][0]['title'],
                     'rating': business['rating'],
                     'review_count': business['review_count'],
                     'price': business['price'],
                     'latitude': business['coordinates']['latitude'],
                     'longitude': business['coordinates']['longitude']
                    }
        # Add the new dictionary to the previous list
        
        parsed_data.append(details)
    
    # Create a DataFrame from the resulting list
    
    df_parsed_data = pd.DataFrame(parsed_data)

    
    return df_parsed_data

## Collect All Data 

Now that we created our functions to request the data and to filter the data for the most relevant parts, we will create a function to pull all of the data, filter it, then save the results to a .csv file for storage.

In [19]:
def get_full_data(term, location, yelp_key):
    '''Requests all results from Yelp API and 
    saves as a .csv; and returns a DataFrame.'''
    
    # Create a .csv to store results
    file_name = 'data/wineries_' + location +'.csv'
    
    # Save results to the .csv
    blank_df = pd.DataFrame()
    blank_df.to_csv(file_name)
    
    # Process first request to Yelp API and calculate number of pages 
    results = yelp_request_offset(term, location, yelp_key)
    
    num_pages = results['total']//50+1
    
    # Print out confirmation feedback
    print(f'For {term} and {location}: ')
    print(f"    Total number of results: {results['total']}.")
    print(f'    Total number of pages: {num_pages}.')
    
    # Create offset for additional results
    offset = 0

    # Retrieves remaining pages
    for num in range(num_pages-1):
        try:
            # Process API request
            results = yelp_request_offset(term, location, yelp_key,
                                          offset=offset)
            
            # From results, take values from "Businesses" key and save
            parsed_results = parse_data(results['businesses'])

            # Add new key:value pair to identify in which region 
            # this business is located.
            parsed_results['City'] = location
          
            # Save resulting dataframe to .csv from top
            parsed_results.to_csv(file_name, mode='a', index = False)
            
            # Increase offset to move to next "page" of data
            offset += 50
            
        except:
            # If error, print where the error happens
            print(f'Error on page {num}.')
            # Then save the results so far to the .csv
            parsed_results.to_csv(file_name, mode='a', index = False)


    return parsed_results

In [20]:
get_full_data('winery', 'San Diego', yelp_key)

For winery and San Diego: 
    Total number of results: 262.
    Total number of pages: 6.


Unnamed: 0,City


## Cleaning Data and Feature Engineering

At this point, we successfully pulled our data and saved it. Now, we need to make sure that we select only those businesses that are wineries (not distributors or venues). We will start by sorting our data by the number of businesses with each alias, or type of business.

In [22]:
def identify_top_aliases(raw_data = None):
    '''- Requires user to specify an existing .csv file
    - Takes raw business data from the Yelp API and identifies the top two
    aliases.
    '''

    # Read in businesses
    df1 = pd.read_csv(raw_data, header = 1)

    alias_index = df1['alias'].value_counts()[:3].index
    print(alias_index)

In [23]:
identify_top_aliases(raw_data = 'data/wineries_San Diego.csv')

Index(['wineries', 'winetastingroom', 'beer_and_wine'], dtype='object')


Based off of our function, we identified our top three businesses. Since we can tell that we only want "wineries" and "winetastingrooms," we'll create a new function to filter out the unwanted results.

In [34]:
def top_two_aliases(raw_data = None, location_name = ''):
    '''- Requires an existing .csv file and name of location.
    - Takes raw business data from the Yelp API and filters for the top two
    aliases (focusing on "wineries" and "winetastingrooms").
    '''

    # Read in businesses
    df1 = pd.read_csv(raw_data, header = 1)

    alias_index = df1['alias'].value_counts()[:2].index
    print(alias_index)
    
    # Filtering rows based on condition

    df2 = df1[df1['alias'].isin(alias_index)]
    
    # Resetting index
    df2.reset_index(drop=True, inplace=True)
    
    # Save results
    new_file_name = 'data/wineries_' + location_name + '_cleaned.csv'
    df2.to_csv(new_file_name, index = False)
       
    print(f"Saved to ''{new_file_name}''.")
    
    return df2

In [33]:
top_two_aliases(raw_data = 'data/wineries_San Diego.csv', 
                location_name = 'San Diego')

Index(['wineries', 'winetastingroom'], dtype='object')
Saved to data/wineries_San Diego_cleaned.csv.


Unnamed: 0,name,location,Business ID,alias,title,rating,review_count,price,latitude,longitude,City
0,Bernardo Winery,"13330 Paseo Del Verano Norte San Diego, CA 92128",DknnpiG1p4OoM1maFshzXA,winetastingroom,Wine Tasting Room,4.5,626,$$,33.0328,-117.04646,San Diego
1,Callaway Vineyard & Winery,"517 4th Ave Ste 101 San Diego, CA 92101",Cn2_bpTngghYW1ej4zreZg,winetastingroom,Wine Tasting Room,5.0,100,$$,32.7107506117294,-117.160917759246,San Diego
2,San Pasqual Winery - Seaport Village,"805 W Harbor Dr San Diego, CA 92101",gMW1RvyLu90RSQAY9UrIHw,winetastingroom,Wine Tasting Room,4.5,138,$$,32.7087316452387,-117.168194991742,San Diego
3,Négociant Winery,"2419 El Cajon Blvd San Diego, CA 92104",Cc1sQWRWgGyMCjzX2mmMQQ,winetastingroom,Wine Tasting Room,4.5,103,$$,32.75488,-117.13828,San Diego
4,Domaine Artefact Vineyard & Winery,"15404 Highland Valley Rd Escondido, CA 92025",WqVbxY77Ag96X90LultCUw,wineries,Wineries,5.0,96,$$,33.06817,-117.0016,San Diego
...,...,...,...,...,...,...,...,...,...,...,...
77,Roll OutThe Barrell Charity Event by Meritage,"162 S Rancho Santa Fe Rd Encinitas, CA 92024",wyLm9fIoamN-VALcu3nUVg,wineries,Wineries,4.0,1,,33.037121,-117.238654,San Diego
78,Licores Kentucky,Calle Puerto y 3ra S/N Col. Centro 22000 Tijua...,B7gID-M2EsdpthrTcwTNYA,wineries,Wineries,5.0,1,,32.534236,-117.034976,San Diego
79,Barrica 9,Av. Revolución 1265 Col. Zona Centro 22000 Tij...,HxTqmzT4G43iAKXrB3pqQg,winetastingroom,Wine Tasting Room,4.5,7,$$,32.53043,-117.0365,San Diego
80,"RL Liquid Assets, Inc","5909 Sea Lion Pl Ste G Carlsbad, CA 92010",-STecUUsS69EMSE7PxwPwA,wineries,Wineries,3.0,2,,33.134743,-117.248093,San Diego


Now, we want to review the code to check for any missing or null values in our 'price' column. These missing/null values can and will cause issues during our analysis phase, so we will fix them now.

In [62]:
# Import data to check
df1 = pd.read_csv('data/wineries_San Diego_cleaned.csv')

# Check for null values
df1.isna().sum()

name             0
location         0
Business ID      0
alias            0
title            0
rating           0
review_count     0
price           25
latitude         0
longitude        0
City             0
dtype: int64

In [64]:
# Exploring the price column null values
df1['price'].value_counts()

$$     53
$       3
$$$     1
Name: price, dtype: int64

Our review of the "price" column indicates that the average value is two dollar signs, so we will correct the null values by inserting two dollar signs.

Additionally, we have some "price" values as well. These are due to how our .csv's are saved, and will be corrected in a later step.

In [65]:
# Fill with "$$", which is the mean, median, and mode all in one!
df1.fillna(value="$$", inplace=True)

## Confirm all NaN are fixed
df1.isna().sum()

name            0
location        0
Business ID     0
alias           0
title           0
rating          0
review_count    0
price           0
latitude        0
longitude       0
City            0
dtype: int64

Finally, we need to convert the details from our "price" column into more usable data. Currently, if we try to run any analysis on the column, the "$" symbols will return an error.

To fix this, we will create another formula to convert the prices to an integer value and save them in a new column. 

In [59]:
def convert_price(dataframe, location_name):
    ''' - Requires a dataframe with the 'price' column elements being NaN, $, $$, $$$, $$$$, or $$$$$.
    - Takes a pre-existing dataframe and adds a column to store the conversion from $ to an integer.
    - Saves results to new .csv and includes the location name in the file name.'''
    
    # Converting $s to integers, then saving to new column.
    dataframe['price_converted'] = dataframe.loc[:,'price'] \
    .map({np.nan:0, '$':1, '$$':2, '$$$':3, '$$$$':4, '$$$$$':5})
    
    # Saves results to new file
    new_file_name = 'data/wineries_' + location_name + '_price_converted.csv'
    dataframe.to_csv(new_file_name,index = False)
    
    print(f"Saved to ''{new_file_name}''.")
    
    return dataframe

In [60]:
convert_price(df1, 'San Diego')

Saved to ''data/wineries_San Diego_price_converted.csv''.


Unnamed: 0,name,location,Business ID,alias,title,rating,review_count,price,latitude,longitude,City,price_converted
0,The Winery Restaurant & Wine Bar,"4301 La Jolla Village Dr Ste 2040 San Diego, C...",76ADW8x8J_69qbtsc5F-2g,bars,Bars,4.0,495,$$,32.8724284,-117.2137748,San Diego,2.0
1,Bernardo Winery,"13330 Paseo Del Verano Norte San Diego, CA 92128",DknnpiG1p4OoM1maFshzXA,winetastingroom,Wine Tasting Room,4.5,626,$$,33.0328,-117.04646,San Diego,2.0
2,Baja Winery Tours,"4629 Cass St San Diego, CA 92109",vVaNDvLrCCE_Cw_DyPnBpA,winetours,Wine Tours,5.0,66,$$,32.7989164,-117.2521107,San Diego,2.0
3,Callaway Vineyard & Winery,"517 4th Ave Ste 101 San Diego, CA 92101",Cn2_bpTngghYW1ej4zreZg,winetastingroom,Wine Tasting Room,5.0,100,$$,32.7107506117294,-117.160917759246,San Diego,2.0
4,San Pasqual Winery - Seaport Village,"805 W Harbor Dr San Diego, CA 92101",gMW1RvyLu90RSQAY9UrIHw,winetastingroom,Wine Tasting Room,4.5,138,$$,32.7087316452387,-117.168194991742,San Diego,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...
195,Alpine Discount Liquor,"2223 Alpine Blvd Alpine, CA 91901",-ARx5ShNxJgjyahKnikTnA,beer_and_wine,"Beer, Wine & Spirits",3.5,3,$,32.8352874,-116.7659099,San Diego,1.0
196,San Diego Limobuses,"3333 Midway Dr Ste 206 San Diego, CA 92110",SCaFGyzrTGTI6aQHhLxbgA,limos,Limos,3.0,46,$$,32.75006,-117.21138,San Diego,2.0
197,Village Wine & Spirits,"1552 Encinitas Blvd Encinitas, CA 92024",XkGnb-YxP5MK_ok1X011RA,beer_and_wine,"Beer, Wine & Spirits",4.0,24,$$,33.0458964,-117.2555835,San Diego,2.0
198,The Destination Wedding Group,"Escondido, CA 92033",lJwxe_fjdt-e8xPrDq63fA,wedding_planning,Wedding Planning,5.0,18,$$,33.12347,-117.08652,San Diego,2.0


#  Business Reviews

## Request 

## Parse

## Collect All Reviews