# **Collecting Data and Reviews from Yelp API**


## Import Packages

First, we are going to import our packages for use in the notebook. 

We are going to import packages to:
- access our saved data
- explore the data and generate statistics 

In [None]:
# Accessing and saving stored data
import csv
import json

# Data exploration and statistics
import pandas as pd
import numpy as np

# Business Data

## Request

We need to create a function to request data from the Yelp API. 

To ensure we can get all of the data, we are including the "offset" parameter. Each request will only pull the first fifty results due to limitations from the API. 

The "offset" parameter will allow us to change the data we're pulling by changing the business result number (initially we will get results 0-49, then we will change the offset to "50" to move to the next set of results, 50-99).

In [None]:
def yelp_request_offset(term, location, yelp_key, offset=0, verbose=False):
    '''Adapted from Yelp API Lab: https://github.com/BenJMcCarty/dsc-yelp-api-lab/tree/solution'''
    
    url = 'https://api.yelp.com/v3/businesses/search'

    headers = {
            'Authorization': 'Bearer {}'.format(yelp_key),
        }

    url_params = {
                    'term': term.replace(' ', '+'),
                    'location': location.replace(' ', '+'),
                    'limit': 50,
                    'offset': offset
                        }
    
    response = requests.get(url, headers=headers, params=url_params)
    
    if verbose == True:
        print(response)
        print(type(response.text))
        print(response.text[:1000])
        
    return response.json()

## Parse

When we request the data, we will get a LOT of information, more than we will need for our analyses and insights.

We will create a function to loop through each result and save specific parts of the information. Once this information is pulled, it will be returned to us as a dataframe, which we can use for cleaning and feature engineering afterwards.

In [None]:
def parse_data(list_of_data):
    '''Adapted from Tyrell's code'''  

    # Create empty list to store results
    
    parsed_data = []
    
    # Loop through each business in the list of businesses
    # Add specific k:v pairs to a dictionary
    
    for business in list_of_data:
        if 'price' not in business:
            business['price'] = np.nan
            
            # Verify that the "price" key is in the selected business dict
            
        details = {'name': business['name'],
                     'location': ' '.join(business['location']['display_address']),
                     'Business ID': business['id'],
                     'alias': business['categories'][0]['alias'],
                     'title': business['categories'][0]['title'],
                     'rating': business['rating'],
                     'review_count': business['review_count'],
                     'price': business['price'],
                     'latitude': business['coordinates']['latitude'],
                     'longitude': business['coordinates']['longitude']
                    }
        # Add the new dictionary to the previous list
        
        parsed_data.append(details)
    
    # Create a DataFrame from the resulting list
    
    df_parsed_data = pd.DataFrame(parsed_data)

    
    return df_parsed_data

## Collect All Data 

Now that we created our functions to request the data and to filter the data for the most relevant parts, we will create a function to pull all of the data, filter it, then save the results to a .csv file for storage.

In [1]:
def get_full_data(term, location, yelp_key):
    '''Requests all results from Yelp API and 
    saves as a .csv; and returns a DataFrame.'''
    
    # Create a .csv to store results
    file_name = 'data/wineries_' + location +'.csv'
    
    # Save results to the .csv
    blank_df = pd.DataFrame()
    blank_df.to_csv(file_name)
    
    # Process first request to Yelp API and calculate number of pages 
    results = yelp_request_offset(term, location, yelp_key)
    
    num_pages = results['total']//50+1
    
    # Print out confirmation feedback
    print(f'For {term} and {location}: ')
    print(f"    Total number of results: {results['total']}.")
    print(f'    Total number of pages: {num_pages}.')
    
    # Create offset for additional results
    offset = 0

    # Retrieves remaining pages
    for num in range(num_pages-1):
        try:
            # Process API request
            results = yelp_request_offset(term, location, yelp_key,
                                          offset=offset)
            
            # From results, take values from "Businesses" key and save
            parsed_results = parse_data(results['businesses'])

            # Add new key:value pair to identify in which region 
            # this business is located.
            parsed_results['City'] = location
          
            # Save resulting dataframe to .csv from top
            parsed_results.to_csv(file_name, mode='a', index = False)
            
            # Increase offset to move to next "page" of data
            offset += 50
            
        except:
            # If error, print where the error happens
            print(f'Error on page {num}.')
            # Then save the results so far to the .csv
            parsed_results.to_csv(file_name, mode='a', index = False)


    return parsed_results

## Cleaning Data and Feature Engineering

At this point, we successfully pulled our data and saved it. Now, we need to make sure that we select only those businesses that are wineries (not distributors or venues). We will start by sorting our data by the number of businesses with each alias, or type of business.

In [4]:
def identify_top_aliases(raw_data = None):
    '''- Requires user to specify an existing .csv file
    - Takes raw business data from the Yelp API and identifies the top two
    aliases.
    '''

    # Read in businesses
    df1 = pd.read_csv(raw_data, header = 1)

    alias_index = df1['alias'].value_counts()[:3].index
    print(alias_index)

Based off of our function, we identified our top three businesses. Since we can tell that some of these are not truly wineries, we'll create a new function to filter out the unwanted results.

In [6]:
def top_two_aliases(raw_data = None):
    '''- Requires an existing .csv file
    - Takes raw business data from the Yelp API and filters for the top two
    aliases (focusing on "wineries" and "winetastingrooms").
    '''

    # Read in businesses
    df1 = pd.read_csv(raw_data, header = 1)

    alias_index = df1['alias'].value_counts()[:2].index
    print(alias_index)
    
    # Filtering rows based on condition

    df2 = df1[df1['alias'].isin(alias_index)]
    
    # Resetting index
    df2.reset_index(drop=True, inplace=True)
    
    # Save results
    df2.to_csv('data/wineries_cleaned.csv',index = False)
       
    print("Saved to 'data/wineries_cleaned.csv'")
    
    return df2

Finally, we need to convert the details from our "price" column into more usable data. Currently, if we try to run any analysis on the column, the "$" symbols will return an error.

To fix this, we will create another formula to convert the prices to an integer value and save them in a new column. 

In [7]:
def convert_price(dataframe):
    ''' - Requires a dataframe with the 'price' column elements being NaN, $, $$, $$$, $$$$, or $$$$$.
    - Takes a pre-existing dataframe and adds a column to store the conversion from $ to an integer.'''
    
    # Converting $s to integers, then saving to new column.
    dataframe['price_converted'] = dataframe.loc[:,'price'] \
    .map({np.nan:0, '$':1, '$$':2, '$$$':3, '$$$$':4, '$$$$$':5})
    
    # Saves results to new file
    dataframe.to_csv('data/wineries_price_converted.csv',index = False)
    
    return dataframe

Finally, we want to review the code to check for any missing or null values in our 'price' column. These missing/null values can and will cause issues during our analysis phase, so we will fix them now.

In [None]:
def find_and_fix_null(dataframe['price'] = None):
    '''- Requires a dataframe with a 'price' column
    - Replaces null values with the mean price'''

# Check for null values
nan_sum = df_sd_details['price'].isna().sum()
print(nan_sum)

price_mean = 

df_sd_details['price'].fillna(value="$$", inplace=True)

#  Business Reviews

## Request 

## Parse

## Collect All Reviews