# Yelp API - Lab



## Introduction 

Now that we've seen how the Yelp API works, it's time to put those API and SQL skills to work in order to do some basic business analysis! Taking things a step further, you'll also independently explore how to perform pagination in order to retrieve a full results set from the Yelp API!

## Objectives

You will be able to:
* Create a DB on AWS to store information from Yelp about businesses
* Create HTTP requests to get data from Yelp API
* Parse HTTP responses and insert the information into your DB
* Perform pagination to retrieve troves of data!
* Write SQL queries to answer questions about your data 

Making an ETL pipeline (extract, transform, and load)

## Problem Introduction

You've now worked with some API calls, but we have yet to see how to retrieve a more complete dataset in a programmatic manner and combine it with our other data skills. In this lab you will get data from the Yelp API, store that data in a SQL Database on AWS, and write queries to answer follow-up questions. 


The search data is up to you -- the term & location are your call. 


### Outline:

1. Determine which pieces of information you need to pull from the Yelp API. 
    * look at documentation
    * figure out what you'll want to put into your sql database (there's a lot given, and you won't want it all)
    * at the very least, you'll want the name and reviews of the business
    * the business ID is one thing you'll need to be able to query Yelp for reviews


2. Create a DB schema with 2 tables. One for the businesses and one for the reviews.
    * Each review will have to be its own row, so you'll have think about 1NL -- each item in a column should be unique (or w/e the phrasing of that rule should be)


3. Create Python functions to:
  - Perform a search of businesses using pagination
  - Parse the API response for specific data points
  - Insert the data into your AWS DB
  - consider the above functions "helper functions" for #4


4. Use the functions above in a loop that will paginate over the results to retrieve all of the results.

5. Create functions to:
  - Retrieve the reviews data of one business
  - Parse the reviews response for specific review data
  - Insert the review data into the DB
  

6. Using SQL, query all of the business IDs. Using the 3 Python functions you've created, run your business IDs through a loop to get the reviews for each business and insert them into your DB.

7. Write SQL queries to answer the following questions about your data.


Bonus Steps:  
- Place your helper functions in a package so that your final notebook only has the major steps listed.
- Rewrite your business search functions to be able take an argument for the type of business you are searching for.
- Add another group of businesses to your database.


 
## SQL Questions:

- What are the 5 businesses with the highest average ratings?
- What are the 5 businesses with the lowest average ratings?
- What is the average rating of restaurants that have a price label of one dollar sign? Two dollar signs? Three dollar signs? 
- How many businesses have a rating above 4.5?
- How many businesses have a rating below 3?
- Return the text of the oldest review in the table.
- Return the overall rating of the business with the oldest review. 
- Find the highest rated business and return text of the newest review of the three.
- Find the lowest rated business and return text of the newest review of the three.  

## Part I - Set up the DB

Start by reading SQL questions above to get an understanding of the data you will need. Then, read the documentation of Yelp API to understand what data you will receive in the response.  


Now that you are familiar with the data, create your SQL queries to create the DB and the appropriate tables. 

In [1]:
## Connect to DB server on AWS
import mysql.connector
import config
import time

cnx = mysql.connector.connect(
        host = config.host,
        user = config.user,
        passwd = config.passwd)

cursor = cnx.cursor()

In [2]:
## Create new DB 
db_name = 'yelp'

def create_database(cursor, database):
    try:
        #it will try to create a database with whatever name passed through
        cursor.execute(
            "CREATE DATABASE {} DEFAULT CHARACTER SET 'utf8'".format(database))
        #if this fails, the error will print out as a message
    except mysql.connector.Error as err:
        print("Failed creating database: {}".format(err))
        exit(1)

    #this is going to try the above function 
    try:
        cursor.execute("USE {}".format(db_name))
    except mysql.connector.Error as err:
        print("Database {} does not exists.".format(db_name))
        if err.errno == errorcode.ER_BAD_DB_ERROR:
            create_database(cursor, db_name)
            print("Database {} created successfully.".format(db_name))
            cnx.database = db_name
        else:
            print(err)
            exit(1)

#creates yelp db
create_database(cursor, db_name)

Failed creating database: 1007 (HY000): Can't create database 'yelp'; database exists


In [3]:
# Create a table for the Businesses
# PK for businesses should be business' id

TABLES = {}

TABLES['businesses'] = (
                        """CREATE TABLE businesses (
                                 businessId VARCHAR(25),
                                 name TEXT,
                                 price VARCHAR(5),
                                 rating REAL,
                                 loc_address TEXT,
                                 PRIMARY KEY (businessId)
                                 ) ENGINE=InnoDB""")
TABLES['reviews'] = (
                        """CREATE TABLE reviews (
                                 reviewId VARCHAR(25),
                                 businessId VARCHAR(25),
                                 userId VARCHAR(25),
                                 preview TEXT,
                                 userRating REAL,
                                 PRIMARY KEY (reviewId)
                                 )ENGINE=InnoDB""")


In [6]:
#This adds our tables to the database
from mysql.connector import errorcode
for table_name in TABLES:
    table_description = TABLES[table_name]
    try:
        print("Creating table {}: ".format(table_name), end='')
        cursor.execute(table_description)
    except mysql.connector.Error as err:
        if err.errno == errorcode.ER_TABLE_EXISTS_ERROR:
            print("already exists.")
        else:
            print(err.msg)
    else:
        print("OK")

cursor.close()
cnx.close()

Creating table businesses: Cursor is not connected
Creating table reviews: Cursor is not connected


## Part 2: Create ETL pipeline for the business data from the API

In [7]:
import requests 
import json
# write a function to make a call to the API
api_key = 'woG20FI2v3L2Pi39YfhvKMbEHg7-lffJ5fAdSnrDLx7ef6XG5CM4IPQcM7-9Qo-sUE5LQvb7LY5tkhWTXn4p0ky21xeZsOktSlN3vrnfC5tT5aA1dpynYkLhEAM6XXYx'
url_params_businesses = {
                            'term' : 'thai',
                            'location': 'NYC',
                            'limit' : 50
                            }
url_businesses = 'https://api.yelp.com/v3/businesses/search'

url_params_reviews = {
                        ''
                        }
url_reviews = 'https://api.yelp.com/v3/businesses/{id}/reviews'

def yelp_call(url, url_params, api_key, id=None):
    # DON'T FORGET TO COME BACK AND FIX FOR REVIEW URL!
    headers = {'Authorization': 'Bearer {}'.format(api_key)}
    response = requests.get(url, headers=headers, params=url_params)
    
    #may have to fix this for reviews!
    data = response.json()
    return data

tryresults = yelp_call(url_businesses, url_params_businesses, api_key)
print(tryresults)

{'businesses': [{'id': 'jjJc_CrkB2HodEinB6cWww', 'alias': 'lovemama-new-york', 'name': 'LoveMama', 'image_url': 'https://s3-media1.fl.yelpcdn.com/bphoto/bLlFKTlVuLfmF-lIDGIjZA/o.jpg', 'is_closed': False, 'url': 'https://www.yelp.com/biz/lovemama-new-york?adjust_creative=Q0o6644AtYBVMqGZUHAx-w&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=Q0o6644AtYBVMqGZUHAx-w', 'review_count': 4387, 'categories': [{'alias': 'thai', 'title': 'Thai'}, {'alias': 'malaysian', 'title': 'Malaysian'}, {'alias': 'vietnamese', 'title': 'Vietnamese'}], 'rating': 4.0, 'coordinates': {'latitude': 40.7303859, 'longitude': -73.9860613}, 'transactions': ['delivery', 'restaurant_reservation', 'pickup'], 'price': '$$', 'location': {'address1': '174 2nd Ave', 'address2': '', 'address3': '', 'city': 'New York', 'zip_code': '10003', 'country': 'US', 'state': 'NY', 'display_address': ['174 2nd Ave', 'New York, NY 10003']}, 'phone': '+12122545370', 'display_phone': '(212) 254-5370', 'distance': 2858

In [8]:
type(tryresults)
len(tryresults)

3

In [9]:
# write a function to parse the API response 
# so that you can easily insert the data in to the DB

#creates new cursor for this
cnx = mysql.connector.connect(
        host = config.host,
        user = config.user,
        passwd = config.passwd)

cursor = cnx.cursor()


#adds results just gotten from API to data base
def add_businesses_to_db(tryresults):
    # for each business that is in our results from our API request
    for i in range(len(tryresults['businesses'])):
        # sql statement to add row
        add_business = ("""INSERT INTO businesses
                                VALUES (%s, %s, %s, %s, %s)
                        """)
        try:
            business_values = (tryresults['businesses'][i]['id'], tryresults['businesses'][i]['name'], 
                           tryresults['businesses'][i]['price'], tryresults['businesses'][i]['rating'],
                           tryresults['businesses'][i]['location']['display_address'][0]+", "+tryresults['businesses'][i]['location']['display_address'][1])
            
            cursor.execute("USE yelp") #gets us into the yelp db instance 
            cursor.execute(add_business, business_values)
            
        except KeyError: # here to skip the restaurants that are missing values, such as price
            continue
            
    cnx.commit()
    
    
def all_results(url, url_params, api_key):
    num = tryresults['total'] 
    print('{} total matches found.'.format(num))
    cur = 0
    while cur < num and cur < 100:
        # This gets you to where you should be in the data, rather than just keeping grabbing
        # the first 50 over and over again
        url_params['offset'] = cur 
        results = yelp_call(url, url_params, api_key)
        add_businesses_to_db(results)
        time.sleep(1) #Wait a second
        cur += 50 # count to go to next 50 results
    return

all_results(url_businesses, url_params_businesses, api_key)

cursor.close()
cnx.close()


2600 total matches found.


In [None]:
# Write a function to take your parsed data and insert it into the DB

## Part 3: Create ETL pipeline for the review data from the API

In [None]:
# write a query to pull back all of the business ids 
# you will need these ids to pull back the reviews for each restaurant

In [None]:
# write a function that take a business id 
# and makes a call to the API for reivews
# then parse out the relevant information

In [None]:
# write a function to insert the parsed data into the reviews table

## Part 4: Write SQL queries that will answer the questions posed. 

In [None]:
# sql queries go here (using sqlite3?)

###  Pagination

Returning to the Yelp API, the [documentation](https://www.yelp.com/developers/documentation/v3/business_search) also provides us details regarding the API limits. These often include details about the number of requests a user is allowed to make within a specified time limit and the maximum number of results to be returned. In this case, we are told that any request has a maximum of 50 results per request and defaults to 20. Furthermore, any search will be limited to a total of 1000 results. To retrieve all 1000 of these results, we would have to page through the results piece by piece, retriving 50 at a time. Processes such as these are often refered to as pagination.

Now that you have an initial response, you can examine the contents of the json container. For example, you might start with ```response.json().keys()```. Here, you'll see a key for `'total'`, which tells you the full number of matching results given your query parameters. Write a loop (or ideally a function) which then makes successive API calls using the offset parameter to retrieve all of the results (or 5000 for a particularly large result set) for the original query. As you do this, be mindful of how you store the data. 

**Note: be mindful of the API rate limits. You can only make 5000 requests per day, and APIs can make requests too fast. Start prototyping small before running a loop that could be faulty. You can also use time.sleep(n) to add delays. For more details see https://www.yelp.com/developers/documentation/v3/rate_limiting.**

***Below is sample code that you can use to help you deal with the pagination parameter.***

In [None]:
# Your code here; use a function or loop to retrieve all the results from your original request
import time


def yelp_call(url_params, api_key):
    url = 'https://api.yelp.com/v3/businesses/search'
    headers = {'Authorization': 'Bearer {}'.format(api_key)}
    response = requests.get(url, headers=headers, params=url_params)
    
    data = response.json()['businesses']
    return data

def all_results(url_params, api_key):
    num = response.json()['total']
    print('{} total matches found.'.format(num))
    cur = 0
    results = []
    while cur < num and cur < 1000:
        # This gets you to where you should be in the data, rather than just keeping grabbing
        # the first 50 over and over again
        url_params['offset'] = cur 
        results.append(yelp_call(url_params, api_key))
        time.sleep(1) #Wait a second
        cur += 50
    return df
    # Note: you could also have your function parse and then insert into database rather 
    # than appending it to results

term = 'pizza'
location = 'Astoria NY'
url_params = {  'term': term.replace(' ', '+'),
                'location': location.replace(' ', '+'),
                'limit' : 50
             }
df = all_results(url_params, api_key)
print(len(df))
df.head()

### Sample SQL Query 

Below is a SQL query to create a table.  Additionally here is a link to create a table with a foreign key.

http://www.mysqltutorial.org/mysql-foreign-key/

```CREATE TABLE IF NOT EXISTS tasks (
    task_id INT AUTO_INCREMENT,
    title VARCHAR(255) NOT NULL,
    start_date DATE,
    due_date DATE,
    status TINYINT NOT NULL,
    priority TINYINT NOT NULL,
    description TEXT,
    PRIMARY KEY (task_id)
)  ENGINE=INNODB;```