# Yelp API - Lab



## Introduction 

Now that we've seen how the Yelp API works, it's time to put those API and SQL skills to work in order to do some basic business analysis! Taking things a step further, you'll also independently explore how to perform pagination in order to retrieve a full results set from the Yelp API!

## Objectives

You will be able to:
* Create a DB on AWS to store information from Yelp about businesses
* Create HTTP requests to get data from Yelp API
* Parse HTTP responses and insert the information into your DB
* Perform pagination to retrieve troves of data!
* Write SQL queries to answer questions about your data 

## Problem Introduction

You've now worked with some API calls, but we have yet to see how to retrieve a more complete dataset in a programmatic manner and combine it with our other data skills. In this lab you will get data from the Yelp API, store that data in a SQL Database on AWS, and write queries to answer follow-up questions. 


### Outline:

1. Determine which pieces of information you need to pull from the Yelp API.

2. Create a DB schema with 2 tables. One for the businesses and one for the reviews.

3. Create Python functions to:
  - Perform a search of businesses using pagination
  - Parse the API response for specific data points
  - Insert the data into your AWS DB

4. Use the functions above in a loop that will paginate over the results to retrieve all of the results. 

*Something might cause your code to break while it is running. You don't want to constantly repull the same data when this happens, so you should insert the data into the database as you call and parse it, not after you have all of the data*

5. Create functions to:
  - Retrieve the reviews data of one business
  - Parse the reviews response for specific review data
  - Insert the review data into the DB

6. Using SQL, query all of the business IDs. Using the 3 Python functions you've created, run your business IDs through a loop to get the reviews for each business and insert them into your DB.

7. Write SQL queries to answer the following questions about your data.


Bonus Steps:  
- Place your helper functions in a package so that your final notebook only has the major steps listed.
- Rewrite your business search functions to be able take an argument for the type of business you are searching for.
- Add another group of businesses to your database.


 
## SQL Questions:

- What are the 5 businesses with the highest average ratings?
- What are the 5 businesses with the lowest average ratings?
- What is the average rating of restaurants that have a price label of one dollar sign? Two dollar signs? Three dollar signs? 
- How many businesses have a rating greater than or  4.5?
- How many businesses have a rating less than 3?
- Return the text of the reviews for the most reviewed restaurant. 
- Find the highest rated business and return text of the most recent review. If multiple business have the same rating, select the restaurant with the most reviews. 
- Find the lowest rated business and return text of the most recent review.  If multiple business have the same rating, select the restaurant with the least reviews. 


In [103]:
import requests
import json
import mysql.connector
import config
from mysql.connector import errorcode
import time

## Part I - Set up the DB

Start by reading SQL questions above to get an understanding of the data you will need. Then, read the documentation of Yelp API to understand what data you will receive in the response.  


Now that you are familiar with the data, create your SQL queries to create the DB and the appropriate tables. 

In [150]:
## Connect to DB server on AWS ##### REMEMBER TO COMMIT UPDATES TO DB!!!!!
cnx = mysql.connector.connect(
    host = config.host,
    user = config.user,
    passwd = config.password,
    database = 'yelp'
)

In [151]:
cursor = cnx.cursor()

In [49]:
## Create new DB 
db_name = 'yelp'

In [14]:
def create_database(cursor, database):
    try:
        cursor.execute(
            "CREATE DATABASE {} DEFAULT CHARACTER SET 'utf8'".format(database))
    except mysql.connector.Error as err:
        print("Failed creating database: {}".format(err))
        exit(1)

try:
    cursor.execute("USE {}".format(db_name))
except mysql.connector.Error as err:
    print("Database {} does not exists.".format(db_name))
    if err.errno == errorcode.ER_BAD_DB_ERROR:
        create_database(cursor, db_name)
        print("Database {} created successfully.".format(db_name))
        cnx.database = db_name
    else:
        print(err)
        exit(1)

Database yelp does not exists.
Database yelp created successfully.


In [26]:
# Create a table for the Businesses
tables = {}
tables['businesses'] = (
    "CREATE TABLE businesses ("
    "  id varchar(50) NOT NULL,"
    "  name varchar(50) NOT NULL,"
    "  rating float(32),"
    "  price varchar(10),"
    "  PRIMARY KEY (id)"
    ") ENGINE=InnoDB")


In [27]:
# Create a table for the reviews
tables['reviews'] = (
    "CREATE TABLE reviews ("
    "  id varchar(50) NOT NULL,"
    "  text varchar(1000),"
    "  PRIMARY KEY (id)"
    ") ENGINE=InnoDB")

In [32]:
for table_name in tables:
    table_description = tables[table_name]
    try:
        print("Creating table {}: ".format(table_name), end='')
        cursor.execute(table_description)
    except mysql.connector.Error as err:
        if err.errno == errorcode.ER_TABLE_EXISTS_ERROR:
            print("already exists.")
        else:
            print(err.msg)
    else:
        print("OK")

Creating table businesses: OK
Creating table reviews: OK


In [55]:
# ALTER TABLE TableName
# ADD ColumnName Datatype;

stmt = """ALTER TABLE businesses ADD review_count INT(10)"""
cursor.execute(stmt)

In [119]:
stmt= """ALTER TABLE reviews ADD FOREIGN KEY (restaurant_id) REFERENCES businesses(id)"""
cursor.execute(stmt)

## Part 2: Create ETL pipeline for the business data from the API

In [88]:
# write a function to make a call to the DB
url = "https://api.yelp.com/v3/businesses/search"
api_key = 'OG1R6QqOCWoDZmudlnvFACy8MDm2UPvWfvGSkz6Fv2mLrbjU1q5__1eXA-T6TWE5vacDopfDXZnk2qdjnUd6ZORJiNRPyFi5Sj_nhDE7H710VPbZrRjjkTlRL9SxXXYx'
header = {'Authorization': 'Bearer %s' %api_key}
params = {'term': 'pizza',
         'location':'East Village, NY',
          'limit' : '50'
         }



In [7]:
def request_from_yelp(url, header, params):
    req = requests.get(url, params=params, headers=header)
    data = json.loads(req.content)
    return data
def all_results(url,header,params):
    num = request_from_yelp(url, params=params, header=header)['total']
    print('{} total matches found.'.format(num))
    cur = 0
    results = []
    while cur < num and cur < 1000:
        params['offset'] = str(cur)
        results.append(request_from_yelp(url, params=params, header=header))
        time.sleep(1) #Wait a second
        cur += 50
    return results


In [8]:
results = all_results(url=url, header=header, params=params)

1400 total matches found.


In [37]:
results[0].keys()

dict_keys(['businesses', 'total', 'region'])

In [30]:
all_results = []
for item in results:
    for thing in item['businesses']:
        all_results.append(thing)
    

In [31]:
len(all_results)

1000

In [33]:

def parser(data):
        if 'price' not in item.keys():
                bus = {
            'id' : item['id'],
            'name' : item['name'],
            'rating' : item['rating'],
            'price' : 0,
            'review_count' : item['review_count']}
        else:
            bus = {
            'id' : item['id'],
            'name' : item['name'],
            'rating' : item['rating'],
            'price' : item['price'],
            'review_count' : item['review_count']}

        return bus

In [34]:
to_sql =[]
for item in all_results:
    to_sql.append(parser(item))


In [36]:
to_sql[999]

{'id': 'a7VT5ljjI6aHhgog9ANa1Q',
 'name': 'Middle Eats',
 'price': '$',
 'rating': 4.0,
 'review_count': 62}

In [28]:
to_sql[0]

{'id': 'zj8Lq1T8KIC5zwFief15jg',
 'name': 'Prince Street Pizza',
 'price': '$',
 'rating': 4.5,
 'review_count': 3177}

In [None]:
# write a function to parse the API response 

# so that you can easily insert the data in to the DB

In [38]:
# Write a function to take your parsed data and insert it into the DB
def insert_restaurant(my_dict):
    stmt = """INSERT INTO businesses
    (id, name, rating, price, review_count)
    VALUES (%(id)s,%(name)s,%(rating)s,%(price)s, %(review_count)s)"""
    cursor.execute(stmt, my_dict)


# def load_sql(list_dicts):
#     first = """INSERT INTO businesses (id, name, rating, price, review_count) VALUES"""
#     str_list = []
#     for item in list_dicts:
#         second = item.values()
#         second = [str(x) + ', 'for x in second ]
#         second = ''.join(second)[:-2]
#         second = '('+ second + ')'
#         str_list.append(first + second)
#     return str_list

## Part 3: Create ETL pipeline for the restaurant review data from the API

In [74]:
# write a query to pull back all of the business ids 
cursor.execute("""SELECT id FROM businesses""")
# you will need these ids to pull back the reviews for each restaurant

In [75]:
ids =[]
for item in cursor:
    ids.append(item[0])

In [109]:
api_key = "3Bygmy0aVv8v2v7gVHw15sAsmJmHwRyT5_1xoMIq81w_A-UXCk8r4SXZP4NFGFEPseYTv-Qho9M3Ad26wx36t0RCbGZ4OwokXhTM2Yr0Gtm8-052Q6iAniEHoDCzXXYx"
header = {'Authorization': 'Bearer %s' %api_key}
url = 'https://api.yelp.com/v3/businesses/{}/reviews'.format()
req = requests.get(url,headers=header)


In [130]:
def get_reviews(api_key, header, item):
    url = 'https://api.yelp.com/v3/businesses/{}/reviews'.format(item)
    req = requests.get(url,headers=header)
    data= json.loads(req.content)
    return data

In [131]:
data = get_reviews(api_key=api_key, header=header, item=ids[0])

In [133]:
data['reviews']

[{'id': '9V_9DbseuChgWODWJBtl6Q',
  'rating': 4,
  'text': "We were pretty hungry and hubby hadn't liked any of my other suggestions. So I found this place and we gave it a shot. For atmosphere, it was exactly what...",
  'time_created': '2019-08-16 08:26:52',
  'url': 'https://www.yelp.com/biz/blue-haven-new-york?adjust_creative=VtPRsqqMAjvCbjZ2iKDzZw&hrid=9V_9DbseuChgWODWJBtl6Q&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_reviews&utm_source=VtPRsqqMAjvCbjZ2iKDzZw',
  'user': {'id': 'zQTEbn5nWGPypxOf2uRPmg',
   'image_url': 'https://s3-media3.fl.yelpcdn.com/photo/I3wyb-L2B-M-tsA_QEBpOg/o.jpg',
   'name': 'Katie H.',
   'profile_url': 'https://www.yelp.com/user_details?userid=zQTEbn5nWGPypxOf2uRPmg'}},
 {'id': 'KTLcMC_T5iinGOftgmLfMw',
  'rating': 3,
  'text': "Not exclusively a sports bar, but I would say it's pretty sports bar adjacent. Surprisingly good selection of beers, and lots of screens all around. Good...",
  'time_created': '2019-10-02 07:08:11',
  'url': 'https://www

In [124]:
data = json.loads(req.content)
rev ={'id': data['reviews'][0]['id'],
    'text': data['reviews'][0]['text']}

In [136]:
def get_review_dict(ids):
    reviews = []  
    for rest in ids:
        data = get_reviews(api_key=api_key, header=header, item=rest)
        for item in data['reviews']:
            rev = {'id' : item['id'],
                   'text': item['text'],
                    'restaurant_id': rest}
            reviews.append(rev)
    return reviews

    

In [138]:
reviews_sql = get_review_dict(ids)

In [154]:
reviews_sql[0]['text']

"We were pretty hungry and hubby hadn't liked any of my other suggestions. So I found this place and we gave it a shot. For atmosphere, it was exactly what..."

In [None]:
# write a function that takes a business id 
# and makes a call to the API for reivews
# then parse out the relevant information


In [155]:
# write a function to insert the parsed data into the reviews table
def insert_review(my_dict):
    stmt = """INSERT INTO reviews
    (id, text, restaurant_id)
    VALUES (%(id)s,%(text)s,%(restaurant_id)s)"""
    cursor.execute(stmt, my_dict)


In [164]:
for review in reviews_sql[1:]:
    insert_review(review)
    cnx.commit()

## Part 4: Write SQL queries that will answer the questions posed. 

###  Pagination

Returning to the Yelp API, the [documentation](https://www.yelp.com/developers/documentation/v3/business_search) also provides us details regarding the API limits. These often include details about the number of requests a user is allowed to make within a specified time limit and the maximum number of results to be returned. In this case, we are told that any request has a maximum of 50 results per request and defaults to 20. Furthermore, any search will be limited to a total of 1000 results. To retrieve all 1000 of these results, we would have to page through the results piece by piece, retriving 50 at a time. Processes such as these are often refered to as pagination.

Now that you have an initial response, you can examine the contents of the json container. For example, you might start with ```response.json().keys()```. Here, you'll see a key for `'total'`, which tells you the full number of matching results given your query parameters. Write a loop (or ideally a function) which then makes successive API calls using the offset parameter to retrieve all of the results (or 5000 for a particularly large result set) for the original query. As you do this, be mindful of how you store the data. 

**Note: be mindful of the API rate limits. You can only make 5000 requests per day, and APIs can make requests too fast. Start prototyping small before running a loop that could be faulty. You can also use time.sleep(n) to add delays. For more details see https://www.yelp.com/developers/documentation/v3/rate_limiting.**

***Below is sample code that you can use to help you deal with the pagination parameter.***

In [None]:
##### EXAMPLE CODE GIVEN BY INSTRUCTOR Your code here; use a function or loop to retrieve all the results from your original request
import time

def yelp_call(url_params, api_key):
    url = 'https://api.yelp.com/v3/businesses/search'
    headers = {'Authorization': 'Bearer {}'.format(api_key)}
    response = requests.get(url, headers=headers, params=url_params)
    
    data = response.json()['businesses']
    return data

def all_results(url_params, api_key):
    num = response.json()['total']
    print('{} total matches found.'.format(num))
    cur = 0
    results = []
    while cur < num and cur < 1000:
        url_params['offset'] = cur
        results.append(yelp_call(url_params, api_key))
        time.sleep(1) #Wait a second
        results += 50
    return df

term = 'pizza'
location = 'Astoria NY'
url_params = {  'term': term.replace(' ', '+'),
                'location': location.replace(' ', '+'),
                'limit' : 50
             }
df = all_results(url_params, api_key)
print(len(df))
df.head()

### Sample SQL Query 

Below is a SQL query to create a table.  Additionally here is a link to create a table with a foreign key.

http://www.mysqltutorial.org/mysql-foreign-key/

```CREATE TABLE IF NOT EXISTS tasks (
    task_id INT AUTO_INCREMENT,
    title VARCHAR(255) NOT NULL,
    start_date DATE,
    due_date DATE,
    status TINYINT NOT NULL,
    priority TINYINT NOT NULL,
    description TEXT,
    PRIMARY KEY (task_id)
)  ENGINE=INNODB;```

In [165]:
### 5 MOST REVIEWED BUSINESSES
stmt ="""SELECT name, review_count FROM businesses ORDER BY review_count DESC LIMIT 5"""
cursor.execute(stmt)
for thing in cursor:
    print (thing)

("Katz's Delicatessen", 11795)
('Ippudo NY', 9866)
("Lombardi's Pizza", 5988)
("Joe's Shanghai", 5970)
('Peter Luger', 5570)


In [170]:
### NUMBER OF HIGHEST RATED BUSINESSES
stmt = """SELECT COUNT(NAME) FROM businesses WHERE rating = (SELECT MAX(rating) FROM businesses)"""
cursor.execute(stmt)
for thing in cursor:
    print (thing)

(15,)


In [178]:
### PERCENT OF BUSINESSES RATING 4.5 OR HIGHER
stmt = """SELECT tot1/tot2
FROM
(SELECT COUNT(*) as tot1 FROM businesses WHERE rating >= 4.5) as table1, 
(SELECT COUNT(*) as tot2 FROM businesses) as table2"""
cursor.execute(stmt)
for thing in cursor:
    print (thing)

(Decimal('0.1620'),)


In [179]:
### PERCENT OF BUSINESSES RATED LOWER THAN 3
stmt = """SELECT tot1/tot2
FROM
(SELECT COUNT(*) as tot1 FROM businesses WHERE rating < 3) as table1, 
(SELECT COUNT(*) as tot2 FROM businesses) as table2"""
cursor.execute(stmt)
for thing in cursor:
    print (thing)

(Decimal('0.0560'),)


In [182]:
### AVG RATING BY PRICE LEVEL
stmt = """SELECT price, AVG(rating) FROM businesses GROUP BY price"""
cursor.execute(stmt)
for thing in cursor:
    print (thing)

('$$', 3.8079096045197742)
('$$$', 3.8863636363636362)
('0', 3.840909090909091)
('$$$$', 4.166666666666667)
('$', 3.583629893238434)


In [185]:
### RETURN THE TEXT OF ALL OF THE REVIEWS FROM THE MOST REVIEWED RESTAURANT
stmt = """SELECT text, review_count
FROM (SELECT text, businesses.review_count FROM reviews JOIN businesses on reviews.restaurant_id = businesses.id) a
ORDER BY review_count DESC LIMIT 3"""
cursor.execute(stmt)
for thing in cursor:
    print (thing)

('Beware - once you have the 50/50 corned beef/pastrami rueben, you will live a life of dissappointing ruebens elsewhere. I hit this 130 yr old legend every...', 11795)
("Long lines beware and very noisy inside. Once you make it in take your ticket and don't lose it. Then you stand in another line behind a cutter. Some are...", 11795)
("Delicious everything!\n\nReminds me of a little Deli on Geary Street in San Francisco called Shenson's Deli.", 11795)
