# Gathering Data - Yelp 
**_Authors:_** *Alex Lau, Despina Matos, Julie Vovchenko, and Kelly Wu* 

We decided to gather data from [Yelp Fusion API](https://www.yelp.com/fusion) to answer our problem statement. The [API documentation](https://www.yelp.com/developers/documentation/v3/business_search) states that it will return a maximum of 1000 businesses based on the provided search criteria. Yet, we can only pull 50 results per request. Thus, lets create a loop to retrieve the businesses data for the borough of Manhattan. We decided on the borough of Manhattan because there is a lot of income diversity in this borough.

## Table of Contents
- [Libraries](#Libraries)
- [API Loop to Retrieve Business Data for Manhattan](#API-Loop-to-Retrieve-Business-Data-for-Manhattan)
- [Including Zipcodes](#Including-Zipcodes)
- [Saving the Dataset](#Saving-the-Dataset)

## Libraries

Lets begin by importing the libraries we need. We will need the requests library to ask and pull in the .json file from Yelp. Then, we need the json library to hide our API key. We will be using [Store API Credentials For Open Source Projects](https://chrisalbon.com/python/basics/store_api_credentials_for_open_source_projects/) for this task. Then, we will need the time library to put thought into the number of requests per second we are requesting on the Yelp's server. Lastly, we will need the pandas library to create a dataframe that we can work with it later on. 

In [None]:
#Pulling in the data
import requests

#Hiding our api key
import json

#Creating a time.sleep request
import time

#Creating a workable dataset
import pandas as pd

We imported our libraries lets create our loop.

## API Loop to Retrieve Business Data for Manhattan

Again, we decided to use the borough of Mahattan to retrieve the business data from Yelp. However, we will not be guaranteed that the data we gather is unique because there could still be duplicates in the .json file. Thus, we will drop duplicates later on after retrieving the file. In our loop, we will use our API credentials to obtain our .json file from https://api.yelp.com/v3/businesses/search. Again, we will set our pulls 50 per request and use our sleep function to consider how many pulls we are getting from the Yelp's server. In sum, to pull in 5000 reviews, we will pull in 20 times.

In [None]:
#Creating the JSON with our API credentials
with open('../env.json') as creds:    
    credentials = json.load(creds)

In [None]:
#Reference by Eddie Yip, Hadi Morrow, and Mahdi Shadkam-Farrokhi

#the location we want to pull in 
borough = ['Manhattan'] 
#the .json file will be saved in
data = []
#start the count at 1
count = 1
#starting the for loop
for yelp_review in borough:
    #for each pull, pull in 20 times
    for _ in range(20):
        #using the url
        URL = 'https://api.yelp.com/v3/businesses/search'
        #using our api key
        API_KEY = credentials['yelp_api_key']
        #50 pulls per request (params)
        params = {'location': yelp_review, 
                  'limit': 50 ,
                  'offset': 50 * _} 
        #Need authorization by using our api key to pull in data
        headers = {'Authorization': 'bearer %s' % API_KEY}
        #Send a get request to Yelp
        response = requests.get(url=URL, params = params, headers = headers)
        #adding all the items into the data list
        data.extend(response.json()['businesses'])#need the businesses ID to get the .json file
        #if the length of data is equal to zero then don't put it in the data list 
        if len(data) == 0:
            break
        #count until 20   
        count += 1
        #using the sleep function to pull in slowly for each pull 
        time.sleep(3)

We were sucessfully able to pull the business data from Yelp, however, we will need a filter to guarantee that we can get the reviews in Manhattan. Thus, we will now include zip codes in our loop to obtain the correct data because based on our problem statement, we need to consider neighborhoods. 

## Including Zipcodes

Using the [ZIP Code Definitions of New York City Neighborhoods](https://www.health.ny.gov/statistics/cancer/registry/appendix/neighborhoods.htm), we were able to detemine that Manhattan has 43 livable zip codes. Therefore, we will now include zip codes with our loop above.

In [None]:
#Reference by Eddie Yip, Hadi Morrow, and Mahdi Shadkam-Farrokhi

#creating our zip code list 
zip_code_list = [
    '10026', 
    '10027', 
    '10030', 
    '10037', 
    '10039', 
    '10001', 
    '10011', 
    '10018', 
    '10019', 
    '10020', 
    '10036', 
    '10029', 
    '10035', 
    '10010', 
    '10016', 
    '10017', 
    '10022', 
    '10012', 
    '10013', 
    '10014',
    '10004',
    '10005', 
    '10006', 
    '10007', 
    '10038', 
    '10280', 
    '10002', 
    '10003', 
    '10009',
    '10021', 
    '10028', 
    '10044', 
    '10065', 
    '10075', 
    '10128', 
    '10023', 
    '10024', 
    '10025', 
    '10031', 
    '10032', 
    '10033', 
    '10034', 
    '10040'
]
#start the count at 1
count = 1
#starting the for loop
for zip_code in zip_code_list:
    #for each pull, pull in 20 times
    for _ in range(20):
        #using the url
        URL = 'https://api.yelp.com/v3/businesses/search'
        #using our api key
        API_KEY = credentials['yelp_api_key']
        #50 pulls per request (params)
        params = {'location': zip_code,
                  'limit': 50,
                  'offset': 50 * _}
        #Need authorization by using our api key to pull in data
        headers = {'Authorization': 'bearer %s' % API_KEY}
        #Send a get request to Yelp
        response = requests.get(url=URL, params = params, headers = headers)
        #getting new items into the data list
        businesses = response.json()['businesses'] #need the businesses ID to get the .json file
        #if the length of new data is equal to zero then don't put it in the data list 
        if len(businesses) == 0:
            break
        #adding the new data into the data list    
        data.extend(businesses)
        #count until 20 
        count += 1
        #using the sleep function to pull in slowly for each pull 
        time.sleep(3)

We were sucessfully able to pull from Yelp again and get the correct data that we want to answer our problem statement. Finally, lets create a dataframe with this .json file so that we can work with it later on.

## Saving the Dataset

Before we save our data, we need to create a dataframe and drop the duplicate rows in the dataframe.

In [None]:
#creating a dataframe with yelp data
manhattan = pd.DataFrame(data)

In [None]:
#dropping the duplicates 
manhattan.drop_duplicates(subset='id', inplace = True)

We were able to sucessfully drop the duplicates. So, lets save our data.

In [None]:
#Here is what we will like to save it as
#index = false for no index column
manhattan_df = manhattan.to_csv('../Client_Project_Yelp_Affluence/manhattan.csv', index = False)