# Querying NYC Restaurants with Yelp using GraphQL
This is a beta program, but contrary to the other Yelp APIs, allows us to completely customize our querys, only obtaining the data we'll need for our analysis.

Below is an example of how to utilize the `GraphQL` Yelp API to query 100 restaurants in multiple locations within NYC.

For detailed instructions please reference [Getting Started with Yelp GraphQL](https://docs.developer.yelp.com/docs/graphql-intro)

In [14]:
# import packages
import requests
import pandas as pd
import time
import sys
sys.path.append('src') # add src folder to path

# import api key
from config import YELP_API

First we need to set our `headers` and `url`.<br>
The API key needs to be approved for beta use, so prior to making any queries, goto the `Manage Account` section of your Yelp profile and approve for beta use.

In [15]:
# set up headers and access token
headers = {
    "Authorization": f"Bearer {YELP_API}",
    "Content-Type": "application/json"
}

# set url
url = "https://api.yelp.com/v3/graphql"


Next lets set the:

- locations
- categories
- total businesses and query limit for each location
- total reviews and query limit for each location

We will also set empty lists for both business and review data.

In [55]:
# data storage lists
business_data = []
review_data = []

# locations and limits
locations = ["Brooklyn 11222", "Queens 111104", "Bronx 10463"]
categories = ["restaurant"]
business_total = 100
business_limit = 50
review_limit = 3
review_total = 6


Next we perform the business querys.<br>
We have a **one-to-many relationship**, with businesses to reviews, so we will need to query the businesses first, than use those business ids to query the reviews.<br>
In addition, there is a <=50 businesses and <= reviews per query, so we will need to offset.<br>
Let's loop through each location to apply the query.

In [None]:

for location in locations:
    # Fetch businesses
    business_count = 0
    offset = 0
    while business_count < business_total:
        # Calculate the number of businesses to retrieve in the current iteration
        businesses_to_retrieve = min(business_total - business_count, business_limit)

        # query for the businesses in the current location with offset and limit
        businesses_query = f"""
        {{
          search(location: "{location}", categories: "{categories}", 
          limit: {businesses_to_retrieve}, offset: {offset}) {{
            business {{
              id
              name
              rating
              review_count
              is_closed
              coordinates {{
                  latitude
                  longitude
              }}
              location {{
                city
                address1
                address2
                postal_code
              }}
            }}
          }}
        }}
        """

        # api call to fetch businesses
        response = requests.post(url, json={"query": businesses_query}, headers=headers)
        data = response.json()

        # extract info
        businesses = data["data"]["search"]["business"]

        if not businesses:
            # No more businesses found for the current location, break the loop
            break

        # iterate over the businesses and fetch their reviews
        for business in businesses:
            # fetch reviews for the current business
            fetched_reviews = []
            offset_reviews = 0
            while len(fetched_reviews) < review_total:
                reviews_query = f"""
                {{
                  business(id: "{business['id']}") {{
                    reviews(limit: {review_limit}, offset: {offset_reviews}) {{
                      user {{
                        name
                        id
                      }}
                      rating
                      text
                    }}
                  }}
                }}
                """
                # api call to fetch reviews
                reviews_response = requests.post(url, json={"query": reviews_query}, headers=headers)
                reviews_data = reviews_response.json()

                # extract info
                reviews = reviews_data["data"]["business"]["reviews"]

                fetched_reviews.extend(reviews)
                offset_reviews += review_limit

            # store business data
            business_data.append({
                "business_id": business["id"],
                "name": business["name"],
                "address1": business["location"]["address1"],
                "address2": business["location"]["address2"],
                "city": business["location"]["city"],
                "postal_code": business["location"]["postal_code"],
                "rating": business["rating"],
                "review_count": business["review_count"],
                "is_cloused": business["is_closed"],
                "latitude": business["coordinates"]["latitude"],
                "longitude": business["coordinates"]["longitude"]

            })

            # store review data
            for review in fetched_reviews[:review_total]:
                review_data.append({
                    "business_id": business["id"],
                    "review_user_id": review["user"]["id"],
                    "review_user": review["user"]["name"],
                    "review_rating": review["rating"],
                    "review_text": review["text"]
                })

            business_count += 1

        offset += business_limit

Create two dataframes:
- one for businesses
- one for reviews

Having separate dataframes will allow us to explore the **one-to-many** relationship between business and reviews.

In [56]:

# business df
business_df = pd.DataFrame(business_data)

# reviews df
review_df = pd.DataFrame(review_data)

Look at datatypes and preview.

In [59]:
business_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   business_id   300 non-null    object 
 1   name          300 non-null    object 
 2   address1      300 non-null    object 
 3   address2      216 non-null    object 
 4   city          300 non-null    object 
 5   postal_code   300 non-null    object 
 6   rating        300 non-null    float64
 7   review_count  300 non-null    int64  
 8   is_cloused    300 non-null    bool   
 9   latitude      300 non-null    float64
 10  longitude     300 non-null    float64
dtypes: bool(1), float64(3), int64(1), object(6)
memory usage: 23.9+ KB


In [60]:
# preview
business_df.head()

Unnamed: 0,business_id,name,address1,address2,city,postal_code,rating,review_count,is_cloused,latitude,longitude
0,qLLxS7RwNEjP_jq_KQrPfA,Traif,229 S 4th St,,Brooklyn,11211,4.5,2024,False,40.710658,-73.958872
1,zwOAiVT4pAmpNGXzj-t5MA,Lilia,567 Union Ave,,Brooklyn,11211,4.0,1176,False,40.71757,-73.95236
2,jAaVnUKLITkuhzwXIe0vLQ,Cafe Mogador,133 Wythe Ave,,Brooklyn,11211,4.5,1400,False,40.719747,-73.959993
3,hthvpEL7JEbfxfD6iP9axQ,DeStefano's Steakhouse,593 Lorimer St,,Brooklyn,11211,4.5,1020,False,40.714624,-73.94974
4,6gzQLjzJk25ePm_JS7ZAug,Esme,999 Manhattan Ave,,Brooklyn,11222,4.5,400,False,40.733226,-73.954927


In [64]:
# dtypes
review_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1800 entries, 0 to 1799
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   business_id     1800 non-null   object
 1   review_user_id  1800 non-null   object
 2   review_user     1800 non-null   object
 3   review_rating   1800 non-null   int64 
 4   review_text     1800 non-null   object
dtypes: int64(1), object(4)
memory usage: 70.4+ KB


In [61]:
# preview
review_df.head()

Unnamed: 0,business_id,review_user_id,review_user,review_rating,review_text
0,qLLxS7RwNEjP_jq_KQrPfA,MfsHZG8YsH5S_8b_NR6rVw,Thomas H.,5,food is excellent. I love that they don't rush...
1,qLLxS7RwNEjP_jq_KQrPfA,jv4iczCaaKne1tJA-Qd55A,Ashley K.,5,WOW. This was one of the best meals I've had i...
2,qLLxS7RwNEjP_jq_KQrPfA,LBCZ6Tw1Na6U9kfqXsZo8Q,Dennis W.,4,Traif is SO good. The menu is reasonably price...
3,qLLxS7RwNEjP_jq_KQrPfA,MfsHZG8YsH5S_8b_NR6rVw,Thomas H.,5,food is excellent. I love that they don't rush...
4,qLLxS7RwNEjP_jq_KQrPfA,jv4iczCaaKne1tJA-Qd55A,Ashley K.,5,WOW. This was one of the best meals I've had i...


Let's groupby `reviewer_user_id` and see if we have reviewers with multiple reviews.

In [66]:
review_df.groupby('review_user_id').size().sort_values(ascending=False).head(10)

review_user_id
MxtKj5GFmCvijWOLQ1pjdg    20
rT502dRc8jUxcIdIu0JTLA    14
uaAJzWR1iipChDerr_hFkg    12
uySR3jDEk_DrvUt9fAThYg    10
ux418S1kkyYQzUyvYc8mqQ    10
X9VCpNQoEz8D20N_OsSg2w    10
JPH-WOKa6EBMlpBGLIPuiw     8
LBCZ6Tw1Na6U9kfqXsZo8Q     8
QEvgjD61Dy3tynbPN8z88A     8
Sjx_y-T0fTXO2M5TiDCWsw     6
dtype: int64

Pickle dataframes for future analysis.

In [25]:
# pickle df
business_df.to_pickle("./data/business_df_300.pkl")
review_df.to_pickle("./data/reviews_df_1800.pkl")