# Yelp

*Developed by Daniel Deutsch, José Lucas Barretto, Kevin Kühl and Lucas Miguel Agrizzi.*

The present document is divided in 5 sections:

- Part 1 - Extracting businesses's information
- Part 2 - Extracting the number of fake reviews for each business
- Part 3 - Extraction of fake reviews from restaurant pages
- Part 4 - Extraction of real reviews
- Final remarks

We did an effort to present comments and explanations of each and every step. Please let us know if anything can have a better presentation or if there exists a better approach for a given step.

# Part 1 - Extracting businesses's information

In [None]:
import time
import json
import random
import requests
import pandas as pd
import numpy as np

from lxml import html
from bs4 import BeautifulSoup
from datetime import timedelta
from requests_html import HTMLSession
from requests_html import AsyncHTMLSession

In [None]:
# Constants
ARRONDISSEMENTS = json.load(open("./constants/arrondissements.json", encoding="utf-8"))
CATEGORIES = json.load(open("./constants/categories.json", encoding="utf-8"))
URL_DEFAULT_IMG = "https://s3-media0.fl.yelpcdn.com/assets/srv0/yelp_styleguide/514f6997a318/assets/img/default_avatars/user_60_square.png"

# API authentication
AUTHS_API = json.load(open("./constants/auth-api.json", encoding="utf-8")) 
available_auths_api = json.load(open("./constants/auth-api.json", encoding="utf-8"))

# NordVPN possible countries
countries = json.load(open("./constants/countries.json", encoding="utf-8"))

## 1.1) Category Parser

The categories JSON provides from Yelp's site has some information we don't actually need. To make it easier for us to work with it, we will transform this variable into an array of categories ailias.

In [None]:
categories_alias = [category["alias"] for category in CATEGORIES if "restaurants" in category["parents"]]

## 1.2) Business Extraction

Uses the Yelp API to gather all the possible restaurants in Paris. Since Paris is known to have more than 45,000 restaurants and the API only returns the first 1000 results given the parameters (with 50 results per page), we had to play with the request params to be able to gather as many restaurants as possible.

The first option we tried was to iterate only on the arrondissements of Paris. This allowed us to retrieve 12204 restaurants.

We wanted to check if we could get more restaurants, so we applied a second approach, which consisted of doing two iterations: one consisting of the categories each restaurant can have and the other on the arrondissements. This returned 13184 restaurants at the end. This last approach is presented below

In [None]:
# Sets the HTTP request
url = "https://api.yelp.com/v3/businesses/search"
headers = { "Authorization": f"Bearer {random.choice(available_auths_api)['api_key']}" }
params = {
    "term": "restaurants",
    "sort_by": "distance",
    "open_now": False, 
    "offset": 0,
    "limit": 50, 
    "location": "",
    "categories": ""
}

# Defines the time that the process started
start = time.time()

# Defines the dataframe of obtained restaurants
df = pd.DataFrame()

# Appends the restaurants to the dataframe
for idx_arr, arr in enumerate(ARRONDISSEMENTS):

    params["location"] = f"{arr}, Paris"

    for idx_cat, category in enumerate(categories_alias):        
        
        params["categories"] = category
        params["offset"] = 0

        while True:

            print(f"\rProgress: arrondissement {idx_arr+1}/{len(ARRONDISSEMENTS)}, category {idx_cat+1}/{len(categories_alias)}, available api accounts {len(available_auths_api)}/{len(AUTHS_API)}, restaurants {df.shape[0]}, time taken {timedelta(seconds=time.time()-start)}", end="")

            r = requests.get(url, headers=headers, params=params)
            status_code = r.status_code
            r = r.json()

            # Overflow on the amount of requests allowed per day
            if status_code == 429:
                available_auths_api = [keys for keys in available_auths_api if keys["api_key"] != headers["Authorization"].split(" ")[1]]
                if not available_auths_api:
                    raise Exception("Out of valid api accounts for today")
                headers = { "Authorization": f"Bearer {random.choice(available_auths_api)['api_key']}" }

            # Overflow on the limit (1000)
            elif r.get("error"):
                break
            
            # Overflow on the amount of results
            elif params["offset"] > r["total"]:
                break
            
            # Got the results
            else:
                df = pd.concat([pd.DataFrame(r["businesses"]), df], ignore_index=True)
                params["offset"] += params["limit"]
        
    # Remove duplicates
    df = df.drop_duplicates(subset=["id"], ignore_index=True)

    # Saves to csv
    df.to_csv(f"datasets/arrondissements/raw_{arr}_restaurants.csv")
    

# Remove duplicates
df = df.drop_duplicates(subset=["id"], ignore_index=True)

# Saves to csv
df.to_csv("datasets/raw_restaurants.csv")

# Part 2 - Extracting the number of fake reviews for each business

## 2.1) Number of Fake Reviews Extraction

Once we have the dataframe with all the restaurants, we need to obtain information over the reviews of each one. The problem is that Yelp's site blocks us after a certein amount of requests. The workaround is to use a VPN service that can give us anonymity, enabling us to scrape the information. So whenever we got blocked by Yelp, we iterated randomly over a list of countries and connected to a different option. This method will be used whenever the scraping method involves the possibility of being blocked.

*This part of the code may not run on your computer due to the fact that we used the service NordVPN*.

In [None]:
df_counter = pd.read_csv("datasets/raw_restaurants.csv", index_col=0)

country = "France"
def fake_review_count(row, row_count):

    global country   # It's global so it doesn't depend on the context 

    # Builds the url of the restaurant
    url = f"https://www.yelp.fr/not_recommended_reviews/{row['alias']}"

    # Tries to get the number of fake reviews
    try:
        r = requests.get(url)
        doc = html.fromstring(r.text)
        count = int(doc.xpath('//*[@id="super-container"]/div[2]/div/div/div[3]/div/div[1]/h3')[0].text.strip().split(" ")[0])
    
    # Changes the country of the VPN and tries again
    except:
        country = random.choice(countries)
        ! nordvpn connect {country} # Runs on the terminal
        time.sleep(5)
        fake_review_count(row, row_count)
    
    # Once we know the number of fake reviews we return it
    else:
        print(f"\rProgress: row {row.name}/{row_count}, accessing from: {country}", end="")
        return count


# Creates a column that shows the number of fake comments on the restaurant's page
df_counter["fake_review_count"] = df_counter.apply(lambda row: fake_review_count(row, df_counter.shape[0]), axis=1)

df_counter.to_csv("datasets/raw_restaurants_fake_counts.csv")

## 2.2) Raw Data Storage

Now that we gathered all the information we needed, we store them in a .csv file without any further treatment.

In [None]:
df = pd.read_csv("datasets/raw_restaurants_fake_counts.csv", index_col=0)

# Remove duplicates
df = df.drop_duplicates(subset=["id"], ignore_index=True)

# Saves to csv
df.to_csv("./datasets/raw_complete_restaurants.csv")

## 2.3) Data Processing

Once we have the raw data, we fit it into a structure that is easier to work with when we are analyzing the data.

## 2.3.1) Drop Unnecessary Columns

Some of the business details returned by Yelp's API are not interesting for our goal. Therefore, there is no problem in dropping these columns

In [None]:
df.drop(["phone","display_phone", "is_closed", "image_url", "transactions", "name", "location"], axis=1, inplace=True)

## 2.3.2) Flattening Important Attributes

The coordinates column is a dict with keys "latitude" and "longitude". It's harder to work with columns that are dicts in pandas. To avoid that, we create new columns based on these attributes.

In [None]:
def flatten_coordinates(x):
    try:
        json_obj = json.loads(x["coordinates"].replace("'", '"'))
        return json_obj['latitude'], json_obj['longitude']
    except json.JSONDecodeError:
        return pd.NA, pd.NA

df["latitude"] = df.apply(lambda x: flatten_coordinates(x)[0], axis=1)
df["longitude"] = df.apply(lambda x: flatten_coordinates(x)[1], axis=1)

df.drop(["coordinates"], axis=1, inplace=True)

## 2.3.3) Numericalizing Valuable Strings

The price is proportional to the amount of $\$$ returned by the API. It is harder to work with strings in this context, so we set the price column as the number of $\$$ returned by the API instead of a string with $\$$.

In [None]:
def numericalize(row):
    return len(str(row["price"]))

df["price"] = df.apply(lambda row: numericalize(row), axis=1)

## 2.3.4) Dropping Unnecessary Categories Attributes

The categories column is an array of dictionaries. These dictionaries have "alias" and "title" as their keys. We will only use the "alias", so we can make the column be an array of alias instead an array of dictionaries.

In [None]:
def get_categ(row):
    categories = []
    text = row["categories"].replace("[","").replace("]","")
    text = text.split(",")
    for item in text:
        final = item.replace("'", '"').strip().replace('"', "").replace("{","").replace("}", "").split(":")
        if final[0] == "alias":
            categories.append(final[1])
    return categories
        
df["categories"] = df.apply(lambda row: get_categ(row), axis=1)

## 2.4) Final Error Check
In this part, we search for missing information on the number of fake reviews and try once again to capture them. This make sure we eliminate inconsistent data.

In [None]:
for i, row in df.iterrows():
    if pd.isnull(row["fake_review_count"]):
        url = f"https://www.yelp.fr/not_recommended_reviews/{row['alias']}"
        try:
            r = requests.get(url)
            doc = html.fromstring(r.text)
            count = int(doc.xpath('//*[@id="super-container"]/div[2]/div/div/div[3]/div/div[1]/h3')[0].text.strip().split(" ")[0])
            df.at[i,"fake_review_count"] = count
        except:    
            country = random.choice(countries)
            ! nordvpn connect {country} # Runs on the terminal
            time.sleep(5)

## 2.5) Processed Data Storage

Once we have the data fitting the structure that we wanted, we store it in a different .csv file.

In [None]:
df.to_csv("datasets/processed_restaurants.csv")

## 2.6) Notes on Obtained Dataset

The data saved in "processed_restaurants.csv" contains the following columns:

- **id**: restaurant's id, unique identifier code for a given restaurant
- **alias**: unique name identifier (used to compose with yelp.fr url to form restaurant own webpage)
- **url**: restaurant's webpage url
- **review_count**: number of reviews on a given restaurant's page
- **categories**: list of categories a restaurant belongs to
- **rating**: general rating of a given restaurant
- **distance**: distance relative to a point in the center of the arrondissement the restaurant belongs to
- **price**: price index for the given restaurant
- **fake_review_count**: number of fake reviews on a given restaurant's page
- **latitude**: latitude of the restaurant's position
- **longitude**: longitude of the restaurant's position


# Part 3 - Extraction of fake reviews from restaurant pages

In order to extract the fake reviews we applied an approach in which we selected the first 10 fake reviews of each restaurant in our dataset. If the restaurant had less than 10 fake reviews, we collect all the reviews.

At the end, we were able to retrieve 32869 fake reviews from a total of 40965. This represents 80.23% of the total number of fake reviews available.

## 3.1) Extraction of the fake reviews

In [None]:
# Reads the dataset containing information from the obtained restaurants
df_restaurants = pd.read_csv("datasets/processed_restaurants.csv", index_col=0)

# Create an empty dataframe to store reviews
df_reviews = pd.DataFrame(columns=['user_origin', 'user_friends_count', 'user_reviews_count', 'is_fake', 'date', 'rest_alias', 'text', 'rating', 'has_img', 'reviews_have_photos'])

# Creating useful variables
row_count = df.shape[0]
progress = 0
total_reviews = 0

# Iterating over each row of the dataframe of restaurants
for index, row in df_restaurants.iterrows():
    iterator = 0
    less10 = 0
    # Getting the number of fake reviews for the current restaurant
    number_fake_reviews = int(row["fake_review_count"])
    
    # Check if there are less than 10 reviews (all reviews would be in a single page)
    if number_fake_reviews < 10:
        less10 = 1
        
    # Update progress counter if the restaurant has no fake reviews (and do not enter the while loop below)
    if number_fake_reviews == 0:
        progress += 1
        
    # Builds the url of the restaurant
    url = f"https://www.yelp.fr/not_recommended_reviews/{row['alias']}"

    # Tries to get the fake reviews
    # Inside a while loop to treat all possible exceptions
    while(number_fake_reviews != 0):
        try:
            # Do the request
            r = requests.get(url)
            # Parse the reponse from the request
            soup = BeautifulSoup(r.text, 'html.parser')
            # Get all the <li> elements from the page (those contains the reviews)
            reviews_in_page = soup.find_all("ul")[0].find_all("li")
            
            # Iterate over the first 10 fake reviews of the current analysed restaurant
            for i in range(10 if not less10 else number_fake_reviews):
                # First, we analyse if there is a icon for posted photos by the user
                # This changes the way we iterate between one review and another
                # As it adds an extra <li> block to be counted
                # It indicates how many photos the user has already posted in Yelp
                reviews_have_photos = True if len(reviews_in_page[iterator].find_all(class_="photo-count responsive-small-display-inline-block"))>0 else False
                # Initializing an empty dictionary to contain the review
                review = {}
                # First filling it with the user's origin
                review["user_origin"] = reviews_in_page[iterator].find_all("b")[0].text
                # Sometimes Yelp uses Membre Qype and Membre Cityvox in the alias of a given user
                # When this is the case, we have to ignore the first response for the user origin and
                # Start capturing from the second entry (<b> blocks)
                if review["user_origin"] == "Membre Qype" or review["user_origin"]=="Membre Cityvox":
                    review["user_origin"] = reviews_in_page[iterator].find_all("b")[1].text
                    review["user_friends_count"] = reviews_in_page[iterator].find_all("b")[2].text
                    review["user_reviews_count"] = reviews_in_page[iterator].find_all("b")[3].text
                else:
                    review["user_friends_count"] = reviews_in_page[iterator].find_all("b")[1].text
                    review["user_reviews_count"] = reviews_in_page[iterator].find_all("b")[2].text
                # Indicates this is a fake review (to use when comparing with real reviews)
                review["is_fake"] = True
                review["date"] = reviews_in_page[iterator].find_all("span", class_="rating-qualifier")[0].text.strip()
                review["rest_alias"] = f"{row['alias']}"
                review["text"] = reviews_in_page[iterator].find_all("p")[0].text
                review["rating"] = reviews_in_page[iterator].find_all("img", class_="offscreen")[0].attrs["alt"].split(" ")[0]
                review["has_img"] = reviews_in_page[iterator].find_all("img", class_="offscreen")[0].attrs["src"] != URL_DEFAULT_IMG
                review["reviews_have_photos"] = reviews_have_photos
                # Add it to the dataframe
                df_reviews = df_reviews.append(review.copy(), ignore_index=True)
                # Update the iterator
                if reviews_have_photos:
                    iterator += 6
                else:
                    iterator += 5
        # This never happens, but we wanted to make a stable program
        except KeyError:
            progress+=1
            break
        # All possible exceptions will be related to conections to Yelp's website
        # We solve them by changing to another country
        except Exception:
            country = random.choice(countries)
            # Runs on the terminal
            ! nordvpn connect {country}
            time.sleep(5)
        # If there is no exceptions, we update the progress counter
        # Also updating the total number of reviews retrieved and printing the status
        else:
            progress += 1
            total_reviews += 10 if not less10 else number_fake_reviews
            print(f"\rProgress: row {progress}/{row_count}, accessing from: {country}, Total reviews: {total_reviews}", end="")
            break

# Saving the dataframe to a csv file
df_reviews.to_csv("datasets/raw_fake_reviews.csv")

## 3.2) Notes on Obtained Dataset

The data saved in "raw_fake_reviews.csv" contains the following columns:

- **user_origin**: Origin of the user that posted the review
- **user_friends_count**: Number of friends the user has
- **user_reviews_count**: Number of reviews the user has already posted
- **is_fake**: Tag that indicates if the given review is fake or not
- **date**: Date on which the review was posted
- **rest_alias**: Alias of the restaurant object from the review
- **text**: The text of the review
- **rating**: The rating attributed on the analysed review
- **has_img**: Tag that indicates if the user has a profile image
- **reviews_have_photos**: Tag that indicates if the user usually posts photos on Yelp (for other reviews)

# Part 4 - Extraction of real reviews

For this last part of extraction we were able to use a hidden api from Yelp's website.

We iterate over different parameters and we are able to retrieve all the real reviews from Yelp for the restaurants obtained in Part 1.

At the end, we were able to retrieve a total of 226,143 real reviews after 18h48min26s of execution.

## 4.1) Reviews Extraction

In [None]:
# Reads the dataset containing information from the obtained restaurants
df_restaurants = pd.read_csv("./datasets/processed_restaurants.csv", index_col=0)

# Defines the time that the process started
start = time.time()

# Defines the reviews dataframe
df_reviews = pd.DataFrame()

# Iterating through the rows of the dataframe
for idx, row in df_restaurants.iterrows():
    for rl in ["en", "fr"]:
    
        # Sets the HTTP request
        url = f"https://www.yelp.com/biz/{row['id']}/review_feed"
        params = {
            "rl": rl,
            "sort_by": "relevance_desc",
            "start": 0
        }

        while True:

            print(f"\rProgress: restaurants {row.name+1}/{df_restaurants.shape[0]}, reviews {df_reviews.shape[0]}, time taken {timedelta(seconds=time.time()-start)}", end="")

            # Makes the HTTP request
            try:
                r = requests.get(url, params=params)

            # Good response from the API
                if r.status_code == 200:

                    # Obtains the reviews of this page
                    reviews = r.json()["reviews"]

                    # Still have reviews from this restaurant
                    if reviews:
                        df_reviews = pd.concat([pd.DataFrame(reviews), df_reviews], ignore_index=True)
                        df_reviews["is_fake"] = False
                        params["start"] += 20
                
                    # Overflow on restaurant's reviews (go to the next restaurant)
                    else:
                        break
                elif r.status_code == 503:
                    raise Exception
            # Our IP got blocked from the API
            except Exception:
                country = random.choice(countries)
                ! nordvpn connect {country} # Runs on the terminal
                time.sleep(5)


# Saves the obtained reviews into multiple csv files
chunks = np.array_split(df_reviews, 7)

for i in range(len(chunks)):
    chunks[i].to_csv("datasets/raw_reviews_{}.csv".format(i))

## 4.2) Notes on Obtained Dataset
The data saved in "raw_reviews.csv" contains the following columns:

- **comment**: The text of the review (can be extracted from the json format)
- **rating**: The rating attributed on the analysed review
- **photosUrl**: Internal URL to Yelp (remembering we extracted it from a hidden API)
- **feedback**: User's return on Yelp's standard reactions to a given restaurant (useful, funny, cool...)
- **business**: JSON containing the informations about the restaurant object from the review
- **localizedDateVisited**: Empty column. Would represent the date from the user's visit to the restaurant
- **businessOwnerReplies**: Replies from the business owner to the given review
- **userId**: Unique identifier code for the user
- **previousReviews**: All the previous reviews from the given user
- **lightboxMediaItems**: Important JSON containing information such as the number of reviews the user has already done and the number of friends (easy to access information for being in JSON format)
- **photos**: Posted photos on the given review
- **tags**: Those are some tags returned by the API, they are redundant to other information previous described and sometimes they describe internal properties of the review
- **isUpdated**: Indicates that the present review is an update of a previous review of the same user on the same restaurant
- **user**: JSON containing the user's information
- **appreciatedBy**: Contains information of users that feel helped with the given review
- **totalPhotos**: Total number of photos posted for the given review
- **id**: Review's unique id
- **localizedDate**: Date when the review was posted
- **is_fake**: Tag to indicate the review is fake or not (in this case, it is false for real reviews)

# Final remarks

At the end of the process of extraction we were able to retrieve a good amount of information to be used in our analysis.

Some of our data (specially the last part, which was captured from Yelp's hidden API) needs to be reformated to better serve the purpose of the analysis part. We note that this will be a very simple task as we have already pre selected in which format we wanted each column. Some contains the direct value while others contain a JSON, which allows for an easy value extraction.

This will be surely be finished before the analysis and visualization part.

We are overall proud of the presented work as we could test a very broad sets of tools learned in class. Also, we are excited for the future work on the analysis of the collected that and the presentation of useful visualization ideas.