# Restaurant Reviews Analysis on Yelp



In [7]:
# Import libraries
import requests
import json
import time
import pandas as pd
from bs4 import BeautifulSoup

import seaborn as sns
import nltk
from nltk.corpus import stopwords
import matplotlib.pyplot as plt
import numpy as np

import string

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, classification_report

import pickle
from sklearn.svm import SVC

## Data Scraping

Yelp has a pretty large dataset on restaurants throughout the United States. However, we are more interested at restaurants around us and want to get insights from reviews in those restaurants. Thus, here we retrieved around 1000 restaurants near Mountain View, CA 94043 from Yelp as our dataset.

We conducted a two-stage data scraping:

1. Retrieving restaurant list from Yelp API
2. Retrieving review list from web scraping

### Retrieving restaurant list from Yelp API

According to [Yelp API](https://www.yelp.com/developers/documentation/v3) documentation, we used endpoint of searching businesses to get restaurants list around certain location. 

Here we read [Yelp API Key](https://www.yelp.com/developers/v3/manage_app) from a local file named `api_key.txt` and provide functions to retrieve a restaurant list with location criteria.

In [None]:
# Load API key from file 
with open('api_key.txt', 'r') as f:
    api_key = f.read().replace('\n','')

In [None]:
def extract_info_for_business(row):
    return {'name': row['name'], 'id': row['id'], 'review_count': int(row['review_count']), 'url': row['url']}

def scrape_restaurant_list(api_key, location, max_num=1000):
    """ Retrieve a restaurant list based on location from Yelp API
        Args: 
            location(string): 
            max_num(integer): maximum number of restaurants to retrieve
        Returns:
            businesses_list(list): a restaurant list with name, id, review_count, url for each restaurant
    """   
    payload = {'categories': 'restaurants', 'location': location, 'limit': 20, 'offset': 0}
    basic_url = 'https://api.yelp.com/v3/businesses/search'
    headers = {'Authorization': 'Bearer ' + api_key}
    response = requests.get(basic_url, params=payload, headers=headers)
    res_json = response.json()
    total = res_json['total']
    businesses_list = list(map(extract_info_for_business, res_json['businesses']))
    while len(businesses_list) < total and len(businesses_list) < max_num:
        time.sleep(0.3)
        payload['offset'] = len(businesses_list)
        response = requests.get(basic_url, params=payload, headers=headers)
        res_json = response.json()
        if 'businesses' not in res_json or len(list(res_json['businesses'])) == 0:
            break
        businesses_list.extend(list(map(extract_info_for_business ,res_json['businesses'])))
    
    return businesses_list

For our project, we retrieved a list of 1000 restaurants for further analysis. As shown below, each of the restaurant in the list contains following information:

1. `name`: restaurant's name
2. `id`: restaurant's id
3. `review_count`: total number of reviews under that restaurant
4. `url`: url linked to the restaurant's page

In [None]:
blist = scrape_restaurant_list(api_key, '94043')

print('Total:', len(blist))
# print('Total: 1000')
print(blist[0])
# print("{'name': \"The Sea by Alexander's Steakhouse\", 'id': 'P1eEPolk9EDGqVn1Jyncww', 'review_count': 874, 'url': 'https://www.yelp.com/biz/the-sea-by-alexanders-steakhouse-palo-alto?adjust_creative=6RD6nFOw75PxaCjeWnG24Q&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=6RD6nFOw75PxaCjeWnG24Q'}")

### Retrieving review list from web scraping

Yelp API also provides a way to retrieve reviews for a restaurant by its id. However, in practice, we found out the reviews sent back from API call are cut after a certain length. Thus, we scraped reviews manually from the web page. 

Here we provide a function to parse the reviews on a single page of a restaurant and store each review with its:

1. `review_id`: review id
2. `user_id`: user id
3. `rating`: rating from 1.0 to 5.0
4. `date`: date that the review is created
5. `text`: content of the review

In [None]:
def parse_page(html):
    """ Parse the reviews on a single page of a restaurant.
        Args:
            html (string): String of HTML corresponding to a Yelp restaurant
        Returns:
            tuple(list, string): a tuple of two elements
                first element: list of dictionaries corresponding to the extracted review information
                second element: URL for the next page of reviews (or None if it is the last page)
    """
    review_list = []
    soup = BeautifulSoup(html, 'html.parser')
    for review_block in soup.find_all('div', attrs={'class': 'review review--with-sidebar'}):
        review_id = user_id = rating = date = text = None
        if 'data-review-id' in review_block.attrs:
            review_id = review_block['data-review-id']
        if 'data-signup-object' in review_block.attrs and review_block['data-signup-object'].startswith('user_id:'):
            user_id = review_block['data-signup-object'][8:]
        rating_div = review_block.find('div', attrs={'class': 'i-stars'})
        if 'title' in rating_div.attrs:
            rating = float(rating_div['title'].split()[0])
        date_span = review_block.find('span', attrs={'class': 'rating-qualifier'})
        if date_span:
            date = date_span.getText().strip()
        review_content = review_block.find('div', attrs={'class': 'review-content'})
        if review_content:
            text = review_content.find('p').getText()
        if review_id and user_id and rating and date and text:
            review_list.append({
                'review_id': review_id,
                'user_id': user_id,
                'rating': rating,
                'date': date,
                'text': text
            })
    next_link = None
    next_ele = soup.find('a', attrs={'class': 'u-decoration-none next pagination-links_anchor'})
    if next_ele and 'href' in next_ele.attrs:
        next_link = next_ele['href']
    return review_list, next_link

Here we provide another function to get all reviews from a restaurant list and store those data into `reviews.csv` for future use.

In [None]:
def scrape_all_reviews(restaurants):
    """ Scrape all reviews from a restaurant list and store them into reviews.csv
        Args:
            restaurants(list): a restaurant list with url for each restaurant
    """
    for i, restaurant in enumerate(restaurants):
        reviews = []
        url = restaurant['url']
        while url != None:
            response = requests.get(url)
            reviews_in_page, url = parse_page(response.content)
            reviews.extend(reviews_in_page)
        df = pd.DataFrame(reviews)
        df.to_csv('reviews.csv', mode='a', header=True)

In [None]:
scrape_all_reviews(blist)

In `reviews.csv`, each row represents a review with its id, user id, rating, date and text as stated before. Here is how our dataset looks like.

In [13]:
df = pd.read_csv(open('reviews.csv','r'))
df.head()

Unnamed: 0.1,Unnamed: 0,date,rating,review_id,text,user_id
0,0,4/30/2018,5.0,0Si_T9jAoYGhcQbdkMa8iQ,Had an amazing four course dinner here! 1. Wil...,eBXvTpaU4KPf2Busy_xEZA
1,1,4/4/2018,4.0,du-a3LYobeY7FBM3nF-3Jg,A Saturday date night with my boyfriend - and ...,a-F64OPbsaI3Ab1imMyLAw
2,2,4/2/2018,5.0,RKcin7_HV7ZmNUFn364DPg,Was able to get a seat at the bar at 1700 with...,YxX2NGb_gIhCn5c8ZbQL9w
3,3,4/15/2018,4.0,zFQnFy3il8b6NwCFva6Evg,Tasting menu here is delicious! Alexander's pa...,mcG523uA11CIk8OP4ieRGQ
4,4,3/1/2018,5.0,H2OJVI0ZEh2h5OHyYJGT_g,One of the best fine dining restaurants I've b...,NFo1WMzgrt_1Jv_DrdQzAw


In data scraping phase, we actually noticed something interesting from Yelp's data. We retrieved around 1000 restaurants from Yelp API, but when we were retrieving reviews from restaurants, we found that a large number of restaurants don't have reviews at all. From that restaurant list, only around 250 restaurants have reviews in it.

**为什么啊嘤**

## Data Analysis

### General Obeservations

### Word Cloud