# Introduction

This notebook is used to collect the restauant data within Greater Vancouver in Zomato's website. The output are csv files consisting of restaurant information, such as name, rating, cuisine type, etc separated by each district (or city) within Greater Vancouver.

# Table of Contents

<a href='#1.0'><b><h3>1.0 Importing Necessary Libraries<b></a><br> 
<a href='#2.0'><b><h3>2.0 Getting URLs and Names of Each City<b></a><br>
<a href='#3.0'><b><h3>3.0 Building Functions<b></a><br>
<a href='#3.1'><b><h4>3.1 Page Scroller<b></a><br> 
<a href='#3.2'><b><h4>3.2 Data Collection<b></a><br>
<a href='#3.3'><b><h4>3.3 From Every City<b></a><br>
<a href='#3.4'><b><h4>3.4 From Particular City<b></a><br>
<a href='#4.0'><b><h3>4.0 Executing Functions<b></a><br> 
<a href='#4.1'><b><h4>4.1 From Every City<b></a><br> 
<a href='#4.2'><b><h4>4.2 From Particular City<b></a><br> 

<a id='1.0'></a>
## 1.0 Importing Necessary Libraries

In [1]:
import re
from bs4 import BeautifulSoup # For HTML parsing
import requests # Website connections
from time import sleep # To prevent overwhelming the server between connections
from collections import Counter # Keep track of our term counts
import pandas as pd # For converting results to a dataframe and bar chart plots
import json # For parsing json
import random # To randomize the web scraping timing to mitigate the risk of getting blocked/banned
%matplotlib inline

In [2]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

<a id='2.0'></a>
## 2.0 Getting URLs and Names of Each City

In [3]:
# Building the beautiful soup scraper
url='https://www.zomato.com/vancouver'
agent = {"User-Agent":"Mozilla/5.0"}
scrape=requests.get(url, headers=agent)
scraper=BeautifulSoup(scrape.content)

In [4]:
# finding the location container that contains all the districts in Vancouver
location_container=scraper.find_all("div", {"class": "sc-bke1zw-0"})[2].find_all('div', class_='sc-bke1zw-1')
# sanity checking to see if the number in the list is the same as the number of location containers
len(location_container)

30

In [5]:
#finding the links for the locations
location_urls=[]
location_names=[]
for location in location_container:
    try:
        location_url=location.a.get('href')
        location_urls.append(location_url)
    except Exception as e:
        location_urls.append('unknown link')
# collecting each of the location names
    try:
        location_name=location.find('h5').get_text().split('(')[0].strip(' ') 
        location_names.append(location_name)
    except Exception as e:
        location_names.append('unknown location name')

<a id='3.0'></a>
## 3.0 Building Functions

<a id='3.1'></a>
### 3.1 Page Scroller

In [6]:
# Creating the scrolling to bottom of page function
# credits go to https://stackoverflow.com/questions/20986631/how-can-i-scroll-a-web-page-using-selenium-webdriver-in-python
def ScrollToBottomOfPage(url, loadingtime, maximumtime):
    driver=webdriver.Chrome(r'C:/Users/Nathan Ling/Data Science Work/Week 4 Numpy and Web Scraping/chromedriver.exe')
    driver.get(url)
    # Get scroll height
    last_height = driver.execute_script("return document.body.scrollHeight")
    loop=0
    #prevent infinite scrolling if page ends up bottomless
    while loop<=int(maximumtime/loadingtime):
        # Scroll down to bottom
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

        # Wait to load page
        sleep(loadingtime)

        # Calculate new scroll height and compare with last scroll height
        new_height = driver.execute_script("return document.body.scrollHeight")
        print(f'New height is {new_height} while previous height was {last_height}')
        if new_height == last_height:
            break
        last_height = new_height
        loop+=1
    scraper=BeautifulSoup(driver.page_source)
    return scraper

<a id='3.2'></a>
### 3.2 Collecting Restaurant Data

In [7]:
def extract_restaurant_info(location_number, restaurant_containers):
    
    restaurant_names=[]
    restaurant_cuisine_types=[]
    restaurant_ratings=[]
    restaurant_locations=[]
    restaurant_prices=[]
    addresses=[]
    phone_numbers=[]
    num_reviews=[]
    num_photos=[]
    links=[]
    
    # Getting all the data within the restaurant containers
    for restaurant in restaurant_containers:
        # get restaurant names
        try:
            restaurant_name=restaurant.h4.get_text()
            restaurant_names.append(restaurant_name)
        except:
            restaurant_names.append('unknown')
        # get restaurant cuisine type
        try:
            restaurant_cuisine_type=restaurant.find_all('p')[1].get_text()
            restaurant_cuisine_types.append(restaurant_cuisine_type)
        except:
            restaurant_cuisine_types.append('unknown')
        # get restaurant rating
        try:
            restaurant_rating=restaurant.find('div',class_='sc-1q7bklc-1 cILgox').get_text()
            restaurant_ratings.append(restaurant_rating)
        except:
            restaurant_ratings.append('unknown')
        # get restaurant location
        try:
            restaurant_location=restaurant.find_all('p')[3].get_text(strip=True).split(',')[-2].strip()
            restaurant_locations.append(restaurant_location)
        except:
            restaurant_locations.append('unknown')
        # get restaurant prices
        try:
            restaurant_price=restaurant.find_all('p')[2].get_text(strip=True)
            restaurant_prices.append(restaurant_price)
        except:
            restaurant_prices.append('unknown')
        # get the links of each restaurant
        try:
            restaurant_link=restaurant.find('a').get('href')
            links.append('https://www.zomato.com'+restaurant_link)
        except:
            links.append('unknown link')
            
    # Getting the data within each restaurant link that could not be obtained in the restaurant container
    for j, link in enumerate(links):
        try:
            agent = {"User-Agent":"Mozilla/5.0"}
            scrape=requests.get(link, headers=agent)
            scraper=BeautifulSoup(scrape.content)
        except:
            pass
        # getting the address from each restaurant
        try:
            address=scraper.find('p', class_="sc-1hez2tp-0 clKRrC").get_text()
            addresses.append(address)
        except Exception as e:
            addresses.append('N/A')
        # getting the phone # from each restaurant
        try:
            phone_number=scraper.find('p', class_="sc-1hez2tp-0 fanwIZ").get_text()
            phone_numbers.append(phone_number)
        except Exception as e:
            phone_numbers.append('N/A')
        # getting the # reviews from each restaurant
        try:
            num_review=int(scraper.find('div',class_='sc-1q7bklc-8 kEgyiI').get_text())
            num_reviews.append(num_review)
        except Exception as e:
            num_reviews.append(0)
        # getting the # photos from each restaurant
        #sleep(random.random()*1+1)
        try:
            photoscrape=requests.get(link.replace('info','photos'), headers=agent)
            #print(link)
            #print(link.replace('info','photos'))
            photoscraper=BeautifulSoup(photoscrape.content)
        except:
            pass
        try:
            num_photo=int(photoscraper.find_all('span',{'class':'sc-1kx5g6g-2 kBKJxB'})[0].find('span').get_text().split(' ')[-1].strip('()'))
            num_photos.append(num_photo)
        except Exception as e:
            num_photos.append(0)
        # Creating a sleep function to reduce risk of getting banned
        #sleep(random.random()*2+1)
        # to see if the loop is actually running given how slow and tedious this step is
        if (j+1)%10==0:
            print(f'{j+1}th restaurant collected')
        #Getting the last restaurant in the given location
        if j==0:
            last_restaurant=len(restaurant_locations)-restaurant_locations[::-1].index(location_names[location_number])
            print(f'Last relevant restaurant in {location_names[location_number]} in the search query is {last_restaurant}')
        #based on a control f function to find out where the last restaurant with the relevant location is so that the scraper
        # doesn't end up getting the entire page
        if (j+1)>=last_restaurant:
            break
    return restaurant_locations, restaurant_names, restaurant_ratings, restaurant_prices, restaurant_cuisine_types, addresses, phone_numbers, num_reviews, num_photos, last_restaurant 

<a id='3.3'></a>
### 3.3 Collecting Data From Every City

Data from every restaurant for every city in the Greater Vancouver area will be collected.

In [8]:
def collect_restaurant_data_in_each_city(location_urls, location_names, loadtime, maxtime):
    
    for location_number,url in enumerate(location_urls):
        
        # Load the entire page of that particular location. Skips locations where the url doesn't work
        try:
            webdriver.Chrome(r'C:/Users/Nathan Ling/Data Science Work/Week 4 Numpy and Web Scraping/chromedriver.exe').\
            get(url)
        except:
            continue
            
        scraper=ScrollToBottomOfPage(url, loadtime, maxtime)
        
        # Collect all the restaurant containers
        restaurant_containers=scraper.find_all('div',{'class':'jumbo-tracker'})
        print(f'Total number of returned restaurants in {location_names[location_number]} is {len(restaurant_containers)}')

        # Collecting the data
        restaurant_locations, restaurant_names, restaurant_ratings, restaurant_prices, restaurant_cuisine_types, addresses, phone_numbers, num_reviews, num_photos, last_restaurant=extract_restaurant_info(location_number, restaurant_containers)

        #Transforming data to a dict form
        DataInDictForm={'Location':restaurant_locations[:last_restaurant],'Name':restaurant_names[:last_restaurant],
                        'Rating':restaurant_ratings[:last_restaurant],'Price Range':restaurant_prices[:last_restaurant],
                        'Cuisine Type':restaurant_cuisine_types[:last_restaurant], 'Address':addresses, 
                        'Phone Number':phone_numbers, 'Num Reviews':num_reviews, 'Num Photos':num_photos}

        # changing the data to a pandas dataframe
        df=pd.DataFrame(DataInDictForm)

        # exporting the data to a csv file
        df.to_csv(f'raw data/{url.split("/")[-1].replace("restaurants","data")}.csv')

<a id='3.4'></a>
### 3.4 Collecting Data in Given City

Data from every restaurant within a particular city in the Greater Vancouver area will be collected (eg. South Surrey).

In [9]:
def collect_restaurant_data_per_city(location, location_urls, location_names, loadtime, maxtime):
    
    location_standardized=re.sub('[ ,/&-]','',location.lower().replace('and','&'))
    location_names_standardized=[re.sub('[ &-]','',location_name.lower()) for location_name in location_names]
    # Do not return anything if the user input location does not exist
    if location_standardized not in location_names_standardized:
        print('The location you have entered does not exist.')
    # Only if the user input location exists do we extract the restaurant information
    else:
        print('The location you have entered exists.')
        location_number=location_names_standardized.index(location_standardized)
        url=location_urls[location_number]
        
        # Load the entire page of that particular location
        scraper=ScrollToBottomOfPage(url, loadtime, maxtime)
        
        # Collect all the restaurant containers
        restaurant_containers=scraper.find_all('div',{'class':'jumbo-tracker'})
        print(f'Total number of returned restaurants in {location_names[location_number]} is {len(restaurant_containers)}')

        
        # Collecting the data
        restaurant_locations, restaurant_names, restaurant_ratings, restaurant_prices, restaurant_cuisine_types, addresses, phone_numbers, num_reviews, num_photos, last_restaurant=extract_restaurant_info(location_number, restaurant_containers)

        #Transforming data to a dict form
        DataInDictForm={'Location':restaurant_locations[:last_restaurant],'Name':restaurant_names[:last_restaurant],
                        'Rating':restaurant_ratings[:last_restaurant],'Price Range':restaurant_prices[:last_restaurant],
                        'Cuisine Type':restaurant_cuisine_types[:last_restaurant], 'Address':addresses, 
                        'Phone Number':phone_numbers, 'Num Reviews':num_reviews, 'Num Photos':num_photos}

        # changing the data to a pandas dataframe
        df=pd.DataFrame(DataInDictForm)

        # exporting the data to a csv file
        df.to_csv(f'raw data/{url.split("/")[-1].replace("restaurants","data")}.csv')

<a id='4.0'></a>
## 4.0 Executing Functions

<a id='4.1'></a>
### 4.1 From Every City

**Important Note:** This function takes a very long time as each city's restaurant data is collected in Greater Vancouver. Not recommended.

In [2]:
loadtime=1
maxtime=600
collect_restaurant_data_in_each_city(location_urls, location_names, loadtime, maxtime)

<a id='4.2'></a>
### 4.2 For a Particular City

Input a city that is part of Greater Vancouver, as shown in the print statement below.

In [10]:
print(f"Relevant cities in Greater Vancouver are: \n\n{', '.join(location_names)}")

Relevant cities in Greater Vancouver are: 

Central Richmond, Coquitlam, Downtown, West End, Central Burnaby, Mount Pleasant, Guildford, Kitsilano, North Burnaby, Kensington, Fairview, South Burnaby, New Westminster, Renfrew-Collingwood, Newton, East Richmond, Riley Park & Little Mountain, Whalley, City of Langley, Grandview, Port Coquitlam, Maple Ridge, Yaletown, Cariboo & Lougheed, Hastings-Sunrise, Fleetwood, South Surrey, Punjabi Market, Victoria-Fraserview & Killarney, Gastown


In [11]:
loadtime=1
maxtime=600
location=input()
collect_restaurant_data_per_city(location, location_urls, location_names, loadtime, maxtime)

port coquitlam
The location you have entered exists.
New height is 5078 while previous height was 3449
New height is 6666 while previous height was 5078
New height is 9820 while previous height was 6666
New height is 12932 while previous height was 9820
New height is 16107 while previous height was 12932
New height is 16972 while previous height was 16107
New height is 16972 while previous height was 16972
Total number of returned restaurants in Port Coquitlam is 110
Last relevant restaurant in Port Coquitlam in the search query is 110
10th restaurant collected
20th restaurant collected
30th restaurant collected
40th restaurant collected
50th restaurant collected
60th restaurant collected
70th restaurant collected
80th restaurant collected
90th restaurant collected
100th restaurant collected
110th restaurant collected
