# Toronto Restaurants

### Scraping Restaurant Data From The Web

**Data**  

From a market segmentation perspective, we want to collect the most relevant features on as many restaurants as possible within Toronto. Relevant features might include: 
1. Price
2. Average Review
3. Reviews Count
4. Cuisines
5. Location 

For the most part, this data can be scraped from the web. We'll keep the scope to a single website for the sake of consistency. This will allow us to better handle the platform-idiosyncracies directly affecting reviews or the range of prices. For instance, in most cases, price and review are not continuous numerical values but rather discrete (and even categorical in some cases i.e good - bad or cheap - expensive). Generally, comparing data across platforms isn't apples to apples, so we'll want to find a single platform (website) with an extensive and rich library of Toronto restaurants. 

In [3]:
# import pandas 
import pandas as pd # to store our data in a dataframe object

# import requests
import requests

# import selenium (webscraping library)
from selenium import webdriver 
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import StaleElementReferenceException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# import BeautifulSoup (handles xml)
from bs4 import BeautifulSoup

print('All libraries successfully imported!')

All libraries successfully imported!


In [4]:
# Assigned driver path and url

driver_path = 'YOUR DRIVER PATH' # example: C:\Program Files (x86)\Google\Chrome\Application\chromedriver
url = 'YOUR URL' # url of website to be scraped

Now that we've assigned our driver path and found a website with suitable data, we can begin building our crawler. In my case, 
the crawler will have to iterate over each restaurant card (a CSS container where the data I want for an individual restaurant is located) and again for every webpage (the restaurant catalogue I will be using spans multiple webpages).

In [13]:
# create webdriver instance

driver = webdriver.Chrome(driver_path)

# create empty dataframe to host data
restaurants = pd.DataFrame(columns = ['occasion', 'name', 'avg review', 'review count', 'neighborhood', 'address', 
                                     'cuisines', 'cost for 2', 'featured in'])

results = driver.get(url) # Feed URL of webpage we are scraping
iter_count = 0

while True: 
    # check how many restaurants on page
    n_items = driver.find_elements_by_css_selector('.search-result')
    
    for i, item in enumerate(n_items, 1): 
        
        try: # get data and assign to vars
            occasion = driver.find_elements_by_css_selector('#orig-search-list > div:nth-child('+ str(i) +') > div.content > div > article > div.pos-relative.clearfix > div > div.col-s-16.col-m-12.pl0 > div:nth-child(1) > div.col-s-12 > div.res-snippet-small-establishment.mt5 > a')[0]
            occasion = occasion.text
            name = driver.find_elements_by_css_selector('#orig-search-list > div:nth-child('+ str(i) +') > div.content > div > article > div.pos-relative.clearfix > div > div.col-s-16.col-m-12.pl0 > div:nth-child(1) > div.col-s-12 > a.result-title.hover_feedback.zred.bold.ln24.fontsize0')[0]
            name = name.text
            avg_review = driver.find_elements_by_css_selector('#orig-search-list > div:nth-child('+ str(i) +') > div.content > div > article > div.pos-relative.clearfix > div > div.col-s-16.col-m-12.pl0 > div:nth-child(1) > div.col-s-12 > div.single-rating.flex > span.rating-value')[0]
            avg_review = avg_review.text
            review_count = driver.find_elements_by_css_selector('#orig-search-list > div:nth-child('+ str(i) +') > div.content > div > article > div.pos-relative.clearfix > div > div.col-s-16.col-m-12.pl0 > div:nth-child(1) > div.col-s-12 > div.single-rating.flex > span.review-count.medium')[0]
            review_count = review_count.text
            neighborhood = driver.find_elements_by_css_selector('#orig-search-list > div:nth-child('+ str(i) +') > div.content > div > article > div.pos-relative.clearfix > div > div.col-s-16.col-m-12.pl0 > div:nth-child(1) > div.col-s-12 > a.ln24.search-page-text.mr10.zblack.search_result_subzone.left > b')[0]
            neighborhood = neighborhood.text
            address = driver.find_elements_by_css_selector('#orig-search-list > div:nth-child('+ str(i) +') > div.content > div > article > div.pos-relative.clearfix > div > div.col-s-16.col-m-12.pl0 > div:nth-child(2) > div')[0]
            address = address.text
            cuisines = driver.find_elements_by_css_selector('#orig-search-list > div:nth-child('+ str(i) +') > div.content > div > article > div.search-page-text.clearfix.row > div:nth-child(1) > span.col-s-11.col-m-12.nowrap.pl0 > a')
            cuisines = [x.text for x in cuisines]
            cost_for_2 = driver.find_elements_by_css_selector('#orig-search-list > div:nth-child('+ str(i) +') > div.content > div > article > div.search-page-text.clearfix.row > div.res-cost.clearfix > span.col-s-11.col-m-12.pl0')[0]
            cost_for_2 = cost_for_2.text
            
            try: 
                featured_in = driver.find_elements_by_css_selector('#orig-search-list > div:nth-child('+ str(i) +') > div.content > div > article > div.search-page-text.clearfix.row > div.res-collections.clearfix > div')
                featured_in = [x.text for x in featured_in]
            
            except: 
                featured_in = 'NaN'
                
            iter_count += 1 
            print('\r iteration: {}'.format(iter_count), end = "")
            
            # add obs to df 
            restaurants.loc[len(restaurants)] = [
                occasion, 
                name,
                avg_review,
                review_count, 
                neighborhood,
                address,
                cuisines,
                cost_for_2, 
                featured_in]
            
        except: 
            pass
    
    
    try: # go to next page
        next_page = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, '#search-results-container > div.search-pagination-top.clearfix.mtop > div.row > div.col-l-12 > div > div > a.paginator_item.next.item')))
        next_page.click()
    except: 
        print(' Final page reached')
        break 
    
driver.close()

 iteration: 5125 Final page reached


In [14]:
restaurants.head()

Unnamed: 0,occasion,name,avg review,review count,neighborhood,address,cuisines,cost for 2,featured in
0,CASUAL DINING,Pizzeria Libretto,4.5,"(1,010 Reviews)","Ossington Avenue, Trinity Bellwoods","221 Ossington Avenue, Toronto M6J 2Z8","[Pizza, Italian]",CA$50,[Hipster Hangouts]
1,CASUAL DINING,KINKA IZAKAYA,4.6,"(1,166 Reviews)","Church Street, Church And Wellesley","398 Church Street, Toronto M5B 2A2 2A2","[Japanese, Asian]",CA$55,[Unique Dining]
2,CASUAL DINING,Pai,4.9,(614 Reviews),"Duncan Street, Entertainment District","18 Duncan Street, Toronto","[Thai, Asian]",CA$50,[]
3,QUICK BITES,Banh Mi Boys,4.7,(941 Reviews),"Queen Street West, Fashion District","392 Queen Street West, Toronto",[Sandwich],CA$25,[Pocket Friendly]
4,QUICK BITES,The Stockyards,4.6,(734 Reviews),"Saint Clair Avenue West, Forest Hill","699 St Clair Avenue West, Toronto M6C 1B2","[BBQ, Burger]",CA$25,"[Kickass Burgers, Barbecue & Grill, Grill My C..."


In [16]:
restaurants.dtypes

occasion        object
name            object
avg review      object
review count    object
neighborhood    object
address         object
cuisines        object
cost for 2      object
featured in     object
dtype: object

In [17]:
restaurants.to_csv('raw_restaurants_data.csv')