<a href="https://colab.research.google.com/github/harslan/Notebooks/blob/master/headless_browser_google_colab_opentable.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Web Scraping: 

## Learning Objectives
- Learn how to locate HTML elements on a webpage.
- Convert HTML data from a webpage into searchable object using Beautiful soup.
- Discuss limitations associated with simple requests library.
- Introduce Selenium as a solution, and implement a headless browser scraper using Selenium.




## What is Selenium?

Selenium is a headless browser. It allows us to render JavaScript content just as a human-navigated browser would.




## Running in Google Colabs

If you are running your Python code in Google Colabs, we are going to run the following installations to install Selenium and Chromium Chromedriver:

In [0]:
%%bash
chmod 777 /tmp
mkdir data
apt-get update --allow-unauthenticated 
apt-get update -y --fix-missing 
pip install selenium
apt-get install chromium-chromedriver -y --fix-missing
pip install joblib
apt-get update --fix-missing

<a id="intro"></a>
## Introduction

In this lesson, we'll build a web scraper using requests and BeautifulSoup librarires in Python. We will also explore how to use a headless browser called Selenium.

We'll begin by scraping OpenTable's Boston listings. We're interested in knowing the restaurant's **name and location**. OpenTable provides all of this information on the following page: http://www.opentable.com/boston-restaurant-listings 

---

<a id="building-scraper"></a>
## Building a web scraper

Let's build a web scraper for scraping data from OpenTable website using requests and Beautiful Soup libraries:

In [0]:
# import our packages
from bs4 import BeautifulSoup
import requests

# set the url we want to visit
url = "http://www.opentable.com/boston-restaurant-listings"

# visit the webpage at the url and retrieve its corresponding html content
html = requests.get(url)

# .text method returns the request content in Unicode
print(html.text)

# convert this content into a BeautifulSoup object
soup = BeautifulSoup(html.text, 'html.parser')

When you look into the content of html.text above, you will realize that its contents do not reflect the list of restaurants in Boston.

<a id="selenium"></a>
## Introducing Selenium

Let's build a web scraper for scraping data from OpenTable website using Selenium headless browser and Beautiful Soup library:

In [0]:
## Headless Browser with BeautifulSoup
import selenium
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.binary_location = '/usr/bin/chromium-browser'

driver = webdriver.Chrome(executable_path='/usr/bin/chromedriver', options = options)
driver.get("http://www.opentable.com/boston-restaurant-listings")
html = driver.page_source
soup = BeautifulSoup(html)

<a id="retrieving-data"></a>
### Retrieving data from the HTML page

Before we start scraping the restaurant names, let's find each restaurant name listed on the page we've loaded. To achieve this, we nee to find where in the **HTML** the restaurant element is stored. In order to find the HTML that renders the restaurant location, we can use Google Chrome's Inspect tool:

> http://www.opentable.com/boston-restaurant-listings
> 1. Visit the URL above. 

> 2. Right-click on an element you are interested in, then choose Inspect (in Chrome). 

> 3. This will open the Developer Tools and show the HTML used to render the selected page element. 

> Throughout this lesson, we will use this method to find tags associated with elements of the page we want to scrape.

The general idea is: restaurant namaes are structured in the same HTML element on the webpage. If we can identify this structure, we can use it to extract the corresponding data from these HTML elements.

In [0]:
# print the restaurant names
soup.find_all(name='span', attrs={'class':'rest-row-name-text'})

We note that the above returned object from find_all method is a `list`. We observe the outer square brackets and commas separating each tag. We also note that the elements of the list are `Tag` objects, not strings.  The Beautiful Soup library uses a `Tag` object as a visual representation of a tag and its text contents. However, being an object, it has many methods that we can call on it. For example, next we will use the `text()` method to return the tag's contents as a Python string.

<a id="retrieving-names"></a>
#### Retrieving the restaurant names

After finding a list of tags containing the restaurant names, we can also loop through them and print the restaurant names. We utilize `text()` method to get the clean text contents.

In [0]:
# for each element you find, print out the restaurant name
for entry in soup.find_all(name='span', attrs={'class':'rest-row-name-text'}):
    print(entry.text)


<a id="retrieving-locations"></a>
#### Retrieving the restaurant locations

Let us repeat our above process to find restaurant locations.

In [0]:
# now print out the location of each restaurant
for entry in soup.find_all('span', {'class':'rest-row-meta--location rest-row-meta-text sfx1388addContent'}):
    print(entry.text)


<a id="retrieving-review-numbers"></a>
#### Challenge: Retrieving the restaurant review numbers

Can you repeat the above process for finding the review number for each restaurant?

In [0]:
# ANSWER:
# retriving restaurant review numbers

for review in soup.find_all("span",attrs={'class':'underline-hover'}):
  s = review.text
  print(s[s.find("(")+1:s.find(")")])
