# Webscraping 2.0

In today's codealong, I'll walkthrough how to build a scraper using urllib and BeautifulSoup. We'll discover the problems we discussed in the lesson readme associated with doing so, and we'll remedy this problem using a headless browser called Selenium.

For starter's we're going to be scraping OpenTable's DC listings. We're interested in knowing the restaurant's **name, location, price, and how many people booked it today.**

OpenTable provides all of this information on this given page: http://www.opentable.com/washington-dc-restaurant-listings

Let's inspect the elements of this page to assure we can find each of the bits of information in which we're interested.
>Class proceeds to do exactly that ^

We'll then build our scraper:

In [1]:
# import our necessary first packages
from bs4 import BeautifulSoup
import urllib

In [2]:
# set the url we want to visit
url = "http://www.opentable.com/washington-dc-restaurant-listings"

# visit that url, and grab the html of said page
html = urllib.urlopen(url).read()

At this point, what is in html?

In [3]:
html

'          <!DOCTYPE html><html lang="en"><head><meta charset="utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=9; IE=8; IE=7; IE=EDGE"/> <title>Washington, D.C. Area Restaurants List | OpenTable</title>  <meta  name="description" content="Find Washington, D.C. Area restaurants. Search by location, cuisine, or price to refine restaurant results in the Washington, D.C. Area area." > </meta>  <meta  name="robots" content="noindex" > </meta>    <link rel="shortcut icon" href="//components.otstatic.com/components/favicon/1.0.4/favicon/favicon.ico" type="image/x-icon"/><link rel="icon" href="//components.otstatic.com/components/favicon/1.0.4/favicon/favicon-16.png" sizes="16x16"/><link rel="icon" href="//components.otstatic.com/components/favicon/1.0.4/favicon/favicon-32.png" sizes="32x32"/><link rel="icon" href="//components.otstatic.com/components/favicon/1.0.4/favicon/favicon-48.png" sizes="48x48"/><link rel="icon" href="//components.otstatic.com/components/favicon/1.0.4/favicon/fa

In [4]:
# we need to convert this into a soup object
soup = BeautifulSoup(html, 'html.parser', from_encoding="utf-8")

Let's first find each restaurant name listed on the page we've loaded. How do we find the page location of the restaurant? (Psst: we need to know where in the **html** the restaurant element is housed)

See if you can find the restaurant name on the page. Keep in mind there are many restaurants loaded on the page.

In [5]:
# print the restaurant names
soup.find_all(name='span', attrs={'class':'rest-row-name-text'})

[<span class="rest-row-name-text">Voltaggio Brothers Steak House - MGM National Harbor</span>,
 <span class="rest-row-name-text">Fish by Jose Andres - MGM National Harbor</span>,
 <span class="rest-row-name-text">barmini by Jos\xe9 Andr\xe9s</span>,
 <span class="rest-row-name-text">Le Diplomate</span>,
 <span class="rest-row-name-text">Rasika</span>,
 <span class="rest-row-name-text">Marcus - MGM National Harbor</span>,
 <span class="rest-row-name-text">Rasika West End</span>,
 <span class="rest-row-name-text">Kinship</span>,
 <span class="rest-row-name-text">Vasili's Kitchen</span>,
 <span class="rest-row-name-text">Sushi Taro</span>,
 <span class="rest-row-name-text">The Goodstone Inn &amp; Estate Restaurant</span>,
 <span class="rest-row-name-text">TAP Sports Bar - MGM National Harbor</span>,
 <span class="rest-row-name-text">Farmers &amp; Distillers</span>,
 <span class="rest-row-name-text">Elizabeth's Gone Raw</span>,
 <span class="rest-row-name-text">Ginger - MGM National Harbor

Now that we can find each element, let's think how we can loop through them all one-by-one. In the following cell, print out the name (and **only** the clean name, not the rest of the html) of each restaurant.

In [6]:
# for each element you find, print out the restaurant name
for entry in soup.find_all(name='span', attrs={'class':'rest-row-name-text'}):
    print entry.renderContents()

Voltaggio Brothers Steak House - MGM National Harbor
Fish by Jose Andres - MGM National Harbor
barmini by José Andrés
Le Diplomate
Rasika
Marcus - MGM National Harbor
Rasika West End
Kinship
Vasili's Kitchen
Sushi Taro
The Goodstone Inn &amp; Estate Restaurant
TAP Sports Bar - MGM National Harbor
Farmers &amp; Distillers
Elizabeth's Gone Raw
Ginger - MGM National Harbor
RPM Italian - DC
Roof Terrace Restaurant &amp; Bar
TenPenh - Tysons
Field &amp; Main
Ted's Bulletin - 14th Street
Ambar - Arlington
Blue Duck Tavern
Empress Lounge at the Mandarin Oriental - DC
Doi Moi
Hazel
Uncle Julio's - Gainesville
Ambar
Restaurant Eve
Founding Farmers - DC
Joe's Seafood, Prime Steak &amp; Stone Crab - Washington DC
Community Restaurant and Lounge
Ruth's Chris Steak House - Gaithersburg
Old Ebbitt Grill
Nostos Restaurant
Uncle Julio's - Loudoun
The Melting Pot - Reston
Texas de Brazil - Fairfax
Sushi Ogawa
Bombay Club
Farmers Fishers Bakers
Plan B Burger Bar - Washington DC
The Melting Pot - Arlingt

Great!

Can you repeat that process for finding the location? For example, barmini by Jose Andres is in the location listed as "Penn Quarter" in our search results.

In [7]:
# first, see if you can identify the location for all elements -- print it out
soup.find_all('span', {'class':'rest-row-meta--location rest-row-meta-text'})

[<span class="rest-row-meta--location rest-row-meta-text">Oxon Hill</span>,
 <span class="rest-row-meta--location rest-row-meta-text">Oxon Hill</span>,
 <span class="rest-row-meta--location rest-row-meta-text">Penn Quarter</span>,
 <span class="rest-row-meta--location rest-row-meta-text">Logan Circle</span>,
 <span class="rest-row-meta--location rest-row-meta-text">Penn Quarter</span>,
 <span class="rest-row-meta--location rest-row-meta-text">Oxon Hill</span>,
 <span class="rest-row-meta--location rest-row-meta-text">West End</span>,
 <span class="rest-row-meta--location rest-row-meta-text">Mt. Vernon Square</span>,
 <span class="rest-row-meta--location rest-row-meta-text">Gaithersburg</span>,
 <span class="rest-row-meta--location rest-row-meta-text">Dupont Circle</span>,
 <span class="rest-row-meta--location rest-row-meta-text">Middleburg</span>,
 <span class="rest-row-meta--location rest-row-meta-text">Oxon Hill</span>,
 <span class="rest-row-meta--location rest-row-meta-text">Mt. Ve

In [8]:
# now print out EACH location for the restaurants
for entry in soup.find_all('span', {'class':'rest-row-meta--location rest-row-meta-text'}):
    print entry.renderContents()

Oxon Hill
Oxon Hill
Penn Quarter
Logan Circle
Penn Quarter
Oxon Hill
West End
Mt. Vernon Square
Gaithersburg
Dupont Circle
Middleburg
Oxon Hill
Mt. Vernon Square
Downtown
Oxon Hill
Mt. Vernon Square
Foggy Bottom
Tysons Corner / McLean
The Plains
Logan Circle
Arlington
West End
Downtown
Logan Circle
Shaw
Gainesville
Capitol Hill
Old Town Alexandria
Foggy Bottom
Downtown
Bethesda / Chevy Chase
Gaithersburg
Downtown
Vienna
Ashburn
Reston
Fairfax
Dupont Circle
Downtown
Georgetown
Washington
Arlington
Gaithersburg
National Harbor
Alexandria
Penn Quarter
Logan Circle
Gaithersburg
Downtown
Tysons Corner / McLean
Georgetown
Bethesda / Chevy Chase
Tysons Corner / McLean
Dupont Circle
Reston
Georgetown
Rockville
Fairfax
Washington
Gaithersburg
Georgetown
Rockville
Alexandria
Downtown
Great Falls
Middleburg
Chantilly
Great Falls
Bethesda / Chevy Chase
Lovettsville
Bethesda / Chevy Chase
Mount Vernon
Arlington
Ashburn
Warrenton
Fairfax
Bethesda / Chevy Chase
Frederick
Downtown
Georgetown
Adams Mor

Ok, we've figured out the restaurant name and location. Now we need to grab the price (number of dollar signs on a scale of one to four) for each restaurant. We'll follow the same process.

In [9]:
# print out all prices
soup.find_all('div', {'class':'rest-row-pricing'})

[<div class="rest-row-pricing"> <i>  $    $    $    $  </i> </div>,
 <div class="rest-row-pricing"> <i>  $    $    $    $  </i> </div>,
 <div class="rest-row-pricing"> <i>  $    $    $    </i>    $         </div>,
 <div class="rest-row-pricing"> <i>  $    $    $    </i>    $         </div>,
 <div class="rest-row-pricing"> <i>  $    $    $    </i>    $         </div>,
 <div class="rest-row-pricing"> <i>  $    $    $    $  </i> </div>,
 <div class="rest-row-pricing"> <i>  $    $    $    </i>    $         </div>,
 <div class="rest-row-pricing"> <i>  $    $    $    $  </i> </div>,
 <div class="rest-row-pricing"> <i>  $    $    $    </i>    $         </div>,
 <div class="rest-row-pricing"> <i>  $    $    $    $  </i> </div>,
 <div class="rest-row-pricing"> <i>  $    $    $    $  </i> </div>,
 <div class="rest-row-pricing"> <i>  $    $      </i>    $    $       </div>,
 <div class="rest-row-pricing"> <i>  $    $      </i>    $    $       </div>,
 <div class="rest-row-pricing"> <i>  $    $   

In [10]:
# print out EACH number of dollar signs per restaurant
# this one is trickier to eliminate the html. Hint: try a nested find
for entry in soup.find_all('div', {'class':'rest-row-pricing'}):
    print entry.find('i').renderContents()

  $    $    $    $  
  $    $    $    $  
  $    $    $    
  $    $    $    
  $    $    $    
  $    $    $    $  
  $    $    $    
  $    $    $    $  
  $    $    $    
  $    $    $    $  
  $    $    $    $  
  $    $      
  $    $      
  $    $    $    $  
  $    $    $    
  $    $    $    
  $    $    $    $  
  $    $    $    
  $    $      
  $    $      
  $    $      
  $    $    $    
  $    $    $    
  $    $      
  $    $    $    
  $    $      
  $    $      
  $    $    $    $  
  $    $      
  $    $    $    $  
  $    $      
  $    $    $    
  $    $      
  $    $      
  $    $      
  $    $    $    
  $    $    $    
  $    $    $    $  
  $    $    $    
  $    $      
  $    $      
  $    $    $    
  $    $    $    
  $    $      
  $    $    $    
  $    $    $    
  $    $      
  $    $      
  $    $    $    $  
  $    $    $    
  $    $    $    $  
  $    $      
  $    $      
  $    $    $    
  $    $      
  $    $    $    $  
  $    $     

That looks great, but what if I wanted just the number of dollar signs per restaurant? Can you figure out a way to simply print out the number of dollar signs per restaurant listed?

In [11]:
# print the number of dollars signs per restaurant
for entry in soup.find_all('div', {'class':'rest-row-pricing'}):
    price = entry.find('i').renderContents()
    print price.count('$')

4
4
3
3
3
4
3
4
3
4
4
2
2
4
3
3
4
3
2
2
2
3
3
2
3
2
2
4
2
4
2
3
2
2
2
3
3
4
3
2
2
3
3
2
3
3
2
2
4
3
4
2
2
3
2
4
2
3
4
2
3
2
3
2
2
3
4
4
3
4
3
3
2
2
2
2
3
3
2
2
2
2
3
4
2
2
2
4
2
3
3
2
3
2
2
2
2
2
2
2


Phew, nice work. 

One more, right? We only need to find the number times a restaurant was booked. In the next cell, print out all objects that contain the number of times the restaurant was booked.

In [12]:
# print out all objects that contain the number of times the restaurant was booked
soup.find_all('div', {'class':'booking'})

[]

That's weird -- an empty set. Did we find the wrong element? What's going on here? Discuss.

How can we debug this? Any ideas?

In [13]:
# let's first try printing out all 'div' objects
for entry in soup.find_all('div'):
    print entry

<div class="master-container" id="search-master-container"> <style>.icon-font{font-family:'icons';speak:none;font-style:normal;font-weight:normal;font-variant:normal;text-transform:none;line-height:1;-webkit-font-smoothing:antialiased;-moz-osx-font-smoothing:grayscale}.breadcrumb li.icon-visible a:before{font-family:'icons';speak:none;font-style:normal;font-weight:normal;font-variant:normal;text-transform:none;line-height:1;-webkit-font-smoothing:antialiased;-moz-osx-font-smoothing:grayscale}.breadcrumb{*zoom:1;background-color:#ffffff;height:2rem;line-height:1.9rem;font-size:0.75rem;border-top:1px solid rgba(0, 0, 0, 0.08);border-bottom:none !important;padding:0 1.25rem}.breadcrumb:before,.breadcrumb:after{content:" ";display:table}.breadcrumb:after{clear:both}@media only screen and (min-width: 64.0625em){.breadcrumb{padding:0 2.25rem}}.breadcrumb li{position:relative;display:block;float:left;margin:0 1rem 0 0.35rem}.breadcrumb li.hidden{display:none}@media only screen and (min-width:

<div class="overall-search-container" id="no-re-render-container"> <div class="filters-bar show-filter-content" id="filters-bar"><div class="max-width-wrapper"><div class="row"><div class="column"> <div id="search_filters"> <ul class="filters-list"> <li class="filter-option filter-option-locations" id="location_filters"> <span class="filters" data-target="Regions-filter-menu"><span class="location-icon filter-icon"></span> <span class="filter-list-title">Regions</span> <span class="filter-count"></span></span><div class="menu with-arrow search-filter-menu" id="Regions-filter-menu"><div class="menu-container"><div class="menu-main collapsed"><div class="menu-section "><div class="menu-list "><div class="menu-with-checkboxes" id="Regions-filter-items"> <ul class="view-filter-list"> <li> <label class="filter-toggle menu-list-label" for="Regions_All"><input ,="" class="menu-list-input all-filter" data-id="All" id="Regions_All" name="Regions" type="checkbox"/> <span title="All Locations">Al

<div class="menu-container"><div class="menu-main collapsed"><div class="menu-section with-scroll"><div class="menu-list with-overflow"><div class="menu-with-checkboxes" id="Tags-filter-items"> <ul class="view-filter-list"> <li> <label class="filter-toggle menu-list-label" for="Tags_All"><input ,="" class="menu-list-input all-filter" data-id="All" id="Tags_All" name="Tags" type="checkbox"/> <span title="All Top Rated">All Top Rated</span></label></li> <li> <label class="menu-list-label show-filter" for="Tags_487e1d73-7c80-47f8-bb0c-79f25f7f423c"><input class="filter menu-list-input" data-filter-name="TagIds" data-id="487e1d73-7c80-47f8-bb0c-79f25f7f423c" id="Tags_487e1d73-7c80-47f8-bb0c-79f25f7f423c" name="Tags" type="checkbox"> </input> <span title="Authentic"> Authentic </span></label> </li> <li> <label class="menu-list-label show-filter" for="Tags_9aa33949-bcce-4847-bdeb-f45d21201539"><input class="filter menu-list-input" data-filter-name="TagIds" data-id="9aa33949-bcce-4847-bdeb-f4

<div class="stack-selected-filters"> <div class="toggle-filter-bar"><div id="filter-toggle"><div class="filter-toggle-icon"></div> <span class="show-filter-text">Show filters</span> <span class="hide-filter-text">Hide filters</span></div></div><div class="selected-filters js-selected-filters full-width-wrapper"><div class="row"><div class="column" id="js-selected-filters-column"></div></div></div> <div class="search-results-container page-main-content max-width-wrapper" id="search_results_container"><div class="close-filters"></div> <div class="loader" id="loading_animation"><div class="spinner"></div><div class="loader-content" id="loading_error_container"></div></div> <div class="results-set results-table search-results" data-name="ResultsTable" id="search_results"> <div class="content-section-header"> <h3 class="results-title color-dark" id="results-title">  2546 tables for 2 people on Thu, April 6, at 7:00 PM </h3> <div class="sort-view-filters"><div class="search-tab right view-to

<div class="results-set results-table search-results" data-name="ResultsTable" id="search_results"> <div class="content-section-header"> <h3 class="results-title color-dark" id="results-title">  2546 tables for 2 people on Thu, April 6, at 7:00 PM </h3> <div class="sort-view-filters"><div class="search-tab right view-toggle"> <div class="views" id="view_toggle"><div class="icon-as-button selected" data-view="list"><svg class="sort-view-toggle-icon" version="1.1" viewbox="0 0 33 29" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><path class="svg-list-icon" d="M7.88020833.5 7.88020833 3.5 32.796875 3.5 32.796875.5 7.88020833.5ZM7.88020833 15.8984375 32.796875 15.8984375 32.796875 12.8984375 7.88020833 12.8984375 7.88020833 15.8984375ZM7.88020833 28.3203125 32.796875 28.3203125 32.796875 25.3203125 7.88020833 25.3203125 7.88020833 28.3203125ZM0 4 4 4 4 0 0 0 0 4ZM0 16.421875 4 16.421875 4 12.421875 0 12.421875 0 16.421875ZM0 28.84375 4 28.84375 4 24.84375 0 

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.


I still don't see it. Let's search our entire soup object:

In [14]:
# print out soup, do command+f for "booked "
soup

 <!DOCTYPE html>\n<html lang="en"><head><meta charset="unicode-escape"/><meta content="IE=9; IE=8; IE=7; IE=EDGE" http-equiv="X-UA-Compatible"/> <title>Washington, D.C. Area Restaurants List | OpenTable</title> <meta content="Find Washington, D.C. Area restaurants. Search by location, cuisine, or price to refine restaurant results in the Washington, D.C. Area area." name="description"> </meta> <meta content="noindex" name="robots"> </meta> <link href="//components.otstatic.com/components/favicon/1.0.4/favicon/favicon.ico" rel="shortcut icon" type="image/x-icon"/><link href="//components.otstatic.com/components/favicon/1.0.4/favicon/favicon-16.png" rel="icon" sizes="16x16"/><link href="//components.otstatic.com/components/favicon/1.0.4/favicon/favicon-32.png" rel="icon" sizes="32x32"/><link href="//components.otstatic.com/components/favicon/1.0.4/favicon/favicon-48.png" rel="icon" sizes="48x48"/><link href="//components.otstatic.com/components/favicon/1.0.4/favicon/favicon-64.png" rel="

What do you notice? Why is this happening?

## Enter Selenium

Selenium is a headless browser. That means it enables us to mock human browsing behavior -- even waiting for JavaScript elements to load.

If you do not already have Selenium installed, you can do so via pip. Simply: `pip install selenium`

In [15]:
# import
from selenium import webdriver

Selenium requires us to determine a default browser to run. I'm going to opt for Firefox, but Chromium is also a very common choice. http://selenium-python.readthedocs.io/faq.html

In [16]:
# STOP
# what is going to happen when I run the next cell?

In [17]:
# create a driver called Firefox
driver = webdriver.Firefox()

Pretty crazy, right? Let's close that driver.

In [18]:
# close it
driver.close()

In [19]:
# let's boot it up, and visit a URL of our choice
driver = webdriver.Firefox()
driver.get("http://www.python.org")

Awesome. Now we're getting somewhere: programmatically controlling our browser like a human.

Let's return to our problem at hand. We need to visit the OpenTable listing for DC. Once there, we need to get the html to load. In the next cell, prove you can programmatically visit the page.

In [20]:
# visit our OpenTable page
driver = webdriver.Firefox()
driver.get("http://www.opentable.com/washington-dc-restaurant-listings")
# always good to check we've got the page we think we do
assert "OpenTable" in driver.title

In [21]:
driver.title

u'Washington, D.C. Area Restaurants List | OpenTable'

Now, to resolve our JavaScript problem, there's a few things we can do. What I'll do in this case is request that the page load, wait one second, and then I'm going to grab the source html from the page. Because the page should believe I'm visiting from a live connection on a browser client, the JavaScript should render to be a part of the page source. I can then grab the page source.

In [22]:
# import sleep
from time import sleep

In [23]:
# visit our relevant page
driver = webdriver.Firefox()
driver.get("http://www.opentable.com/washington-dc-restaurant-listings")
# wait one second
sleep(1)
#grab the page source
html = driver.page_source

In [24]:
# BeautifulSoup it!
html = BeautifulSoup(html)



 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))


Now, let's return to our earlier problem: how do we locate bookings on the page?

In [25]:
# print out the number bookings for all restaurants
html.find_all('div', {'class':'booking'})

[<div class="booking"><span class="tadpole"></span>Booked 15 times today</div>,
 <div class="booking"><span class="tadpole"></span>Booked 25 times today</div>,
 <div class="booking"><span class="tadpole"></span>Booked 27 times today</div>,
 <div class="booking"><span class="tadpole"></span>Booked 275 times today</div>,
 <div class="booking"><span class="tadpole"></span>Booked 118 times today</div>,
 <div class="booking"><span class="tadpole"></span>Booked 5 times today</div>,
 <div class="booking"><span class="tadpole"></span>Booked 170 times today</div>,
 <div class="booking"><span class="tadpole"></span>Booked 49 times today</div>,
 <div class="booking"><span class="tadpole"></span>Booked 24 times today</div>,
 <div class="booking"><span class="tadpole"></span>Booked 79 times today</div>,
 <div class="booking"><span class="tadpole"></span>Booked 5 times today</div>,
 <div class="booking"><span class="tadpole"></span>Booked 333 times today</div>,
 <div class="booking"><span class="tad

In [26]:
# now print out each booking for the listings using a loop
for entry in html.find_all('div', {'class':'booking'}):
    print entry

<div class="booking"><span class="tadpole"></span>Booked 15 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 25 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 27 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 275 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 118 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 5 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 170 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 49 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 24 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 79 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 5 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 333 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 7 tim

Let's grab just the text of each of these entries.

In [27]:
# do the same as above, but grabbing only the text content
for entry in html.find_all('div', {'class':'booking'}):
    print entry.text

Booked 15 times today
Booked 25 times today
Booked 27 times today
Booked 275 times today
Booked 118 times today
Booked 5 times today
Booked 170 times today
Booked 49 times today
Booked 24 times today
Booked 79 times today
Booked 5 times today
Booked 333 times today
Booked 7 times today
Booked 9 times today
Booked 161 times today
Booked 70 times today
Booked 33 times today
Booked 9 times today
Booked 72 times today
Booked 114 times today
Booked 7 times today
Booked 43 times today
Booked 65 times today
Booked 4 times today
Booked 84 times today
Booked 10 times today
Booked 652 times today
Booked 172 times today
Booked 7 times today
Booked 17 times today
Booked 366 times today
Booked 53 times today
Booked 16 times today
Booked 63 times today
Booked 56 times today
Booked 8 times today
Booked 94 times today
Booked 290 times today
Booked 4 times today
Booked 51 times today
Booked 29 times today
Booked 29 times today
Booked 19 times today
Booked 313 times today
Booked 27 times today
Booked 22

We've succeeded!

But we can clean this up a little bit. We're going to use regular expressions (regex) to grab only the digits that are available in each of the text.

The best way to get good at regex is to, well, just keep trying and testing: http://pythex.org/

In [28]:
# import regex
import re

Given we haven't covered regex, I'll show you how to use the search function to match any given digit.

In [29]:
# for each entry, grab the text
for booking in html.find_all('div', {'class':'booking'}):
    # match all digits
    match = re.search(r'\d+', booking.text)
    # print if found
    if match:
        print match.group()
    # otherwise pass
    else:
        pass

15
25
27
275
118
5
170
49
24
79
5
333
7
9
161
70
33
9
72
114
7
43
65
4
84
10
652
172
7
17
366
53
16
63
56
8
94
290
4
51
29
29
19
313
27
22
11
71
60
13
284
84
7
76
26
4
28
8
163
11
91
9
8
14
33
58
12
11
78
50
60
14
71
70
10
43
51
16
50
24
14
33
42
30
66
15
13
5
161
15
14
3
130
24
78
56
12


Before we demonstrate all the other amazing things about headless browsers, let's finish up collecting the data we want from this current example. Do you suppose the html parsing we wrote above will still work on the page source we've grabbed from our headless browser?

To be most efficient, we want to only do a single loop for each entry on the page. That means we want to find what element all of other other elements (name, location, price, bookings) is housed within. Where on the page is each entry located?

In [30]:
# print out all entries
soup.find_all('div', {'class':'result content-section-list-row cf with-times'})

[<div class="result content-section-list-row cf with-times" data-id="0" data-index="1" data-lat="38.7886940" data-lon="-77.0190220" data-offers="" data-rid="341935"><div class="rest-row with-image"> <div class="rest-row-image"> <a href="/r/voltaggio-brothers-steak-house-mgm-national-harbor" target="_blank"><img alt="photo of voltaggio brothers steak house - mgm national harbor restaurant" class="lazy rest-image" data-src="//resizer.otstatic.com/v2/profiles/legacy/341935.jpg" src="//media.otstatic.com/search-result-node/images/no-image.png"/></a></div> <div class="rest-row-info"> <a class="rest-row-name rest-name " href="/r/voltaggio-brothers-steak-house-mgm-national-harbor" target="_blank"> <span class="rest-row-name-text">Voltaggio Brothers Steak House - MGM National Harbor</span> </a> <div class="rest-row-grid--row"> <div class="rest-row-review"> <div class="star-rating"><div class="star-wrapper small"><div class="all-stars"></div><div class="all-stars filled" style="width: 84%;"></d

Look over the page. Does every single entry have each element we're seeking?
> I did this previously. I know for a fact that not every element has a number of recent bookings. That's probably exactly why OpenTable houses this in JavaScript: they want to continously update the number of bookings with the most relevant number of values.

In [31]:
# what happens when a booking is not available?
# print out each booking entry, using the identification code we wrote above
for entry in html.find_all('div', {'class':'result content-section-list-row cf with-times'}):
    print entry.find('div', {'class':'booking'})

<div class="booking"><span class="tadpole"></span>Booked 15 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 25 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 27 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 275 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 118 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 5 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 170 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 49 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 24 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 79 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 5 times today</div>
None
<div class="booking"><span class="tadpole"></span>Booked 333 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 

In [32]:
# if we find the element we want, we print it. Otherwise, we print 'ZERO'
for entry in html.find_all('div', {'class':'result content-section-list-row cf with-times'}):
    print entry.find('div', {'class':'booking'}).text

Booked 15 times today
Booked 25 times today
Booked 27 times today
Booked 275 times today
Booked 118 times today
Booked 5 times today
Booked 170 times today
Booked 49 times today
Booked 24 times today
Booked 79 times today
Booked 5 times today


AttributeError: 'NoneType' object has no attribute 'text'

What do you notice takes the place when booking is not found?

Thus, we will use exceptions. Here's a demo:

In [33]:
# if we find the element we want, we print it. Otherwise, we print 'ZERO'
for entry in html.find_all('div', {'class':'result content-section-list-row cf with-times'}):
    try:
        print entry.find('div', {'class':'booking'}).text
    except:
        print 'ZERO'

Booked 15 times today
Booked 25 times today
Booked 27 times today
Booked 275 times today
Booked 118 times today
Booked 5 times today
Booked 170 times today
Booked 49 times today
Booked 24 times today
Booked 79 times today
Booked 5 times today
ZERO
Booked 333 times today
Booked 7 times today
Booked 9 times today
Booked 161 times today
Booked 70 times today
Booked 33 times today
Booked 9 times today
ZERO
Booked 72 times today
Booked 114 times today
Booked 7 times today
Booked 43 times today
Booked 65 times today
Booked 4 times today
Booked 84 times today
Booked 10 times today
Booked 652 times today
Booked 172 times today
Booked 7 times today
Booked 17 times today
Booked 366 times today
Booked 53 times today
Booked 16 times today
Booked 63 times today
Booked 56 times today
Booked 8 times today
Booked 94 times today
Booked 290 times today
Booked 4 times today
Booked 51 times today
Booked 29 times today
Booked 29 times today
Booked 19 times today
Booked 313 times today
Booked 27 times today

From previously completing this, I know all other elements WILL be returned. That means we do not have to create exceptions for them.

However, the onus is on you to now put all the pieces together.

Loop through each entry. For each entry, grab the relevant information we want (name, location, price, bookings). Produce a dataframe with the columns "name","location","price","bookings" that contains the 100 entries we would like.

In [34]:
import pandas as pd

In [35]:
# I'm going to create my empty df first
dc_eats = pd.DataFrame(columns=["name","location","price","bookings"])

In [36]:
for entry in html.find_all('div', {'class':'result content-section-list-row cf with-times'}):
    # grab the name
    name =  entry.find('span', {'class':'rest-row-name-text'}).text
    # grab the location 
    location = entry.find('span', {'class':'rest-row-meta--location rest-row-meta-text'}).renderContents()
    # grab the price
    price =  entry.find('div', {'class':'rest-row-pricing'}).find('i').renderContents().count('$')
    # try to find the number of bookings
    try:
        temp = entry.find('div', {'class':'booking'}).text
        match = re.search(r'\d+', temp)
        if match:
            bookings = match.group()
    except:
        bookings = 'NA'
    dc_eats.loc[len(dc_eats)]=[name, location, price, bookings]

In [37]:
# check out our work
dc_eats.head()

Unnamed: 0,name,location,price,bookings
0,Voltaggio Brothers Steak House - MGM National ...,Oxon Hill,4.0,15
1,Fish by Jose Andres - MGM National Harbor,Oxon Hill,4.0,25
2,barmini by José Andrés,Penn Quarter,3.0,27
3,Le Diplomate,Logan Circle,3.0,275
4,Rasika,Penn Quarter,3.0,118


Awesome! We succeeded.

Now, let's explore some of the other functionality of a webdriver. We've barely scratched the surface.

In [38]:
# we can send keys as well
# import
from selenium.webdriver.common.keys import Keys

In [39]:
# open Firefox
driver = webdriver.Firefox()
# visit Python
driver.get("http://www.python.org")
# verify we're in the right place
assert "Python" in driver.title

In [40]:
# find the search position
elem = driver.find_element_by_name("q")
# clear it
elem.clear()
# type in pycon
elem.send_keys("pycon")
# send those keys
elem.send_keys(Keys.RETURN)

In [41]:
# send those keys
#elem.send_keys(Keys.RETURN)
# no results
# assert "No results found." not in driver.page_source

In [42]:
# close
driver.close()

In [43]:
# all at once:
driver = webdriver.Firefox()
driver.get("http://www.python.org")
assert "Python" in driver.title
elem = driver.find_element_by_name("q")
elem.clear()
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)
#assert "No results found." not in driver.page_source
driver.close()

The above example (and many others) are available in the Selenium docs: http://selenium-python.readthedocs.io/getting-started.html

What is especially important is exploring functionality like locating elements: http://selenium-python.readthedocs.io/locating-elements.html#locating-elements

FAQ:
http://selenium-python.readthedocs.io/faq.html