In [1]:
# Import scraping tools
from splinter import Browser
from bs4 import BeautifulSoup as soup
from webdriver_manager.chrome import ChromeDriverManager

With these two lines of code, we are creating an instance of a Splinter browser. This means that we're prepping our automated browser. We're also specifying that we'll be using Chrome as our browser. **executable_path is unpacking the dictionary we've stored the path in – think of it as unpacking a suitcase. headless=False means that all of the browser's actions will be displayed in a Chrome window so we can see them.

In [2]:
# Set up Splinter
executable_path = {'executable_path': ChromeDriverManager().install()}
browser = Browser('chrome', **executable_path, headless=False)

[WDM] - Downloading: 100%|█████████████████| 6.21M/6.21M [00:17<00:00, 380kB/s]


In [3]:
# Visit the Quotes to Scrape site
url = 'http://books.toscrape.com/'
browser.visit(url)

In [4]:
# Use BeautifulSoup to parse the HTML: that means that BeautifulSoup has taken a look at the
# different components and can now access them.  Specifically, BS parses the HTML text and then
# stores it as an object.
html = browser.html
html_soup = soup(html, 'html.parser')

In [5]:
# Scrape the Title
title = html_soup.find('h3').text
title

'A Light in the ...'

(reference the cell above)
What we've just done in the last two lines of code is:
    1. We used our html_soup object we created earlier and chained find() to it to search for the <h2 /> tag.
    2. We've also extracted only the text within the HTML tags by adding .text to the end of the code.

In [7]:
# Scrape the top ten tags
tag_box = html_soup.find('ol', class_="row")
# tag_box
tags = tag_box.find_all('a')

for tag in tags:
    word = tag.text
    print(word)


A Light in the ...

Tipping the Velvet

Soumission

Sharp Objects

Sapiens: A Brief History ...

The Requiem Red

The Dirty Little Secrets ...

The Coming Woman: A ...

The Boys in the ...

The Black Maria

Starving Hearts (Triangular Trade ...

Shakespeare's Sonnets

Set Me Free

Scott Pilgrim's Precious Little ...

Rip it Up and ...

Our Band Could Be ...

Olio

Mesaerion: The Best Science ...

Libertarianism for Beginners

It's Only the Himalayas


(reference the cell above)
This code looks really similar to our last, but we've increased the difficulty a bit by incorporating a for loop, but let's start at the beginning.

The first line, tag_box = html_soup.find('div', class_='tags-box'), creates a new variable tag_box, which will be used to store the results of a search. In this case, we're looking for <div /> elements with a class of tags-box, and we're searching for it in the HTML we parsed earlier and stored in the html_soup variable.

The second line, tags = tag_box.find_all('a', class_='tag'), is similar to the first but with a few tweaks to make the search more specific. The new "tags" variable will hold the results of a find_all, but this time we're searching through the parsed results stored in our tag_box variable to find <a /> elements with a tag class.

We used find_all this time because we want to capture all results, instead of a single or specific one.

Next, we've added a for loop. This for loop cycles through each tag in the tags variable, strips the HTML code out of it, and then prints only the text of each tag.