With these two lines of code, we are creating an instance of a Splinter browser. This means that we're prepping our automated browser. We're also specifying that we'll be using Chrome as our browser. `**executable_path` is unpacking the dictionary we've stored the path in – think of it as unpacking a suitcase. `headless=False` means that all of the browser's actions will be displayed in a Chrome window so we can see them.

In [1]:
from splinter import Browser
from bs4 import BeautifulSoup as soup
from webdriver_manager.chrome import ChromeDriverManager

In [2]:
# Set up Splinter
executable_path = {'executable_path': ChromeDriverManager().install()}
browser = Browser('chrome', **executable_path, headless=False)



Current google-chrome version is 90.0.4430
Get LATEST driver version for 90.0.4430
Driver [/Users/joshuaallen/.wdm/drivers/chromedriver/mac64/90.0.4430.24/chromedriver] found in cache


This code tells Splinter which site we want to visit by assigning the link to a URL.

In [3]:
# Visit the Quotes to Scrape site
url = 'http://quotes.toscrape.com/'
browser.visit(url)

After executing the cell above, we will use BeautifulSoup to parse the HTML.

BeautifulSoup has taken a look at the different components and can now access them. Specifically, BeautifulSoup parses the HTML text and then stores it as an object.

In [4]:
# Parse the HTML
html = browser.html
html_soup = soup(html, 'html.parser')

## Scrape the Title

In our next cell, we will find the title and extract it.

What we've just done in these last two lines of code is:

1. We used our `html_soup` object we created earlier and chained `find()` to it to search for the `<h2 />` tag.
2. We've also extracted only the text within the HTML tags by adding `.text` to the end of the code.

In [5]:
# Scrape the Title
title = html_soup.find('h2').text
title

'Top Ten tags'

## Scrape All of the Tags

The first line, `tag_box = html_soup.find('div', class_='tags-box')`, creates a new variable `tag_box`, which will be used to store the results of a search. In this case, we're looking for `<div />` elements with a class of `tags-box`, and we're searching for it in the HTML we parsed earlier and stored in the `html_soup` variable.

The second line, `tags = tag_box.find_all('a', class_='tag')`, is similar to the first but with a few tweaks to make the search more specific. The new "tags" variable will hold the results of a `find_all`, but this time we're searching through the parsed results stored in our `tag_box` variable to find `<a />` elements with a `tag` class.

We used `find_all` this time because we want to capture all results, instead of a single or specific one.

Next, we've added a `for` loop. This `for` loop cycles through each tag in the `tags` variable, strips the HTML code out of it, and then prints only the text of each tag.

In [6]:
# Scrape the top ten tags
tag_box = html_soup.find('div', class_='tags-box')
# tag_box
tags = tag_box.find_all('a', class_='tag')

for tag in tags:
    word = tag.text
    print(word)

love
inspirational
life
humor
books
reading
friendship
friends
truth
simile


## Scrape Across Pages


In [8]:
# Assigns URL to url variable
url = 'http://quotes.toscrape.com/'
# Causes automated browser to navigate to url
browser.visit(url)

In the next cell, we'll create a `for` loop to collect each quote, "click" the next button, then collect the next set of quotes. We'll use `range(1, 6)` in our `for` loop to visit the first five pages of the website.

In [10]:
# for loop w/ 5 iterations
for x in range(1,6):
    
    # create an html object, assigned to html variable
    html = browser.html
    
    # use BeautifulSoup to parse the html object
    quote_soup = soup (html, 'html.parser')
    
    # use BS to find all <span /> tags within a class of "text"
    quotes = quote_soup.find_all('span', class_ = 'text')
    
    # Print statements qrapped in another for loop that will print each quote parsed by BS4
    for quote in quotes:
        print('page:', x, '------')
        print(quote.text)
    
    # Use Splinter browser tools to click the "Next" button
    browser.links.find_by_partial_text('Next')
    

page: 1 ------
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
page: 1 ------
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
page: 1 ------
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
page: 1 ------
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
page: 1 ------
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
page: 1 ------
“Try not to become a man of success. Rather become a man of value.”
page: 1 ------
“It is better to be hated for what you are than to be loved for what you are not.”
page: 1 ------
“I have not failed. I've just found 10,000 ways that won't work.”
page: 1 ------
“A woman is like a tea bag; you never know how strong it is until it's in hot water.”
page: 1 ---

### Note about BS4 scraping:
It's important to note that there are many ways that BeautifulSoup can search for text, but the syntax is typically the same: we look for a tag first, then an attribute. We can search for items using only a tag, such as a `<span />` or `<h1 />`, but a **class** or **id** attribute makes the search that much more specific.

In [13]:
# Replaced soup.find_all('span', class='text') with soup.find_all('div', class_='quote')
# would scrape the parent element and grab everything instead of just the quotes.

# for loop w/ 5 iterations
for x in range(1,6):
    
    # create an html object, assigned to html variable
    html = browser.html
    
    # use BeautifulSoup to parse the html object
    quote_soup = soup (html, 'html.parser')
    
    # use BS to find all <div /> tags within a class of "quote"
    # NOTE: This line of code doesn’t target a single, unique element. 
    # Instead, it pulls everything that’s inside the targeted element, even other nested elements.
    quotes = quote_soup.find_all('div', class_ = 'quote')
    
    # Print statements qrapped in another for loop that will print each quote parsed by BS4
    for quote in quotes:
        print('page:', x, '------')
        print(quote.text)
    
    # Use Splinter browser tools to click the "Next" button
    browser.links.find_by_partial_text('Next')
    

page: 1 ------

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
by Albert Einstein
(about)


            Tags:
            
change
deep-thoughts
thinking
world


page: 1 ------

“It is our choices, Harry, that show what we truly are, far more than our abilities.”
by J.K. Rowling
(about)


            Tags:
            
abilities
choices


page: 1 ------

“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
by Albert Einstein
(about)


            Tags:
            
inspirational
life
live
miracle
miracles


page: 1 ------

“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
by Jane Austen
(about)


            Tags:
            
aliteracy
books
classic
humor


page: 1 ------

“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
by 

## Skill Drill 10.3.2

In [14]:
# Assigns URL to url variable
url = 'http://books.toscrape.com/'
# Causes automated browser to navigate to url
browser.visit(url)

In [None]:
# Scrape the top ten tags
tag_box = html_soup.find('div', class_='tags-box')
# tag_box
tags = tag_box.find_all('a', class_='tag')

for tag in tags:
    word = tag.text
    print(word)

In [19]:
# Scrape the top ten tags
list_box = html_soup.find('ul', class_='nav nav-list')
# tag_box


for i in list_box:
    word = list_box.text
    print(word)

TypeError: 'NoneType' object is not iterable

## Scrape Mars Data: The News
