# 10.3.1 Use Splinter

In [7]:
# import our scraping tools: 
# the Browser instance from splinter, 
# the BeautifulSoup object, 
# and the driver object for Chrome, ChromeDriverManage

from splinter import Browser
from bs4 import BeautifulSoup as soup
from webdriver_manager.chrome import ChromeDriverManager

# (At first, import splinter code was not working, had to install selenium: pip install selenium)

In [8]:
# Set up Splinter -- set the executable path and initialize a browser
executable_path = {'executable_path': ChromeDriverManager().install()}
browser = Browser('chrome', **executable_path, headless=False)



Current google-chrome version is 101.0.4951
Get LATEST chromedriver version for 101.0.4951 google-chrome
Driver [C:\Users\micha\.wdm\drivers\chromedriver\win32\101.0.4951.41\chromedriver.exe] found in cache


# 10.3.2 Practice with Splinter and BeautifulSoup

###  Scraping Practice Website Link:

# http://quotes.toscrape.com/

### Scrape the Top 10 Tags

In [9]:
# Q: What is the process to follow when discovering what HTML tags contain our data?
# A: Locate the data to scrape on the page, use “Inspect” to bring up DevTools, then select an element on a page to inspect it.

### Search for Elements

### 1st Scrape: Scrape the Title

In [10]:
# Visit the Quotes to Scrape site
# This code tells Splinter which site we want to visit by assigning the link to a URL. 
# (After executing the code, we will use BeautifulSoup to parse the HTML.)

url = 'http://quotes.toscrape.com/'
browser.visit(url)

In [11]:
# use BeautifulSoup to parse the HTML
html = browser.html
html_soup = soup(html, 'html.parser')

# Now we've parsed all of the HTML on the page. 
# That means that BeautifulSoup has taken a look at the different components and can now access them. 
# Specifically, BeautifulSoup parses the HTML text and then stores it as an object.

In [12]:
# Scrape the Title -- find the title and extract it
title = html_soup.find('h2').text
title

'Top Ten tags'

### 2nd Scrape: Scrape All of the Tags

In [13]:
# Scrape the top ten tags
tag_box = html_soup.find('div', class_='tags-box')
# tag_box
tags = tag_box.find_all('a', class_='tag')

for tag in tags:
    word = tag.text
    print(word)

love
inspirational
life
humor
books
reading
friendship
friends
truth
simile


### Scrape Across Pages

In [14]:
url = 'http://quotes.toscrape.com/'
browser.visit(url)

In [15]:
for x in range(1, 6):
   html = browser.html
   quote_soup = soup(html, 'html.parser')
   quotes = quote_soup.find_all('span', class_='text')
   for quote in quotes:
      print('page:', x, '----------')
      print(quote.text)
   browser.links.find_by_partial_text('Next').click()

page: 1 ----------
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
page: 1 ----------
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
page: 1 ----------
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
page: 1 ----------
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
page: 1 ----------
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
page: 1 ----------
“Try not to become a man of success. Rather become a man of value.”
page: 1 ----------
“It is better to be hated for what you are than to be loved for what you are not.”
page: 1 ----------
“I have not failed. I've just found 10,000 ways that won't work.”
page: 1 ----------
“A woman is like a tea bag; you never know how strong it is u

In [None]:
# Q: To create a new div element that acts as a container and has an id of “box”, which of these options is would you use?
# A: <div class=”container” id=”box”></div>

# What would happen we ran soup.find_all('div', class_='quote') instead of soup.find_all('span', class='text')?
# We would scrape the parent element and grab everything instead of just the quotes.