# Browser Automation with Selenium

This notebook contains a short tutorial for scraping with the Selenium toolkit.

We will be scraping `quotes.toscrape.com`, a wonderful page for practicing more advanced scraping techniques.

In [None]:
# imports
import requests
from selenium import webdriver
from selenium.webdriver.common.by import By

## When static scraping fails:

The following webpage is generated dynamically by `javascript`.
We can see the script source in this page, but this is often not the case:

In [None]:
from bs4 import BeautifulSoup

url = "https://quotes.toscrape.com/js/"
page = requests.get(url)
print(BeautifulSoup(page.text).body.prettify())

## Instantiating the WebDriver

When we call the `webdriver.Chrome()` method, if we have the webdriver properly installed, an automated Chrome instance should appear!


In [None]:
driver = webdriver.Chrome()
driver.get(url)

Let's select all of the quote-boxes that have the tag "life".

In [None]:
# This returns a list of elements that have the CSS class 'quote'
quote_boxes = driver.find_elements(
    By.CLASS_NAME, 'quote')

In [None]:
# Let's navigate the first element to recognize a pattern
# Selecting the first div
quote_box = quote_boxes[0]
# Selecting the container div for the tags
tags = quote_box.find_element(By.CLASS_NAME, 'tags')
# Getting the tag names
[
    tag.text for tag
    in tags.find_elements(By.TAG_NAME, 'a')
]

In [None]:
# Some crazy list filtering
life_quotes = [
    quote for quote in quote_boxes if                     # unpack quote_boxes
    'life' in [tag.text for tag in                        # check if 'life' is in
               quote.find_element(By.CLASS_NAME, 'tags'). # the list of tags
               find_elements(By.TAG_NAME, 'a')]           # like we obtained before
]
life_quotes

In [None]:
# Let's put that into a function
def filter_quotes_by_tag(driver, tag):
    quote_boxes = driver.find_elements(By.CLASS_NAME, 'quote')
    tagged_quotes = [
    quote for quote in quote_boxes if                     # unpack quote_boxes
        tag in [t.text for t in                           # check if tag is in
               quote.find_element(By.CLASS_NAME, 'tags'). # the list of tags
               find_elements(By.TAG_NAME, 'a')]           # like we obtained before
    ]
    return tagged_quotes

## Simulating Clicks

We can use the `.click()` property of any element to 'click' on it.

Let's proceed to the next page of quotes.

In [None]:
# Get the "next" element
next_button = driver.find_element(By.PARTIAL_LINK_TEXT, 'Next')
print(driver.current_url)
next_button.click()
print(driver.current_url)

## Sending Keys

Let's try to log in!

In [None]:
login_box = driver.find_element(By.LINK_TEXT, 'Login')
login_box.click()

In [None]:
# Entering username and password
username_box = driver.find_element(By.ID, 'username')
password_box = driver.find_element(By.ID, 'password')

In [None]:
username_box.send_keys('username')
password_box.send_keys('password')

In [None]:
# Using XPATH to get the login button\
# https://www.w3schools.com/xml/xpath_syntax.asp
login_button = driver.find_element(
    By.XPATH, r"//input[(@type='submit')]")
login_button.click()

## Race Conditions

Usually the page will take time to load.

If you are running Selenium from a script, it will execute the commands sequentially
as fast as possible. This causes problems.

In [None]:
url = "https://quotes.toscrape.com/js-delayed/"
driver.get(url)
filter_quotes_by_tag(driver, 'life')

Selenium does provide more sophisticated "wait" functionality,
where you can define some condition that it will test until
it becomes true.

I'll demonstrate a simpler (and less reliable) solution, which
is to just use a timed wait.

In [None]:
from time import sleep
url = "https://quotes.toscrape.com/js-delayed/"
driver.get(url)
sleep(10) # I happen to know the length of the delay
filter_quotes_by_tag(driver, 'life')

In [None]:
driver.quit()