# Web Scraping with a Programmatic Browser

Many website use a lot of javascript to render the navigation more dynamic and optimize display speed by only loading the content that can be displayed on the screen of the user devices.

For these websites, writing a scraper at the HTTP + HTML level is too complex: one would need to record and analyze all the HTTP request / responses of a typical navigation session to understand how to simulate it with low level tools such as `urllib` and `lxml`.

An alternative is to programmatically control the execution of a full-fledged browser like Firefox, Chrome, Edge or Safari and introspect the state of the Document object model. This way it is possible to execute all the javascript necessary to trigger navigation events by using an API that reflects how the end user would interact with the web page (clicking on links, mouse scroll events, pressing keys on the keyboard to type text in forms and so on).

To illustrate this approach, let's write a program that simulates the navigation to the Book section of the amazon online shop and retrieve the list of the title of best sellers along with their respective price.

To do this exercises you will need to install the Python package for `selenium` (that includes the `webdriver` API), the [Firefox](https://www.mozilla.org/en-US/firefox/) web browser and the [geckodriver](https://github.com/mozilla/geckodriver/releases) helper program (de-zip it and put it somewhere in your PATH, for instance in the `bin/` or `Scripts/` folder of your conda environment.

In [None]:
import sys
!{sys.executable} -m pip install selenium

In [None]:
from selenium import webdriver

In [None]:
driver = webdriver.Firefox()

In [None]:
driver.get("https://www.amazon.com")

## Navigation to the Book Page

In [None]:
driver.find_elements_by_partial_link_text("book")

In [None]:
driver.find_elements_by_link_text("Departments")

In [None]:
departments_link = driver.find_element_by_link_text("Departments")

In [None]:
departments_link.click()

In [None]:
driver.find_elements_by_link_text("Books")

In [None]:
driver.find_element_by_link_text("Books").click()

## Finding the List of Best Sellers

In [None]:
carousels = driver.find_elements_by_class_name("acswidget-carousel")
len(carousels)

In [None]:
carousels = driver.find_elements_by_css_selector(".acswidget-carousel")
len(carousels)

In [None]:
first_carousel = carousels[0]
first_carousel.tag_name

In [None]:
print(carousels[0].text[:100])

In [None]:
print(carousels[1].text[:100])

In [None]:
print(carousels[2].text[:100])

In [None]:
bestsellers_carousels = [carousel for carousel in carousels
                         if "Books Bestsellers" in carousel.text]
len(bestsellers_carousels)

In [None]:
[bestsellers_carousel] = bestsellers_carousels

In [None]:
nextpage_button = bestsellers_carousel.find_element_by_class_name("a-carousel-goto-nextpage")

In [None]:
nextpage_button.click()

In [None]:
driver.execute_script("arguments[0].click();", nextpage_button)

In [None]:
cards = bestsellers_carousel.find_elements_by_class_name("a-carousel-card")
len(cards)

Try to resize the firefox window or reduce the font size and fetch again the list of cards in the `bestsellers_carousel` element.

In [None]:
cards = bestsellers_carousel.find_elements_by_class_name("a-carousel-card")
len(cards)

Notice that the content can change dynamically: the javascript in the page reacts to font-size or window-size change events to load the information of more books and update the DOM dynamically.

Let's consider the first "carousel card" element of our updated carousel.

In [None]:
first_card = next(iter(cards))

In [None]:
print(first_card.text)

In [None]:
product_links = [tag for tag in first_card.find_elements_by_tag_name("a")
                 if "/product/" in tag.get_attribute("href")]
product_links

In [None]:
[link.text for link in product_links]

## Exercises:

- Use Firefox "Inspect Element" tool to analyse the DOM and find a way to extract the text of the price element for `first_card`

- Write a function named `extract_price(card)` that extracts the **numerical** price for the book described in a "carousel card" element. Try the function on `first_card`.

- Write a function named `extract_bookname(card)` that extracts the name of the book described in a "carousel card" element. Try your function on `first_card`.

- Write a function named `extract_product_id(card)` that extracts the Amazon product identifier from the URL of the product link in the card.

In [None]:
def extract_price(book_card):
    # TODO: implement me!
    return 10.0


extract_price(first_card)

In [None]:
def extract_price(book_card):
    price_element = book_card.find_element_by_class_name("acs_product-price")
    price_text = price_element.text.strip()
    assert price_text.startswith("$")
    return float(price_text[1:])

extract_price(first_card)

In [None]:
def extract_bookname(book_card):
    # TODO: implement me
    return "The story of my life"

extract_bookname(first_card)

In [None]:
def extract_bookname(book_card):
    link_tags = [tag for tag in book_card.find_elements_by_tag_name("a")
                 if "/product/" in tag.get_attribute("href")
                 and not tag.find_elements_by_tag_name("img")]
    assert len(link_tags) == 1
    return next(iter(link_tags)).text

extract_bookname(first_card)

In [None]:
def extract_product_id(book_card):
    # TODO: implement me!
    return 238834593244


extract_product_id(first_card)

In [None]:
def extract_product_id(book_card):
    link_urls = [tag.get_attribute("href")
                 for tag in book_card.find_elements_by_tag_name("a")]
    product_urls = set(url for url in link_urls if "/product/" in url)
    assert len(product_urls) == 1
    url = next(iter(product_urls))
    components = url.split("/")
    return int(components[components.index("product") + 1])
    

extract_product_id(first_card)

## Exercise

Write a function that takes a new driver as argument, navigates to the book departments and retrieve the top 30 book sellers. For each book extract the amazon product identifier, the title of the book and the price.

Store the results in a list of Python dictionaries or a pandas dataframe.

In [None]:
def find_bestsellers_carousel(driver):
    driver.get("https://www.amazon.com")
    driver.find_element_by_link_text("Departments").click()
    driver.find_element_by_link_text("Books").click()
    carousels = driver.find_elements_by_class_name("acswidget-carousel")
    bestsellers_carousels = [carousel for carousel in carousels
                             if "Books Bestsellers" in carousel.text]
    assert len(bestsellers_carousels) == 1
    return next(iter(bestsellers_carousels))


def extract_bestseller_data()


def extract_bookname(book_card):
    link_tags = [tag for tag in first_card.find_elements_by_tag_name("a")
                 if "/product/" in tag.get_attribute("href")
                 and not tag.find_elements_by_tag_name("img")]
    assert len(link_tags) == 1
    return next(iter(link_tags)).text


def extract_price(book_card):
    price_element = book_card.find_element_by_class_name("acs_product-price")
    price_text = price_element.text.strip()
    assert price_text.startswith("$")
    return float(price_text[1:])

In [None]:
extract_bookname(first_card)

In [None]:
extract_price(first_card)

Do not forget to close the browser session to release system resources.

In [None]:
driver.close()

## Using Firefox in Headless Mode

In [None]:
import os

os.environ['MOZ_HEADLESS'] = '1'
driver = webdriver.Firefox()
driver.get("https://www.scikit-learn.org/stable")

In [None]:
for title in driver.find_elements_by_tag_name("h2"):
    print(title.text)

In [None]:
driver.close()
del os.environ['MOZ_HEADLESS']