# Web Scraping with a Programmatic Browser

Many website use a lot of javascript to render the navigation more dynamic and optimize display speed by only loading the content that can be displayed on the screen of the user devices.

For these websites, writing a scraper at the HTTP + HTML level is too complex: one would need to record and analyze all the HTTP request / responses of a typical navigation session to understand how to simulate it with low level tools such as `urllib` and `lxml`.

An alternative is to programmatically control the execution of a full-fledged browser like Firefox, Chrome, Edge or Safari and introspect the state of the Document object model. This way it is possible to execute all the javascript necessary to trigger navigation events by using an API that reflects how the end user would interact with the web page (clicking on links, mouse scroll events, pressing keys on the keyboard to type text in forms and so on).

To illustrate this approach, let's write a program that simulates the navigation to the Book section of the amazon online shop and retrieve the list of the title of best sellers along with their respective price.

To do this exercises you will need to install the Python package for `selenium` (that includes the `webdriver` API), the [Firefox](https://www.mozilla.org/en-US/firefox/) web browser and the [geckodriver](https://github.com/mozilla/geckodriver/releases) helper program (de-zip it and put it somewhere in your PATH, for instance in the `bin/` or `Scripts/` folder of your conda environment.

In [None]:
import sys
!{sys.executable} -m pip install selenium

In [None]:
from selenium import webdriver

In [None]:
driver = webdriver.Firefox()

In [None]:
driver.get("https://www.amazon.com")

In [None]:
driver.find_element_by_link_text("Departments").click()

In [None]:
driver.find_element_by_link_text("Books").click()

In [None]:
carousels = driver.find_elements_by_class_name("acswidget-carousel")
len(carousels)

In [None]:
bestsellers_carousels = [c for c in driver.find_elements_by_class_name("acswidget-carousel")
                         if "Books Bestsellers" in c.text]
len(bestsellers_carousels)

In [None]:
best_seller_carousel = bestsellers_carousels[0]

In [None]:
nextpage_button = best_seller_carousel.find_element_by_class_name("a-carousel-goto-nextpage")

In [None]:
nextpage_button.is_enabled()

In [None]:
# nextpage_button.click()

In [None]:
driver.execute_script("arguments[0].click();", nextpage_button)

In [None]:
cards = best_seller_carousel.find_elements_by_class_name("a-carousel-card")
len(cards)

Try to resize the firefox window or reduce the font size and reload the previous notebook cell.

In [None]:
first_card = cards[0]

In [None]:
[tag.text for tag in first_card.find_elements_by_tag_name("a")
 if "/product/" in tag.get_attribute("href")
 and not tag.find_elements_by_tag_name("img")]

In [None]:
price_text = first_card.find_element_by_class_name("acs_product-price").text
price_text

In [None]:
price = float(price_text.strip().replace("$", ""))
price

Do not forget to close the browser session to release system resources.

In [None]:
driver.close()

## Using Firefox in Headless Mode

In [None]:
import os

os.environ['MOZ_HEADLESS'] = '1'
driver = webdriver.Firefox()
driver.get("https://www.scikit-learn.org/stable")

In [None]:
for title in driver.find_elements_by_tag_name("h2"):
    print(title.text)

In [None]:
driver.close()
del os.environ['MOZ_HEADLESS']