The purpose of this notebook is to make the first tests to extract information from the [Boots - Sleep](https://www.boots.com/health-pharmacy/medicines-treatments/sleep) web. 

Since it contains JavaScript, we have chosen to use `selenium` as a scraping tool.

# selenium

In [16]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

import sys

In [2]:
URL = 'https://www.boots.com/health-pharmacy/medicines-treatments/sleep'

ACCEPT_COOKIES_BUTTON_ID = "onetrust-pc-btn-handler"
ACCEPT_RECOMMENDED_COOKIES_BUTTON_ID = "accept-recommended-btn-handler"

PRODUCT_ELEMENTS_CLASS_NAME = "oct-teaser__contents"

PRODUCT_TITLE_ID = 'estore_product_title'
PRODUCT_RATING_CLASS_NAME = 'bv_avgRating_component_container'
PRODUCT_TEXT_CLASS_NAME = 'product_text'
PRODUCT_PRICE_STR_CLASS_NAME = 'price'

In [3]:
def _wait_until(driver, condition, *, timeout = 10):
    """
    Helper function that wraps `WebDriverWait` for the WebDriver `driver` to wait
    until the condition `condition` is fulfilled
    """
    return WebDriverWait(driver, timeout).until(condition)

def _wait_until_clickable(driver, by, value, *, timeout = 10):
    """
    Convenience function for the driver to wait until the located element
    is clickable
    """
    return _wait_until(driver, condition = EC.element_to_be_clickable((by, value)), timeout = timeout)

In [4]:
# instantiate the webdriver and navigate to the webpage
driver = webdriver.Chrome()
driver.get(URL)

# auto-accept cookies (the recommended ones)
_wait_until_clickable(driver, By.ID, ACCEPT_COOKIES_BUTTON_ID).click()
_wait_until_clickable(driver, By.ID, ACCEPT_RECOMMENDED_COOKIES_BUTTON_ID).click()

In [28]:
# find the first hit (product)
product_element = driver.find_element(By.PARTIAL_LINK_TEXT, 'Sleepeaze')
print(f'found product {product_element.text!r}')
print(f'product tag name: {product_element.tag_name!r}')
print(f'product class name: {product_element.get_attribute("class")!r}')

# check if we can find all 96 displayed products the same way
target_class = product_element.get_attribute('class').split(' ')[-1]
product_elements = driver.find_elements(By.CLASS_NAME, target_class)
print(f'found {len(product_elements)} hits')

found product 'Boots Sleepeaze Tablets 25 mg - 20s'
product tag name: 'a'
product class name: 'oct-link oct-link--theme-text oct-color--boots-blue oct-teaser__title-link'
found 96 hits


Note in the following cell we already search for the elements according to their attributes. This elements can be located either 
- using the developer console in the browser (`Ctrl + Shift + I`), and then using the `Search element` functionality (`Ctrl + Shift + C`), or
- using again the `find_element` method of the webdriver (see the code chunk below)

```python
e = driver.find_element(By.PARTIAL_LINK_TEXT, '<text_to_search>')
e_class = e.get_attribute('class')
print(e_class)
```

In [30]:
# navigate to the product page
# note its class element is `a`, so it should have and `href` attribute containing the hyperlink
driver.get(product_element.get_attribute('href'))

# extract data
# rating is one of the latest to show
rating = _wait_until_clickable(driver, By.CLASS_NAME, PRODUCT_RATING_CLASS_NAME).text

# name is rendered from the beginning (even before than JS)
name = driver.find_element(By.ID, PRODUCT_TITLE_ID).text

# some items do not have a description
text_raw = driver.find_element(By.CLASS_NAME, PRODUCT_TEXT_CLASS_NAME).text
description = text_raw.split('\n')[0]

# split price and unit
price_str = driver.find_element(By.CLASS_NAME, PRODUCT_PRICE_STR_CLASS_NAME).text
price_unit, price = price_str[0], price_str[1:]

# note that `sys.getsizeof` might not give the exact size of the HTML page,
# as it also includes additional overhead from Python's object management
page_size = sys.getsizeof(driver.page_source.encode('utf-8'))//1024 # in KB

# display product data
product_data = {
    'Title': name,
    'Price': price,
    'Price_Unit': price_unit,
    'Short_Desc': description,
    'Rating': rating,
    'Page_Size_KB': page_size
}
product_data

{'Title': 'Boots Sleepeaze Tablets 25 mg - 20s',
 'Price': '1.95',
 'Price_Unit': '£',
 'Short_Desc': "For a restful night's sleep. Two a night.",
 'Rating': '2.7',
 'Page_Size_KB': 2028}