

# Scraping Dynamic Web Pages


We are increasingly encountering pages whose contents are dynamically generated within the user's Web browser; that is, the content is determined only when the page is rendered and is updated dynamically based on user interactions and inputs.







Is there a programmatic approach to drive a browser to mimic human users' actions, e.g., clicking on a button, filling in a form, etc., to load contents dynamically?

<img src="https://upload.wikimedia.org/wikipedia/commons/d/d5/Selenium_Logo.png"  width=150/>






[Selenium](https://www.selenium.dev) is an umbrella project for a range of tools and libraries that enable and support the automation of web browsers. At its core is WebDriver, an interface to write instruction sets that can be run interchangeably in many browsers.




In [None]:
import selenium
selenium.__version__

---

<br>

### Creating a WebDriver Instance and Navigating to the Target Page

The `selenium.webdriver` module provides all the WebDriver implementations. Currently supported WebDriver implementations are Firefox, Chrome, IE, etc.

In [3]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

# use Chrome
s = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=s)


The first thing we'll want to do with the `WebDriver` is to navigate to a page given by the URL. The convenient way to do so is to call the `.get()` method:




In [4]:
driver.get("https://bing.com")

<div class="alert alert-info">WebDriver will wait until the page has fully loaded before returning control to the script. </div>

In [None]:
driver.title     # retrieve the title of the page


---

<br>


### Locating Desired Tags

We can use the `WebDriver` instance's `.find_element()` or  `.find_elements()` methods to specify a locating strategy with a corresponding filter to find the first matching `WebElement` or a list of matching `WebElement`s.

- Various locating strategies are available and can be specified via predefined strings:

    - By ID: `"id"`
    
    - By tag name: `"tag name"`
    
    - By class name: `"class name"`
    - By name: `"name"`
    - By css selector: `"css selector"`
    - By link text: `"link text"`
    - By partial link text: `"partial link text"`
 


For example, a paragraph element of a specific class:

```html
<p class="class-one"> This is a paragraph of  of class "class-one".</p>
```

can be found by using any of:

```python
driver.find_element("class name", "class-one")
driver.find_element("css selector", 'p.class-one')
driver.find_element("css selector", "p[class='class-one']")
```

A hyperlink element that contains a specific link text:

```html
<a href="sample.html">The 1st item</a>
```

can be found by using either of:

```python
driver.find_element("link text", "The 1st item")
driver.find_element("partial link text", "1st")
```

And a text field  

<input type="text" name="passwd" id="passwd-id" />


defined as:


```html
<input type="text" name="passwd" id="passwd-id" />

```

can be located using any of:

```python
driver.find_element("id", "passwd-id")
driver.find_element("name", "passwd")
driver.find_element("css selector", "input#passwd-id")
```

### Using Chrome's Inspect Tool to Examine a Target Element


Open Chrome's devtools. Use the mouse selector tool (top left button) to explore the web page content for the desired page element

- The element will be highlighted on the page itself and its corresponding entry in the document tree.

Take note of any identifying attributes for the target tags (class, id, etc) and use the `.find_elements()` or `.find_element()` method to fetch desired `WebElement`(s).

In [None]:
 # locate the input text element by its name attribute
search_box = driver.find_element("name", "q")  
search_box

 ### Sending Text and Keystrokes



Virtualized device input can be generated by the `send_keys()` method:



In [6]:
search_box.clear()    # clear any pre-populated text                          
search_box.send_keys("us election 2020")

<div class="alert alert-info">Typing something into a text field won't automatically clear it. Instead, what we type will be appended to what's already there.</div>

 Special keys can be sent using the `Keys` class imported from `selenium.webdriver.common.keys` (try `dir(Keys)` to see a list of supported virtual keystrokes):

In [7]:
from selenium.webdriver.common.keys import Keys
search_box.send_keys(Keys.RETURN)

In [None]:
dir(Keys)

In [4]:
from selenium.webdriver.common.keys import Keys

# We can also send the query text and the keystroke in one call
search_box = driver.find_element("name", "q")
search_box.clear()
search_box.send_keys("us election 2020", Keys.RETURN)

---

<br>

### Extracting Information

In [None]:
search_results = driver.find_elements("class name", 'b_algo')

# a top-down approach to narrow down the selection to the correct elements
# search_results = driver.find_elements("css selector", "#b_results .b_algo")
search_results

In [None]:
for result in search_results[1:]:
    print(result.text)

The rendered text of a specific element can be retrieved by `.text`:

In [None]:
search_results[2].text

The parent `WebElement` can also be chained with a `find_element()` method to access child elements:

In [None]:
search_results[2].find_element("class name", "tpcn").text

In [None]:
search_results[2].find_element("tag name", "h2").text

In [None]:
search_results[2].find_element("css selector", "p").text

---

## Transitioning to Beautiful Soup


After retrieving the search result page, we instruct selenium to hand off the page source (via the WebDriver's `page_source` attribute) to Beautiful Soup:

In [6]:
from bs4 import BeautifulSoup

search_result_soup = BeautifulSoup(driver.page_source, 'html.parser')

In [None]:
print(search_result_soup.prettify())

In [8]:
search_result_list = search_result_soup.find_all('li', {'class': 'b_algo'})

In [None]:
search_result_list[2]

In [None]:
search_result_list[2].find("a").get('href')       # equivalently, search_result_list[2].a['href']  

In [None]:
import pandas as pd

# a row-based approach

all_records = []

for search_result in search_result_list:
    url = search_result.find("a").get('href')
    text = search_result.find("h2").get_text()
    all_records.append({'text': text, 'url': url})

search_result_df = pd.DataFrame(all_records)
search_result_df

Navigating around the search result is just a repeated application of generating keystrokes with `selenium`:

In [12]:
# use .send_keys(Keys.RETURN) to click on a hyperlink
driver.find_element("css selector", "[title='Next page']").send_keys(Keys.RETURN)

`WebDriver`'s `back()` and `forward()` methods allow us to move backward and forward in the browser's history:


In [13]:
driver.back()      # driver.forward()

In [14]:
driver.forward()

`.refresh()` refreshes the current page:

In [15]:
driver.refresh()

When we are finished with the browser session, we should close the browser window:

In [16]:
driver.close()

<div class="alert alert-info"> We can also call <code>quit()</code> method instead of <code>close()</code>. <code>quit()</code> will exit entire browser whereas <code>close()</code> will close one tab.</div>

---

## Waits


Nowadays, it is increasingly the case that, when a page is loaded, the elements which we want to interact with may load at different time intervals (e.g., some elements may be added after the document has completed loading). This makes locating elements difficult.

Selenium's `WebDriverWait` addresses this issue by providing some slack between actions performed by the browser and those instructed by our WebDriver script. In particular, we can use waits to tell the WebDriver to:

- Wait a certain amount of time before throwing an exception, e.g., `driver.implicitly_wait(time_in_seconds)`;
   
   - It needs to be called only once per selenium session.
   - It makes each command wait for the defined time before resuming execution  
    
 ```python
 # setting an implicit wait of 10 seconds
driver.implicitly_wait(10)
 ```



- Wait for certain conditions before proceeding further, e.g., `WebDriverWait(driver).until(some_condition)`.




In [17]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

# use Chrome
s = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=s)
driver.get("https://techcrunch.com/video/")

In [19]:
from selenium.webdriver.support.ui import WebDriverWait

# wait up to 10 seconds before throwing a TimeoutException unless it finds the element within 10 seconds
loadmore_button = WebDriverWait(driver, 10).until(lambda d: 
                                                  d.find_element("css selector", "button.load-more"))
loadmore_button.click()

       
There are predefined conditions for [frequent wait operations](https://www.selenium.dev/selenium/docs/api/py/webdriver_support/selenium.webdriver.support.expected_conditions.html).  Some of them are:

 - `title_is()`
 - `title_contains()`
 -  `presence_of_element_located()`
 - `visibility_of_element_located()`
 - `visibility_of()`
 - `presence_of_all_elements_located()`
 - `element_to_be_clickable()`
 - `element_located_to_be_selected()`
 - `element_selection_state_to_be()`
 - `element_located_selection_state_to_be()`
 - `alert_is_present()`
 - ...



In [20]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

# use Chrome
s = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=s)
driver.get("https://techcrunch.com/video/")

# EC.element_to_be_clickable is a class
loadmore_button = WebDriverWait(driver, 10).until(EC.element_to_be_clickable(("css selector",
                                                                              "button.load-more")))
loadmore_button.click()

Please refer to this Web page for [the detail](https://www.selenium.dev/documentation/webdriver/waits) of uses of waits.



<div class="alert alert-info"> Warning: Do not mix implicit and explicit waits. Doing so can cause unpredictable wait times.</div>

In [None]:
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

driver.get("https://techcrunch.com/video/")

for i in range(10):
# EC.element_to_be_clickable is a class
    try:
        WebDriverWait(driver, 10).until(EC.element_to_be_clickable(("css selector", 
                                                                    "button.load-more"))).click()
        print(f"Click on the load-more button {i+1} times")
    except:
        break

# Optional: Run Selenium on Colab


In [1]:
%%capture
! pip install selenium
! apt-get update
! apt install chromium-chromedriver

In [None]:
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('-- disable-dev-shm-usage')
driver = webdriver.Chrome(options=chrome_options)
driver.get("https://w5.ab.ust.hk/wcq/cgi-bin/2310/")
driver.title