# Web Scraping Workshop 3 - Browser Automation with Selenium¶
Prepared by: Nickolas K. Freeman, Ph.D.

This notebook demonstrates how we can use the `Selenium` webdriver to automate a version of the Chrome browser. `Selenium` was designed to allow web developers to construct test suites for verifying the correct operation of developed websites. In constrast to the functionality of the `Requests` and `BeautifulSoup` libraries, `Selenium` allows us to codify more complex interactions with websites. This is necessary for many modern website designs.

You can install the `Python` bindings for `Selenium` using the command `conda install -c conda-forge selenium`. This demonstration will consider automating the `Chrome` browser. For this, you will need to download an appropriate version of the ChromeDriver available at https://chromedriver.chromium.org/.

The following code block imports the selenium modules and objects that we will be using.

In [1]:
import time

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys

In this tutorial, we will be looking at very basic use of Selenium via Python. We will only see a small top of the iceberg. If you are interested in learning more about the library, please see the documentation at https://selenium-python.readthedocs.io/.

The following code block launches an instance of the chromedriver.

In [2]:
driver = webdriver.Chrome()
driver.maximize_window()

Recall that when we used the `Requests` library, the default user-agent associated with each request referred to the `Requests` library and we had to specify a user-agent if we wanted this default to be changed. With `Selenium`, we are using a web browser and it handles the user-agent differently. The following code block shows how we can use Selenium to execute a JavaScript command that will return the current user-agent.

In [3]:
driver.execute_script("return navigator.userAgent;")

'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.113 Safari/537.36'

We will familiarize ourselves with `Selenium` by writing a script that automates the collection of data from a fake e-commerce site that was designed for users to practice using `Selenium`. The following code block shows how we can navigate to the website URL.

In [4]:
url = 'http://automationpractice.com'
driver.get(url)

The following code block shows one method that we can use to *select* the search box at the top of the main page.

In [5]:
driver.find_element_by_id('search_query_top')

<selenium.webdriver.remote.webelement.WebElement (session="26ea89482be7384401499e1205b5281f", element="c7d3784d-31b4-4b19-8e78-41c26ef1bcd1")>

The following code block shows that we can store *selected* elements as variables, allowing us to perform additional operations on the object.

In [6]:
search_box = driver.find_element_by_id('search_query_top')

For instance, we can send a search string...

In [7]:
search_box.send_keys('Mens shirt')

... and *press* the enter button.

In [8]:
search_box.send_keys(Keys.RETURN)

If we want to navigate *back*, we simply call the `back` method of the driver object.

In [9]:
driver.back()

There are many ways to identify and select HTML elements using `Selenium`, each have there caveats. For example, we can search the HTML to identify elements with particular link test as shown below. However, this query will fail because the search is case-sensitive.

In [10]:
driver.find_element_by_partial_link_text('Best Sellers')

NoSuchElementException: Message: no such element: Unable to locate element: {"method":"partial link text","selector":"Best Sellers"}
  (Session info: chrome=81.0.4044.113)


The previous exception could cause your script to fail if you did not have proper exception handling in place. Thus, I have found that it is often a good practice to use the `find_elements...` methods instead of the `find_element...` variations when finding objects. If the element does not exist, the `find_elements...` variations will return an empty list, which evaluates to `False` in Python. The following code block provides an example based on the previous command. 

In [11]:
if driver.find_elements_by_partial_link_text('Best Sellers'):
    print('Element(s) found')
else:
    print('Element does not exist!')

Element does not exist!


The following code block shows the correct command to find the best sellers element by partial link text...

In [12]:
driver.find_element_by_partial_link_text('BEST SELLERS')

<selenium.webdriver.remote.webelement.WebElement (session="26ea89482be7384401499e1205b5281f", element="0c4f89bf-25fc-46fa-bba1-b1fb9b599dc6")>

... by class name ...

In [13]:
driver.find_element_by_class_name('blockbestsellers')

<selenium.webdriver.remote.webelement.WebElement (session="26ea89482be7384401499e1205b5281f", element="0c4f89bf-25fc-46fa-bba1-b1fb9b599dc6")>

... and by xpath.

In [14]:
driver.find_element_by_xpath('//*[@id="home-page-tabs"]/li[2]/a')

<selenium.webdriver.remote.webelement.WebElement (session="26ea89482be7384401499e1205b5281f", element="0c4f89bf-25fc-46fa-bba1-b1fb9b599dc6")>

I also commonly use CSS selectiors. The following table shows how we can use CSS selectors to find elements by various attributes.

From https://www.w3schools.com/cssref/css_selectors.asp (accessed 4/16/2020):

<table class="w3-table-all notranslate">
  <tbody><tr>
    <th style="width:20%">Selector</th>
    <th style="width:20%">Example</th>
    <th>Example description</th>
  </tr>
  <tr>
    <td><a href="sel_class.asp">.<i>class</i></a></td>
    <td class="notranslate">.intro</td>
    <td>Selects all elements with class="intro"</td>
  </tr>
  <tr>
    <td><em>.class1.class2</em></td>
    <td class="notranslate">.name1.name2</td>
    <td>Selects all elements with both <em>name1</em> and <em>name2</em> set 
    within its class attribute</td>
  </tr>  
  <tr>
    <td><em>.class1 .class2</em></td>
    <td class="notranslate">.name1 .name2</td>
    <td>Selects all elements with <em>name2</em> that is a descendant of an 
    element with <em>name1</em></td>
  </tr>  
  <tr>
    <td><a href="sel_id.asp">#<i>id</i></a></td>
    <td class="notranslate">#firstname</td>
    <td>Selects the element with id="firstname"</td>
  </tr>  <tr>
    <td><a href="sel_all.asp">*</a></td>
    <td class="notranslate">*</td>
    <td>Selects all elements</td>
  </tr>
  <tr>
    <td><i><a href="sel_element.asp">element</a></i></td>
    <td class="notranslate">p</td>
    <td>Selects all &lt;p&gt; elements</td>
  </tr>
  <tr>
    <td><i><a href="sel_element_class.asp">element.class</a></i></td>
    <td class="notranslate">p.intro</td>
    <td>Selects all &lt;p&gt; elements with class="intro"</td>
  </tr>
  <tr>
    <td><i><a href="sel_element_comma.asp">element,element</a></i></td>
    <td class="notranslate">div, p</td>
    <td>Selects all &lt;div&gt; elements and all &lt;p&gt; elements</td>
  </tr>
  <tr>
    <td><a href="sel_element_element.asp"><i>element</i> <i>element</i></a></td>
    <td class="notranslate">div p</td>
    <td>Selects all &lt;p&gt; elements inside &lt;div&gt; elements</td>
  </tr>
  <tr>
    <td><a href="sel_element_gt.asp"><i>element</i>&gt;<i>element</i></a></td>
    <td class="notranslate">div &gt; p</td>
    <td>Selects all &lt;p&gt; elements where the parent is a &lt;div&gt; element</td>
  </tr>
  <tr>
    <td><a href="sel_element_pluss.asp"><i>element</i>+<i>element</i></a></td>
    <td class="notranslate">div + p</td>
    <td>Selects all &lt;p&gt; elements that are placed immediately after &lt;div&gt; elements</td>
  </tr>
  <tr>
    <td><a href="sel_gen_sibling.asp"><i>element1</i>~<i>element2</i></a></td>
    <td>p ~ ul</td>
    <td>Selects every &lt;ul&gt; element that are preceded by a &lt;p&gt; element</td>
  </tr>
  <tr>
    <td><a href="sel_attribute.asp">[<i>attribute</i>]</a></td>
    <td class="notranslate">[target]</td>
    <td>Selects all elements with a target attribute</td>
  </tr>
  <tr>
    <td><a href="sel_attribute_value.asp">[<i>attribute</i>=<i>value</i>]</a></td>
    <td class="notranslate">[target=_blank]</td>
    <td>Selects all elements with target="_blank"</td>
  </tr>
  <tr>
    <td><a href="sel_attribute_value_contains.asp">[<i>attribute</i>~=<i>value</i>]</a></td>
    <td class="notranslate">[title~=flower]</td>
    <td>Selects all elements with a title attribute containing the word "flower"</td>
  </tr>
  <tr>
    <td><a href="sel_attribute_value_lang.asp">[<i>attribute</i>|=<i>value</i>]</a></td>
    <td class="notranslate">[lang|=en]</td>
    <td>Selects all elements with a lang attribute value starting with "en"</td>
  </tr>
  <tr>
    <td><a href="sel_attr_begin.asp">[<i>attribute</i>^=<i>value</i>]</a></td>
    <td>a[href^="https"]</td>
    <td>Selects every &lt;a&gt; element whose href attribute value begins with "https"</td>
  </tr>
  <tr>
    <td><a href="sel_attr_end.asp">[<i>attribute</i>$=<i>value</i>]</a></td>
    <td>a[href$=".pdf"]</td>
    <td>Selects every &lt;a&gt; element whose href attribute value ends with ".pdf"</td>
  </tr>
  <tr>
    <td><a href="sel_attr_contain.asp">[<i>attribute</i>*=<i>value</i>]</a></td>
    <td>a[href*="w3schools"]</td>
    <td>Selects every &lt;a&gt; element whose href attribute value contains the substring "w3schools"</td>
  </tr>
  <tr>
    <td><a href="sel_active.asp">:active</a></td>
    <td class="notranslate">a:active</td>
    <td>Selects the active link</td>
  </tr>
  <tr>
    <td><a href="sel_after.asp">::after</a></td>
    <td class="notranslate">p::after</td>
    <td>Insert something after the content of each &lt;p&gt; element</td>
  </tr>
  <tr>
    <td><a href="sel_before.asp">::before</a></td>
    <td class="notranslate">p::before</td>
    <td>Insert something before&nbsp;the content of each &lt;p&gt; element</td>
  </tr>
  <tr>
    <td><a href="sel_checked.asp">:checked</a></td>
    <td>input:checked</td>
    <td>Selects every checked &lt;input&gt; element</td>
  </tr>
  <tr>
    <td><a href="sel_default.asp">:default</a></td>
    <td>input:default</td>
    <td>Selects the default &lt;input&gt; element</td>
  </tr>
  <tr>
    <td><a href="sel_disabled.asp">:disabled</a></td>
    <td>input:disabled</td>
    <td>Selects every disabled &lt;input&gt; element</td>
  </tr>
  <tr>
    <td><a href="sel_empty.asp">:empty</a></td>
    <td>p:empty</td>
    <td>Selects every &lt;p&gt; element that has no children (including text nodes)</td>
  </tr>
  <tr>
    <td><a href="sel_enabled.asp">:enabled</a></td>
    <td>input:enabled</td>
    <td>Selects every enabled &lt;input&gt; element</td>
  </tr>
  <tr>
    <td><a href="sel_firstchild.asp">:first-child</a></td>
    <td class="notranslate">p:first-child</td>
    <td>Selects every &lt;p&gt; element that is the first child of its parent</td>
  </tr>
  <tr>
    <td><a href="sel_firstletter.asp">::first-letter</a></td>
    <td class="notranslate">p::first-letter</td>
    <td>Selects the first letter of every &lt;p&gt; element</td>
  </tr>
  <tr>
    <td><a href="sel_firstline.asp">::first-line</a></td>
    <td class="notranslate">p::first-line</td>
    <td>Selects the first line of every &lt;p&gt; element</td>
  </tr>
  <tr>
    <td><a href="sel_first-of-type.asp">:first-of-type</a></td>
    <td>p:first-of-type</td>
    <td>Selects every &lt;p&gt; element that is the first &lt;p&gt; element of its parent</td>
  </tr>
  <tr>
    <td><a href="sel_focus.asp">:focus</a></td>
    <td class="notranslate">input:focus</td>
    <td>Selects the input element which has focus</td>
  </tr>
  <tr>
    <td><a href="sel_hover.asp">:hover</a></td>
    <td class="notranslate">a:hover</td>
    <td>Selects links on mouse over</td>
  </tr>
  <tr>
    <td><a href="sel_in-range.asp">:in-range</a></td>
    <td class="notranslate">input:in-range</td>
    <td>Selects input elements with a value within a specified range</td>
  </tr>
  <tr>
    <td><a href="sel_indeterminate.asp">:indeterminate</a></td>
    <td class="notranslate">input:indeterminate</td>
    <td>Selects input elements that are in an indeterminate state</td>
  </tr>
  <tr>
    <td><a href="sel_invalid.asp">:invalid</a></td>
    <td class="notranslate">input:invalid</td>
    <td>Selects all input elements with an invalid value</td>
  </tr>
  <tr>
    <td><a href="sel_lang.asp">:lang(<i>language</i>)</a></td>
    <td class="notranslate">p:lang(it)</td>
    <td>Selects every &lt;p&gt; element with a lang attribute equal to "it" (Italian)</td>
  </tr>
  <tr>
    <td><a href="sel_last-child.asp">:last-child</a></td>
    <td>p:last-child</td>
    <td>Selects every &lt;p&gt; element that is the last child of its parent</td>
  </tr>
  <tr>
    <td><a href="sel_last-of-type.asp">:last-of-type</a></td>
    <td>p:last-of-type</td>
    <td>Selects every &lt;p&gt; element that is the last &lt;p&gt; element of its parent</td>
  </tr>
  <tr>
    <td><a href="sel_link.asp">:link</a></td>
    <td class="notranslate">a:link</td>
    <td>Selects all unvisited links</td>
  </tr>
  <tr>
    <td><a href="sel_not.asp">:not(<i>selector</i>)</a></td>
    <td>:not(p)</td>
    <td>Selects every element that is not a &lt;p&gt; element</td>
  </tr>
  <tr>
    <td><a href="sel_nth-child.asp">:nth-child(<i>n</i>)</a></td>
    <td>p:nth-child(2)</td>
    <td>Selects every &lt;p&gt; element that is the second child of its parent</td>
  </tr>
  <tr>
    <td><a href="sel_nth-last-child.asp">:nth-last-child(<i>n</i>)</a></td>
    <td>p:nth-last-child(2)</td>
    <td>Selects every &lt;p&gt; element that is the second child of its parent, counting from the last child</td>
  </tr>
  <tr>
    <td><a href="sel_nth-last-of-type.asp">:nth-last-of-type(<i>n</i>)</a></td>
    <td>p:nth-last-of-type(2)</td>
    <td>Selects every &lt;p&gt; element that is the second &lt;p&gt; element of its parent, counting from the last child</td>
  </tr>
  <tr>
    <td><a href="sel_nth-of-type.asp">:nth-of-type(<i>n</i>)</a></td>
    <td>p:nth-of-type(2)</td>
    <td>Selects every &lt;p&gt; element that is the second &lt;p&gt; element of its parent</td>
  </tr>
  <tr>
    <td><a href="sel_only-of-type.asp">:only-of-type</a></td>
    <td>p:only-of-type</td>
    <td>Selects every &lt;p&gt; element that is the only &lt;p&gt; element of its parent</td>
  </tr>
  <tr>
    <td><a href="sel_only-child.asp">:only-child</a></td>
    <td>p:only-child</td>
    <td>Selects every &lt;p&gt; element that is the only child of its parent</td>
  </tr>
  <tr>
    <td><a href="sel_optional.asp">:optional</a></td>
    <td class="notranslate">input:optional</td>
    <td>Selects input elements with no "required" attribute</td>
  </tr>
  <tr>
    <td><a href="sel_out-of-range.asp">:out-of-range</a></td>
    <td class="notranslate">input:out-of-range</td>
    <td>Selects input elements with a value outside a specified range</td>
  </tr>
  <tr>
    <td><a href="sel_placeholder.asp">::placeholder</a></td>
    <td class="notranslate">input::placeholder</td>
    <td>Selects input elements with the "placeholder" attribute specified</td>
  </tr>
  <tr>
    <td><a href="sel_read-only.asp">:read-only</a></td>
    <td class="notranslate">input:read-only</td>
    <td>Selects input elements with the "readonly" attribute specified</td>
  </tr>
  <tr>
    <td><a href="sel_read-write.asp">:read-write</a></td>
    <td class="notranslate">input:read-write</td>
    <td>Selects input elements with the "readonly" attribute NOT specified</td>
  </tr>
  <tr>
    <td><a href="sel_required.asp">:required</a></td>
    <td class="notranslate">input:required</td>
    <td>Selects input elements with the "required" attribute specified</td>
  </tr>
  <tr>
    <td><a href="sel_root.asp">:root</a></td>
    <td>:root</td>
    <td>Selects the document's root element</td>
  </tr>
  <tr>
    <td><a href="sel_selection.asp">::selection</a></td>
    <td>::selection</td>
    <td>Selects the portion of an element that is selected by a user</td>
  </tr>
  <tr>
    <td><a href="sel_target.asp">:target</a></td>
    <td>#news:target </td>
    <td>Selects the current active #news element (clicked on a URL containing that anchor name)</td>
  </tr>
  <tr>
    <td><a href="sel_valid.asp">:valid</a></td>
    <td class="notranslate">input:valid</td>
    <td>Selects all input elements with a valid value</td>
  </tr>
  <tr>
    <td><a href="sel_visited.asp">:visited</a></td>
    <td class="notranslate">a:visited</td>
    <td>Selects all visited links</td>
  </tr>
</tbody></table>

The following code block selects the best sellers element using CSS selector.

In [15]:
driver.find_element_by_css_selector('.blockbestsellers')

<selenium.webdriver.remote.webelement.WebElement (session="26ea89482be7384401499e1205b5281f", element="0c4f89bf-25fc-46fa-bba1-b1fb9b599dc6")>

Let's suppose that we want to collect information on Women's tops. The following code block navigates to the *WOMEN* link and clicks it.

In [16]:
if driver.find_elements_by_link_text('WOMEN'):
    women_link = driver.find_element_by_link_text('WOMEN')
    women_link.click()

Let's try to filter the products by tops. Note that simply finding elements that include the text *top* will be problematic because there is more than one on the page. The following code block shows how we can select a particular area of the webpage, which we will search within.

In [17]:
catalog_block = driver.find_element_by_id('layered_block_left')

We will now select and click the checkbox to filter the products to just tops.

In [18]:
tops_filter = catalog_block.find_element_by_partial_link_text('Tops')
tops_filter.click()

The following code block shows how we can grap all of the product containers.

In [19]:
driver.find_elements_by_class_name('product-container')

[<selenium.webdriver.remote.webelement.WebElement (session="26ea89482be7384401499e1205b5281f", element="dc6a279a-75d1-43ec-ad7d-143777ecdc81")>,
 <selenium.webdriver.remote.webelement.WebElement (session="26ea89482be7384401499e1205b5281f", element="21b1005b-e83f-47d8-9d8f-076fba3fce4e")>,
 <selenium.webdriver.remote.webelement.WebElement (session="26ea89482be7384401499e1205b5281f", element="65c71e8d-e681-438f-964e-d48987bce56e")>,
 <selenium.webdriver.remote.webelement.WebElement (session="26ea89482be7384401499e1205b5281f", element="d48f1156-610d-45f0-9448-f4bfb3f71558")>,
 <selenium.webdriver.remote.webelement.WebElement (session="26ea89482be7384401499e1205b5281f", element="0abde664-be45-49bb-8cb7-8d5913b23149")>,
 <selenium.webdriver.remote.webelement.WebElement (session="26ea89482be7384401499e1205b5281f", element="17470415-487e-406a-9fd3-bb708b4ed8d6")>,
 <selenium.webdriver.remote.webelement.WebElement (session="26ea89482be7384401499e1205b5281f", element="0d0872c1-3857-4202-9642-cf

At the time I wrote this notebook, attempting to filter only the tops using the checkbox caused the website to hang, i.e., no filtering occured. The following code block shows how we can refresh the page.

In [20]:
driver.refresh()

The following code block attempts to filter the information using sub-categories.

In [21]:
subcategory_area = driver.find_element_by_id('subcategories')
subcategory_area.find_element_by_partial_link_text('TOPS').click()

Notice now that the filtering occured.

In [22]:
driver.find_elements_by_class_name('product-container')

[<selenium.webdriver.remote.webelement.WebElement (session="26ea89482be7384401499e1205b5281f", element="2bf60d33-da2b-429a-9abf-30d2dffb9457")>,
 <selenium.webdriver.remote.webelement.WebElement (session="26ea89482be7384401499e1205b5281f", element="a7987750-2870-40d2-9614-e22eebb1e2c1")>]

The following code block iterates over the product containers, printing all text. You can expand this code to extract only the elements you wish

In [23]:
for container in driver.find_elements_by_class_name('product-container'):
    print(container.text)
    print('-'*25)

Faded Short Sleeve T-shirts
$16.51
In stock
-------------------------
Blouse
$27.00
In stock
-------------------------


The following code block closes the driver.

In [24]:
driver.quit()

The following code block automates the data collection process we just described. In the demo, I introduced some bugs to highlight some things you need to consider when working with `Selenium`

In [25]:
# create a driver instance
driver = webdriver.Chrome()

# define and naviagte to the target URL
url = 'http://automationpractice.com'
driver.get(url)

# find the link for the WOMENs area and click
if driver.find_elements_by_link_text('WOMEN'):
    women_link = driver.find_element_by_link_text('WOMEN')
    women_link.click()
    time.sleep(2)
    
# determine the number of product containers before filtering
initial_containers = len(driver.find_elements_by_class_name('product-container'))
filtered = False

# try to get filtered products using checkbox
if driver.find_elements_by_id('layered_block_left'):
    catalog_block = driver.find_element_by_id('layered_block_left')

    if catalog_block.find_element_by_partial_link_text('Tops'):
        catalog_block.find_element_by_partial_link_text('Tops').click()
        time.sleep(2)
        
    filtered_containers = len(driver.find_elements_by_class_name('product-container'))

# check to see if filtering occured. If not, refresh page.
if initial_containers == filtered_containers:
    print('No filtering occured! Refreshing page.')
    driver.refresh()
    time.sleep(2)

else:
    filtered = True

# if filtering did not occur, try subcategory method
if not filtered:
    
    if driver.find_elements_by_id('subcategories'):
        subcategory_area = driver.find_element_by_id('subcategories')
        subcategory_area.find_element_by_partial_link_text('TOPS').click()
        time.sleep(2)
        
    filtered_containers = len(driver.find_elements_by_class_name('product-container'))
        
if initial_containers == filtered_containers:
    print('No filtering occured! Refreshing page.')
    driver.refresh()
    time.sleep(2)
else:
    filtered = True

# if filtering was successful, print containers
if filtered:
    for container in driver.find_elements_by_class_name('product-container'):
        print(container.text)
        print('-'*25)
        
driver.quit()

No filtering occured! Refreshing page.
Faded Short Sleeve T-shirts
$16.51
Add to cart
More
In stock
Add to Wishlist
Add to Compare
-------------------------
Blouse
$27.00
Add to cart
More
In stock
Add to Wishlist
Add to Compare
-------------------------
