# Selenium: Web Scraping

In [1]:
import re
import time
from urllib.parse import urljoin
import requests

import bs4
import pandas as pd

from webdriver_manager.chrome import ChromeDriverManager
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains

## 1. Overview
There are basically two ways to scrape data from websites, one is via HTML responses (server-side) and the other is via JSON responses (client-side).

::::{tab-set}

:::{tab-item} Server-side

We extract *unstructured* data from user interface, which is friendly for human eyes but not for computers. Data is availale most of the time except for cases when website owners protect their data intentionally. This crawling approach requires some basic knowledge of HTML.

:::

:::{tab-item} Client-side

We try to crawl *structured* REST API responses, which is only available in specific website that use this protocol. The advantage is that the returned data is in JSON format, so they can be easily extracted and processed. Crawling data this way is much easier, so we are going to start with it.

:::

::::

### 1.1. Requests
Instead of accessing websites using a browser such as Google Chrome, we can use the [Requests] library to download the raw content of that page and interact with it. Most of the time, we are going to use the `get()` function followed by the `text` or `content` attributes.

[Requests]: https://github.com/psf/requests

In [27]:
import requests

In [42]:
url = 'https://books.toscrape.com/index.html'
response = requests.get(url)
response

<Response [200]>

In [36]:
response.text[:1000]

'<!DOCTYPE html>\n<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->\n<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->\n<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->\n<!--[if gt IE 8]><!--> <html lang="en-us" class="no-js"> <!--<![endif]-->\n    <head>\n        <title>\n    All products | Books to Scrape - Sandbox\n</title>\n\n        <meta http-equiv="content-type" content="text/html; charset=UTF-8" />\n        <meta name="created" content="24th Jun 2016 09:29" />\n        <meta name="description" content="" />\n        <meta name="viewport" content="width=device-width" />\n        <meta name="robots" content="NOARCHIVE,NOCACHE" />\n\n        <!-- Le HTML5 shim, for IE6-8 support of HTML elements -->\n        <!--[if lt IE 9]>\n        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>\n        <![endif]-->\n\n        \n            <link rel="shortcut icon" href

In [37]:
response.headers

{'Date': 'Fri, 06 Jan 2023 02:51:21 GMT', 'Content-Type': 'text/html', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Last-Modified': 'Thu, 26 May 2022 21:15:15 GMT', 'ETag': 'W/"628fede3-c85e"', 'Strict-Transport-Security': 'max-age=0; includeSubDomains; preload', 'Content-Encoding': 'br'}

### 1.2. APIs crawling
Not all URLs point to an HTML page, for example, the URL https://api.github.com/repos/dmlc/xgboost points to a raw document in JSON format. Such an URL is called a REST API endpoint and can be easily converted into Python dictionaries using the  `json()` method. But APIs in practice are not always that simple, they can go with concepts such as headers and payloads. So, learning real-world API structures and how to find them will be our goal in this section.

In [47]:
import pandas as pd
import requests

#### Documented APIs
Many organizations officially support REST APIs for data accessing, such as [GitHub], [Facebook], [Twitter], [Reddit] and [Clash Royale]. To start using APIs provided this way, developers usually need to register an account and generate an API key, but some of them don't require any authenication. Either way, the instructions for requesting data can be found in their documentation sites.

Now let's hand on an example by requesting GitHub's [list-repository-languages] endpoint to show the size (in bytes) of code written in each language.
- The main component of a request is the URL which follows a pre-defined syntax. In this case, the URL has two placeholders for `OWNER` and `REPO` they can be handled nicely using Python formatted strings. We can use this URL to access data of any public repository.
- For private repositories, we must provide an authenication key with appropriae permissions. These additional information are called the headers, you can think of them as metadata of the API call.
- We might notice that there are different requesting methods available such as GET, POST, PUT and DELETE that serve different purposes. As we only want to collect data, we only need to care about GET, and sometimes, POST.

[GitHub]: https://docs.github.com/en/rest
[Facebook]: https://developers.facebook.com/docs/groups-api/reference
[Twitter]: https://developer.twitter.com/en/docs/twitter-api
[Reddit]: https://www.reddit.com/dev/api/
[Clash Royale]: https://developer.clashroyale.com/
[list-repository-languages]: https://docs.github.com/en/rest/repos/repos#list-repository-languages

In [3]:
OWNER = 'dmlc'
REPO = 'xgboost'

url = f'https://api.github.com/repos/{OWNER}/{REPO}/languages'
response = requests.get(url)
response.json()

{'C++': 2215388,
 'Python': 1203192,
 'Cuda': 863316,
 'Scala': 470919,
 'R': 343950,
 'Java': 206895,
 'CMake': 52369,
 'Shell': 45902,
 'C': 22503,
 'Makefile': 8179,
 'PowerShell': 4308,
 'CSS': 3812,
 'Dockerfile': 2364,
 'M4': 2131,
 'Batchfile': 1383,
 'Groovy': 1251,
 'TeX': 913}

In [5]:
OWNER = 'hungpq7'
REPO = 'courses'

url = f'https://api.github.com/repos/{OWNER}/{REPO}/languages'
headers = {
    'Accept': 'application/vnd.github+json',
    'Authorization': 'Bearer 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',
    'X-GitHub-Api-Version': '2022-11-28',
}
response = requests.get(url, headers=headers)
response.json()

{'Jupyter Notebook': 8166118, 'Perl': 1432, 'Shell': 1286}

:::{note}

We can make API calls with command line too, using the [cURL](https://en.wikipedia.org/wiki/CURL) command.

:::

In [6]:
%%bash
curl https://api.github.com/repos/hungpq7/courses/languages \
    -H "Accept: application/vnd.github+json" \
    -H "Authorization: Bearer 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" \
    -H "X-GitHub-Api-Version: 2022-11-28"

{
  "Jupyter Notebook": 8166118,
  "Perl": 1432,
  "Shell": 1286
}


#### Hidden APIs
The increasing popularity of JavaScript-based frameworks such as React, Angular and Vue encourages websites to be rendered *client-side*. This means, more websites use REST APIs to send and receive data to fill HTML templates, then render the page on user's computers. Of course, such APIs are not documented, so benefiting them requires some tricks including finding them and understanding their structures. In this section, we are going to inspect a page's network activities to locate scrapable APIs. 

::::{admonition} Case study
:class: seealso

We will be crawling all articles in the home page of https://techcrunch.com/ with the following steps:
- Go to the target page and open the browser's developer tool. The shortcut in Google Chrome is `F12` or `Ctrl + Shift + I`. However, the tool will not record activities before it was opened, so we need to press `Ctrl + R` to reload the target page.
- Navigate to the Network tab to show all requests the page has made and filter Fetch/XHR requests. This filter leaves only requests that fetch JSON data, separating them from other types of response that we don't need such as image, media and CSS. These buttons are yellow circled in the image below.

:::{image} ../image/rest_api_response.png
:height: 300px
:align: center
:::

- At this point, one of the displaying requests returns the data we are looking for. We will need to explore a bit to determine that API, start with names. In the example of TechCrunch, the API "magazine" sounds promising. Indeed, when we click this API and preview its response, we see a list of items storing articles in the website.
- Now we have found the API, let's learn how to use it. Switching to the Header tab reveals to us the URL and the request method of this API. Other APIs may require headers as well as payload, but this one is not the case.

:::{image} ../image/rest_api_url.png
:height: 250px
:align: center
:::

::::

In [2]:
url = 'https://techcrunch.com/wp-json/tc/v1/magazine?page=1&_embed=true&cachePrevention=0'
response = requests.get(url)
response

<Response [200]>

In [3]:
data = []
for item in response.json():
    sample = {
        'id': item['id'],
        'category': item['primary_category']['slug'],
        'author': item['parselyMeta']['parsely-author'][0],
        'title': item['parselyMeta']['parsely-title'],
    }
    data.append(sample)

pd.DataFrame.from_dict(data).head()

Unnamed: 0,id,category,author,title
0,2473566,social,Taylor Hatmaker,Microsoft is sunsetting social VR pioneer Alts...
1,2472783,security,Zack Whittaker,A hack at ODIN Intelligence exposes a huge tro...
2,2472976,startups,Kyle Wiggers,"Alphabet makes cuts, Twitter bans third-party ..."
3,2472885,startups,Natasha Mascarenhas,Tech forgot its umbrella
4,2472793,apps,Sarah Perez,This Week in Apps: Twitter kills third-party a...


::::{admonition} Case study
:class: seealso

Sometimes, websites block connections from non-browser clients. For example, when inspecting the website https://tiki.vn/nha-sach-tiki/c8322, I found an API named "listing" which contains all products shown in the page. With the naked URL, we can access its response using Chrome but will get blocked using Requests. This can be easily bypassed by we overwriting the *user agent* header as follows.

:::{image} ../image/rest_api_payload.png
:height: 300px
:align: center
:::

Now, if switch to the Payload tab, we can observe that the parameters here match exactly the components in the URL. With this insight, we can rewrite the request with URL and payload separatedly, which is far more readable.

::::

In [46]:
requests.utils.default_headers()

{'User-Agent': 'python-requests/2.25.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

In [56]:
url = 'https://tiki.vn/api/personalish/v1/blocks/listings?limit=40&category=8322&page=1&urlKey=nha-sach-tiki'
headers = {'user-agent': 'Mozilla/5.0 Chrome/108.0.0.0 Safari/537.36'}
response = requests.get(url, headers=headers)
response

<Response [200]>

In [60]:
url = 'https://tiki.vn/api/personalish/v1/blocks/listings'
headers = {'user-agent': 'Mozilla/5.0 Chrome/108.0.0.0 Safari/537.36'}
payload = {
    'limit': 40,
    'category': 8322,
    'page': 1,
    'urlKey': 'nha-sach-tiki',
}
response = requests.get(url, headers=headers, params=payload)
response

<Response [200]>

In [57]:
for product in response.json()['data'][:5]:
    name = product['name']
    print(name)

Cây Cam Ngọt Của Tôi
Hành Tinh Của Một Kẻ Nghĩ Nhiều
Không Phải Sói Nhưng Cũng Đừng Là Cừu -Tặng kèm bookmark 2 mặt
Thao Túng Tâm Lý
Thiên Tài Bên Trái, Kẻ Điên Bên Phải (Tái Bản)


## 2. HTML parser

### 2.1. HTML concepts
[HTML] is a language for creating web pages. The easiest way to think about HTML, is a language with the same purpose with Markdown, with less readability but more expressivity. A HTML document is constructed by *elements*, organized in a hierarchical structure. For example, here are the components of an element:
- The *tag* `<span>` usually pairs with a closing one `</span>`. Some tags can stands alone such as `<br>`. We also refer to tag as element name.
- The text between two tags `computer` is the *content* of that element.
- An element can have a number of *attributes* such as `class`, `style` and their corresponding *values* placed after the equal sign `=`.

[HTML]: https://en.wikipedia.org/wiki/HTML

In [11]:
%%html
<span class='breadcrumb content' style='color:indianred'>computer</span>

:::{note}

Some global attributes occur everywhere in HTML documents. Keeping an eye on them will help you a lot in scraping data:
- The attribute `id` is the identifier of an element, must be unique across the document. Useful in locating a specific element.
- The attribute `class` makes reference to custom CSS styles. Useful in matching a list of items with the same style. An element can have multiple classes, for example *breadcrumb* and *content*.

:::

### 2.2. Parsing
There is a powerful library, [Beautiful Soup], that helps us navigating HTML documents and finding the desired elements. All we need to do is passing the HTML document to it (to get an object called the *soup*), then uses its methods and attributes to extract what we want. We are going to demonstrate the main features of Beautiful Soup using a small document.

[Beautiful Soup]: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [2]:
import bs4

In [87]:
html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">Once upon a time there were three little sisters; and their names were
            <a href="https://example.com/elsie" class="sister" id="link1">Elsie</a>,
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
            and they lived at the bottom of a well.
        </p>
        <time class="story">2000-01-01 06:00:00</time>
"""

In [88]:
soup = bs4.BeautifulSoup(html, 'html.parser')

#### Tree navigation
We can easily navigate a HTML document as BS has registered child tags and attributes to the *soup*. This syntax is simple, but the downside is that it cannot handle multiple children (only the first child is returned).

In [63]:
tag = soup.body.p.a

In [64]:
tag.text

'Elsie'

In [65]:
tag['href']

'http://example.com/elsie'

#### Element searching
BS supports a more reliable way for finding exactly the element we want, via the [`find_all()`] method. This function searchs for HTML tags and attributes:
- The [first argument] is the tag you want to find. It can be a string or list of strings, a regex pattern or a function.
- Other arguments are named after HTML attributes. But if some of them make conflicts to Python built-in names such as `id`, `class` and `custom-attribute`, we can pass them as a dictionary to the `attrs` argument.

The ideal case in searching is when you know the ID of an element, thanks to its uniqueness. In this case, we can safely use the `find()` method instead, which returns only the first result. Otherwise, combinations of tags and attributes will help you finding the element you want very quickly.

[`find_all()`]: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all
[first argument]: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#kinds-of-filters

In [94]:
import re
import bs4

In [82]:
soup.find(id='link2')

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

In [85]:
soup.find_all(re.compile('^t'))

[<title>The Dormouse's story</title>,
 <time class="story">2000-01-01 06:00:00</time>]

In [93]:
soup.find_all('a', class_='sister')

[<a class="sister" href="https://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [92]:
attrs = {
    'class': 'sister',
    'href': re.compile('https\S+')
}

soup.find_all('a', attrs=attrs)

[<a class="sister" href="https://example.com/elsie" id="link1">Elsie</a>]

:::{admonition} Case study
:class: seealso

Let's use Beautiful Soup to crawl all fiction books in the page https://books.toscrape.com/catalogue/category/books/fiction_10/index.html. This website does not use REST APIs, so the only way is parsing its HTML souce. Our crawler contains two phases, (1) gathering book URLs and (2) actually crawling book information.

*Phase 1*
- First, examine the URL structure to find that we can replace "index" with "page-n" to access pages. The index of page will be set incrementally, as we will get an 404 error message when it exceeds the maximum number.
- In the first page, use the *inspect* tool of Chrome to find book containers (each contains image, title, ratings and price). We observe that each container has 4 classes `col-xs-6`, `col-sm-4`, `col-md-3` and `col-lg-3`. You can re-check this information by searching and counting the number of elements that use all 4 classes. There are 20 of them, which matches the number of books the page shows.
- All the information in the containers also appear in book pages. So, the only data we get here is book URLs. Note that URLs here are relative, we can easily convert in into full path using the function [`urljoin()`].

*Phase 2*
- Access each URL collected in the first phase. Locate the content container.
- Extract the important fields and add them to a Pandas dataframe. There is no new technique in this phase.

:::

[`urljoin()`]: https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urljoin

In [2]:
import re
import requests
import bs4
import pandas as pd
import time
from urllib.parse import urljoin

In [4]:
def crawl_book_url(urlBase) -> list:
    listUrlBook = []
    nPage = 1
    while True:
        # access crawl page and check if it loads successfully (status 200)
        urlPage = urlBase.replace('index', f'page-{nPage}')
        response = requests.get(urlPage)
        if response.status_code != 200:
            break
        
        # create soup
        html = response.text
        soup = bs4.BeautifulSoup(html, 'html.parser')
        
        # get book containers
        listContainer = soup.find_all('li', class_='col-xs-6 col-sm-4 col-md-3 col-lg-3')
        
        # get book urls
        for container in listContainer:
            href = container.article.h3.a['href']
            urlBook = urljoin(urlBase, href)
            listUrlBook.append(urlBook)
        
        # advance to the next page
        nPage += 1
    
    return listUrlBook

In [7]:
def crawl_book_info(listUrlBook:list):
    data = []
    for urlBook in listUrlBook:
        response = requests.get(urlBook)
        html = response.content
        soup = bs4.BeautifulSoup(html, 'html.parser')
        
        container = soup.find('article', class_='product_page')
        title = container.find('div', class_='product_main').h1.text
        description = container.find('p', class_=False).text
        
        table = container.find('table', class_='table-striped').prettify()
        table = pd.read_html(table)
        table = pd.concat(table)
        table = table.set_index(0)[1]
        
        upc = table['UPC']
        price = table['Price (excl. tax)']
        price = float(re.findall('\d+\.\d+', price)[0])
        tax = table['Tax']
        tax = float(re.findall('\d+\.\d+', tax)[0])
        
        sample = {
            'upc': upc,
            'title': title,
            'price': price,
            'tax': tax,
        }
        data.append(sample)
        
    return pd.DataFrame(data)

In [6]:
urlBase = 'https://books.toscrape.com/catalogue/category/books/fiction_10/index.html'
listUrlBook = crawl_book_url(urlBase)
len(listUrlBook)

65

In [8]:
df = crawl_book_info(listUrlBook[:5])
df

Unnamed: 0,upc,title,price,tax
0,6957f44c3847a760,Soumission,50.1,0.0
1,b12b89017878a60d,Private Paris (Private #10),47.61,0.0
2,8d455c7539795d2a,"We Love You, Charlie Freeman",50.27,0.0
3,709822d0b5bcb7f4,Thirst,17.27,0.0
4,d01ac97e2b8947c2,The Murder That Never Was (Forensic Instincts #5),54.11,0.0


## 3. Web driver
In many scenarios, a website may require users to perform some actions to reveal data, which Beautiful Soup cannot handle. In such cases, we need a *web driver* that can emulate user interation with browsers. [Selenium] is a library born to serve this purpose; it is designed for automation test but can also be used for web scraping.

[Selenium]: https://github.com/SeleniumHQ/selenium

### 3.1. Driver initialization
In order to use Selenium, we must first set up a web driver (Chrome is recommended). This is done by downloading [Chrome Driver] manually and giving the path to Selenium. But thanks to [Webdriver Manager], drivers for different browsers will be automatically detected and downloaded for us. After intializing, Selenium will open a driver window for us, so we can monitor how actions being performed. We can configure the web driver indifferent ways:
- Add some [options], such as *headless* and *start-maximized*
- Use a custom [page load strategy]
- Quit the driver to free up memory

[Chrome Driver]: https://chromedriver.chromium.org/home
[Webdriver Manager]: https://github.com/SergeyPirogov/webdriver_manager
[options]: https://www.selenium.dev/documentation/webdriver/drivers/options/
[page load strategy]: https://www.selenium.dev/documentation/webdriver/drivers/options/#pageloadstrategy

In [1]:
import time
from webdriver_manager.chrome import ChromeDriverManager
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options

In [4]:
service = Service(ChromeDriverManager().install())
options = Options()
# options.add_argument('--headless')
# options.add_argument('--window-size=1920,1080')
options.add_argument('start-maximized')

driver = webdriver.Chrome(service=service, options=options)

time.sleep(5)
driver.quit()

### 3.2. Element locating
In Selenium, we find an element using the [`find_elements()`] method with the first argument being a [`By`] locator. Besides basic element locating strategies using tags and attributes, Selenium provides two convinient locators, XPath and CSS selector. You can simply use them by right-click an element and copy its XPath/selector, but they worth being learned carefully.

[`find_elements()`]: https://www.selenium.dev/documentation/webdriver/elements/finders
[`By`]: https://selenium-python.readthedocs.io/locating-elements.html

#### XPath
The general syntax of XPath is `/element/element/...`, starts from root. You can leave the first element blank `//element/element/...` to turn the absolute path into relative. The basic syntaxes for searching elements are `tag[@attr='value']` and `tag[ordinal]`. For example:
- `*[@id='promotion']` matches any tag that has `id='promotion'`
- `div[@class='breadcrumb']` matches elements that have `<div class='breadcrumb'>`
- `span[7]` matches any `<span>` element that is the seventh child of its parent.

#### CSS selector
The general syntax of CSS selector is `element > element > ...`, being relative by nature. The basic syntaxes for searching elements are `tag.class`, `tag#id`, `tag[attr=value]`, `tag:func(args)`. For example:
- `#promotion` matches any tag that has `id='promotion'`
- `div.breadcrumb` matches elements that have `<div class='breadcrumb'>`
- `span[role=alert]` matches elements that have `<span role='alert'>`
- `span:nth-child(7)` matches any `<span>` element that is the seventh child of its parent.

#### Implementation
In this section, we attemp to crawl first 20 books in the fiction category. We try traditional way first, using tags and classes, only to know that Selenium does not support this style very well. Next, we try to use XPath and CSS selector by copying those of a single book from Chrome and then tweaking them to match all 20.

In [1]:
import time
from webdriver_manager.chrome import ChromeDriverManager
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By

In [2]:
service = Service(ChromeDriverManager().install())
options = Options()
options.add_argument('start-maximized')
driver = webdriver.Chrome(service=service, options=options)

[WDM] - Downloading:  97%|█████████▋| 8.34M/8.61M [00:02<00:00, 5.49MB/s]

In [4]:
url = 'https://books.toscrape.com/catalogue/category/books/fiction_10/index.html'
driver.get(url)

In [5]:
listBook = driver.find_elements(By.CLASS_NAME, 'col-xs-6')
len(listBook)

20

In [21]:
xpath = '//*[@id="default"]/div/div/div/div/section/div[2]/ol/li'
listBook = driver.find_elements(By.XPATH, xpath)
len(listBook)

20

In [23]:
selector = '#default > div > div > div > div > section > div:nth-child(2) > ol > li'
listBook = driver.find_elements(By.CSS_SELECTOR, selector)
len(listBook)

20

In [6]:
driver.quit()

### 3.3. Actions
In this section we use Selenium to perform basic [actions] on a website: clicking, typing keys and hovering.

[actions]: https://selenium-python.readthedocs.io/navigating.html

In [1]:
import time
from webdriver_manager.chrome import ChromeDriverManager
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains

In [4]:
service = Service(ChromeDriverManager().install())
options = Options()
options.add_argument('start-maximized')
driver = webdriver.Chrome(service=service, options=options)

In [5]:
url = 'https://www.tensorflow.org/'
driver.get(url)

#### Send keys

In [7]:
path = '/html/body/section/devsite-header/div/div[1]/div/div/div[2]/devsite-search/form/div[1]/div/input'
element = driver.find_element(By.XPATH, path)

In [35]:
element.send_keys('lstm')

In [47]:
element.send_keys(Keys.CONTROL, 'A')

In [39]:
element.send_keys(Keys.CONTROL, 'C')

In [49]:
element.send_keys(Keys.DOWN)

In [50]:
element.clear()

#### Hover

In [54]:
chains = ActionChains(driver)

In [55]:
path = '/html/body/section/devsite-header/div/div[1]/div/div/div[2]/div[1]/devsite-tabs/nav/tab[6]/a[1]'
element = driver.find_element(By.XPATH, path)
chains.move_to_element(element).perform()

In [56]:
path = '/html/body/section/devsite-header/div/div[1]/div/div/div[2]/div[1]/devsite-tabs/nav/tab[5]/a[1]'
element = driver.find_element(By.XPATH, path)
chains.move_to_element(element).perform()

In [57]:
element.click()

## Resources
- *gregreda - [Web Scraping 201: finding the API](http://www.gregreda.com/2015/02/15/web-scraping-finding-the-api/)*
- *jovian - [Introduction to Web Scraping and REST APIs](https://jovian.ai/aakashns/python-web-scraping-and-rest-api)*
- *blog.devgenius - [Scrape Data without Selenium by Exposing Hidden APIs](https://blog.devgenius.io/scrape-data-without-selenium-by-exposing-hidden-apis-946b23850d47)*
- *medium - [Web Crawling Made Easy with Scrapy and REST API](https://medium.com/@geneng/web-crawling-made-easy-with-scrapy-and-rest-api-ed993e84abd3)*
- *w3schools - [XPath syntax](https://www.w3schools.com/xml/xpath_syntax.asp)*
- *w3schools - [CSS selector reference](https://www.w3schools.com/cssref/css_selectors.php)*

## Install

In [None]:
pip install -U selenium

In [None]:
pip install webdriver-manager