## **#04: Data Collection Methods: Selenium**
- Instructor: [Jaeung Sim](https://www.business.uconn.edu/person/jaeung-sim/) (University of Connecticut)
- Course: OPIM 5512 Data Science Using Python
- Last updated: February 6, 2025

## Selenium

Selenium refers to a number of different open-source projects used for browser automation. It supports bindings for all major programming languages, including our favorite language: Python.

The Selenium API uses the WebDriver protocol to control web browsers like Chrome, Firefox, or Safari. Selenium can control both, a locally installed browser instance, as well as one running on a remote machine over the network.

Selenium provides a wide range of ways to interact with sites, such as:
* Clicking buttons
* Populating forms with data
* Scrolling the page
* Taking screenshots
* Executing your own, custom JavaScript code

But the strongest argument in its favor is the ability to handle sites in a natural way, just as any browser will. This particularly comes to shine with JavaScript-heavy Single-Page Application sites. If you scraped such a site with the traditional combination of HTTP client and HTML parser, you'd mostly have lots of JavaScript files, but not so much data to scrape.

**References:**
* [Web Scraping Using Selenium And Python](https://www.scrapingbee.com/blog/selenium-python/)

## #1. Basics

### Installation

In [None]:
!pip install selenium

### Quickstart

1. Check the version of your Chrome.

>"..." on the up right corner > "Settings" > "About Chrome"

2. Download a ChromeDriver zip file appropriate for your version from https://chromedriver.chromium.org/downloads.
3. Unzip the file and move the Chromedriver File to the folder with the current ipynb file.

In [None]:
from selenium import webdriver

# DRIVER_PATH = '/path/to/chromedriver'
# driver = webdriver.Chrome(executable_path=DRIVER_PATH)

# Launch ChromeBrower
driver = webdriver.Chrome() # Since we have ChromeDriver in our current folder, no need to set the path

In [None]:
# Go to the following site
driver.get('https://google.com')

In [None]:
# Go to the following site
driver.get('https://imdb.com')

### Headless option

Running the browser from Selenium the way we just did is particularly helpful during development. It allows you observe exactly what's going on and how the page and the browser is behaving in the context of your code. Once you are happy with everything, it is generally advisable, however, to switch to said headless mode in production.

In that mode, Selenium will start Chrome in the "background" without any visual output or windows.

In [None]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.headless = True
options.add_argument("--window-size=1920,1200")

driver = webdriver.Chrome(options=options)

### 3 ways to close a browser
- webDriver.Close() - Close the browser window that the driver has focus of
- webDriver.Quit() - Calls Dispose()
- webDriver.Dispose() Closes all browser windows and safely ends the session

In [None]:
driver.close()

### Example of extracting the page source in a headless mode

In [None]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.headless = True
options.add_argument("--window-size=1920,1200")

driver = webdriver.Chrome(options=options)
driver.get("https://www.imdb.com")

# print the current page source
print(driver.page_source)
driver.quit()

### Locating Elements

**The `find_element` methods**

`WebDriver` provides two main methods for finding elements:
- find_element
- find_elements

They are pretty similar, with the difference that the former looks for one single element, which it returns, whereas the latter will return a list of all found elements.

Both methods support eight different search types, indicated with the By class.

Type	| Description	| DOM Sample	| Example
--------|---------------|---------------|-----------------------------
By.ID	| Searches for elements based on their HTML ID	| `<div id="myID">`	| `find_element(By.ID, "myID")`
By.NAME	| Searches for elements based on their name attribute	| `<input name="myNAME">`	| `find_element(By.NAME, "myNAME")`
By.XPATH	| Searches for elements based on an XPath expression	| `<span>My <a>Link</a></span>`	| `find_element(By.XPATH, "//span/a")`
By.LINK_TEXT	| Searches for anchor elements based on a match of their text content	| `<a>My Link</a>`	| `find_element(By.LINK_TEXT, "My Link")`
By.PARTIAL_LINK_TEXT	| Searches for anchor elements based on a sub-string match of their text content	| `<a>My Link</a>`	| `find_element(By.PARTIAL_LINK_TEXT, "Link")`
By.TAG_NAME	| Searches for elements based on their tag name	| `<h1>`	| `find_element(By.TAG_NAME, "h1")`
By.CLASS_NAME	| Searches for elements based on their HTML classes	| `<div class="myCLASS">`	| `find_element(By.CLASSNAME, "myCLASS")`
By.CSS_SELECTOR	| Searches for elements based on a CSS selector	| `<span>My <a>Link</a></span>`	| `find_element(By.CSS_SELECTOR, "span > a")`

## #2. Full Example

We will pass the login process and scrape some data from **Hacker News** automatically using `Selenium`. Please visit the following website first: https://news.ycombinator.com/login

In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By

# Launch ChromeBrower
driver = webdriver.Chrome()

In [None]:
# Go to the following site
driver.get("https://news.ycombinator.com/login")

USERNAME = "test06901"
PASSWORD = "00000000"

After you create your account:

In [None]:
# Go to the following site
driver.get("https://news.ycombinator.com/login")

# Find the first element with tag name "input" and input USERNAME into the element
login = driver.find_element(By.XPATH, "//input").send_keys(USERNAME)
# Find the first element with tag name "input" that has type "password" and input PASSWORD into the element
password = driver.find_element(By.XPATH, "//input[@type='password']").send_keys(PASSWORD)
# Find the first element with tag name "input" that has value "login" and click it
submit = driver.find_element(By.XPATH, "//input[@value='login']").click()

In [None]:
from selenium.common.exceptions import NoSuchElementException

# Check if we succeeded login
try:
    logout_button = driver.find_element(By.ID, "logout")
    print('Successfully logged in')
except NoSuchElementException:
    print('Incorrect login/password')

In [None]:
# Click the logout button
logout = driver.find_element(By.ID, "logout").click()

In [None]:
# Another way to click the logout button
logout = driver.find_element(By.XPATH, "//a[contains(text(), 'logout')]").click()

In [None]:
# Try this again!
try:
    logout_button = driver.find_element(By.ID, "logout")
    print('Successfully logged in')
except NoSuchElementException:
    print('Incorrect login/password')

### Taking screenshots

In [None]:
driver.save_screenshot('screenshot.png')

In [None]:
from IPython.display import Image
Image(filename='screenshot.png')

In [None]:
driver.quit()

### Waiting for an element to be present

We may have to wait until JavaScript completed its work. There are typically two ways to approach that:

* Use `time.sleep()` before taking the screenshot.
* Employ a `WebDriverWait` object.

If you use a `time.sleep()` you will have to use the most reasonable delay for your use case. The problem is, you're either waiting too long or not long enough and neither is ideal. Also, the site may load slower on your residential ISP connection than when your code is running in production in a datacenter. With `WebDriverWait`, you don't really have to take that into account. It will wait only as long as necessary until the desired element shows up (or it hits a timeout).

In [None]:
# Launch ChromeBrower
driver = webdriver.Chrome()

In [None]:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
TIMEOUT = 5
try:
    driver.get("https://news.ycombinator.com/login")
    element = WebDriverWait(driver, TIMEOUT).until(
        EC.presence_of_element_located((By.ID, "mySuperId"))
    )
    print("Found the element")
except:
    print("No such element")

In [None]:
TIMEOUT = 5
try:
    driver.get("https://news.ycombinator.com/login")
    element = WebDriverWait(driver, TIMEOUT).until(
        EC.presence_of_element_located((By.XPATH, "//input[@value='login']"))
    )
    print("Found the element")
except:
    print("No such element")

In [None]:
driver.quit()

### Extract text and href data

In [None]:
# Launch ChromeBrower
driver = webdriver.Chrome()

# Go to the following site
driver.get("https://news.ycombinator.com/")

In [None]:
import pandas as pd

# Get all rank elements into a list
ranks = driver.find_elements(By.CLASS_NAME, "rank")
# Get all title elements into a list
titles = driver.find_elements(By.CLASS_NAME, "titleline")
# Get all detail elements into a list
details = driver.find_elements(By.CLASS_NAME, "subtext")

df = pd.DataFrame()

for i in range(len(ranks)):
    dt = dict()
    
    # For each element in list get rank, href, title, and detail info, and put them into a dictionary
    dt['rank'] = ranks[i].text
    dt['href'] = titles[i].find_element(By.TAG_NAME, "a").get_attribute('href')
    dt['title'] = titles[i].find_element(By.TAG_NAME, "a").text
    dt['detail'] = details[i].text
    
    # Convert above dictionary to a single-row dataframe and accumulate them into a single dataframe
    df = pd.concat([df, pd.DataFrame([dt])])

In [None]:
df