<h2 style="text-align: center;">
    SELENIUM TUTORIAL
</h2>

Selenium is a powerful Python library used for automating web browser interactions. It allows users to programmatically navigate websites, fill out forms, click buttons, scrape data, and even simulate user behavior such as scrolling or uploading files. This notebook showcases a series of progressively complex use cases that demonstrate how Selenium can be applied to real-world business and data analytics scenarios. Starting with basic tasks like taking a screenshot and logging into a website, the use cases build up to advanced projects involving infinite scrolling, multi-page scraping, and full automation workflows that include data cleaning and visualization.

### Library Imports

This notebook uses a combination of libraries to perform web automation, data extraction, and analysis. Selenium and webdriver_manager are used to automate browser interactions and scrape content from dynamic websites. pandas structures the extracted data, while re and nltk (with stopwords) help clean and process natural language text. collections.Counter is used to count word frequencies. For visualization, matplotlib generates bar charts and manages axis formatting, and wordcloud creates visual representations of word usage. The os and time modules support file handling and script timing. Together, these tools enable an end-to-end workflow for web scraping and content analysis.

In [50]:
from collections import Counter
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import nltk
from nltk.corpus import stopwords
import os
import pandas as pd
import re
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait, Select
from selenium.webdriver.support import expected_conditions as EC
import time
from webdriver_manager.chrome import ChromeDriverManager
from wordcloud import WordCloud

### Use Case 1: Simple Screenshot

This use case demonstrates how to automate the process of capturing a webpage screenshot using Selenium. The script launches a Chrome browser, navigates to the official Selenium website, waits briefly for the page to load, and saves a full-page screenshot to a designated folder. This introduces basic Selenium commands such as launching the browser, navigating to a URL, and using save_screenshot() for visual documentation or testing purposes. While it uses a simple time.sleep() for page loading, future use cases will incorporate more robust waiting strategies.

In [8]:
# Launch browser (Chrome)
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

# Open a webpage
driver.get("https://www.selenium.dev/")

# Wait for page to load (better to use WebDriverWait in later use cases)
time.sleep(2)

# Take a screenshot
driver.save_screenshot("Screenshots/1 - Webpage Screenshot.png")

# Display success message
print('✅ Success! A screenshot of the webpage has been saved in the "Screenshots" folder.')

# Close browser
driver.quit()

✅ Success! A screenshot of the webpage has been saved in the "Screenshots" folder.


### Use Case 2: Account Login

This use case demonstrates how to automate the process of logging into a website using Selenium. The script navigates to a sample login page, waits for the login form to load, enters a predefined username and password, submits the form, and waits for confirmation of a successful login. It concludes by capturing a screenshot of the post-login page. This example introduces form interaction, button clicks, and the use of explicit waits to ensure that elements are fully loaded before interacting with them.

In [12]:
# Launch browser (Chrome)
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

# Navigate to the login page
driver.get("https://the-internet.herokuapp.com/login")

# Wait for the login form to load
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "username")))

# Fill in username and password
driver.find_element(By.ID, "username").send_keys("tomsmith")
driver.find_element(By.ID, "password").send_keys("SuperSecretPassword!")

# Click the login button
driver.find_element(By.CSS_SELECTOR, "button[type='submit']").click()

# Wait for the login to complete and dashboard to appear
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "flash.success")))

# Save screenshot
driver.save_screenshot("Screenshots/2 - Account Login.png")

# Display success message
print('✅ Success! A screenshot of the webpage after logging in has been saved in the "Screenshots" folder.')

# Close the browser
driver.quit()

✅ Success! A screenshot of the webpage after logging in has been saved in the "Screenshots" folder.


### Use Case 3: Web Form

This use case demonstrates how to automate filling out and submitting a web form using Selenium. The script inputs text into various fields, selects options from dropdown menus, checks and unchecks checkboxes and radio buttons, picks a color and date, adjusts a range slider using JavaScript, and uploads a local PDF file. It captures screenshots both before and after form submission for verification. This example highlights Selenium’s ability to handle different types of form elements and introduces basic DOM interaction, condition checking, and JavaScript execution for complex controls.

In [22]:
# Launch browser (Chrome)
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

# Open the form page
driver.get("https://www.selenium.dev/selenium/web/web-form.html")

# Wait for the form to load
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.TAG_NAME, "form")))

# Fill out text fields
driver.find_element(By.NAME, "my-text").send_keys("Luke Tam")
driver.find_element(By.NAME, "my-password").send_keys("TestPassword123!")
driver.find_element(By.NAME, "my-textarea").send_keys("This is a test message.")
driver.find_element(By.NAME, "my-datalist").send_keys("Boston")

# Select from dropdown
select = Select(driver.find_element(By.NAME, "my-select"))
select.select_by_visible_text("Two")

# Select a file
file_path = os.path.abspath("Data/3 - Mid-Term Instructions.pdf")
driver.find_element(By.NAME, "my-file").send_keys(file_path)

# Unselect a checkbox
checked_checkbox = driver.find_element(By.ID, "my-check-1")
if checked_checkbox.is_selected():
    checked_checkbox.click()

# Select a checkbox
default_checkbox = driver.find_element(By.ID, "my-check-2")
if not default_checkbox.is_selected():
    default_checkbox.click()

# Select a radio button
default_radio = driver.find_element(By.ID, "my-radio-2")
if not default_radio.is_selected():
    default_radio.click()

# Select a color
driver.find_element(By.NAME, "my-colors").send_keys("#008000")

# Select a date
driver.find_element(By.NAME, "my-date").send_keys("04/01/2025")
driver.find_element(By.NAME, "my-date").send_keys(Keys.TAB)  # hide the calendar popup

# Set slider value using JavaScript
slider = driver.find_element(By.NAME, "my-range")
driver.execute_script("arguments[0].value = arguments[1];", slider, "8")

# Take a screenshot of the filled form
driver.save_screenshot("Screenshots/3 - Web Form Completed.png")

# Submit the form
driver.find_element(By.TAG_NAME, "button").click()

# Take a screenshot of the confirmation page
time.sleep(2)
driver.save_screenshot("Screenshots/3 - Web Form Confirmation.png")

# Display success message
print('✅ Success! Screenshots of the completed web form and the confirmation page have been saved in the "Screenshots" folder.')

# Close the browser
driver.quit()

✅ Success! Screenshots of the completed web form and the confirmation page have been saved in the "Screenshots" folder.


### Use Case 4: Web Scraping - Basic

This use case introduces basic web scraping with Selenium by extracting news headlines from the NPR News website. The script launches a browser, navigates to the NPR News section, waits for the page content to load, and collects the text of the first 10 headline links using CSS selectors. It demonstrates fundamental scraping techniques such as locating elements by tag and class, handling dynamic page loading with explicit waits, and iterating over results to extract and display clean text output.

In [25]:
# Launch browser (Chrome)
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

# Load the NPR News section
driver.get("https://www.npr.org/sections/news/")
driver.maximize_window()

# Wait until headline links are loaded
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "h2.title a")))

# Grab all headline elements
headline_elements = driver.find_elements(By.CSS_SELECTOR, "h2.title a")

# Extract and print their text content
print("Top 10 Headlines on NPR News:\n")
for i, element in enumerate(headline_elements[:10], start=1):
    print(f"{i}. {element.text}")

# Close the browser
driver.quit()

Top 10 Headlines on NPR News:

1. Why gold prices are surging to record highs
2. 2 mothers bring the House to a halt over push to allow proxy voting for new parents
3. Top scientists warn that Trump policies are causing a 'climate of fear' in research
4. Trump administration admits Maryland man sent to El Salvador prison by mistake
5. Widespread firings start at federal health agencies including many in leadership
6. What kind of support is the U.S. offering in the wake of the Myanmar quake?
7. Thyme for some healing soup recipes from around the world
8. Cory Booker's anti-Trump speech on the Senate floor has lasted 19 hours and counting
9. Caregiving can test you, body and soul. It can also unlock a new sense of self
10. Crumbling trust in American institutions: A MAHA activist takes on Girl Scout cookies


### Use Case 5: Web Scraping - Intermediate

This use case demonstrates intermediate-level web scraping using Selenium by extracting structured product data from OpenFoodFacts.org. The script loads the main product listing page, waits for the content to finish loading, and collects the name, URL, and nutritional labels (Nutri-Score, NOVA group, and Green-Score) for the first 10 food products. It showcases how to work with nested HTML elements, extract image metadata for additional context, and apply conditional logic to parse available product information. This example strengthens your understanding of navigating real-world HTML structures and processing mixed content types.

In [29]:
# Launch browser (Chrome)
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

# Open OpenFoodFacts.org
driver.get("https://world.openfoodfacts.org/")

# Wait until products finish loading
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "#products_match_all li")))

# Get list of products
products = driver.find_elements(By.CSS_SELECTOR, "#products_match_all li")

# Scrape first 10 items
print("First 10 Products on OpenFoodFacts.org:\n")
for i, item in enumerate(products[:10], start=1):

    # Get name and link
    link_tag = item.find_element(By.TAG_NAME, "a")
    name = link_tag.text.strip()
    link = link_tag.get_attribute("href")

    # Get all images and parse score info
    images = item.find_elements(By.TAG_NAME, "img")
    nutri_score = nova = green_score = "N/A"
    for img in images:
        src = img.get_attribute("src")
        title = img.get_attribute("title")
        if src:
            if "nutriscore" in src:
                nutri_score = (title.replace("Nutri-Score ", "")
                               .replace("unknown", "Unknown"))
            elif "nova-group" in src:
                nova = (title.replace("NOVA ", "")
                        .replace("not computed", "Not computed"))
            elif "green-score" in src:
                green_score = (title.replace("Green-Score ", "")
                               .replace("not applicable", "Not applicable")
                               .replace("not computed", "Not computed"))

    # Display product info
    print(f"{i}. {name}")
    print(f"   Link: {link}")
    print(f"   Nutri-Score: {nutri_score}")
    print(f"   NOVA Group: {nova}")
    print(f"   Green-Score: {green_score}")
    print()

# Close the browser
driver.quit()

First 10 Products on OpenFoodFacts.org:

1. Sidi Ali - 33 cl
   Link: https://world.openfoodfacts.org/product/6111035000430/sidi-ali
   Nutri-Score: A - Very good nutritional quality
   NOVA Group: Unprocessed or minimally processed foods
   Green-Score: Not applicable - Not yet applicable for the category: Waters

2. Sample Product - Jaouda - 80.0
   Link: https://world.openfoodfacts.org/product/6111242100992/sample-product-jaouda
   Nutri-Score: Unknown - Missing data to compute the Nutri-Score
   NOVA Group: Processed foods
   Green-Score: B - Low environmental impact

3. Sidi Ali - 2 L
   Link: https://world.openfoodfacts.org/product/6111035002175/sidi-ali
   Nutri-Score: A - Very good nutritional quality
   NOVA Group: Not computed - Food processing level unknown
   Green-Score: Not applicable - Not yet applicable for the category: Waters

4. Sidi Ali mineral water - 1,5 L
   Link: https://world.openfoodfacts.org/product/6111035000058/sidi-ali-mineral-water
   Nutri-Score: A - Ver

### Use Case 6: Web Scraping - Intermediate with Infinite Scrolling

This use case demonstrates how to use Selenium to scrape dynamically loaded content from a webpage with infinite scrolling. The script navigates to the Quotes to Scrape site, scrolls to the bottom repeatedly until all content is loaded, and then extracts the text, author, and tags of each quote. This example introduces a common scrolling automation pattern using JavaScript and shows how to detect when no new content is being added. It's a practical example for handling social feeds, product catalogs, or any website that loads additional items as the user scrolls.

In [33]:
# Launch browser (Chrome)
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

# Open the quotes page
driver.get("https://quotes.toscrape.com/scroll")

# Wait for initial content
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "quote")))

# Scroll loop
SCROLL_PAUSE_TIME = 1
last_height = driver.execute_script("return document.body.scrollHeight")

while True:
    # Scroll to bottom of page and wait briefly for new content to load
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(SCROLL_PAUSE_TIME)

    # Calculate new scroll height and compare with last
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break  # No more new content loaded
    last_height = new_height

# Get all quotes
quotes = driver.find_elements(By.CLASS_NAME, "quote")

print(f"Total quotes loaded: {len(quotes)}\n")

# Display first 10 quotes
for i, quote in enumerate(quotes[:10], start=1):
    text = quote.find_element(By.CLASS_NAME, "text").text
    author = quote.find_element(By.CLASS_NAME, "author").text
    tags = [tag.text for tag in quote.find_elements(By.CLASS_NAME, "tag")]
    print(f"{i}. {text}")
    print(f"   — {author}")
    print(f"   Tags: {', '.join(tags)}\n")

# Close the browser
driver.quit()

Total quotes loaded: 100

1. “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
   — Albert Einstein
   Tags: change, deep-thoughts, thinking, world

2. “It is our choices, Harry, that show what we truly are, far more than our abilities.”
   — J.K. Rowling
   Tags: abilities, choices

3. “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
   — Albert Einstein
   Tags: inspirational, life, live, miracle, miracles

4. “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
   — Jane Austen
   Tags: aliteracy, books, classic, humor

5. “Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
   — Marilyn Monroe
   Tags: be-yourself, inspirational

6. “Try not to become a man of success. Rather become a man of value.”
   — Albert Einstein
   Tag

### Use Case 7: Web Scraping - Advanced

This use case demonstrates advanced web scraping techniques by collecting job postings from Indeed using a live Chrome session with remote debugging. The script navigates through the first three pages of job search results for Business Analyst positions in Boston and extracts detailed information for the first two jobs on each page. For each job, it captures the job title, company, location, URL, and full job description, then saves the structured data to an Excel file. This example highlights multi-page scraping, browser tab management, dynamic content handling with explicit waits, and the use of remote browser control to bypass bot detection and CAPTCHA restrictions.

In [39]:
# ============================== #
# INSTRUCTIONS BEFORE RUNNING THIS SCRIPT
# ============================== #
#
# STEP 1: Manually start Google Chrome with remote debugging enabled.
#         This allows Selenium to attach to your existing browser session.
#
# WINDOWS:
# Open Command Prompt and run (include all quotes):
# "C:\Program Files\Google\Chrome\Application\chrome.exe" --remote-debugging-port=9222 --user-data-dir="C:\chrome_temp"
#
# macOS:
# Open Terminal and run (include all quotes):
# /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9222 --user-data-dir="/tmp/chrome_temp"
#
# STEP 2: Run the following Python script.

# Configure Chrome to attach to the existing session
options = Options()
options.add_experimental_option("debuggerAddress", "127.0.0.1:9222")

# Connect to the existing Chrome session
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)

# Open the Indeed job search page (adjust filters in URL as needed)
# Job title: Business Analyst
# Location: Boston, MA
# Date posted: Last 7 days
base_url = "https://www.indeed.com/jobs?q=business+analyst&l=Boston%2C+MA&fromage=7&start={}"
results = []

# Loop through first 3 pages (0, 10, 20)
for page in range(0, 30, 10):
    driver.get(base_url.format(page))
    driver.refresh()
    time.sleep(3)

    # Wait for all job cards on the page to load
    WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.CLASS_NAME, "job_seen_beacon")))

    # Get all job cards on the page
    job_cards = driver.find_elements(By.CLASS_NAME, "job_seen_beacon")

    # Loop through first 2 jobs on the page
    for card in job_cards[:2]:
        try:
            # Extract job title, company, and location
            title = card.find_element(By.TAG_NAME, "h2").text.strip()
            company = card.find_element(By.CSS_SELECTOR, 'span[data-testid="company-name"]').text.strip()
            location = card.find_element(By.CSS_SELECTOR, 'div[data-testid="text-location"]').text.strip()

            # Build full job URL to get job description
            link = card.find_element(By.TAG_NAME, "a")
            href = link.get_attribute("href")
            job_url = href if href.startswith("http") else "https://www.indeed.com" + href

            # Open job details page in new tab and wait for page to laod
            driver.execute_script("window.open(arguments[0]);", job_url)
            driver.switch_to.window(driver.window_handles[1])
            time.sleep(3)

            # Extract job description
            WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "jobDescriptionText")))
            description = driver.find_element(By.ID, "jobDescriptionText").text.strip()

            # Save results as list of dictionraries
            results.append({
                "Title": title,
                "Company": company,
                "Location": location,
                "URL": job_url,
                "Description": description
            })

            # Close job details tab and switch back to search window
            driver.close()
            driver.switch_to.window(driver.window_handles[0])

        except:
            continue

# Make sure the destination folder exists
os.makedirs("Data", exist_ok=True)

# Convert list of job dictionaries to DataFrame
df = pd.DataFrame(results)

# Save to Excel
excel_filename = "Data/7 - Indeed Job Postings.xlsx"
df.to_excel(excel_filename, index=False)

# Display success message
print('✅ Success! The job details have been saved in the "Data" folder.')

# Manually close the Chrome window

✅ Success! The job details have been saved in the "Data" folder.


### Use Case 8: End-to-End Automation

This use case presents a complete end-to-end automation pipeline using Selenium, pandas, matplotlib, and wordcloud. The script scrapes the first three news articles from NPR, extracts and cleans the full article text, identifies the top 10 most frequent non-stopwords for each article, and saves the results to a single Excel file. It also generates a bar chart and word cloud for each article, visualizing the most common words based on the full article content. This use case showcases the integration of web scraping, text processing, data visualization, and file output into a cohesive and reproducible automation workflow.

In [52]:
# one-time download of NLTK stopwords
# nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Create output folders if they do not already exist
os.makedirs("Data", exist_ok=True)
os.makedirs("Charts", exist_ok=True)

# ---------------------------
# Step 1: Open NPR News and get first 3 article links
# ---------------------------
# Launch browser (Chrome)
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

# Open NPR News and wait for headlines to load
driver.get("https://www.npr.org/sections/news/")
time.sleep(3)

# Get the first 3 article links
headline_elements = driver.find_elements(By.CSS_SELECTOR, 'h2.title a')
articles = []
seen = set()
all_data = []

for elem in headline_elements:
    title = elem.text.strip()
    url = elem.get_attribute("href")

    if title and url and url not in seen:
        articles.append((title, url))
        seen.add(url)

    if len(articles) == 3:
        break

# ---------------------------
# Step 2: Process each article
# ---------------------------
for i, (title, url) in enumerate(articles, start=1):
    print(f"🔍 Processing article {i}: {title}")
    driver.get(url)
    time.sleep(3)

    # Try to extract article paragraphs
    try:
        paragraphs = driver.find_elements(By.CSS_SELECTOR, 'div.storytext p')
        article_text = " ".join([p.text.strip() for p in paragraphs])
    except:
        print("❌ Failed to extract article body.")
        continue

    # Clean and tokenize article text
    text = article_text.lower()
    text = re.sub(r"[^a-zA-Z\s]", "", text)
    words = text.split()
    filtered_words = [w for w in words if w not in stop_words and len(w) > 2]

    # Count top 10 words
    word_freq = Counter(filtered_words)
    top_words = word_freq.most_common(10)

    # Store article details and word frequencies
    for word, count in top_words:
        all_data.append({
            "Article #": i,
            "Headline": title,
            "URL": url,
            "Word": word,
            "Frequency": count
        })

    # ---------------------------
    # Step 3: Create bar chart
    # ---------------------------
    words, counts = zip(*top_words)
    plt.figure(figsize=(8, 5))
    plt.bar(words, counts, color="green")
    plt.title(f"Top Words in Article {i}")
    plt.xticks(rotation=45)
    plt.gca().yaxis.set_major_locator(ticker.MaxNLocator(integer=True))
    plt.tight_layout()
    bar_filename = f"Charts/8 - Article {i} Top 10 Words.png"
    plt.savefig(bar_filename)
    plt.close()

    # ---------------------------
    # Step 4: Create word cloud
    # ---------------------------
    wc = WordCloud(width=800, height=400, background_color="white").generate(" ".join(filtered_words))
    wc_filename = f"Charts/8 - Article {i} Word Cloud.png"
    wc.to_file(wc_filename)

# ---------------------------
# Step 5: Save all collected data to one Excel file
# ---------------------------
df_all = pd.DataFrame(all_data)
excel_filename = "Data/8 - NPR News Top 10 Words.xlsx"
df_all.to_excel(excel_filename, index=False)

# Display success message
print("✅ Success! Headlines scraped. Excel saved. Bar charts and word clouds generated!")

# Close the browser
driver.quit()

🔍 Processing article 1: Say goodbye to chain crews: The NFL will use camera technology to measure 1st downs
🔍 Processing article 2: China's Global Electric Vehicle Boom
🔍 Processing article 3: Why gold prices are surging to record highs
✅ Success! Headlines scraped. Excel saved. Bar chart and word cloud generated!
