# Web Scraping with Selenium

Traditional scraping libraries like BeautifulSoup work well for static HTML, but they fall short on dynamic sites where content loads via user interactions (e.g., scrolling, clicking filters). Selenium bridges this gap by automating a real web browser, simulating human actions like navigating pages, clicking buttons, and waiting for elements to load. In this tutorial we are going to scrape the website "Ground News" using Selenium and JavaScript.

Ground News is a unique platform that aggregates news articles from thousands of sources, rating them for bias (left, center, right) and factuality. This makes it an ideal case study for examining media polarization, sentiment analysis, visualizing bias distribution and more.

We aim to collect comprehensive information on news articles related to Artificial Intelligence, including the article titles, their summaries, and the media bias ratings associated with each source. This data will be organized and saved into a CSV file for easy analysis and future reference.

# What You'll Learn in This Notebook

This notebook will walk you through:

- Setting up Selenium with a web driver (e.g., ChromeDriver).
- Navigating to Ground News and interacting with its dynamic elements (e.g., searching for topics, expanding articles)
- Extracting data such as article titles, sources, bias ratings
- Handling common challenges like waiting for page loads, dealing with pop-ups

By the end, you'll have a foundational script that you can adapt for your own DH projects, whether analyzing news trends, social media, or archival sites.

In [None]:
# Import webdriver for controlling webpages
from selenium import webdriver

from chromedriver_autoinstaller import install
install(True)

# Import By class to identify and select elements on the webpage
from selenium.webdriver.common.by import By

# Options are the settings you can give to the browser
from selenium.webdriver.chrome.options import Options

# # Tools for waiting until certain conditions are met on the page (e.g., element becomes clickable)
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

# Import time to manually pause execution when needed
import time

# Import pandas to structure and store data
import pandas as pd

In [None]:
# Initialize Settings
opts = Options()

# Optional: open the browers without showing it on your screen
#opts.add_argument("--headless")

# Open the browser
driver = webdriver.Chrome(options=opts)

# Open the website
driver.get("https://ground.news/")

# Print the webpage's current title
driver.title

In [None]:
# Import common errors raised when the element you are searching for is not found or something blocks it
# Only needed if you refer to them directly (we will do it later on in the code)
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import ElementClickInterceptedException
from selenium.common.exceptions import ElementNotInteractableException

# Set default waiting time useful when calling element-searching methods (e.g., driver.find_elements)
driver.implicitly_wait(0.5)

# Identify the onboarding button and click on it, if present, to proceed to the home page
try:
    # Use CSS_SELECTOR with attribute=value
    proceed_to_webpage_button = driver.find_element(By.CSS_SELECTOR, "[data-testid='onboarding-close-button']")

    # Click on the button
    proceed_to_webpage_button.click()

# Proceed if you are already on the home page
except:
    pass
    
# List all the links in the navigation bar
# This line combines CSS Selectors: find any element with class "embla__slide" and then find all <a> elements inside it
links = driver.find_elements(By.CSS_SELECTOR, ".embla__slide a")

# If any link was detected using the above selectors:
if links:

    # Print the number of links found
    print(f"Found {len(links)} nav links:")

    # For each collected link:
    for link in links:

        # Print the link's text and the url
        print(link.text, link.get_attribute("href"))



We successfully collected all the links in the website’s navigation bar using Selenium, though this was purely HTML extraction, which we could have also accomplished with BeautifulSoup. However, our next goal is to gather all articles related to Artificial Intelligence, which requires interacting with the page—specifically, clicking on the relevant link. Selenium is particularly useful here because it allows us to simulate user actions, such as clicking buttons or links, which is not possible with BeautifulSoup alone.

Heads-up: In web automation with Selenium, you sometimes need JavaScript (JS) to click on buttons because not all elements are straightforward HTML elements that Selenium can interact with directly. Selenium can only click elements that are visible on the screen. If the element is off-screen, hidden inside a scrollable area, or if it is hidden by pop-ups, banners, or sticky headers, element.click() may fail. JS can scroll the element into view.

In [None]:
# Try to locate and click on "Artificial Intelligence" using Selenium's click() method 
try: 

    # Identify the topic to click on using ID
    topic = driver.find_element(By.ID, "header-trending-Artificial Intelligence")
    
    # Click on the button
    topic.click()

# Use JavaScript to scroll the topic into view and click on it, if hidden
except:

    # Re-select element as the previous reference may no longer be valid.
    topic = driver.find_element(By.ID, "header-trending-Artificial Intelligence")
    
    # Scroll the element into view using JavaScript
    driver.execute_script("arguments[0].scrollIntoView({block: 'center'});", topic)

    # CLick on the element using Javascript
    driver.execute_script("arguments[0].click();", topic)

# Wait up to 10 seconds for the new url to load
wait = WebDriverWait(driver, 10)
wait.until(EC.url_contains("/interest/ai"))

# Check that you are on the correct url now
print(driver.current_url) 



## It worked! You are on the correct webpage.

We can now examine the current webpage to identify and collect the articles displayed, either in full or partially, so that we can save them for further analysis.

The page includes a “More Stories” button that loads additional articles. To capture all available stories, this button may need to be clicked (sometimes multiple times) until no new articles are loaded.

In [None]:
# While the button "more stories" is available, click on it to load all articles
while True:
    
    try:
        # Relocate the element after each iteration as the references become invalid once the webpage changes
        more_button = driver.find_element(By.CSS_SELECTOR, "[data-testid='load-more-stories-button']")

        # Click on the button
        more_button.click()

    # Force clicking with JavaScript if the button is hidden
    except (ElementClickInterceptedException, ElementNotInteractableException): #, ElementNotInteractableException, StaleElementReferenceException

        # Relocate the element
        more_button = driver.find_element(By.CSS_SELECTOR, "[data-testid='load-more-stories-button']")

        # Scroll into view
        driver.execute_script("arguments[0].scrollIntoView({block: 'center'});", more_button)

        # Click on it
        driver.execute_script("arguments[0].click();", more_button)

    # Stop only when there is no "more stories" button to click on
    except NoSuchElementException:
        break    

    # Manually wait a few seconds for all articles to load
    time.sleep(2)


In [None]:
# Count how many articles are on the webpage
article_count = len(driver.find_elements(By.CSS_SELECTOR, "div .group"))

# Initialize a list to store articles' titles and links as dictionaries
data = []

# Loop through each article on the page by index
for i in range(article_count):

    # Locate article combining tag name and class name
    article = driver.find_elements(By.CSS_SELECTOR, "div .group")[i]

    # Locate url with the <a> tag
    link = article.find_element(By.TAG_NAME, "a")

    # Extract the url address from the element with get_attribute
    href = link.get_attribute("href")

    # In this webpage, the title is usually stored in the tag <h3>
    try:
        title = article.find_element(By.TAG_NAME, "h3")

    # However, sometimes it is stored in <h4> instead --> you can only know this by inspecting the webpage's html!
    except:
        title = article.find_element(By.TAG_NAME, "h4")

    # Get the title's text
    title_text = title.text

    data.append({
        "title": title_text,
        "link": href
    })

# Convert your data to a pandas dataframe
df = pd.DataFrame(data)

After storing the links to the individual articles, we can visit each one to extract additional information.

We'll gather data related to political coverage bias and we will add a third column in the dataframe to store it.

In [None]:
# Iterate over the dataframe with titles and links
for index, row in df.iterrows():

    # Extract the link of each article
    link = row['link']

    # Access the webpage programmatically
    driver.get(link)

    # Locate information on bias by combining CSS SELECTORS and extract text
    # innerText gets vsibile text, textContent gets can also get hidden text
    bias_text = "".join([el.get_attribute("innerText") or el.get_attribute("textContent") or "" for el in driver.find_elements(By.CSS_SELECTOR, "ul.list-disc li span")])
    
    # If bias information is found
    if bias_text:

        # Save to a third column in the df called "bias"
        df.loc[index, 'bias'] = bias_text

    # If information on bias is not present in the <span> element
    else:

        # Extract "missing information on bias"
        bias_text = "".join([el.get_attribute("innerText") or el.get_attribute("textContent") or "" for el in driver.find_elements(By.CSS_SELECTOR, "ul.list-disc li")])
        
        # Save it in the dataframe     
        df.loc[index, 'bias'] = bias_text
        
# Save the updated df to csv
df.to_csv('../data/ground_news_articles.csv', index=False, encoding='utf-8')

The web scraping process has completed successfully! Well done!

You can find the final csv file in the `data/` folder as `ground_news_articles.csv`

Open it and check that the columns contain the data as you expecte and it looks in good order.