## BBC News

1. Library installation and import. Installation line is commented as it was installed once.

In [8]:
#!pip install selenium
import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options

2. Retrieving HTML content

In [9]:
browser = webdriver.Firefox()
browser.get('https://www.bbc.com/news/world/europe')

3. Extracting articles. 
It was found that there are few "hot" articles displayed at each page, while all the articles including the "hot" ones can be retrieved from Latest Updates section. For that reason function extract_articles() below concentrates on that section. Selectors were identified manually using inspect.

In [10]:

def extract_articles(driver):
    """Extracts articles from the current page using an existing Selenium WebDriver."""
    try:
        # Wait for the "Latest Updates" section to load
        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "div[data-testid='alaska-section']"))
        )

        # Find all article links inside "Latest Updates"
        articles = driver.find_elements(By.CSS_SELECTOR, "div[data-testid='alaska-section'] a[data-testid='internal-link']")

        extracted_articles = []
        for article in articles:
            try:
                headline = article.find_element(By.CSS_SELECTOR, "h2[data-testid='card-headline']").text
                summary = article.find_element(By.CSS_SELECTOR, "p[data-testid='card-description']").text
                link = article.get_attribute("href")
                time_updated = article.find_element(By.CSS_SELECTOR, "span[data-testid='card-metadata-lastupdated']").text

                extracted_articles.append({
                    "headline": headline,
                    "summary": summary,
                    "link": link,
                    "time": time_updated
                })
            except Exception as e:
                print(f"Error extracting article: {e}")

    except Exception as e:
        print(f"Failed to load articles: {e}")
        extracted_articles = []

    return extracted_articles  # Driver stays open for pagination (see subtask 4)

#Initialize WebDriver
driver = webdriver.Chrome()
driver.get("https://www.bbc.com/news/world/europe")  # Load the initial page

#Now call extract_articles with the existing driver
europe_articles = extract_articles(driver)

#Print extracted articles
for article in europe_articles:
    print(article)

#Quit WebDriver after scraping
driver.quit()

{'headline': 'The long, slow road to a ceasefire, with no guarantee of success', 'summary': 'The agreement comes after days of talks with the US in Saudi Arabia.', 'link': 'https://www.bbc.com/news/articles/cqjdnpjgj75o', 'time': '11 hrs ago'}
{'headline': 'Two French air display jets crash in rehearsal', 'summary': 'The three occupants of the two Alpha Jets that collided ejected and were "found alive and conscious".', 'link': 'https://www.bbc.com/news/videos/cpv4gxx2dwzo', 'time': '14 hrs ago'}
{'headline': 'Dáil adjourns as rowdy scenes erupt over speaking rights', 'summary': 'Rowdy scenes erupted while the Irish PM attempted to explain a new government proposal for speaking rights.', 'link': 'https://www.bbc.com/news/articles/cx20qyzgrzxo', 'time': '14 hrs ago'}
{'headline': "Grandparents arrested on suspicion of toddler's murder in French Alps", 'summary': 'Emile Soleil mysteriously vanished in 2023, in a case that made headlines across France.', 'link': 'https://www.bbc.com/news/a

4. Scrape multiple pages.
Function extract_pages() given below identifies number of pages and then goes through them using next (">") button and calling extract_articles() function defined above to get all the articles. Selectors were identified manually. The function also handles cookies popup.

In [11]:
def extract_pages(section_url):
    """Scrapes all paginated articles in the 'Latest Updates' section, handling dynamic loading."""

    driver.get(section_url)
    all_articles = []

    try:
        # Handle Cookie Popup
        try:
            WebDriverWait(driver, 10).until(
                EC.frame_to_be_available_and_switch_to_it((By.ID, "sp_message_iframe_1192447"))
            )
            accept_button = WebDriverWait(driver, 10).until(
                EC.element_to_be_clickable((By.XPATH, "//button[text()='I agree']"))
            )
            accept_button.click()
            print("Cookies accepted.")
            driver.switch_to.default_content()
            time.sleep(2)
        except:
            print("No cookie popup found.")

        # Detect total pages from pagination buttons
        pagination_buttons = driver.find_elements(By.CSS_SELECTOR, "button.RZsRF.gBqyGL")
        total_pages = max([int(btn.text) for btn in pagination_buttons if btn.text.isdigit()], default=1)
        print(f"Total pages detected: {total_pages}")

        scraped_headlines = set()  # Track already scraped headlines

        for page in range(1, total_pages + 1):
            #print(f"- Scraping page {page}...")

            # Extract articles before clicking
            articles = extract_articles(driver)
            new_articles = [a for a in articles if a['headline'] not in scraped_headlines]
            #for article in articles[:5]:  # Display only first 5 articles
            #     print(article)

            #if not new_articles:
            #    print("No new articles found. Stopping pagination.")
            #    break  # Stop if the page didn't change

            # Add only new articles
            all_articles.extend(new_articles)
            scraped_headlines.update([a['headline'] for a in new_articles])

            time.sleep(1)

            # Click the "Next Page" button
            try:
                driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
                time.sleep(2)
                #print(f"Clicking Next Page button (Page {page})")
                next_button = WebDriverWait(driver, 10).until(
                    EC.presence_of_element_located((By.CSS_SELECTOR, "button[data-testid='pagination-next-button']"))
                )
                

                driver.execute_script("document.querySelector('button[data-testid=\"pagination-next-button\"]').click();")

                time.sleep(1)  # Wait for new articles to load
            except:
                print("No more pages to scrape.")
                break

    except Exception as e:
        print(f"Error: {e}")

    return all_articles

driver = webdriver.Chrome()

# Execution for the Europe section
europe_articles = extract_pages("https://www.bbc.com/news/world/europe")

# Print results
print(f"Total articles scraped: {len(europe_articles)}")
#for article in europe_articles:  
    #print(article)

Cookies accepted.
Total pages detected: 12
Total articles scraped: 100


5. Expand the Scope. Here the extract_pages() function defined above is executed for the defined list of regions to get all the required articles.

In [12]:
# List of BBC News Regions
region_urls = {
    "US & Canada": "https://www.bbc.com/news/us-canada",
    "UK": "https://www.bbc.com/news/uk",
    "Africa": "https://www.bbc.com/news/world/africa",
    "Asia": "https://www.bbc.com/news/world/asia",
    "Australia": "https://www.bbc.com/news/world/australia",
    "Europe": "https://www.bbc.com/news/world/europe",
    "Latin America": "https://www.bbc.com/news/world/latin_america",
    "Middle East": "https://www.bbc.com/news/world/middle_east",
}

# Initialize WebDriver
driver = webdriver.Chrome()

# Storage for All Articles
all_articles = []

# Loop Through Each Region and Scrape Articles
for region, url in region_urls.items():
    print(f"-- Scraping region: {region}")
    articles = extract_pages(url)  # Uses your existing function
    all_articles.extend(articles)
    print(f" {len(articles)} articles scraped from {region}")

# ✅ Close the WebDriver
driver.quit()

# ✅ Print Final Results
print(f"\nTotal articles scraped from all regions: {len(all_articles)}")
for article in all_articles[:10]:  # Display only first 10 articles
    print(article)

-- Scraping region: US & Canada
Cookies accepted.
Total pages detected: 12
 100 articles scraped from US & Canada
-- Scraping region: UK
No cookie popup found.
Total pages detected: 1
 9 articles scraped from UK
-- Scraping region: Africa
No cookie popup found.
Total pages detected: 12
 100 articles scraped from Africa
-- Scraping region: Asia
No cookie popup found.
Total pages detected: 12
 99 articles scraped from Asia
-- Scraping region: Australia
No cookie popup found.
Total pages detected: 12
 100 articles scraped from Australia
-- Scraping region: Europe
No cookie popup found.
Total pages detected: 12
 100 articles scraped from Europe
-- Scraping region: Latin America
No cookie popup found.
Total pages detected: 12
Error extracting article: Message: no such element: Unable to locate element: {"method":"css selector","selector":"p[data-testid='card-description']"}
  (Session info: chrome=134.0.6998.166); For documentation on this error, please visit: https://www.selenium.dev/docum

6. Saving results to csv file

In [13]:
import pandas as pd

df = pd.DataFrame(all_articles)
    
    # Save the DataFrame to a CSV file
df.to_csv("scraped_articles.csv", index=False)  # index=False to avoid writing row indices
    
print(f"... Articles saved to scraped_articles.csv ...")

... Articles saved to scraped_articles.csv ...


## Part 3: Scraping Article Text

Loading back the list of articles to use URLs

In [14]:
# Load the scraped articles list
file_path = "scraped_articles.csv"
articles_df = pd.read_csv(file_path)

1. Few articles have been inspected, four selectors for needed attributes have been found.

2. Text Scraping Function. Using identifyed selectors the function gets the required attributes of an article from a given URL.

In [15]:
def scrape_article(driver, url):
    
    driver.get(url)
    article_data = {"url": url, "headline": None, "date": None, "author": None, "text": None}
    excnt = 0

    try:
        #print(f"Extracting data from: {url}")

        # Extract Headline
        try:
            headline = WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.CSS_SELECTOR, "div[data-component='headline-block'] h1"))
            ).text
            article_data["headline"] = headline
            #print(f"Headline: {headline}")
        except:
            excnt = excnt + 1
            #print("Headline not found")

        # Extract Published Date
        try:
            date = driver.find_element(By.CSS_SELECTOR, "time.sc-b42e7a8f-2").text
            article_data["date"] = date
            #print(f"Published Date: {date}")
        except:
            excnt = excnt + 1
            #print("Date not found")

        # Extract Author
        try:
            author = driver.find_element(By.CSS_SELECTOR, "span.sc-b42e7a8f-7").text
            article_data["author"] = author
            #print(f"Author: {author}")
        except:
            excnt = excnt + 1
            #print("Author not found")

        # Extract Full Article Text
        try:
            paragraphs = driver.find_elements(By.CSS_SELECTOR, "div[data-component='text-block'] p")
            text = "\n".join([p.text for p in paragraphs])
            article_data["text"] = text
            #print(f"Extracted {len(paragraphs)} paragraphs")
        except:
            excnt = excnt + 1
            #print("Article text not found")

    except Exception as e:
        print(f"Error scraping {url}: {e}")

    #print(f"Total exceptions: {excnt}")    

    return article_data



3. Scrape all articles. The code below uses the function defined above to iterate through URLs and get the required attributes. Error handling is implemented in the function itself.

In [16]:

# Initialize WebDriver
driver = webdriver.Chrome()

# Track scraped articles
scraped_articles = []

# Loop through first 5 articles (for debugging)
print("Starting article scraping...")
for i, url in enumerate(articles_df["link"], start=1):
    #print(f"\n Scraping article {i}: {url}")

    article_data = scrape_article(driver, url)
    scraped_articles.append(article_data)

    # Print success message per article
    #print(f"Finished scraping article {i}")

# Close WebDriver
driver.quit()

# Convert to DataFrame
scraped_articles_df = pd.DataFrame(scraped_articles)

# Print the number of successfully scraped articles
print(f"\n Successfully scraped {len(scraped_articles)} articles.")

Starting article scraping...

 Successfully scraped 707 articles.


4. Saving the results to a file

In [17]:
# Save the results to a CSV file with proper quoting
output_file = "articles_full.csv"
scraped_articles_df.to_csv(output_file, index=False, quoting=1)  # quoting=1 ensures double quotes for text fields

print(f"Scraped articles saved to: {output_file}")

Scraped articles saved to: articles_full.csv


5. Discussion. Including the newly scraped data in the dataset is not straightforward because there are no labels. Without knowing whether the articles are "fake" or "reliable," the data cannot be directly used for supervised learning. However, if labeled properly, it could enhance the dataset by increasing its size and diversity, potentially improving model generalization. The risk is that the new data might introduce bias or imbalance if it does not match the original distribution. A statistical comparison of word frequency, article length, and source credibility between the original and scraped datasets could help determine compatibility. If the differences are minimal, incorporating the new data—after labeling—could be beneficial. There is also an option to consider those news coming from BBC as reliable, then it could help with balancing the dataset.

Final remark: there are many print() calls commented in the code to make pdf more compact.