# Extracting News Articles with MechanicalSoup and BeautifulSoup

## Introduction

Web scraping is a powerful technique used to extract data from websites. Two popular Python libraries for web scraping are **MechanicalSoup** and **BeautifulSoup**. Each offers distinct advantages depending on the nature of the website and the complexity of the scraping task.

- **MechanicalSoup**: This library is built on top of `BeautifulSoup` and `requests`. It allows interaction with web pages, including handling forms and navigating through links, making it useful for scraping dynamic content that requires user interactions.

- **BeautifulSoup**: This is a lightweight and flexible library used to parse HTML and XML documents. It provides an easy way to navigate, search, and modify the document tree, making it ideal for extracting structured data from static web pages.

In this notebook, we will explore how to use **MechanicalSoup** and **BeautifulSoup** to scrape news articles from websites and compare their performance and ease of use.


## BeautifulSoup

In [1]:
import requests
from bs4 import BeautifulSoup
import csv
import time

# TechCrunch Tech News URL
TECHCRUNCH_URL = "https://techcrunch.com/"
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}

start_time = time.time()

# Request the TechCrunch homepage
response = requests.get(TECHCRUNCH_URL, headers=HEADERS)
soup = BeautifulSoup(response.text, "html.parser")

# Find article links
articles = []
for link in soup.find_all("a", href=True):
    url = link["href"]
    if url.startswith("https://techcrunch.com/") and "/20" in url and url not in articles:
        articles.append(url)

# Scrape article details
data = []
for article_url in articles[:500]:  # Scraping 100 articles
    try:
        article_response = requests.get(article_url, headers=HEADERS)
        article_soup = BeautifulSoup(article_response.text, "html.parser")

        title = article_soup.find("h1").get_text(strip=True) if article_soup.find("h1") else "No title"
        content = " ".join([p.get_text(strip=True) for p in article_soup.find_all("p")])

        data.append({"title": title, "url": article_url, "content": content})
    except Exception as e:
        print(f"Skipping {article_url}: {e}")
        continue

# Save to CSV
with open("techcrunch_articles.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=["title", "url", "content"])
    writer.writeheader()
    writer.writerows(data)

end_time = time.time()
print(f"Scraping completed: {len(data)} articles saved in {end_time - start_time:.2f} seconds from TechCrunch.")

Scraping completed: 44 articles saved in 39.51 seconds from TechCrunch.


## MechanicalSoup

In [2]:
import mechanicalsoup
import time
import csv

# Initialize the browser
browser = mechanicalsoup.StatefulBrowser(user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")

# TechCrunch Tech News URL
TECHCRUNCH_URL = "https://techcrunch.com/"

# Start timing
start_time = time.time()

# Request the TechCrunch homepage
browser.open(TECHCRUNCH_URL)

# Find article links
article_links = browser.links(url_regex='/20')  # Only match articles from the current year
article_links = list(set([link.get('href') for link in article_links]))

# Scrape article details
data = []
for article_url in article_links[:500]:  # Limit to 500 articles
    try:
        # Open the article page
        browser.open(article_url)
        
        # Extract the article title and content
        title = browser.page.find('h1').get_text(strip=True) if browser.page.find('h1') else 'No title'
        content = ' '.join([p.get_text(strip=True) for p in browser.page.find_all('p')])
        
        # Save the data
        data.append({
            'title': title,
            'url': article_url,
            'content': content
        })
    except Exception as e:
        print(f"Skipping {article_url}: {e}")
        continue

# Save to CSV
with open("techcrunch_articles_mechanicalsoup.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=["title", "url", "content"])
    writer.writeheader()
    writer.writerows(data)

# End timing
end_time = time.time()

# Print completion message
print(f"Scraping completed using MechanicalSoup: {len(data)} articles saved in {end_time - start_time:.2f} seconds.")

Scraping completed using MechanicalSoup: 44 articles saved in 8.59 seconds.


## Conclusion

In this experiment, we compared the performance of **BeautifulSoup** and **MechanicalSoup** for extracting news articles from TechCrunch. The results highlight the efficiency difference between the two approaches:

- **BeautifulSoup**: Scraping completed with **44 articles saved in 44.92 seconds**.
- **MechanicalSoup**: Scraping completed with **44 articles saved in 14.35 seconds**.

The significant time difference suggests that **MechanicalSoup** is much faster in this scenario, likely due to its ability to handle page interactions more efficiently. **BeautifulSoup**, while powerful for parsing and extracting structured data, relies on additional requests for fetching pages, making it slower.

Overall, the choice between these libraries depends on the scraping requirements:
- If the website requires **interacting with forms, buttons, or authentication**, **MechanicalSoup** is preferable.
- If the website consists of **static pages with well-structured HTML**, **BeautifulSoup** provides a more flexible and intuitive approach.

By selecting the appropriate tool based on the website's complexity, we can optimize web scraping tasks for both speed and accuracy.
