# Introduction to BeautifulSoup and Scrapy

## BeautifulSoup

BeautifulSoup is a Python library used for web scraping and parsing HTML and XML documents. It provides simple methods for navigating, searching, and modifying the parsed data. It is best suited for small-scale web scraping projects where the website structure is not too complex. Since it does not handle HTTP requests by itself, it is often used alongside the `requests` library.

### Key Features:
- Easy to set up and use
- Parses HTML and XML efficiently
- Best for small to medium-scale scraping tasks
- Requires external libraries for handling requests

## MechanicalSoup

MechanicalSoup is a Python library used for automating interactions with websites and scraping content. It combines the power of **Requests** and **BeautifulSoup** to allow for easy navigation, form submission, and content extraction from web pages.

### Key Features:
- **Simple API**: MechanicalSoup provides a straightforward and user-friendly interface to interact with websites, making it easy for both beginners and advanced users.
- **Form Handling**: Unlike Scrapy, MechanicalSoup makes it easy to fill out and submit forms on websites, enabling the automation of login and data entry tasks.
- **HTML Parsing**: It uses **BeautifulSoup** to parse HTML, providing the flexibility to extract data from websites with complex structures.
- **Stateful Sessions**: Supports stateful sessions, meaning that it can manage cookies and session data, keeping track of user logins and maintaining the context between requests.



In [2]:
import requests
from bs4 import BeautifulSoup
import time
import csv

# Define BBC News URL and request headers
BBC_NEWS_URL = 'https://www.bbc.com/news'
HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

start_time = time.time()

# Fetch and parse the main BBC News page
response = requests.get(BBC_NEWS_URL, headers=HEADERS)
soup = BeautifulSoup(response.text, 'html.parser')

# Extract article links
article_links = set()
for anchor in soup.find_all('a', href=True):
    href = anchor['href']
    if href.startswith('/news/'):
        article_links.add(f'https://www.bbc.com{href}')

# Scrape details from each article
articles_data = []
for article_url in list(article_links)[:500]:
    try:
        article_response = requests.get(article_url, headers=HEADERS)
        article_soup = BeautifulSoup(article_response.text, 'html.parser')

        title = article_soup.find('h1').get_text(strip=True) if article_soup.find('h1') else 'Title Not Found'
        content = ' '.join([p.get_text(strip=True) for p in article_soup.find_all('p')])

        articles_data.append({'title': title, 'url': article_url, 'content': content})
    except Exception as e:
        continue

# Save extracted data to a CSV file
csv_filename = 'bbc_news_scraped_data.csv'
with open(csv_filename, 'w', newline='', encoding='utf-8') as file:
    csv_writer = csv.DictWriter(file, fieldnames=['title', 'url', 'content'])
    csv_writer.writeheader()
    csv_writer.writerows(articles_data)

end_time = time.time()
print(f"Scraping completed: {len(articles_data)} articles saved in {end_time - start_time:.2f} seconds using BeautifulSoup.")

Scraping completed: 64 articles saved in 6.00 seconds using BeautifulSoup.


In [3]:
import mechanicalsoup
import time
import csv

# Initialize a browser object
browser = mechanicalsoup.Browser()
BBC_NEWS_URL = 'https://www.bbc.com/news'

start_time = time.time()

# Fetch the main BBC News page
page = browser.get(BBC_NEWS_URL)
soup = page.soup

# Extract article links
article_links = list(set(f'https://www.bbc.com{a["href"]}' for a in soup.select('a[href^="/news/"]')))

# Scrape article details
data = []
for article_url in article_links[:500]:
    try:
        article_page = browser.get(article_url)
        article_soup = article_page.soup

        title = article_soup.find('h1').get_text(strip=True) if article_soup.find('h1') else 'Title Not Found'
        content = ' '.join(p.get_text(strip=True) for p in article_soup.find_all('p'))

        data.append({'title': title, 'url': article_url, 'content': content})
    except Exception:
        continue

# Save extracted data to a CSV file
csv_filename = 'bbc_news_articles_mechanicalsoup.csv'
with open(csv_filename, 'w', newline='', encoding='utf-8') as file:
    csv_writer = csv.DictWriter(file, fieldnames=['title', 'url', 'content'])
    csv_writer.writeheader()
    csv_writer.writerows(data)

end_time = time.time()
print(f"Scraping completed using MechanicalSoup: {len(data)} articles saved in {end_time - start_time:.2f} seconds.")

Scraping completed using MechanicalSoup: 64 articles saved in 2.66 seconds.


# Performance Comparison: BeautifulSoup vs. MechanicalSoup

| Feature               | BeautifulSoup                                     | MechanicalSoup                                        |
|-----------------------|--------------------------------------------------|------------------------------------------------------|
| **Total Articles Scraped** | 64                                             | 64                                                   |
| **Execution Time (s)**  | 6.0                                             | 2.66                                                 |
| **Setup Complexity**   | Simple setup with minimal dependencies          | Simple setup, requires minimal configuration         |
| **Processing Speed**   | Slower for handling large datasets              | Faster, optimized for web scraping with stateful sessions |
| **Scalability**        | Best for small to medium-sized tasks            | Suited for small to medium-scale tasks, not as scalable as Scrapy |
| **Code Complexity**    | Straightforward code but requires manual request handling | Slightly more complex but automates browsing and form submissions |
| **Built-in Capabilities** | Lacks native crawling and session management, needs additional setup | Includes built-in crawling, form handling, and stateful sessions |
| **Best Use Case**      | Suitable for small-scale projects or quick one-off extractions | Great for simple interactive web scraping, form handling, and session management |
