# Web Scraping Tutorial: Books to Scrape

## üìö Project Overview

This comprehensive tutorial demonstrates web scraping techniques using **BeautifulSoup4** to extract book data from [books.toscrape.com](http://books.toscrape.com), a website designed specifically for practicing web scraping.

### Learning Objectives:
1. Understand HTML structure and DOM navigation
2. Master BeautifulSoup4 parsing techniques
3. Handle pagination across multiple pages
4. Extract structured data (titles, prices, ratings, availability)
5. Clean and transform scraped data
6. Store data in multiple formats (CSV, JSON)
7. Implement proper error handling and logging

### Website Structure:
- **Target Site**: http://books.toscrape.com/
- **Content**: 1000 books across 50 pages (20 books per page)
- **Data Points**: Title, Price, Rating, Availability, Category, Image URL

---

## Step 1: Environment Setup and Library Installation

### Objective:
Install and import all necessary libraries for web scraping, data manipulation, and storage.

### Libraries Used:
- **requests**: HTTP library to fetch web pages
- **beautifulsoup4**: HTML parsing and navigation
- **lxml**: Fast XML and HTML parser (BeautifulSoup backend)
- **pandas**: Data manipulation and analysis
- **datetime**: Timestamp generation
- **time**: Rate limiting between requests
- **json**: JSON data handling

In [1]:
# Install required libraries (run this cell first if libraries are not installed)
# Uncomment the line below if you need to install the packages
# !pip install requests beautifulsoup4 lxml pandas

In [2]:
# Import required libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
import json
import time
from datetime import datetime
from typing import List, Dict, Optional
import re
import logging

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)

print("‚úÖ All libraries imported successfully!")
print(f"üìÖ Current timestamp: {datetime.now()}")

‚úÖ All libraries imported successfully!
üìÖ Current timestamp: 2025-10-24 17:17:23.203749


---

## Step 2: Understanding the Website Structure

### Objective:
Before scraping, we need to understand the HTML structure of the target website.

### Process:
1. Visit the website in a browser
2. Right-click and select "Inspect" to view HTML
3. Identify CSS classes and HTML tags containing our target data

### Key HTML Elements:
```html
<article class="product_pod">
    <h3><a title="Book Title">...</a></h3>
    <p class="price_color">¬£51.77</p>
    <p class="star-rating Three">...</p>
    <p class="instock availability">In stock</p>
</article>
```

In [3]:
# Define base URL and initial configuration
BASE_URL = "http://books.toscrape.com/"
CATALOGUE_URL = f"{BASE_URL}catalogue/"

# Headers to mimic a real browser request
HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
}

print(f"üåê Base URL: {BASE_URL}")
print(f"üìÇ Catalogue URL: {CATALOGUE_URL}")

üåê Base URL: http://books.toscrape.com/
üìÇ Catalogue URL: http://books.toscrape.com/catalogue/


---

## Step 3: Fetching a Single Web Page

### Objective:
Create a function to fetch HTML content from a URL with proper error handling.

### Function Details:
- **Input**: URL string
- **Output**: HTML content as string or None if error
- **Error Handling**: HTTP errors, timeouts, connection errors
- **Timeout**: 10 seconds to prevent hanging

### HTTP Status Codes:
- 200: Success
- 404: Page not found
- 500: Server error

In [4]:
def fetch_page(url: str, timeout: int = 10) -> Optional[str]:
    """
    Fetch HTML content from a given URL.
    
    Args:
        url (str): The URL to fetch
        timeout (int): Request timeout in seconds (default: 10)
    
    Returns:
        str: HTML content if successful, None otherwise
    
    Raises:
        Logs errors but does not raise exceptions
    """
    try:
        logging.info(f"Fetching URL: {url}")
        response = requests.get(url, headers=HEADERS, timeout=timeout)
        
        # Raise exception for bad status codes (4xx, 5xx)
        response.raise_for_status()
        
        logging.info(f"‚úÖ Successfully fetched {url} (Status: {response.status_code})")
        return response.text
        
    except requests.exceptions.HTTPError as e:
        logging.error(f"‚ùå HTTP Error: {e}")
    except requests.exceptions.ConnectionError as e:
        logging.error(f"‚ùå Connection Error: {e}")
    except requests.exceptions.Timeout as e:
        logging.error(f"‚ùå Timeout Error: {e}")
    except requests.exceptions.RequestException as e:
        logging.error(f"‚ùå Request Error: {e}")
    
    return None

# Test the function with the homepage
print("Testing fetch_page function...")
html_content = fetch_page(BASE_URL)

if html_content:
    print(f"‚úÖ Successfully fetched homepage")
    print(f"üìä HTML length: {len(html_content)} characters")
    print(f"üìÑ First 200 characters:\n{html_content[:200]}...")
else:
    print("‚ùå Failed to fetch homepage")

2025-10-24 17:17:23,209 - INFO - Fetching URL: http://books.toscrape.com/


Testing fetch_page function...


2025-10-24 17:17:23,456 - INFO - ‚úÖ Successfully fetched http://books.toscrape.com/ (Status: 200)


‚úÖ Successfully fetched homepage
üìä HTML length: 51294 characters
üìÑ First 200 characters:
<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if I...


---

## Step 4: Parsing HTML with BeautifulSoup

### Objective:
Parse raw HTML into a BeautifulSoup object for easy navigation and data extraction.

### BeautifulSoup Parsers:
- **lxml**: Fast and lenient (recommended)
- **html.parser**: Built-in Python parser (slower)
- **html5lib**: Most lenient, very slow

### Key Methods:
- `find()`: Find first matching element
- `find_all()`: Find all matching elements
- `select()`: CSS selector based search
- `get_text()`: Extract text content

In [5]:
def parse_html(html_content: str) -> Optional[BeautifulSoup]:
    """
    Parse HTML content into BeautifulSoup object.
    
    Args:
        html_content (str): Raw HTML content
    
    Returns:
        BeautifulSoup: Parsed HTML object or None if parsing fails
    """
    try:
        soup = BeautifulSoup(html_content, 'lxml')
        logging.info("‚úÖ HTML parsed successfully")
        return soup
    except Exception as e:
        logging.error(f"‚ùå Error parsing HTML: {e}")
        return None

# Parse the homepage HTML
soup = parse_html(html_content)

if soup:
    # Extract page title
    page_title = soup.find('title').get_text()
    print(f"üìñ Page Title: {page_title}")
    
    # Count number of books on the page
    books = soup.find_all('article', class_='product_pod')
    print(f"üìö Number of books found: {len(books)}")
    
    # Find pagination info
    pager = soup.find('li', class_='current')
    if pager:
        print(f"üìÑ Pagination: {pager.get_text().strip()}")

2025-10-24 17:17:23,494 - INFO - ‚úÖ HTML parsed successfully


üìñ Page Title: 
    All products | Books to Scrape - Sandbox

üìö Number of books found: 20
üìÑ Pagination: Page 1 of 50


---

## Step 5: Extracting Book Data from a Single Page

### Objective:
Extract detailed information about each book from a single page.

### Data Points to Extract:
1. **Title**: Book name
2. **Price**: Price in GBP (¬£)
3. **Rating**: Star rating (One to Five)
4. **Availability**: In stock or out of stock
5. **Image URL**: Book cover image URL
6. **Product URL**: Link to book detail page

### CSS Selectors Used:
- `article.product_pod`: Each book container
- `h3 a`: Book title and URL
- `p.price_color`: Price
- `p.star-rating`: Rating class
- `p.instock.availability`: Availability status

In [6]:
def clean_price_text(price_text: str) -> float:
    """
    Clean and convert price text to float, handling encoding issues.
    
    Args:
        price_text (str): Raw price text from HTML (e.g., '¬£51.77', '√Ç¬£51.77')
    
    Returns:
        float: Cleaned price value
    """
    # Remove all non-ASCII characters, currency symbols, and whitespace
    # Keep only digits and decimal point
    cleaned = re.sub(r'[^\d.]', '', price_text)
    
    try:
        return float(cleaned)
    except ValueError:
        # If conversion still fails, log and return 0
        logging.warning(f"Could not convert price: '{price_text}' -> '{cleaned}'")
        return 0.0


def extract_rating(rating_class: str) -> int:
    """
    Convert star rating class to numeric value.
    
    Args:
        rating_class (str): CSS class containing rating (e.g., 'star-rating Three')
    
    Returns:
        int: Numeric rating (1-5)
    """
    rating_map = {
        'One': 1,
        'Two': 2,
        'Three': 3,
        'Four': 4,
        'Five': 5
    }
    
    # Extract rating word from class string
    for word, value in rating_map.items():
        if word in rating_class:
            return value
    
    return 0  # Default if no rating found


def parse_book_data(book_element: BeautifulSoup, base_url: str) -> Dict:
    """
    Extract all relevant data from a single book element.
    
    Args:
        book_element (BeautifulSoup): Parsed book article element
        base_url (str): Base URL for constructing absolute URLs
    
    Returns:
        dict: Dictionary containing book information
    """
    try:
        # Extract title and URL
        title_element = book_element.find('h3').find('a')
        title = title_element.get('title')
        product_url = base_url + title_element.get('href')
        
        # Extract price using robust cleaning function
        price_text = book_element.find('p', class_='price_color').get_text()
        price = clean_price_text(price_text)
        
        # Extract rating
        rating_element = book_element.find('p', class_='star-rating')
        rating_class = rating_element.get('class')[1]  # Get second class (rating word)
        rating = extract_rating(rating_class)
        
        # Extract availability
        availability_element = book_element.find('p', class_='instock')
        availability = availability_element.get_text().strip()
        
        # Extract image URL
        image_element = book_element.find('img')
        image_url = base_url + image_element.get('src')
        
        return {
            'title': title,
            'price': price,
            'rating': rating,
            'availability': availability,
            'image_url': image_url,
            'product_url': product_url,
            'scraped_at': datetime.now().isoformat()
        }
        
    except Exception as e:
        logging.error(f"‚ùå Error parsing book data: {e}")
        return None


def scrape_books_from_page(soup: BeautifulSoup, base_url: str) -> List[Dict]:
    """
    Extract all books from a single page.
    
    Args:
        soup (BeautifulSoup): Parsed HTML page
        base_url (str): Base URL for constructing absolute URLs
    
    Returns:
        list: List of dictionaries containing book data
    """
    books_data = []
    
    # Find all book elements
    book_elements = soup.find_all('article', class_='product_pod')
    
    logging.info(f"Found {len(book_elements)} books on the page")
    
    for book_element in book_elements:
        book_data = parse_book_data(book_element, base_url)
        if book_data:
            books_data.append(book_data)
    
    return books_data


# Test: Extract books from the first page
print("\nüîç Extracting books from the first page...\n")
books_on_page = scrape_books_from_page(soup, BASE_URL)

print(f"‚úÖ Successfully extracted {len(books_on_page)} books\n")

if len(books_on_page) > 0:
    print("üìö Sample Book Data (First Book):\n")
    print(json.dumps(books_on_page[0], indent=2))
else:
    print("‚ö†Ô∏è  No books were extracted. Please check the HTML structure.")

2025-10-24 17:17:23,509 - INFO - Found 20 books on the page



üîç Extracting books from the first page...

‚úÖ Successfully extracted 20 books

üìö Sample Book Data (First Book):

{
  "title": "A Light in the Attic",
  "price": 51.77,
  "rating": 3,
  "availability": "In stock",
  "image_url": "http://books.toscrape.com/media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg",
  "product_url": "http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
  "scraped_at": "2025-10-24T17:17:23.510022"
}


---

## Step 6: Handling Pagination

### Objective:
Navigate through multiple pages to scrape all books from the website.

### Pagination Strategy:
1. Start with page 1
2. Look for "Next" button or page number
3. Extract next page URL
4. Repeat until no more pages

### URL Pattern:
- Page 1: `http://books.toscrape.com/`
- Page 2: `http://books.toscrape.com/catalogue/page-2.html`
- Page N: `http://books.toscrape.com/catalogue/page-N.html`

### Rate Limiting:
- Add 1-2 second delay between requests
- Prevents overwhelming the server
- Mimics human browsing behavior

In [7]:
def get_next_page_url(soup: BeautifulSoup, current_url: str) -> Optional[str]:
    """
    Extract the URL of the next page from pagination links.
    
    Args:
        soup (BeautifulSoup): Parsed current page
        current_url (str): Current page URL
    
    Returns:
        str: Next page URL or None if no next page
    """
    # Find the "next" button
    next_button = soup.find('li', class_='next')
    
    if next_button:
        next_link = next_button.find('a')
        if next_link:
            next_page = next_link.get('href')
            
            # Construct absolute URL
            if current_url.endswith('index.html') or current_url == BASE_URL:
                next_url = CATALOGUE_URL + next_page
            else:
                # Replace current page with next page
                base = current_url.rsplit('/', 1)[0]
                next_url = base + '/' + next_page
            
            return next_url
    
    return None


def scrape_all_books(max_pages: Optional[int] = None, delay: float = 1.5) -> List[Dict]:
    """
    Scrape all books from all pages on the website.
    
    Args:
        max_pages (int, optional): Maximum number of pages to scrape (None = all pages)
        delay (float): Delay in seconds between page requests (default: 1.5)
    
    Returns:
        list: List of all book dictionaries
    """
    all_books = []
    current_url = BASE_URL
    page_count = 0
    
    print("üöÄ Starting to scrape all books...\n")
    
    while current_url:
        page_count += 1
        
        # Check max pages limit
        if max_pages and page_count > max_pages:
            print(f"\n‚èπÔ∏è  Reached maximum page limit: {max_pages}")
            break
        
        print(f"üìÑ Scraping page {page_count}: {current_url}")
        
        # Fetch and parse page
        html = fetch_page(current_url)
        if not html:
            print(f"‚ùå Failed to fetch page {page_count}")
            break
        
        soup = parse_html(html)
        if not soup:
            print(f"‚ùå Failed to parse page {page_count}")
            break
        
        # Extract books from current page
        books = scrape_books_from_page(soup, BASE_URL)
        all_books.extend(books)
        print(f"   ‚úÖ Extracted {len(books)} books (Total: {len(all_books)})")
        
        # Get next page URL
        next_url = get_next_page_url(soup, current_url)
        
        if next_url:
            current_url = next_url
            # Rate limiting: wait before next request
            time.sleep(delay)
        else:
            print("\n‚úÖ No more pages to scrape")
            break
    
    print(f"\nüéâ Scraping complete!")
    print(f"üìä Total pages scraped: {page_count}")
    print(f"üìö Total books extracted: {len(all_books)}")
    
    return all_books


# Test: Scrape first 3 pages only (for demonstration)
print("Testing pagination with first 3 pages...\n")
sample_books = scrape_all_books(max_pages=3, delay=1)

2025-10-24 17:17:23,518 - INFO - Fetching URL: http://books.toscrape.com/


Testing pagination with first 3 pages...

üöÄ Starting to scrape all books...

üìÑ Scraping page 1: http://books.toscrape.com/


2025-10-24 17:17:23,918 - INFO - ‚úÖ Successfully fetched http://books.toscrape.com/ (Status: 200)
2025-10-24 17:17:23,945 - INFO - ‚úÖ HTML parsed successfully
2025-10-24 17:17:23,946 - INFO - Found 20 books on the page


   ‚úÖ Extracted 20 books (Total: 20)


2025-10-24 17:17:24,956 - INFO - Fetching URL: http://books.toscrape.com/catalogue/catalogue/page-2.html


üìÑ Scraping page 2: http://books.toscrape.com/catalogue/catalogue/page-2.html


2025-10-24 17:17:25,278 - ERROR - ‚ùå HTTP Error: 404 Client Error: Not Found for url: http://books.toscrape.com/catalogue/catalogue/page-2.html


‚ùå Failed to fetch page 2

üéâ Scraping complete!
üìä Total pages scraped: 2
üìö Total books extracted: 20


---

## Step 7: Data Cleaning and Validation

### Objective:
Clean and validate the scraped data to ensure quality and consistency.

### Cleaning Operations:
1. Remove duplicate books (based on title)
2. Handle missing values
3. Validate data types
4. Standardize text fields
5. Check for data anomalies

### Validation Checks:
- Price should be positive number
- Rating should be between 1-5
- Title should not be empty
- URLs should be valid

In [8]:
def clean_and_validate_data(books_data: List[Dict]) -> pd.DataFrame:
    """
    Clean and validate scraped book data.
    
    Args:
        books_data (list): List of book dictionaries
    
    Returns:
        pd.DataFrame: Cleaned and validated data
    """
    print("üßπ Cleaning and validating data...\n")
    
    # Convert to DataFrame
    df = pd.DataFrame(books_data)
    
    print(f"üìä Initial dataset shape: {df.shape}")
    print(f"üìã Columns: {list(df.columns)}\n")
    
    # Check for duplicates
    duplicates = df.duplicated(subset=['title']).sum()
    print(f"üîç Duplicate books found: {duplicates}")
    if duplicates > 0:
        df = df.drop_duplicates(subset=['title'], keep='first')
        print(f"   ‚úÖ Removed {duplicates} duplicates")
    
    # Check for missing values
    missing = df.isnull().sum()
    print(f"\n‚ùì Missing values:\n{missing}")
    
    # Validate price
    invalid_prices = df[df['price'] <= 0].shape[0]
    print(f"\nüí∞ Invalid prices (‚â§0): {invalid_prices}")
    
    # Validate rating
    invalid_ratings = df[(df['rating'] < 1) | (df['rating'] > 5)].shape[0]
    print(f"‚≠ê Invalid ratings (not 1-5): {invalid_ratings}")
    
    # Add derived columns
    df['title_length'] = df['title'].str.len()
    df['in_stock'] = df['availability'].str.contains('In stock', case=False)
    
    print(f"\n‚úÖ Data cleaning complete!")
    print(f"üìä Final dataset shape: {df.shape}\n")
    
    return df


# Clean and validate the sample data
df_books = clean_and_validate_data(sample_books)

# Display first few rows
print("üìö Sample Data (First 5 Books):\n")
display(df_books.head())

üßπ Cleaning and validating data...

üìä Initial dataset shape: (20, 7)
üìã Columns: ['title', 'price', 'rating', 'availability', 'image_url', 'product_url', 'scraped_at']

üîç Duplicate books found: 0

‚ùì Missing values:
title           0
price           0
rating          0
availability    0
image_url       0
product_url     0
scraped_at      0
dtype: int64

üí∞ Invalid prices (‚â§0): 0
‚≠ê Invalid ratings (not 1-5): 0

‚úÖ Data cleaning complete!
üìä Final dataset shape: (20, 9)

üìö Sample Data (First 5 Books):



Unnamed: 0,title,price,rating,availability,image_url,product_url,scraped_at,title_length,in_stock
0,A Light in the Attic,51.77,3,In stock,http://books.toscrape.com/media/cache/2c/da/2c...,http://books.toscrape.com/catalogue/a-light-in...,2025-10-24T17:17:23.947269,20,True
1,Tipping the Velvet,53.74,1,In stock,http://books.toscrape.com/media/cache/26/0c/26...,http://books.toscrape.com/catalogue/tipping-th...,2025-10-24T17:17:23.947420,18,True
2,Soumission,50.1,1,In stock,http://books.toscrape.com/media/cache/3e/ef/3e...,http://books.toscrape.com/catalogue/soumission...,2025-10-24T17:17:23.947565,10,True
3,Sharp Objects,47.82,4,In stock,http://books.toscrape.com/media/cache/32/51/32...,http://books.toscrape.com/catalogue/sharp-obje...,2025-10-24T17:17:23.947710,13,True
4,Sapiens: A Brief History of Humankind,54.23,5,In stock,http://books.toscrape.com/media/cache/be/a5/be...,http://books.toscrape.com/catalogue/sapiens-a-...,2025-10-24T17:17:23.947856,37,True


---

## Step 8: Data Analysis and Statistics

### Objective:
Perform exploratory data analysis on the scraped data.

### Analysis Areas:
1. Price statistics (mean, median, min, max)
2. Rating distribution
3. Availability analysis
4. Top rated books
5. Most expensive and cheapest books

In [9]:
def analyze_book_data(df: pd.DataFrame) -> None:
    """
    Perform exploratory data analysis on book dataset.
    
    Args:
        df (pd.DataFrame): Book data
    """
    print("üìä DATA ANALYSIS REPORT\n")
    print("=" * 60)
    
    # Basic statistics
    print("\n1Ô∏è‚É£  DATASET OVERVIEW")
    print("-" * 60)
    print(f"Total Books: {len(df)}")
    print(f"Books in Stock: {df['in_stock'].sum()}")
    print(f"Books Out of Stock: {(~df['in_stock']).sum()}")
    
    # Price statistics
    print("\n2Ô∏è‚É£  PRICE STATISTICS (¬£)")
    print("-" * 60)
    print(f"Average Price: ¬£{df['price'].mean():.2f}")
    print(f"Median Price: ¬£{df['price'].median():.2f}")
    print(f"Minimum Price: ¬£{df['price'].min():.2f}")
    print(f"Maximum Price: ¬£{df['price'].max():.2f}")
    print(f"Standard Deviation: ¬£{df['price'].std():.2f}")
    
    # Rating distribution
    print("\n3Ô∏è‚É£  RATING DISTRIBUTION")
    print("-" * 60)
    rating_counts = df['rating'].value_counts().sort_index()
    for rating, count in rating_counts.items():
        stars = '‚≠ê' * rating
        percentage = (count / len(df)) * 100
        print(f"{stars} ({rating}): {count} books ({percentage:.1f}%)")
    
    print(f"\nAverage Rating: {df['rating'].mean():.2f} ‚≠ê")
    
    # Top 5 most expensive books
    print("\n4Ô∏è‚É£  TOP 5 MOST EXPENSIVE BOOKS")
    print("-" * 60)
    top_expensive = df.nlargest(5, 'price')[['title', 'price', 'rating']]
    for idx, row in top_expensive.iterrows():
        print(f"¬£{row['price']:.2f} - {row['title'][:50]}... (‚≠ê{row['rating']})")
    
    # Top 5 cheapest books
    print("\n5Ô∏è‚É£  TOP 5 CHEAPEST BOOKS")
    print("-" * 60)
    top_cheap = df.nsmallest(5, 'price')[['title', 'price', 'rating']]
    for idx, row in top_cheap.iterrows():
        print(f"¬£{row['price']:.2f} - {row['title'][:50]}... (‚≠ê{row['rating']})")
    
    # Top rated books (5 stars)
    print("\n6Ô∏è‚É£  FIVE-STAR BOOKS")
    print("-" * 60)
    five_star = df[df['rating'] == 5]
    print(f"Total 5-star books: {len(five_star)}")
    if len(five_star) > 0:
        print("\nSample 5-star books:")
        for idx, row in five_star.head(5).iterrows():
            print(f"  ‚Ä¢ {row['title'][:60]} (¬£{row['price']:.2f})")
    
    print("\n" + "=" * 60)


# Run analysis
analyze_book_data(df_books)

üìä DATA ANALYSIS REPORT


1Ô∏è‚É£  DATASET OVERVIEW
------------------------------------------------------------
Total Books: 20
Books in Stock: 20
Books Out of Stock: 0

2Ô∏è‚É£  PRICE STATISTICS (¬£)
------------------------------------------------------------
Average Price: ¬£38.05
Median Price: ¬£41.38
Minimum Price: ¬£13.99
Maximum Price: ¬£57.25
Standard Deviation: ¬£15.14

3Ô∏è‚É£  RATING DISTRIBUTION
------------------------------------------------------------
‚≠ê (1): 6 books (30.0%)
‚≠ê‚≠ê (2): 3 books (15.0%)
‚≠ê‚≠ê‚≠ê (3): 3 books (15.0%)
‚≠ê‚≠ê‚≠ê‚≠ê (4): 4 books (20.0%)
‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê (5): 4 books (20.0%)

Average Rating: 2.85 ‚≠ê

4Ô∏è‚É£  TOP 5 MOST EXPENSIVE BOOKS
------------------------------------------------------------
¬£57.25 - Our Band Could Be Your Life: Scenes from the Ameri... (‚≠ê3)
¬£54.23 - Sapiens: A Brief History of Humankind... (‚≠ê5)
¬£53.74 - Tipping the Velvet... (‚≠ê1)
¬£52.29 - Scott Pilgrim's Precious Little Life (Scott Pilgri... (‚≠ê5)
¬£52.15

---

## Step 9: Saving Data to Files

### Objective:
Save the scraped and cleaned data to multiple file formats for further use.

### File Formats:
1. **CSV**: Comma-separated values (universal format)
2. **JSON**: JavaScript Object Notation (web-friendly)
3. **Excel**: Microsoft Excel format (business-friendly)
4. **Parquet**: Columnar storage (big data optimized)

### Use Cases:
- CSV: Import into databases, spreadsheet analysis
- JSON: Web APIs, JavaScript applications
- Excel: Business reports, manual review
- Parquet: PySpark processing, data lakes

In [10]:
import os

def save_data(df: pd.DataFrame, output_dir: str = 'data') -> None:
    """
    Save DataFrame to multiple file formats.
    
    Args:
        df (pd.DataFrame): Data to save
        output_dir (str): Output directory path
    """
    # Create output directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)
    
    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    
    print("üíæ Saving data to files...\n")
    
    # Save as CSV
    csv_path = f"{output_dir}/books_{timestamp}.csv"
    df.to_csv(csv_path, index=False, encoding='utf-8')
    print(f"‚úÖ Saved CSV: {csv_path}")
    
    # Save as JSON
    json_path = f"{output_dir}/books_{timestamp}.json"
    df.to_json(json_path, orient='records', indent=2, force_ascii=False)
    print(f"‚úÖ Saved JSON: {json_path}")
    
    # Save as Excel (requires openpyxl)
    try:
        excel_path = f"{output_dir}/books_{timestamp}.xlsx"
        df.to_excel(excel_path, index=False, engine='openpyxl')
        print(f"‚úÖ Saved Excel: {excel_path}")
    except ImportError:
        print("‚ö†Ô∏è  Excel save skipped (install openpyxl to enable)")
    
    # Save as Parquet (requires pyarrow)
    try:
        parquet_path = f"{output_dir}/books_{timestamp}.parquet"
        df.to_parquet(parquet_path, index=False, engine='pyarrow')
        print(f"‚úÖ Saved Parquet: {parquet_path}")
    except ImportError:
        print("‚ö†Ô∏è  Parquet save skipped (install pyarrow to enable)")
    
    print(f"\nüéâ All data saved to '{output_dir}/' directory")
    
    # Display file sizes
    print("\nüìÅ File Sizes:")
    for filename in os.listdir(output_dir):
        if timestamp in filename:
            filepath = os.path.join(output_dir, filename)
            size_kb = os.path.getsize(filepath) / 1024
            print(f"   {filename}: {size_kb:.2f} KB")


# Save the data
save_data(df_books)

üíæ Saving data to files...

‚úÖ Saved CSV: data/books_20251024_171725.csv
‚úÖ Saved JSON: data/books_20251024_171725.json
‚ö†Ô∏è  Excel save skipped (install openpyxl to enable)
‚ö†Ô∏è  Parquet save skipped (install pyarrow to enable)

üéâ All data saved to 'data/' directory

üìÅ File Sizes:
   books_20251024_171725.json: 8.51 KB
   books_20251024_171725.csv: 5.09 KB


---

## Step 10: Complete Pipeline - Scrape All Books

### Objective:
Execute the complete scraping pipeline to collect all books from the website.

### Pipeline Steps:
1. Scrape all pages (50 pages, ~1000 books)
2. Clean and validate data
3. Perform analysis
4. Save to multiple formats

### Estimated Time:
- With 1.5 second delay per page: ~75 seconds (50 pages √ó 1.5s)
- Total processing time: ~2-3 minutes

‚ö†Ô∏è **Note**: Uncomment and run the cell below to scrape ALL books.

In [11]:
def run_complete_pipeline(max_pages: Optional[int] = None, delay: float = 1.5) -> pd.DataFrame:
    """
    Execute the complete web scraping pipeline.
    
    Args:
        max_pages (int, optional): Maximum pages to scrape (None = all)
        delay (float): Delay between requests in seconds
    
    Returns:
        pd.DataFrame: Cleaned and validated book data
    """
    print("\n" + "=" * 70)
    print("üöÄ STARTING COMPLETE WEB SCRAPING PIPELINE")
    print("=" * 70 + "\n")
    
    start_time = time.time()
    
    # Step 1: Scrape all books
    print("\nüìñ STEP 1: SCRAPING BOOKS")
    print("-" * 70)
    all_books = scrape_all_books(max_pages=max_pages, delay=delay)
    
    # Step 2: Clean and validate
    print("\nüßπ STEP 2: CLEANING AND VALIDATING DATA")
    print("-" * 70)
    df_clean = clean_and_validate_data(all_books)
    
    # Step 3: Analyze data
    print("\nüìä STEP 3: ANALYZING DATA")
    print("-" * 70)
    analyze_book_data(df_clean)
    
    # Step 4: Save data
    print("\nüíæ STEP 4: SAVING DATA")
    print("-" * 70)
    save_data(df_clean)
    
    # Calculate elapsed time
    elapsed_time = time.time() - start_time
    minutes = int(elapsed_time // 60)
    seconds = int(elapsed_time % 60)
    
    print("\n" + "=" * 70)
    print("‚úÖ PIPELINE COMPLETED SUCCESSFULLY!")
    print("=" * 70)
    print(f"\n‚è±Ô∏è  Total execution time: {minutes}m {seconds}s")
    print(f"üìö Total books scraped: {len(df_clean)}")
    print(f"üíæ Data saved to 'data/' directory\n")
    
    return df_clean


# UNCOMMENT THE LINE BELOW TO SCRAPE ALL BOOKS (50 pages)
# This will take approximately 2-3 minutes to complete

# df_all_books = run_complete_pipeline(max_pages=None, delay=1.5)

# Or scrape just the first 5 pages for testing:
# df_test = run_complete_pipeline(max_pages=5, delay=1.0)

print("\nüí° To run the complete pipeline, uncomment one of the lines above.")


üí° To run the complete pipeline, uncomment one of the lines above.


---

## Step 11: Advanced Features (Optional)

### Additional Enhancements:
1. Scrape detailed book information (description, UPC, reviews)
2. Extract category information
3. Download book cover images
4. Create data visualizations
5. Export to database (SQLite, PostgreSQL)

Below are examples of advanced scraping techniques.

In [12]:
def scrape_book_details(book_url: str) -> Dict:
    """
    Scrape detailed information from individual book page.
    
    Args:
        book_url (str): URL of the book detail page
    
    Returns:
        dict: Detailed book information
    """
    html = fetch_page(book_url)
    if not html:
        return None
    
    soup = parse_html(html)
    if not soup:
        return None
    
    try:
        # Extract description
        description_elem = soup.find('article', class_='product_page').find('p', recursive=False)
        description = description_elem.get_text() if description_elem else 'No description'
        
        # Extract product information table
        table = soup.find('table', class_='table-striped')
        product_info = {}
        
        if table:
            rows = table.find_all('tr')
            for row in rows:
                header = row.find('th').get_text()
                value = row.find('td').get_text()
                product_info[header] = value
        
        # Extract category
        breadcrumb = soup.find('ul', class_='breadcrumb')
        category = breadcrumb.find_all('a')[2].get_text() if breadcrumb else 'Unknown'
        
        return {
            'description': description,
            'upc': product_info.get('UPC', ''),
            'product_type': product_info.get('Product Type', ''),
            'tax': product_info.get('Tax', ''),
            'num_reviews': product_info.get('Number of reviews', 0),
            'category': category
        }
        
    except Exception as e:
        logging.error(f"Error scraping book details: {e}")
        return None


# Example: Scrape details for the first book
if len(df_books) > 0:
    first_book_url = df_books.iloc[0]['product_url']
    print(f"üîç Scraping detailed information...\n")
    print(f"URL: {first_book_url}\n")
    
    details = scrape_book_details(first_book_url)
    
    if details:
        print("üìö Book Details:\n")
        print(json.dumps(details, indent=2))

2025-10-24 17:17:25,357 - INFO - Fetching URL: http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html


üîç Scraping detailed information...

URL: http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html



2025-10-24 17:17:25,694 - INFO - ‚úÖ Successfully fetched http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html (Status: 200)
2025-10-24 17:17:25,703 - INFO - ‚úÖ HTML parsed successfully


üìö Book Details:

{
  "description": "It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe place to rock?And who put you up there,And your cradle, too?Baby, I think someone down here'sGo

---

## Summary and Best Practices

### What We Learned:
1. ‚úÖ How to fetch HTML content using requests library
2. ‚úÖ How to parse HTML with BeautifulSoup4
3. ‚úÖ How to navigate DOM and extract data
4. ‚úÖ How to handle pagination
5. ‚úÖ How to clean and validate scraped data
6. ‚úÖ How to save data in multiple formats
7. ‚úÖ How to implement error handling and logging

### Web Scraping Best Practices:

#### Legal and Ethical:
- ‚úÖ Always check robots.txt file
- ‚úÖ Read and respect Terms of Service
- ‚úÖ Only scrape publicly available data
- ‚úÖ Use websites designed for scraping practice (like toscrape.com)

#### Technical:
- ‚úÖ Implement rate limiting (delays between requests)
- ‚úÖ Use appropriate User-Agent headers
- ‚úÖ Handle errors gracefully
- ‚úÖ Validate and clean data
- ‚úÖ Log all operations
- ‚úÖ Cache responses when appropriate

#### Code Quality:
- ‚úÖ Write modular, reusable functions
- ‚úÖ Use type hints
- ‚úÖ Add comprehensive docstrings
- ‚úÖ Follow PEP 8 style guidelines
- ‚úÖ Implement proper error handling

### Next Steps:
1. **Apply to Your Project**: Use these techniques for quotes.toscrape.com
2. **Add PySpark**: Process scraped data with PySpark DataFrames
3. **Database Integration**: Store data in PostgreSQL/MongoDB
4. **Automation**: Schedule scraping with cron jobs or Airflow
5. **Advanced Scraping**: Learn Selenium for JavaScript-heavy sites

### Resources:
- [BeautifulSoup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Requests Documentation](https://docs.python-requests.org/)
- [Web Scraping Best Practices](https://www.scraperapi.com/blog/web-scraping-best-practices/)
- [Practice Sites](http://toscrape.com/)

---

## üéâ Congratulations!

You've completed this comprehensive web scraping tutorial. You now have the skills to:
- Scrape data from websites
- Parse and extract structured information
- Handle pagination and navigation
- Clean and validate data
- Save data in various formats

**Happy Scraping! üöÄ**