# üï∑Ô∏è Web Scraping Training with Python

## Scraping Books to Scrape Website

**Target Website:** https://books.toscrape.com/

**Learning Objectives:**
- Understand HTML structure and DOM navigation
- Use BeautifulSoup for parsing HTML
- Extract various elements (titles, prices, ratings, images)
- Handle pagination and multiple pages
- Save data to CSV and JSON formats
- Best practices and ethical scraping

---

## üì¶ Step 1: Install Required Libraries

We'll need:
- **requests**: To fetch web pages
- **beautifulsoup4**: To parse HTML
- **pandas**: To organize and export data
- **lxml**: Parser for BeautifulSoup (faster than default)

In [None]:
# Install required libraries (run once)
!pip install requests beautifulsoup4 pandas lxml

## üìö Step 2: Import Libraries

In [17]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
from urllib.parse import urljoin
import json
import re  # For regex to clean prices

## üåê Step 3: Fetch a Web Page

**How it works:**
1. `requests.get()` sends an HTTP GET request
2. Server responds with HTML content
3. We check the status code (200 = success)
4. Access HTML via `.text` attribute

In [18]:
# Define the URL
url = "https://books.toscrape.com/"

# Send GET request
response = requests.get(url)

# Check if request was successful
print(f"Status Code: {response.status_code}")
print(f"Content Length: {len(response.text)} characters")

# Preview first 500 characters
print("\nHTML Preview:")
print(response.text[:500])

Status Code: 200
Content Length: 51294 characters

HTML Preview:
<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html lang="en-us" class="no-js"> <!--<![endif]-->
    <head>
        <title>
    All products | Books to Scrape - Sandbox
</title>

        <meta http-equiv="content-type" content="text/html; charset=UTF-8" /


## üîç Step 4: Parse HTML with BeautifulSoup

**BeautifulSoup** converts raw HTML into a navigable tree structure.

**Key Methods:**
- `.find()` - Find first matching element
- `.find_all()` - Find all matching elements
- `.select()` - Use CSS selectors
- `.get_text()` - Extract text content
- `.get('attribute')` - Get attribute value

In [19]:
# Parse HTML
soup = BeautifulSoup(response.text, 'lxml')

# Pretty print the HTML (first 1000 characters)
print(soup.prettify()[:1000])

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" lang="en-us">
 <!--<![endif]-->
 <head>
  <title>
   All products | Books to Scrape - Sandbox
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <meta content="24th Jun 2016 09:29" name="created"/>
  <meta content="" name="description"/>
  <meta content="width=device-width" name="viewport"/>
  <meta content="NOARCHIVE,NOCACHE" name="robots"/>
  <!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
  <!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
  <link href="static/oscar/favicon.ico" rel="shortcut icon"/>
  <link href="static/oscar/css/styles.css" rel="stylesheet" type="tex

## üìñ Step 5: Extract Book Titles

**Inspection Process:**
1. Right-click on a book title ‚Üí Inspect Element
2. Find the HTML structure: `<h3><a title="Book Name">...</a></h3>`
3. Use BeautifulSoup to find all `<h3>` tags
4. Extract the `title` attribute from `<a>` tags

In [20]:
# Find all book titles
book_titles = soup.find_all('h3')

print(f"Found {len(book_titles)} books on this page\n")

# Extract and display titles
titles = []
for i, book in enumerate(book_titles[:5], 1):  # Show first 5
    title = book.find('a')['title']
    titles.append(title)
    print(f"{i}. {title}")

Found 20 books on this page

1. A Light in the Attic
2. Tipping the Velvet
3. Soumission
4. Sharp Objects
5. Sapiens: A Brief History of Humankind


## üí∞ Step 6: Extract Book Prices

**HTML Structure:**
```html
<p class="price_color">¬£51.77</p>
```

**Extraction Steps:**
1. Find all elements with class `price_color`
2. Get text content
3. Clean the price using regex to extract only numbers
4. Convert to float for calculations

**Note:** We use regex to handle encoding issues with the ¬£ symbol

In [21]:
def clean_price(price_text):
    """
    Extract numeric price from text, handling encoding issues.
    
    Args:
        price_text (str): Raw price text (e.g., '¬£51.77' or '√Ç¬£51.77')
    
    Returns:
        float: Cleaned price as a number
    """
    # Use regex to extract only numbers and decimal point
    match = re.search(r'\d+\.\d+', price_text)
    if match:
        return float(match.group())
    return 0.0

# Find all prices
prices = soup.find_all('p', class_='price_color')

print(f"Found {len(prices)} prices\n")

# Extract and clean prices
price_list = []
for i, price in enumerate(prices[:5], 1):  # Show first 5
    price_text = price.get_text()
    price_value = clean_price(price_text)
    price_list.append(price_value)
    print(f"{i}. {price_text} ‚Üí ¬£{price_value}")

Found 20 prices

1. √Ç¬£51.77 ‚Üí ¬£51.77
2. √Ç¬£53.74 ‚Üí ¬£53.74
3. √Ç¬£50.10 ‚Üí ¬£50.1
4. √Ç¬£47.82 ‚Üí ¬£47.82
5. √Ç¬£54.23 ‚Üí ¬£54.23


## ‚≠ê Step 7: Extract Star Ratings

**HTML Structure:**
```html
<p class="star-rating Three">
```

**Rating Mapping:**
- One ‚Üí 1 star
- Two ‚Üí 2 stars
- Three ‚Üí 3 stars
- Four ‚Üí 4 stars
- Five ‚Üí 5 stars

In [22]:
# Find all star ratings
ratings = soup.find_all('p', class_='star-rating')

# Rating conversion dictionary
rating_map = {
    'One': 1,
    'Two': 2,
    'Three': 3,
    'Four': 4,
    'Five': 5
}

print(f"Found {len(ratings)} ratings\n")

# Extract ratings
rating_list = []
for i, rating in enumerate(ratings[:5], 1):  # Show first 5
    # Get the second class name (e.g., 'Three' from 'star-rating Three')
    rating_class = rating['class'][1]
    rating_value = rating_map[rating_class]
    rating_list.append(rating_value)
    print(f"{i}. {rating_class} ‚Üí {rating_value} stars")

Found 20 ratings

1. Three ‚Üí 3 stars
2. One ‚Üí 1 stars
3. One ‚Üí 1 stars
4. Four ‚Üí 4 stars
5. Five ‚Üí 5 stars


## üì¶ Step 8: Extract Availability

**HTML Structure:**
```html
<p class="instock availability">
    <i class="icon-ok"></i>
    In stock
</p>
```

In [23]:
# Find all availability info
availability = soup.find_all('p', class_='instock availability')

print(f"Found {len(availability)} availability statuses\n")

# Extract availability
availability_list = []
for i, avail in enumerate(availability[:5], 1):  # Show first 5
    status = avail.get_text(strip=True)
    availability_list.append(status)
    print(f"{i}. {status}")

Found 20 availability statuses

1. In stock
2. In stock
3. In stock
4. In stock
5. In stock


## üñºÔ∏è Step 9: Extract Book Image URLs

**HTML Structure:**
```html
<div class="image_container">
    <a href="...">
        <img src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg" ...>
    </a>
</div>
```

**Note:** Image URLs are relative, so we need to join them with the base URL.

In [None]:
# Find all book images
images = soup.find_all('div', class_='image_container')

print(f"Found {len(images)} images\n")

# Extract image URLs
image_urls = []
for i, img_container in enumerate(images[:5], 1):  # Show first 5
    img_tag = img_container.find('img')
    relative_url = img_tag['src']
    # Convert relative URL to absolute URL
    absolute_url = urljoin(url, relative_url)
    image_urls.append(absolute_url)
    print(f"{i}. {absolute_url}")

## üîó Step 10: Extract Book Detail Page Links

Each book has a link to its detail page with more information.

In [24]:
# Find all book detail links
book_links = soup.find_all('h3')

print(f"Found {len(book_links)} book links\n")

# Extract URLs
detail_urls = []
for i, book in enumerate(book_links[:5], 1):  # Show first 5
    link = book.find('a')['href']
    absolute_link = urljoin(url, link)
    detail_urls.append(absolute_link)
    print(f"{i}. {absolute_link}")

Found 20 book links

1. https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
2. https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html
3. https://books.toscrape.com/catalogue/soumission_998/index.html
4. https://books.toscrape.com/catalogue/sharp-objects_997/index.html
5. https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html


## üéØ Step 11: Complete Function to Scrape All Books on a Page

Let's combine everything into a reusable function with proper price cleaning.

In [25]:
def scrape_books_page(page_url):
    """
    Scrape all books from a single page.
    
    Args:
        page_url (str): URL of the page to scrape
    
    Returns:
        list: List of dictionaries containing book information
    """
    # Fetch the page
    response = requests.get(page_url)
    soup = BeautifulSoup(response.text, 'lxml')
    
    # Find all book containers
    books = soup.find_all('article', class_='product_pod')
    
    books_data = []
    
    for book in books:
        # Extract title
        title = book.find('h3').find('a')['title']
        
        # Extract price with proper cleaning
        price_text = book.find('p', class_='price_color').get_text()
        price = clean_price(price_text)
        
        # Extract rating
        rating_class = book.find('p', class_='star-rating')['class'][1]
        rating_map = {'One': 1, 'Two': 2, 'Three': 3, 'Four': 4, 'Five': 5}
        rating = rating_map[rating_class]
        
        # Extract availability
        availability = book.find('p', class_='instock availability').get_text(strip=True)
        
        # Extract image URL
        img_url = urljoin(page_url, book.find('img')['src'])
        
        # Extract detail page URL
        detail_url = urljoin(page_url, book.find('h3').find('a')['href'])
        
        # Create book dictionary
        book_data = {
            'title': title,
            'price': price,
            'rating': rating,
            'availability': availability,
            'image_url': img_url,
            'detail_url': detail_url
        }
        
        books_data.append(book_data)
    
    return books_data

# Test the function
books = scrape_books_page(url)
print(f"Scraped {len(books)} books\n")
print("First book:")
print(json.dumps(books[0], indent=2))

Scraped 20 books

First book:
{
  "title": "A Light in the Attic",
  "price": 51.77,
  "rating": 3,
  "availability": "In stock",
  "image_url": "https://books.toscrape.com/media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg",
  "detail_url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"
}


## üìÑ Step 12: Handle Pagination (Multiple Pages)

The website has multiple pages. Let's scrape all of them!

**Pagination Pattern:**
- Page 1: `https://books.toscrape.com/catalogue/page-1.html`
- Page 2: `https://books.toscrape.com/catalogue/page-2.html`
- etc.

In [26]:
def scrape_all_pages(base_url, max_pages=5):
    """
    Scrape books from multiple pages.
    
    Args:
        base_url (str): Base URL of the website
        max_pages (int): Maximum number of pages to scrape
    
    Returns:
        list: Combined list of all books from all pages
    """
    all_books = []
    
    for page_num in range(1, max_pages + 1):
        # Construct page URL
        if page_num == 1:
            page_url = base_url
        else:
            page_url = f"{base_url}catalogue/page-{page_num}.html"
        
        print(f"Scraping page {page_num}: {page_url}")
        
        try:
            # Scrape the page
            books = scrape_books_page(page_url)
            all_books.extend(books)
            print(f"  ‚Üí Found {len(books)} books")
            
            # Be polite: wait 1 second between requests
            time.sleep(1)
            
        except Exception as e:
            print(f"  ‚Üí Error: {e}")
            break
    
    return all_books

# Scrape first 3 pages
all_books = scrape_all_pages(url, max_pages=3)
print(f"\nTotal books scraped: {len(all_books)}")

Scraping page 1: https://books.toscrape.com/
  ‚Üí Found 20 books
Scraping page 2: https://books.toscrape.com/catalogue/page-2.html
  ‚Üí Found 20 books
Scraping page 3: https://books.toscrape.com/catalogue/page-3.html
  ‚Üí Found 20 books

Total books scraped: 60


## üìä Step 13: Convert to Pandas DataFrame

DataFrames make it easy to analyze and export data.

In [27]:
# Create DataFrame
df = pd.DataFrame(all_books)

# Display basic info
print("Dataset Info:")
print(df.info())
print("\nFirst 5 rows:")
df.head()

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60 entries, 0 to 59
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   title         60 non-null     object 
 1   price         60 non-null     float64
 2   rating        60 non-null     int64  
 3   availability  60 non-null     object 
 4   image_url     60 non-null     object 
 5   detail_url    60 non-null     object 
dtypes: float64(1), int64(1), object(4)
memory usage: 2.9+ KB
None

First 5 rows:


Unnamed: 0,title,price,rating,availability,image_url,detail_url
0,A Light in the Attic,51.77,3,In stock,https://books.toscrape.com/media/cache/2c/da/2...,https://books.toscrape.com/catalogue/a-light-i...
1,Tipping the Velvet,53.74,1,In stock,https://books.toscrape.com/media/cache/26/0c/2...,https://books.toscrape.com/catalogue/tipping-t...
2,Soumission,50.1,1,In stock,https://books.toscrape.com/media/cache/3e/ef/3...,https://books.toscrape.com/catalogue/soumissio...
3,Sharp Objects,47.82,4,In stock,https://books.toscrape.com/media/cache/32/51/3...,https://books.toscrape.com/catalogue/sharp-obj...
4,Sapiens: A Brief History of Humankind,54.23,5,In stock,https://books.toscrape.com/media/cache/be/a5/b...,https://books.toscrape.com/catalogue/sapiens-a...


## üìà Step 14: Data Analysis

Let's analyze the scraped data!

In [28]:
# Price statistics
print("Price Statistics:")
print(df['price'].describe())

print("\nRating Distribution:")
print(df['rating'].value_counts().sort_index())

print("\nMost Expensive Books:")
print(df.nlargest(5, 'price')[['title', 'price', 'rating']])

print("\nCheapest Books:")
print(df.nsmallest(5, 'price')[['title', 'price', 'rating']])

Price Statistics:
count    60.000000
mean     35.002667
std      14.553082
min      12.840000
25%      22.040000
50%      33.485000
75%      50.142500
max      57.310000
Name: price, dtype: float64

Rating Distribution:
rating
1    15
2     8
3    13
4    10
5    14
Name: count, dtype: int64

Most Expensive Books:
                                                title  price  rating
40                     Slow States of Collapse: Poems  57.31       3
15  Our Band Could Be Your Life: Scenes from the A...  57.25       3
58                                The Past Never Ends  56.50       4
57  The Pioneer Woman Cooks: Dinnertime: Comfort C...  56.41       1
56                    The Secret of Dreadwillow Carse  56.13       1

Cheapest Books:
                                                title  price  rating
20                                        In Her Wake  12.84       1
10     Starving Hearts (Triangular Trade Trilogy, #1)  13.99       2
47            Untitled Collection: Sabbath Poe

## üíæ Step 15: Save Data to CSV

In [29]:
# Save to CSV
df.to_csv('books_data.csv', index=False)
print("‚úì Data saved to books_data.csv")

‚úì Data saved to books_data.csv


## üìù Step 16: Save Data to JSON

In [30]:
# Save to JSON
df.to_json('books_data.json', orient='records', indent=2)
print("‚úì Data saved to books_data.json")

‚úì Data saved to books_data.json


## üîç Step 17: Scrape Individual Book Details

Let's scrape more detailed information from individual book pages.

In [31]:
def scrape_book_details(book_url):
    """
    Scrape detailed information from a book's detail page.
    
    Args:
        book_url (str): URL of the book detail page
    
    Returns:
        dict: Dictionary containing detailed book information
    """
    response = requests.get(book_url)
    soup = BeautifulSoup(response.text, 'lxml')
    
    # Extract product information table
    table = soup.find('table', class_='table table-striped')
    
    details = {}
    
    # Extract UPC, Product Type, Price (excl. tax), etc.
    for row in table.find_all('tr'):
        header = row.find('th').get_text()
        value = row.find('td').get_text()
        details[header] = value
    
    # Extract description
    description_tag = soup.find('div', id='product_description')
    if description_tag:
        description = description_tag.find_next('p').get_text()
        details['Description'] = description
    else:
        details['Description'] = 'No description available'
    
    # Extract category
    breadcrumb = soup.find('ul', class_='breadcrumb')
    category = breadcrumb.find_all('a')[2].get_text(strip=True)
    details['Category'] = category
    
    return details

# Test with first book
first_book_url = all_books[0]['detail_url']
print(f"Scraping details for: {all_books[0]['title']}")
print(f"URL: {first_book_url}\n")

book_details = scrape_book_details(first_book_url)
print("Book Details:")
for key, value in book_details.items():
    print(f"{key}: {value}")

Scraping details for: A Light in the Attic
URL: https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html

Book Details:
UPC: a897fe39b1053632
Product Type: Books
Price (excl. tax): √Ç¬£51.77
Price (incl. tax): √Ç¬£51.77
Tax: √Ç¬£0.00
Availability: In stock (22 available)
Number of reviews: 0
Description: It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety ki

## üé® Step 18: Advanced - Scrape by Category

The website organizes books by categories. Let's scrape a specific category.

In [None]:
def get_categories(base_url):
    """
    Get all book categories from the website.
    
    Returns:
        dict: Dictionary mapping category names to URLs
    """
    response = requests.get(base_url)
    soup = BeautifulSoup(response.text, 'lxml')
    
    # Find category sidebar
    category_list = soup.find('ul', class_='nav nav-list').find('ul')
    
    categories = {}
    for link in category_list.find_all('a'):
        category_name = link.get_text(strip=True)
        category_url = urljoin(base_url, link['href'])
        categories[category_name] = category_url
    
    return categories

# Get all categories
categories = get_categories(url)
print(f"Found {len(categories)} categories:\n")
for i, (name, url) in enumerate(list(categories.items())[:10], 1):
    print(f"{i}. {name}")

## üõ°Ô∏è Step 19: Best Practices & Ethics

### ‚úÖ DO:
- Check `robots.txt` (https://books.toscrape.com/robots.txt)
- Add delays between requests (`time.sleep()`)
- Use User-Agent headers
- Handle errors gracefully
- Respect rate limits

### ‚ùå DON'T:
- Scrape personal data without permission
- Overload servers with too many requests
- Ignore Terms of Service
- Scrape copyrighted content for commercial use

In [None]:
# Example: Adding User-Agent header
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}

response = requests.get(url, headers=headers)
print(f"Request with User-Agent: {response.status_code}")

## üéØ Step 20: Complete Project - Scrape All Books

Let's put it all together!

In [None]:
def complete_scraping_project(base_url, max_pages=5, include_details=False):
    """
    Complete web scraping project.
    
    Args:
        base_url (str): Base URL of the website
        max_pages (int): Number of pages to scrape
        include_details (bool): Whether to scrape individual book details
    
    Returns:
        pd.DataFrame: DataFrame containing all scraped data
    """
    print("üï∑Ô∏è Starting Web Scraping Project...\n")
    
    # Step 1: Scrape all pages
    print(f"üìÑ Scraping {max_pages} pages...")
    all_books = scrape_all_pages(base_url, max_pages)
    print(f"‚úì Scraped {len(all_books)} books\n")
    
    # Step 2: Convert to DataFrame
    df = pd.DataFrame(all_books)
    
    # Step 3: Optionally scrape details
    if include_details:
        print("üìñ Scraping individual book details...")
        details_list = []
        for i, book in enumerate(all_books[:10], 1):  # Limit to 10 for demo
            print(f"  {i}/10: {book['title'][:50]}...")
            details = scrape_book_details(book['detail_url'])
            details_list.append(details)
            time.sleep(1)
        
        # Merge details with main DataFrame
        details_df = pd.DataFrame(details_list)
        df = pd.concat([df.iloc[:10], details_df], axis=1)
    
    # Step 4: Save data
    print("\nüíæ Saving data...")
    df.to_csv('complete_books_data.csv', index=False)
    df.to_json('complete_books_data.json', orient='records', indent=2)
    print("‚úì Data saved to CSV and JSON\n")
    
    # Step 5: Display summary
    print("üìä Summary Statistics:")
    print(f"Total Books: {len(df)}")
    print(f"Average Price: ¬£{df['price'].mean():.2f}")
    print(f"Price Range: ¬£{df['price'].min():.2f} - ¬£{df['price'].max():.2f}")
    print(f"Average Rating: {df['rating'].mean():.2f} stars")
    
    return df

# Run the complete project
final_df = complete_scraping_project(url, max_pages=3, include_details=False)
final_df.head(10)

## üéì Summary

### What You Learned:

1. ‚úÖ **HTTP Requests** - Fetching web pages with `requests`
2. ‚úÖ **HTML Parsing** - Using BeautifulSoup to navigate DOM
3. ‚úÖ **Element Selection** - `.find()`, `.find_all()`, CSS selectors
4. ‚úÖ **Data Extraction** - Text, attributes, images, links
5. ‚úÖ **Data Cleaning** - Handling encoding issues with regex
6. ‚úÖ **Pagination** - Scraping multiple pages
7. ‚úÖ **Data Organization** - Pandas DataFrames
8. ‚úÖ **Data Export** - CSV and JSON formats
9. ‚úÖ **Best Practices** - Delays, error handling, ethics

### Next Steps:

- üîß Try scraping other websites
- üìä Add data visualization (matplotlib, seaborn)
- üóÑÔ∏è Store data in databases (SQLite, PostgreSQL)
- ü§ñ Automate with scheduling (cron, Task Scheduler)
- üåê Learn Selenium for JavaScript-heavy sites

---

**Happy Scraping! üï∑Ô∏è**