# Objective:

Scraping of [books.toscrape.com](https://books.toscrape.com) with 'BeautifulSoup4' and 'Requests', get all the data includes:

* category
* code
* cover
* title
* rating
* price (excl. tax)
* price (incl. tax)
* tax
* stock status
* number of stock available
* description
* number of reviews

In [1]:
# --- Install Dependencies ---
!pip install requests beautifulsoup4 pandas tqdm



# Import Libraries

In [12]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from tqdm import tqdm
import time

# Scraping Process

In [13]:
# Base URL
BASE_URL = "https://books.toscrape.com/"

# Function to get soup object
def get_soup(url):
    response = requests.get(url)
    response.raise_for_status()
    return BeautifulSoup(response.text, "html.parser")

In [14]:
# Get all categories
def get_categories():
    soup = get_soup(BASE_URL)
    categories = {}
    cat_section = soup.find('ul', class_='nav-list').find('ul').find_all('a')
    for cat in cat_section:
        name = cat.text.strip()
        link = BASE_URL + cat['href']
        categories[name] = link
    return categories

In [15]:
# Extract book details from book page
def get_book_details(book_url, category):
    soup = get_soup(book_url)

    # Extract product info table
    table = soup.find('table', class_='table table-striped')
    data = {row.th.text.strip(): row.td.text.strip() for row in table.find_all('tr')}

    # Extract other fields
    title = soup.find('div', class_='product_main').h1.text.strip()
    rating = soup.find('p', class_='star-rating')['class'][1]
    price_excl = data.get('Price (excl. tax)')
    price_incl = data.get('Price (incl. tax)')
    tax = data.get('Tax')
    upc = data.get('UPC')
    stock_text = data.get('Availability', '')
    stock_number = ''.join(filter(str.isdigit, stock_text))
    description_tag = soup.select_one('#product_description + p')
    description = description_tag.text.strip() if description_tag else ''
    num_reviews = data.get('Number of reviews', '0')
    cover = BASE_URL + soup.find('div', class_='item active').img['src'].replace('../', '')

    return {
        'category': category,
        'code (UPC)': upc,
        'cover': cover,
        'title': title,
        'rating': rating,
        'price (excl. tax)': price_excl,
        'price (incl. tax)': price_incl,
        'tax': tax,
        'stock status': 'In stock' if 'In stock' in stock_text else 'Out of stock',
        'number of stock available': stock_number,
        'description': description,
        'number of reviews': num_reviews
    }


In [16]:
# Scrape all books from category pages
def scrape_books_in_category(category_name, category_url):
    books = []
    page_url = category_url

    while True:
        soup = get_soup(page_url)
        book_links = [BASE_URL + 'catalogue/' + a['href'].replace('../../../', '') for a in soup.select('h3 a')]

        for link in book_links:
            try:
                books.append(get_book_details(link, category_name))
                time.sleep(0.2)
            except Exception as e:
                print(f"Error on {link}: {e}")
                continue

        # Next page
        next_page = soup.select_one('li.next > a')
        if next_page:
            next_href = next_page['href']
            page_url = category_url.rsplit('/', 1)[0] + '/' + next_href
        else:
            break
    return books

In [17]:
# Main Scraping Process
all_books = []
categories = get_categories()

print(f"Found {len(categories)} categories.\n")

for cat_name, cat_url in tqdm(categories.items(), desc="Scraping categories"):
    print(f"\n--- Scraping Category: {cat_name} ---")
    books_in_cat = scrape_books_in_category(cat_name, cat_url)
    all_books.extend(books_in_cat)

Found 50 categories.



Scraping categories:   0%|          | 0/50 [00:00<?, ?it/s]


--- Scraping Category: Travel ---


Scraping categories:   2%|▏         | 1/50 [00:03<03:05,  3.78s/it]


--- Scraping Category: Mystery ---


Scraping categories:   4%|▍         | 2/50 [00:14<06:16,  7.84s/it]


--- Scraping Category: Historical Fiction ---


Scraping categories:   6%|▌         | 3/50 [00:23<06:28,  8.27s/it]


--- Scraping Category: Sequential Art ---


Scraping categories:   8%|▊         | 4/50 [00:48<11:29, 14.98s/it]


--- Scraping Category: Classics ---


Scraping categories:  10%|█         | 5/50 [00:55<08:58, 11.96s/it]


--- Scraping Category: Philosophy ---


Scraping categories:  12%|█▏        | 6/50 [00:58<06:45,  9.21s/it]


--- Scraping Category: Romance ---


Scraping categories:  14%|█▍        | 7/50 [01:10<07:11, 10.04s/it]


--- Scraping Category: Womens Fiction ---


Scraping categories:  16%|█▌        | 8/50 [01:16<06:04,  8.67s/it]


--- Scraping Category: Fiction ---


Scraping categories:  18%|█▊        | 9/50 [01:38<08:50, 12.93s/it]


--- Scraping Category: Childrens ---


Scraping categories:  20%|██        | 10/50 [01:48<07:58, 11.96s/it]


--- Scraping Category: Religion ---


Scraping categories:  22%|██▏       | 11/50 [01:51<05:52,  9.05s/it]


--- Scraping Category: Nonfiction ---


Scraping categories:  24%|██▍       | 12/50 [02:28<11:16, 17.81s/it]


--- Scraping Category: Music ---


Scraping categories:  26%|██▌       | 13/50 [02:33<08:32, 13.85s/it]


--- Scraping Category: Default ---


Scraping categories:  28%|██▊       | 14/50 [03:24<15:04, 25.12s/it]


--- Scraping Category: Science Fiction ---


Scraping categories:  30%|███       | 15/50 [03:30<11:12, 19.20s/it]


--- Scraping Category: Sports and Games ---


Scraping categories:  32%|███▏      | 16/50 [03:32<07:54, 13.97s/it]


--- Scraping Category: Add a comment ---


Scraping categories:  34%|███▍      | 17/50 [03:54<09:08, 16.62s/it]


--- Scraping Category: Fantasy ---


Scraping categories:  36%|███▌      | 18/50 [04:11<08:50, 16.57s/it]


--- Scraping Category: New Adult ---


Scraping categories:  38%|███▊      | 19/50 [04:13<06:18, 12.22s/it]


--- Scraping Category: Young Adult ---


Scraping categories:  40%|████      | 20/50 [04:31<07:01, 14.06s/it]


--- Scraping Category: Science ---


Scraping categories:  42%|████▏     | 21/50 [04:36<05:26, 11.27s/it]


--- Scraping Category: Poetry ---


Scraping categories:  44%|████▍     | 22/50 [04:42<04:34,  9.81s/it]


--- Scraping Category: Paranormal ---


Scraping categories:  46%|████▌     | 23/50 [04:43<03:09,  7.00s/it]


--- Scraping Category: Art ---


Scraping categories:  48%|████▊     | 24/50 [04:46<02:28,  5.73s/it]


--- Scraping Category: Psychology ---


Scraping categories:  50%|█████     | 25/50 [04:48<01:58,  4.75s/it]


--- Scraping Category: Autobiography ---


Scraping categories:  52%|█████▏    | 26/50 [04:51<01:42,  4.26s/it]


--- Scraping Category: Parenting ---


Scraping categories:  54%|█████▍    | 27/50 [04:52<01:11,  3.12s/it]


--- Scraping Category: Adult Fiction ---


Scraping categories:  56%|█████▌    | 28/50 [04:52<00:51,  2.32s/it]


--- Scraping Category: Humor ---


Scraping categories:  58%|█████▊    | 29/50 [04:56<00:56,  2.67s/it]


--- Scraping Category: Horror ---


Scraping categories:  60%|██████    | 30/50 [05:02<01:13,  3.65s/it]


--- Scraping Category: History ---


Scraping categories:  62%|██████▏   | 31/50 [05:08<01:23,  4.41s/it]


--- Scraping Category: Food and Drink ---


Scraping categories:  64%|██████▍   | 32/50 [05:18<01:51,  6.18s/it]


--- Scraping Category: Christian Fiction ---


Scraping categories:  66%|██████▌   | 33/50 [05:20<01:24,  4.96s/it]


--- Scraping Category: Business ---


Scraping categories:  68%|██████▊   | 34/50 [05:24<01:15,  4.73s/it]


--- Scraping Category: Biography ---


Scraping categories:  70%|███████   | 35/50 [05:26<00:58,  3.88s/it]


--- Scraping Category: Thriller ---


Scraping categories:  72%|███████▏  | 36/50 [05:30<00:54,  3.88s/it]


--- Scraping Category: Contemporary ---


Scraping categories:  74%|███████▍  | 37/50 [05:31<00:39,  3.05s/it]


--- Scraping Category: Spirituality ---


Scraping categories:  76%|███████▌  | 38/50 [05:33<00:33,  2.78s/it]


--- Scraping Category: Academic ---


Scraping categories:  78%|███████▊  | 39/50 [05:34<00:22,  2.08s/it]


--- Scraping Category: Self Help ---


Scraping categories:  80%|████████  | 40/50 [05:36<00:19,  1.97s/it]


--- Scraping Category: Historical ---


Scraping categories:  82%|████████▏ | 41/50 [05:36<00:14,  1.61s/it]


--- Scraping Category: Christian ---


Scraping categories:  84%|████████▍ | 42/50 [05:37<00:11,  1.46s/it]


--- Scraping Category: Suspense ---


Scraping categories:  86%|████████▌ | 43/50 [05:38<00:08,  1.15s/it]


--- Scraping Category: Short Stories ---


Scraping categories:  88%|████████▊ | 44/50 [05:38<00:05,  1.06it/s]


--- Scraping Category: Novels ---


Scraping categories:  90%|█████████ | 45/50 [05:39<00:03,  1.26it/s]


--- Scraping Category: Health ---


Scraping categories:  92%|█████████▏| 46/50 [05:40<00:03,  1.00it/s]


--- Scraping Category: Politics ---


Scraping categories:  94%|█████████▍| 47/50 [05:41<00:03,  1.05s/it]


--- Scraping Category: Cultural ---


Scraping categories:  96%|█████████▌| 48/50 [05:42<00:01,  1.15it/s]


--- Scraping Category: Erotica ---


Scraping categories:  98%|█████████▊| 49/50 [05:42<00:00,  1.33it/s]


--- Scraping Category: Crime ---


Scraping categories: 100%|██████████| 50/50 [05:43<00:00,  6.87s/it]


In [18]:
# Save to CSV
df = pd.DataFrame(all_books)
df.to_csv('books_data.csv', index=False)
print("\n✅ Scraping completed! Data saved to 'books_data.csv'.")


✅ Scraping completed! Data saved to 'books_data.csv'.


In [19]:
# Display sample data
df.head()

Unnamed: 0,category,code (UPC),cover,title,rating,price (excl. tax),price (incl. tax),tax,stock status,number of stock available,description,number of reviews
0,Travel,a22124811bfa8350,https://books.toscrape.com/media/cache/6d/41/6...,It's Only the Himalayas,Two,Â£45.17,Â£45.17,Â£0.00,In stock,19,"âWherever you go, whatever you do, just . . ...",0
1,Travel,ce60436f52c5ee68,https://books.toscrape.com/media/cache/fe/8a/f...,Full Moon over Noahâs Ark: An Odyssey to Mou...,Four,Â£49.43,Â£49.43,Â£0.00,In stock,15,Acclaimed travel writer Rick Antonson sets his...,0
2,Travel,f9705c362f070608,https://books.toscrape.com/media/cache/c7/1a/c...,See America: A Celebration of Our National Par...,Three,Â£48.87,Â£48.87,Â£0.00,In stock,14,To coincide with the 2016 centennial anniversa...,0
3,Travel,1809259a5a5f1d8d,https://books.toscrape.com/media/cache/ca/30/c...,Vagabonding: An Uncommon Guide to the Art of L...,Two,Â£36.94,Â£36.94,Â£0.00,In stock,8,With a new foreword by Tim Ferriss â¢Thereâ...,0
4,Travel,a94350ee74deaa07,https://books.toscrape.com/media/cache/45/21/4...,Under the Tuscan Sun,Three,Â£37.33,Â£37.33,Â£0.00,In stock,7,A CLASSIC FROM THE BESTSELLING AUTHOR OF UNDER...,0
