# Web Scrapping lab

In this lab you will scrappe this [website](https://books.toscrape.com/) of books.

You have to create a Pandas DataFrame with all the books listed in the page. Each row of the DataFrame should contain information of each book. In particular, the DataFrmae must contain:

* category
* title
* price
* stock availability
* star rating (number of stars)
* description
* UPC

Happy scrapping!



# Server verification

Load the needed libraries, and make sure thar you can obtain the correct status code.

In [13]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Step 1: Send a GET request to the website
url = "https://books.toscrape.com/"
response = requests.get(url)

response

<Response [200]>

In [13]:
# Step 2: Parse the HTML content
soup = BeautifulSoup(response.content, 'html.parser')


# Book categories

Create the code to collect the **relative urls** from the left panel to obtain a list with all the book categories.

In [20]:
# Step 3: Find the left panel with categories
category_section = soup.find('ul', class_='nav-list')

# Step 4: Extract the relative URLs of the book categories
categories = category_section.find_all('a')

# Step 5: Create a list of categories with relative URLs
category_urls = []
for category in categories:
    category_name = category.text.strip()
    relative_url = category['href']
    category_urls.append((category_name, relative_url))

# Display the list of categories and their URLs
for category in category_urls:
    print(f"Category: {category[0]}, URL: {category[1]}")

Category: Books, URL: catalogue/category/books_1/index.html
Category: Travel, URL: catalogue/category/books/travel_2/index.html
Category: Mystery, URL: catalogue/category/books/mystery_3/index.html
Category: Historical Fiction, URL: catalogue/category/books/historical-fiction_4/index.html
Category: Sequential Art, URL: catalogue/category/books/sequential-art_5/index.html
Category: Classics, URL: catalogue/category/books/classics_6/index.html
Category: Philosophy, URL: catalogue/category/books/philosophy_7/index.html
Category: Romance, URL: catalogue/category/books/romance_8/index.html
Category: Womens Fiction, URL: catalogue/category/books/womens-fiction_9/index.html
Category: Fiction, URL: catalogue/category/books/fiction_10/index.html
Category: Childrens, URL: catalogue/category/books/childrens_11/index.html
Category: Religion, URL: catalogue/category/books/religion_12/index.html
Category: Nonfiction, URL: catalogue/category/books/nonfiction_13/index.html
Category: Music, URL: catalo

# Books in a given category

Use. web scrapping and list comprehension to obtain the **absolute** url of each book to be scraped

In [5]:
# Base URL
base_url = "https://books.toscrape.com/"

# Step 1: Request the category page
category_url = "https://books.toscrape.com/catalogue/category/books/travel_2/index.html"
response = requests.get(category_url)

# Step 2: Parse the HTML content
soup = BeautifulSoup(response.content, 'html.parser')

# Step 3: Find all the book URLs
books = soup.find_all('article', class_='product_pod')

# Step 4: Use list comprehension to build absolute URLs
book_urls = [base_url + book.h3.a['href'].replace('../../../', 'catalogue/') for book in books]

# Display the absolute URLs
for url in book_urls:
    print(url)

https://books.toscrape.com/catalogue/its-only-the-himalayas_981/index.html
https://books.toscrape.com/catalogue/full-moon-over-noahs-ark-an-odyssey-to-mount-ararat-and-beyond_811/index.html
https://books.toscrape.com/catalogue/see-america-a-celebration-of-our-national-parks-treasured-sites_732/index.html
https://books.toscrape.com/catalogue/vagabonding-an-uncommon-guide-to-the-art-of-long-term-world-travel_552/index.html
https://books.toscrape.com/catalogue/under-the-tuscan-sun_504/index.html
https://books.toscrape.com/catalogue/a-summer-in-europe_458/index.html
https://books.toscrape.com/catalogue/the-great-railway-bazaar_446/index.html
https://books.toscrape.com/catalogue/a-year-in-provence-provence-1_421/index.html
https://books.toscrape.com/catalogue/the-road-to-little-dribbling-adventures-of-an-american-in-britain-notes-from-a-small-island-2_277/index.html
https://books.toscrape.com/catalogue/neither-here-nor-there-travels-in-europe_198/index.html
https://books.toscrape.com/catalo

# Book details

Create a Python function that given a book_url as an input returns a dictionary with the following structure:

```Python
{"Title": title, "Price": price, "Availability": availability, "Rating": rating, "Description": description, "UPC": upc}
```

where `description` should contain the book's summary given in the Product description, and the values are the book's associated information.

In [11]:
def scrape_book_details(book_url):
    # Step 1: Send a request to the book page
    response = requests.get(book_url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Step 2: Extract book details
    title = soup.find('h1').text
    price = soup.find('p', class_='price_color').text
    availability = soup.find('p', class_='instock availability').text.strip()
    
    # Extract rating (as a class name that represents the rating)
    rating = soup.find('p', class_='star-rating')['class'][1]
    
    # Extract description from the Product Description section
    description = soup.find('meta', {'name': 'description'})['content'].strip()
    description = ' '.join(description.split()[:50])  # Limit description to first 50 words
    
    # Extract UPC from the table of product information
    table = soup.find('table', class_='table table-striped')
    upc = table.find('td').text

    # Step 3: Create a dictionary with the extracted data
    book_details = {
        "Title": title,
        "Price": price,
        "Availability": availability,
        "Rating": rating,
        "Description": description,
        "UPC": upc
    }
    
    return book_details

# Example Usage:
book_url = "https://books.toscrape.com/catalogue/its-only-the-himalayas_981/index.html"
details = scrape_book_details(book_url)
print(details)

{'Title': "It's Only the Himalayas", 'Price': '£45.17', 'Availability': 'In stock (19 available)', 'Rating': 'Two', 'Description': '“Wherever you go, whatever you do, just . . . don’t do anything stupid.” —My MotherDuring her yearlong adventure backpacking from South Africa to Singapore, S. Bedford definitely did a few things her mother might classify as "stupid." She swam with great white sharks in South Africa, ran from lions', 'UPC': 'a22124811bfa8350'}


# Collect and store all the information from the books in a Pandas DataFrame

Start with the following dictionary:

```python
books_dict = {"Title": [], "Price": [], "Availability": [], "Rating": [], "Description": [], "UPC": [], "Category": [] }
```

Then, iterate over all the categories and all the books in a given category to collect any book information using the previous function. Fill the previous dictionary with the information about each book.

Show the first five rows of the previous final Pandas DataFrame.

Tip: You can use the function `tqdm` from the library `tqdm` to show a progress bar if in iterable of a for loop as shown below :wink: :

```python
from tqdm import tqdm

for elem in tqdm(iterable):
    # some code
```





In [17]:
from tqdm import tqdm
# Step 1: Initialize the books dictionary
books_dict = {
    "Title": [], 
    "Price": [], 
    "Availability": [], 
    "Rating": [], 
    "Description": [], 
    "UPC": [], 
    "Category": []
}

# Function to scrape book details with error handling
def scrape_book_details(book_url):
    response = requests.get(book_url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    title = soup.find('h1').text if soup.find('h1') else "N/A"
    price = soup.find('p', class_='price_color').text if soup.find('p', class_='price_color') else "N/A"
    availability = soup.find('p', class_='instock availability').text.strip() if soup.find('p', class_='instock availability') else "N/A"
    rating_tag = soup.find('p', class_='star-rating')
    rating = rating_tag['class'][1] if rating_tag else "N/A"
    description_tag = soup.find('meta', {'name': 'description'})
    description = description_tag['content'].strip()[:150] + '...' if description_tag else "No description available."
    table = soup.find('table', class_='table table-striped')
    upc = table.find('td').text if table and table.find('td') else "N/A"

    return {
        "Title": title,
        "Price": price,
        "Availability": availability,
        "Rating": rating,
        "Description": description,
        "UPC": upc
    }

# Base URL for the website
base_url = "https://books.toscrape.com/"

# Step 2: Scrape category URLs
response = requests.get(base_url)
soup = BeautifulSoup(response.content, 'html.parser')
category_section = soup.find('ul', class_='nav-list')
categories = category_section.find_all('a')
category_urls = [(cat.text.strip(), base_url + cat['href'].replace('../', '')) for cat in categories]

# Step 3: Iterate over all categories and books
for category_name, category_url in tqdm(category_urls):
    # Get the category page
    category_response = requests.get(category_url)
    category_soup = BeautifulSoup(category_response.content, 'html.parser')
    
    # Find all books in the category
    books = category_soup.find_all('article', class_='product_pod')
    book_urls = [base_url + book.h3.a['href'].replace('../../../', 'catalogue/') for book in books]
    
    # Scrape details for each book
    for book_url in book_urls:
        try:
            book_details = scrape_book_details(book_url)
            book_details["Category"] = category_name  # Add category to book details
            
            # Append data to books_dict
            for key in books_dict:
                books_dict[key].append(book_details[key])
        except Exception as e:
            print(f"Error scraping {book_url}: {e}")

# Step 4: Create a Pandas DataFrame and display the first 5 rows
books_df = pd.DataFrame(books_dict)

100%|███████████████████████████████████████████| 51/51 [03:12<00:00,  3.77s/it]

           Title Price Availability Rating                Description  UPC  \
0  404 Not Found   N/A          N/A    N/A  No description available.  N/A   
1  404 Not Found   N/A          N/A    N/A  No description available.  N/A   
2  404 Not Found   N/A          N/A    N/A  No description available.  N/A   
3  404 Not Found   N/A          N/A    N/A  No description available.  N/A   
4  404 Not Found   N/A          N/A    N/A  No description available.  N/A   

  Category  
0    Books  
1    Books  
2    Books  
3    Books  
4    Books  





In [19]:
books_df.head()

Unnamed: 0,Title,Price,Availability,Rating,Description,UPC,Category
0,404 Not Found,,,,No description available.,,Books
1,404 Not Found,,,,No description available.,,Books
2,404 Not Found,,,,No description available.,,Books
3,404 Not Found,,,,No description available.,,Books
4,404 Not Found,,,,No description available.,,Books
