# Web Scrapping lab

In this lab you will scrappe this [website](https://books.toscrape.com/) of books.

You have to create a Pandas DataFrame with all the books listed in the page. Each row of the DataFrame should contain information of each book. In particular, the DataFrmae must contain:

* category
* title
* price
* stock availability
* star rating (number of stars)
* description
* UPC

Happy scrapping!



# Server verification

Load the needed libraries, and make sure thar you can obtain the correct status code.

In [2]:
# Your code here
# Step 1: Load the needed libraries
import requests
from bs4 import BeautifulSoup

# Make sure you can obtain the correct status code
url = 'http://books.toscrape.com/'  # Replace with the website you are scraping from 
response = requests.get(url)

# Check the status code
if response.status_code == 200:
    print(f"Success! Status Code: {response.status_code}")
else:
    print(f"Failed to access site, Status Code: {response.status_code}")


Success! Status Code: 200


# Book categories

Create the code to collect the **relative urls** from the left panel to obtain a list with all the book categories.

In [4]:
# Your code here

#Collect all book category URLs
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Find the left panel with all categories (adjust the class or tag to match the HTML structure)
categories_panel = soup.find('ul', class_='nav-list')

# Extract the relative URLs for each category
category_links = [a['href'] for a in categories_panel.find_all('a') if 'href' in a.attrs]

# Print the collected category links
print(category_links)


['catalogue/category/books_1/index.html', 'catalogue/category/books/travel_2/index.html', 'catalogue/category/books/mystery_3/index.html', 'catalogue/category/books/historical-fiction_4/index.html', 'catalogue/category/books/sequential-art_5/index.html', 'catalogue/category/books/classics_6/index.html', 'catalogue/category/books/philosophy_7/index.html', 'catalogue/category/books/romance_8/index.html', 'catalogue/category/books/womens-fiction_9/index.html', 'catalogue/category/books/fiction_10/index.html', 'catalogue/category/books/childrens_11/index.html', 'catalogue/category/books/religion_12/index.html', 'catalogue/category/books/nonfiction_13/index.html', 'catalogue/category/books/music_14/index.html', 'catalogue/category/books/default_15/index.html', 'catalogue/category/books/science-fiction_16/index.html', 'catalogue/category/books/sports-and-games_17/index.html', 'catalogue/category/books/add-a-comment_18/index.html', 'catalogue/category/books/fantasy_19/index.html', 'catalogue/

# Books in a given category

Use. web scrapping and list comprehension to obtain the **absolute** url of each book to be scraped

In [6]:
# Your code here
# Get absolute URLs for each book
absolute_category_urls = [url + link for link in category_links]

# Print to verify the absolute URLs
print(absolute_category_urls)


['http://books.toscrape.com/catalogue/category/books_1/index.html', 'http://books.toscrape.com/catalogue/category/books/travel_2/index.html', 'http://books.toscrape.com/catalogue/category/books/mystery_3/index.html', 'http://books.toscrape.com/catalogue/category/books/historical-fiction_4/index.html', 'http://books.toscrape.com/catalogue/category/books/sequential-art_5/index.html', 'http://books.toscrape.com/catalogue/category/books/classics_6/index.html', 'http://books.toscrape.com/catalogue/category/books/philosophy_7/index.html', 'http://books.toscrape.com/catalogue/category/books/romance_8/index.html', 'http://books.toscrape.com/catalogue/category/books/womens-fiction_9/index.html', 'http://books.toscrape.com/catalogue/category/books/fiction_10/index.html', 'http://books.toscrape.com/catalogue/category/books/childrens_11/index.html', 'http://books.toscrape.com/catalogue/category/books/religion_12/index.html', 'http://books.toscrape.com/catalogue/category/books/nonfiction_13/index.h

# Book details

Create a Python function that given a book_url as an input returns a dictionary with the following structure:

```Python
{"Title": title, "Price": price, "Availability": availability, "Rating": rating, "Description": description, "UPC": upc}
```

where `description` should contain the book's summary given in the Product description, and the values are the book's associated information.

In [8]:
# Your code here
# Function to scrape book information
def scrape_book_info(book_url):
    response = requests.get(book_url)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Extract book details
    title = soup.find('h1').text
    price = soup.find('p', class_='price_color').text
    availability = soup.find('p', class_='instock availability').text.strip()
    rating = soup.find('p', class_='star-rating')['class'][1]  # The second class contains the rating (e.g., "Three")
    description = soup.find('meta', {'name': 'description'})['content'].strip()
    upc = soup.find('table').find('td').text  # First row in the table is usually the UPC

    # Return the dictionary
    return {"Title": title, "Price": price, "Availability": availability, "Rating": rating, 
            "Description": description, "UPC": upc}


# Collect and store all the information from the books in a Pandas DataFrame

Start with the following dictionary:

```python
books_dict = {"Title": [], "Price": [], "Availability": [], "Rating": [], "Description": [], "UPC": [], "Category": [] }
```

Then, iterate over all the categories and all the books in a given category to collect any book information using the previous function. Fill the previous dictionary with the information about each book.

Show the first five rows of the previous final Pandas DataFrame.

Tip: You can use the function `tqdm` from the library `tqdm` to show a progress bar if in iterable of a for loop as shown below :wink: :

```python
from tqdm import tqdm

for elem in tqdm(iterable):
    # some code
```





In [10]:
from tqdm import tqdm

# Your code here

# Start scraping all books
books_dict = {"Title": [], "Price": [], "Availability": [], "Rating": [], "Description": [], "UPC": [], "Category": [] }

# Iterate over each category
for category_url in tqdm(absolute_category_urls):
    category_response = requests.get(category_url)
    category_soup = BeautifulSoup(category_response.content, 'html.parser')

    # Find all book links in the current category
    book_links = [a['href'] for a in category_soup.find_all('a', href=True) if 'catalogue' in a['href']]

    # Iterate over each book in the category
    for book_link in book_links:
        book_url = url + book_link  # Form the complete URL for each book
        book_info = scrape_book_info(book_url)  # Get book info using the function from Step 4

        # Fill the dictionary with the book's information
        for key, value in book_info.items():
            books_dict[key].append(value)
        books_dict['Category'].append(category_url.split('/')[-2])  # Append the category name

# Convert the dictionary to a DataFrame
import pandas as pd
books_df = pd.DataFrame(books_dict)

# First 5 rows of the DataFrame
print(books_df.head())


100%|███████████████████████████████████████████| 51/51 [00:18<00:00,  2.75it/s]

Empty DataFrame
Columns: [Title, Price, Availability, Rating, Description, UPC, Category]
Index: []



