# Web Scrapping lab

In this lab you will scrappe this [website](https://books.toscrape.com/) of books.

You have to create a Pandas DataFrame with all the books listed in the page. Each row of the DataFrame should contain information of each book. In particular, the DataFrmae must contain:

* category
* title
* price
* stock availability
* star rating (number of stars)
* description
* UPC

Happy scrapping!



# Server verification

Load the needed libraries, and make sure thar you can obtain the correct status code.

In [14]:
# Your code here
!pip install requests beautifulsoup4
from bs4 import BeautifulSoup




# Book categories

Create the code to collect the **relative urls** from the left panel to obtain a list with all the book categories.

In [28]:
# Your code here
import requests

url = "https://books.toscrape.com/"
response = requests.get(url)
response
soup = BeautifulSoup(response.content, "html.parser")
# Find the parent element containing the genres (assumed to be within the nested <ul>)
nav_list = soup.find('ul', class_='nav nav-list')
book_genres = nav_list.find_all('li')

# Initialize an empty list to store the genres
genres = []

# Iterate over the found <li> elements and extract the genre names
for genre in book_genres:
    sub_list = genre.find('ul')
    if sub_list:
        sub_genres = sub_list.find_all('li')
        for sub_genre in sub_genres:
            genre_name = sub_genre.find('a').text.strip()
            genres.append(genre_name)

# Print the list of book genres
print("List of Book Genres:")
for genre in genres:
    print(genre)

List of Book Genres:
Travel
Mystery
Historical Fiction
Sequential Art
Classics
Philosophy
Romance
Womens Fiction
Fiction
Childrens
Religion
Nonfiction
Music
Default
Science Fiction
Sports and Games
Add a comment
Fantasy
New Adult
Young Adult
Science
Poetry
Paranormal
Art
Psychology
Autobiography
Parenting
Adult Fiction
Humor
Horror
History
Food and Drink
Christian Fiction
Business
Biography
Thriller
Contemporary
Spirituality
Academic
Self Help
Historical
Christian
Suspense
Short Stories
Novels
Health
Politics
Cultural
Erotica
Crime


# Books in a given category

Use. web scrapping and list comprehension to obtain the **absolute** url of each book to be scraped

In [30]:
# Your code here
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

# URL of the page to scrape
base_url = 'http://books.toscrape.com'
url = 'http://books.toscrape.com/catalogue/category/books_1/index.html'

# Fetch the HTML content
response = requests.get(url)
html_content = response.content

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')

# Find the parent element containing the genres (assumed to be within the nested <ul>)
nav_list = soup.find('ul', class_='nav nav-list')
book_genres = nav_list.find_all('li')

# List comprehension to extract the relative URL and construct the absolute URL of each genre
genre_urls = [
    urljoin(base_url, genre.find('a')['href'])
    for genre in book_genres
    if genre.find('ul')  # Ensure we are in the correct nested <ul> subtree
]

# Further navigation to find sub-genres in the nested <ul> elements, if necessary
sub_genre_urls = [
    urljoin(base_url, sub_genre.find('a')['href'])
    for genre in book_genres
    if genre.find('ul')
    for sub_genre in genre.find('ul').find_all('li')
]

# Combine all URLs into one list
all_genre_urls = genre_urls + sub_genre_urls

# Print the list of absolute URLs of each book genre
print("List of Absolute URLs of Book Genres:")
for genre_url in all_genre_urls:
    print(genre_url)

List of Absolute URLs of Book Genres:
http://books.toscrape.com/index.html
http://books.toscrape.com/books/travel_2/index.html
http://books.toscrape.com/books/mystery_3/index.html
http://books.toscrape.com/books/historical-fiction_4/index.html
http://books.toscrape.com/books/sequential-art_5/index.html
http://books.toscrape.com/books/classics_6/index.html
http://books.toscrape.com/books/philosophy_7/index.html
http://books.toscrape.com/books/romance_8/index.html
http://books.toscrape.com/books/womens-fiction_9/index.html
http://books.toscrape.com/books/fiction_10/index.html
http://books.toscrape.com/books/childrens_11/index.html
http://books.toscrape.com/books/religion_12/index.html
http://books.toscrape.com/books/nonfiction_13/index.html
http://books.toscrape.com/books/music_14/index.html
http://books.toscrape.com/books/default_15/index.html
http://books.toscrape.com/books/science-fiction_16/index.html
http://books.toscrape.com/books/sports-and-games_17/index.html
http://books.toscrap

# Book details

Create a Python function that given a book_url as an input returns a dictionary with the following structure:

```Python
{"Title": title, "Price": price, "Availability": availability, "Rating": rating, "Description": description, "UPC": upc}
```

where `description` should contain the book's summary given in the Product description, and the values are the book's associated information.

In [36]:
# Your code here

import requests
from bs4 import BeautifulSoup

def get_book_details(book_url):
    # Fetch the HTML content of the book page
    response = requests.get(book_url)
    html_content = response.content

    # Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(html_content, 'html.parser')

    # Extract the book details
    title = soup.h1.text  # Assuming the <h1> element contains the book title
    price = soup.find('p', class_='price_color').text  # Assuming the price is in a <p> with class 'price_color'
    availability = soup.find('p', class_='instock availability').text.strip()  # Assuming the availability is in a <p> with class 'instock availability'
    rating = soup.find('p', class_='star-rating')['class'][1]  # Assuming the rating is in a <p> with class 'star-rating' and the second class is the rating
    description = soup.find('meta', {'name': 'description'})['content'].strip() if soup.find('meta', {'name': 'description'}) else 'No description available'

    # Create and return the dictionary with the book details
    book_details = {
        "Title": title,
        "Price": price,
        "Availability": availability,
        "Rating": rating,
        "Description": description,
    }

    return book_details

# Example usage
book_url = 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'  # Replace with the actual book URL
book_details = get_book_details(book_url)
print(book_details)

{'Title': 'A Light in the Attic', 'Price': 'Â£51.77', 'Availability': 'In stock (22 available)', 'Rating': 'Three', 'Description': "It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe pla

# Collect and store all the information from the books in a Pandas DataFrame

Start with the following dictionary:

```python
books_dict = {"Title": [], "Price": [], "Availability": [], "Rating": [], "Description": [], "UPC": [], "Category": [] }
```

Then, iterate over all the categories and all the books in a given category to collect any book information using the previous function. Fill the previous dictionary with the information about each book.

Show the first five rows of the previous final Pandas DataFrame.

Tip: You can use the function `tqdm` from the library `tqdm` to show a progress bar if in iterable of a for loop as shown below :wink: :

```python
from tqdm import tqdm

for elem in tqdm(iterable):
    # some code
```





In [45]:
import pandas as pd
from tqdm import tqdm
from urllib.parse import urljoin

base_url = 'http://books.toscrape.com'
main_url = urljoin(base_url, 'catalogue/category/books_1/index.html')
books_dict = {"Title": [], "Price": [], "Availability": [], "Rating": [], "Description": [], "UPC": [], "Category": [] }

# Fetch the main page content
response = requests.get(main_url)
html_content = response.content
soup = BeautifulSoup(html_content, 'html.parser')

# Find all category links
category_links = [urljoin(base_url, a['href']) for a in soup.find_all('a', href=True) if 'category/books' in a['href']]

# Iterate over each category
for category_url in tqdm(category_links):
    category_response = requests.get(category_url)
    category_soup = BeautifulSoup(category_response.content, 'html.parser')
    category_name = category_soup.find('h1').text

    # Find all book links in the category
    book_links = [urljoin(base_url, a['href']) for a in category_soup.select('h3 a')]

    # Iterate over each book
    for book_url in tqdm(book_links, leave=False):
        book_details = get_book_details(book_url)
        for key, value in book_details.items():
            books_dict[key].append(value)
        books_dict["Category"].append(category_name)

# Convert the dictionary to a pandas DataFrame
books_df = pd.DataFrame(books_dict)

# Display the first five rows of the DataFrame
print(books_df.head())
# Your code here

0it [00:00, ?it/s]

Empty DataFrame
Columns: [Title, Price, Availability, Rating, Description, UPC, Category]
Index: []



