# Web Scrapping lab

In this lab you will scrappe this [website](https://books.toscrape.com/) of books.

You have to create a Pandas DataFrame with all the books listed in the page. Each row of the DataFrame should contain information of each book. In particular, the DataFrmae must contain:

* category
* title
* price
* stock availability
* star rating (number of stars)
* description
* UPC

Happy scrapping!



# Server verification

Load the needed libraries, and make sure thar you can obtain the correct status code.

In [111]:
# Import the necessary libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd

# The URL of the website you want to scrape
url = "https://books.toscrape.com/"

# Use the requests library to get the content of the website
response = requests.get(url)

# Print the status code to make sure the request went through successfully
print("Status Code:", response.status_code)

# All good if the status code is 200
if response.status_code == 200:
    print("All good!")
else:
    print("There was a problem accessing the website.")

Status Code: 200
All good!


# Book categories

Create the code to collect the **relative urls** from the left panel to obtain a list with all the book categories.

In [112]:
from bs4 import BeautifulSoup
import requests

url = "https://books.toscrape.com/"

response = requests.get(url)

soup = BeautifulSoup(response.text, 'html.parser')

category_urls = []


categories_ul = soup.find('div', class_='side_categories').find('ul').find('ul')  # The categories are likely in the second <ul>

if categories_ul:
    category_links = categories_ul.find_all('a')
    
    for link in category_links:
        relative_url = link.get('href')
        
        if relative_url:
            category_urls.append(relative_url)


print([", ".join(category_urls)])

['catalogue/category/books/travel_2/index.html, catalogue/category/books/mystery_3/index.html, catalogue/category/books/historical-fiction_4/index.html, catalogue/category/books/sequential-art_5/index.html, catalogue/category/books/classics_6/index.html, catalogue/category/books/philosophy_7/index.html, catalogue/category/books/romance_8/index.html, catalogue/category/books/womens-fiction_9/index.html, catalogue/category/books/fiction_10/index.html, catalogue/category/books/childrens_11/index.html, catalogue/category/books/religion_12/index.html, catalogue/category/books/nonfiction_13/index.html, catalogue/category/books/music_14/index.html, catalogue/category/books/default_15/index.html, catalogue/category/books/science-fiction_16/index.html, catalogue/category/books/sports-and-games_17/index.html, catalogue/category/books/add-a-comment_18/index.html, catalogue/category/books/fantasy_19/index.html, catalogue/category/books/new-adult_20/index.html, catalogue/category/books/young-adult_

# Books in a given category

Use. web scrapping and list comprehension to obtain the **absolute** url of each book to be scraped

In [113]:
# Your code here

base_url = "https://books.toscrape.com/"

response = requests.get(base_url)
soup = BeautifulSoup(response.text, 'html.parser')

books = soup.find_all('article', class_='product_pod')

book_urls = [base_url + book.find('h3').find('a')['href'] for book in books]

for url in book_urls:
    print(url)

https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html
https://books.toscrape.com/catalogue/soumission_998/index.html
https://books.toscrape.com/catalogue/sharp-objects_997/index.html
https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html
https://books.toscrape.com/catalogue/the-requiem-red_995/index.html
https://books.toscrape.com/catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html
https://books.toscrape.com/catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html
https://books.toscrape.com/catalogue/the-boys-in-the-boat-nine-americans-and-their-epic-quest-for-gold-at-the-1936-berlin-olympics_992/index.html
https://books.toscrape.com/catalogue/the-black-maria_991/index.html
https://books.toscrape.com/catalogue/starving-hearts-triangular-trade-trilogy-1_990/index.html
https://books.toscr

# Book details

Create a Python function that given a book_url as an input returns a dictionary with the following structure:

```Python
{"Title": title, "Price": price, "Availability": availability, "Rating": rating, "Description": description, "UPC": upc}
```

where `description` should contain the book's summary given in the Product description, and the values are the book's associated information.

In [114]:
from bs4 import BeautifulSoup
import requests

def scrape_book_details(book_url):
    response = requests.get(book_url)
    if response.status_code != 200:
        return "Failed to fetch the webpage, Status code: {}".format(response.status_code)
    
    soup = BeautifulSoup(response.text, 'html.parser')
    
    try:
        title = soup.find('h1').text
    except AttributeError:
        title = "Title not found"
    
    try:
        price = soup.select_one('p.price_color').text
    except AttributeError:
        price = "Price not found"
    
    try:
        availability = soup.select_one('p.instock.availability').text.strip()
    except AttributeError:
        availability = "Availability not found"
    
    try:
        rating_class = soup.select_one('p.star-rating')['class'][1]
        rating_conversion = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5}
        rating = rating_conversion.get(rating_class, "Unknown")
    except (AttributeError, TypeError, KeyError):
        rating = "Rating not found"
    
    try:
        description_tag = soup.select_one('#product_description ~ p')
        description = description_tag.text if description_tag else "No description available"
    except AttributeError:
        description = "Description not found"
    
    try:
        upc = soup.find('table', class_='table table-striped').find('td').text
    except AttributeError:
        upc = "UPC not found"
    
    book_details = {
        "Title": title,
        "Price": price,
        "Availability": availability,
        "Rating": rating,
        "Description": description,
        "UPC": upc
    }
    
    return book_details

# Test the function
book_url = "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"
book_details = scrape_book_details(book_url)
print(book_details)

{'Title': 'A Light in the Attic', 'Price': 'Â£51.77', 'Availability': 'In stock (22 available)', 'Rating': 3, 'Description': "It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe place to 

In [115]:
book_url = "https://books.toscrape.com/catalogue/its-only-the-himalayas_981/index.html"
book_details = scrape_book_details(book_url)
print(book_details)

{'Title': "It's Only the Himalayas", 'Price': 'Â£45.17', 'Availability': 'In stock (19 available)', 'Rating': 2, 'Description': 'â\x80\x9cWherever you go, whatever you do, just . . . donâ\x80\x99t do anything stupid.â\x80\x9d â\x80\x94My MotherDuring her yearlong adventure backpacking from South Africa to Singapore, S. Bedford definitely did a few things her mother might classify as "stupid." She swam with great white sharks in South Africa, ran from lions in Zimbabwe, climbed a Himalayan mountain without training in Nepal, and wa â\x80\x9cWherever you go, whatever you do, just . . . donâ\x80\x99t do anything stupid.â\x80\x9d â\x80\x94My MotherDuring her yearlong adventure backpacking from South Africa to Singapore, S. Bedford definitely did a few things her mother might classify as "stupid." She swam with great white sharks in South Africa, ran from lions in Zimbabwe, climbed a Himalayan mountain without training in Nepal, and watched as her friend was attacked by a monkey in Indonesi

# Collect and store all the information from the books in a Pandas DataFrame

Start with the following dictionary:

```python
books_dict = {"Title": [], "Price": [], "Availability": [], "Rating": [], "Description": [], "UPC": [], "Category": [] }
```

Then, iterate over all the categories and all the books in a given category to collect any book information using the previous function. Fill the previous dictionary with the information about each book.

Show the first five rows of the previous final Pandas DataFrame.

Tip: You can use the function `tqdm` from the library `tqdm` to show a progress bar if in iterable of a for loop as shown below :wink: :

```python
from tqdm import tqdm

for elem in tqdm(iterable):
    # some code
```





In [None]:
import requests
from bs4 import BeautifulSoup
from tqdm import tqdm
import pandas as pd

def scrape_book_details(book_url):
    book_info = {
        "Title": "Information not found", 
        "Price": "Information not found",
        "Availability": "Information not found",
        "Rating": "Information not found",
        "Description": "Information not found",
        "UPC": "Information not found",
        "Category": "Information not found" 
    }
    
    response = requests.get(book_url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')

        title_tag = soup.find('h1')
        if title_tag:
            book_info['Title'] = title_tag.text



    return book_info

books_dict = {"Title": [], "Price": [], "Availability": [], "Rating": [], "Description": [], "UPC": [], "Category": []}

base_url = "https://books.toscrape.com/"

for category in tqdm(categories, desc='Categories Progress'):
    category_name = category.text.strip()
    category_url = base_url + category.get('href')
    
    category_response = requests.get(category_url)
    category_soup = BeautifulSoup(category_response.text, 'html.parser')
    
    book_urls = [base_url + x.find('a')['href'] for x in category_soup.find_all('h3') if x.find('a')]
    
    for book_url in tqdm(book_urls, desc=f'Books in {category_name}'):
        book_info = scrape_book_details(book_url)
        book_info['Category'] = category_name  
        for key in books_dict:
            books_dict[key].append(book_info.get(key, None))  # Using None as fallback

books_df = pd.DataFrame(books_dict)
print(books_df.head())

Categories Progress:   0%|          | 0/50 [00:00<?, ?it/s]
Books in Travel:   0%|          | 0/11 [00:00<?, ?it/s][A
Books in Travel:   9%|▉         | 1/11 [00:00<00:03,  2.70it/s][A
Books in Travel:  18%|█▊        | 2/11 [00:00<00:03,  2.71it/s][A
Books in Travel:  27%|██▋       | 3/11 [00:01<00:02,  2.72it/s][A
Books in Travel:  36%|███▋      | 4/11 [00:01<00:02,  2.73it/s][A
Books in Travel:  45%|████▌     | 5/11 [00:01<00:02,  2.74it/s][A
Books in Travel:  55%|█████▍    | 6/11 [00:02<00:01,  2.74it/s][A
Books in Travel:  64%|██████▎   | 7/11 [00:02<00:01,  2.73it/s][A
Books in Travel:  73%|███████▎  | 8/11 [00:02<00:01,  2.70it/s][A
Books in Travel:  82%|████████▏ | 9/11 [00:03<00:00,  2.70it/s][A
Books in Travel:  91%|█████████ | 10/11 [00:03<00:00,  2.71it/s][A
Books in Travel: 100%|██████████| 11/11 [00:04<00:00,  2.71it/s][A
Categories Progress:   2%|▏         | 1/50 [00:04<03:44,  4.59s/it]
Books in Mystery:   0%|          | 0/20 [00:00<?, ?it/s][A
Books in Myste