# Web Scrapping lab

In this lab you will scrappe this [website](https://books.toscrape.com/) of books.

You have to create a Pandas DataFrame with all the books listed in the page. Each row of the DataFrame should contain information of each book. In particular, the DataFrmae must contain:

* category
* title
* price
* stock availability
* star rating (number of stars)
* description
* UPC

Happy scrapping!



# Server verification

Load the needed libraries, and make sure thar you can obtain the correct status code.

In [1]:
# Your code here
import requests
from bs4 import BeautifulSoup
url = "https://books.toscrape.com/"
response = requests.get(url)
response

<Response [200]>

# Book categories

Create the code to collect the **relative urls** from the left panel to obtain a list with all the book categories.

In [3]:
response.headers

{'Date': 'Sun, 22 Sep 2024 11:57:29 GMT', 'Content-Type': 'text/html', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Last-Modified': 'Wed, 08 Feb 2023 21:02:32 GMT', 'ETag': 'W/"63e40de8-c85e"', 'Strict-Transport-Security': 'max-age=0; includeSubDomains; preload', 'Content-Encoding': 'br'}

In [9]:
# Parse the content
soup = BeautifulSoup(response.content, 'html.parser')

# Find the area where categories are listed
category_section = soup.find('ul', class_='nav-list')

# Retrieve all category links within that section
category_links = category_section.find_all('a')

# Extract relative URLs from category links
relative_urls = [link.get('href') for link in category_links]

# Print the list of relative URLs
print(relative_urls)

['catalogue/category/books_1/index.html', 'catalogue/category/books/travel_2/index.html', 'catalogue/category/books/mystery_3/index.html', 'catalogue/category/books/historical-fiction_4/index.html', 'catalogue/category/books/sequential-art_5/index.html', 'catalogue/category/books/classics_6/index.html', 'catalogue/category/books/philosophy_7/index.html', 'catalogue/category/books/romance_8/index.html', 'catalogue/category/books/womens-fiction_9/index.html', 'catalogue/category/books/fiction_10/index.html', 'catalogue/category/books/childrens_11/index.html', 'catalogue/category/books/religion_12/index.html', 'catalogue/category/books/nonfiction_13/index.html', 'catalogue/category/books/music_14/index.html', 'catalogue/category/books/default_15/index.html', 'catalogue/category/books/science-fiction_16/index.html', 'catalogue/category/books/sports-and-games_17/index.html', 'catalogue/category/books/add-a-comment_18/index.html', 'catalogue/category/books/fantasy_19/index.html', 'catalogue/

In [4]:
#print(soup.prettify())

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" lang="en-us">
 <!--<![endif]-->
 <head>
  <title>
   All products | Books to Scrape - Sandbox
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <meta content="24th Jun 2016 09:29" name="created"/>
  <meta content="" name="description"/>
  <meta content="width=device-width" name="viewport"/>
  <meta content="NOARCHIVE,NOCACHE" name="robots"/>
  <!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
  <!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
  <link href="static/oscar/favicon.ico" rel="shortcut icon"/>
  <link href="static/oscar/css/styles.css" rel="stylesheet" type="tex

# Books in a given category

Use. web scrapping and list comprehension to obtain the **absolute** url of each book to be scraped

In [10]:
# Your code here
from urllib.parse import urljoin

# Extract relative URLs from category links
relative_urls = [link.get('href') for link in category_links]

# Define the base URL
base_url = "https://books.toscrape.com/"

# Join the base URL with each relative URL to get the absolute URL
absolute_urls = [urljoin(base_url, relative_url) for relative_url in relative_urls]

# Print the list of absolute URLs
print(absolute_urls)

['https://books.toscrape.com/catalogue/category/books_1/index.html', 'https://books.toscrape.com/catalogue/category/books/travel_2/index.html', 'https://books.toscrape.com/catalogue/category/books/mystery_3/index.html', 'https://books.toscrape.com/catalogue/category/books/historical-fiction_4/index.html', 'https://books.toscrape.com/catalogue/category/books/sequential-art_5/index.html', 'https://books.toscrape.com/catalogue/category/books/classics_6/index.html', 'https://books.toscrape.com/catalogue/category/books/philosophy_7/index.html', 'https://books.toscrape.com/catalogue/category/books/romance_8/index.html', 'https://books.toscrape.com/catalogue/category/books/womens-fiction_9/index.html', 'https://books.toscrape.com/catalogue/category/books/fiction_10/index.html', 'https://books.toscrape.com/catalogue/category/books/childrens_11/index.html', 'https://books.toscrape.com/catalogue/category/books/religion_12/index.html', 'https://books.toscrape.com/catalogue/category/books/nonficti

# Book details

Create a Python function that given a book_url as an input returns a dictionary with the following structure:

```Python
{"Title": title, "Price": price, "Availability": availability, "Rating": rating, "Description": description, "UPC": upc}
```

where `description` should contain the book's summary given in the Product description, and the values are the book's associated information.

In [11]:
# Your code here
def scrape_book(book_url):
    response = requests.get(book_url)
    response.raise_for_status()  # to ensure the request was successful
    
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Extract book information with safety checks for missing data
    title = soup.find('h1').text.strip() if soup.find('h1') else 'N/A'
    price = soup.find('p', class_='price_color').text.strip() if soup.find('p', class_='price_color') else 'N/A'
    availability = soup.find('p', class_='instock availability').text.strip() if soup.find('p', class_='instock availability') else 'N/A'
    
    rating = 'N/A'
    rating_tag = soup.find('p', class_='star-rating')
    if rating_tag:
        rating_classes = rating_tag.get('class')
        if len(rating_classes) > 1:
            rating = rating_classes[1]
    
    description = 'N/A'
    description_tag = soup.find('div', id='product_description')
    if description_tag and description_tag.find_next_sibling('p'):
        description = description_tag.find_next_sibling('p').text.strip()
    
    upc = 'N/A'
    upc_tag = soup.find('table', class_='table table-striped')
    if upc_tag:
        upc = upc_tag.find('td').text.strip()
    
    # Create and return the dictionary
    book_info = {
        "Title": title,
        "Price": price,
        "Availability": availability,
        "Rating": rating,
        "Description": description,
        "UPC": upc
    }
    return book_info


In [12]:
book_url = "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"  
book_info = scrape_book(book_url)  
print(book_info)

{'Title': 'A Light in the Attic', 'Price': '£51.77', 'Availability': 'In stock (22 available)', 'Rating': 'Three', 'Description': "It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe plac

# Collect and store all the information from the books in a Pandas DataFrame

Start with the following dictionary:

```python
books_dict = {"Title": [], "Price": [], "Availability": [], "Rating": [], "Description": [], "UPC": [], "Category": [] }
```

Then, iterate over all the categories and all the books in a given category to collect any book information using the previous function. Fill the previous dictionary with the information about each book.

Show the first five rows of the previous final Pandas DataFrame.

Tip: You can use the function `tqdm` from the library `tqdm` to show a progress bar if in iterable of a for loop as shown below :wink: :

```python
from tqdm import tqdm

for elem in tqdm(iterable):
    # some code
```





In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from urllib.parse import urljoin
from tqdm import tqdm

# Define the base URL
base_url = "https://books.toscrape.com/"

# Define the dictionary to store book information
books_dict = {"Title": [], "Price": [], "Availability": [], "Rating": [], "Description": [], "UPC": [], "Category": []}

# Define the function to scrape book information
def scrape_book(book_url):
    response = requests.get(book_url)
    response.raise_for_status()  # to ensure the request was successful

    soup = BeautifulSoup(response.content, 'html.parser')

    # Extract book information with safety checks for missing data
    title = soup.find('h1').text.strip() if soup.find('h1') else 'N/A'
    price = soup.find('p', class_='price_color').text.strip() if soup.find('p', class_='price_color') else 'N/A'
    availability = soup.find('p', class_='instock availability').text.strip() if soup.find('p', class_='instock availability') else 'N/A'

    rating = 'N/A'
    rating_tag = soup.find('p', class_='star-rating')
    if rating_tag:
        rating_classes = rating_tag.get('class')
        if len(rating_classes) > 1:
            rating = rating_classes[1]
    
    description = 'N/A'
    description_tag = soup.find('div', id='product_description')
    if description_tag and description_tag.find_next_sibling('p'):
        description = description_tag.find_next_sibling('p').text.strip()
    
    upc = 'N/A'
    upc_tag = soup.find('table', class_='table table-striped')
    if upc_tag:
        upc = upc_tag.find('td').text.strip()

    # Extract the category
    category = soup.find('ul', class_='breadcrumb').find_all('li')[2].text.strip() if soup.find('ul', class_='breadcrumb') else 'N/A'

    # Create and return the dictionary
    book_info = {
        "Title": title,
        "Price": price,
        "Availability": availability,
        "Rating": rating,
        "Description": description,
        "UPC": upc,
        "Category": category
    }
    return book_info

# Get the category URLs
response = requests.get(base_url)
response.raise_for_status()
soup = BeautifulSoup(response.content, 'html.parser')
category_section = soup.find('ul', class_='nav-list')
category_links = category_section.find_all('a')
relative_urls = [link.get('href') for link in category_links][1:]  # Skip the first
absolute_urls = [urljoin(base_url, relative_url) for relative_url in relative_urls]

# Iterate over categories and books to collect information
for category_url in tqdm(absolute_urls, desc="Categories"):
    while True:
        response = requests.get(category_url)
        response.raise_for_status()
        soup = BeautifulSoup(response.content, 'html.parser')

        books = soup.find_all('article', class_='product_pod')
        
        for book in tqdm(books, desc="Books in category", leave=False):
            book_link = book.find('a')
            book_url = urljoin(category_url, book_link.get('href'))
            book_info = scrape_book(book_url)
            books_dict["Title"].append(book_info["Title"])
            books_dict["Price"].append(book_info["Price"])
            books_dict["Availability"].append(book_info["Availability"])
            books_dict["Rating"].append(book_info["Rating"])
            books_dict["Description"].append(book_info["Description"])
            books_dict["UPC"].append(book_info["UPC"])
            books_dict["Category"].append(book_info["Category"])

        # Check for the next page
        next_button = soup.find('li', class_='next')
        if next_button:
            next_page = next_button.find('a')['href']
            category_url = urljoin(category_url, next_page)
        else:
            break

# Create a Pandas DataFrame from the dictionary
df = pd.DataFrame(books_dict)

# Show the first five rows of the DataFrame
print(df.head())

Categories:   0%|                                                                               | 0/50 [00:00<?, ?it/s]
Books in category:   0%|                                                                        | 0/11 [00:00<?, ?it/s][A
Books in category:   9%|█████▊                                                          | 1/11 [00:00<00:06,  1.51it/s][A
Books in category:  18%|███████████▋                                                    | 2/11 [00:01<00:05,  1.52it/s][A
Books in category:  27%|█████████████████▍                                              | 3/11 [00:01<00:05,  1.52it/s][A
Books in category:  36%|███████████████████████▎                                        | 4/11 [00:02<00:04,  1.42it/s][A
Books in category:  45%|█████████████████████████████                                   | 5/11 [00:03<00:04,  1.45it/s][A
Books in category:  55%|██████████████████████████████████▉                             | 6/11 [00:04<00:03,  1.40it/s][A
Books in category: 