# Web Scrapping lab

In this lab you will scrappe this [website](https://books.toscrape.com/) of books.

You have to create a Pandas DataFrame with all the books listed in the page. Each row of the DataFrame should contain information of each book. In particular, the DataFrmae must contain:

* category
* title
* price
* stock availability
* star rating (number of stars)
* description
* UPC

Happy scrapping!



# Server verification

Load the needed libraries, and make sure thar you can obtain the correct status code.

In [2]:
pip install requests beautifulsoup4 pandas

Note: you may need to restart the kernel to use updated packages.


In [40]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from tqdm import tqdm  # For showing progress bars

# URL of the first page of the books
url = 'http://books.toscrape.com/catalogue/category/books_1/index.html'

# Send a GET request to fetch the HTML content of the page
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Find all books on the page
books = soup.find_all('article', class_='product_pod')

# Initialize a list to store book data
book_data = []

# Loop through each book and extract the required information
for book in tqdm(books, desc="Extracting book information"):
    # Extract the title, price, stock availability, and rating
    title = book.h3.a['title']
    price = book.find('p', class_='price_color').text
    stock = book.find('p', class_='availability').text.strip()
    rating = book.p['class'][1]  # e.g., 'star-rating Three'
    
    # To get the description, category, and UPC, visit the book's detail page
    book_url = book.h3.a['href']
    book_page = requests.get(f'http://books.toscrape.com/catalogue/{book_url}')
    book_soup = BeautifulSoup(book_page.text, 'html.parser')
    
    # Extract the description
    description = book_soup.select_one('#product_description ~ p')
    description = description.text.strip() if description else "No description"
    
    # Extract the category
    breadcrumb = book_soup.select('ul.breadcrumb li a')
    if len(breadcrumb) >= 3:
        category = breadcrumb[-1].text.strip()
    else:
        category = "No category"
    
    # Extract the UPC
    upc = None
    upc_element = book_soup.find('th', string='UPC')
    if upc_element:
        upc = upc_element.find_next_sibling('td').text.strip()
    
    # Save all the information in a dictionary
    book_data.append({
        'Category': category,
        'Title': title,
        'Price': price,
        'Availability': stock,
        'Rating': rating,
        'Description': description,
        'UPC': upc
    })

# Create a DataFrame from the collected book data
books_df = pd.DataFrame(book_data)

# Save the DataFrame to a CSV file
books_df.to_csv('books_data.csv', index=False)

print("Scraping complete. Data has been saved to 'books_data.csv'.")


Extracting book information: 100%|██████████████| 20/20 [00:05<00:00,  3.61it/s]

Scraping complete. Data has been saved to 'books_data.csv'.





In [45]:
import requests

# URL of the website to scrape
url = 'https://books.toscrape.com/'

# Send a GET request to the server to check the status
response = requests.get(url)

# Check the status code
if response.status_code == 200:
    print("Connection successful! Status code:", response.status_code)
else:
    print("Failed to connect. Status code:", response.status_code)


Connection successful! Status code: 200


In [47]:
from bs4 import BeautifulSoup
import requests

# URL of the website to scrape
url = 'https://books.toscrape.com/'

# Send a GET request to the server and get the HTML content
response = requests.get(url)
html_content = response.text

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')

# Example: Print the title of the page to confirm successful parsing
page_title = soup.title.text
print("Page Title:", page_title)


Page Title: 
    All products | Books to Scrape - Sandbox



In [52]:
import pandas as pd

# List to store the information for each book
books_data = []

# Locate the section containing all books
books_section = soup.find_all('article', class_='product_pod')

# Iterate over each book
for book in books_section:
    # Extract the title
    title = book.h3.a['title']
    
    # Extract the price
    price = book.find('p', class_='price_color').text
    
    # Extract the stock availability
    availability = book.find('p', class_='instock availability').text.strip()
    
    # Extract the star rating (convert to a number)
    rating_class = book.p['class']
    star_rating = rating_class[-1]
    star_rating = {'One': 1, 'Two': 2, 'Three': 3, 'Four': 4, 'Five': 5}.get(star_rating, 0)
    
    # Navigate to the book's detail page for more information
    book_url = url + book.h3.a['href']
    book_response = requests.get(book_url)
    book_soup = BeautifulSoup(book_response.text, 'html.parser')
    
    # Extract UPC
    upc = book_soup.find('th', string='UPC').find_next_sibling('td').text
    
    # Extract Description
    description = book_soup.find('meta', {'name': 'description'})['content'].strip()
    
    # Extract Category (assumes categories are listed in breadcrumb)
    category = book_soup.find('ul', class_='breadcrumb').find_all('li')[2].text.strip()
    
    # Append the book's information to the list
    books_data.append({
        'Title': title,
        'Price': price,
        'Availability': availability,
        'Star Rating': star_rating,
        'UPC': upc,
        'Description': description,
        'Category': category
    })

# Convert the list to a Pandas DataFrame
books_df = pd.DataFrame(books_data)

# Display the DataFrame
books_df.head()


Unnamed: 0,Title,Price,Availability,Star Rating,UPC,Description,Category
0,A Light in the Attic,Â£51.77,In stock,3,a897fe39b1053632,It's hard to imagine a world without A Light i...,Poetry
1,Tipping the Velvet,Â£53.74,In stock,1,90fa61229261140a,"""Erotic and absorbing...Written with starling ...",Historical Fiction
2,Soumission,Â£50.10,In stock,1,6957f44c3847a760,"Dans une France assez proche de la nÃ´tre, un ...",Fiction
3,Sharp Objects,Â£47.82,In stock,4,e00eb4fd7b871a48,"WICKED above her hipbone, GIRL across her hear...",Mystery
4,Sapiens: A Brief History of Humankind,Â£54.23,In stock,5,4165285e1663650f,From a renowned historian comes a groundbreaki...,History


In [54]:
# Save the DataFrame to a CSV file
books_df.to_csv('books_data.csv', index=False)

print("Data saved to 'books_data.csv'")


Data saved to 'books_data.csv'


In [56]:
# Save the DataFrame to an Excel file
books_df.to_excel('books_data.xlsx', index=False)

print("Data saved to 'books_data.xlsx'")


Data saved to 'books_data.xlsx'


In [58]:
# Save the DataFrame to a JSON file
books_df.to_json('books_data.json', orient='records', lines=True)

print("Data saved to 'books_data.json'")


Data saved to 'books_data.json'


# Books in a given category

Use. web scrapping and list comprehension to obtain the **absolute** url of each book to be scraped

In [60]:
from bs4 import BeautifulSoup
import requests

# Base URL of the website
base_url = 'https://books.toscrape.com/'

# URL of the specific category (for example, "Travel")
category_url = base_url + 'catalogue/category/books/travel_2/index.html'

# Send a GET request to the category page
response = requests.get(category_url)
soup = BeautifulSoup(response.text, 'html.parser')

# Find all the book links on the category page
# The links are usually within 'h3' tags under 'article' with class 'product_pod'
book_links = soup.find_all('h3')

# Use list comprehension to create a list of absolute URLs
book_urls = [base_url + link.find('a')['href'].replace('../../../', 'catalogue/') for link in book_links]

# Display the list of absolute URLs
book_urls



['https://books.toscrape.com/catalogue/its-only-the-himalayas_981/index.html',
 'https://books.toscrape.com/catalogue/full-moon-over-noahs-ark-an-odyssey-to-mount-ararat-and-beyond_811/index.html',
 'https://books.toscrape.com/catalogue/see-america-a-celebration-of-our-national-parks-treasured-sites_732/index.html',
 'https://books.toscrape.com/catalogue/vagabonding-an-uncommon-guide-to-the-art-of-long-term-world-travel_552/index.html',
 'https://books.toscrape.com/catalogue/under-the-tuscan-sun_504/index.html',
 'https://books.toscrape.com/catalogue/a-summer-in-europe_458/index.html',
 'https://books.toscrape.com/catalogue/the-great-railway-bazaar_446/index.html',
 'https://books.toscrape.com/catalogue/a-year-in-provence-provence-1_421/index.html',
 'https://books.toscrape.com/catalogue/the-road-to-little-dribbling-adventures-of-an-american-in-britain-notes-from-a-small-island-2_277/index.html',
 'https://books.toscrape.com/catalogue/neither-here-nor-there-travels-in-europe_198/index.

# Book details

Create a Python function that given a book_url as an input returns a dictionary with the following structure:

```Python
{"Title": title, "Price": price, "Availability": availability, "Rating": rating, "Description": description, "UPC": upc}
```

where `description` should contain the book's summary given in the Product description, and the values are the book's associated information.

In [62]:
from bs4 import BeautifulSoup
import requests

def get_book_details(book_url):
    # Send a GET request to the book's detail page
    response = requests.get(book_url)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Extract the title
    title = soup.find('h1').text
    
    # Extract the price
    price = soup.find('p', class_='price_color').text
    
    # Extract the stock availability
    availability = soup.find('p', class_='instock availability').text.strip()
    
    # Extract the star rating (convert to a number)
    rating_class = soup.find('p', class_='star-rating')['class']
    rating = {'One': 1, 'Two': 2, 'Three': 3, 'Four': 4, 'Five': 5}.get(rating_class[-1], 0)
    
    # Extract UPC
    upc = soup.find('th', string='UPC').find_next_sibling('td').text
    
    # Extract Description (check if it exists, as some books might not have a description)
    description_tag = soup.find('meta', {'name': 'description'})
    description = description_tag['content'].strip() if description_tag else 'No description available'
    
    # Return the details as a dictionary
    book_details = {
        "Title": title,
        "Price": price,
        "Availability": availability,
        "Rating": rating,
        "Description": description,
        "UPC": upc
    }
    
    return book_details

# Example usage
book_url = 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'  # Example book URL
book_info = get_book_details(book_url)
print(book_info)


{'Title': 'A Light in the Attic', 'Price': 'Â£51.77', 'Availability': 'In stock (22 available)', 'Rating': 3, 'Description': "It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe place to 

# Collect and store all the information from the books in a Pandas DataFrame

Start with the following dictionary:

```python
books_dict = {"Title": [], "Price": [], "Availability": [], "Rating": [], "Description": [], "UPC": [], "Category": [] }
```

Then, iterate over all the categories and all the books in a given category to collect any book information using the previous function. Fill the previous dictionary with the information about each book.

Show the first five rows of the previous final Pandas DataFrame.

Tip: You can use the function `tqdm` from the library `tqdm` to show a progress bar if in iterable of a for loop as shown below :wink: :

```python
from tqdm import tqdm

for elem in tqdm(iterable):
    # some code
```





In [None]:
import pandas as pd
from tqdm import tqdm
import requests
from bs4 import BeautifulSoup

# Function to get book details (from previous step)
def get_book_details(book_url):
    response = requests.get(book_url)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    title = soup.find('h1').text
    price = soup.find('p', class_='price_color').text
    availability = soup.find('p', class_='instock availability').text.strip()
    
    rating_class = soup.find('p', class_='star-rating')['class']
    rating = {'One': 1, 'Two': 2, 'Three': 3, 'Four': 4, 'Five': 5}.get(rating_class[-1], 0)
    
    upc = soup.find('th', string='UPC').find_next_sibling('td').text
    
    description_tag = soup.find('meta', {'name': 'description'})
    description = description_tag['content'].strip() if description_tag else 'No description available'
    
    return {
        "Title": title,
        "Price": price,
        "Availability": availability,
        "Rating": rating,
        "Description": description,
        "UPC": upc
    }

# Initialize the dictionary to store the information
books_dict = {"Title": [], "Price": [], "Availability": [], "Rating": [], "Description": [], "UPC": [], "Category": [] }

# Base URL of the website
base_url = 'https://books.toscrape.com/'

# Send a GET request to the main page
response = requests.get(base_url)
soup = BeautifulSoup(response.text, 'html.parser')

# Find all categories
categories = soup.find('ul', class_='nav nav-list').find('ul').find_all('li')

# Iterate over each category
for category in tqdm(categories):
    # Get the category name and URL
    category_name = category.a.text.strip()
    category_url = base_url + category.a['href']
    
    # Send a GET request to the category page
    response = requests.get(category_url)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Find all the book links on the category page
    book_links = soup.find_all('h3')
    book_urls = [base_url + link.find('a')['href'].replace('../../../', 'catalogue/') for link in book_links]
    
    # Iterate over each book in the category
    for book_url in tqdm(book_urls, leave=False):
        # Get book details using the previously defined function
        book_info = get_book_details(book_url)
        
        # Add the book's information to the dictionary
        books_dict["Title"].append(book_info["Title"])
        books_dict["Price"].append(book_info["Price"])
        books_dict["Availability"].append(book_info["Availability"])
        books_dict["Rating"].append(book_info["Rating"])
        books_dict["Description"].append(book_info["Description"])
        books_dict["UPC"].append(book_info["UPC"])
        books_dict["Category"].append(category_name)

# Convert the dictionary to a Pandas DataFrame
books_df = pd.DataFrame(books_dict)

# Display the first five rows of the DataFrame
books_df.head()


  0%|                                                    | 0/50 [00:00<?, ?it/s]
  0%|                                                    | 0/11 [00:00<?, ?it/s][A
  9%|████                                        | 1/11 [00:00<00:04,  2.34it/s][A
 18%|████████                                    | 2/11 [00:00<00:04,  2.04it/s][A
 27%|████████████                                | 3/11 [00:01<00:03,  2.17it/s][A
 36%|████████████████                            | 4/11 [00:01<00:03,  2.00it/s][A
 45%|████████████████████                        | 5/11 [00:02<00:02,  2.08it/s][A
 55%|████████████████████████                    | 6/11 [00:02<00:02,  1.95it/s][A
 64%|████████████████████████████                | 7/11 [00:03<00:02,  1.99it/s][A
 73%|████████████████████████████████            | 8/11 [00:03<00:01,  1.97it/s][A
 82%|████████████████████████████████████        | 9/11 [00:04<00:00,  2.02it/s][A
 91%|███████████████████████████████████████    | 10/11 [00:05<00:00,  1.94it/s