# Web Scrapping lab

In this lab you will scrappe this [website](https://books.toscrape.com/) of books.

You have to create a Pandas DataFrame with all the books listed in the page. Each row of the DataFrame should contain information of each book. In particular, the DataFrmae must contain:

* category
* title
* price
* stock availability
* star rating (number of stars)
* description
* UPC

Happy scrapping!



# Server verification

Load the needed libraries, and make sure thar you can obtain the correct status code.

In [9]:
# Your code here
import requests
from bs4 import BeautifulSoup
import pandas as pd

base_url = "https://books.toscrape.com/"
response = requests.get(base_url)
response


<Response [200]>

In [10]:
if response.status_code == 200:
        print("All good!")
        print("==============")
        print("\n")
        base_url = "https://books.toscrape.com/"
        response = requests.get(base_url)
        
        soup = BeautifulSoup(response.content, "html.parser")
        
else:
    print(f"Failed!!!. Status code: {response.status_code}")

All good!




In [11]:
print(soup.prettify()) 

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" lang="en-us">
 <!--<![endif]-->
 <head>
  <title>
   All products | Books to Scrape - Sandbox
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <meta content="24th Jun 2016 09:29" name="created"/>
  <meta content="" name="description"/>
  <meta content="width=device-width" name="viewport"/>
  <meta content="NOARCHIVE,NOCACHE" name="robots"/>
  <!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
  <!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
  <link href="static/oscar/favicon.ico" rel="shortcut icon"/>
  <link href="static/oscar/css/styles.css" rel="stylesheet" type="tex

In [12]:
# Lists to hold extracted data
categories = []
titles = []
prices = []
availability = []
ratings = []
descriptions = []
upcs = []

# Mapping of star ratings
star_mapping = {
    "One": 1,
    "Two": 2,
    "Three": 3,
    "Four": 4,
    "Five": 5
}

# Extract book details from a book page
def get_book_details(base_url):
    response = requests.get(base_url)
    soup = BeautifulSoup(response.content, "html.parser")
    
    # Extract data
    title = soup.find("h1").text if soup.find("h1") else "N/A"
    price = soup.find("p", class_="price_color").text if soup.find("p", class_="price_color") else "N/A"
    stock = soup.find("p", class_="instock availability").text.strip() if soup.find("p", class_="instock availability") else "N/A"
    
    # Find star rating
    star_rating = soup.find("p", class_="star-rating")
    star_rating = star_mapping[star_rating["class"][1]] if star_rating and "class" in star_rating.attrs else "N/A"
    
    # Find UPC and description
    upc = soup.find("th", string="UPC")
    upc = upc.find_next_sibling("td").text if upc else "N/A"
    
    description = soup.find("meta", {"name": "description"})
    description = description["content"].strip() if description and "content" in description.attrs else "N/A"
    
    return title, price, stock, star_rating, upc, description


# Scrape books from a single category page
def scrape_category_page(category_url, category_name):
    response = requests.get(category_url)
    soup = BeautifulSoup(response.content, "html.parser")
    
    books = soup.find_all("article", class_="product_pod")
    
    for book in books:
        book_url = base_url + book.find("h3").find("a")["href"].replace("../../../", "catalogue/")
        title, price, stock, star_rating, upc, description = get_book_details(book_url)
        
        # Append data to lists
        categories.append(category_name)
        titles.append(title)
        prices.append(price)
        availability.append(stock)
        ratings.append(star_rating)
        descriptions.append(description)
        upcs.append(upc)

    # Check next page
    next_page = soup.find("li", class_="next")
    if next_page:
        next_url = category_url.rsplit("/", 1)[0] + "/" + next_page.find("a")["href"]
        scrape_category_page(next_url, category_name)

# Scrape all website
def scrape_books():
    response = requests.get(base_url)
    soup = BeautifulSoup(response.content, "html.parser")
    
    # Find all categories
    category_links = soup.find("ul", class_="nav-list").find("ul").find_all("a")
    
    for category_link in category_links:
        category_name = category_link.text.strip()
        category_url = base_url + category_link["href"]
        scrape_category_page(category_url, category_name)


    df = pd.DataFrame({
        "Category": categories,
        "Title": titles,
        "Price": prices,
        "Availability": availability,
        "Star Rating": ratings,
        "Description": descriptions,
        "UPC": upcs
    })
    
    return df


df_books = scrape_books()
display(df_books.head())

    
    

Unnamed: 0,Category,Title,Price,Availability,Star Rating,Description,UPC
0,Travel,It's Only the Himalayas,£45.17,In stock (19 available),2,"“Wherever you go, whatever you do, just . . . ...",a22124811bfa8350
1,Travel,Full Moon over Noah’s Ark: An Odyssey to Mount...,£49.43,In stock (15 available),4,Acclaimed travel writer Rick Antonson sets his...,ce60436f52c5ee68
2,Travel,See America: A Celebration of Our National Par...,£48.87,In stock (14 available),3,To coincide with the 2016 centennial anniversa...,f9705c362f070608
3,Travel,Vagabonding: An Uncommon Guide to the Art of L...,£36.94,In stock (8 available),2,With a new foreword by Tim Ferriss •There’s no...,1809259a5a5f1d8d
4,Travel,Under the Tuscan Sun,£37.33,In stock (7 available),3,A CLASSIC FROM THE BESTSELLING AUTHOR OF UNDER...,a94350ee74deaa07


# Book categories

Create the code to collect the **relative urls** from the left panel to obtain a list with all the book categories.

In [13]:
# Your code here
response = requests.get(base_url)
soup = BeautifulSoup(response.content, "html.parser")

# Find the left panel 
category_section = soup.find("ul", class_="nav-list").find("ul")

# Extract all category links
category_links = category_section.find_all("a")

category_urls = []

# Loop through each category link 
for link in category_links:
    relative_url = link["href"]
    category_urls.append(relative_url)

display(category_urls)

['catalogue/category/books/travel_2/index.html',
 'catalogue/category/books/mystery_3/index.html',
 'catalogue/category/books/historical-fiction_4/index.html',
 'catalogue/category/books/sequential-art_5/index.html',
 'catalogue/category/books/classics_6/index.html',
 'catalogue/category/books/philosophy_7/index.html',
 'catalogue/category/books/romance_8/index.html',
 'catalogue/category/books/womens-fiction_9/index.html',
 'catalogue/category/books/fiction_10/index.html',
 'catalogue/category/books/childrens_11/index.html',
 'catalogue/category/books/religion_12/index.html',
 'catalogue/category/books/nonfiction_13/index.html',
 'catalogue/category/books/music_14/index.html',
 'catalogue/category/books/default_15/index.html',
 'catalogue/category/books/science-fiction_16/index.html',
 'catalogue/category/books/sports-and-games_17/index.html',
 'catalogue/category/books/add-a-comment_18/index.html',
 'catalogue/category/books/fantasy_19/index.html',
 'catalogue/category/books/new-adul

# Books in a given category

Use. web scrapping and list comprehension to obtain the **absolute** url of each book to be scraped

In [14]:
# Your code here
def get_absolute_book_urls():
    
    # Find all book links
    book_links = soup.find_all("h3")
    
    # Extract the relative URLs and convert them to absolute URLs
    absolute_urls = [
        base_url + book.find("a")["href"].replace("../../../", "catalogue/")
        for book in book_links
    ]
    
    return absolute_urls

# Get all book URLs
book_urls = get_absolute_book_urls()

# Display the list of absolute URLs
for url in book_urls:
    display(url)

'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'

'https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html'

'https://books.toscrape.com/catalogue/soumission_998/index.html'

'https://books.toscrape.com/catalogue/sharp-objects_997/index.html'

'https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html'

'https://books.toscrape.com/catalogue/the-requiem-red_995/index.html'

'https://books.toscrape.com/catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html'

'https://books.toscrape.com/catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html'

'https://books.toscrape.com/catalogue/the-boys-in-the-boat-nine-americans-and-their-epic-quest-for-gold-at-the-1936-berlin-olympics_992/index.html'

'https://books.toscrape.com/catalogue/the-black-maria_991/index.html'

'https://books.toscrape.com/catalogue/starving-hearts-triangular-trade-trilogy-1_990/index.html'

'https://books.toscrape.com/catalogue/shakespeares-sonnets_989/index.html'

'https://books.toscrape.com/catalogue/set-me-free_988/index.html'

'https://books.toscrape.com/catalogue/scott-pilgrims-precious-little-life-scott-pilgrim-1_987/index.html'

'https://books.toscrape.com/catalogue/rip-it-up-and-start-again_986/index.html'

'https://books.toscrape.com/catalogue/our-band-could-be-your-life-scenes-from-the-american-indie-underground-1981-1991_985/index.html'

'https://books.toscrape.com/catalogue/olio_984/index.html'

'https://books.toscrape.com/catalogue/mesaerion-the-best-science-fiction-stories-1800-1849_983/index.html'

'https://books.toscrape.com/catalogue/libertarianism-for-beginners_982/index.html'

'https://books.toscrape.com/catalogue/its-only-the-himalayas_981/index.html'

# Book details

Create a Python function that given a book_url as an input returns a dictionary with the following structure:

```Python
{"Title": title, "Price": price, "Availability": availability, "Rating": rating, "Description": description, "UPC": upc}
```

where `description` should contain the book's summary given in the Product description, and the values are the book's associated information.

In [15]:
# Your code here
def get_book_details(book_url):
    #base_url = "https://books.toscrape.com/"
    
    response = requests.get(book_url)
    soup = BeautifulSoup(response.content, "html.parser")
    
    # Extract book details
    title = soup.find("h1").text if soup.find("h1") else "N/A"
    price = soup.find("p", class_="price_color").text if soup.find("p", class_="price_color") else "N/A"
    availability = soup.find("p", class_="instock availability").text.strip() if soup.find("p", class_="instock availability") else "N/A"
    
    # Extract star rating
    star_rating_element = soup.find("p", class_="star-rating")
    if star_rating_element:
        star_class = star_rating_element["class"][1]
        star_mapping = {
            "One": 1,
            "Two": 2,
            "Three": 3,
            "Four": 4,
            "Five": 5
        }
        rating = star_mapping.get(star_class, "N/A")
    else:
        rating = "N/A"
    
    # Extract UPC
    upc_element = soup.find("th", text="UPC")
    upc = upc_element.find_next_sibling("td").text if upc_element else "N/A"
    
    # Extract description
    description_meta = soup.find("meta", {"name": "description"})
    description = description_meta["content"].strip() if description_meta and "content" in description_meta.attrs else "N/A"
    
    # Return a dictionary
    return {
        "Title": title,
        "Price": price,
        "Availability": availability,
        "Rating": rating,
        "Description": description,
        "UPC": upc
    }


book_url = "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"
book_details = get_book_details(book_url)

display(book_details)

  upc_element = soup.find("th", text="UPC")


{'Title': 'A Light in the Attic',
 'Price': '£51.77',
 'Availability': 'In stock (22 available)',
 'Rating': 3,
 'Description': "It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe place 

# Collect and store all the information from the books in a Pandas DataFrame

Start with the following dictionary:

```python
books_dict = {"Title": [], "Price": [], "Availability": [], "Rating": [], "Description": [], "UPC": [], "Category": [] }
```

Then, iterate over all the categories and all the books in a given category to collect any book information using the previous function. Fill the previous dictionary with the information about each book.

Show the first five rows of the previous final Pandas DataFrame.

Tip: You can use the function `tqdm` from the library `tqdm` to show a progress bar if in iterable of a for loop as shown below :wink: :

```python
from tqdm import tqdm

for elem in tqdm(iterable):
    # some code
```





In [16]:
from tqdm import tqdm

def get_book_details(book_url):
    response = requests.get(book_url)
    soup = BeautifulSoup(response.content, "html.parser")
    
    title = soup.find("h1").text if soup.find("h1") else "N/A"
    price = soup.find("p", class_="price_color").text if soup.find("p", class_="price_color") else "N/A"
    availability = soup.find("p", class_="instock availability").text.strip() if soup.find("p", class_="instock availability") else "N/A"
    
    star_rating_element = soup.find("p", class_="star-rating")
    if star_rating_element:
        star_class = star_rating_element["class"][1]
        star_mapping = {
            "One": 1,
            "Two": 2,
            "Three": 3,
            "Four": 4,
            "Five": 5
        }
        rating = star_mapping.get(star_class, "N/A")
    else:
        rating = "N/A"
    
    upc_element = soup.find("th", text="UPC")
    upc = upc_element.find_next_sibling("td").text if upc_element else "N/A"
    
    description_meta = soup.find("meta", {"name": "description"})
    description = description_meta["content"].strip() if description_meta and "content" in description_meta.attrs else "N/A"
    
    return {
        "Title": title,
        "Price": price,
        "Availability": availability,
        "Rating": rating,
        "Description": description,
        "UPC": upc
    }

# Scrape books from a category page
def scrape_category_page(category_url, category_name, books_dict):
    response = requests.get(category_url)
    soup = BeautifulSoup(response.content, "html.parser")
    
    books = soup.find_all("article", class_="product_pod")
    
    for book in tqdm(books, desc=f"Scraping {category_name}"):
        book_url = base_url + book.find("h3").find("a")["href"].replace("../../../", "catalogue/")
        book_details = get_book_details(book_url)
        
        # Append the book details to the dictionary
        books_dict["Title"].append(book_details["Title"])
        books_dict["Price"].append(book_details["Price"])
        books_dict["Availability"].append(book_details["Availability"])
        books_dict["Rating"].append(book_details["Rating"])
        books_dict["Description"].append(book_details["Description"])
        books_dict["UPC"].append(book_details["UPC"])
        books_dict["Category"].append(category_name)

    # Handle pagination
    next_page = soup.find("li", class_="next")
    if next_page:
        next_url = category_url.rsplit("/", 1)[0] + "/" + next_page.find("a")["href"]
        scrape_category_page(next_url, category_name, books_dict)

# Scrape all categories and books
def scrape_all_books():
    books_dict = {
        "Title": [],
        "Price": [],
        "Availability": [],
        "Rating": [],
        "Description": [],
        "UPC": [],
        "Category": []
    }

    global base_url
    base_url = "https://books.toscrape.com/"
    
    # Scrape the main page to find all categories
    response = requests.get(base_url)
    soup = BeautifulSoup(response.content, "html.parser")
    
    # Find all category links
    category_links = soup.find("ul", class_="nav-list").find("ul").find_all("a")
    
    for category_link in tqdm(category_links, desc="Categories"):
        category_name = category_link.text.strip()
        category_url = base_url + category_link["href"]
        scrape_category_page(category_url, category_name, books_dict)

    return books_dict

# Run the scraper
books_dict = scrape_all_books()

df_books = pd.DataFrame(books_dict)


display(df_books.head())



  upc_element = soup.find("th", text="UPC")
Scraping Travel: 100%|██████████| 11/11 [00:04<00:00,  2.39it/s]
Scraping Mystery: 100%|██████████| 20/20 [00:08<00:00,  2.41it/s]
Scraping Mystery: 100%|██████████| 12/12 [00:04<00:00,  2.52it/s]
Scraping Historical Fiction: 100%|██████████| 20/20 [00:07<00:00,  2.56it/s]
Scraping Historical Fiction: 100%|██████████| 6/6 [00:02<00:00,  2.62it/s]
Scraping Sequential Art: 100%|██████████| 20/20 [00:07<00:00,  2.52it/s]
Scraping Sequential Art: 100%|██████████| 20/20 [00:07<00:00,  2.62it/s]
Scraping Sequential Art: 100%|██████████| 20/20 [00:07<00:00,  2.52it/s]
Scraping Sequential Art: 100%|██████████| 15/15 [00:05<00:00,  2.52it/s]
Scraping Classics: 100%|██████████| 19/19 [00:07<00:00,  2.55it/s]
Scraping Philosophy: 100%|██████████| 11/11 [00:04<00:00,  2.58it/s]
Scraping Romance: 100%|██████████| 20/20 [00:07<00:00,  2.58it/s]
Scraping Romance: 100%|██████████| 15/15 [00:05<00:00,  2.53it/s]
Scraping Womens Fiction: 100%|██████████| 17/17

Unnamed: 0,Title,Price,Availability,Rating,Description,UPC,Category
0,It's Only the Himalayas,£45.17,In stock (19 available),2,"“Wherever you go, whatever you do, just . . . ...",a22124811bfa8350,Travel
1,Full Moon over Noah’s Ark: An Odyssey to Mount...,£49.43,In stock (15 available),4,Acclaimed travel writer Rick Antonson sets his...,ce60436f52c5ee68,Travel
2,See America: A Celebration of Our National Par...,£48.87,In stock (14 available),3,To coincide with the 2016 centennial anniversa...,f9705c362f070608,Travel
3,Vagabonding: An Uncommon Guide to the Art of L...,£36.94,In stock (8 available),2,With a new foreword by Tim Ferriss •There’s no...,1809259a5a5f1d8d,Travel
4,Under the Tuscan Sun,£37.33,In stock (7 available),3,A CLASSIC FROM THE BESTSELLING AUTHOR OF UNDER...,a94350ee74deaa07,Travel
