# Web Scrapping lab

In this lab you will scrappe this [website](https://books.toscrape.com/) of books.

You have to create a Pandas DataFrame with all the books listed in the page. Each row of the DataFrame should contain information of each book. In particular, the DataFrmae must contain:

* category
* title
* price
* stock availability
* star rating (number of stars)
* description
* UPC

Happy scrapping!



# Server verification

Load the needed libraries, and make sure thar you can obtain the correct status code.

In [39]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "https://books.toscrape.com"
response = requests.get(url)

if response.status_code == 200:
    print ("All good!")
else:
    print(f"Failed to connect. Status code:{response.status_code}")

All good!


# Book categories

Create the code to collect the **relative urls** from the left panel to obtain a list with all the book categories.

In [41]:
soup = BeautifulSoup(response.content, "html.parser")

categories_list = []

categories = soup.find('ul', class_='nav-list').find('ul').find_all('li')

for category in categories:
    category_url = category.a['href'] 
    categories_list.append(category_url)

categories_list

['catalogue/category/books/travel_2/index.html',
 'catalogue/category/books/mystery_3/index.html',
 'catalogue/category/books/historical-fiction_4/index.html',
 'catalogue/category/books/sequential-art_5/index.html',
 'catalogue/category/books/classics_6/index.html',
 'catalogue/category/books/philosophy_7/index.html',
 'catalogue/category/books/romance_8/index.html',
 'catalogue/category/books/womens-fiction_9/index.html',
 'catalogue/category/books/fiction_10/index.html',
 'catalogue/category/books/childrens_11/index.html',
 'catalogue/category/books/religion_12/index.html',
 'catalogue/category/books/nonfiction_13/index.html',
 'catalogue/category/books/music_14/index.html',
 'catalogue/category/books/default_15/index.html',
 'catalogue/category/books/science-fiction_16/index.html',
 'catalogue/category/books/sports-and-games_17/index.html',
 'catalogue/category/books/add-a-comment_18/index.html',
 'catalogue/category/books/fantasy_19/index.html',
 'catalogue/category/books/new-adul

# Books in a given category

Use. web scrapping and list comprehension to obtain the **absolute** url of each book to be scraped

In [43]:
base_url = "http://books.toscrape.com/catalogue/"

book_urls = [base_url + book.h3.a['href'].replace('../../../', '') for book in soup.find_all('article', class_='product_pod')]
book_urls

['http://books.toscrape.com/catalogue/catalogue/a-light-in-the-attic_1000/index.html',
 'http://books.toscrape.com/catalogue/catalogue/tipping-the-velvet_999/index.html',
 'http://books.toscrape.com/catalogue/catalogue/soumission_998/index.html',
 'http://books.toscrape.com/catalogue/catalogue/sharp-objects_997/index.html',
 'http://books.toscrape.com/catalogue/catalogue/sapiens-a-brief-history-of-humankind_996/index.html',
 'http://books.toscrape.com/catalogue/catalogue/the-requiem-red_995/index.html',
 'http://books.toscrape.com/catalogue/catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html',
 'http://books.toscrape.com/catalogue/catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html',
 'http://books.toscrape.com/catalogue/catalogue/the-boys-in-the-boat-nine-americans-and-their-epic-quest-for-gold-at-the-1936-berlin-olympics_992/index.html',
 'http://books.toscrape.com/catalogue/catalogue/the-black-maria_9

# Book details

Create a Python function that given a book_url as an input returns a dictionary with the following structure:

```Python
{"Title": title, "Price": price, "Availability": availability, "Rating": rating, "Description": description, "UPC": upc}
```

where `description` should contain the book's summary given in the Product description, and the values are the book's associated information.

In [51]:
import requests
from bs4 import BeautifulSoup

def scrape_book_details(book_url):

    response = requests.get(book_url)
    if response.status_code != 200:
        return {"error": "Failed to retrieve the book page."}
    
    soup = BeautifulSoup(response.content, "html.parser")
    
    title = soup.find('h1').text
    
    price = soup.find('p', class_='price_color').text
    
    availability = soup.find('p', class_='instock availability').text.strip()
    
    rating = soup.find('p', class_='star-rating')['class'][1]
    rating_dict = {'One': 1, 'Two': 2, 'Three': 3, 'Four': 4, 'Five': 5}
    rating = rating_dict.get(rating, "Unknown")
    
    description_tag = soup.find('meta', {'name': 'description'})
    description = description_tag['content'].strip() if description_tag else "No description available"

    upc = soup.find('th', string='UPC').find_next_sibling('td').text
    
    category = soup.find('ul', class_='breadcrumb').find_all('li')[2].text.strip()
    
    return {
        "Title": title,
        "Price": price,
        "Availability": availability,
        "Rating": rating,
        "Description": description,
        "UPC": upc,
        "Category": category
    }

book_url = "http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"
book_details = scrape_book_details(book_url)
book_details

{'Title': 'A Light in the Attic',
 'Price': '£51.77',
 'Availability': 'In stock (22 available)',
 'Rating': 3,
 'Description': "It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe place 

# Collect and store all the information from the books in a Pandas DataFrame

Start with the following dictionary:

```python
books_dict = {"Title": [], "Price": [], "Availability": [], "Rating": [], "Description": [], "UPC": [], "Category": [] }
```

Then, iterate over all the categories and all the books in a given category to collect any book information using the previous function. Fill the previous dictionary with the information about each book.

Show the first five rows of the previous final Pandas DataFrame.

Tip: You can use the function `tqdm` from the library `tqdm` to show a progress bar if in iterable of a for loop as shown below :wink: :

```python
from tqdm import tqdm

for elem in tqdm(iterable):
    # some code
```





In [55]:
from tqdm import tqdm

books_dict = {"Title": [], "Price": [], "Availability": [], "Rating": [], "Description": [], "UPC": [], "Category": []}

base_url = "http://books.toscrape.com/"
response = requests.get(base_url)
soup = BeautifulSoup(response.content, "html.parser")

categories = soup.find('ul', class_='nav-list').find('ul').find_all('li')
category_urls = [base_url + category.a['href'] for category in categories]

for category_url in tqdm(category_urls):
    response = requests.get(category_url)
    soup = BeautifulSoup(response.content, "html.parser")
    
    book_urls = [base_url + "catalogue/" + book.h3.a['href'].replace('../../../', '') for book in soup.find_all('article', class_='product_pod')]
    
    for book_url in book_urls:
        book_details = scrape_book_details(book_url)
        books_dict["Title"].append(book_details["Title"])
        books_dict["Price"].append(book_details["Price"])
        books_dict["Availability"].append(book_details["Availability"])
        books_dict["Rating"].append(book_details["Rating"])
        books_dict["Description"].append(book_details["Description"])
        books_dict["UPC"].append(book_details["UPC"])
        books_dict["Category"].append(book_details["Category"])

books_df = pd.DataFrame(books_dict)
books_df

100%|███████████████████████████████████████████| 50/50 [02:37<00:00,  3.16s/it]


Unnamed: 0,Title,Price,Availability,Rating,Description,UPC,Category
0,It's Only the Himalayas,£45.17,In stock (19 available),2,"“Wherever you go, whatever you do, just . . . ...",a22124811bfa8350,Travel
1,Full Moon over Noah’s Ark: An Odyssey to Mount...,£49.43,In stock (15 available),4,Acclaimed travel writer Rick Antonson sets his...,ce60436f52c5ee68,Travel
2,See America: A Celebration of Our National Par...,£48.87,In stock (14 available),3,To coincide with the 2016 centennial anniversa...,f9705c362f070608,Travel
3,Vagabonding: An Uncommon Guide to the Art of L...,£36.94,In stock (8 available),2,With a new foreword by Tim Ferriss •There’s no...,1809259a5a5f1d8d,Travel
4,Under the Tuscan Sun,£37.33,In stock (7 available),3,A CLASSIC FROM THE BESTSELLING AUTHOR OF UNDER...,a94350ee74deaa07,Travel
...,...,...,...,...,...,...,...
512,Why the Right Went Wrong: Conservatism--From G...,£52.65,In stock (14 available),4,“Dionne's expertise is evident in this finely ...,2b5054a4192e9b06,Politics
513,Equal Is Unfair: America's Misguided Fight Aga...,£56.86,In stock (12 available),1,We’ve all heard that the American Dream is van...,3968e3fbf4695d7c,Politics
514,Amid the Chaos,£36.58,In stock (15 available),1,Some people call Eritrea the “North Korea of A...,bb8245f52c7cce8f,Cultural
515,Dark Notes,£19.19,In stock (15 available),5,They call me a slut. Maybe I am.Sometimes I do...,88c21fcd38e2486e,Erotica


In [59]:
books_df.head()

Unnamed: 0,Title,Price,Availability,Rating,Description,UPC,Category
0,It's Only the Himalayas,£45.17,In stock (19 available),2,"“Wherever you go, whatever you do, just . . . ...",a22124811bfa8350,Travel
1,Full Moon over Noah’s Ark: An Odyssey to Mount...,£49.43,In stock (15 available),4,Acclaimed travel writer Rick Antonson sets his...,ce60436f52c5ee68,Travel
2,See America: A Celebration of Our National Par...,£48.87,In stock (14 available),3,To coincide with the 2016 centennial anniversa...,f9705c362f070608,Travel
3,Vagabonding: An Uncommon Guide to the Art of L...,£36.94,In stock (8 available),2,With a new foreword by Tim Ferriss •There’s no...,1809259a5a5f1d8d,Travel
4,Under the Tuscan Sun,£37.33,In stock (7 available),3,A CLASSIC FROM THE BESTSELLING AUTHOR OF UNDER...,a94350ee74deaa07,Travel
