# Web Scrapping lab

In this lab you will scrappe this [website](https://books.toscrape.com/) of books.

You have to create a Pandas DataFrame with all the books listed in the page. Each row of the DataFrame should contain information of each book. In particular, the DataFrmae must contain:

* category
* title
* price
* stock availability
* star rating (number of stars)
* description
* UPC

Happy scrapping!



# Server verification

Load the needed libraries, and make sure thar you can obtain the correct status code.

In [1]:
# Your code here
from bs4 import BeautifulSoup
import requests

# Book categories

Create the code to collect the **relative urls** from the left panel to obtain a list with all the book categories.

In [2]:
url_lab = "https://books.toscrape.com/"
response = requests.get(url_lab)
html_obtenido = response.text


In [3]:
soup = BeautifulSoup(html_obtenido, 'html.parser')

In [5]:
# Create the code to collect the relative urls from the left panel to obtain a list with all the book categories.

cat_section = soup.find('ul', class_='nav-list')
cat_links = cat_section.find_all('a')

In [6]:
category_urls = []
for link in cat_links:
    href = link.get('href')
    if href:
       category_urls.append(href.strip())

In [7]:
for url in category_urls:
    print(url)

catalogue/category/books_1/index.html
catalogue/category/books/travel_2/index.html
catalogue/category/books/mystery_3/index.html
catalogue/category/books/historical-fiction_4/index.html
catalogue/category/books/sequential-art_5/index.html
catalogue/category/books/classics_6/index.html
catalogue/category/books/philosophy_7/index.html
catalogue/category/books/romance_8/index.html
catalogue/category/books/womens-fiction_9/index.html
catalogue/category/books/fiction_10/index.html
catalogue/category/books/childrens_11/index.html
catalogue/category/books/religion_12/index.html
catalogue/category/books/nonfiction_13/index.html
catalogue/category/books/music_14/index.html
catalogue/category/books/default_15/index.html
catalogue/category/books/science-fiction_16/index.html
catalogue/category/books/sports-and-games_17/index.html
catalogue/category/books/add-a-comment_18/index.html
catalogue/category/books/fantasy_19/index.html
catalogue/category/books/new-adult_20/index.html
catalogue/category/b

In [None]:
# Your code here

# Books in a given category

Use. web scrapping and list comprehension to obtain the **absolute** url of each book to be scraped

In [8]:
# Your code here
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

base_url = "http://books.toscrape.com/"

# Book details

Create a Python function that given a book_url as an input returns a dictionary with the following structure:

```Python
{"Title": title, "Price": price, "Availability": availability, "Rating": rating, "Description": description, "UPC": upc}
```

where `description` should contain the book's summary given in the Product description, and the values are the book's associated information.

In [9]:
# Your code here

def extract_book_details(book_url):
    response = requests.get(book_url)  # Send a GET request to the book URL
    soup = BeautifulSoup(response.content, 'html.parser')

In [10]:
# extrct the book "title"
title = soup.find('h1').get_text()
title

'All products'

In [11]:
# the book price
price = soup.find('p', class_='price_color').get_text()
price

'Â£51.77'

In [12]:
# availability
availability = soup.find('p', class_='instock availability').get_text(strip=True)
availability

'In stock'

In [13]:
# rating information
rating_inf = soup.find('p', class_='star-rating')
rating = rating_inf['class'][1] if rating_inf else "No rating"
rating

'Three'

In [14]:
# description
description_element = soup.find('meta', {'name': 'description'})
description = description_element['content'].strip() if description_element else "No description available."
description

''

In [15]:
# UPC universal product code
upc_cat = soup.find('th', string='UPC')
upc = upc_cat.find_next_sibling('td').get_text()if upc_cat else "No UPC available"
upc


'No UPC available'

# Collect and store all the information from the books in a Pandas DataFrame

Start with the following dictionary:

```python
books_dict = {"Title": [], "Price": [], "Availability": [], "Rating": [], "Description": [], "UPC": [], "Category": [] }
```

Then, iterate over all the categories and all the books in a given category to collect any book information using the previous function. Fill the previous dictionary with the information about each book.

Show the first five rows of the previous final Pandas DataFrame.

Tip: You can use the function `tqdm` from the library `tqdm` to show a progress bar if in iterable of a for loop as shown below :wink: :

```python
from tqdm import tqdm

for elem in tqdm(iterable):
    # some code
```





In [16]:
from tqdm import tqdm

# Your code here

In [17]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import pandas as pd
from tqdm import tqdm

In [18]:
base_url = "http://books.toscrape.com/"

In [19]:
def get_category_urls():
    response = requests.get(base_url)
    soup = BeautifulSoup(response.content, 'html.parser')
    categories = soup.find('ul', class_='nav-list').find('ul').find_all('a')
    category_urls = {category.get_text(strip=True): urljoin(base_url, category['href']) for category in categories}
    return category_urls

In [20]:
def get_book_urls(category_url):
    book_urls = []
    while category_url:
        response = requests.get(category_url)
        soup = BeautifulSoup(response.content, 'html.parser')
        books = soup.find_all('h3')
        book_urls += [urljoin(base_url, book.find('a')['href']) for book in books]

In [21]:
 # Find the next page URL if it exists
next_page = soup.find('li', class_='next')
category_url = urljoin(base_url, next_page.find('a')['href']) if next_page else None

In [22]:
# Function to extract book details
def extract_book_details(book_url):
    response = requests.get(book_url)
    soup = BeautifulSoup(response.content, 'html.parser')

In [23]:
# Extract details
title = soup.find('h1').get_text()
price = soup.find('p', class_='price_color').get_text()
availability = soup.find('p', class_='instock availability').get_text(strip=True)
rating_element = soup.find('p', class_='star-rating')
rating = rating_element['class'][1] if rating_element else "No rating"
description_element = soup.find('meta', {'name': 'description'})
description = description_element['content'].strip() if description_element else "No description available."
upc_element = soup.find('th', string='UPC')
upc = upc_element.find_next_sibling('td').get_text() if upc_element else "No UPC available."
