{{< include _include_d5.qmd >}}

## Web scraping

The content of the webpage "https://books.toscrape.com/catalogue/page-2.html" was fetched using the `requests` library. Once retrieved, the content was saved as an HTML file named "books_page_2.html". The `Beautiful Soup` library was then employed to parse and analyze the HTML structure. The analysis focused on identifying all `article` tags with the class `product_pod`. Within each of these tags, the title and link attributes from the nested `h3` and `a` tags were extracted. This extracted data was subsequently organized into a pandas dataframe, displaying the title of each book alongside its corresponding link.

In [None]:
#| eval: true
#| echo: true
#| output: true

import requests
from bs4 import BeautifulSoup
import pandas as pd

# URL to fetch
books_url = 'https://books.toscrape.com/catalogue/page-2.html'

# Make a request to the website using the requests library
books_response = requests.get(books_url)
books_content = books_response.content

# Save the content to an HTML file
books_filename = '../../xml/books_page_2.html'
with open(books_filename, 'wb') as file:
    file.write(books_content)

# Parse the content with BeautifulSoup
books_soup = BeautifulSoup(books_content, 'lxml')

# Look for all article tags of class 'product_pod'
product_pods = books_soup.find_all('article', class_='product_pod')

# Extract the h3 child tag and its a tag attributes as text
books_data = []
for pod in product_pods:
    h3_tag = pod.find('h3')
    a_tag = h3_tag.find('a') if h3_tag else None
    books_data.append({
        'title': a_tag.attrs.get('title') if a_tag else None,
        'link': a_tag.attrs.get('href') if a_tag else None
    })

# Convert the data to a pandas dataframe
books_df = pd.DataFrame(books_data)
books_df.head()

## Survey experiments

- Upcoming, streamlit ? pyscript ? shiny ?
