{{< include _include_d5.qmd >}}

## Web scraping

In the realm of web scraping, `robots.txt` is a standard used by websites to instruct web crawling and scraping bots about which pages should not be processed or scanned. It helps website owners control how search engines index their content. The `sitemap.xml` file, on the other hand, provides a roadmap of a website's structure, aiding search engines in navigation. Both files are essential for ethical and effective web scraping.

Sample `robots.txt`:
```
User-agent: *
Disallow: /private/
Disallow: /temp/
```

Sample `sitemap.xml`:
```xml
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
      <loc>http://www.example.com/</loc>
      <lastmod>2022-09-24</lastmod>
   </url>
</urlset>
```

For web scraping in Python, the `BeautifulSoup` and `Scrapy` packages are popular choices. While `BeautifulSoup` is ideal for parsing HTML and XML documents, `Scrapy` provides a powerful framework for large-scale web scraping tasks.

### BeautifulSoup

The content of the webpage "https://books.toscrape.com/catalogue/page-2.html" was fetched using the `requests` library. Once retrieved, the content was saved as an HTML file named "books_page_2.html". The `Beautiful Soup` library was then employed to parse and analyze the HTML structure. The analysis focused on identifying all `article` tags with the class `product_pod`. Within each of these tags, the title and link attributes from the nested `h3` and `a` tags were extracted. This extracted data was subsequently organized into a pandas dataframe, displaying the title of each book alongside its corresponding link.

In [None]:
#| eval: false
#| echo: true
#| output: true

import requests
from bs4 import BeautifulSoup
import pandas as pd

# URL to fetch
books_url = 'https://books.toscrape.com/catalogue/page-2.html'

# Make a request to the website using the requests library
books_response = requests.get(books_url)
books_content = books_response.content

# Save the content to an HTML file
books_filename = 'books_page_2.html'
with open(books_filename, 'wb') as file:
    file.write(books_content)

# Parse the content with BeautifulSoup
books_soup = BeautifulSoup(books_content, 'lxml')

# Look for all article tags of class 'product_pod'
product_pods = books_soup.find_all('article', class_='product_pod')

# Extract the h3 child tag and its a tag attributes as text
books_data = []
for pod in product_pods:
    h3_tag = pod.find('h3')
    a_tag = h3_tag.find('a') if h3_tag else None
    books_data.append({
        'title': a_tag.attrs.get('title') if a_tag else None,
        'link': a_tag.attrs.get('href') if a_tag else None
    })

# Convert the data to a pandas dataframe
books_df = pd.DataFrame(books_data)
books_df.head()

## Put it together!

- nlp
- data analysis
- summarize results
- visualization, wordcloud

## Online survey experiments

Work in progress. Check out [streamlit](https://streamlit.io/) and [github]() for more info!

- [https://github.com/nils-holmberg/scom-expm](https://github.com/nils-holmberg/scom-expm)

In [None]:
#| eval: false
#| echo: true
#| output: true

#!pip install streamlit

import streamlit as st
import pandas as pd
import seaborn as sns

conditions = ["a", "b"]

selected_condition = random.choice(conditions)

if selected_condition == "a":
    st.image("img/exps-stim-c.png")
else:
    st.image("img/exps-stim-t.png")

#!streamlit hello
#!streamlit run app.py

- [https://osm-exp-vetchcj6ybl.streamlit.app/](https://osm-exp-vetchcj6ybl.streamlit.app/)

**Django vs. Flask: Python Web Frameworks**

Django and Flask are two prominent web frameworks in Python, but they serve different philosophies. Django, often termed the "framework for perfectionists with deadlines", is a high-level, all-inclusive framework. It follows the "batteries-included" approach, offering a built-in admin panel, ORM, and directory structure, making it suitable for larger applications and rapid development. In contrast, Flask is a micro-framework. It's lightweight, flexible, and gives developers more control over components they wish to use. Flask doesn't prescribe a directory structure or include extras like Django does, making it more suitable for small applications or microservices. Both have their merits, and the choice depends on the project's needs.
