# Web-scraping with `BeautifulSoup4`

This document covers basic usage of `bs4` (Beautiful Soup 4) for scraping a webpage.  We will primarily discuss extracting information from *one* webpage, and leave web-crawling to an advanced class on web scraping.

## To scrape or not to scrape

Unlike APIs, which are designed for programs/applications to interact with the data, web-scraping is directly working with user-facing websites for humans.

| Web scraping benefits:                                      | Web scraping challenges:           |
|-------------------------------------------------------------|------------------------------------|
| Any content that can be viewed on a webpage can be scraped. | Rarely tailored for researchers.   |
| No API needed.                                              | Your IP can be blocked (403)       |
| No rate-limiting or authentication (usually).               | Messy, unstructured, inconsistent. |
|                                                             | Entirely site-dependent.           |

**Rule of thumb:**
Check if there is an API.  If not, then consider scraping.

## Ethics of web scraping

Several considerations before scraping:
- Read the terms and conditions of data use.
- `robots.txt`
- Self-throttle, as in API usage.
- Web-scrapers require regular maintenance (best coupled with CI/CD).

## Anatomy of a webpage

A website is typically built up from some combination of codebase and database.  The front-end product combines HTML, CSS stylesheets, and javascript.

```{figure} ../img/anatomy-html.jpg
---
width: 60%
name: html-anatomy
---
Anatomy of a website (Adobe)
```

```{figure} ../img/anatomy-html-css.jpg
---
width: 60%
name: anatomy-html-css
---
Anatomy of a website, with CSS styles (Adobe)
```

## Parsing a website
Retrieving the website content is not difficult - extracting the exact useful information is.

### HTML, briefly


```{figure} ../img/html-doc.png
---
width: 75%
name: jb-html
---
HTML structure of this Jupyter notebook.
```

### HTML as a tree

```{figure} ../img/html-tree.png
---
width: 75%
name: html-tree
---
HTML as a tree.  Each branch is an **element**.
```

### Three components of HTML (*Tags*, *Attributes*, and *Content*)

```{figure} ../img/html-element.png
---
width: 60%
name: html-element
---
An example of an HTML element.
```

### Example tags

|          Tag         |                         Meaning                        |
|:--------------------:|:------------------------------------------------------:|
| `<head>`               | page header (metadata, etc                             |
| `<body>`               | holds all of the content                               |
| `<p>`                  | regular text (paragraph)                               |
| `<h1>,<h2>,<h3>`       | header text, levels 1, 2, 3                            |
| `ol,<ul>,<li>`        | ordered list, unordered list, list item                |
| `<a href="page.html">` | link to "page.html"                                    |
| `<table>,<tr>,<td>`    | table, table row, table item                           |
| `<div>,<span>`         | general containers (can contain CSS, JavaScript, etc.) |

## Example with weather.com

In [79]:
# beautifulsoup4 package and lxml parser
!pip install beautifulsoup4
!pip install lxml



In [80]:
from bs4 import BeautifulSoup
import requests

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36'}
url = 'https://weather.com/weather/tenday/l/7ee7c4d6a3b13fff1e3d9a93240e83530a6dc829aebde965f9b31d7a97d5c295'

r = requests.get(url, headers=headers)

In [85]:
soup = BeautifulSoup(r.content)

In [None]:
# accessing content

In [86]:
# using bs4

<!DOCTYPE html>
<html dir="ltr" lang="en-US"><head>
<meta charset="utf-8" data-react-helmet="true"/><meta content="width=device-width, initial-scale=1, viewport-fit=cover" data-react-helmet="true" name="viewport"/><meta content="max-image-preview:large" data-react-helmet="true" name="robots"/><meta content="index, follow" data-react-helmet="true" name="robots"/><meta content="origin" data-react-helmet="true" name="referrer"/><meta content="Be prepared with the most accurate 10-day forecast for Evanston, IL with highs, lows, chance of precipitation from The Weather Channel and Weather.com" data-react-helmet="true" name="description"/><meta content="#ffffff" data-react-helmet="true" name="msapplication-TileColor"/><meta content="/daily/assets/ms-icon-144x144.d353af.png" data-react-helmet="true" name="msapplication-TileImage"/><meta content="#ffffff" data-react-helmet="true" name="theme-color"/><meta content="app-id=295646461" data-react-helmet="true" name="apple-itunes-app"/><meta conten

In [91]:
# selecting by tags
details = soup.find_all('details', attrs={'class': lambda x: 'DaypartDetails' in x})


In [None]:
# locate by tags

In [106]:
details[0].find_all('h2')[0].text

'Today'

In [104]:
details[0].find_all('h2')[2].text

'Tue 07 | Night'

In [109]:
# locate neighboring content
# .find_next('span', attrs={'data-testid':'TemperatureValue'}).text

'70°'

In [112]:
# .find_next('p', attrs={'data-testid': 'wxPhrase'}).text

'Scattered showers and thunderstorms. A few storms may be severe. High around 70F. Winds SW at 10 to 15 mph. Chance of rain 70%.'

In [105]:
# locate neighboring content
details[0].find_all('h2')[2].find_next('span', attrs={'data-testid':'TemperatureValue'}).text

'56°'

## Practice 11 - BeautifulSoup

1. Locate the tags and attributes for the following items:

```{figure} ../img/weather-elements.png
---
width: 50%
name: weather-element
---
Fact card from weather.com.
```

2. Create a dataframe with columns as the items:
   - DateDay
   - Temperature
   - Rain
   - UV
   - Description

3. Using `BeautifulSoup`, populate the table for the first day (May 7).
4. Repeat for the next nine days (May 8 - May 16).

## Further reference
Read *The Legalities and Ethics of Web Scraping* {cite:p}`mitchell2018web` for a brief discussion on web-scraping ethics.