# Web Scraping Basics with HTML and BeautifulSoup

Use this notebook to learn how we move from a raw HTML page to the information we care about. Each step shows one idea at a time so you can practice as you read.

## Important Guidelines for Web Scraping

Keep these habits whenever you scrape a site:

1. Check the terms of service and the `robots.txt` file. Only scrape what is allowed.
2. Space out your requests. A short delay keeps the site responsive for everyone.
3. Re-run your code often. Sites change layouts, so your selectors may need updates.

### Example: Checking robots.txt

A quick look at `robots.txt` tells us which sections of the site welcome bots and which ones do not.


## What is robots.txt?
`robots.txt` is a plain text file located at the root of a website `(e.g., https://example.com/robots.txt)`. It tells web crawlers which parts of the site they can or cannot access.

robtots.txt example:
```
Disallow: /admin/
Disallow: /private/
Allow: /private/public-info/
Crawl-delay: 5
Sitemap: https://example.com/sitemap.xml
```
## What is sitemap.xml?
`sitemap.xml` is an XML file that lists URLs on a website that are available for crawling. It helps search engines index content more efficiently.

sitemap.xml example:
```
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/</loc>
    <lastmod>2025-10-01</lastmod>
    <changefreq>weekly</changefreq>
    <priority>1.0</priority>
  </url>
  <url>
    <loc>https://example.com/blog/</loc>
    <lastmod>2025-10-15</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.8</priority>
  </url>
  <url>
    <loc>https://example.com/private/public-info/</loc>
    <lastmod>2025-09-30</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.5</priority>
  </url>
</urlset>
```

### Why look at robots.txt?

The file explains what each bot can do. A quick checklist:

- Open `https://example.com/robots.txt` in a browser.
- Sections under `User-agent: *` apply to most scrapers.
- `Disallow` lines list paths you must skip; `Allow` lines give explicit approval.
- Some sites have extra rules in their terms of service—follow those too.

Even when a path is allowed, send requests gently so you do not overload the server.

In [None]:
import requests

# Fetch robots.txt to review allowed paths before scraping
url = 'https://www.esilv.fr/robots.txt'
response = requests.get(url)
print(response.text)

## Installing and Importing Libraries

We rely on three core libraries throughout the course:

- `requests` handles HTTP calls.
- `beautifulsoup4` (imported as `bs4`) parses the HTML.
- `lxml` gives BeautifulSoup a fast parser backend.

Install them with any package manager you prefer:

**pip**
```
pip install requests beautifulsoup4 lxml
```

**conda**
```
conda install requests beautifulsoup4 lxml
```

**uv**
```
uv add requests beautifulsoup4 lxml
```

Once installed, import them as shown below.

In [None]:
# Load BeautifulSoup and requests for the upcoming examples
from bs4 import BeautifulSoup
import requests

## Our Sample HTML Page

To practice without hitting external sites we use a self-contained HTML snippet. It includes headings, paragraphs, lists, links, a table, and a footer so we can try different selectors.

In [1]:
# Local HTML fixture used for all parsing demos below
sample_html = """
<!DOCTYPE html>
<html lang="en">
<head>
    <title>Sample Web Scraping Page</title>
    <style>
        .highlight { color: blue; font-weight: bold; }
        #main-title { font-size: 24px; color: green; }
        table { border-collapse: collapse; }
        th, td { border: 1px solid black; padding: 5px; }
    </style>
</head>
<body>
    <h1 id="main-title">Welcome to Web Scraping</h1>
    <h2>Introduction</h2>
    <p class="intro">This is an introductory paragraph.</p>
    <p class="intro highlight">Another intro paragraph with <b>bold text</b>.</p>
    <p class="content">Content paragraph.</p>

    <h2>Libraries</h2>
    <p>Useful libraries for web scraping:</p>
    <ul>
        <li><a href="https://beautiful-soup-4.readthedocs.io/" class="library" id="link1">BeautifulSoup</a></li>
        <li><a href="https://docs.scrapy.org/" class="library" id="link2">Scrapy</a></li>
        <li><a href="https://selenium-python.readthedocs.io/" class="library" id="link3">Selenium</a></li>
    </ul>

    <h2>Data Table</h2>
    <table>
        <tr><th>Library</th><th>Purpose</th></tr>
        <tr><td>BeautifulSoup</td><td>HTML Parsing</td></tr>
        <tr><td>Requests</td><td>HTTP Requests</td></tr>
    </table>

    <div class="footer">
        <p>Footer text.</p>
        <!-- This is a comment -->
    </div>
</body>
</html>
"""

print("Sample HTML loaded.")

Sample HTML loaded.


### Creating the Soup Object

BeautifulSoup turns raw HTML into a searchable tree. Build the soup once and reuse it across the notebook.

In [None]:
# Parse the HTML string and view the formatted structure
soup = BeautifulSoup(sample_html, 'lxml') # html.parser can also be used
print(soup.prettify())

## What is HTML?

HTML (HyperText Markup Language) gives structure to every web page. Key pieces you will see often:

- `<!DOCTYPE html>` announces an HTML document.
- `<html>` wraps the full page.
- `<head>` stores metadata, the title, and CSS rules.
- `<body>` contains the visible content.
- Headings (`<h1>` … `<h6>`), paragraphs (`<p>`), and links (`<a>`) show the text.
- Lists use `<ul>` / `<ol>` with `<li>` items.
- Tables use rows `<tr>`, headers `<th>`, and cells `<td>`.
- `<div>` groups related content.
- Inline styles and comments sit inside the HTML as well.

Attributes such as `id`, `class`, and `href` help us target the right elements while scraping.

## What is CSS?

CSS (Cascading Style Sheets) styles the HTML. Two attributes appear everywhere:

- `id` labels a single element.
- `class` groups elements that share a style.

Scrapers reuse these same identifiers to find the content we need within the page structure.

### Selecting Elements by id and class

BeautifulSoup exposes simple helpers:

- `soup.find(id="main-title")` grabs the unique element with that id.
- `soup.select(".intro")` returns every element that carries the `intro` class.

### Finding Elements by Class

Use either approach depending on what feels clearer:

- `soup.find(class_="intro")` returns the first match.
- `soup.select(".intro")` returns a list of matches for looping.

In [None]:
# Grab the element with the unique 'id'
title = soup.find(id="main-title")
print("Title:", title)

In [None]:
# Collect every paragraph that uses the intro class
intros = soup.select(".intro")
print("Intro paragraphs:")
for p in intros:
    print(p.text)

### Finding Elements by Class (Alternative)

The `class_` keyword works when you only need the first match and want a shorter syntax.

In [None]:
# First .find() match for the content class
content = soup.find(class_="content")
print("Content:", content.text)

## Working with the Sample Page

From here on we reuse the `sample_html` content. Each section shows a new way to pull information from the same structure.

## Basics of the Title Tag

Title information helps identify the page quickly. We can examine the tag, the text inside it, and its position in the tree.

In [None]:
# Inspect the raw <title> tag
print("Title tag:", soup.title)

### Read the Title Text

`.text` strips away the tag and leaves the readable string—perfect for logging or saving.

In [None]:
# Pull just the text inside the <title> tag
print("Title text:", soup.title.text)

### Check the Tag Name

`.name` confirms the element type so you can branch logic when needed.

In [None]:
# Confirm the element type
print("Tag name:", soup.title.name)

### Look at the Parent

`.parent` steps one level up the tree so you can inspect the surrounding structure.

In [None]:
# Move up one level to see where the paragraph lives
print("Parent:", soup.p.parent.name)

## Basics of Paragraph Tags

Paragraphs (`<p>`) often hold the text you want. We can grab the first one, inspect its classes, and loop through all of them.

In [None]:
# First paragraph element in the document
print("First p:", soup.p)

### Check the Paragraph Classes

Access the `class` attribute to see how the page labels each paragraph.

In [None]:
# View the class list attached to that paragraph
print("Class:", soup.p['class'])

### Read the Paragraph Text

`.text` provides the cleaned text content, ready for storage or further processing.

In [None]:
# Extract plain text from the paragraph
print("Text:", soup.p.text)

### Loop Through All Paragraphs

`find_all('p')` returns every paragraph tag so you can iterate and handle them together.

In [None]:
# Gather every <p> element and inspect the text
all_p = soup.find_all('p')
print("Number of p tags:", len(all_p))

for i, p in enumerate(all_p):
    print(f"P {i+1}: {p.text}")

## Basics of Anchor Tags

Links (`<a>`) contain two important pieces: the clickable text and the `href` that points to the destination. They can also carry classes and ids for styling.

In [None]:
# First anchor element in the document
print("First a:", soup.a)

### Read the Href Attribute

`href` stores the actual link target. We often capture it alongside the anchor text.

In [None]:
# Pull the destination URL from the link
print("Href:", soup.a['href'])

### Check Class and ID Attributes

Classes group similar links; an id marks one specific link. Both are useful for targeting elements precisely.

In [None]:
# Look at the label information attached to the link
print("Class:", soup.a['class'])
print("Id:", soup.a['id'])

### List Every Link

Looping through `find_all('a')` gives us every anchor. Store or print each `href` depending on your task.

In [None]:
# Loop through each <a> tag and print the target URL
all_a = soup.find_all('a')
print("All hrefs:")
for a in all_a:
    print(a.get('href'))

## Analyzing the HTML Structure

Understanding parents and children helps when the data you need sits near a known element.

In [None]:
# Walk up the tree from a single link
link = soup.a
print("Parents:")
for parent in link.parents:
    print(parent.name)

### List Every Tag

`recursiveChildGenerator()` walks the entire tree, letting us audit which tag types appear on the page.

In [None]:
# Iterate through all tags present in the document
print("All tags:")
for child in soup.recursiveChildGenerator():
    if child.name:
        print(child.name)

## Finding Elements by ID

An id should appear only once per page, making it a reliable way to select a single element.

In [None]:
# Locate the Selenium link by its id
link3 = soup.find(id="link3")
print("Link 3:", link3.text)
print("Href:", link3['href'])

## Working with Headers

Headers outline the structure of the page. Scraping them provides quick summaries or section navigation.

In [None]:
# First h2 element on the page
h2 = soup.find('h2')
print("H2 text:", h2.text)

### Collect All Headers

Pass a list of tag names to `find_all` to capture multiple heading levels in one call.

In [None]:
# Gather h1 and h2 text to understand section names
headers = soup.find_all(["h1", "h2"])
for h in headers:
    print(f"{h.name}: {h.text.strip()}")

## Extracting All URLs

Every `<a>` tag carries an `href`. Printing them now makes it easy to decide which pages to visit next.

In [None]:
# Print each hyperlink so we can inspect or queue them
for link in soup.find_all('a'):
    print(link.get('href'))

## Getting All Text

`get_text()` removes every tag and returns the readable page content. It is handy for quick text exports or keyword searches.

In [None]:
# Dump all visible text from the page
print(soup.get_text())

## Working with Strings and Comments

BeautifulSoup treats text inside tags as `NavigableString` objects and HTML comments as a related type. You can read or replace them just like regular strings.

In [None]:
# Strings inside tags behave like standard Python strings
soup_string = BeautifulSoup('<b class="type">Web Scraper</b>', 'html.parser')
tag = soup_string.b
print("String:", tag.string)
print("Type:", type(tag.string))

### Replacing Text Inside Tags

Use `replace_with()` when you need to update the content while keeping the tag in place.

In [None]:
# Swap the existing text for a friendlier message
tag.string.replace_with("Good web scraper")
print("After replace:", tag.string)

### Handling Comments

HTML comments are stored as a special string subclass. You can read them to see hidden notes or replace them if needed.

In [None]:
# Comments appear as Comment objects when parsed
markup = "<b><!--Hey, I wish to be a good web scraper--></b>"
soup_comment = BeautifulSoup(markup, 'html.parser')
comment = soup_comment.b.string
print("Comment:", comment)
print("Type:", type(comment))

## Navigating Siblings

Use `next_sibling` and `previous_sibling` to move sideways between elements that share the same parent.

In [None]:
# Move from <b> to its sibling <c>
sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></a>", 'html.parser')
print("Next sibling of b:", sibling_soup.b.next_sibling)

### Navigate Backward

`previous_sibling` walks in the opposite direction when you need the element that comes before the current one.

In [None]:
# Go from <c> back to its <b> sibling
print("Previous sibling of c:", sibling_soup.c.previous_sibling)

## Summary

You now have the core tools to parse HTML with BeautifulSoup:

- Respect site rules (`robots.txt`) and pace your requests.
- Inspect the HTML structure to choose the right selectors.
- Use BeautifulSoup to locate tags, attributes, and text.
- Traverse the tree when information sits near the elements you already know.

Practice these moves on real pages and adapt your selectors as layouts change.