### Discussion Week 6

We will review an example that highlights the need of being proficient in xpath syntax, because we are not able to inspect the html using devtools. 

```markdown

---

## 1. What is HTML?

**HTML** (HyperText Markup Language) is the standard language for creating web pages. A web page is structured by **tags** that tell a browser (or parser like BeautifulSoup) the **roles** of each piece of content: headings, paragraphs, images, links, etc.

---

## 2. Basic Structure of an HTML Document

A typical HTML file might look like:

\```html
<!DOCTYPE html>
<html>
<head>
  <title>My Web Page</title>
</head>
<body>

  <h1>Welcome to My Page</h1>
  <p>This is a paragraph of text.</p>
  <a href="https://example.com">A Link</a>

</body>
</html>
\```

In a nutshell:

- **`<html>`** is the root element; everything is inside it.  
- **`<head>`** contains metadata (e.g., `<title>`).  
- **`<body>`** contains the main content.

---

## 3. The Parts of an HTML Tag

HTML elements (often referred to as “tags”) generally have an **opening tag** and a **closing tag**:

\```html
<p>This is a paragraph</p>
\```

- The **opening tag** is `p` (paragraph).  
- The **closing tag** is `/p`.  
- The text “This is a paragraph” is the **content** of the `<p>` element.

Some tags are self-closing (like `<img />` or `<br />`) and don't need a separate closing tag.

---

## 4. Common Tags You’ll See in Scraping

1. **`<div>`**: A block-level “division” or container for grouping elements.  
2. **`<p>`**: A paragraph of text.  
3. **`<a>`**: An anchor (link). Often has an `href` attribute pointing to the URL.  
4. **`<span>`**: An inline container for text.  
5. **`<ul>`** & **`<li>`**: Unordered list (`<ul>`) and list items (`<li>`).  
6. **`<img>`**: An image tag. Often has a `src` attribute (image URL) and can have a `title` or `alt` attribute.  
7. **`<h1>`, `<h2>`, ...**: Headings.

---

## 5. Attributes: `class` and `id` (and more)

**Attributes** are key–value pairs in the tag’s opening bracket that provide extra information. For example:

\```html
<p class="intro" id="first-paragraph">Hello World!</p>
\```

- **class="intro"**: The `class` attribute is often used for styling (CSS) or identifying groups of elements.  
- **id="first-paragraph"**: The `id` attribute should be unique on the page. Often used for JavaScript targeting or to link to a specific section.  

When scraping, we commonly use:

- **`class_="some-class"`** in `BeautifulSoup` (note the underscore to avoid Python’s reserved word `class`)  
- **`id="some-id"`** in `BeautifulSoup`

We also encounter other attributes, like:

- **`href`** in `<a>` tags (the URL link).  
- **`src`** in `<img>` tags (the image source).  
- **`title`** or **`alt`** on various tags (extra descriptive text).

---

## 6. Nesting of Tags

HTML tags can be **nested**. For instance:

\```html
<div id="main-container">
  <h2>Section Title</h2>
  <p class="description">
    Here is some text with an <a href="https://example.com">example link</a>.
  </p>
</div>
\```

When you use `soup.prettify()`, you’ll see these tags indented to show that `<h2>` and `<p>` are inside the `<div>`, and that `<a>` is inside the `<p>`.

---

## 7. How This Relates to BeautifulSoup

When you run:

\```python
print(soup.prettify())
\```

You’ll see:

1. **Opening and closing tags**: like `<div> ... </div>`.  
2. **Attributes**: `<p class="description">`, `<a href="...">`.  
3. **Nested structure**: Indentation shows which tags are inside others.

### Common BeautifulSoup Methods/Concepts

1. **`find()`**: Returns the **first** matching element. Example:

   \```python
   container = soup.find('div', id='main-container')
   \```

   - Looks for a `<div>` with `id="main-container"`.

2. **`find_all()`**: Returns **all** matching elements as a list.

   \```python
   paragraphs = soup.find_all('p', class_='description')
   \```

   - Looks for **all** `<p>` tags whose `class` is `"description"`.

3. **CSS Selectors (`.select()`)**:

   \```python
   # "p.description" means: find <p> with class="description"
   paragraphs = soup.select("p.description")
   \```

   - This is like using CSS rules to find elements.

4. **Tag Text**:

   \```python
   my_paragraph = soup.find('p', class_='description')
   print(my_paragraph.get_text(strip=True))
   \```

   - `get_text(strip=True)` extracts the text content, removing extra spaces.

5. **Tag Attributes**:

   \```python
   link = soup.find('a')
   url = link['href']  # or link.get('href')
   \```

   - Accessing the `href` attribute of an `<a>` tag.

---

## 8. Tips for Reading HTML with `prettify()`

When you see something like:

\```html
<div id="seven-day-forecast-container">
  <ul class="list-unstyled" id="seven-day-forecast-list">
    <li class="forecast-tombstone">
      <div class="tombstone-container">
        <p class="period-name">Monday</p>
        <p class="short-desc">Sunny</p>
        <p class="temp temp-high">High: 75 °F</p>
      </div>
    </li>
    ...
  </ul>
</div>
\```

you can read it as:

- A **`<div>`** with `id="seven-day-forecast-container"`.  
- Inside it, a **`<ul>`** with `id="seven-day-forecast-list"`.  
- Then each **`<li>`** (list item) has a `<div>` for the “tombstone” (forecast box).  
- That `<div>` contains three **`<p>`** elements for period name, short description, and temperature.

---

## 9. Summary

1. **HTML = tags + attributes + nested structure.**  
2. **`class`** and **`id`** attributes are your best friends for scraping—they help locate specific elements.  
3. **Use** BeautifulSoup’s methods like:
   - `.find()`, `.find_all()`, or `.select()` to locate tags by name, class, or id.  
   - `.get_text()` to extract text content.  
   - `.get(<attribute_name>)` to grab specific attribute values (like `href`, `title`, etc.).

Once you can **recognize** these tags, **navigate** their nesting, and **understand** how to target them by `class`, `id`, or tag name, you have the **foundation** you need to read `prettify()` output and scrape effectively with BeautifulSoup.

---

**That’s it!** Now you have a brief introduction to the parts of HTML that matter most for scraping. You don’t need to memorize all HTML tags or advanced layouts—just focus on **seeing** which tags hold the data you want and how you can target them with BeautifulSoup.
```


## Xpath Scraping Examples & Explanation


```markdown
# Discussion: XPath Fundamentals and Practical Scraping

XPath (XML Path Language) is a query language for selecting nodes from an XML/HTML document. With XPath, we can precisely locate elements in a structured document based on tags, attributes, text, and hierarchy.

In web scraping scenarios, especially when working with HTML documents, XPath offers powerful capabilities such as:
- **Selecting elements by tag** (e.g., `//a`, `//table`, etc.)
- **Selecting elements by attribute** (e.g., `//div[@class="content"]`)
- **Selecting elements containing specific text** (e.g., `//td[contains(text(), "Genre")]`)
- **Navigating the tree structure** (using child, sibling, or ancestor axes, such as `parent::`, `following-sibling::`, `preceding-sibling::`, etc.)

Often, the HTML you get via an HTTP library (e.g., `requests`) can differ from what you see in your browser, because many sites serve different HTML to mobile vs. desktop clients, or because scripts dynamically manipulate the DOM. This can make scraping challenging if you rely solely on DevTools to copy selectors from a rendered page. Below is an illustrative example using `requests` and `lxml` to scrape [imsdb.com](https://imsdb.com/).

---

## Scraping Genre Links Example

```python
import requests
import lxml.html as lx

# Step 1: Retrieve the page's HTML
result = requests.get('https://imsdb.com/')
result.raise_for_status()  # Ensure no HTTP errors
html_content = result.text

# Step 2: Parse the HTML content
html = lx.fromstring(html_content)

# Step 3: (Demonstration) Trying to select a specific <table><tbody> might return empty
# (because the structure is different from what we see in some device view)
genre_table = html.xpath('//table/tbody')
print("Attempt to find table/tbody:", genre_table)  # Likely returns []

# Step 4: Different HTML is served depending on viewport/device. In some views,
# the 'Genres' section might appear as a table row with a specific <td> containing the text "Genre".
# In other (mobile) views, it might not appear at all (or only as a script-generated dropdown).
# Let's assume we have the "desktop" or large-viewport HTML.

# One trick: look for the table row containing the cell with text "Genre" 
# (or partial match in case there's trailing whitespace like "\r\n").
genres = html.xpath('//table[tr/td[contains(text(), "Genre")]]/tr//a/@href')
print("Genre links found:", genres)

# Explanation:
#  - //table[tr/td[contains(text(), "Genre")]]: find a <table> that has a <tr>-><td> containing "Genre"
#  - /tr//a/@href: within that table, find all <a> elements inside <tr> and return their "href" attributes
```

In many real-world scenarios, you must carefully inspect the raw HTML returned by `requests` (rather than the rendered HTML in your browser) to craft XPath queries that match the actual structure you’re scraping.

---

## Scraping Script Date Example

In another scenario, suppose we want to retrieve the movie release year from a page like:
[Interstellar Script](https://imsdb.com/Movie%20Scripts/Interstellar%20Script.html).

After inspecting the HTML (mindful it may differ between desktop and mobile), we note that the script date is found as text after a `<b>` element with the text `"Script Date"`. For example:

```python
import requests
import lxml.html as lx

url = 'https://imsdb.com/Movie%20Scripts/Interstellar%20Script.html'
response = requests.get(url)
response.raise_for_status()
html = lx.fromstring(response.text)

# We'll extract all text from <td> elements within the table that has class="script-details"
script_details_texts = html.xpath('//table[@class="script-details"]//td/text()')
print("Script details (all text):", script_details_texts)

# If we specifically want the text node immediately following the <b> element that has text "Script Date":
date_text = html.xpath('//b[text()="Script Date"]/following-sibling::text()[1]')
print("Raw script date text:", date_text)

# The returned text might contain additional words, whitespace, or punctuation.
# Next, we'd typically use regular expressions to isolate the four-digit year from the text.
# For now, we'll just demonstrate that the immediate text is captured.
```

---

## Key Takeaways on XPath Usage

1. **Absolute vs. Relative Paths**  
   - `//tag` searches for `<tag>` anywhere in the document, while `/tag` searches only in the immediate children of the current node.

2. **Attribute Conditions**  
   - `//div[@class="nav"]` selects `<div>` elements with `class="nav"`.

3. **Text Matching**  
   - `//td[text()="Genre"]` matches a `<td>` whose **entire** text content is `"Genre"`.
   - `//td[contains(text(),"Genre")]` matches a `<td>` whose text content contains `"Genre"` as a substring.

4. **Handling Whitespace & Newlines**  
   - Real HTML often includes line breaks like `\r\n`. To handle partial matches, use `contains()` or normalize space if needed.

5. **Navigation Axes**  
   - `following-sibling::`, `preceding-sibling::`, `parent::`, `child::`, etc. let you move in the document relative to a known node.

By understanding these XPath strategies, you can more flexibly navigate HTML structures that aren’t always consistent — especially when the site provides different renders (e.g., mobile vs. desktop) or dynamically alters the DOM via JavaScript.

---

**Note:**  
To handle tricky situations where the page is significantly different when rendered in a browser (due to JavaScript or device-based rendering), you may need to:
- Emulate a specific User-Agent and send the correct headers to get the “desktop” version.
- Use a headless browser solution (e.g., `Selenium`, `Playwright`) to execute JavaScript and get the fully rendered page.

```



## Extended XPath Overview & Examples


```markdown
# Key Takeaways on XPath Usage

## Absolute vs. Relative Paths
- `//tag` searches for `<tag>` anywhere in the document.
- `/tag` searches for `<tag>` only in the immediate children of the current node (i.e., from the root in a full path).

**Example**:
```python
# Absolute path from the document root
html.xpath('/html/body/div/p')

# Relative path (searches anywhere in the document)
html.xpath('//p')
```

## Attribute Conditions
- `//div[@class="nav"]` selects all `<div>` elements with `class="nav"`.

**Example**:
```python
# Select all <img> elements whose "alt" attribute equals "logo"
html.xpath('//img[@alt="logo"]')
```

## Text Matching
- `//td[text()="Genre"]` matches a `<td>` whose entire text content is `"Genre"`.
- `//td[contains(text(),"Genre")]` matches a `<td>` whose text content contains `"Genre"` as a substring.

**Example**:
```python
# Exact text match
html.xpath('//span[text()="Subscribe"]')

# Partial text match (avoids issues with whitespace or additional text)
html.xpath('//span[contains(text(), "Subscribe")]')
```

## Handling Whitespace & Newlines
HTML often includes line breaks like `\r\n`. To handle partial matches, you can use:
- `contains()`
- Functions like `normalize-space()`

**Example**:
```python
# Using contains() to avoid missing text with stray newline characters
html.xpath('//td[contains(text(), "Genre")]')

# Using normalize-space() if there's excessive spacing
html.xpath('//td[normalize-space(text())="Genre"]')
```

## Navigation Axes
- `following-sibling::`, `preceding-sibling::`, `parent::`, `child::`, etc.  
  These allow you to move in the document relative to a known node.

**Example**:
```python
# Select the text node following a <b> element with text 'Script Date'
html.xpath('//b[text()="Script Date"]/following-sibling::text()[1]')

# Select any <div> that is the parent of an <img> with src="logo.png"
html.xpath('//img[@src="logo.png"]/parent::div')
```

---

# Additional XPath Grammar and Methods

Below we introduce more XPath concepts, including wildcard usage, union operators, and common functions for more powerful queries.

## Wildcards
- `*` matches any element node (regardless of its name).
- `@*` matches any attribute node.

**Example**:
```python
# Select all child elements under <div> of class "container", regardless of tag name
html.xpath('//div[@class="container"]/*')

# Select all attributes of the <img> elements
html.xpath('//img/@*')
```

## Union (|) Operator
- Combines multiple XPath expressions so you can select multiple sets of nodes.

**Example**:
```python
# Select all <div> or <span> elements
html.xpath('//div | //span')
```

## Common Functions
- `starts-with(string, substring)`: Tests if `string` starts with `substring`.
- `substring(string, start, length)`: Returns a portion of `string`.
- `string-length(string)`: Returns the length of a string.
- `count(node-set)`: Returns the number of nodes in a node set.

**Example**:
```python
# Select <a> elements whose href starts with "https"
html.xpath('//a[starts-with(@href, "https")]')

# Count how many <p> elements exist
num_paragraphs = html.xpath('count(//p)')
print("Number of <p> elements:", num_paragraphs)
```

## Context Nodes and Parent/Child Notation
- `.` refers to the current context node.
- `..` refers to the parent of the current node.

**Example**:
```python
# From a known element "el", select its parent's sibling divs
el.xpath('../following-sibling::div')
```

## Putting It All Together
When crafting your XPath, you often combine these features:
1. Start with a known node or wildcard.
2. Use predicate filters on attributes/text/position.
3. Employ axes to move to siblings, parents, children, etc.
4. Apply string functions or partial matches to handle real-world HTML quirks.

**Example**:
```python
# 1. Find a table containing a <td> with text "Genre"
# 2. Then locate all <a> within that table (in any row).
genre_links = html.xpath('//table[tr/td[contains(text(), "Genre")]]//a/@href')

# 3. Move from a known <b> element's text to the next text node.
release_date_text = html.xpath('//b[text()="Script Date"]/following-sibling::text()[1]')

# 4. Use starts-with() to filter anchor links that begin with "/scripts".
script_links = html.xpath('//a[starts-with(@href, "/scripts")]/@href')
```

---

By understanding these XPath strategies and methods, you can become more agile in navigating and extracting data from HTML documents that vary in layout or contain dynamic elements. Always remember to inspect the **actual** HTML returned by your HTTP client (e.g., `requests`) rather than relying solely on the rendered DOM in a browser, which may include additional transformations or scripts.

```


# Beautiful Soup

## Tutorial: Beautiful Soup Basics


```markdown
# Tutorial: Beautiful Soup Basics

Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping. Here's a brief step-by-step tutorial that outlines how to get started:

---

## 1. Installation

Before using Beautiful Soup, you need to install it. If you haven’t already:

```bash
pip install beautifulsoup4
```

---

## 2. Importing and Creating a Soup Object

To begin parsing HTML, import both `requests` (or another HTTP library) and `BeautifulSoup`:

```python
import requests
from bs4 import BeautifulSoup

# Fetch a webpage
url = "https://example.com"
response = requests.get(url)

# Create a BeautifulSoup object from the HTML text
soup = BeautifulSoup(response.text, "html.parser")

# Alternatively, parse an HTML string directly
html_doc = "<html><body><p>Hello!</p></body></html>"
soup_from_string = BeautifulSoup(html_doc, "html.parser")
```

A `BeautifulSoup` object (`soup` in these examples) acts as a structured representation of your HTML. You can navigate and search it like a tree.

---

## 3. Parsing and Navigating the HTML Tree

### 3.1 Accessing Elements by Tag

```python
# Access the first <title> tag found in the document
page_title = soup.title

# Access the first <body> tag
page_body = soup.body

# Access the first <p> tag
paragraph = soup.p
```

Remember that these direct accesses (`soup.p`, `soup.title`) only give you **the first** occurrence of that tag.

### 3.2 Going Down the Tree

- `.contents` gives a list of **all children** of a tag.
- `.children` is an **iterator** over those children.

```python
# If we want to see what's inside <body>
print(soup.body.contents)

# Or iterate over the children:
for child in soup.body.children:
    print(child)
```

### 3.3 Going Up the Tree

If you have a tag, you can find its parent:

```python
# Access a tag's parent
if soup.p:
    parent_of_p = soup.p.parent
    print("Parent of <p>:", parent_of_p.name)
```

---

## 4. Finding Elements

### 4.1 `find_all()`

`find_all()` returns **all** matches of your query:

```python
# All paragraph tags
paragraphs = soup.find_all("p")

# All tags that have class="important"
important_tags = soup.find_all(class_="important")
```

### 4.2 `find()`

`find()` returns **the first** match:

```python
# First <p> tag with class "important"
first_important_p = soup.find("p", class_="important")
```

### 4.3 CSS Selectors via `.select()`

Use `.select()` to match elements using CSS selectors (like in a browser’s DevTools):

```python
# All <p> tags
paragraphs_css = soup.select("p")

# A <p> tag with id="best-paragraph"
best_paragraph = soup.select("p#best-paragraph")

# A <p> tag with class="important"
important_paragraphs = soup.select("p.important")
```

---

## 5. Extracting Text and Attributes

### 5.1 `.get_text()`

`get_text()` returns **all** text within a tag (including its descendants), stripped of HTML tags:

```python
body_text = soup.body.get_text()
print(body_text)
```

### 5.2 Accessing Attributes

Tags can be treated like a dictionary to get/set attributes:

```python
some_link = soup.find("a")
if some_link:
    href_value = some_link["href"]   # Might raise KeyError if "href" missing
    safer_href = some_link.get("href", "No link available")
    print("Link:", safer_href)
```

You can also view the entire attributes dictionary:

```python
print(some_link.attrs)  # e.g., {"href": "https://example.com"}
```

---

## 6. Advanced Tasks

Below are some advanced tasks you can perform with Beautiful Soup:

1. **Filtering by Function**  
   You can pass a function to `find_all()` or `find()` to define a custom matching condition.

2. **Modifying the Parse Tree**  
   You can insert, delete, or reorder tags within the parsed structure.

3. **Handling Non-Standard Documents**  
   Beautiful Soup is forgiving of poorly formed HTML, but you might need to experiment with different parsers (e.g., `"html5lib"`).

---

## 7. Output and Debugging

### 7.1 `.prettify()`

```python
print(soup.prettify())      # Prints the entire HTML in a nicely formatted way
print(soup.body.prettify()) # Prints just the <body> section
```

This helps you understand the structure that Beautiful Soup sees, which may differ from the raw HTML if there are minor errors.

---



```


# Example 2: National Weather Service

Let's scrape the [National Weather Service](https://weather.gov/) for the weather forecast of Davis, CA.

## Annotated Example: Scraping the National Weather Service


```markdown
# Scraping the National Weather Service: Davis, CA Forecast

In this example, we’ll demonstrate how to:
1. Send an HTTP request to the National Weather Service page for Davis, CA.
2. Parse the HTML response with Beautiful Soup.
3. Extract specific weather data (period names, short descriptions, temperatures, and detailed descriptions).
4. Assemble the data into a pandas DataFrame.




In [3]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# ----------------------------------------------------------------------------
# Step 1: Identify the Target URL
# ----------------------------------------------------------------------------
# This URL corresponds to the National Weather Service’s 7-day forecast
# for a specific latitude/longitude near Davis, CA.
url = "https://forecast.weather.gov/MapClick.php?lat=38.54669&lon=-121.74457#.Y9fY5vv565t"

# ----------------------------------------------------------------------------
# Step 2: Fetch the Page Content
# ----------------------------------------------------------------------------
try:
    # 'requests.get(url)' sends an HTTP GET request to the specified URL.
    # The server's response is stored in the 'response' variable.
    response = requests.get(url)
    
    # 'raise_for_status()' will raise a 'requests.exceptions.HTTPError' if
    # the server returned an unsuccessful status code (e.g., 404, 500).
    response.raise_for_status()

except requests.exceptions.RequestException as e:
    # If *any* Request-related error occurs (connection error, timeout, etc.),
    # print the error and exit the script gracefully.
    print(f"Error fetching {url}:\n{e}")
    raise SystemExit

# ----------------------------------------------------------------------------
# Step 3: Parse HTML with BeautifulSoup
# ----------------------------------------------------------------------------
# 'response.text' gives us the HTML content of the response.
# BeautifulSoup constructs a "soup" object from that text for easy parsing.
html_soup = BeautifulSoup(response.text, "html.parser")

# ----------------------------------------------------------------------------
# Step 4: Identify and Validate the "Seven-Day Forecast" Section
# ----------------------------------------------------------------------------
# The page structure (from NWS) typically includes a <div> with id="seven-day-forecast-container".
# We find it using 'html_soup.find(id="...")'.
seven_day = html_soup.find(id="seven-day-forecast-container")

# If 'find' returned None, it means it couldn’t locate that <div> — likely
# because the site’s structure changed or the page didn’t load fully.
if not seven_day:
    raise ValueError(
        "Could not find the element with id='seven-day-forecast-container'. "
        "The site structure may have changed."
    )

# ----------------------------------------------------------------------------
# Optional: Debugging / Inspect the snippet
# ----------------------------------------------------------------------------
# If you want to see exactly what we got in 'seven_day', you can uncomment the line:
# print(seven_day.prettify())

# ----------------------------------------------------------------------------
# Step 5: Extract the Forecast Period Names
# ----------------------------------------------------------------------------
# .find_all("p", class_="period-name") means:
# - Look inside the 'seven_day' object
# - Find ALL <p> tags whose 'class' attribute is "period-name".
# Because there might be multiple 'period-name' paragraphs (e.g., "Tonight", "Wednesday", etc.),
# .find_all() returns a list of matching <p> elements.
period_names = seven_day.find_all("p", class_="period-name")

# We then iterate over each found <p> element and use .get_text(strip=True)
# to extract only the text content, trimming whitespace.
period = [name.get_text(strip=True) for name in period_names]

# ----------------------------------------------------------------------------
# Step 6: Extract the Short Weather Descriptions
# ----------------------------------------------------------------------------
# Similarly, we look for <p> elements whose class is "short-desc",
# which typically shows a phrase like "Rain", "Mostly Sunny", "Snow Likely", etc.
descs = seven_day.find_all("p", class_="short-desc")

# Again, we strip each match’s text content.
description = [desc.get_text(strip=True) for desc in descs]

# ----------------------------------------------------------------------------
# Step 7: Extract the Temperatures
# ----------------------------------------------------------------------------
# Temperatures often appear in <p> tags whose class includes "temp",
# e.g. class="temp temp-high" or class="temp temp-low".
# We use a CSS selector "p[class*='temp']" to match any <p> whose class
# attribute CONTAINS the substring "temp".
temps = seven_day.select("p[class*='temp']")

# Then we extract and strip the text from each of these <p> tags.
temperature = [temp.get_text(strip=True) for temp in temps]

# ----------------------------------------------------------------------------
# Step 8: Extract Detailed Descriptions from <img> Tags
# ----------------------------------------------------------------------------
# Each forecast "tombstone" has an <img> whose 'title' attribute provides
# a more detailed description, often including chance of precipitation, wind info, etc.
images = seven_day.select("div.tombstone-container img")

# We'll collect the 'title' attribute from each <img>. Some images might not have one,
# so we use .get("title", "") instead of ['title'] to avoid KeyErrors.
details = []
for image in images:
    title_text = image.get("title", "")
    details.append(title_text)

# ----------------------------------------------------------------------------
# Step 9: Clean Up the Detailed Descriptions
# ----------------------------------------------------------------------------
# Many of these 'title' strings begin with the period name, followed by a colon.
# E.g. "Thursday: Rain, mainly after 4pm. High near 60..."
# We want just the forecast text after the colon.
def clean_detail(txt):
    # If there's no colon, just return the entire text trimmed.
    if ":" not in txt:
        return txt.strip()
    # If there is a colon, split into three parts with .partition(":")
    # and return index 2 (the substring after the colon).
    return txt.partition(":")[2].strip()

# Apply our cleanup function to every string in 'details'.
new_details = [clean_detail(d) for d in details]

# ----------------------------------------------------------------------------
# Step 10: Ensure Lists Have the Same Length (Optional Check)
# ----------------------------------------------------------------------------
# Sometimes, the site includes extra hazard items or fewer periods,
# causing mismatches in the list lengths. Pandas can handle mismatches but
# it may misalign rows. Here, we find the min length and truncate each list:
min_length = min(len(period), len(description), len(temperature), len(new_details))
period = period[:min_length]
description = description[:min_length]
temperature = temperature[:min_length]
new_details = new_details[:min_length]

# ----------------------------------------------------------------------------
# Step 11: Create a DataFrame
# ----------------------------------------------------------------------------
# We build a pandas DataFrame mapping each column name ("Period", "Description", etc.)
# to the lists we collected.
weather = pd.DataFrame({
    "Period": period,
    "Description": description,
    "Temperature": temperature,
    "Detail": new_details
})

# Finally, print the DataFrame to see a table of the extracted forecast data.
print(weather)


                  Period              Description  Temperature  \
1                Tonight                  Showers  High: 60 °F   
2               Thursday                  Showers   Low: 48 °F   
3         Thursday Night  Showers thenRain Likely  High: 61 °F   
4                 Friday              Chance Rain   Low: 40 °F   
5           Friday Night             Mostly Clear  High: 59 °F   
6               Saturday             Partly Sunny   Low: 42 °F   
7         Saturday Night            Mostly Cloudy  High: 59 °F   

                                              Detail  
0                                                     
1                                                     
2  Rain, mainly before 4pm, then showers and poss...  
3  Showers and possibly a thunderstorm before 10p...  
4  A 40 percent chance of rain, mainly before 10a...  
5  Mostly clear, with a low around 40. South sout...  
6                 Partly sunny, with a high near 59.  
7               Mostly cloudy, 

In [3]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# ----------------------------------------------------------------------------
# Step 1: Identify the Target URL
# ----------------------------------------------------------------------------
# This URL corresponds to the National Weather Service’s 7-day forecast
# for a specific latitude/longitude near Davis, CA.
url = "https://forecast.weather.gov/MapClick.php?lat=38.54669&lon=-121.74457#.Y9fY5vv565t"

# ----------------------------------------------------------------------------
# Step 2: Fetch the Page Content
# ----------------------------------------------------------------------------
try:
    # 'requests.get(url)' sends an HTTP GET request to the specified URL.
    # The server's response is stored in the 'response' variable.
    response = requests.get(url)
    
    # 'raise_for_status()' will raise a 'requests.exceptions.HTTPError' if
    # the server returned an unsuccessful status code (e.g., 404, 500).
    response.raise_for_status()

except requests.exceptions.RequestException as e:
    # If *any* Request-related error occurs (connection error, timeout, etc.),
    # print the error and exit the script gracefully.
    print(f"Error fetching {url}:\n{e}")
    raise SystemExit

# ----------------------------------------------------------------------------
# Step 3: Parse HTML with BeautifulSoup
# ----------------------------------------------------------------------------
# 'response.text' gives us the HTML content of the response.
# BeautifulSoup constructs a "soup" object from that text for easy parsing.
html_soup = BeautifulSoup(response.text, "html.parser")

# ----------------------------------------------------------------------------
# Step 4: Identify and Validate the "Seven-Day Forecast" Section
# ----------------------------------------------------------------------------
# The page structure (from NWS) typically includes a <div> with id="seven-day-forecast-container".
# We find it using 'html_soup.find(id="...")'.
seven_day = html_soup.find(id="seven-day-forecast-container")

# If 'find' returned None, it means it couldn’t locate that <div> — likely
# because the site’s structure changed or the page didn’t load fully.
if not seven_day:
    raise ValueError(
        "Could not find the element with id='seven-day-forecast-container'. "
        "The site structure may have changed."
    )

# ----------------------------------------------------------------------------
# Optional: Debugging / Inspect the snippet
# ----------------------------------------------------------------------------
# If you want to see exactly what we got in 'seven_day', you can uncomment the line:
print(seven_day.prettify())



<div id="seven-day-forecast-container">
 <div class="current-hazard" id="headline-container" style="margin-left: 124px">
  <div id="headline-separator" style="top: 34px; height: 171px">
  </div>
  <div id="headline-info" onclick="$('#headline-detail').toggle(); $('#headline-detail-now').hide()" style="margin-top: 5px">
   <div id="headline-detail">
    <div>
     Flood Watch until February 14, 10:00pm
    </div>
    <div>
     Wind Advisory until February 14, 10:00am
    </div>
    <div>
    </div>
   </div>
   <span class="fa fa-info-circle">
   </span>
   Click here for hazard details and duration
  </div>
  <div class="headline-bar headline-watch" style="top: 40px; left: 19px; height: 165px; width: 518px">
   <div class="headline-title">
    Flood Watch
   </div>
  </div>
  <div class="headline-bar headline-advisory" style="top: 60px; left: 19px; height: 145px; width: 394px">
   <div class="headline-title">
    Wind Advisory
   </div>
  </div>
   <div class="headline-title">
   </di

In [5]:
# .find_all() returns a list of matching <p> elements.
period_names = seven_day.find_all("p", class_="period-name")

# We then iterate over each found <p> element and use .get_text(strip=True)
# to extract only the text content, trimming whitespace.
period = [name.get_text(strip=True) for name in period_names]
period

['NOW: Multiple hazards in effect',
 'Tonight',
 'Thursday',
 'Thursday Night',
 'Friday',
 'Friday Night',
 'Saturday',
 'Saturday Night',
 'Sunday']

In [None]:
<div id="seven-day-forecast-container">
  <ul id="seven-day-forecast-list">
    <li class="forecast-tombstone">
      <div class="tombstone-container">
        <p class="period-name">Tonight</p>
        <p>
          <img class="forecast-icon"
               src="newimages/medium/nshra100.png"
               title="Tonight: Rain likely. Low around 46. Chance of precipitation is 70%." />
        </p>
        <p class="short-desc">Rain Likely</p>
        <p class="temp temp-low">Low: 46 °F</p>
      </div>
    </li>

    <li class="forecast-tombstone">
      <div class="tombstone-container">
        <p class="period-name">Thursday</p>
        <p>
          <img class="forecast-icon"
               src="newimages/medium/shra80.png"
               title="Thursday: Showers. High near 60. Chance of precipitation is 80%." />
        </p>
        <p class="short-desc">Showers</p>
        <p class="temp temp-high">High: 60 °F</p>
      </div>
    </li>

    <li class="forecast-tombstone">
      <div class="tombstone-container">
        <p class="period-name">Thursday Night</p>
        <p>
          <img class="forecast-icon"
               src="DualImage.php?i=nshra&amp;j=nshra&amp;ip=100&amp;jp=70"
               title="Thursday Night: Showers likely. Low around 48. Chance of precipitation is 100%." />
        </p>
        <p class="short-desc">Showers Likely</p>
        <p class="temp temp-low">Low: 48 °F</p>
      </div>
    </li>
  </ul>
</div>
