# INST447: Data Sources and Manipulation
## Lecture: Sourcing Data from the Web - HTML Parsing & APIs

---

## Part 1: Motivation - How Healthy is my Recipe?

* **Recap from last week:** We discussed structured data formats like JSON and XML. Unlike flat CSVs, they allow for nested, complex structures, which is how data is often organized on the web.

* **Today's Scenario:** Imagine we have a recipe for **Whole Wheat Bread with Nutella**. We want to write a program to determine its nutritional value.

* **The Problem:** We need a data source for nutritional information. A quick search leads us to a great resource: [Open Food Facts](https://world.openfoodfacts.org/).

* **The Manual Process:** We can go to the website and search for Nutella. We land on this page: [https://world.openfoodfacts.org/product/3017620425035/nutella-ferrero](https://world.openfoodfacts.org/product/3017620425035/nutella-ferrero).

    

* **The Core Question:** How can we get this information *programmatically*? How do we automate this process for not just Nutella, but potentially thousands of ingredients?

---

## Part 2: The Direct Approach - Web Scraping & HTML Parsing

Our first thought might be: "Let's just download the webpage and find the data in there."

This process is called **Web Scraping**.
1.  Download the raw HTML content of a webpage.
2.  Parse the HTML to navigate its structure.
3.  Extract the specific content we need.

### Step 2.1: Downloading the Webpage

We can use the `requests` library in Python to act like a web browser and download the page.

In [None]:
import requests

# The URL of the Nutella product page
url = 'https://world.openfoodfacts.org/product/3017620425035/nutella-ferrero'

# Send a GET request to the URL
response = requests.get(url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    html_content = response.text

    # Save it to a file so we can examine it
    with open('nutella_page.html', 'w', encoding='utf-8') as f:
        f.write(html_content)

    print(f"Successfully downloaded {len(html_content)} characters of HTML.")
    print("Saved to: nutella_page.html")

    # Let's look at the first 1000 characters
    print("\n--- Start of HTML ---")
    print(html_content[:1000])
    print("--- End of Snippet ---")
else:
    print(f"Failed to download page. Status code: {response.status_code}")

**Observation:** That's... a lot of text. It's a mix of HTML tags (`<!DOCTYPE`, `<html>`, `<head>`), CSS styles, JavaScript, and our actual data somewhere in there. It's not easy to read for a human or a program.

**Try this:** Open the saved `nutella_page.html` file in any text editor (VS Code, Notepad, TextEdit, etc.) to see the full complexity of the raw HTML. This will help you appreciate why we need parsing tools!

### Step 2.2: Parsing HTML with Beautiful Soup

To navigate this complex structure, we need a special tool. `BeautifulSoup` is a Python library that converts a messy HTML string into a structured object that we can search.

Let's try to extract useful information from the page.

In [None]:
# You might need to install these libraries first!
# !pip install requests beautifulsoup4

from bs4 import BeautifulSoup

# Parse the HTML content
soup = BeautifulSoup(html_content, 'html.parser')

# Example 1: Find the product name
# By inspecting the page, we find it's in an <h1> tag
product_name_element = soup.select_one('h1')
product_name = product_name_element.text.strip()  # .strip() removes whitespace

print(f"Product Name: {product_name}")

In [None]:
# Example 2: Find the Nutri-Score
# This is in a specific section with an id
nutriscore_element = soup.select_one('#panel_nutriscore_2023 h4')

if nutriscore_element:
    nutriscore = nutriscore_element.text.strip()
    print(f"Nutri-Score: {nutriscore}")

In [None]:
# Example 3: Find nutritional information in a table
# We need to find the nutrition table and iterate through its rows
nutrition_table = soup.find('table', attrs={'aria-label': 'Nutrition facts'})

if nutrition_table:
    rows = nutrition_table.find_all('tr')
    print("Nutritional Information (first 5 items):")
    for row in rows[:5]:  # Show first 5 rows
        cells = row.find_all('td')
        if cells and len(cells) >= 2:
            nutrient = cells[0].text.strip()
            value = cells[1].text.strip()
            print(f"  {nutrient}: {value}")

### Discussion: The Problems with Web Scraping

We got the data! But is this a good approach?

* **üëé Inefficient:** We downloaded over 189,000 characters of HTML, CSS, and JavaScript just to get a few dozen words of nutritional data. This wastes our bandwidth and the website's bandwidth.

* **üëé Fragile:** Our code relies on the website's specific HTML structure (e.g., `<h1>` for the name, `#panel_nutriscore_2023 h4` for the Nutri-Score, table with `aria-label='Nutrition facts'`). What if the web developers redesign the page tomorrow? They might change the IDs, restructure the table, or use different elements. **Our scraper would break instantly.**

* **üëé Complex:** Notice how we had to use different strategies for each piece of data:
    - Simple tag selection for product name
    - CSS selector with ID for Nutri-Score  
    - Table iteration for nutritional facts
    
    Every website is structured differently, so we'd need to write custom parsing logic for each site.

* **üëé Bad Etiquette:** Scraping can put a heavy load on a website's server, especially if done frequently. It's like sending a fleet of trucks to pick up a single letter. Many websites explicitly forbid aggressive scraping in their `robots.txt` file or Terms of Service.

---

## Part 3: A Better Way - Using an API

Wouldn't it be nice if the website offered a separate, clean, data-only channel for programmers?

That's exactly what an **API (Application Programming Interface)** is.

> **Analogy:** Scraping a website is like rummaging through a restaurant's kitchen to figure out a recipe. Using an API is like ordering from the menu: you make a specific, structured request and get a specific, structured response.

Fortunately, Open Food Facts has a free and open API!

### Step 3.1: Making a Basic API Request

The API documentation tells us we can get data for a specific product by using a URL like this:
`https://world.openfoodfacts.org/api/v0/product/[barcode].json`

Let's try it for Nutella (barcode: 3017620425035).

In [None]:
import json

api_url = 'https://world.openfoodfacts.org/api/v0/product/3017620425035.json'

api_response = requests.get(api_url)

# The .json() method from the requests library automatically parses the JSON response
data = api_response.json()

# Let's print the whole thing to see the structure. It's a dictionary!
# We'll use the 'json' library to pretty-print it.
print(json.dumps(data, indent=4))

Look at that! It's pure, structured data. No HTML, no CSS, no ads. Just the information we need in a predictable format (JSON).

Now, extracting the data is trivial and robust.

In [None]:
# The path to the data is predictable based on the JSON structure
product_name_api = data['product']['product_name']
ingredients_api = data['product']['ingredients_text']
countries = data['product']['countries']

print(f"Product Name: {product_name_api}")
print(f"Ingredients: {ingredients_api}")
print(f"Sold in: {countries}")

### Step 3.2: Adding Parameters to API Requests

Most APIs allow you to customize your request using **parameters** (also called query parameters). These are key-value pairs added to the URL.

For example, the Open Food Facts API allows us to search for products. Instead of manually building the URL string, we can pass parameters as a dictionary to `requests.get()`.

In [None]:
# Search API endpoint
search_url = 'https://world.openfoodfacts.org/cgi/search.pl'

# Parameters as a dictionary - much cleaner than building the URL manually!
params = {
    'search_terms': 'chocolate',
    'page_size': 5,  # Only get 5 results
    'json': 1  # Request JSON format
}

# Pass parameters to requests.get()
search_response = requests.get(search_url, params=params)
search_data = search_response.json()

# Display results
print(f"Found {search_data['count']} products matching 'chocolate'")
print("\nFirst 3 results:")
for i, product in enumerate(search_data['products'][:3], 1):
    name = product.get('product_name', 'N/A')
    brands = product.get('brands', 'N/A')
    print(f"{i}. {name} ({brands})")

**What happened?** The `requests` library automatically converted our `params` dictionary into a query string and added it to the URL:

```
https://world.openfoodfacts.org/cgi/search.pl?search_terms=chocolate&page_size=5&json=1
```

This is much cleaner than building that string manually, and it handles special characters automatically (like spaces, which get converted to `%20`).

### Step 3.3: Limiting Response Data with the `fields` Parameter

API responses can be very large, containing hundreds of fields. **Many APIs** (though not all) allow you to request only specific fields to reduce response size and improve performance.

The Open Food Facts API supports this through the `fields` parameter - you simply list the fields you want, separated by commas. Let's see how much difference this makes!

In [None]:
# First, let's see what a FULL response looks like
full_url = 'https://world.openfoodfacts.org/api/v2/product/3017620425035.json'
full_response = requests.get(full_url)
full_data = full_response.json()

print("=== Full Response ===")
print(f"Response size: {len(full_response.text):,} characters")
print(f"Number of top-level fields: {len(full_data['product'].keys())}")
print("\nThis is a LOT of data - most of which we might not need!")

In [None]:
# Now, let's request ONLY the fields we actually need
limited_url = 'https://world.openfoodfacts.org/api/v2/product/3017620425035.json'
params = {
    'fields': 'product_name,brands,nutriscore_grade,nutriments'
}

limited_response = requests.get(limited_url, params=params)
limited_data = limited_response.json()

print("=== Limited Response (with fields parameter) ===")
print(f"Response size: {len(limited_response.text):,} characters")
print(f"Number of top-level fields: {len(limited_data['product'].keys())}")
print(f"\nReduction: {100 - (len(limited_response.text) / len(full_response.text) * 100):.1f}% smaller!")
print()

# Show what we got
print("Data we requested:")
print(f"  Product: {limited_data['product']['product_name']}")
print(f"  Brand: {limited_data['product']['brands']}")
print(f"  Nutri-Score: {limited_data['product']['nutriscore_grade'].upper()}")
print(f"  Calories (per 100g): {limited_data['product']['nutriments']['energy-kcal_100g']} kcal")

**Key Takeaway:** Using the `fields` parameter can dramatically reduce response size! In this example, we went from receiving hundreds of fields to just the 4 we needed. This is especially important when:
- Making many API requests
- Working with slow internet connections
- Building mobile apps where data usage matters
- Wanting faster response times

### Step 3.4: API Wrappers - Making APIs Even Easier

Using APIs directly with `requests` is already much better than web scraping. But there's an even more convenient approach: **API Wrappers**.

**What is an API Wrapper?**
An API wrapper (also called a client library or SDK) is a Python package that someone has built to make working with a specific API even easier. Instead of constructing URLs and parsing JSON manually, you call simple Python functions.

Think of it this way:
- **Raw API:** You're ordering food by calling the restaurant and describing exactly what you want
- **API Wrapper:** You're using a delivery app with a nice menu interface

Let's see an example using the `openfoodfacts` Python package.

In [None]:
# First, you would install the wrapper: pip install openfoodfacts
# Then you can use it like this:

import openfoodfacts

# Get product by barcode - much simpler!
product = openfoodfacts.products.get_product('3017620425035')

if product['status'] == 1:
    product_data = product['product']
    print(f"Product Name: {product_data['product_name']}")
    print(f"Brands: {product_data.get('brands', 'N/A')}")
    print(f"Nutri-Score: {product_data.get('nutriscore_grade', 'N/A').upper()}")

# Search for products - also simple!
search_results = openfoodfacts.products.search('whole wheat bread')
print(f"\nFound {search_results['count']} products matching 'whole wheat bread'")
print(f"First result: {search_results['products'][0]['product_name']}")

### Pros and Cons of API Wrappers

**Advantages (üëç):**

* **üöÄ Easier to Use:** Functions have clear names and parameters. You don't need to memorize URL patterns or JSON structures.
* **üìö Better Documentation:** Often includes examples and docstrings that are easier to read than raw API docs.
* **üõ°Ô∏è Error Handling:** Good wrappers handle common errors gracefully and provide helpful error messages.
* **‚ö° Less Code:** Compare the wrapper code above to manually constructing the URL and parsing JSON.
* **üîÑ Automatic Updates:** If the API changes slightly, the wrapper maintainer can update the library without you changing your code.

**Disadvantages (üëé):**

* **‚è≥ Not Always Available:** Many APIs don't have official wrappers. Third-party wrappers may exist, but quality varies.
* **üì¶ Extra Dependency:** You're adding another package to your project. If the wrapper is abandoned, you might have problems.
* **üîí Limited Features:** Wrappers might not expose every feature of the API. If you need something advanced, you may still need to use `requests` directly.
* **üêõ Additional Layer of Bugs:** The wrapper itself could have bugs, separate from the API.
* **üìñ Learning Curve:** You need to learn both the API *and* how the wrapper works.

**When to Use Each Approach:**

- **Use an API wrapper** when: A well-maintained wrapper exists, and you're doing common operations
- **Use `requests` directly** when: No good wrapper exists, you need advanced features, or you want full control
- **Never use web scraping** when: An API is available (with or without a wrapper)

### Step 3.5: Being a Good API Citizen

APIs are powerful, but they are a shared resource. It's crucial to use them responsibly.

* **Authentication:** Many APIs require an **API Key**. This is a unique string that identifies you. It helps the provider track usage and prevent abuse. You usually get one by signing up on their website.

* **Rate Limiting:** Don't make too many requests too quickly! API documentation will specify a **rate limit**, like "100 requests per minute." If you exceed this, your key might be temporarily blocked. Always add a small delay (`time.sleep()`) in your code if you're making many requests in a loop.

* **Read the Documentation:** The API documentation is your contract with the data provider. It tells you the rules, the available data, and the correct way to ask for it. **Always read it first!**

* **Respect the Terms of Service:** Some APIs have restrictions on how you can use the data (e.g., no commercial use, must attribute the source). Always check and follow these rules.

## Part 4: Conclusion & The Data Sourcing Hierarchy

We've explored multiple ways to get data from the web. This leads to a clear hierarchy of preference when you're looking for a data source:

**ü•á Level 1: API with a good wrapper library**
   * **Pros:** Most convenient, clean code, often includes helpful features
   * **Example:** `openfoodfacts.products.get_product('123')`
   * **Use when:** A well-maintained wrapper exists for your API

**ü•à Level 2: Well-documented API (using `requests`)**
   * **Pros:** Efficient, stable, robust, and the intended method of access
   * **Example:** `requests.get('https://api.example.com/products/123')`
   * **Use when:** No wrapper exists, or you need full control

**ü•â Level 3: Structured Data Files (e.g., a direct CSV/JSON download link)**
   * **Pros:** Data is already in a clean format
   * **Cons:** Might not be the most up-to-date data

**üèÖ Level 4: Web Scraping (Last Resort)**
   * **Pros:** Can get data from almost any website
   * **Cons:** Inefficient, fragile, complex, and ethically ambiguous
   * **Use only when:** No API or data download is available, and you have permission

**Key Takeaway:** Always look for an API first! And if an API exists, check if there's a wrapper to make your life even easier.

By understanding these methods, you are now equipped to programmatically source data from the vast resources of the internet, a crucial skill for any data scientist.