# USGS Earthquake API Deep Dive

This notebook demonstrates:
1. How the USGS Earthquake API works
2. The 20,000 record limit problem
3. Single-threaded partitioning (the bottleneck)
4. PySpark Custom Data Source (the scalable solution)

## Part 1: Understanding the USGS API

The USGS provides a free, public API for earthquake data. Let's explore how it works.

In [0]:
import requests
import json

# The USGS Earthquake API endpoint
USGS_API_URL = "https://earthquake.usgs.gov/fdsnws/event/1/query"

# Let's fetch one week of earthquake data
params = {
    "format": "geojson",
    "starttime": "2000-01-01",
    "endtime": "2000-01-07"
}

response = requests.get(USGS_API_URL, params=params)
data = response.json()

print(f"Response status: {response.status_code}")
print(f"Total earthquake events: {len(data['features'])}")

### Exploring the Response Structure

The API returns GeoJSON format with:
- `metadata`: Information about the query
- `features`: Array of earthquake events
- Each feature has `properties` (event details) and `geometry` (coordinates)

In [0]:
response.json()

In [0]:
# Let's look at the metadata
print("=== METADATA ===")
print(json.dumps(data['metadata'], indent=2))

In [0]:
# Now let's examine a single earthquake event
print("=== SAMPLE EVENT ===")
event = data['features'][0]
print(json.dumps(event, indent=2))

### Key Properties We Care About

Each earthquake event contains rich metadata. Let's convert the API response to a Spark DataFrame for easier viewing:

In [0]:
from pyspark.sql.types import (
    StructType, StructField,
    StringType, DoubleType, LongType, IntegerType, MapType
)

# Defining the schema as the top level keys with nested arrays
schema = StructType([
            StructField("type", StringType(), True),
            StructField("properties", MapType(StringType(), StringType(), True), True),
            StructField("geometry", MapType(StringType(), StringType(), True), True),
            StructField("id", StringType(), True),
        ])

df = spark.createDataFrame(response.json()['features'], schema=schema)

df.display()

## Part 2: The 20,000 Record Limit Problem

The USGS API has a **hard limit of 20,000 records per request**.

If your query would return more than 20,000 records, **the API returns a 400 error** - it doesn't silently truncate, it fails completely.

Let's see what happens when we try to fetch a full year of data:

In [0]:
# Try to fetch a full year - this will FAIL with a 400 error!
params_year = {
    "format": "geojson",
    "starttime": "2000-01-01",
    "endtime": "2001-01-01"
}

response_year = requests.get(USGS_API_URL, params=params_year)

print(f"Status Code: {response_year.status_code}")

## Part 3: Single-Threaded Partitioning (The Bottleneck)

One solution is to break the date range into smaller chunks and make multiple requests.

**Problem**: This runs sequentially - each request waits for the previous one to complete.

### (Demo) Fetching 12 Months of Data (Sequential)

Watch how each chunk waits for the previous one to complete:

In [0]:
# Fetch 12 months of data using sequential approach
# This will take a while because it's single-threaded!

from datetime import datetime, timedelta, timezone

from dataclasses import dataclass

start_s = "2000-01-01"
end_s = "2001-01-01"
num_partitions = 25


# Explicit parsing to UTC
start = datetime.strptime(start_s, "%Y-%m-%d").replace(tzinfo=timezone.utc)
end = datetime.strptime(end_s, "%Y-%m-%d").replace(tzinfo=timezone.utc)

total_seconds = (end - start).total_seconds()
step = total_seconds / num_partitions

parts = []
for i in range(num_partitions):
    p_start = start + timedelta(seconds=step * i)
    p_end = start + timedelta(seconds=step * (i + 1))

    # Explicit ISO-8601 UTC formatting (USGS-compatible)
    start_iso = p_start.strftime("%Y-%m-%dT%H:%M:%SZ")
    end_iso = p_end.strftime("%Y-%m-%dT%H:%M:%SZ")

    parts.append({"start_iso":start_iso, "end_iso":end_iso, "index": i})

import requests

URL = "https://earthquake.usgs.gov/fdsnws/event/1/query"

all_features = []

for part in parts:
    print(f"Partition {part.get('index')}: {len(all_features)} events")

    params = {
        "format": "geojson",
        "starttime": part.get("start_iso"),
        "endtime": part.get("end_iso"),
    }

    response = requests.get(URL, params=params)
    response.raise_for_status()

    data = response.json()                 # dict
    features = data.get("features", [])    # list

    all_features.extend(features)           # ‚úÖ list.extend(list)

print("Total events:", len(all_features))
    

### The Problem with Sequential Processing

For 12 months of data with approx 14-day partitions:
- ~25 partition to process
- Each partition takes ~1-3 seconds (network latency + API processing)
- Total time: **20-40 seconds** (or more!)

For a longer time periods **it will take much longer**

And remember - we HAVE to partition because the API **fails with a 400 error** if we exceed 20,000 records. A single month is safe, but anything larger requires partitioning.

**This sequential approach doesn't scale!**

What if we could process all chunks **in parallel**?

## Part 4: PySpark Custom Data Source (The Solution)

Spark's **Custom Data Source API** allows us to:
1. Define how to partition the work (date ranges)
2. Let Spark distribute partitions across executors
3. Process all partitions **in parallel**

Let's build it step by step.

### Step 1: Define the InputPartition

An `InputPartition` represents **one unit of work** that can be processed independently.

In our case, each partition is a date range to fetch from the API.

In [0]:
from pyspark.sql.datasource import DataSource, DataSourceReader, InputPartition
from pyspark.sql.types import StructType, StructField, StringType, MapType
from datetime import datetime, timedelta, timezone


class TimeRangePartition(InputPartition):
    """
    Represents one partition of work - a specific date range to fetch.

    This object gets serialized and sent to executors.
    Keep it lightweight - just the data needed to fetch this chunk.
    """

    def __init__(self, starttime: str, endtime: str, pid: int):
        self.starttime = starttime  # ISO format string
        self.endtime = endtime      # ISO format string
        self.pid = pid              # Partition ID for logging/debugging

### Step 2: Define the DataSourceReader

The `DataSourceReader` has two key methods:

- **`partitions()`**: Runs on the **driver**. Creates the list of partitions.
- **`read(partition)`**: Runs on **executors**. Fetches data for one partition.

In [0]:
class USGSDataSourceReader(DataSourceReader):
    """
    Reader that creates partitions and fetches data from USGS API.

    - partitions() runs on the DRIVER (creates the work units)
    - read() runs on EXECUTORS (does the actual API calls)
    """

    def __init__(self, schema, options):
        self.schema = schema
        self.options = options

    def _parse_dt_utc(self, s: str) -> datetime:
        """Parse date string to UTC datetime."""
        # Handle simple 'YYYY-MM-DD' format
        if len(s) == 10 and s[4] == "-" and s[7] == "-":
            return datetime.strptime(s, "%Y-%m-%d").replace(tzinfo=timezone.utc)

        # Handle ISO8601 format
        dt = datetime.fromisoformat(s.replace("Z", "+00:00"))
        if dt.tzinfo is None:
            dt = dt.replace(tzinfo=timezone.utc)
        return dt.astimezone(timezone.utc)

    def _to_usgs_iso(self, dt: datetime) -> str:
        """Convert datetime to USGS-compatible ISO string."""
        return dt.strftime("%Y-%m-%dT%H:%M:%S")

    def partitions(self):
        """
        Create partitions based on the date range and numPartitions.

        THIS RUNS ON THE DRIVER.

        Returns a list of TimeRangePartition objects that Spark will
        distribute across executors.
        """
        # Get options from the user
        start_s = self.options.get("starttime")
        end_s = self.options.get("endtime")

        if not start_s or not end_s:
            raise ValueError(
                "You must provide .option('starttime', ...) and .option('endtime', ...)"
            )

        num_partitions = int(self.options.get("numPartitions", "10"))

        if num_partitions <= 0:
            raise ValueError("numPartitions must be > 0")

        # Parse dates
        start = self._parse_dt_utc(start_s)
        end = self._parse_dt_utc(end_s)

        if end <= start:
            raise ValueError(f"endtime must be after starttime. Got {start_s} -> {end_s}")

        # Calculate time step per partition
        total_seconds = (end - start).total_seconds()
        step = total_seconds / num_partitions

        # Create partition objects
        parts = []
        for i in range(num_partitions):
            p_start = start + timedelta(seconds=step * i)
            p_end = start + timedelta(seconds=step * (i + 1))

            parts.append(
                TimeRangePartition(
                    starttime=self._to_usgs_iso(p_start),
                    endtime=self._to_usgs_iso(p_end),
                    pid=i
                )
            )

        print(f"‚úì Created {len(parts)} partitions for date range {start_s} to {end_s}")
        return parts

    def read(self, partition: TimeRangePartition):
        """
        Fetch data for a single partition from USGS API.

        THIS RUNS ON EXECUTORS (potentially in parallel!).

        Yields tuples that match the schema defined in the DataSource.
        """
        # Import inside read() - this code runs on executors
        # which may not have the same imports as the driver
        import requests

        url = "https://earthquake.usgs.gov/fdsnws/event/1/query"

        params = {
            "format": "geojson",
            "starttime": partition.starttime,
            "endtime": partition.endtime
        }

        # Make the API request
        r = requests.get(url, params=params, timeout=30)

        # Handle errors with useful information
        if r.status_code != 200:
            raise RuntimeError(
                f"USGS HTTP {r.status_code} partition={partition.pid} "
                f"start={partition.starttime} end={partition.endtime}. "
                f"Response: {r.text[:500]}"
            )

        # Parse response and yield rows
        payload = r.json()
        features = payload.get("features", [])

        for f in features:
            # Yield tuple matching our schema:
            # (type, properties, geometry, id)
            yield (
                f.get("type"),
                f.get("properties"),  # MapType handles dict automatically
                f.get("geometry"),    # MapType handles dict automatically
                f.get("id")
            )

### Step 3: Define the DataSource

The `DataSource` is the entry point. It defines:
- **`name()`**: The format string used in `spark.read.format("usgs")`
- **`schema()`**: The output schema of the DataFrame
- **`reader()`**: Returns the DataSourceReader instance

In [0]:
class USGSDataSource(DataSource):
    """
    Custom Spark DataSource for USGS Earthquake API.

    Usage:
        spark.dataSource.register(USGSDataSource)

        df = spark.read.format("usgs") \\
            .option("starttime", "2000-01-01") \\
            .option("endtime", "2001-01-01") \\
            .option("numPartitions", "25") \\
            .load()
    """

    @classmethod
    def name(cls):
        """The format name used in spark.read.format()"""
        return "usgs"

    def schema(self):
        """
        Define the output schema.

        We keep properties and geometry as MapType to preserve
        all the nested data from the API response.
        """
        return StructType([
            StructField("type", StringType(), True),
            StructField("properties", MapType(StringType(), StringType(), True), True),
            StructField("geometry", MapType(StringType(), StringType(), True), True),
            StructField("id", StringType(), True),
        ])

    def reader(self, schema: StructType):
        """Return a reader instance with the user's options."""
        return USGSDataSourceReader(schema, self.options)

### Step 4: Register and Use the DataSource

Now let's register our custom DataSource with Spark and use it!

In [0]:
# Register the DataSource
spark.dataSource.register(USGSDataSource)
print("‚úì USGS DataSource registered with Spark")

### Demo: Fetching 12 Months of Data (Parallel!)

Now let's fetch the same 12 months of data, but using our parallel DataSource:

In [0]:
# Fetch the same 3 months of data - but now in PARALLEL!
start_time = time.time()

df = spark.read.format("usgs") \
    .option("starttime", "2000-01-01") \
    .option("endtime", "2001-01-01") \
    .option("numPartitions", "25") \
    .load()

# Force execution by counting rows
event_count = df.count()

end_time = time.time()

print(f"\n‚è±Ô∏è  Total elapsed time: {end_time - start_time:.2f} seconds")
print(f"üìä Total events fetched: {event_count:,}")

### Compare the Results!

| Approach | Time | Speedup |
|----------|------|---------|
| Sequential (single-threaded) | ~30-40 seconds | 1x |
| PySpark DataSource (parallel) | ~5-10 seconds | **4-6x faster!** |

The speedup comes from:
1. **Parallel API calls** - Multiple executors fetch data simultaneously
2. **Distributed processing** - Work is spread across the cluster
3. **No waiting** - Partitions don't wait for each other

And remember - partitioning isn't optional! The API returns a **400 error** if we exceed 20,000 records. The question is whether we partition sequentially (slow) or in parallel (fast).

In [0]:
# Let's look at the data we fetched
display(df.limit(10))

### Understanding the Schema

The `properties` and `geometry` columns are `MapType` - they contain all the nested data from the API:

In [0]:
# Extract specific properties from the map
from pyspark.sql.functions import col

df_expanded = df.select(
    col("id"),
    col("properties")["mag"].alias("magnitude"),
    col("properties")["place"].alias("place"),
    col("properties")["sig"].alias("significance"),
    col("properties")["time"].alias("event_time"),
    col("geometry")["coordinates"].alias("coordinates")
)

display(df_expanded.limit(10))

## Summary

| Component | Purpose | Runs On |
|-----------|---------|---------|
| `InputPartition` | Defines one unit of work (date range) | Serialized to executors |
| `DataSourceReader.partitions()` | Creates all partition objects | Driver |
| `DataSourceReader.read()` | Fetches data for one partition | Executors (parallel!) |
| `DataSource` | Entry point, defines schema and name | Driver |

**Key Insight**: The magic happens because `read()` runs on executors. Spark automatically:
1. Distributes partitions across available executors
2. Calls `read()` in parallel on each executor
3. Combines the results into a single DataFrame

This pattern works for ANY API or data source - not just USGS!

## Next Steps

In the actual ETL pipeline (`etl/bronze_ingestion.py`), we use this exact DataSource to:
1. Ingest earthquake data into the Bronze layer
2. Store raw data in Delta format with Change Data Feed enabled
3. Track ingestion progress with checkpoints

The implementation in `utils/datasource.py` is identical to what we built here!