In [31]:
# This notebook assumes ioos_qc and erddapy are installed:
# pip install ioos-qc erddapy

# ERDDAP to ioos_qc Workflow

This notebook demonstrates a complete workflow for fetching oceanographic data from an ERDDAP server, running quality control checks using ioos_qc, and summarizing the results.

## Workflow Overview

1. **Fetch data from ERDDAP** using `erddapy`
2. **Run QC tests** using `ioos_qc.config.Config` and stream classes
3. **Post-process results** to count flags and assess data quality

### Design Note

ERDDAP data access and QC result summaries are handled **outside** ioos_qc core by design. This keeps the library focused on QC tests while allowing users flexibility in how they fetch data and interpret results. This notebook shows how to compose these tools together.


## 1. Imports


In [32]:
import numpy as np
import pandas as pd
from erddapy import ERDDAP

from ioos_qc.config import Config
from ioos_qc.qartod import QartodFlags
from ioos_qc.results import collect_results
from ioos_qc.streams import PandasStream


## 2. Fetch Data from ERDDAP

We'll use `erddapy` to query a public ERDDAP server. This example fetches sea surface temperature data from NOAA's CoastWatch ERDDAP.

The query is limited to:
- A single station (NDBC buoy 41013 - Frying Pan Shoals, NC)
- A short time range (7 days)
- One variable (sea surface temperature)


In [33]:
# Configure the ERDDAP connection
e = ERDDAP(
    server="https://coastwatch.pfeg.noaa.gov/erddap",
    protocol="tabledap",
)
e.dataset_id = "cwwcNDBCMet"

# Define the variables we want
e.variables = [
    "station",
    "time",
    "latitude",
    "longitude",
    "wtmp",  # Sea surface temperature (°C)
]

# Add constraints: single station, 7 days of data
e.constraints = {
    "station=": "41013",  # Frying Pan Shoals, NC buoy
    "time>=": "2024-01-01T00:00:00Z",
    "time<=": "2024-01-07T23:59:59Z",
}


In [34]:
# Fetch data as a pandas DataFrame
df = e.to_pandas(parse_dates=True)

# Clean up column names (erddapy adds units in parentheses)
df.columns = [col.split(" (")[0] for col in df.columns]

print(f"Fetched {len(df)} observations")
df.head(10)


Fetched 1008 observations


Unnamed: 0,station,time,latitude,longitude,wtmp
0,41013,2024-01-01T00:00:00Z,33.436,-77.743,17.1
1,41013,2024-01-01T00:10:00Z,33.436,-77.743,17.1
2,41013,2024-01-01T00:20:00Z,33.436,-77.743,17.1
3,41013,2024-01-01T00:30:00Z,33.436,-77.743,17.2
4,41013,2024-01-01T00:40:00Z,33.436,-77.743,17.2
5,41013,2024-01-01T00:50:00Z,33.436,-77.743,17.2
6,41013,2024-01-01T01:00:00Z,33.436,-77.743,17.2
7,41013,2024-01-01T01:10:00Z,33.436,-77.743,17.3
8,41013,2024-01-01T01:20:00Z,33.436,-77.743,17.3
9,41013,2024-01-01T01:30:00Z,33.436,-77.743,17.3


In [35]:
# Quick look at the data
print(f"Time range: {df['time'].min()} to {df['time'].max()}")
print(f"Temperature range: {df['wtmp'].min():.2f} to {df['wtmp'].max():.2f} °C")
print(f"Missing values: {df['wtmp'].isna().sum()}")


Time range: 2024-01-01T00:00:00Z to 2024-01-07T23:50:00Z
Temperature range: 15.80 to 21.70 °C
Missing values: 5


## 3. Run ioos_qc Quality Control Tests

We define a minimal QARTOD configuration and run it against the data using ioos_qc's existing APIs.

### Configuration

For sea surface temperature, we'll run:
- **gross_range_test**: Check if values fall within physically realistic bounds
- **spike_test**: Detect sudden jumps in the data


In [36]:
# Define QC configuration for sea surface temperature
# These thresholds are examples - adjust based on your region and season
qc_config = {
    "wtmp": {
        "qartod": {
            "gross_range_test": {
                "fail_span": [-2, 35],      # Fail if outside -2 to 35 °C
                "suspect_span": [5, 30],    # Suspect if outside 5 to 30 °C
            },
            "spike_test": {
                "suspect_threshold": 2.0,   # Suspect if spike > 2 °C
                "fail_threshold": 5.0,      # Fail if spike > 5 °C
            },
        },
    },
}

# Create the Config object
config = Config(qc_config)
print(f"Configured {len(config.calls)} QC tests")


Configured 2 QC tests


In [37]:
# Run QC tests using PandasStream
stream = PandasStream(df, time="time", lat="latitude", lon="longitude")
qc_results = stream.run(config)

# Collect results into a list of CollectedResult objects
results_list = collect_results(qc_results, how="list")

print(f"Collected {len(results_list)} test results")
for r in results_list:
    print(f"  - {r.stream_id}: {r.package}.{r.test}")


Collected 2 test results
  - wtmp: qartod.gross_range_test
  - wtmp: qartod.spike_test


**Note:** The QC summary below is an example of post-processing QC results. By design, ioos_qc focuses on producing QC flags, while aggregation and reporting are handled in downstream workflows such as notebooks or applications. ERDDAP data access is similarly outside ioos_qc core. Users are encouraged to adapt this logic to their own needs.


## 4. QC Summary (Post-processing)

The following function counts QARTOD flags in the results. This is **post-processing** logic that users can adapt for their own needs.

### QARTOD Flag Meanings

| Flag | Value | Meaning |
|------|-------|---------|
| GOOD | 1 | Data passed QC |
| UNKNOWN | 2 | QC could not be performed |
| SUSPECT | 3 | Data is questionable |
| FAIL | 4 | Data failed QC |
| MISSING | 9 | Data is missing |


In [38]:
def summarize_qc_flags(results_list):
    """Count QARTOD flags for each test result.
    
    Parameters
    ----------
    results_list : list
        List of CollectedResult objects from collect_results()
    
    Returns
    -------
    pandas.DataFrame
        Summary table with flag counts per test
    """
    flag_names = {
        QartodFlags.GOOD: "good",
        QartodFlags.UNKNOWN: "unknown",
        QartodFlags.SUSPECT: "suspect",
        QartodFlags.FAIL: "fail",
        QartodFlags.MISSING: "missing",
    }
    
    summary_rows = []
    for result in results_list:
        # Get the flag array, handling masked arrays properly
        flags = np.asarray(result.results)
        if np.ma.isMaskedArray(flags):
            flags = flags.filled(QartodFlags.MISSING)
        flags = flags.flatten()
        total = len(flags)
        
        # Count each flag type
        counts = {}
        for flag_value, flag_name in flag_names.items():
            count = int(np.sum(flags == flag_value))
            counts[flag_name] = count
        
        # Calculate percentages
        row = {
            "variable": result.stream_id,
            "test": result.test,
            "total": total,
        }
        for flag_name, count in counts.items():
            row[flag_name] = count
            row[f"{flag_name}_pct"] = round(100 * count / total, 1) if total > 0 else 0
        
        summary_rows.append(row)
    
    return pd.DataFrame(summary_rows)


In [39]:
# Generate QC summary
summary = summarize_qc_flags(results_list)
summary


Unnamed: 0,variable,test,total,good,good_pct,unknown,unknown_pct,suspect,suspect_pct,fail,fail_pct,missing,missing_pct
0,wtmp,gross_range_test,1008,1003,99.5,0,0.0,0,0.0,0,0.0,5,0.5
1,wtmp,spike_test,1008,991,98.3,12,1.2,0,0.0,0,0.0,5,0.5


In [40]:
# Display a cleaner view with just counts
summary[["variable", "test", "total", "good", "suspect", "fail", "unknown", "missing"]]


Unnamed: 0,variable,test,total,good,suspect,fail,unknown,missing
0,wtmp,gross_range_test,1008,1003,0,0,0,5
1,wtmp,spike_test,1008,991,0,0,12,5


## 5. Add QC Flags to Original DataFrame

For further analysis, we can add the QC flag arrays back to the original DataFrame.


In [46]:
# Add QC results to the dataframe
for result in results_list:
    col_name = f"{result.stream_id}_{result.package}_{result.test}"
    df[col_name] = np.asarray(result.results).flatten()

df.head(1008)


Unnamed: 0,station,time,latitude,longitude,wtmp,wtmp_qartod_gross_range_test,wtmp_qartod_spike_test
0,41013,2024-01-01T00:00:00Z,33.436,-77.743,17.1,1,2
1,41013,2024-01-01T00:10:00Z,33.436,-77.743,17.1,1,1
2,41013,2024-01-01T00:20:00Z,33.436,-77.743,17.1,1,1
3,41013,2024-01-01T00:30:00Z,33.436,-77.743,17.2,1,1
4,41013,2024-01-01T00:40:00Z,33.436,-77.743,17.2,1,1
...,...,...,...,...,...,...,...
1003,41013,2024-01-07T23:10:00Z,33.436,-77.743,16.7,1,1
1004,41013,2024-01-07T23:20:00Z,33.436,-77.743,16.7,1,1
1005,41013,2024-01-07T23:30:00Z,33.436,-77.743,16.7,1,1
1006,41013,2024-01-07T23:40:00Z,33.436,-77.743,16.7,1,1


In [50]:
# Example: Filter to show only suspect or failed observations
# You can customize this filter using other QartodFlags (GOOD, UNKNOWN, MISSING, etc.)
flagged = df[
    df["wtmp_qartod_gross_range_test"].isin([QartodFlags.SUSPECT, QartodFlags.FAIL]) | 
    df["wtmp_qartod_spike_test"].isin([QartodFlags.SUSPECT, QartodFlags.FAIL])
]

print(f"Found {len(flagged)} observations with suspect or fail flags")
if len(flagged) > 0:
    display(flagged)


Found 0 observations with suspect or fail flags


## 6. Interpretation

### Understanding the Results

The QC summary provides a quick assessment of data quality:

- **High `good` percentage**: Data is mostly within expected ranges
- **Non-zero `suspect`**: Some values warrant closer inspection
- **Non-zero `fail`**: Values outside physically realistic bounds
- **High `unknown`**: QC tests couldn't evaluate these points (e.g., edge effects for spike test)

### Next Steps

Users can extend this workflow by:

1. **Adjusting thresholds** based on regional climatology
2. **Adding more tests** (flat_line_test, rate_of_change_test, etc.)
3. **Visualizing flagged data** to understand failure patterns
4. **Filtering data** for downstream analysis based on QC flags

### Key Points

- ERDDAP data fetching uses `erddapy` (not part of ioos_qc)
- QC configuration and execution use ioos_qc's public APIs
- Flag summarization is user-side post-processing
- This pattern keeps ioos_qc focused on QC tests while giving users flexibility
