# Regression test suite for the SAMBAH:

This notebook provides condensed examples of using Harmony to make requests against the [Subsetter And Multi-dimensional Batched Aggregation in Harmony (SAMBAH)](https://stitchee.readthedocs.io/en/latest/sambah_readme/) service developed to process Level 2 data from the [Tropospheric Emissions: Monitoring of Pollution (TEMPO)](https://asdc.larc.nasa.gov/project/TEMPO) instrument. 

### Features of SAMBAH include:

* Variable subsetting, including required variables.
* Temporal subsetting.
* Bounding box spatial subsetting.
* Concatenation within TEMPO east-west scans
* Concatenation across scans

### Prerequisites

The dependencies for this notebook are listed in the [environment.yaml](./environment.yaml). To test or install locally, create the papermill environment used in the automated regression testing suite:

`conda env create -f ./environment.yaml && conda activate papermill-sambah`

A `.netrc` file must also be located in the `test` directory of this repository.

### Test reference files:

The reference files stored in the `harmony-regression-test` repository are JSON files containing hashed values derived from all groups and variables in each file. The raw netCDF4 files are hosted in the Harmony UAT AWS account in the harmony-uat-regression-tests S3 bucket in the `sambah` folder.

# Setup

## Import required packages:

In [None]:
from datetime import datetime
from os.path import exists
from pathlib import Path
from tempfile import TemporaryDirectory

from earthdata_hashdiff import nc4_matches_reference_hash_file
from harmony import BBox, Client, Collection, Environment, Request

### Import shared utility functions:

In [None]:
import sys

sys.path.append('../shared_utils')
from utilities import print_success, submit_and_download

## Set default parameters:

`papermill` requires default values for parameters used on the workflow. In this case, `harmony_host_url`.

In [None]:
harmony_host_url = 'https://harmony.uat.earthdata.nasa.gov'

### Identify Harmony environment (for easier reference):

In [None]:
host_environment = {
    'http://localhost:3000': Environment.LOCAL,
    'https://harmony.sit.earthdata.nasa.gov': Environment.SIT,
    'https://harmony.uat.earthdata.nasa.gov': Environment.UAT,
    'https://harmony.earthdata.nasa.gov': Environment.PROD,
}

harmony_environment = host_environment.get(harmony_host_url)

if harmony_environment is not None:
    harmony_client = Client(env=harmony_environment)

The request collection and granules are different for UAT and PROD:

In [None]:
tempo_no2_test_data_uat = {
    # TEMPO NO2 tropospheric, stratospheric, and total columns V03
    # https://cmr.uat.earthdata.nasa.gov/search/concepts/C1262899916-LARC_CLOUD.html
    'collection': Collection(id='C1262899916-LARC_CLOUD'),
    'granule_id': [
        'G1269044486-LARC_CLOUD',  # TEMPO_NO2_L2_V03_20240801T153258Z_S007G07.nc
        'G1269044632-LARC_CLOUD',  # TEMPO_NO2_L2_V03_20240801T153935Z_S007G08.nc
        'G1269044623-LARC_CLOUD',  # TEMPO_NO2_L2_V03_20240801T154612Z_S007G09.nc
        'G1269044612-LARC_CLOUD',  # TEMPO_NO2_L2_V03_20240801T155308Z_S008G01.nc
        'G1269044756-LARC_CLOUD',  # TEMPO_NO2_L2_V03_20240801T155948Z_S008G02.nc
    ],
}

tempo_no2_test_data_prod = {
    # TEMPO NO2 tropospheric and stratospheric columns V03 (BETA)
    # https://cmr.earthdata.nasa.gov/search/concepts/C2930725014-LARC_CLOUD.html
    'collection': Collection(id='C2930725014-LARC_CLOUD'),
    'granule_id': [
        'G3181300053-LARC_CLOUD',  # TEMPO_NO2_L2_V03_20240801T153258Z_S007G07.nc
        'G3181300108-LARC_CLOUD',  # TEMPO_NO2_L2_V03_20240801T153935Z_S007G08.nc
        'G3181299889-LARC_CLOUD',  # TEMPO_NO2_L2_V03_20240801T154612Z_S007G09.nc
        'G3181345515-LARC_CLOUD',  # TEMPO_NO2_L2_V03_20240801T155308Z_S008G01.nc
        'G3181345531-LARC_CLOUD',  # TEMPO_NO2_L2_V03_20240801T155948Z_S008G02.nc
    ],
}

sambah_test_data_by_environment = {
    Environment.LOCAL: tempo_no2_test_data_uat,
    Environment.SIT: tempo_no2_test_data_uat,
    Environment.UAT: tempo_no2_test_data_uat,
    Environment.PROD: tempo_no2_test_data_prod,
}

if harmony_environment in sambah_test_data_by_environment:
    sambah_test_data = sambah_test_data_by_environment[harmony_environment]
else:
    sambah_test_data = None

In [None]:
request_info = {
    **sambah_test_data,
    'temporal': {
        'start': datetime(2024, 8, 1, 15, 34, 0),
        'stop': datetime(2024, 8, 1, 16, 0, 0),
    },
    'spatial': BBox(-170, 33, -10, 38),
    # chosen variables include one variable from each group
    # support/scattering_weights is 3D variable
    'variables': [
        'product/vertical_column_stratosphere',
        'qa_statistics/fit_rms_residual',
        'support_data/scattering_weights',
    ],
}

# Begin regression tests

SAMBAH is currently deployed to Sandbox, SIT, UAT and production.

## Test Strategy Overview

### Test Execution Process

Each test follows this standardized process:

1. Define a request to Harmony using `harmony-py`
2. Execute that request
3. Download the output file within a temporary directory
4. Ensure that the expected output file is downloaded (i.e., that the request was successful)
5. Convert the output to a JSON file that hashes the variables and groups in the output file
6. Compare that JSON file to a reference JSON file, which should be identical (except for the `history` and `history_json` attributes that contain timestamps, and `/subset_files`, which varies between UAT and production)

### Capabilities Tested

These 4 tests are designed to validate different aspects of SAMBAH functionality:

| Test                       | Purpose                             | Key Features Tested                                                 |
|:---------------------------|:------------------------------------|:--------------------------------------------------------------------|
| **1. Full Capabilities**   | Integration test with all features  | Temporal + Spatial + Variable subsetting + Multi-scan concatenation |
| **2. Variable Subsetting** | Isolated variable filtering         | Variable selection + Intra-scan concatenation only                  |
| **3. Spatial Subsetting**  | Isolated spatial filtering          | Bounding box filtering + Intra- & Inter-scan concatenation          |
| **4. Minimal Processing**  | Edge case validation                | Single granule passthrough with no subsetting                       |

### Test Data

All tests use the same **TEMPO NO2 L2 V03** dataset from **2024-08-01**, and optionally specify:
- **5 granules** spanning **2 scans** (S007G07-G09, S008G01-G02)
- **Time range:** 15:34–16:00 UTC (26 minutes)
- **Geographic coverage:** North America and Pacific regions


## Test Execution Helper function

In [None]:
def execute_sambah_test(
    harmony_client_object,
    harmony_request,
    request_name,
    output_filename,
    reference_filepath,
):
    """Execute a SAMBAH test with temporary directory handling and validation."""
    with TemporaryDirectory() as tmp_dir:
        output_path = tmp_dir / Path(output_filename)
        submit_and_download(harmony_client_object, harmony_request, output_path)
        assert exists(output_path), f'Unsuccessful {request_name}.'
        assert nc4_matches_reference_hash_file(
            output_path,
            reference_filepath,
            skipped_variables_or_groups='/subset_files',
        ), f'{request_name}: Output and reference files do not match'

    print_success(request_name)

## SAMBAH Test 1: Full Capabilities

**Parameters:** 5 granules, time window, large bbox, 3 variables from different groups

**Tests:** Temporal + Spatial + Variable subsetting + Multi-scan (i.e., `extend` & `concatenate`) concatenation

In [None]:
if request_info is not None:
    # All 5 granules + temporal + variable + spatial subsetting
    temp_var_bbox_request = Request(
        collection=request_info['collection'],
        extend=['mirror_step'],
        concatenate=True,
        granule_id=request_info['granule_id'],  # All 5 granules
        temporal=request_info['temporal'],  # 26-min window
        variables=request_info['variables'],  # 3 selected variables
        spatial=request_info['spatial'],  # Bounding box
    )

    execute_sambah_test(
        harmony_client,
        temp_var_bbox_request,
        request_name='SAMBAH temporal + variable + bounding box request',
        output_filename='temp_var_bbox.nc4',
        reference_filepath='reference_files/temp_var_bbox.json',
    )
else:
    print(
        f'SAMBAH is not configured for environment: "{harmony_environment}" - skipping test.'
    )

## SAMBAH Test 2: Variable subsetting

**Parameters:** 2 adjacent granules (same scan), 3 variables including one 3D variable

**Tests:** Variable selection + Intra-scan (i.e., `extend`) concatenation only

In [None]:
if request_info is not None:
    # First 2 granules + variable subsetting only
    var_only_request = Request(
        collection=request_info['collection'],
        extend=['mirror_step'],
        concatenate=True,
        granule_id=request_info['granule_id'][:2],  # First 2 granules only
        variables=request_info['variables'],  # 3 selected variables
    )

    execute_sambah_test(
        harmony_client,
        var_only_request,
        request_name='SAMBAH variable request',
        output_filename='var_only.nc4',
        reference_filepath='reference_files/var_only.json',
    )
else:
    print(
        f'SAMBAH is not configured for environment: "{harmony_environment}" - skipping test.'
    )

## SAMBAH Test 3: Spatial Subsetting

**Parameters:** 5 granules, large bbox, all variables

**Tests:** Bounding box filtering + Multi-scan (i.e., `extend` & `concatenate`) concatenation

In [None]:
if request_info is not None:
    # All 5 granules + spatial subsetting only
    spatial_only_request = Request(
        collection=request_info['collection'],
        extend=['mirror_step'],
        concatenate=True,
        granule_id=request_info['granule_id'],  # All 5 granules
        spatial=request_info['spatial'],  # Bounding box
    )

    execute_sambah_test(
        harmony_client,
        spatial_only_request,
        request_name='SAMBAH spatial request',
        output_filename='spatial_only.nc4',
        reference_filepath='reference_files/spatial_only.json',
    )
else:
    print(
        f'SAMBAH is not configured for environment: "{harmony_environment}" - skipping test.'
    )

## SAMBAH Test 4: Minimal Processing

**Parameters:** 1 granule, no subsetting operations

**Tests:** Single granule passthrough with no subsetting

In [None]:
if request_info is not None:
    # Single granule + no subsetting
    all_data_request = Request(
        collection=request_info['collection'],
        extend='mirror_step',
        concatenate=True,
        granule_id=request_info['granule_id'][0],  # Single granule only
    )

    execute_sambah_test(
        harmony_client,
        all_data_request,
        request_name='SAMBAH no subset single file request',
        output_filename='all_data.nc4',
        reference_filepath='reference_files/all_data.json',
    )
else:
    print(
        f'SAMBAH is not configured for environment: "{harmony_environment}" - skipping test.'
    )