---
title: Indonesia GeoMAD Notebook
subtitle: Testing ODC-Stats Configuration for Cloud Cover Optimization and Picking the Right Study and Testing Area
date: 2025-11-13
authors:
  - name: Muhammad Taufik
    affiliations:
      - Badan Informasi Geospasial (BIG)
    email: muhammad.taufik@big.go.id
  - name: Fang Yuan
    affiliations:
      - Auspatious
    email: contact@fangyuan.space
  - name: Alex Leith
    affiliations:
      - Auspatious
    email: alex@auspatious.com

keywords:
  - GeoMAD
  - Sentinel-2
  - Open Data Cube
  - Cloud Cover
  - Indonesia
  - Remote Sensing
  - Earth Observation
project:
  license: CC-BY-4.0
  open_access: true
  github: https://github.com/your-repo/indonesia-geomad
abstract: |
  This notebook explores optimal cloud cover thresholds for generating 
  geoMAD (Geometric Median and Median Absolute Deviation) composites 
  over Indonesia using Sentinel-2 L2A data. We compare different cloud 
  cover filtering strategies (≤100%, ≤80%, ≤60%) to balance data quality, 
  temporal coverage, and storage requirements. Additionally, we evaluate 
  suitable study areas for testing and validation across Indonesia's 
  diverse geographic conditions.
---

## Abstract

This notebook explores optimal cloud cover thresholds for generating geoMAD (Geometric Median and Median Absolute Deviation) composites over Indonesia using Sentinel-2 L2A data. We compare different cloud cover filtering strategies (≤100%, ≤80%, ≤60%) to balance data quality and temporal coverage. Additionally, we evaluate suitable study areas for testing and validation across various Indonesia's geographic conditions.

## A. Objectives

1. Evaluate data distribution and availability across Indonesia under different cloud cover thresholds (100%, 80%, 60%)

2. Locate tiles with least datasets to serve as test subjects alongside tiles with diverse geographic conditions

3. Test with Argo Workflows to document peak memory usage, especially on high-dataset tiles


## B. Initial Setup
### Libraries Used
pandas
: Python data analysis library for handling tabular data, dataframes, and statistical operations

odc-stats
: Open Data Cube statistics toolkit for generating temporal composites and summary statistics from Earth observation data

In [None]:
# import libraries 
import pandas as pd

> In piksel-sandbox, we need to upgrade odc-stats to the latest version.

In [None]:
!pip install --upgrade odc-stats

## Generate ODC-Stats Task Database

> We use terminal commands to imitate the production workflow with odc-stats container.

The function below generates task databases filtered by cloud cover threshold.

In [None]:
def save_tasks(cloud_cover, output_db):
    """Generate odc-stats task database with cloud cover filter."""
    !odc-stats save-tasks \
        --frequency "annual" \
        --grid "EPSG:6933;10;5000" \
        --year "2024" \
        --input-products "s2_l2a" \
        --dataset-filter='{{"cloud_cover": [0,{cloud_cover}]}}' \
        {output_db}

### How save_tasks() Works

When executed, the `save-tasks` command performs the following operations:

1. **Query ODC Database** - Connects to the Open Data Cube and queries all indexed Sentinel-2 L2A datasets for the year 2024

2. **Apply Cloud Cover Filter** - The function receives two arguments: `cloud_cover_threshold` and `output_filename`. It filters datasets based on the specified threshold:

   ```python
   save_tasks(60, "tasks_cc60.db")   # cloud_cover: [0, 60]
   save_tasks(80, "tasks_cc80.db")   # cloud_cover: [0, 80]
   save_tasks(100, "tasks_cc100.db") # cloud_cover: [0, 100]
   ```
   
   - **First argument**: Maximum cloud cover percentage (60, 80, or 100)
   - **Second argument**: Output filename prefix for generated files
   - Filters include all datasets with cloud cover from 0% up to the specified threshold

3. **Generate Spatial Grid** - Creates a processing grid in EPSG:6933 projection with 10° tiles at 5000m resolution covering Indonesia

4. **Spatial Intersection** - Matches filtered datasets to their corresponding grid tiles based on spatial footprints

5. **Task Generation** - For each tile, generates processing tasks containing:
   - Tile identifier and spatial bounds
   - List of datasets intersecting that tile
   - Metadata for GeoMAD computation

6. **Database Storage** - Serializes all tasks into multiple output formats:
   - **`.db`** - SQLite database for efficient querying and task distribution
   - **`.csv`** - Tabular summary of tiles and dataset counts
   - **`.json`** - JSON manifest with complete task specifications

:::{note}
This process takes several minutes to complete.
:::


In [None]:
# Execute function
save_tasks(60, "tasks_cc60.db")
save_tasks(80, "tasks_cc80.db")
save_tasks(100, "tasks_cc100.db")

In [None]:
tasks_cc100 = pd.read_csv('tasks_cc100.csv')

In [None]:
print(f"\nSample tasks:")
print(tasks_cc100.head(10))

In [None]:
# Load and inspect the CSV

print(f"Total tasks: {len(tasks_df)}")
print(f"\nDataset summary:")
print(tasks_cc100.describe())