<a href='https://jupyterhub.user.eopf.eodc.eu/hub/login?next=/hub/spawn?next=/hub/user-redirect/git-pull?repo=https://github.com/eopf-toolkit/eopf-101&branch=main&urlpath=lab/tree/eopf-101/02_about_eopf_zarr/65_create_overviews.ipynb#fancy-forms-config={"profile":"choose-your-environment","image":"unlisted_choice","image:unlisted_choice":"4zm3809f.c1.de1.container-registry.ovh.net/eopf-toolkit-python/eopf-toolkit-python:latest","autoStart":"true"}' target="_blank">
  <button style="background-color:#0072ce; color:white; padding:0.6em 1.2em; font-size:1rem; border:none; border-radius:6px; margin-top:1em;">
    üöÄ Launch this notebook in JupyterLab
  </button>
</a>

**By:** *[@christophenoel](https://github.com/christophenoel)*

### Introduction

Large satellite products, such as Sentinel-2 Level-2A scenes, contain tens of millions of pixels per band. Accessing or visualising them at full resolution is often unnecessary, especially for exploratory analysis, map rendering, or quality checks. Multiscale pyramids address this by providing progressively coarser versions of the original data, allowing client applications to request only the level of detail required. This approach reduces computational load, improves user experience, and aligns with modern cloud-native geospatial workflows.

Our approach consist of two notebooks:

* **Part 1: Creating Zarr Overviews**
* [Part 2: Visualising Multiscale Pyramids](./66_use_overviews.ipynb)

In this notebook, we will demonstrate how to create overviews (also called **multiscale pyramids**) for large Earth Observation datasets stored in Zarr format. Overviews are downscaled representations of gridded data that support efficient visualisation and scalable access to high-resolution datasets. They enable fast inspection at multiple zoom levels, reduce data transfer volumes, and improve performance when working with cloud-optimised storage.

### What we will learn

- üóÇÔ∏è How to compute multiscale overview levels from high-resolution satellite data?
- üìä How to attach GeoZarr-compliant metadata to datasets?
- üíæ How to write overview pyramids to Zarr storage?

### Prerequisites

This notebook uses:
- **Dataset**: Sentinel-2 L2A reflectance data from EODC object storage (hosted by the STAC catalogue)
- **Resolution**: 10m spatial resolution (10980 √ó 10980 pixels)
- **Bands**: Blue (b02), Green (b03), Red (b04), NIR (b08)

The workflow follows a **compute-then-write pattern** that separates in-memory computation from disk persistence, allowing validation before committing changes.

---

#### Import libraries

In [None]:
import os, warnings, json, s3fs, zarr, dask
import xarray as xr
from pathlib import Path

We prepare our credentials for S3 access:

In [None]:
warnings.filterwarnings("ignore")
try:
    bucket = os.environ["BUCKET_NAME"]
    access = os.environ["ACCESS_KEY"]
    secret = os.environ["SECRET_KEY"]
    bucket_endpoint = os.environ["BUCKET_ENDPOINT"]
except KeyError as e:
    raise RuntimeError(
        f"Missing required environment variable: {e}\n"
        "Please set BUCKET_NAME, ACCESS_KEY, and SECRET_KEY"
    )

# S3 filesystem
fs = s3fs.S3FileSystem(
    key=access,
    secret=secret,
    client_kwargs={"endpoint_url": bucket_endpoint},
)


## Copy Remote Dataset to Local Storage

The starting point for our overviews starts with the ‚Äúlocal‚Äù access to the STAC zarr item we are interested in. We do this based on two main reasons:

*	The remote dataset is read-only (object storage) and we need a writable local copy to add new groups (L1-L5)
*	This preserves the complete original structure


To make sure that we use a convenient scene, we select a source URL from the STAC catalogue.

In [None]:
remote_url = ("https://objects.eodc.eu/e05ab01a9d56408d82ac32d69a5aae2a:202601-s02msil2a-eu/30/products/cpm_v262/S2C_MSIL2A_20260130T102251_N0511_R065_T32TQT_20260130T142716.zarr")
store_path = ("S2C_MSIL2A_20260130T102251_N0511_R065_T32TQT_20260130T142716.zarr")
base_store = fs.get_mapper(f"{bucket}/{store_path}")

As a first step, we download the remote Zarr dataset and saves it as a Zarr copy ready on a S3 bucket for exploration.
Chunking helps large datasets load faster especially when used in visualisation tools that read data in small spatial tiles.

In [None]:
print(f"Copying remote Zarr to local storage... (may take several minutes)")
try:
    s2l2a_remote = xr.open_datatree(remote_url, engine="zarr")
except Exception as e:
    raise RuntimeError(f"Failed to open remote dataset: {e}\nCheck network connectivity and URL")
print(f"Copying to s3://{base_store} ...")
s2l2a_remote.to_zarr(
    store=base_store,
    mode="w",
    consolidated=False,
    zarr_version=2,
    compute=True,
)

Then, we can access the dataset from the S3 bucket and look inside the group that contains the 10-metre reflectance data to understand which variables, dimensions, and coordinates it contains.

In [None]:
# --- Load Local Dataset and Inspect Structure ---
variable_group_path = "measurements/reflectance/r10m"
r10m_store = fs.get_mapper(
    f"{bucket}/{store_path}/{variable_group_path}"
)
dataset = xr.open_dataset(r10m_store, engine="zarr")
dataset

## Compute Overviews (In-Memory)

Now we compute the overview levels **in memory only** - no data is written to disk at this stage.

We extract the reflectance group at 10m resolution and automatically discover variables and dimensions.

**Key operations:**
- Open the local Zarr dataset as a datatree
- Extract the reflectance group and convert to an xarray Dataset
- Automatically discover all variables and dimensions
- Identify spatial coordinate names (x, y)


In this step, we identify the spatial dimensions and variables in the dataset and define the scale levels that will be used to generate lower-resolution overviews.

In [None]:
scales = [2, 4, 8, 16, 32, 64, 128]  # Scale factors for each level
variables = [var for var in dataset.data_vars]  # Discover variables
spatial_dims = [dim for dim in dataset.dims]  # Discover dimensions
x_dim = next((d for d in spatial_dims if d in ['x', 'X', 'lon', 'longitude']), 'x')  # Identify x dimension
y_dim = next((d for d in spatial_dims if d in ['y', 'Y', 'lat', 'latitude']), 'y')  # Identify y dimension
print(f"Variables: {variables} | Dims: {spatial_dims} | Shape: {dataset[variables[0]].shape} | Using: x_dim='{x_dim}', y_dim='{y_dim}'\n")

Accessing such information allows us to generate a series of lower-resolution overview datasets directly in memory.

For each scale factor, we use xarray‚Äôs `coarsen()` function to average groups of pixels along the spatial dimensions (x, y). Each coarsened version is stored under a level name like L1, L2, etc., representing progressively coarser spatial resolutions.

In [None]:
overviews = {}  # Generate in-memory overview datasets
for i, factor in enumerate(scales):
    level_id = f"L{i+1}"
    coarsened = dataset.coarsen({x_dim: factor, y_dim: factor}, boundary="trim").mean()
    overviews[level_id] = coarsened[variables]

print(f"Created {len(overviews)} overview levels:")
for level_id, level_ds in overviews.items():
    print(f"  {level_id}: shape {level_ds[variables[0]].shape}, dims {dict(level_ds.dims)}")
print("\nOverview datasets created successfully (in memory only, not written to disk)")

## Attach Multiscales Metadata

Once the data has been processed, relating it with the GeoZarr-compliant metadata will enhance the application and self description:

- **Version**: Schema version ("1.0")
- **Resampling method**: How data was aggregated ("average")
- **Variables**: Which bands have overviews
- **Layout**: The complete hierarchy including L0 (base) and all derived levels

The metadata is stored in `dataset.attrs["multiscales"]` following the GeoZarr Overviews specification. This ensures interoperability with GeoZarr-aware tools and libraries.

Here we prepare the information needed to describe the overview hierarchy in the GeoZarr metadata. We set `overview_path` to indicate where the overview groups will be stored, record the `resampling_method("average")` used to create them, and compute the base spatial resolutions (`x_res` and `y_res`) from the coordinate spacing.

In [None]:
overview_path = "overviews"  # Where overviews are written ("." for direct children)
resampling_method = "average"
x_res = abs(float(dataset['x'].values[1] - dataset['x'].values[0]))
y_res = abs(float(dataset['y'].values[1] - dataset['y'].values[0]))

Now, we build the multiscales layout metadata that describes how all overview levels relate to the base dataset.

The first entry (L0) represents the original data, including its spatial resolution (cell_size). Each subsequent level (L1, L2, ‚Ä¶) is added to the layout with information about its path, the level it was derived from, the scale factors applied, the resampling method, and its corresponding cell size.

Finally, this complete structure is stored in `dataset.attrs["multiscales"]` following the GeoZarr Overviews specification (draft). The printed JSON summary shows the final metadata layout that GeoZarr-aware tools can use to identify and navigate between resolution levels.

In [None]:
layout = [{"id": "L0", "path": ".", "cell_size": [x_res, y_res]}]  # Base level (native data at current group)
for i, factor in enumerate(scales):
    level_id = f"L{i+1}"
    level_path = level_id if overview_path == "." else f"{overview_path}/{level_id}"
    level_cell_size = [x_res * factor, y_res * factor]
    layout.append({"id": level_id, "path": level_path, "derived_from": "L0" if i == 0 else f"L{i}", "factors": [factor, factor], "resampling_method": resampling_method, "cell_size": level_cell_size})
dataset.attrs["multiscales"] = {"version": "1.0", "resampling_method": resampling_method, "variables": variables, "layout": layout}
print("Metadata structure:")
print(json.dumps(dataset.attrs["multiscales"], indent=2))

In [None]:
new_attrs = dataset.attrs.copy()   # includes your multiscales metadata
json_bytes = json.dumps(new_attrs).encode("utf-8")
# FSMap stores values as bytes
r10m_store['.zattrs'] = json_bytes

### Write Overviews to Local Zarr Store

The final step consists of adding the computed overviews to the Zarr store copy.

**Overview path options:**
- `overview_path="."` - Write overviews as direct children (L1, L2, L3, ...)
- `overview_path="overviews"` - Write overviews in a subfolder (overviews/L1, overviews/L2, ...)

**Write operations:**
1. **Write L1-L5** - Add overview levels as subgroups
2. **Add metadata** - Update group attributes with multiscales metadata

**Key point:** Native data stays at the group level. The multiscales metadata uses `path: "."` for L0 to reference the existing native data without duplication.

**Result with `overview_path="."`:**
```
measurements/reflectance/r10m/
‚îú‚îÄ‚îÄ b02, b03, b04, b08  # Native data (L0 via path=".")
‚îú‚îÄ‚îÄ x, y                # Coordinates
‚îú‚îÄ‚îÄ L1/                 # Overview levels (direct children)
‚îú‚îÄ‚îÄ L2/
‚îú‚îÄ‚îÄ L3/
‚îú‚îÄ‚îÄ L4/
‚îú‚îÄ‚îÄ L5/
‚îî‚îÄ‚îÄ .zattrs             # multiscales metadata
```

**Alternative with `overview_path="overviews"`:**
```
measurements/reflectance/r10m/
‚îú‚îÄ‚îÄ b02, b03, b04, b08  # Native data (L0 via path=".")
‚îú‚îÄ‚îÄ x, y                # Coordinates
‚îú‚îÄ‚îÄ overviews/          # Overview levels in subfolder
‚îÇ   ‚îú‚îÄ‚îÄ L1/
‚îÇ   ‚îú‚îÄ‚îÄ L2/
‚îÇ   ‚îú‚îÄ‚îÄ L3/
‚îÇ   ‚îú‚îÄ‚îÄ L4/
‚îÇ   ‚îî‚îÄ‚îÄ L5/
‚îî‚îÄ‚îÄ .zattrs             # multiscales metadata
```

In [None]:
print(f"Adding overviews to {variable_group_path} | Variables: {variables} | Path: '{overview_path}'\n")

# Create the overview group folder on S3
zarr.open_group(store=base_store, mode="a", zarr_version=2)

print(f"Writing {len(overviews)} overview levels...")
for level_id, level_dataset in overviews.items():

    level_store = fs.get_mapper(
        f"{bucket}/{store_path}/"
        f"{variable_group_path}/{overview_path}/{level_id}"
    )

    level_dataset.to_zarr(
        store=level_store,
        mode="a",
        zarr_version=2,
    )

# Write coordinates + attrs into r10m group
coords_only = xr.Dataset(coords=dataset.coords, attrs=dataset.attrs)
coords_only.to_zarr(
    store=r10m_store,
    mode="a",
    zarr_version=2,
)

print(f"Generating consolidated metadata for {overview_path}/")
zarr.consolidate_metadata(store=base_store)

print(f"\nSuccessfully added overviews to {variable_group_path}\n")
print(f"Final structure:\n  {variable_group_path}/")
print(f"    ‚îú‚îÄ‚îÄ {', '.join(variables)}")
print(f"    ‚îú‚îÄ‚îÄ {x_dim}, {y_dim}")
print(f"    ‚îú‚îÄ‚îÄ {overview_path}/")
for level_id in overviews.keys():
    print(f"    ‚îÇ   ‚îú‚îÄ‚îÄ {level_id}/")
print(f"    ‚îî‚îÄ‚îÄ .zattrs")


<hr>

## üí™ Now it is your turn

With everything we have learnt so far, you are now able to create multiscale overviews for your own datasets.

### Task 1: Experiment With Different Scale Factors

Try modifying the `scales` list to create different pyramid structures. For example:
- **Fewer levels**: `scales = [2, 4, 8]` for a smaller pyramid
- **More aggressive downsampling**: `scales = [4, 16, 64]` for rapid zoom levels
- **Fine-grained levels**: `scales = [2, 3, 4, 6, 8]` for smoother transitions

### Task 2: Apply To Your Own Dataset

Use this notebook as a template for your own Earth Observation data:
1. Replace the URL with your own Zarr dataset
2. Let the code discover variables and dimensions automatically
3. Adjust scale factors based on your data resolution
4. Validate and write the results

<br>
<hr>

## Conclusion

This tutorial demonstrated the complete workflow for creating GeoZarr-compliant multiscale overviews:

1. ‚úÖ Load and discover dataset structure automatically
2. ‚úÖ Compute overview levels in memory (no disk I/O)
3. ‚úÖ Attach specification-compliant metadata
4. ‚úÖ Write to Zarr storage

**Key takeaways:**
- The **compute-then-write pattern** separates computation from I/O
- **Dynamic discovery** makes code adaptable to different datasets

<br>
<hr>


## What's next?

In [the next notebook](./66_use_overviews.ipynb) we will focus on the second part of this workflow and, visualise the generated overviews with the help of `matplotlib` and `ipyleaflet` for progressive rendering and web mapping.