# Virtual Zarr

At the end of [COG vs Zarr](./cog_vs_zarr.ipynb) we created a "ZOG" which was a native Zarr store containing a single shard that was a valid COG. This was neat, but in practice people use virtual Zarr stores (stored as icechunk or kerchunk) when they want to access data as if it were Zarr but leave the bytes of data on disk in whatever archival data format they have. Turns out thanks to @maxrjones' work on virtual-tiff we can virtualize COGs!

In [1]:
import shutil
import urllib.request
from pathlib import Path

import obstore
from virtualizarr import open_virtual_dataset, open_virtual_mfdataset
from virtualizarr.registry import ObjectStoreRegistry
from virtual_tiff import VirtualTIFF

Zarr can emit a lot of warnings about Numcodecs not being including in the Zarr version 3 specification yet -- let's suppress those.

In [2]:
import warnings

warnings.filterwarnings(
    "ignore",
    message="Numcodecs codecs are not in the Zarr version 3 specification*",
    category=UserWarning,
)

## Downloading a COG

We're going to use that same Sentinel 2 L2A scene as in the previous notebook. Let's make sure we have it downloaded

In [3]:
OUTDIR = Path('test_data')
OUTDIR.mkdir(exist_ok=True)

In [4]:
COG_HREF = 'https://e84-earth-search-sentinel-data.s3.us-west-2.amazonaws.com/sentinel-2-c1-l2a/10/T/FR/2023/12/S2B_T10TFR_20231223T190950_L2A/B04.tif'
COG_FILE = OUTDIR / 'red.tif'

# check if we are rerunning this cell to not download the COG if we already have it
if not COG_FILE.exists():
    with urllib.request.urlopen(urllib.request.Request(COG_HREF)) as response:
        COG_FILE.write_bytes(response.read())

## Create a Manifest Store
We can use the `VirtualTIFF` Parser to create a manifest store

Notice that we need to specify the IFD that we are interested in. We want the highest resolution array which is stored in IFD 0. This is different from rasterio which defaults to 0.

In [5]:
registry = ObjectStoreRegistry({"file://": obstore.store.LocalStore()})
parser = VirtualTIFF(ifd=0)

manifest_store = parser(
    url=f"file://{COG_FILE.absolute()}",
    registry=registry,
)

This manifest store is a proper Zarr store but instead of referencing chunks of data that sit alonside the Zarr metadata it refers to them where they already live within the COG. The manifest store can be represented as a virtual `xarray.Dataset`:

In [6]:
vds = manifest_store.to_virtual_dataset()
vds

We can write that off to icechunk or kerchunk and use it later to read the COG as Zarr.

## Open as zarr

We can also consume that same manifest store directly with regular `xarray.open_zarr`.

In [7]:
import xarray as xr

ds = xr.open_zarr(
    manifest_store,
    consolidated=False,
    zarr_format=3,
    chunks={},
)
ds

Unnamed: 0,Array,Chunk
Bytes,229.95 MiB,2.00 MiB
Shape,"(10980, 10980)","(1024, 1024)"
Dask graph,121 chunks in 2 graph layers,121 chunks in 2 graph layers
Data type,uint16 numpy.ndarray,uint16 numpy.ndarray
"Array Chunk Bytes 229.95 MiB 2.00 MiB Shape (10980, 10980) (1024, 1024) Dask graph 121 chunks in 2 graph layers Data type uint16 numpy.ndarray",10980  10980,

Unnamed: 0,Array,Chunk
Bytes,229.95 MiB,2.00 MiB
Shape,"(10980, 10980)","(1024, 1024)"
Dask graph,121 chunks in 2 graph layers,121 chunks in 2 graph layers
Data type,uint16 numpy.ndarray,uint16 numpy.ndarray


## Check values

To check that the encoding is all hooked up we can compare the mean of the first tile in this new `ds` with the mean of the first tile when you open the COG using `rasterio`.

In [8]:
%%time
ds['0'][:1024,:1024].mean().compute().values

CPU times: user 29.5 ms, sys: 7.94 ms, total: 37.4 ms
Wall time: 35.6 ms


array(1233.40614319)

Now let's open the downloaded COG with rasterio and do the same computation:

In [9]:
import rioxarray

ds_rio = rioxarray.open_rasterio(COG_FILE, chunks={})

In [10]:
%%time
ds_rio[0, :1024, :1024].mean().compute().values

CPU times: user 15.9 ms, sys: 5.79 ms, total: 21.7 ms
Wall time: 20.6 ms


array(1233.40614319)

## Serialize Manifest Store to Icechunk

So far we have an in-memory manifest store. That is great for as long as we are in this Python session but to share it with others we need to write it out to a local icechunk store. That icechunk store will contain the attrs and references to the individual chunks of data within the COGs where they live. In this case in the file sitting next to the icechunk store, but more commonly on s3.

In [11]:
import icechunk

ICECHUNK = OUTDIR / "icechunk"

# just in case we've run this before
if ICECHUNK.exists():
    shutil.rmtree(ICECHUNK)

icechunk_storage = icechunk.local_filesystem_storage(ICECHUNK)
config = icechunk.RepositoryConfig.default()

config.set_virtual_chunk_container(
    icechunk.VirtualChunkContainer(
        url_prefix=f"file://{OUTDIR.absolute()}/",
        store=icechunk.local_filesystem_store(OUTDIR.absolute()),
    ),
)
virtual_credentials = icechunk.containers_credentials({f"file://{OUTDIR.absolute()}/": None})
repo = icechunk.Repository.create(
    icechunk_storage,
    config,
    authorize_virtual_chunk_access=virtual_credentials,
)

session = repo.writable_session("main")
vds.vz.to_icechunk(session.store)
session.commit("Create virtual store")

  [2m2025-10-03T20:06:14.883414Z[0m [33m WARN[0m [1;33micechunk::storage::object_store[0m[33m: [33mThe LocalFileSystem storage is not safe for concurrent commits. If more than one thread/process will attempt to commit at the same time, prefer using object stores.[0m
    [2;3mat[0m icechunk/src/storage/object_store.rs:80



'PVXVAJQXSSBKACG3PDZG'

We can read that icechunk store in as a real (but still lazy) xarray dataset

In [12]:
ds = xr.open_zarr(session.store)
ds

Unnamed: 0,Array,Chunk
Bytes,229.95 MiB,2.00 MiB
Shape,"(10980, 10980)","(1024, 1024)"
Dask graph,121 chunks in 2 graph layers,121 chunks in 2 graph layers
Data type,uint16 numpy.ndarray,uint16 numpy.ndarray
"Array Chunk Bytes 229.95 MiB 2.00 MiB Shape (10980, 10980) (1024, 1024) Dask graph 121 chunks in 2 graph layers Data type uint16 numpy.ndarray",10980  10980,

Unnamed: 0,Array,Chunk
Bytes,229.95 MiB,2.00 MiB
Shape,"(10980, 10980)","(1024, 1024)"
Dask graph,121 chunks in 2 graph layers,121 chunks in 2 graph layers
Data type,uint16 numpy.ndarray,uint16 numpy.ndarray


Let's just double check our first chunk:

In [13]:
%%time
ds['0'][:1024,:1024].mean().compute().values

CPU times: user 27.9 ms, sys: 6.17 ms, total: 34.1 ms
Wall time: 31.8 ms


array(1233.40614319)