# Virtual Zarr

At the end of [COG vs Zarr](./cog_vs_zarr.ipynb) we created a "ZOG" which was a native Zarr store containing a single shard that was a valid COG. This was neat, but in practice people use virtual Zarr stores (stored as icechunk or kerchunk) when they want to access data as if it were Zarr but leave the bytes of data on disk in whatever archival data format they have. Turns out thanks to @maxrjones' work on virtual-tiff we can virtualize COGs!

In [1]:
import urllib.request
from pathlib import Path

import obstore
from virtualizarr import open_virtual_dataset, open_virtual_mfdataset
from virtualizarr.registry import ObjectStoreRegistry
from virtual_tiff import VirtualTIFF

Zarr can emit a lot of warnings about Numcodecs not being including in the Zarr version 3 specification yet -- let's suppress those.

In [2]:
import warnings

warnings.filterwarnings(
    "ignore",
    message="Numcodecs codecs are not in the Zarr version 3 specification*",
    category=UserWarning,
)

## Downloading a COG

We're going to use that same Sentinel 2 L2A scene as in the previous notebook. Let's make sure we have it downloaded

In [3]:
OUTDIR = Path('test_data')
OUTDIR.mkdir(exist_ok=True)

In [4]:
COG_HREF = 'https://e84-earth-search-sentinel-data.s3.us-west-2.amazonaws.com/sentinel-2-c1-l2a/10/T/FR/2023/12/S2B_T10TFR_20231223T190950_L2A/B04.tif'
COG_FILE = OUTDIR / 'red.tif'

# check if we are rerunning this cell to not download the COG if we already have it
if not COG_FILE.exists():
    with urllib.request.urlopen(urllib.request.Request(COG_HREF)) as response:
        COG_FILE.write_bytes(response.read())

Open that COG as a virtual Zarr store. Notice that we need to specify the IFD that we are interested in. We want the highest resolution array which is stored in IFD 0. This is different from rasterio which defaults to 0.

In [5]:
registry = ObjectStoreRegistry({"file://": obstore.store.LocalStore()})
parser = VirtualTIFF(ifd=0)

vds = open_virtual_dataset(
    url=f"file://{COG_FILE.absolute()}",
    registry=registry,
    parser=parser
)
vds

So that is a virtual xarray dataset. We can write that off to icechunk or kerchunk and use it later to read the COG as Zarr. What if we just want to use the virtual-tiff parser to go straight to a real xarray dataset?

## Open as zarr

We can use the `VirtualTIFF` Parser to create a manifest store and just consume that directly with regular `xarray.open_zarr`.

In [6]:
registry = ObjectStoreRegistry({"file://": obstore.store.LocalStore()})
parser = VirtualTIFF(ifd=0)

manifest_store = parser(
    url=f"file://{COG_FILE.absolute()}",
    registry=registry,
)

In [7]:
import xarray as xr

ds = xr.open_zarr(
    manifest_store,
    consolidated=False,
    zarr_format=3,
    chunks={},
)

In [8]:
ds['0']

Unnamed: 0,Array,Chunk
Bytes,229.95 MiB,2.00 MiB
Shape,"(10980, 10980)","(1024, 1024)"
Dask graph,121 chunks in 2 graph layers,121 chunks in 2 graph layers
Data type,uint16 numpy.ndarray,uint16 numpy.ndarray
"Array Chunk Bytes 229.95 MiB 2.00 MiB Shape (10980, 10980) (1024, 1024) Dask graph 121 chunks in 2 graph layers Data type uint16 numpy.ndarray",10980  10980,

Unnamed: 0,Array,Chunk
Bytes,229.95 MiB,2.00 MiB
Shape,"(10980, 10980)","(1024, 1024)"
Dask graph,121 chunks in 2 graph layers,121 chunks in 2 graph layers
Data type,uint16 numpy.ndarray,uint16 numpy.ndarray


## Check values

To check that the encoding is all hooked up we can compare the mean of the first tile in this new `ds` with the mean of the first tile when you open the COG using `rasterio`.

In [9]:
%%time
ds['0'][:1024,:1024].mean().compute().values

CPU times: user 33.1 ms, sys: 7.87 ms, total: 40.9 ms
Wall time: 40.2 ms


array(1233.40614319)

Now let's open the downloaded COG with rasterio and do the same computation:

In [10]:
import rioxarray

ds_rio = rioxarray.open_rasterio(COG_FILE, chunks={})

In [11]:
%%time
ds_rio[0, :1024, :1024].mean().compute().values

CPU times: user 20.3 ms, sys: 3.95 ms, total: 24.2 ms
Wall time: 23.6 ms


array(1233.40614319)

## Straight from S3

We can do both of these experiments just as well while referencing the COG straight from S3. In this case the virtual dataset will point to the chunks on s3.

In [12]:
url = "s3://e84-earth-search-sentinel-data/sentinel-2-c1-l2a/10/T/FR/2023/12/S2B_T10TFR_20231223T190950_L2A/B04.tif"

s3_store = obstore.store.from_url("s3://e84-earth-search-sentinel-data/", region="us-west-2", skip_signature=True)
registry = ObjectStoreRegistry({"s3://e84-earth-search-sentinel-data/": s3_store})
parser = VirtualTIFF(ifd=0)

open_virtual_dataset(
    url=url, 
    registry=registry,
    parser=parser,
)

## Construct a virtual zarr using multiple bands

The real power lies in concatenating together multiple virtual datasets and serializing the output. This allows distributing virtual zarr hierarchies. Let's take the red, green, and blue bands and concatenate then on over a new dimension (called "band"). We will specify that if any of the attrs are dirrerent (so all the STATISTICS* ones) they can be dropped. There are many ways to specify this concatenation to get exactly the behavior you want. 

In [13]:
urls = [
    "s3://e84-earth-search-sentinel-data/sentinel-2-c1-l2a/10/T/FR/2023/12/S2B_T10TFR_20231223T190950_L2A/B04.tif",
    "s3://e84-earth-search-sentinel-data/sentinel-2-c1-l2a/10/T/FR/2023/12/S2B_T10TFR_20231223T190950_L2A/B03.tif",
    "s3://e84-earth-search-sentinel-data/sentinel-2-c1-l2a/10/T/FR/2023/12/S2B_T10TFR_20231223T190950_L2A/B02.tif",
]

vds_rgb = open_virtual_mfdataset(
    urls,
    registry=registry,
    parser=parser,
    concat_dim="band",
    combine="nested",
    combine_attrs="drop_conflicts"
)

In [17]:
vds_rgb

What we have is a virtual xarray dataset. To store it we will write it out to a local icechunk store. That icechunk store will contain the attrs and references to the individual chunks of data within the COGs where they live on S3.

In [16]:
import icechunk

icechunk_store = icechunk.local_filesystem_storage(OUTDIR / "icechunk")
config = icechunk.RepositoryConfig.default()

config.set_virtual_chunk_container(
    icechunk.VirtualChunkContainer(
        url_prefix="s3://e84-earth-search-sentinel-data/",
        store=icechunk.s3_store(region="us-west-2", anonymous=True),
    ),
)
virtual_credentials = icechunk.containers_credentials(
    {"s3://e84-earth-search-sentinel-data/": icechunk.s3_anonymous_credentials()}
)
repo = icechunk.Repository.create(
    icechunk_store,
    config,
    authorize_virtual_chunk_access=virtual_credentials,
)

session = repo.writable_session("main")
vds_rgb.vz.to_icechunk(session.store)
session.commit("Create virtual store")

  [2m2025-10-01T13:31:46.202151Z[0m [33m WARN[0m [1;33micechunk::storage::object_store[0m[33m: [33mThe LocalFileSystem storage is not safe for concurrent commits. If more than one thread/process will attempt to commit at the same time, prefer using object stores.[0m
    [2;3mat[0m icechunk/src/storage/object_store.rs:80



'CBFVCHQ62HWE470Z0WGG'

We can read that icechunk store in as a real (but still lazy) xarray dataset

In [20]:
ds = xr.open_zarr(session.store)
ds

Unnamed: 0,Array,Chunk
Bytes,689.85 MiB,2.00 MiB
Shape,"(3, 10980, 10980)","(1, 1024, 1024)"
Dask graph,363 chunks in 2 graph layers,363 chunks in 2 graph layers
Data type,uint16 numpy.ndarray,uint16 numpy.ndarray
"Array Chunk Bytes 689.85 MiB 2.00 MiB Shape (3, 10980, 10980) (1, 1024, 1024) Dask graph 363 chunks in 2 graph layers Data type uint16 numpy.ndarray",10980  10980  3,

Unnamed: 0,Array,Chunk
Bytes,689.85 MiB,2.00 MiB
Shape,"(3, 10980, 10980)","(1, 1024, 1024)"
Dask graph,363 chunks in 2 graph layers,363 chunks in 2 graph layers
Data type,uint16 numpy.ndarray,uint16 numpy.ndarray


Let's just double check our first chunk:

In [32]:
%%time
ds['0'][0,:1024,:1024].mean().compute().values

CPU times: user 40.1 ms, sys: 5.42 ms, total: 45.6 ms
Wall time: 1.04 s


array(1233.40614319)

Don't compare that timing to the ones above. Those were for reading from local! That is coming straight from s3.