An ongoing experiment to compare array virtualization in GDAL and in xarray/VirtualiZarr. This repo exists purely to describe two ways of connecting to a virtualized-netcdf store, with various software tools.
In directory
remote/
there is a nested Parquet representation of a modelled ocean variable
ocean_temp
.
Note that for actually connecting to this we need the github raw link, which isn’t a valid link in a browser but works for connection, that looks like this:
https://raw.githubusercontent.com/mdsumner/virtualized/refs/heads/main/remote/ocean_temp_2023.parq
The array ocean_temp
has dimensions Time,st_ocean,yt_ocean,xt_ocean
of shape (5599, 51, 1500, 3600)
. This consists of 180 monthly, with
daily time step NetCDF files that exist on the
NCI high performance computing system. The model
data is a product of the Bluelink
Reanalysis by
CSIRO. (We are using BRAN2023, but please note this is purely an
infrastructure exercise for now, a way to prototype future workflows).
The store was built by using VirtualiZarr in-situ on NCI, then pushing up the resulting Parquet kerchunk store to here on Github, and then re-naming the path references from those in-situ on NCI, to publically available ones that are available via the NCI Thredds server.
We are using kerchunk Parquet, we are supposed to use Icechunk but that won’t work in GDAL yet, and the tooling is more challenging in the environments we currently need to use.
There is no guarantee an of this will continue to work, if we commit to longer term support it will be on object storage hosting the indexes (parquet or icechunk or materialized, or other).
The script to do the raw processing in-situ is in py/ here. For some reason I am not seeing any benefit from parallelizing with concurrent.futures or with dask, it actually slows it down.
Running on one cpu for one variable looks like this:
Exit Status: 0
Service Units: 13.70
NCPUs Requested: 1 NCPUs Used: 1
CPU Time Used: 00:47:48
Memory Requested: 16.0GB Memory Used: 16.0GB
Walltime requested: 02:35:00 Walltime Used: 01:42:45
JobFS requested: 100.0MB JobFS used: 0B
Here we connect to the virtual Zarr store, this represents a very huge logical array. We deploy different ways to read the same values from the store for a basic illustration.
First, we read it the canonical way with xarray:
import xarray
ds = xarray.open_dataset("https://raw.githubusercontent.com/mdsumner/virtualized/refs/heads/main/remote/ocean_temp_2023.parq", engine = "kerchunk", chunks = {})
ds
<xarray.Dataset> Size: 12TB
Dimensions: (Time: 5599, st_ocean: 51, yt_ocean: 1500, xt_ocean: 3600)
Coordinates:
* Time (Time) datetime64[ns] 45kB 2024-12-01T12:00:00 ... 2020-12-31T1...
* st_ocean (st_ocean) float64 408B 2.5 7.5 12.5 ... 3.603e+03 4.509e+03
* xt_ocean (xt_ocean) float64 29kB 0.05 0.15 0.25 0.35 ... 359.8 359.9 360.0
* yt_ocean (yt_ocean) float64 12kB -74.95 -74.85 -74.75 ... 74.75 74.85 74.95
Data variables:
temp (Time, st_ocean, yt_ocean, xt_ocean) float64 12TB dask.array<chunksize=(1, 1, 300, 300), meta=np.ndarray>
Attributes:
filename: TMP/ocean_temp_2024_12_01.nc.0000
NumFilesInSet: 20
title: BRAN2023
grid_type: regular
history: Mon May 19 14:03:50 2025: ncrcat -4 --dfl_lvl 1 --cnk...
NCO: netCDF Operators version 5.0.5 (Homepage = http://nco...
catalogue_doi_url: https://dx.doi.org/10.25914/2wxj-vt48
acknowledgement: BRAN output is made freely available by CSIRO Bluelin...
run a lazy subset select and pull the values
ds.temp.isel(Time = 0, st_ocean = 10, yt_ocean = 100)
<xarray.DataArray 'temp' (xt_ocean: 3600)> Size: 29kB
dask.array<getitem, shape=(3600,), dtype=float64, chunksize=(300,), chunktype=numpy.ndarray>
Coordinates:
Time datetime64[ns] 8B 2024-12-01T12:00:00
st_ocean float64 8B 65.67
* xt_ocean (xt_ocean) float64 29kB 0.05 0.15 0.25 0.35 ... 359.8 359.9 360.0
yt_ocean float64 8B -64.95
Attributes:
long_name: Potential temperature
units: degrees C
valid_range: [-32767, 32767]
packing: 4
cell_methods: time: mean Time: mean
time_avg_info: average_T1,average_T2,average_DT
standard_name: sea_water_potential_temperature
v = ds.temp.isel(Time = 0, st_ocean = 10, yt_ocean = 100).values
v
#array([-1.54072672, -1.5329445 , -1.52516228, ..., -1.5329445 ,
# -1.54072672, -1.54850894], shape=(3600,))
Now use GDAL in multdim mode, we use an indexing approach with raw
numbers as with isel()
above, using the multidim numpy-like syntax:
from osgeo import gdal
gdal.UseExceptions()
ds = gdal.OpenEx("ZARR:\"/vsicurl/https://raw.githubusercontent.com/mdsumner/virtualized/refs/heads/main/remote/ocean_temp_2023.parq\"", gdal.OF_MULTIDIM_RASTER)
rg = ds.GetRootGroup()
rg.GetMDArrayNames()
#['Time', 'st_ocean', 'temp', 'xt_ocean', 'yt_ocean']
temp = rg.OpenMDArrayFromFullname("/temp")
temp.shape
# (5599, 51, 1500, 3600)
view = temp.GetView("[0,10,100,:]")
a = view.ReadAsArray()
a * temp.GetScale() + temp.GetOffset()
#array([-1.54072672, -1.5329445 , -1.52516228, ..., -1.5329445 ,
# -1.54072672, -1.54850894], shape=(3600,))
Finally, a more traditional 2D view of this but we burrow in just for argument’s sake.
(Deeper work on the multdim approach for GDAL is ongoing in gdalraster).
library(terra)
#> terra 1.8.64
dsn <- "ZARR:\"/vsicurl/https://raw.githubusercontent.com/mdsumner/virtualized/refs/heads/main/remote/ocean_temp_2023.parq\":/temp:{10}:{0}"
(r <- rast(dsn))
#> class : SpatRaster
#> size : 1500, 3600, 1 (nrow, ncol, nlyr)
#> resolution : 0.1, 0.1 (x, y)
#> extent : -9.507305e-10, 360, -75, 75 (xmin, xmax, ymin, ymax)
#> coord. ref. :
#> source : temp:{10}:{0}
#> name : temp:{10}:{0}
plot(r)
#terra 1.8.64
#class : SpatRaster
#size : 1500, 3600, 1 (nrow, ncol, nlyr)
#resolution : 0.1, 0.1 (x, y)
#extent : -9.507305e-10, 360, -75, 75 (xmin, xmax, ymin, ymax)
#coord. ref. :
#source : temp:{0}:{10}
#name : temp:{0}:{10}
(I’m not yet getting the exact same values out but the plot show the values are valid, and mapped correctly).
We can leverage the GDAL api to cast these slices to classic raster mode, and many other options.
Let’s try another variable (this won’t go on, I won’t host them all here this is just a test).
library(terra)
dsn <- "ZARR:\"/vsicurl/https://raw.githubusercontent.com/mdsumner/virtualized/refs/heads/main/remote/ocean_salt_2023.parq\":/salt:{10}:{0}"
(r <- rast(dsn))
#> class : SpatRaster
#> size : 1500, 3600, 1 (nrow, ncol, nlyr)
#> resolution : 0.1, 0.1 (x, y)
#> extent : -9.507305e-10, 360, -75, 75 (xmin, xmax, ymin, ymax)
#> coord. ref. :
#> source : salt:{10}:{0}
#> name : salt:{10}:{0}
plot(r)