Modified ASV suite of benchmark tests to gather IO performance metrics in Pangeo environments. Set of tests exist for cloud, HPC, and workstation-like environments across different mixture of storage backends and APIs. We're mainly concerned with benchmarking Xarray/Dask performance in both single and multiprocessor/multithreaded/clustered environments.
airspeedvelocity is the basis of these benchmarks, although workflow has been modified to accomodate gathering IO statistics.
Basics and running the benchmarks.
You typically run ASV benchmarks through its command line tool, but with this implementation, the runs are conducted through a Python script:
usage: run_benchmarks.py [-h] -b BENCHMARK [BENCHMARK ...] [-n N_RUNS [N_RUNS ...]]
BENCHMARK is a regex of the benchmark test you'd like to run. For example, if you want to run all the GCP Kubernetes read tests 10 times, you'd execute,
python run_benchmarks.py -b gcp_kubernetes_read* -n 10
This will then generate all the benchmark runs, and scrape the resultant JSON output and append them to a CSV file. Data is collected from most recent ASV JSON results file for the machine the tests are being run on. If your directory has results from a different machine, this script will not collect data from that at this time.
Suite of Tests
The following perfomance tests are conducted:
- mean / variance
- convolution (smoothing) over spatial dimensions
- multidimensional Fourier transforms
- netcdf -> POSIX -> local storage
- netcdf -> POSIX -> some sort of disk presentation layer (e.g. FUSE) -> cloud bucket
- Zarr -> POSIX -> local storage
- Zarr -> cloud bucket
- h5netcdf -> POSIX -> local storage
- h5netcdf -> hsds -> cloud bucket
- TileDB -> cloud (currently only S3)
- TileDB -> POSIX (local, Lustre, etc)
- TileDB -> HDFS