# Pangeo Storage Benchmarks

This is more in depth documentation and discussion of the Pangeo Storage Benchmark tests and some information that will serve as discussion around development of tests and also precursor material for future papers or publications that may result. 

Benchmarks are meant to systematically determine performance characteristics of storage backend/API combinations for processing geoscience data on clustered cloud computing resources based on Xarray/Dask within the Python scientific computing framework. Focus will be primarily on read performance although write tests will be written as well. 

## Use Cases
Here is an overview of the use cases covered or in development. 

|API/User Interface  | Environment       | format       | Storage Layer     |  
|:------------------ |:----------------- |:-------------|:------------------|
| Dask/Xarray        | GCP               | Zarr         | GCS via GCSFS     |
|                    |                   |              | FUSE              |
|                    |                   | NetCDF4      | FUSE              |
|                    |                   | H5NetCDF     | HSDS/GCS          |
|                    | AWS               | Zarr         | S3 via            |
|                    |                   | NetCDF4      |                   |
|                    |                   | H5NetCDF     | HSDS/S3           |
|                    | Azure             | TBD          |                   |
|                    | GCP               | TileDB       | TBD               |
| Numpy              | GCP               | Zarr         | GCS via GCSFS     |
|                    |                   |              | FUSE              |
|                    |                   | NetCDF4      | FUSE              |
|                    |                   | H5NetCDF     | HSDS/GCS          |
|                    | AWS               | Zarr         | S3 via            |
|                    |                   | NetCDF4      |                   |
|                    |                   | H5NetCDF     | HSDS/S3           |
|                    | Azure             |              |                   | 

## Benchmarks
---
[ASV](https://asv.readthedocs.io/en/latest/index.html) the framework used to run the
benchmarks and collect results. For this repo, results are kept in `benchmarks` the 
directory in JSON format. The benchmarks are roughly separated into two categories: distributed tests which are run clustered via Dask across multiple processors and/or machines

## Environments
---
#### Cloud Platforms (GCP/AWS/Azure)
These configurations comprise Kubernetes/Container based Dask clusters where a configurable number of workers have read/write access to massively parallelized object storage. 

#### Server/HPC
Typical server or HPC environment found at research institutions. 

#### Single Workstation
A subset of tests run on laptop/workstation class hardware.

## Datasets
---
#### Dask/Xarray Synthetic
A randomly generated three dimensional dataset with shape (1350, 1000, 1000) 
is produced for each run which generates roughly 10 GB of data. Tests are run across 
a variety of parameters including chunk configuration, number of dask workers, and... 

```
chunks = (10, 1000, 1000)
size = (1350, 1000, 1000)
dask_arr = da.random.normal(10, 0.1, size=size, chunks=chunks)
```
Chunks parameter varies configuration across the x-axis. The following executions are individually timed. In the current configuration, each test is averaged over five runs with three runs per test.

```
# Mean compute test
dask_arr.mean().compute()
# Write test
dask_arr.store(zarr_ds, lock=False)
```

#### Real World Tests

##### LOCA


##### LLC4320
MITgcm global ocean model output from a $1/48^{\circ}$ model (LLC4320) is used as real world data input. Data resides in `storage-benchmarks` buckets on Pangeo environments in various formats. Currently have data as NetCDF files accessible via FUSE or on GCS object store in Zarr format with a pre-determined chunk configuration chosen to be roughly 10 MB in size per chunk.

### Local Tests


Randomly generated 3D Numpy array which defaults to $n=32$ across $z$ axis. This generates 
a 250 MB dataset to run the synthetic tests.

```
def readtest(f):
    """Read all values of dataset and confirm they are in the expected range
    """
    dset = f[_DATASET_NAME]
    nz = dset.shape[0]
    for i in range(nz):
        arr = dset[i,:,:]
        mean = arr.mean()
        if mean < 0.4 or mean > 0.6:
            msg = "mean of {} for slice: {} is unexpected".format(mean, i)
            raise ValueError(msg)


def writetest(f, data):
    """Update all values of the dataset
    """
    dset = f[_DATASET_NAME]
    nz = dset.shape[0]
    for i in range(nz):
        dset[i,:,:] = data[i,:,:]

def tasmax_slicetest(f):
    """ Check random slice of tasmax dataset
    
    """
    dset = f['tasmax']
    day = random.randrange(dset.shape[0])  # choose random day in year
    data = dset[day,:,:]  # get numpy array for given day
    vals = data[np.where(data<400.0)]  # cull fill values
    min = vals.min()
    if min < 100.0:
        msg = "day: {} was unusually cold! (for degrees kelvin)".format(day)
        raise ValueError(msg)
    max = vals.max()
    if max > 350.0:
        msg = "day: {} was unusually hot! (for degrees kelvin)".format(day)
        raise ValueError(msg)
    if max - min < 20.0:
        msg = "day: {} expected more variation".format(day)
raise ValueError(msg)
```