## Performance

Zarr arrays are [chunked](https://zarr.readthedocs.io/en/stable/tutorial.html?highlight=chunk#chunk-optimizations), meaning that they are split up into small pieces of equal size, and each chunk is stored in a separate file. Choice of the chunk size affects performance significantly.

Performance will also vary quite a bit depending on the access pattern. Slicing the array so that only data from a single chunk needs to be read from disk will be fast while array slices that cross many chunks will be slow.

An overview of some chunking performance considerations are [available here](https://www.oreilly.com/library/view/python-and-hdf5/9781491944981/ch04.html).

By default, *cellcutter* creates Zarra arrays with chunks of the size `[channels in TIFF, x cells, thumbnail width, thumbnail height]`, meaning for a given cell, all channels and the entire thumbnail image are stored in the same chunk. The number of cells `x` per chunk is calculated internally so that each chunk has a total uncompressed size of about 32 MB.

The default chunk size works well for access patterns that request all channels and the entire thumbnail for a given range of cells. Ideally, the cells should be contiguous along the second dimension of the array.

In [1]:
import zarr
from numpy.random import default_rng

In [2]:
z = zarr.open("cellMaskThumbnails.zarr", mode="r")

In [3]:
z.shape

(12, 9522, 46, 46)

In [4]:
z.chunks

(12, 330, 46, 46)

The `chunks` property gives the size of each chunk in the array. In this example, all 12 channels, 330 cells, and the complete thumbnail are stored in a single chunk.

### Access patterns

#### 100 Random cells

In [5]:

rng = default_rng()
def rand_slice(n=100):
    return rng.choice(z.shape[1], size=n, replace=False)

In [6]:
%%timeit
_ = z.get_orthogonal_selection((slice(None), rand_slice()))

109 ms ± 2.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


#### 100 Contiguous cells

In [7]:
%%timeit
_ = z[:, 1000:1100, ...]

4.13 ms ± 139 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Accessing **100 random cells** from the Zarr array takes around 110 ms whereas accessing **100 contiguous cells** (cell 1000 to 1100) only takes around 4 ms — an almost 30-fold speed difference. This is because random cells are likely to be distributed across many separate chunks. All these chunks need to be read into memory in full even if only a single cell is requested for a given chunk. In contrast, contiguous cells are stored together in one or several neighboring chunks minimizing the amount of data that has to be read from disk.

### Fast access to random cells

If access to random cells is required, for example for training a machine learning model, there is a workaround avoiding the performance penalty of requesting random cells. Instead of truly accessing random cells we can instead randomize cell order before the Zarr array is created. Because cell order is random we can then simply request contiguous cells during training.

The simplest way to randomize cell order is to shuffle the order of rows in the CSV file that is passed to *cellcutter*, for example by using *pandas* `df.sample(frac=1)`.