# Zarr

Zarr is a format for the storage of chunked, compressed, N-dimensional arrays inspired by HDF5, h5py and bcolz.

## Installation

In [1]:
# %pip install zarr ipytree numcodecs

## High Dimensional Chunked Array in memory

Zarr provides classes and functions for working with N-dimensional arrays that behave like NumPy arrays but **whose data is divided into chunks and each chunk is compressed**

In [2]:
import zarr
import numpy as np

In [3]:
# creates a 2-dimensional array of 64-bit floats, divided into chunks
# total array size of 2^22 * 2^22, each chunk is of dim 2^11 * 2^11
# only required chunks are loaded into memory when needed
z = zarr.zeros((1<<22, 1<<22), chunks=(2<<11, 2<<11), dtype='f8') 

# ERROR: numpy can not allocate 128 TiB (22 + 22 + 3 (double dtype))
# t = np.zeros((1<<22, 1<<22))

Reading and writing APIs (using slicing brackets) are similar to numpy. 
Zarr arrays support typical numpy array operations like reshape

In [4]:
# Not really needed, just as a precaution
del z

### Chunk optimisation

In general, **chunks of at least 1 megabyte (1M) uncompressed size** seem to provide better performance, at least when using the Blosc compression library.

In `chunks` arugment, pass `None` in the dimension you don't wish to partition. e.g. `shape=(20000,20000)chunks=(1000, None)` will create 20 chunks of size `1000 x 20000` each

You can let Zarr guess a chunk shape for your data by providing `chunks=True`, which guesses chunk shape using simple heuristics 

### Copying large arrays

Data can be copied between large arrays without needing much memory.

Copying works chunk-by-chunk, extracting only the data from z1 required to fill each chunk in z2. 

The source of the data (z1) could equally be an h5py Dataset.

### Parallel Computation Support

**By default**, Zarr arrays have been designed for use as the **source** _OR_ **sink** for data in parallel computations (Mutli-threaded and Multi-process).
Concurrent read and write (both) is not supported.

During writing (sink), if each worker in a parallel computation is **writing to a separate region of the array**, and if region boundaries are **perfectly aligned with chunk boundaries**, then **no synchronization is required**. 

**Otherwise**, synchronization is required. The `synchronizer` can be passed to the array creation functions `open()`, `create()`, `zeros()` and others

Currently available synchronizers:

- Thread synchronizer `synchronizer=zarr.ThreadSynchronizer()`

- Process synchronizer `synchronizer=zarr.ProcessSynchronizer()`

### Compressor and Filters

Zarr divides the array into chunks.  

It also **compresses the data** before storing. For efficient compression, some **transformations are applied** on the raw data using **Filters**.

We can override default Compressor and Filter by passing it as arguments to the constructor while creating a Zarr array.

Zarr uses [NumCodecs](https://numcodecs.readthedocs.io/en/stable/) library which contains multiple compressors and filters.

**About Numcodec:**

Numcodecs is a Python package providing buffer compression and transformation codecs for use in data storage and communication applications. 

Zarr uses numcodecs in Compression, Filter operations and to give object_codec.

See [numcodecs](https://numcodecs.readthedocs.io/en/stable/index.html) documentation for more information



### Configuring Blosc

See [Zarr Doc](https://zarr.readthedocs.io/en/stable/tutorial.html#configuring-blosc)

The [Blosc](https://numcodecs.readthedocs.io/en/stable/blosc.html) is the default compressor.

The number of Blosc threads can be changed using:

```py
from numcodecs import blosc
blosc.set_nthreads(2)  
```

For multi-process program, it is recommended to set `blosc.use_threads = False`

## Persistent Array on Disk

 Zarr arrays can also be stored on a file system, enabling persistence of data between sessions.
 
 These arrays also support compressions, filters, chunks, parallel computing etc. as discussed previously.

In [5]:
# data/example.zarr will be a directory
# initially it will only contains `.zarray` containing the metadata

z1 = zarr.open('data/example.zarr', mode='w', shape=(10000, 10000),
               chunks=(1000, 1000), dtype='i4')

In [6]:
# filling all elements with a value
# this will generate 100 files, each corresponding to one chunk
z1[:] = 37

In [7]:
z2 = zarr.open('data/example.zarr', mode='r')

In [8]:
np.all(z2[:] == 37)

True

### Process Synchronization using file locks

Provides synchronization using file locks via the fasteners package.

In [9]:
# creates lock files in process_sync.sync directory, 1 lock file per chunk
synchronizer = zarr.ProcessSynchronizer('data/process_sync.sync')

# process_sync.zarr will contain the actual data
z = zarr.open('data/process_sync.zarr', mode='w', shape=(10000, 10000),
                    chunks=(1000, 1000), dtype='i4',
                    synchronizer=synchronizer)

# writing data
z[:] = 37

### Quickly Saving Numpy arrays with Zarr

Use the functions zarr.save() and zarr.load() 

In [10]:
a = np.arange(10)

# by default will store it in a single chunk
zarr.save('data/numpy.zarr', a)

In [11]:
zarr.load('data/numpy.zarr')

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

### Storage Alternatives

More on this: https://zarr.readthedocs.io/en/stable/tutorial.html#storage-alternatives

We can store zar arrays in various formats:
1. DirectoryStore (default, native)
2. ZipStore
3. DBMStore 
4. LMDBStore (lightning memory-mapped DB)
5. SQLiteStore
6. ReddisStore
7. N5Store

Zarr also supports **Distributed/Cloud Storage options**:
1. AWS S3
2. HDFS (hadoop distributed)
3. Google Cloud

## Non-numeic dtypes

Zarr supports many non-numeric dtypes also supported by numpy
See [numpy doc](https://numpy.org/doc/stable/reference/arrays.dtypes.html)

### Fixed length string

`dtype=Un` where `n` is the length of each string, `U` means Unicode

In [12]:
z = zarr.zeros(10, dtype='U6') # each element is Unicode string with length 6
z[0] = b'Hello'
z[1] = b'world!!!' # extra chars will be truncated
z[:]

array(['Hello', 'world!', '', '', '', '', '', '', '', ''], dtype='<U6')

### Variable length string

`dtype=str` which is a short hand for unicode string

In [13]:
import numcodecs 
text_data = ["Hello World!!" , "I am Maneesh.", "Nice to see you!"]

z = zarr.array(text_data, dtype=str)
z[:]

array(['Hello World!!', 'I am Maneesh.', 'Nice to see you!'], dtype=object)

### Object Arrays

`dtype=object, object_codec=<>` where object_codec can be

1. numcodecs.json.JSON
1. numcodecs.msgpacks.MsgPack.
1. numcodecs.pickles.Pickle.

In [14]:
z = zarr.empty(5, dtype=object, object_codec=numcodecs.JSON())
z[0] = 42
z[1] = 'foo'
z[2] = ['bar', 'baz', 'qux']
z[3] = {'a': 1, 'b': 2.2}
z[:]

array([42, 'foo', list(['bar', 'baz', 'qux']), {'a': 1, 'b': 2.2}, None],
      dtype=object)

### Datetime

Please refer to [numpy doc](https://numpy.org/doc/stable/reference/arrays.datetime.html#datetimes-and-timedeltas) and [zarr doc](https://zarr.readthedocs.io/en/stable/tutorial.html#datetimes-and-timedeltas)

Datetime64 : `dtype=M8[unit]`

timedelta64 : `dtype=m8[unit]`

where `unit` can be `ms`, `s`, `m`, `h`, `D`, `W`, `M`, `Y`

## Some (may be) useful Features

### Hierarchical organisation of arrays with Groups

See the [zarr doc](https://zarr.readthedocs.io/en/stable/tutorial.html#groups)

In [15]:
# root of hirerarchy

root = zarr.group() # in memory
# or
root = zarr.open("data/group.zarr", mode="w") # on disk (creates or opens existing)

# groups
foo = root.create_group('foo')
# bar inside foo
bar = foo.create_group('bar')

In [16]:
z1 = bar.zeros('baz', shape=(10000, 10000), chunks=(1000, 1000), dtype='i4')

In [17]:
# to get more info
z1.info # works on root, foo, z1

0,1
Name,/foo/bar/baz
Type,zarr.core.Array
Data type,int32
Shape,"(10000, 10000)"
Chunk shape,"(1000, 1000)"
Order,C
Read-only,False
Compressor,"Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)"
Store type,zarr.storage.DirectoryStore
No. bytes,400000000 (381.5M)


In [18]:
# to get tree view 
# if you have ipytree it will give interactive output
root.tree()

Tree(nodes=(Node(disabled=True, name='/', nodes=(Node(disabled=True, name='foo', nodes=(Node(disabled=True, naâ€¦

### User Attributes

Useful to add custom key-value attributes.

Stored in a seperate `.zattrs` file

In [19]:
z1 = zarr.open("data/attributes.zarr" ,shape=(10000, 10000), chunks=(1000, 1000), dtype='i4')

In [20]:
z1.attrs["author"] = "maneesh"

### Advanced Indexing of zarr arrays

Many options are available for indexing, e.g.

**with coordinate arrays**

set_coordinate_selection()

get_coordinate_selection()

**with masked arrays**

get_mask_selection()

set_mask_selection()