### Zarr Examples

In [2]:
import zarr
import numpy as np

#### Example 1
There are different functions for creating arrays through Zarr. Below, we are creating in memory a float array with 10^8 elements in 1000 chunks of size 1000 x 10000.

In [3]:
z = zarr.zeros((100_000, 100_000), chunks=(1_000, 10_000), dtype='f4')
z

<zarr.core.Array (100000, 100000) float32>

With the code snippet below, we can obtain information about the array. Note the „lazy execution“: data will only be physically stored when needed. Initially, there are only 350B stored.

In [3]:
z.info

0,1
Type,zarr.core.Array
Data type,float32
Shape,"(100000, 100000)"
Chunk shape,"(1000, 10000)"
Order,C
Read-only,False
Compressor,"Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)"
Store type,zarr.storage.KVStore
No. bytes,40000000000 (37.3G)
No. bytes stored,350


The no. bytes stored increases to 160.9 KB.

In [22]:
# populate some parts of the z array
z[20_100:20_900, 0] = 3.14159
z.info

0,1
Type,zarr.core.Array
Data type,float32
Shape,"(100000, 100000)"
Chunk shape,"(1000, 10000)"
Order,C
Read-only,False
Compressor,"Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)"
Store type,zarr.storage.KVStore
No. bytes,40000000000 (37.3G)
No. bytes stored,164733 (160.9K)


#### Example 2: Persistent array
Initializing Zarr array by storing it on the file system. Again in this case, no disk space is consumed.

In [5]:
zfile = zarr.open('example_1.zarr', shape=(100_000, 100_000), chunks=(1_000, 10_000), dtype='f4', mode='w')
zfile.info

0,1
Type,zarr.core.Array
Data type,float32
Shape,"(100000, 100000)"
Chunk shape,"(1000, 10000)"
Order,C
Read-only,False
Compressor,"Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)"
Store type,zarr.storage.DirectoryStore
No. bytes,40000000000 (37.3G)
No. bytes stored,350


#### Example 3: Bad Zarr example: be aware of chunk size!
Running the following code snippet takes forever.

In [None]:
%%time
tmp = np.arange(0, 1e5)
for i in range(11_000, 21_000):
    zfile[i, :] = tmp + i
zfile.info

Writing data to a Zarr file line-by-line is not an efficient approach. To optimize the performance of writing data to a Zarr file, we should leverage the chunking capabilities provided by the Zarr library. Writing data chunk-by-chunk, can significantly improve the efficiency of the data writing process. Below we optimize code by writing 10 chunks altogether, after having prepared the NumPy array with all the data that we need to dump to the Zarr file. By writing the data in larger chunks, the library can better utilize its internal compression algorithms and reduce the overall overhead of the writing process.

In [6]:
zfile = zarr.open('example_2.zarr', shape=(100_000, 100_000), chunks=(1_000, 10_000), dtype='f4', mode='w')
data = np.zeros((10_000, 100_000), dtype='f4')
for i in range(10_000):
    data[i, :] = np.arange(0, 100_000) + 11_000 + i
    
zfile[11_000:21_000, :] = data
zfile.info 

0,1
Type,zarr.core.Array
Data type,float32
Shape,"(100000, 100000)"
Chunk shape,"(1000, 10000)"
Order,C
Read-only,False
Compressor,"Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)"
Store type,zarr.storage.DirectoryStore
No. bytes,40000000000 (37.3G)
No. bytes stored,27023364 (25.8M)
