# Compression

If we take a look at the file size of the "radar_trap.nix" file in its last version it is grater than **80MB** (Depends a bit on the number of images stored)!

The reason is the image data of the individual images have a shape of 1024 * 768 * 4 * 4 byte (float32 values) which sums to about 12.5 MB per picture.

An easy way to work around this is to enable dataset compression in the **HDF5** backend. Simply open a file with the ``DeflateNormal`` flag when creating it.

``nixfile = nixio.File.open("radar_trap.nix", nixio.FileMode.Overwrite, compression=nixio.Compression.DeflateNormal)``

If a file is created with this flag, all DataArrays will be losslessly compressed using the embedded gzip algorithm.

## Exercise

 1. Try it out and compare the file sizes with and without compression.



## Compression comes at a price

The file size should be tremendously reduced. On the other hand, compression is not for free, **it reduces read and write performance**. 

For some data it might be preferrable to not compress but have a higher read/write performance. In such cases you can switch compression **on** (``nixio.Compression.DeflateNormal``) or **off** (``nixio.Compression.No``) when creating a **DataArray**.

``block.create_data_array("name", "type", data=data, compression=nixio.Compression.Deflate.No)``

**Note:** Once a DataArray has been created the compression can not be changed.

# Chunking

When the backend reads or writes data from/to file, is does it in *chunks* (for experts, **DataArrays** are resizable, therefore chunking is always enabled).

![chunk large](resources/chunks_big.png)

The data may be accessed as a whole big chunk, or in smaller pieces. 

![chunk large](resources/chunks_small.png)

In reality, there is nothing like a 2d memory space, memory addresses are always contiguous. That means 

The *chunk* size affects a few things:

1. The read and write speed (large datasets can be read faster with larger chunks).
2. The resize performance and overhead.
3. The efficiency of the compression.



## Read/write performance

Generally one could think about large datasets can be written and read faster with large chunks. This is not wrong unless the usual access is in small pieces. Then the backend would need to read the full chunk to memory (probably decompress it) and then return the small piece of data the user requested.

## Resize performance

Let's assume that we have already filled the full 9 by 9 chunk with data. Now we want to increase the dataset by another 3 by 3 bit of data. With the large chunks we would ask the backend to reserve the full 9 by 9 matrix, and write just 9 data points into it. Reserving large amounts of memory takes more time, and if not filled up with meaningful data, creates larger files than strictly necessary.

## Compression performance

Compression is more or less efficient depending on the chunk size.


The chunk size is automatically defined upon creation of the **DataArray**.

``block.create_data_array("name", "type", data=data)``

The **HDF5** backend will try to figure out the optimal chunk size depending on the shape of the data. If one wants to affect the chunking and has a good idea about the usual read and write access patterns (e.g. I know that I will always read one second of data at a time). One can create the **DataArray** with a defined shape and later write the data.

```python
    data_array = block.create_data_array("name", "id", dtype=nixio.DataType.Double,
                                         shape=(chunk_samples, number_of_channels), label="voltage", unit="mV")
    data_array.append_sampled_dimension(0.001, label="time", unit="s")
    data_array.append_set_dimension(labels=["channel %i" % i for i in range(number_of_channels)])

    data_array.write_direct(data)
```

**Note:** If we do not provide the data at the time of **DataArray** creation, we need to provide the data type *dtype*.



## Exercise: Let's test the effects of chunking on the write performance.

1. Use the code below and extend it to test different chunk sizes (chunk_samples controls how many samples per channel at a time). **Note:** make sure to have the same total number of samples written.
2. Compare the write performance with and without compression.
3. Select the "optimal" chunking strategy and test the read performance with slices of varying size.


In [19]:
import nixio
import time
import numpy as np


def record_data(samples, channels, dt):
    data = np.zeros((samples, channels))
    t = np.arange(samples) * dt
    for i in range(channels):
        phase = i * 2 * np.pi / channels
        data[:, i] = np.sin(2 * np.pi * t + phase) + (np.random.randn(samples) * 0.1)

    return data


def write_nixfile(filename, chunk_samples=1000, number_of_channels= 10, dt=0.001, chunk_count=100, compression=nixio.Compression.No):
    nixfile = nixio.File.open(filename, nixio.FileMode.Overwrite, compression=compression)
    block = nixfile.create_block("Session 1", "nix.recording_session")
    data_array = block.create_data_array("multichannel_data", "nix.sampled.multichannel", dtype=nixio.DataType.Double,
                                         shape=(chunk_samples, number_of_channels), label="voltage", unit="mV")
    data_array.append_sampled_dimension(0.001, label="time", unit="s")
    data_array.append_set_dimension(labels=["channel %i" % i for i in range(number_of_channels)])
    
    total_samples = chunk_count * chunk_samples
    data = record_data(total_samples, number_of_channels, dt)
    chunks_recorded = 0
    t0 = time.time()
    while chunks_recorded < chunk_count:
        start_index = chunk_samples * chunks_recorded
        if chunks_recorded == 0:
            data_array.write_direct(data[start_index:start_index + chunk_samples, :])
        else:
            data_array.append(data[start_index:start_index+chunk_samples, :], axis=0)
        chunks_recorded += 1
    total_time = time.time() - t0

    nixfile.close()
    return total_time


time_needed = write_nixfile("chunking_test.nix", chunk_samples=100000, chunk_count=10)
print(time_needed)


0.16677212715148926
