## Check disk consumption of earrays vs tables

Let's supose we have our 800-sample waveform and we want to store it in disk. The most efficient way is to do so as an earray, because it is a fixed-size array. If the waveform is ZS, there are many samples (90%?) that are considered as zeroes and therefore irrelevant. We remove them from the waveform because we don't want to store them. Or do we?

Regarding IO, earrays are much more efficient objects, making the file-reading operation much faster. If instead of removing values below some threshold, we declare them as exact zeroes, the compressor can notice this structure and keep the file small-sized. But is this a real thing? How does it affect IO time? Let's check.

In [33]:
h5tbl.close()
h5arr.close()

In [46]:
from __future__ import print_function
import tables as tb
import tblFunctions as tbl
import wfmFunctions as wfm
import Nh5

h5in = tb.open_file('/Users/Gonzalo/github/IC/data/ISIDORA1000.h5')

sipmrwf = h5in.root.RD.sipmrwf
nevt,NSIPM,SIPMWL = sipmrwf.shape
nevt = 10

def doit( name, lib, level):
    h5tbl = tb.open_file('/Users/Gonzalo/github/IC/data/tbltest_{}.h5'.format(name),'w',filters=tb.Filters(complib=lib, complevel=level))
    h5arr = tb.open_file('/Users/Gonzalo/github/IC/data/arrtest_{}.h5'.format(name),'w',filters=tb.Filters(complib=lib, complevel=level))

    h5tbl.create_group(h5tbl.root,'DATA')
    h5arr.create_group(h5arr.root,'DATA')
    
    table = h5tbl.create_table(h5tbl.root.DATA,"TBL",Nh5.SENSOR_WF,"TBL",tb.Filters(complib=lib, complevel=level) )
    
    array = h5arr.create_earray(h5arr.root.DATA, "ARR",
                                atom=tb.Int16Atom(),     #not Float32! bad for compression
                                shape=(0, NSIPM, SIPMWL),
                                expectedrows=nevt)



    for i in range(nevt):
        print('name {}, evt {}'.format(name,i))
        zswf = wfm.noise_suppression(sipmrwf[i],5)
        array.append( zswf.reshape(1,*zswf.shape) )

        zsdic = wfm.sensor_wise_zero_suppression(zswf,5.)
        tbl.store_wf(i,table,zsdic)

    array.flush()
    h5tbl.close()
    h5arr.close()


In [48]:
for lib in ['zlib','blosc','blosc:lz4hc']:
    for clevel in range(1,10,2):
        name = '{}_{}'.format(lib.replace(':',''), clevel)
        doit(name,lib,clevel)

name zlib_1, evt 0
name zlib_1, evt 1
name zlib_1, evt 2
name zlib_1, evt 3
name zlib_1, evt 4
name zlib_1, evt 5
name zlib_1, evt 6
name zlib_1, evt 7
name zlib_1, evt 8
name zlib_1, evt 9
name zlib_3, evt 0
name zlib_3, evt 1
name zlib_3, evt 2
name zlib_3, evt 3
name zlib_3, evt 4
name zlib_3, evt 5
name zlib_3, evt 6
name zlib_3, evt 7
name zlib_3, evt 8
name zlib_3, evt 9
name zlib_5, evt 0
name zlib_5, evt 1
name zlib_5, evt 2
name zlib_5, evt 3
name zlib_5, evt 4
name zlib_5, evt 5
name zlib_5, evt 6
name zlib_5, evt 7
name zlib_5, evt 8
name zlib_5, evt 9
name zlib_7, evt 0
name zlib_7, evt 1
name zlib_7, evt 2
name zlib_7, evt 3
name zlib_7, evt 4
name zlib_7, evt 5
name zlib_7, evt 6
name zlib_7, evt 7
name zlib_7, evt 8
name zlib_7, evt 9
name zlib_9, evt 0
name zlib_9, evt 1
name zlib_9, evt 2
name zlib_9, evt 3
name zlib_9, evt 4
name zlib_9, evt 5
name zlib_9, evt 6
name zlib_9, evt 7
name zlib_9, evt 8
name zlib_9, evt 9
name blosc_1, evt 0
name blosc_1, evt 1
name blosc

In [49]:
%ls -lha /Users/Gonzalo/github/IC/data/

total 7581096
drwxr-xr-x  46 Gonzalo  staff   1.5K Oct 19 12:14 [36m.[m[m/
drwxr-xr-x  22 Gonzalo  staff   748B Oct 18 22:59 [36m..[m[m/
-rw-r--r--   1 Gonzalo  staff   535K Oct 18 10:29 ANASTASIA0.h5
-rw-r--r--   1 Gonzalo  staff    12M Oct 19 00:46 ANASTASIA1000.h5
-rw-r--r--   1 Gonzalo  staff    10M Oct 18 23:49 DIOMIRA0.h5
-rw-r--r--   1 Gonzalo  staff   112M Oct 17 19:08 DIOMIRA0_bkup.h5
-rw-r--r--   1 Gonzalo  staff   284M Oct 18 15:16 DIOMIRA1000_bkup.h5
-rw-r--r--   1 Gonzalo  staff    38M Oct 18 00:21 ISIDORA0.h5
-rw-r--r--   1 Gonzalo  staff   119M Oct 17 20:29 ISIDORA0_bkup.h5
-rw-r--r--   1 Gonzalo  staff   299M Oct 19 01:50 ISIDORA1000.h5
-rw-r--r--   1 Gonzalo  staff   299M Oct 18 15:48 ISIDORA1000_bkup.h5
-rw-r--r--   1 Gonzalo  staff   6.9M Oct 19 11:57 arrtest.h5
-rw-r--r--   1 Gonzalo  staff   4.5M Oct 19 12:12 arrtest_blosc_1.h5
-rw-r--r--   1 Gonzalo  staff   1.4M Oct 19 12:12 arrtest_blosc_3.h5
-rw-r--r--   1 Gonzalo  staff   1.3M Oct 19 12:12

**BEST OF EARRAY: 54.5 kB/evt with zlib/9**

**BEST OF TABLE:  60.0 kB/evt with zlib/9**

Saving the zeroed wf at maximum compression level implies about **10%** save in space. How does it affect IO time?

In [67]:
from time import time
readtable = lambda x,i: x.read_where('event>-1')
readarray = lambda x,i: x[i]

tarray = {}
ttable = {}
for lib in ['zlib','blosc','blosc:lz4hc']:
    for clevel in range(1,10,2):
        tblname = '/Users/Gonzalo/github/IC/data/tbltest_{}_{}.h5'.format(lib.replace(':',''), clevel)
        arrname = '/Users/Gonzalo/github/IC/data/arrtest_{}_{}.h5'.format(lib.replace(':',''), clevel)
        h5tbl = tb.open_file(tblname)
        h5arr = tb.open_file(arrname)
        table = h5tbl.root.DATA.TBL
        array = h5arr.root.DATA.ARR
        
        t0 = time()
        for i in range(10):
            readarray(array,i)
        dt = time()-t0
        tarray[(lib,clevel)] = dt
        #print('ARRAY with lib {} and clevel {} takes {} s'.format(lib,clevel,dt))
        t0 = time()
        for i in range(10):
            readtable(table,i)
        dt = time()-t0
        ttable[(lib,clevel)] = dt
        #print('TABLE with lib {} and clevel {} takes {} s'.format(lib,clevel,dt))
    
for (lib,clevel),t in sorted(tarray.items(),key=lambda x:x[1]):
    print('ARRAY with lib {} and clevel {} takes {} s'.format(lib,clevel,t))
print('-----------------------------------------------------')
for (lib,clevel),t in sorted(ttable.items(),key=lambda x:x[1]):
    print('TABLE with lib {} and clevel {} takes {} s'.format(lib,clevel,t))


ARRAY with lib blosc:lz4hc and clevel 5 takes 0.023638010025 s
ARRAY with lib blosc:lz4hc and clevel 3 takes 0.0260388851166 s
ARRAY with lib blosc and clevel 1 takes 0.0277469158173 s
ARRAY with lib blosc:lz4hc and clevel 1 takes 0.0288469791412 s
ARRAY with lib blosc and clevel 3 takes 0.0305480957031 s
ARRAY with lib blosc:lz4hc and clevel 9 takes 0.0309579372406 s
ARRAY with lib blosc and clevel 7 takes 0.031261920929 s
ARRAY with lib blosc:lz4hc and clevel 7 takes 0.0314650535583 s
ARRAY with lib blosc and clevel 5 takes 0.0316119194031 s
ARRAY with lib blosc and clevel 9 takes 0.0316879749298 s
ARRAY with lib zlib and clevel 3 takes 0.0636599063873 s
ARRAY with lib zlib and clevel 1 takes 0.0814120769501 s
ARRAY with lib zlib and clevel 9 takes 0.0833299160004 s
ARRAY with lib zlib and clevel 5 takes 0.0855090618134 s
ARRAY with lib zlib and clevel 7 takes 0.0988512039185 s
-----------------------------------------------------
TABLE with lib blosc:lz4hc and clevel 5 takes 1.16335

**BEST OF EARRAY: 23.6 ms with blosc:lz4hc/5**

**BEST OF TABLE:  1.16 s with blosc:lz4hv/5**


BEST compression is a factor 4 slower than the optimal one. But since it is "fast enough" we can take it.