# Lossy compression

Lossy and lossles compression is very important feature i the RootInteractive.

Depending on the layout of the data, a different compression factor can be achieved. Huge reduction factors can be achieved, partly due to **the entropy of the data** (e.g. Gaussian distribution), partly due to the **repetitions in the case of flattened arrays.**

For example, in the real use case of dEdx simulation (clusters per track), the common track properties are very well compressed, the charge properties also have a small entry factor. Normally, the factor **O(10-20)- 5-10% is reached** - depending on the repetition and the entropy of input data.


In the code below you can find how to parametrize lossy ans lossless comprsssion of the data.

Example use case for Q vectors along the track:

```
compressCDSPipe
Compresses 1 dNprimdx .* [('relative', 16), 'code', 'zip']
Compression factor 1502730 33602297 0.04472104987346549 1 dNprimdx
Compress 2 qVector .* [('relative', 16), 'code', 'zip']
Compress Factor 3637456 27522312 0.13216389669588804 2 qVector
Compress 3 region .* [('relative', 16), 'code', 'zip']
Compression factor 579220 27522277 0.02104549707133607 3 region
Compress 4 qMean .* [('relative', 16), 'code', 'zip']
Compress factor 1485573 33602294 0.04421046372607775 4 qMean
Compress 5 nTotVector .* [('relative', 16), 'code', 'zip']
Compress Factor 3336317 27522315 0.1212222518345568 5 nTotVector
Compress 6 nPrimMean .* [('relative', 16), 'code', 'zip']
Compress Factor 1502830 33602298 0.04472402452951283 6 nPrimMean
Compress 7 qStd .* [('relative', 16), 'code', 'zip']
Compress Factor 1474259 33602293 0.04387376182928945 7 qStd
Compress 8 nTotStd .* [('relative', 16), 'code', 'zip']
Compress factor 1488320 33602296 0.044292211460788274 8 nTotStd
Compress 9 nTotMean .* [('relative', 16), 'code', 'zip']
Compress Factor 1499257 33602297 0.04461769384396549 9 nTotMean
Compress 10 TransGEM .* [('relative', 16), 'code', 'zip']
Compress Factor 1461148 33602297 0.04348357494727221 10 TransGEM
Compress 11 nPrimStd .* [('relative', 16), 'code', 'zip']
Compress Factor 1487282 33602297 0.044261319397301914 11 nPrimStd
Compress 12 padLength .* [('relative', 16), 'code', 'zip']
Compress Factor 595667 27522314 0.021643056612172945 12 padLength
Compress 13 nPrimVector .* [('relative', 16), 'code', 'zip']
Compress Factor 2779472 27522316 0.10098975682133728 13 nPrimVector
Compress 14 lognPrimStd .* [('relative', 16), 'code', 'zip']
Compress Factor 1485884 33602300 0.0442197111507248 14 lognPrimStd
Compress 15 SatOn .* [('relative', 16), 'code', 'zip']
Compress Factor 150691 22962324 0.006562532607762176 15 SatOn
Compress 16 nSecSatur .* [('relative', 16), 'code', 'zip']
Compress Factor 1725854 33602298 0.05136118964244648 16 nSecSatur
Compress 17 logqStd .* [('relative', 16), 'code', 'zip']
Compress Factor 1470552 33602296 0.04376343806982713 17 logqStd
Compress 18 lognTotStd .* [('relative', 16), 'code', 'zip']
Compress Factor 1484058 33602299 0.04416537094679147 18 lognTotStd
Compress 19 lognSecSatur .* [('relative', 16), 'code', 'zip']
Compress factor 1474271 33602301 0.043874108502271914 19 lognSecSatur
Compress 20 region.factor() .* [('relative', 16), 'code', 'zip']
Compress factor 553452 6080146 0.09102610364948473 20 region.factor()
Compress 21 SatOn.factor() .* [('relative', 16), 'code', 'zip']
Compress factor 197180 6080146 0.03243014230250392 21 SatOn.factor()
Compress _all 31371473 609564013 0.051465428291285954 21

```



In [None]:
from bokeh.io import output_notebook, show
from bokeh.plotting import output_file
from RootInteractive.InteractiveDrawing.bokeh.bokehDrawSA import bokehDrawSA
from RootInteractive.InteractiveDrawing.bokeh.bokehTools import bokehDrawArray
from RootInteractive.Tools.pandaTools import initMetadata
import pandas as pd
import numpy as np
import math
import logging
output_notebook()
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

## Generate data
* A and B from normal distribution
* C from uniform \[0, 1\]
* D Bernoulli distribution

* Add derived variables - two from normal distribution, one approximately exponential depending on A,B,C,D

In [None]:
npoints = 1000000
A = np.random.randn(npoints)
B = np.random.randn(npoints)
unifC = np.random.random_sample(npoints)
boolD = np.random.random_sample(npoints) > .47
derivedE = A+boolD*(A*.15-B*.4+.1)+.1*np.random.randn(npoints)
derivedF = np.random.exponential(1/((derivedE**2)+(np.sin(2*math.pi*unifC)+1.4)))
derivedG = 100+15*A+2*np.random.randn(npoints)
df = pd.DataFrame({"A":A,"B":B,"unifC":unifC,"boolD":boolD,"derivedE":derivedE,"derivedF":derivedF, "derivedG":derivedG})

## Make figures and selection widgets

In [None]:
parameterArray = [
    {"name": "size", "value":7, "range":[0, 30]},
    {"name": "legendFontSize", "value":"13px", "options":["9px", "11px", "13px", "15px"]},
    {"name": "legendVisible", "value":True},
    {"name": "nPointRender", "range":[0, 5000], "value": 1000},
]
figureArray = [
    [['derivedG'], ['derivedE'], {"colorZvar": "B"}],
    [['derivedE'], ['A','B']],
    [['unifC'], ['derivedF'], { "colorZvar": "derivedG"}],
    [['derivedF'], ['derivedG'], {"colorZvar": "derivedE", "errY": "10*A"} ],
    [['A'], ['B'], {"colorZvar": "derivedF"}],
    {"size":"size", "legend_options": {"label_text_font_size": "legendFontSize", "visible":"legendVisible"}}
]
layout = {
    "A": [
        [0, 1, 2, {'y_visible': 1, 'x_visible':1, 'plot_height': 300}],
        {'plot_height': 100, 'sizing_mode': 'scale_width', 'y_visible' : 2}
        ],
    "B": [
        [3, 4, {'y_visible': 3, 'x_visible':1, 'plot_height': 300}],
        {'plot_height': 100, 'sizing_mode': 'scale_width', 'y_visible' : 2}
        ]
}
widgetParams=[
    ['range', ['A']],
    ['range', ['B']],
    ['range', ['unifC']],
    ['multiSelect', ['boolD']],
    ['range', ['derivedE']],
    ['spinnerRange', ['derivedF']],
    ['range',["derivedG"]],
    ['toggle',['legendVisible'], {"name":"legendVisible"}],
    ['select',['legendFontSize'], {"name":"legendSize"}],
    ['slider',['size'], {"name":"markerSize"}],
    ['slider',['nPointRender'], {"name":"nPoint"}]
]
widgetLayoutDesc={
    "Selection": [[0, 1, 2], [3, 4], [5, 6], {'sizing_mode': 'scale_width'}],
    "Graphics": [["legendVisible", "nPoint"],["legendSize", "markerSize"]]
    }   

* Optimization
    * Compress the data
        * bokehDrawArray (and bokehDrawSA) take an arrayCompression parameter, which is a list of (regex, pipeline) pairs, where regex is the regular expression used to match column names
          and pipeline is a list of operations to be used on the column. Supported values are "relative", "delta", "zip" and "base64" 
        * Example: 
            ``arrayCompressionParam = [
            (".conv.Sigma.*",[("relative",7), "code", "zip"]), 
            (".delta.",[("relative",10), "code", "zip"]), 
            (".i2.",[("relative",7), "code", "zip""]), 
            (".*",[("relative",8), "code", "zip"])]``
            * Variables will be compressed in the given order. Once a variable was compressed, it will not be overwritten by another compression.
            * Tuple paramters: `(".conv.Sigma.*",[("relative",7), "code", "zip"])`
                * first parameter is a regex expression to match the column names to be compressed
                * second parameter is a list of operation to be used on the column
                    * most relevant for the user is the first parameter of the list which defines the quantization
                        * "absolute": precision to be used in absolute units of the given variable, e.g. 0.0001
                        * "relative": precision to be used in units of bits, e.g. 10
                    * "code", "zip"
                        * lossless compression
                        * code - factor the column into "codes" and "factors" - two columns
                        * at the time of writing this tutorial "code" - factoring the columns - resulted in suboptial compression because of a bug that will be fixed soon - factors aren't encoded properly
                        * zip - compress using gzip
                    * "base64"
                        * base64 encoding - as of the current version it's automatically used where appropriate, there should be no need to use this

In [None]:
arrayCompression = [
    ("unif.*", [("delta", .01), "zip"]),
    ("bool.*", ["zip"]),
    (".*", [("relative", 16), "zip"]),
]

In [None]:
output_file("test_compression.html")
bokehDrawSA.fromArray(df, None, figureArray, widgetParams, layout=layout,
                            widgetLayout=widgetLayoutDesc, nPointRender="nPointRender", parameterArray=parameterArray, arrayCompression=arrayCompression, useNotebook=False)

In [None]:
output_file("test_nocompression.html")
bokehDrawSA.fromArray(df, None, figureArray, widgetParams, layout=layout,
                            widgetLayout=widgetLayoutDesc, nPointRender="nPointRender", parameterArray=parameterArray, useNotebook=False)

## Option "code" in compressArray
* to be optimized  in some cases it improve in some cases not - to be fixed in next release

In [None]:
arrayCompression = [
    ("unif.*", [("delta", .01), "code", "zip"]),
    ("bool.*", ["zip"]),
    (".*", [("relative", 16), "code","zip"]),
]
output_file("test_compression_code.html")
bokehDrawSA.fromArray(df, None, figureArray, widgetParams, layout=layout,
                            widgetLayout=widgetLayoutDesc, nPointRender="nPointRender", parameterArray=parameterArray, arrayCompression=arrayCompression, useNotebook=False)

## IN this particular case factor 34 % compression accheved
```
-rw-r--r-- 1 miranov alice 27494089 Dec  6 20:19 test_compression.html
-rw-r--r-- 1 miranov alice 80092088 Dec  6 20:19 test_nocompression.html
-rw-r--r-- 1 miranov alice 27494247 Dec  6 20:19 test_compression_code.html
```