In [14]:
import sys
import timeit
import hashlib
import numpy as np

from sherpa.models.model import boolean_to_byte

In [51]:
# To first order, only the size of the token matters because it's a long binary thing, 
# but let's make it the way it's done in the code for readability.
# Let's just pick a few parameters, an integrate flag and different sizes for the xlo array.
# Ignore xhi, because for hashing it doesn't matter is we have e.g. 1000 values in xlo and 1000 values in xhi
# or 2000 values in xlo and 0 in xhi.
pars = (5, 1.23, 234, 234, 1e5)
integrate = True
# Make tokens of a few different sizes
data = {
    "typical binned x-ray spectrum": [
        np.array(pars).tobytes(),
        boolean_to_byte(integrate),
        np.linspace(0.1, 8.0).tobytes(),
    ],
    "unbinned ACIS spectrum": [
        np.array(pars).tobytes(),
        boolean_to_byte(integrate),
        np.arange(0).tobytes(),
    ],
    "unbinned grating spectrum": [
        np.array(pars).tobytes(),
        boolean_to_byte(integrate),
        np.arange(1.0, 41.96, 0.005).tobytes(),
    ],
    "long UV / optical spectrum": [
        np.array(pars).tobytes(),
        boolean_to_byte(integrate),
        np.arange(int(1e5)).tobytes(),
    ],
}

token = {k: b''.join(v) for k, v in data.items()}

In [11]:
for k, v in token.items():
    print(f"size of token for {k}: {sys.getsizeof(v) / 1024} kB")

size of token for typical binned x-ray spectrum: 0.462890625 kB
size of token for unbinned ACIS spectrum: 8.072265625 kB
size of token for unbinned grating spectrum: 16.072265625 kB
size of token for long UV / optical spectrum: 781.322265625 kB


So, one of the main advantages of using a hash is that it is much smaller than the original data, but it's worth noting that for small data we could do without a hash. By default, each cache holds 5 items and if a user has 20 different models instances in sherpa, that's just 0.4 MB even for the unbinned grating spectrum. The value of the cache will always be the same length as the xlo array. So holding the xlo array (and possibly the xhi array) at the very most triples the size of the cache dict compared to hashing the xlo/xhi arrays. I posit that in most cases the size of the cache dict is either negligible (a few MB at most) or so big that holding the values alone is already too much.

Still, I see the value of using a hash if we can find one that's fast.

In [43]:
number = 1000

for k, v in token.items():
    print(f"hashing token for {k}:")
    for hashname in ['sha1', 'sha256', 'md5', ]: # hashlib.algorithms_available:
        try:
            hashfunc = getattr(hashlib, hashname)
        except AttributeError:
            #print(f"  {hashname}: not available")
            continue
        try:
            print(f"  {hashname}: {timeit.timeit(hashfunc(v).digest, number=number) * 1e6 / number:.3f} microssec")
        except:
            pass
            #print(f"  {hashname}: failed")

hashing token for typical binned x-ray spectrum:
  sha1: 0.654 microssec
  sha256: 0.499 microssec
  md5: 0.774 microssec
hashing token for unbinned ACIS spectrum:
  sha1: 0.402 microssec
  sha256: 0.356 microssec
  md5: 0.440 microssec
hashing token for unbinned grating spectrum:
  sha1: 0.746 microssec
  sha256: 0.437 microssec
  md5: 0.742 microssec
hashing token for long UV / optical spectrum:
  sha1: 0.346 microssec
  sha256: 0.329 microssec
  md5: 0.423 microssec


We used the function above to go over all available hashing algorithms, but there is a lot of variablity between runs. Using Ipython's "%timeit" magic is more robust because that automatically does multiple runs, outlier removal, and averages the results. So, I selected the most promising algorithms and checked those in detail.

In [48]:
hashfunc = hashlib.sha1
v = token["typical binned x-ray spectrum"]
%timeit hashfunc(v).digest


358 ns ± 36.5 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)


In [49]:
hashfunc = hashlib.sha256
v = token["typical binned x-ray spectrum"]
%timeit hashfunc(v).digest


345 ns ± 28.5 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)


In [50]:
hashfunc = hashlib.md5
v = token["typical binned x-ray spectrum"]
%timeit hashfunc(v).digest


764 ns ± 29.7 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)


In [52]:
hashfunc = hashlib.sha1
v = token["unbinned grating spectrum"]
%timeit hashfunc(v).digest


27.7 μs ± 227 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [53]:
hashfunc = hashlib.sha256
v = token["unbinned grating spectrum"]
%timeit hashfunc(v).digest


28.2 μs ± 2.26 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [54]:
hashfunc = hashlib.md5
v = token["unbinned grating spectrum"]
%timeit hashfunc(v).digest


99 μs ± 36.2 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [46]:
hashfunc = hashlib.sha256
v = token["long UV / optical spectrum"]
%timeit hashfunc(v).digest

332 μs ± 5.53 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [45]:
hashfunc = hashlib.sha1
v = token["long UV / optical spectrum"]
%timeit hashfunc(v).digest

351 μs ± 41.5 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [47]:
hashfunc = hashlib.md5
v = token["long UV / optical spectrum"]
%timeit hashfunc(v).digest


1.25 ms ± 54.9 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
