Skip to content

Any interest in adding a table based decode? #153

@Thell

Description

@Thell

Hello. I've been enjoying your module and associated writings/docs! May I 👏 and say 'Well done!'?

I'm an R and Rcpp/c++ kinda guy who had never really looked at Python until just a few weeks ago and immediately had a use for your module and have found it easy to use and instructive but since I was exploring I obviously was profiling a ton while learning and found two things that I thought might be of interest.

First let me present some benchmark numbers...

*** Large Sample Size (Symbols: 12, Max Code Length: 6, In Bits: 132074, Out Symbols: 42858)

Clock type: CPU
Ordered by: ttot, asc

name                                                                    ncall            tsub      ttot      tavg      
DecodeTableTest.py:81 ba_C_iterdecode                                   10000            0.017626  0.098321  0.000010
DecodeTableTest.py:89 ba_C_decode                                       10000            0.022404  0.111367  0.000011
DecodeTableTest.py:139 baTo01_Py_lookupDict                             10000            0.693772  0.715359  0.000072
DecodeTableTest.py:122 baTo01_Py_lookupTbl_builtin_int                  10000            0.863217  0.885142  0.000089
DecodeTableTest.py:177 rawBytes_Py_lookupTbl_intfrombytes_memoryview    10000            2.719999  4.090594  0.000409
DecodeTableTest.py:195 ba_Py_lookupTbl_builtin_int_baMemoryviewTo01     10000            3.397645  4.807507  0.000481
DecodeTableTest.py:155 ba_Py_lookupTbl_builtin_int_shift                10000            3.415027  4.837891  0.000484
DecodeTableTest.py:97 ba_Py_iterdecode                                  10000            0.033506  13.49097  0.001349
DecodeTableTest.py:105 ba_Py_lookupTbl_ba2int                           10000            3.263269  33.08122  0.003308
DecodeTableTest.py:211 ba_Py_lookupTbl_ba2int_baMemoryview              10000            3.923426  33.38520  0.003339

(where ba is bitarray, and baTo01 is the string representation and the table lookups are list lookups of symbol/freq tuples and the memoryview is on the first max code length bits but shifting the whole underlying bitarray)

My first interest was in using a table lookup for the decode instead of a tree so I figured if I compared the Python iterdecode() to the C I could get a feel for the gains to be had going from Python to the underlying C and as you already know its a massive improvement (ba_Py_iterdecode) and I am thinking that even if a table lookup is a fraction of that improvement then it could possibly blaze past the C tree decodes.

I haven't looked into the Python C bindings modules yet and that is my next step (or, preferably, C++ bindings) but if you are interested in it then you know your code and where it'd plug in. I was thinking along the lines of a .decode(decodeTable) method where the decodeTable would be treated in the same fashion as the decode dict / decodeTree setup.


The other thing I wanted to mention, and you can see in the benchmarks, was ba2int() is slow. I see what it is doing with the checks and padding and such but thought perhaps I am missing something or perhaps it is something added for convenience that hasn't had usage enough to decide if it is worth working on to make it faster... 🤷

/home/thell/Projects/HuffmanTestPy/bit2intTesting.py:19 t_bitarray_ba2int_frombytes                   10000            0.020761  0.094871  0.000009
/home/thell/Projects/HuffmanTestPy/bit2intTesting.py:13 t_bitarray_ba2int_fromstrings                 10000            0.013938  0.077259  0.000008
/home/thell/Projects/HuffmanTestPy/bit2intTesting.py:41 t_builtin_int_frombytes                       10000            0.011092  0.016091  0.000002
/home/thell/Projects/HuffmanTestPy/bit2intTesting.py:36 t_builtin_int_fromstrings                     10000            0.006036  0.006036  0.000001

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions