Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid 0xFF00 markers in Huffman data #27

Open
kornelski opened this issue Mar 19, 2014 · 1 comment
Open

Avoid 0xFF00 markers in Huffman data #27

kornelski opened this issue Mar 19, 2014 · 1 comment

Comments

@kornelski
Copy link
Member

0xFF bytes in Huffman data need special coding that adds another byte of overhead. If this overhead could be reliably minimized, then approximation of size if scan data described #15 would be more effective.

Here's an algorithm off the top of my head:

  1. Find which code happens to overlap the 0xFF bytes most frequently
  2. Swap the code with another code of the same length (if there isn't a code to swap with — maybe modify/regenerate the Huffman tree (e.g. flip all bits in all codes?))
  3. goto 1 as long as result keeps getting smaller

And/Or:

  1. For all symbols in the stream make histogram of pairs (code[symbol[n-1]], code[symbol[n]])
  2. Find which pairs have 8+ consecutive 1s.
  3. Starting from most frequent pair swap codes with other codes of the same length to break the streaks.

Likely it can be improved.

@frkay
Copy link

frkay commented Mar 19, 2014

I have to check how codes having the same Huffman code length are stored in the file, I seem to recall that it's by increasing value and nothing related to there actual frequency (a part from having the same Huffman code length which comes from the fact that they have similar frequencies).
i.e codes of length 5, 4 entries : 7 9 10 12
But if 12 appears a bit more frequently than 10 and unfortunately has a code ending with a lot of ones like 00111 it could be smart move to change the recording order to 7 9 12 10 and this time 12 would get a less "dangerous" 00110 code and 10 would get 00111.

I'm pretty sure the histogram of pairs will not work since there is often raw binary data in-between two Huffman codes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants