Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Literal/Length code with 256-wide sub-tree a problem #1

Closed
gp48k opened this issue Nov 5, 2017 · 3 comments
Closed

Literal/Length code with 256-wide sub-tree a problem #1

gp48k opened this issue Nov 5, 2017 · 3 comments
Assignees

Comments

@gp48k
Copy link

gp48k commented Nov 5, 2017

I'm pretty sure this file: rep16.bin.gz will not decode properly. The literal/length Huffman tree has 256 symbols whose code length is 9. This will overflow nBitCode_totalCount .

I imagine this is not a very common case but nonetheless I think it should be fixed. I've not the 6502 experience to suggest a good approach but it is clear that at most only one code length of the tree can possibly have 256 symbols. I expect handling that possible special case shouldn't add too much code or slow things down very much.

It pains me to report this bug as the subroutine is a wonderful piece of work with some very good insights into the deflate format from an 8-bit perspective.

@pfusik pfusik self-assigned this Nov 6, 2017
@pfusik
Copy link
Owner

pfusik commented Nov 6, 2017

Thank you for reporting this. How did you get this stream?

I will try to fix the routine. Is the workaround of detecting this case and recompressing the data to avoid the problem acceptable for you?

@pfusik pfusik closed this as completed in 02c09a5 Nov 6, 2017
@pfusik
Copy link
Owner

pfusik commented Nov 6, 2017

I fixed this by getting rid of nbitCode_totalCount. This even saves 7 bytes of code. The cost is slower fetchCode - zopfli-compressed GPL v3 takes 3% more time.

Now I wonder if:

  • more than 256 codes of same length
  • all 256 literal codes of same length

are possible? If you are able to generate such streams, please submit separate issues.

Thank you!

@gp48k
Copy link
Author

gp48k commented Nov 6, 2017

Wow, fast work. rep16.bin.gz comes from using standard gzip on a file I created specially. All I did was create some data that is compressible and only has 1 occurrence of literal values 0 .. 254. Since 256 (end of block code) appears once there are 256 symbols with the same frequency so the Huffman encoding gives them the same code length. Here's the little C program I wrote:

int main(int argc, char *argv[])
{
	int i, j;
	FILE *fp = fopen("rep16.bin", "wb");
	for (i = 0; i < 255; i++) {
		int n = i == 0 ? 1 : 16;
		for (j = 0; j < n; j++) {
			fputc(i, fp);
		}
	}
	fclose(fp);
	return 0;
}

I think both trees you suggest are possible by directly constructing the deflate data. I'm not so sure if one could make a data file that results in those trees when run through gzip. I'll have to think about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants