Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

could compressFromAsciiToUTF16 achieve a better compression ratio? #17

Closed
urbien opened this issue Nov 24, 2013 · 3 comments
Closed

could compressFromAsciiToUTF16 achieve a better compression ratio? #17

urbien opened this issue Nov 24, 2013 · 3 comments

Comments

@urbien
Copy link

urbien commented Nov 24, 2013

if we can guarantee that the original text (in our case JS, CSS, html templates) is ASCII, this should potentially help achieve better compression. Does it make sense to you?

@pieroxy
Copy link
Owner

pieroxy commented Nov 24, 2013

I'm not sure I understand the question. LZString makes no assumption about its input. Its input is a sequence of 16-bit integers, under the form of a UTF-16 string. I "hacked" LZ by creating a special dictionary entry which introduces a new token in the input. That way, the decoder doesn't know in advance how many different characters there will be in the end. The less characters diversity in the input, the better the compression. So under it's current form, LZString already compresses ASCII much better than another encoding. But that's just because ASCII happens to have only 256 possible characters where UTF-* has much more variety.

Half of the compression happens on the LZ part with a dictionary-based compression algorithm. This algorithm produces a bit stream. This stream is then encoded either in Base64 or stuffed in a string using 15bits per character.

I'm not sure where a gain in compression ratio could be obtained. Plus, it would then be of a limited utility. JS, CSS and HTML templates should be in UTF-8 anyway just as a good practice.

I suggest you read the home page http://pieroxy.net/blog/pages/lz-string/index.html (especially the goal section.) This was originally meant for localStorage where the quota is in characters, not in bytes. Hence the effort to try and stuff a bitstream into an actual UTF-16 string.

@urbien
Copy link
Author

urbien commented Nov 24, 2013

sorry if I was not clear or if this seems like a dumb idea. My thinking was that I could arrange for the server to return ASCII not UTF-8 and receive it in xhr as an arrayBuffer by setting responseType = 'arraybuffer'; This way if lz-string was to use not a string but arrayBuffer as input, then I thought it could achieve a better compression, while still making a buffer UTF-16 compatible for saving in LocalStorage.

I understand lz-string assumes strings consist of 16 bit Unicode characters. I know that FT Labs guys experimented with 2 ASCII chars stuffed into UTF-16 char for 2x compression in LocalStorage. This made me think about ascii and lz-string.

But may be a change in lz-string to accept 8bit ASCII chars buffer is non-trivial or may be you feel compressed result will not be much smaller. Anyway, I appreciate your response and a great utility you provided to the dev community!

Btw, did anyone create jsperf for lz-string vs uncompressed read/write to LocalStorage?

@pieroxy
Copy link
Owner

pieroxy commented Nov 24, 2013

There are two things in LZString: The compression part (based on LZW) and the encoding part (stuffing 15bits of data in each character of a UTF16 string).

Obviously the second part is oblivious to the input type. The first one however is more sensible to it. Maybe by doing a preinitialization of the dictionary with all 256 chars it could optimize something, but I seriously doubt it since in code files, at most ~80-90 different chars are used, in other words yout bitspace is vastly underfilled and most likely than not, your tokens could be encoded on 7 bits and even maybe 6 bits on some cases. That is precisely the optimization I did in LZW since I was starting with 16bit space tokens (impossible to preinitialize the dic with every token). So no, I don't think it would give interesting results. But that is just a hunch ;-)

AFAIK, no JSPerf focuses on IO in localstorage.

@pieroxy pieroxy closed this as completed Nov 26, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants