-
Notifications
You must be signed in to change notification settings - Fork 569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
could compressFromAsciiToUTF16 achieve a better compression ratio? #17
Comments
I'm not sure I understand the question. LZString makes no assumption about its input. Its input is a sequence of 16-bit integers, under the form of a UTF-16 string. I "hacked" LZ by creating a special dictionary entry which introduces a new token in the input. That way, the decoder doesn't know in advance how many different characters there will be in the end. The less characters diversity in the input, the better the compression. So under it's current form, LZString already compresses ASCII much better than another encoding. But that's just because ASCII happens to have only 256 possible characters where UTF-* has much more variety. Half of the compression happens on the LZ part with a dictionary-based compression algorithm. This algorithm produces a bit stream. This stream is then encoded either in Base64 or stuffed in a string using 15bits per character. I'm not sure where a gain in compression ratio could be obtained. Plus, it would then be of a limited utility. JS, CSS and HTML templates should be in UTF-8 anyway just as a good practice. I suggest you read the home page http://pieroxy.net/blog/pages/lz-string/index.html (especially the goal section.) This was originally meant for localStorage where the quota is in characters, not in bytes. Hence the effort to try and stuff a bitstream into an actual UTF-16 string. |
sorry if I was not clear or if this seems like a dumb idea. My thinking was that I could arrange for the server to return ASCII not UTF-8 and receive it in xhr as an arrayBuffer by setting responseType = 'arraybuffer'; This way if lz-string was to use not a string but arrayBuffer as input, then I thought it could achieve a better compression, while still making a buffer UTF-16 compatible for saving in LocalStorage. I understand lz-string assumes strings consist of 16 bit Unicode characters. I know that FT Labs guys experimented with 2 ASCII chars stuffed into UTF-16 char for 2x compression in LocalStorage. This made me think about ascii and lz-string. But may be a change in lz-string to accept 8bit ASCII chars buffer is non-trivial or may be you feel compressed result will not be much smaller. Anyway, I appreciate your response and a great utility you provided to the dev community! Btw, did anyone create jsperf for lz-string vs uncompressed read/write to LocalStorage? |
There are two things in LZString: The compression part (based on LZW) and the encoding part (stuffing 15bits of data in each character of a UTF16 string). Obviously the second part is oblivious to the input type. The first one however is more sensible to it. Maybe by doing a preinitialization of the dictionary with all 256 chars it could optimize something, but I seriously doubt it since in code files, at most ~80-90 different chars are used, in other words yout bitspace is vastly underfilled and most likely than not, your tokens could be encoded on 7 bits and even maybe 6 bits on some cases. That is precisely the optimization I did in LZW since I was starting with 16bit space tokens (impossible to preinitialize the dic with every token). So no, I don't think it would give interesting results. But that is just a hunch ;-) AFAIK, no JSPerf focuses on IO in localstorage. |
if we can guarantee that the original text (in our case JS, CSS, html templates) is ASCII, this should potentially help achieve better compression. Does it make sense to you?
The text was updated successfully, but these errors were encountered: