Shrink generated packages#64
Conversation
Emitting the data as an array of integers inflates it by a factor of
approximately 3.57.
10 byte values require 1 digit + comma
90 byte values require 2 digits + comma
156 byte values require 3 digits + comma
2*10/256 + 3*90/256 + 4*156/256 =~ 3.57
Base64, on the other hand, inflates the data by a factor of only 4/3.
mathiasbynens
left a comment
There was a problem hiding this comment.
Nice work!
Have you done any testing to confirm that the new generated output matches the old generated output from a user perspective? I.e. all exported arrays are still deeply equal to each other, etc.
|
I've run the existing test, it passes. Then I took the list of filenames from that test, plus a few other than The arrays are not deepEqual, but contain the same code points. This is because the code points are not sorted in the released package, whereas here I needed to sort them for run-length encoding. |
I'm surprised the arrays aren't already sorted. Let me take a look. |
|
Independent of this patch, I think we should enforce + test sorting of code points. I was expecting them to already be sorted in the published packages. |
| codePoints.length > 999 ? gzipInline : jsesc | ||
| )(codePoints) | ||
| ); | ||
| let codePointsExports = `require('./ranges').flatMap(r=>Array.from(r.keys()))`; |
There was a problem hiding this comment.
| let codePointsExports = `require('./ranges').flatMap(r=>Array.from(r.keys()))`; | |
| let codePointsExports = `require('./ranges.js').flatMap(r=>Array.from(r.keys()))`; |
|
Thanks! |
|
Thanks so much for your work on this! |
|
Happy to have smaller dependencies ;) |
There is a glaring inefficiency in serialization of gzipped data, which is addressed by the first commit in this PR. Constructing gunzip input buffer from base64-encoded string instead of an array of numbers shrinks generated unicode-13 package from ~150MB down to ~60MB.
Subsequent two commits are a bit more complicated.
The second commit compresses binary properties. It comes from the observation that many generated
code-points.jsandsymbols.jsfiles contain long sequences of adjacent code points. These can be run-length encoded, and the lengths (which are often short) then encoded in variable-length base64 prefix code. This shrinks generated unicode-13 package from ~60MB down to ~15.5MB.The third commit applies run-length encoding to named properties, shrinking the generated package down to ~13.5MB.