Skip to content

Shrink generated packages#64

Merged
mathiasbynens merged 4 commits intonode-unicode:masterfrom
lightmare:shrink
Jun 28, 2021
Merged

Shrink generated packages#64
mathiasbynens merged 4 commits intonode-unicode:masterfrom
lightmare:shrink

Conversation

@lightmare
Copy link
Contributor

There is a glaring inefficiency in serialization of gzipped data, which is addressed by the first commit in this PR. Constructing gunzip input buffer from base64-encoded string instead of an array of numbers shrinks generated unicode-13 package from ~150MB down to ~60MB.

Subsequent two commits are a bit more complicated.

The second commit compresses binary properties. It comes from the observation that many generated code-points.js and symbols.js files contain long sequences of adjacent code points. These can be run-length encoded, and the lengths (which are often short) then encoded in variable-length base64 prefix code. This shrinks generated unicode-13 package from ~60MB down to ~15.5MB.

The third commit applies run-length encoding to named properties, shrinking the generated package down to ~13.5MB.

Emitting the data as an array of integers inflates it by a factor of
approximately 3.57.

     10 byte values require 1 digit + comma
     90 byte values require 2 digits + comma
    156 byte values require 3 digits + comma

    2*10/256 + 3*90/256 + 4*156/256 =~ 3.57

Base64, on the other hand, inflates the data by a factor of only 4/3.
Copy link
Collaborator

@mathiasbynens mathiasbynens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

Have you done any testing to confirm that the new generated output matches the old generated output from a user perspective? I.e. all exported arrays are still deeply equal to each other, etc.

@lightmare
Copy link
Contributor Author

lightmare commented Jun 27, 2021

I've run the existing test, it passes. Then I took the list of filenames from that test, plus a few other than code-points.js, and instead of comparing against ./expected/*, I did compare them against @unicode/unicode-13.0.0/*. Didn't check all files, yet. So far I've found 2 directories that don't match exactly:

/Binary_Property/Composition_Exclusion/code-points.js
/Line_Break/Unknown/code-points.js

The arrays are not deepEqual, but contain the same code points. This is because the code points are not sorted in the released package, whereas here I needed to sort them for run-length encoding.

@mathiasbynens
Copy link
Collaborator

I've run the existing test, it passes. Then I took the list of filenames from that test, plus a few other than code-points.js, and instead of comparing against ./expected/*, I did compare them against @unicode/unicode-13.0.0/*. Didn't check all files, yet. So far I've found 2 directories that don't match exactly:

/Binary_Property/Composition_Exclusion/code-points.js
/Line_Break/Unknown/code-points.js

The arrays are not deepEqual, but contain the same code points. This is because the code points are not sorted in the released package, whereas here I needed to sort them for run-length encoding.

I'm surprised the arrays aren't already sorted. Let me take a look.

@mathiasbynens
Copy link
Collaborator

Independent of this patch, I think we should enforce + test sorting of code points. I was expecting them to already be sorted in the published packages.

codePoints.length > 999 ? gzipInline : jsesc
)(codePoints)
);
let codePointsExports = `require('./ranges').flatMap(r=>Array.from(r.keys()))`;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let codePointsExports = `require('./ranges').flatMap(r=>Array.from(r.keys()))`;
let codePointsExports = `require('./ranges.js').flatMap(r=>Array.from(r.keys()))`;

@mathiasbynens mathiasbynens merged commit b935f78 into node-unicode:master Jun 28, 2021
@mathiasbynens
Copy link
Collaborator

Thanks!

mathiasbynens added a commit to node-unicode/unicode-9.0.0 that referenced this pull request Jun 28, 2021
mathiasbynens added a commit to node-unicode/unicode-8.0.0 that referenced this pull request Jun 28, 2021
mathiasbynens added a commit to node-unicode/unicode-7.0.0 that referenced this pull request Jun 28, 2021
mathiasbynens added a commit to node-unicode/unicode-6.3.0 that referenced this pull request Jun 28, 2021
mathiasbynens added a commit to node-unicode/unicode-6.2.0 that referenced this pull request Jun 28, 2021
mathiasbynens added a commit to node-unicode/unicode-6.1.0 that referenced this pull request Jun 28, 2021
mathiasbynens added a commit to node-unicode/unicode-6.0.0 that referenced this pull request Jun 28, 2021
mathiasbynens added a commit to node-unicode/unicode-5.2.0 that referenced this pull request Jun 28, 2021
mathiasbynens added a commit to node-unicode/unicode-5.1.0 that referenced this pull request Jun 28, 2021
mathiasbynens added a commit to node-unicode/unicode-5.0.0 that referenced this pull request Jun 28, 2021
mathiasbynens added a commit to node-unicode/unicode-4.1.0 that referenced this pull request Jun 28, 2021
mathiasbynens added a commit to node-unicode/unicode-4.0.1 that referenced this pull request Jun 28, 2021
mathiasbynens added a commit to node-unicode/unicode-4.0.0 that referenced this pull request Jun 28, 2021
mathiasbynens added a commit to node-unicode/unicode-3.2.0 that referenced this pull request Jun 28, 2021
mathiasbynens added a commit to node-unicode/unicode-3.1.0 that referenced this pull request Jun 28, 2021
mathiasbynens added a commit to node-unicode/unicode-3.0.1 that referenced this pull request Jun 28, 2021
mathiasbynens added a commit to node-unicode/unicode-3.0.0 that referenced this pull request Jun 28, 2021
mathiasbynens added a commit to node-unicode/unicode-2.1.9 that referenced this pull request Jun 28, 2021
mathiasbynens added a commit to node-unicode/unicode-8.0.0 that referenced this pull request Jun 28, 2021
mathiasbynens added a commit to node-unicode/unicode-7.0.0 that referenced this pull request Jun 28, 2021
mathiasbynens added a commit to node-unicode/unicode-6.3.0 that referenced this pull request Jun 28, 2021
mathiasbynens added a commit to node-unicode/unicode-6.2.0 that referenced this pull request Jun 28, 2021
mathiasbynens added a commit to node-unicode/unicode-6.1.0 that referenced this pull request Jun 28, 2021
mathiasbynens added a commit to node-unicode/unicode-6.0.0 that referenced this pull request Jun 28, 2021
mathiasbynens added a commit to node-unicode/unicode-5.2.0 that referenced this pull request Jun 28, 2021
mathiasbynens added a commit to node-unicode/unicode-5.1.0 that referenced this pull request Jun 28, 2021
mathiasbynens added a commit to node-unicode/unicode-5.0.0 that referenced this pull request Jun 28, 2021
mathiasbynens added a commit to node-unicode/unicode-4.1.0 that referenced this pull request Jun 28, 2021
mathiasbynens added a commit to node-unicode/unicode-4.0.1 that referenced this pull request Jun 28, 2021
mathiasbynens added a commit to node-unicode/unicode-4.0.0 that referenced this pull request Jun 28, 2021
mathiasbynens added a commit to node-unicode/unicode-3.2.0 that referenced this pull request Jun 28, 2021
mathiasbynens added a commit to node-unicode/unicode-3.1.0 that referenced this pull request Jun 28, 2021
mathiasbynens added a commit to node-unicode/unicode-3.0.1 that referenced this pull request Jun 28, 2021
mathiasbynens added a commit to node-unicode/unicode-3.0.0 that referenced this pull request Jun 28, 2021
mathiasbynens added a commit to node-unicode/unicode-2.1.9 that referenced this pull request Jun 28, 2021
mathiasbynens added a commit to node-unicode/unicode-2.1.8 that referenced this pull request Jun 28, 2021
mathiasbynens added a commit to node-unicode/unicode-2.1.5 that referenced this pull request Jun 28, 2021
mathiasbynens added a commit to node-unicode/unicode-2.1.2 that referenced this pull request Jun 28, 2021
mathiasbynens added a commit to node-unicode/unicode-2.0.14 that referenced this pull request Jun 28, 2021
mathiasbynens added a commit to node-unicode/unicode-13.0.0 that referenced this pull request Jun 28, 2021
mathiasbynens added a commit to node-unicode/unicode-12.1.0 that referenced this pull request Jun 28, 2021
mathiasbynens added a commit to node-unicode/unicode-12.0.0 that referenced this pull request Jun 28, 2021
mathiasbynens added a commit to node-unicode/unicode-1.1.5 that referenced this pull request Jun 28, 2021
mathiasbynens added a commit to node-unicode/unicode-11.0.0 that referenced this pull request Jun 28, 2021
mathiasbynens added a commit to node-unicode/unicode-10.0.0 that referenced this pull request Jun 28, 2021
@mathiasbynens
Copy link
Collaborator

Thanks so much for your work on this!

@lightmare
Copy link
Contributor Author

Happy to have smaller dependencies ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants