Shrink generated packages by lightmare · Pull Request #64 · node-unicode/node-unicode-data

lightmare · 2021-06-26T23:35:38Z

There is a glaring inefficiency in serialization of gzipped data, which is addressed by the first commit in this PR. Constructing gunzip input buffer from base64-encoded string instead of an array of numbers shrinks generated unicode-13 package from ~150MB down to ~60MB.

Subsequent two commits are a bit more complicated.

The second commit compresses binary properties. It comes from the observation that many generated code-points.js and symbols.js files contain long sequences of adjacent code points. These can be run-length encoded, and the lengths (which are often short) then encoded in variable-length base64 prefix code. This shrinks generated unicode-13 package from ~60MB down to ~15.5MB.

The third commit applies run-length encoding to named properties, shrinking the generated package down to ~13.5MB.

Emitting the data as an array of integers inflates it by a factor of approximately 3.57. 10 byte values require 1 digit + comma 90 byte values require 2 digits + comma 156 byte values require 3 digits + comma 2*10/256 + 3*90/256 + 4*156/256 =~ 3.57 Base64, on the other hand, inflates the data by a factor of only 4/3.

mathiasbynens

Nice work!

Have you done any testing to confirm that the new generated output matches the old generated output from a user perspective? I.e. all exported arrays are still deeply equal to each other, etc.

scripts/utils.js

static/decode-ranges.js

lightmare · 2021-06-27T08:42:22Z

I've run the existing test, it passes. Then I took the list of filenames from that test, plus a few other than code-points.js, and instead of comparing against ./expected/*, I did compare them against @unicode/unicode-13.0.0/*. Didn't check all files, yet. So far I've found 2 directories that don't match exactly:

/Binary_Property/Composition_Exclusion/code-points.js
/Line_Break/Unknown/code-points.js

The arrays are not deepEqual, but contain the same code points. This is because the code points are not sorted in the released package, whereas here I needed to sort them for run-length encoding.

mathiasbynens · 2021-06-28T06:03:45Z

I've run the existing test, it passes. Then I took the list of filenames from that test, plus a few other than code-points.js, and instead of comparing against ./expected/*, I did compare them against @unicode/unicode-13.0.0/*. Didn't check all files, yet. So far I've found 2 directories that don't match exactly:
/Binary_Property/Composition_Exclusion/code-points.js
/Line_Break/Unknown/code-points.js
The arrays are not deepEqual, but contain the same code points. This is because the code points are not sorted in the released package, whereas here I needed to sort them for run-length encoding.

I'm surprised the arrays aren't already sorted. Let me take a look.

mathiasbynens · 2021-06-28T07:34:04Z

Independent of this patch, I think we should enforce + test sorting of code points. I was expecting them to already be sorted in the published packages.

scripts/utils.js

mathiasbynens · 2021-06-28T09:14:41Z

scripts/utils.js

-				codePoints.length > 999 ? gzipInline : jsesc
-			)(codePoints)
-		);
+		let codePointsExports = `require('./ranges').flatMap(r=>Array.from(r.keys()))`;


Suggested change

let codePointsExports = `require('./ranges').flatMap(r=>Array.from(r.keys()))`;

let codePointsExports = `require('./ranges.js').flatMap(r=>Array.from(r.keys()))`;

scripts/utils.js

mathiasbynens · 2021-06-28T09:18:59Z

Thanks!

Issue: node-unicode/node-unicode-data#64

mathiasbynens · 2021-06-28T09:37:11Z

Thanks so much for your work on this!

lightmare · 2021-06-28T11:53:54Z

Happy to have smaller dependencies ;)

lightmare added 3 commits June 26, 2021 10:55

RLE + base64 encode contiguous code point ranges

9ab29e5

RLE contiguous same property ranges

f80e310

mathiasbynens reviewed Jun 27, 2021

View reviewed changes

scripts/utils.js Outdated Show resolved Hide resolved

scripts/utils.js Outdated Show resolved Hide resolved

static/decode-ranges.js Outdated Show resolved Hide resolved

static/decode-ranges.js Show resolved Hide resolved

mathiasbynens reviewed Jun 28, 2021

View reviewed changes

Apply suggestions from code review

e410e2e

mathiasbynens approved these changes Jun 28, 2021

View reviewed changes

mathiasbynens merged commit b935f78 into node-unicode:master Jun 28, 2021

mathiasbynens added a commit to node-unicode/unicode-9.0.0 that referenced this pull request Jun 28, 2021

Further compress generated output

06373f6

Issue: node-unicode/node-unicode-data#64

mathiasbynens added a commit to node-unicode/unicode-8.0.0 that referenced this pull request Jun 28, 2021

Further compress generated output

7310696

Issue: node-unicode/node-unicode-data#64

mathiasbynens added a commit to node-unicode/unicode-7.0.0 that referenced this pull request Jun 28, 2021

Further compress generated output

7a10ebc

Issue: node-unicode/node-unicode-data#64

mathiasbynens added a commit to node-unicode/unicode-6.3.0 that referenced this pull request Jun 28, 2021

Further compress generated output

f5db0bf

Issue: node-unicode/node-unicode-data#64

mathiasbynens added a commit to node-unicode/unicode-6.2.0 that referenced this pull request Jun 28, 2021

Further compress generated output

161885d

Issue: node-unicode/node-unicode-data#64

mathiasbynens added a commit to node-unicode/unicode-6.1.0 that referenced this pull request Jun 28, 2021

Further compress generated output

d51b731

Issue: node-unicode/node-unicode-data#64

mathiasbynens added a commit to node-unicode/unicode-6.0.0 that referenced this pull request Jun 28, 2021

Further compress generated output

db6010f

Issue: node-unicode/node-unicode-data#64

mathiasbynens added a commit to node-unicode/unicode-5.2.0 that referenced this pull request Jun 28, 2021

Further compress generated output

1a41488

Issue: node-unicode/node-unicode-data#64

mathiasbynens added a commit to node-unicode/unicode-5.1.0 that referenced this pull request Jun 28, 2021

Further compress generated output

71925fe

Issue: node-unicode/node-unicode-data#64

mathiasbynens added a commit to node-unicode/unicode-5.0.0 that referenced this pull request Jun 28, 2021

Further compress generated output

3848c25

Issue: node-unicode/node-unicode-data#64

mathiasbynens added a commit to node-unicode/unicode-4.1.0 that referenced this pull request Jun 28, 2021

Further compress generated output

592b2f5

Issue: node-unicode/node-unicode-data#64

mathiasbynens added a commit to node-unicode/unicode-4.0.1 that referenced this pull request Jun 28, 2021

Further compress generated output

744be1f

Issue: node-unicode/node-unicode-data#64

mathiasbynens added a commit to node-unicode/unicode-4.0.0 that referenced this pull request Jun 28, 2021

Further compress generated output

0108e98

Issue: node-unicode/node-unicode-data#64

mathiasbynens added a commit to node-unicode/unicode-3.2.0 that referenced this pull request Jun 28, 2021

Further compress generated output

68d4f82

Issue: node-unicode/node-unicode-data#64

mathiasbynens added a commit to node-unicode/unicode-3.1.0 that referenced this pull request Jun 28, 2021

Further compress generated output

78b57bb

Issue: node-unicode/node-unicode-data#64

mathiasbynens added a commit to node-unicode/unicode-3.0.1 that referenced this pull request Jun 28, 2021

Further compress generated output

7388a51

Issue: node-unicode/node-unicode-data#64

mathiasbynens added a commit to node-unicode/unicode-3.0.0 that referenced this pull request Jun 28, 2021

Further compress generated output

902f577

Issue: node-unicode/node-unicode-data#64

mathiasbynens added a commit to node-unicode/unicode-2.1.9 that referenced this pull request Jun 28, 2021

Further compress generated output

acde249

Issue: node-unicode/node-unicode-data#64

mathiasbynens added a commit to node-unicode/unicode-8.0.0 that referenced this pull request Jun 28, 2021

Make requires more consistent

17ef916

Issue: node-unicode/node-unicode-data#64

mathiasbynens added a commit to node-unicode/unicode-7.0.0 that referenced this pull request Jun 28, 2021

Make requires more consistent

932ad7c

Issue: node-unicode/node-unicode-data#64

mathiasbynens added a commit to node-unicode/unicode-6.3.0 that referenced this pull request Jun 28, 2021

Make requires more consistent

5dad365

Issue: node-unicode/node-unicode-data#64

mathiasbynens added a commit to node-unicode/unicode-6.2.0 that referenced this pull request Jun 28, 2021

Make requires more consistent

00cb472

Issue: node-unicode/node-unicode-data#64

mathiasbynens added a commit to node-unicode/unicode-6.1.0 that referenced this pull request Jun 28, 2021

Make requires more consistent

0eb699f

Issue: node-unicode/node-unicode-data#64

mathiasbynens added a commit to node-unicode/unicode-6.0.0 that referenced this pull request Jun 28, 2021

Make requires more consistent

1c8ca01

Issue: node-unicode/node-unicode-data#64

mathiasbynens added a commit to node-unicode/unicode-5.2.0 that referenced this pull request Jun 28, 2021

Make requires more consistent

1749fb1

Issue: node-unicode/node-unicode-data#64

mathiasbynens added a commit to node-unicode/unicode-5.1.0 that referenced this pull request Jun 28, 2021

Make requires more consistent

f60ba27

Issue: node-unicode/node-unicode-data#64

mathiasbynens added a commit to node-unicode/unicode-5.0.0 that referenced this pull request Jun 28, 2021

Make requires more consistent

40671e0

Issue: node-unicode/node-unicode-data#64

mathiasbynens added a commit to node-unicode/unicode-4.1.0 that referenced this pull request Jun 28, 2021

Make requires more consistent

a27051b

Issue: node-unicode/node-unicode-data#64

mathiasbynens added a commit to node-unicode/unicode-4.0.1 that referenced this pull request Jun 28, 2021

Make requires more consistent

24f5683

Issue: node-unicode/node-unicode-data#64

mathiasbynens added a commit to node-unicode/unicode-4.0.0 that referenced this pull request Jun 28, 2021

Make requires more consistent

87b13b8

Issue: node-unicode/node-unicode-data#64

mathiasbynens added a commit to node-unicode/unicode-3.2.0 that referenced this pull request Jun 28, 2021

Make requires more consistent

dbb4830

Issue: node-unicode/node-unicode-data#64

mathiasbynens added a commit to node-unicode/unicode-3.1.0 that referenced this pull request Jun 28, 2021

Make requires more consistent

bb2a53c

Issue: node-unicode/node-unicode-data#64

mathiasbynens added a commit to node-unicode/unicode-3.0.1 that referenced this pull request Jun 28, 2021

Make requires more consistent

a6e0ecc

Issue: node-unicode/node-unicode-data#64

mathiasbynens added a commit to node-unicode/unicode-3.0.0 that referenced this pull request Jun 28, 2021

Make requires more consistent

ab8d96a

Issue: node-unicode/node-unicode-data#64

mathiasbynens added a commit to node-unicode/unicode-2.1.9 that referenced this pull request Jun 28, 2021

Make requires more consistent

7bd6a2e

Issue: node-unicode/node-unicode-data#64

mathiasbynens added a commit to node-unicode/unicode-2.1.8 that referenced this pull request Jun 28, 2021

Make requires more consistent

1adde7b

Issue: node-unicode/node-unicode-data#64

mathiasbynens added a commit to node-unicode/unicode-2.1.5 that referenced this pull request Jun 28, 2021

Make requires more consistent

e87001c

Issue: node-unicode/node-unicode-data#64

mathiasbynens added a commit to node-unicode/unicode-2.1.2 that referenced this pull request Jun 28, 2021

Make requires more consistent

431e66b

Issue: node-unicode/node-unicode-data#64

mathiasbynens added a commit to node-unicode/unicode-2.0.14 that referenced this pull request Jun 28, 2021

Make requires more consistent

d052df2

Issue: node-unicode/node-unicode-data#64

mathiasbynens added a commit to node-unicode/unicode-13.0.0 that referenced this pull request Jun 28, 2021

Make requires more consistent

466ad16

Issue: node-unicode/node-unicode-data#64

mathiasbynens added a commit to node-unicode/unicode-12.1.0 that referenced this pull request Jun 28, 2021

Make requires more consistent

abe3733

Issue: node-unicode/node-unicode-data#64

mathiasbynens added a commit to node-unicode/unicode-12.0.0 that referenced this pull request Jun 28, 2021

Make requires more consistent

eaba707

Issue: node-unicode/node-unicode-data#64

mathiasbynens added a commit to node-unicode/unicode-1.1.5 that referenced this pull request Jun 28, 2021

Make requires more consistent

be33d34

Issue: node-unicode/node-unicode-data#64

mathiasbynens added a commit to node-unicode/unicode-11.0.0 that referenced this pull request Jun 28, 2021

Make requires more consistent

dca6244

Issue: node-unicode/node-unicode-data#64

mathiasbynens added a commit to node-unicode/unicode-10.0.0 that referenced this pull request Jun 28, 2021

Make requires more consistent

9156751

Issue: node-unicode/node-unicode-data#64

JLHwung mentioned this pull request Nov 2, 2024

test: migrate to snapshot tests #77

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shrink generated packages#64

Shrink generated packages#64
mathiasbynens merged 4 commits intonode-unicode:masterfrom
lightmare:shrink

lightmare commented Jun 26, 2021

Uh oh!

mathiasbynens left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lightmare commented Jun 27, 2021 •

edited

Loading

Uh oh!

mathiasbynens commented Jun 28, 2021

Uh oh!

mathiasbynens commented Jun 28, 2021

Uh oh!

Uh oh!

mathiasbynens Jun 28, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mathiasbynens commented Jun 28, 2021

Uh oh!

mathiasbynens commented Jun 28, 2021

Uh oh!

lightmare commented Jun 28, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	let codePointsExports = `require('./ranges').flatMap(r=>Array.from(r.keys()))`;
	let codePointsExports = `require('./ranges.js').flatMap(r=>Array.from(r.keys()))`;

Conversation

lightmare commented Jun 26, 2021

Uh oh!

mathiasbynens left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lightmare commented Jun 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mathiasbynens commented Jun 28, 2021

Uh oh!

mathiasbynens commented Jun 28, 2021

Uh oh!

Uh oh!

mathiasbynens Jun 28, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mathiasbynens commented Jun 28, 2021

Uh oh!

mathiasbynens commented Jun 28, 2021

Uh oh!

lightmare commented Jun 28, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lightmare commented Jun 27, 2021 •

edited

Loading