Output DnaWasm as gzip-compressed, base-64 encoded list of Strings #1958

pjkundert · 2019-12-07T00:01:02Z

PR summary

Output .dna.json files are unwieldy and large, because WASM is output as a single base-64 encoded String. It is difficult to view .dna.json files in pagination utilities (eg. less), due to the extreme length of the WASM String. Produces, instead, something like this (with a list of 128-symbol JSON strings of base-64, instead of a 6 Megabyte single base-64 string):

{
  ...,
  "zomes": {
    "transactions": {
      "code": {  ...
        "code": [
          "Abc1234gasdfoiuasd5as57asdfgTiiasdu..XyZ",
          ...
          "Qrs345ZUjUHjjasdi="
        ]
      } 
    }
  }
}

Also, since WASM contains much redundancy, Gzip-encoding yields ~75% reduction in sizes.

Serializes to a list of Strings, instead of just a big String; small Zomes (< 1KB, eg. for tests) still round-trip to a single String.

Historical single String (eg. { "code": "AbC...XyZ==" }) WASM will load fine, but will not produce an equivalent hc hash. Since hc hash was broken until recently, this should be fine; all future DnaWasms serialized will produce consistent hc package and hc hash hashes.

testing/benchmarking notes

( if any manual testing or benchmarking was/should be done, add notes and/or screenshots here )

followups

( any new tickets/concerns that were discovered or created during this work but aren't in scope for review here )

changelog

if this is a code change that effects some consumer (e.g. zome developers) of holochain core, then it has been added to our between-release changelog with the format

- summary of change [PR#1234](https://github.com/holochain/holochain-rust/pull/1234)

documentation

this code has been documented according to our docs checklist

o Use DnaWasm type for all DNA WASM encoding, including in `hc package`

pjkundert · 2019-12-07T00:03:47Z

crates/cli/src/config_files/build.rs

-            Ok(base64::encode(&wasm_buf))
+            Ok(DnaWasm::from_bytes(wasm_buf))


We were actually not even using DnaWasm's Serialize methods to output WASM; we were just serializing to base-64, and it happened to be compatible...

pjkundert · 2019-12-07T00:04:35Z

crates/core_types/Cargo.toml

@@ -21,6 +21,7 @@ serde_derive = "=1.0.89"
 serde_json = { version = "=1.0.39", features = ["preserve_order"] }
 lazy_static = "=1.2.0"
 multihash = "=0.8.0"
+flate2 = "=1.0.12"


We're already using flate2 in core_types, so size increment should be small.

thedavidmeister · 2019-12-07T09:51:26Z

@pjkundert this sounds like a cool idea, would you be interested in looking into radix64 crate instead of base64 while you're at it?

it claims ~3-4x speedup on encode/decode logic by using the AVX2 algo

https://docs.rs/radix64/0.6.2/radix64/index.html

i'm bringing it up now because it seems topical and also might produce incompatible hashes moving forward if the encoding character set is different

lucksus · 2019-12-07T14:36:40Z

Hey, I would highly suggest we introduce a binary DNA format instead. Discussed several times already but never got to it because it wasn't a priority. But 99% of the bytes of a DNA are WASM binary code and don't make sense to be viewed in an editor anyway. Making it even bigger with base64 just to be able to stick to JSON, and then having to deal with the problems of that, feels very weird.

That is, if this working already, awesome and thanks for proactively taking this on. Just wanted to make visible that there might be another solution we might want to approach long-term anyway.

thedavidmeister · 2019-12-08T12:13:03Z

@lucksus what makes sense to you for the rest of the JSON data?

pjkundert · 2019-12-08T14:28:03Z

I really like the single .dna.json. Sure, a big blog of binary data in there is an eye-sore, but its a single, user-readable, easy to manage package. Separate binary files will present a more difficult DNA package consistency problem.

We seldom need to peer inside the DNA package, so even keeping the one long String is fine (as long as we compress it to keep the size down).

We've long had base-64 compressed binary data (see: usenet), and they're not bad. If we need to get that 5/4 expansion back (which is unlikely), it easily compresses away. Mostly, the change from big String to list of Strings is to support the rare occasion when we do need to take a quick look at some of the other properties in the Objects in the JSON, that's all.

pjkundert · 2019-12-10T15:18:03Z

I've left the println!s in, for now; it appears that this serialization is used lots, which is somewhat surprising. Encoding DNAs happens rarely, while loading/decoding DNAs should happen much more frequently. Why is there so much (expensive) serialization of DNAs going on?

thedavidmeister · 2019-12-18T18:31:08Z

@pjkundert @lucksus i feel like this discussion got lost a bit recently, is there something to merge or chase up here or should we close this?

pjkundert · 2019-12-18T19:57:30Z

I think that the question of what base64 package to use is orthogonal to the purpose of this pull -- which is to improve the size and JSON utility of the output DNAs.

This maintains backward compatibility to load existing DNAs, while compressing and new DNAs in a list of strings, instead of one multi-megabyte long string, which is vastly more useful for any human interaction with the DNA file...

I think, most importantly, it restores the usage of the DnaWasm's serialization to control the output of the .dna.json file; presently, it doesn't use the DnaWasm serialization at all!

pjkundert · 2019-12-18T19:59:09Z

I think its a step improvement in DNA WASM <-> JSON handling, and won't interfere too much with any future improvements we might want to do.

thedavidmeister · 2019-12-19T08:35:24Z

@pjkundert i'm happy with all that, makes sense to me, this isn't preventing us from doing anything @lucksus is suggesting in some future work so why not merge? :)

pjkundert · 2020-01-03T17:35:34Z

It would be nice, I think, to get this DNA WASM compression into the product before we get too far into open alpha. The DNA space savings are pretty dramatic, and the produced .dna.json file is quite a bit more user-friendly.

pjkundert added 5 commits December 6, 2019 15:02

Output DNA WASM as a String sequence instead of a huge String

c081305

Test round-trip of encoded compressed WASM

0366231

Round-trip compressed WASM encoded as String arrays

b40a916

o Use DnaWasm type for all DNA WASM encoding, including in `hc package`

Minor test output refinement, Cargo.lock

8565298

Default to a single raw String, for DNA WASM's below ~4KB

38750f5

pjkundert added the enhancement New feature or request label Dec 7, 2019

pjkundert commented Dec 7, 2019

View reviewed changes

pjkundert added 3 commits December 6, 2019 16:09

Revert back to only emitting tiny (1 line, eg. test) WASM as a String

9d70ddc

Correct test, due to changing Gzip compression factors

430ac0f

fmt

f8e28cc

Merge branch 'develop' into 201912-dna-wasm

f774732

pjkundert added 2 commits December 9, 2019 12:31

Merge branch 'develop' into 201912-dna-wasm

1799975

Merge branch 'develop' into 201912-dna-wasm

ca5117d

pjkundert marked this pull request as ready for review December 10, 2019 15:18

pjkundert added 2 commits December 10, 2019 16:20

Merge branch 'develop' into 201912-dna-wasm

ece9bd7

Merge branch 'develop' into 201912-dna-wasm

af9cef2

Merge branch 'develop' into 201912-dna-wasm

31e3ab0

Merge branch 'develop' into 201912-dna-wasm

7f4d730

Merge branch 'develop' into 201912-dna-wasm

056bff5

thedavidmeister requested a review from zippy December 19, 2019 08:35

thedavidmeister approved these changes Dec 19, 2019

View reviewed changes

thedavidmeister requested a review from lucksus December 19, 2019 08:35

pjkundert and others added 3 commits December 19, 2019 05:35

Merge branch 'develop' into 201912-dna-wasm

0e6a239

Remove diagnostic println statements

dc6f1c1

Add CHANGELOG entry

610d5c0

pjkundert force-pushed the 201912-dna-wasm branch from 08be816 to 610d5c0 Compare December 19, 2019 13:55

pjkundert and others added 2 commits December 19, 2019 06:18

Correct math regarding base-64 expansion, output 99-bytes/132-symbols

33fbb5e

Merge branch 'develop' into 201912-dna-wasm

7a969b8

pjkundert force-pushed the 201912-dna-wasm branch from fd9f885 to 7a969b8 Compare January 3, 2020 17:25

pjkundert and others added 2 commits January 3, 2020 09:32

Merge branch 'develop' into 201912-dna-wasm

d5da64a

Restore base64 crate (used by sim2h)

6a0bd26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output DnaWasm as gzip-compressed, base-64 encoded list of Strings #1958

Output DnaWasm as gzip-compressed, base-64 encoded list of Strings #1958

pjkundert commented Dec 7, 2019 •

edited

Loading

pjkundert Dec 7, 2019

pjkundert Dec 7, 2019

thedavidmeister commented Dec 7, 2019 •

edited

Loading

lucksus commented Dec 7, 2019

thedavidmeister commented Dec 8, 2019

pjkundert commented Dec 8, 2019

pjkundert commented Dec 10, 2019

thedavidmeister commented Dec 18, 2019

pjkundert commented Dec 18, 2019

pjkundert commented Dec 18, 2019

thedavidmeister commented Dec 19, 2019

pjkundert commented Jan 3, 2020

		Ok(base64::encode(&wasm_buf))
		Ok(DnaWasm::from_bytes(wasm_buf))

Output DnaWasm as gzip-compressed, base-64 encoded list of Strings #1958

Are you sure you want to change the base?

Output DnaWasm as gzip-compressed, base-64 encoded list of Strings #1958

Conversation

pjkundert commented Dec 7, 2019 • edited Loading

PR summary

testing/benchmarking notes

followups

changelog

documentation

pjkundert Dec 7, 2019

Choose a reason for hiding this comment

pjkundert Dec 7, 2019

Choose a reason for hiding this comment

thedavidmeister commented Dec 7, 2019 • edited Loading

lucksus commented Dec 7, 2019

thedavidmeister commented Dec 8, 2019

pjkundert commented Dec 8, 2019

pjkundert commented Dec 10, 2019

thedavidmeister commented Dec 18, 2019

pjkundert commented Dec 18, 2019

pjkundert commented Dec 18, 2019

thedavidmeister commented Dec 19, 2019

pjkundert commented Jan 3, 2020

pjkundert commented Dec 7, 2019 •

edited

Loading

thedavidmeister commented Dec 7, 2019 •

edited

Loading