Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize CBOR representation of data in Cadence storage layer #738

Closed
fxamacker opened this issue Mar 29, 2021 · 7 comments
Closed

Optimize CBOR representation of data in Cadence storage layer #738

fxamacker opened this issue Mar 29, 2021 · 7 comments

Comments

@fxamacker
Copy link
Member

fxamacker commented Mar 29, 2021

Issue To Be Solved

Cadence storage layer's use of CBOR can be optimized to improve speed, memory use, and stored data size.

Suggestion

Performance gains are possible by tweaking the CBOR representation of Cadence Value types. For example, replacing CBOR map with CBOR array when feasible.

In addition to performance gains, replacing CBOR map with CBOR array preserves order (e.g. DictionaryValue.Entries).

UPDATE: Additional performance gains (especially for decoding) are possible by removing backwards compatible decoding (thanks @turbolent) and not saving key strings twice (thanks @SupunS).

No changes to fxamacker/cbor are required to support these optimizations.

The current CBOR representation in Cadence is very nice — so this optimization is kind of like taking a nicely normalized RDBMS schema and denormalizing some of it to gain speed.

Performance Comparisons

A preliminary comparison (from PR #752) with non-production data showed potential performance improvements. Although faster execution speed was the primary objective, memory use and stored data size also improved.

Comparisons using production data (in PR #761) shows performance gains are cumulative. On a very large value (776,155 bytes), optimization of composite + dictionary types showed:

  • encoding is 44.96% faster, 41.99% less memory, 46.02% fewer allocs
  • decoding is 34.48% faster, 29.67% less memory, 18.56% fewer allocs

⚠️ NOTE: "faster" refers to time delta in benchstat, so "50% faster" here means 2x speedup.

Additional performance gains are possible by optimizing remaining Value types. Smaller data will probably begin to catch up in percent improved as more remaining types are optimized. Larger data with big composites and dictionaries will show more modest gains with remaining types.

Cumulative performance gains from PR #752, PR #761, PR #778, and PR #788 confirmed expectations about small data catching up in performance gains.

For a 167-byte LinkValue (extracted from a mainnet snapshot):

  • encoding is 50.54% faster, 57.81% less kB per alloc, 54.46% fewer allocs/op
  • decoding is 31.43% faster, 47.60% less kB per alloc, same allocs/op

For a 776155-byte CompositeValue (extracted from a mainnet snapshot):

  • encoding is 45.23% faster, 48.25% less kB per alloc, 51.57% fewer allocs/op
  • decoding is 35.83% faster, 31.23% less kB per alloc, 18.56% fewer allocs/op

Additional performance gains are possible by:

  • optimizing code in the Cadence storage layer
  • adding higher-performance API to fxamacker/cbor
  • using higher-performance API of fxamacker/cbor when available

Click to expand:

Preliminary Comparisons (non-production dictionary data)

Stored CBOR Data Size Comparisons (for dictionaries, non-production data)

Dictionary Data Old Size (bytes) New Size (bytes) Delta
Small benchmark 3361 1941 -42.25%
Large benchmark 348802 199602 -42.77%

Size reduction is data dependent. Comparisons using production data is recommended.

Encoding Speed & Memory Comparisons (for dictionaries, non-production data)

name                  old time/op    new time/op    delta
EncodingSmallValue-4     199µs ± 0%     137µs ± 0%  -31.26%  (p=0.000 n=10+9)
EncodingLargeValue-4    19.4ms ± 0%    13.1ms ± 1%  -32.25%  (p=0.000 n=10+10)

name                  old alloc/op   new alloc/op   delta
EncodingSmallValue-4    57.6kB ± 0%    36.2kB ± 0%  -37.12%  (p=0.000 n=10+10)
EncodingLargeValue-4    4.67MB ± 0%    3.55MB ± 0%  -23.93%  (p=0.000 n=10+10)

name                  old allocs/op  new allocs/op  delta
EncodingSmallValue-4     1.10k ± 0%     0.80k ± 0%  -27.46%  (p=0.000 n=10+10)
EncodingLargeValue-4     92.1k ± 0%     70.8k ± 0%  -23.12%  (p=0.000 n=10+10)

Decoding Speed & Memory Comparisons (for dictionaries, non-production data)

name                  old time/op    new time/op    delta
DecodingSmallValue-4     192µs ± 0%     145µs ± 0%  -24.32%  (p=0.000 n=9+10)
DecodingLargeValue-4    19.9ms ± 1%    14.7ms ± 0%  -26.26%  (p=0.000 n=10+10)

name                  old alloc/op   new alloc/op   delta
DecodingSmallValue-4    73.9kB ± 0%    60.7kB ± 0%  -17.84%  (p=0.000 n=10+10)
DecodingLargeValue-4    7.49MB ± 0%    5.75MB ± 0%  -23.23%  (p=0.000 n=10+9)

name                  old allocs/op  new allocs/op  delta
DecodingSmallValue-4     1.57k ± 0%     1.36k ± 0%  -13.36%  (p=0.000 n=10+10)
DecodingLargeValue-4      143k ± 0%      122k ± 0%  -14.62%  (p=0.000 n=10+10)

Benchmarks used linux_amd64 with Go 1.15.10. See Caveats.
Performance of encoding and decoding is data dependent. Comparisons using production data is recommended.

Comparisons using production data (cumulative gains from from PR #752, PR #761, PR #778, and PR #788):

LinkValue 🚀

Encoding Comparisons

name                                            old time/op    new time/op    delta
EncodeCBOR/link_v3_2392f05c3b72f235_167bytes-4    20.2µs ± 1%    10.0µs ± 0%  -50.54%  (p=0.000 n=10+10)
EncodeCBOR/link_v3_3a791fe1b8243e73_192bytes-4    20.7µs ± 1%    10.0µs ± 0%  -51.44%  (p=0.000 n=10+10)

name                                            old alloc/op   new alloc/op   delta
EncodeCBOR/link_v3_2392f05c3b72f235_167bytes-4    5.18kB ± 0%    2.18kB ± 0%  -57.81%  (p=0.000 n=10+10)
EncodeCBOR/link_v3_3a791fe1b8243e73_192bytes-4    5.50kB ± 0%    2.17kB ± 0%  -60.60%  (p=0.000 n=10+10)

name                                            old allocs/op  new allocs/op  delta
EncodeCBOR/link_v3_2392f05c3b72f235_167bytes-4       101 ± 0%        46 ± 0%  -54.46%  (p=0.000 n=10+10)
EncodeCBOR/link_v3_3a791fe1b8243e73_192bytes-4       101 ± 0%        46 ± 0%  -54.46%  (p=0.000 n=10+10)

Decoding Comparisons

name                                            old time/op    new time/op    delta
DecodeCBOR/link_v3_2392f05c3b72f235_167bytes-4    11.9µs ± 0%     8.2µs ± 1%  -31.43%  (p=0.000 n=8+10)
DecodeCBOR/link_v3_3a791fe1b8243e73_192bytes-4    11.9µs ± 1%     8.1µs ± 1%  -31.84%  (p=0.000 n=10+9)

name                                            old alloc/op   new alloc/op   delta
DecodeCBOR/link_v3_2392f05c3b72f235_167bytes-4    4.50kB ± 0%    2.36kB ± 0%  -47.60%  (p=0.000 n=10+10)
DecodeCBOR/link_v3_3a791fe1b8243e73_192bytes-4    4.50kB ± 0%    2.36kB ± 0%  -47.60%  (p=0.000 n=10+10)

name                                            old allocs/op  new allocs/op  delta
DecodeCBOR/link_v3_2392f05c3b72f235_167bytes-4      55.0 ± 0%      55.0 ± 0%     ~     (all equal)
DecodeCBOR/link_v3_3a791fe1b8243e73_192bytes-4      55.0 ± 0%      55.0 ± 0%     ~     (all equal)

Benchmarks used Go 1.15.10 on linux_amd64 using production data. See Caveats.

CompositeValue without dictionaries 🚀

Encoding Comparisons

name                                               old time/op    new time/op    delta
EncodeCBOR/comp_v2_96d7e06eaf4b3fcf_14850bytes-4      899µs ± 1%     486µs ± 1%  -45.93%  (p=0.000 n=10+10)
EncodeCBOR/comp_v3_99dc360eee32dcec_160bytes-4       20.8µs ± 1%    11.8µs ± 1%  -43.30%  (p=0.000 n=10+10)
EncodeCBOR/comp_v3_b52a33b7e56868f6_139bytes-4       15.8µs ± 1%     8.4µs ± 0%  -47.12%  (p=0.000 n=10+10)

name                                               old alloc/op   new alloc/op   delta
EncodeCBOR/comp_v2_96d7e06eaf4b3fcf_14850bytes-4      247kB ± 2%     138kB ± 0%  -44.32%  (p=0.000 n=10+9)
EncodeCBOR/comp_v3_99dc360eee32dcec_160bytes-4       4.93kB ± 2%    2.40kB ± 0%  -51.30%  (p=0.000 n=9+10)
EncodeCBOR/comp_v3_b52a33b7e56868f6_139bytes-4       3.77kB ± 0%    1.62kB ± 0%  -57.15%  (p=0.000 n=9+10)

name                                               old allocs/op  new allocs/op  delta
EncodeCBOR/comp_v2_96d7e06eaf4b3fcf_14850bytes-4      5.03k ± 1%     2.54k ± 0%  -49.50%  (p=0.000 n=10+10)
EncodeCBOR/comp_v3_99dc360eee32dcec_160bytes-4          104 ± 1%        55 ± 0%  -46.91%  (p=0.000 n=10+10)
EncodeCBOR/comp_v3_b52a33b7e56868f6_139bytes-4         76.0 ± 0%      37.0 ± 0%  -51.32%  (p=0.000 n=10+10)

Decoding Comparisons

name                                               old time/op    new time/op    delta
DecodeCBOR/comp_v2_96d7e06eaf4b3fcf_14850bytes-4      762µs ± 1%     516µs ± 1%  -32.27%  (p=0.000 n=10+10)
DecodeCBOR/comp_v3_99dc360eee32dcec_160bytes-4       17.2µs ± 1%    11.6µs ± 1%  -32.59%  (p=0.000 n=10+10)
DecodeCBOR/comp_v3_b52a33b7e56868f6_139bytes-4       11.9µs ± 1%     8.0µs ± 1%  -32.96%  (p=0.000 n=10+10)

name                                               old alloc/op   new alloc/op   delta
DecodeCBOR/comp_v2_96d7e06eaf4b3fcf_14850bytes-4      298kB ± 0%     212kB ± 0%  -28.81%  (p=0.000 n=10+10)
DecodeCBOR/comp_v3_99dc360eee32dcec_160bytes-4       5.58kB ± 0%    3.87kB ± 0%  -30.56%  (p=0.000 n=10+10)
DecodeCBOR/comp_v3_b52a33b7e56868f6_139bytes-4       3.91kB ± 0%    2.57kB ± 0%  -34.36%  (p=0.000 n=10+10)

name                                               old allocs/op  new allocs/op  delta
DecodeCBOR/comp_v2_96d7e06eaf4b3fcf_14850bytes-4      5.05k ± 0%     4.65k ± 0%   -7.98%  (p=0.000 n=10+10)
DecodeCBOR/comp_v3_99dc360eee32dcec_160bytes-4         97.0 ± 0%      89.0 ± 0%   -8.25%  (p=0.000 n=10+10)
DecodeCBOR/comp_v3_b52a33b7e56868f6_139bytes-4         56.0 ± 0%      53.0 ± 0%   -5.36%  (p=0.000 n=10+10)

Benchmarks used Go 1.15.10 on linux_amd64 using production data. See Caveats.

Large CompositeValue with dictionaries and etc. 🚀

Encoding Comparisons

name                                               old time/op    new time/op    delta
EncodeCBOR/comp_v3_d99d7e6b4dad41e1_776155bytes-4    35.5ms ± 1%    19.4ms ± 1%  -45.23%  (p=0.000 n=10+9)

name                                               old alloc/op   new alloc/op   delta
EncodeCBOR/comp_v3_d99d7e6b4dad41e1_776155bytes-4    7.62MB ± 4%    3.94MB ± 0%  -48.25%  (p=0.000 n=10+10)

name                                               old allocs/op  new allocs/op  delta
EncodeCBOR/comp_v3_d99d7e6b4dad41e1_776155bytes-4      141k ± 0%       68k ± 0%  -51.57%  (p=0.000 n=10+10)

Decoding Comparisons

name                                               old time/op    new time/op    delta
DecodeCBOR/comp_v3_d99d7e6b4dad41e1_776155bytes-4    40.4ms ± 1%    25.9ms ± 1%  -35.83%  (p=0.000 n=9+10)

name                                               old alloc/op   new alloc/op   delta
DecodeCBOR/comp_v3_d99d7e6b4dad41e1_776155bytes-4    15.0MB ± 0%    10.3MB ± 0%  -31.23%  (p=0.000 n=9+10)

name                                               old allocs/op  new allocs/op  delta
DecodeCBOR/comp_v3_d99d7e6b4dad41e1_776155bytes-4      265k ± 0%      216k ± 0%  -18.56%  (p=0.000 n=9+10)

Benchmarks used Go 1.15.10 on linux_amd64 using production data. See Caveats.

Proposed Changes

Encoding and decoding arrays are faster than maps. So the proposed change replaces CBOR maps with CBOR arrays wherever feasible.

As mentioned earlier, this change has the added benefit of preserving the order of DictionaryValue.Entries.

The changes are very simple and the minimally modified tests pass. Modified files are limited to encoding.go, encoding_test.go, and decoding.go.

For example, changes to func (e *Encoder) prepareDictionaryValue might include:

  • declaring entries as an array here:

    entries := make(map[string]interface{}, v.Entries.Len())

  • replacing Content: cborMap{ with Content: cborArray{ here:

    return cbor.Tag{
    Number: cborTagDictionaryValue,
    Content: cborMap{
    encodedDictionaryValueKeysFieldKey: keys,
    encodedDictionaryValueEntriesFieldKey: entries,
    },
    }, nil

As mentioned earlier, additional speed gains (especially for decoding) are possible by removing backwards compatible decoding (thanks @turbolent) and not saving key strings twice (thanks @SupunS).

Caveats

Changing the CBOR representation of data is a breaking change -- hopefully, it's just an internal implementation detail within the storage layer. Guidance from @turbolent would be appreciated to make sure I didn't miss anything.

I don't know how closely the speed gains will translate to other platforms (other CPU archs, compiled to wasm, and etc.)

Having this change to CBOR representation completed before I proceed with API changes to my CBOR library can save time.

Performance comparisons are data dependent. Comparisons should use realistic Cadence data (most common mix of data types, values, & sizes encountered in production).

Updates (not exhaustive list)

April 9, 2021 at 2:30 PM PT - Include cumulative benchmark comparisons to master (commit 7c7dd55) with optimizations from PR #752, PR #761, PR #778, and PR #788 using production data (values extracted from mainnet snapshot).

April 5, 2021 at 6:03 AM PT - Include benchmark comparisons using production data. Remove section "Data Used for Benchmark Comparisons" because it described non-production data.

@fxamacker
Copy link
Member Author

I'd like to provide a PR for this unless there are any concerns or objections.

@turbolent
Copy link
Member

turbolent commented Mar 30, 2021

This looks great, thank you for writing this up Faye, and thanks for the great call discussing it.

We agreed to:

  • Make several separate optimizations in separate PRs
  • Target a feature branch for now and merge it back when several optimizations have been added, so the overhead of keeping backwards-compatibility is lower
  • Batch migrate all existing data to the new format, i.e. perform a state migration in the spork updating to the version of Cadence which only has support for the optimized format

Here are the next steps:

@fxamacker
Copy link
Member Author

I updated this issue's text and benchmark comparisons to include more recent info.

@turbolent
Copy link
Member

For reference, dapperlabs/flow-go#5422 will implement the state migration function

@sudojbird
Copy link

@turbolent @fxamacker; is this epic good to close?

@fxamacker
Copy link
Member Author

I think this is good to close, I defer the decision to Bastian.

The next phase of optimizations are epic #794 and epic #833. We'll be adding issues to #833 (possibly streaming decoder).

@turbolent
Copy link
Member

@studio1vc yes, all issues are done now 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants