Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fastpath cbor marshalers #88

Closed
wants to merge 3 commits into from
Closed

Conversation

whyrusleeping
Copy link
Member

This is the first PR as part of a broader effort to improve the speed of cbor-ipld.

I implemented the benchmark here elsewhere using the existing go-ipld-cbor, and get the following results:

why@why-MS-7B46 ~/c/c/cidtest> go test -bench=.
goos: linux
goarch: amd64
pkg: github.com/whyrusleeping/cbor-gen/cidtest
BenchmarkCBORMarshal-16      	 2000000	       712 ns/op	     449 B/op	       9 allocs/op
BenchmarkCBORUnmarshal-16    	 1000000	      1080 ns/op	     480 B/op	       9 allocs/op

The benchmarks here give me:

why@why-MS-7B46 ~/g/s/g/i/go-cid> go test -run=XXX -bench=.
goos: linux
goarch: amd64
pkg: github.com/ipfs/go-cid
BenchmarkCBORMarshal-16      	20000000	       165 ns/op	     128 B/op	       5 allocs/op
BenchmarkCBORUnmarshal-16    	 5000000	       317 ns/op	     208 B/op	       4 allocs/op

I'm sure it can be improved upon, but CIDs arent even the slow part that i'm trying to optimize.

@whyrusleeping
Copy link
Member Author

Note: go-ipld-cbor does not use this (yet) and this will not speed up any existing applications as is. Once things start using these methods all over, we should notice improvements.

"strings"

mbase "github.com/multiformats/go-multibase"
mh "github.com/multiformats/go-multihash"

cbg "github.com/whyrusleeping/cbor-gen"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😭

But go doesn't make this easy, does it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative was to have the codegen stuff spit out the methods it needs into the generated file, which i'm not opposed to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. Honestly, that may be a better approach. That way, we can optimize everything.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @Stebalien but even more intensely: I don't think this is viable at all. Having go-cid import cbor packages seems almost backwards. It would put cbor things on the transitive import requirements for not-at-all-cbor things and that's not okay. An import like this from one of our most imported libraries would be a huge lock-in on that package as well, and I don't think that's something we should go into lightly, and probably isn't even intentional.

This has to go in the generated file. Not here.

Copy link
Member

@Stebalien Stebalien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only blocking complaint is that the minimum CID size needs to be reliably enforced and stored in a constant.

cid.go Outdated Show resolved Hide resolved
cid.go Outdated Show resolved Hide resolved
cid.go Outdated
@@ -522,6 +525,74 @@ func (c Cid) Prefix() Prefix {
}
}

func (c Cid) MarshalCBOR(w io.Writer) error {
tag := cbg.CborEncodeMajorType(6, 42)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to be manually parsing CBOR, let's at least get these into constants.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we'd have one function per major type:

  • WriteTag
  • WriteInt
  • ReadTag
  • ReadInt
  • WriteBytes

...

cid.go Outdated Show resolved Hide resolved
)

// UnsupportedVersionString just holds an error message
const UnsupportedVersionString = "<unsupported cid version>"

const CidMaxLen = 256
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this affect cids using identity multihashes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, do you make identity CIDs that are longer than 256 bytes? If so, that seems rather obscene... but I could be convinced to raise this to 512 bytes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where we use it:
https://github.com/Peergos/Peergos/blob/master/src/peergos/shared/crypto/FragmentedPaddedCipherText.java#L71
Non huge directories and files < 4096 bytes are inlined to save a network round trip.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That won't be a root object, but inside another ipld object.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any restriction here that is not also enforced on identity multihashes would also mean we could construct Cids that would look fine, but just fail to serialize at runtime.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the value is 4112 bytes for that example. Here is an example of a 4096 byte file being encrypted and wrapped in some other cbor (so it includes a 4112 byte identity multihash inlined) - in hex in the file here:

https://alpha.peergos.net/#{"secretLink":true,"link":"#6MDZhRRPT4ugkJuUfceM6bPnpQKEj5dB2NqLxD1RxFn3oA3CusXayN8RReauEh/6MDZhRRPT4ugkJuUfcRzRbPpFimcBNJx2N9TJDnL4W3ETYhwdsWdvgCkXkwipF/JAMW71tpHDbgcrYrb7SkzeoeD4u9KWLR5PpUvPiBkVuK/5Pf7SupJAsP3FoAbmb5rakBBPZDz8zA9K8siNbbZLKnKijW3pWf"}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ianopolous please don't do that. We set the default max to 32 in go-ipfs for a reason.

Any restriction here that is not also enforced on identity multihashes would also mean we could construct Cids that would look fine, but just fail to serialize at runtime.

You're right, we should be checking this in the NewCidV1 function as well (and probably when we decode CIDs).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Stebalien I asked back in March, when I implemented this, and was told inlining 4 KiB identity mulithashes was fine. Now all our users are on it, so breaking that (it is a breaking change in go-ipfs) would break every Peergos user. Current go-ipfs happily accepts much larger values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did I say that that? Go-ipfs doesn't explicitly forbid this anywhere, it's just a bad idea. I had thought we left something somewhere saying inlining anything over 100 bytes was a bad idea but I can't find that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was @whyrusleeping I asked. It makes a bunch of stuff a lot faster for us.

@whyrusleeping
Copy link
Member Author

Actually, question: How should we handle 'empty' Cids?

Right now, its not handling that case properly.

I think currently refmt errors out on this too, saying you can't have an undefined Cid serialized.

Do we want to retain that behavior?

@Stebalien
Copy link
Member

Do we want to retain that behavior?

We should return null, that's what we're currently using empty CIDs for.

@whyrusleeping
Copy link
Member Author

@warpfork (who has so far abstained from commented) is really against having a cbor method on cids. I lean towards agreeing, even though it makes my life a bit harder.

The solution i'm going to move towards now is to have a separate function that serialized a cid in a different package, and then anything wanting to cbor serialize a CID will have to know to use that method.

Copy link
Member

@warpfork warpfork left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should really, really, really be doing this from the outside. If the constructors exported by this package aren't good enough to do this from outside this package, that's a bug we should fix, and then do this feature from outside this package.


There are a host of reasons for this:

  • go-cid is a very foundational package for us. Its transitive imports should be as minimal as possible, to make it easier to adopt, rather than being perceived as "heavy". (Even if the dependency is "small": the sheer count should go down, not up from its already-too-high number.)

  • anything go-cid depends on becomes a highly-depended upon thing instantly, and becomes therefore very difficult to change. This means adding a dependency on a library that is itself new and has partial feature coverage should be regarded with extreme trepidation.

  • CIDs don't depend on CBOR and having the code do so when the concept doesn't should set off all sorts of alarm bells.

  • the sheer use of the word "CBOR" is memetically problematic here. Having the word "CBOR" appear in this library implies that there's one possible way that go-cid relates to CBOR, and that's just not correct.

  • CBOR isn't even a codec (in the sense of multicodec) -- "dag-cbor" is! Seeing the number "42" appear in the go-cid package is incorrect; and seeing the multibase byte appear in the go-cid package is also incorrect. These are details of a codec. Someday we might have another multicodec which is still visibly cbor-ish -- 'dag-cbor-harder' or something -- and it could choose to omit the multiformat null byte, for example; and this would be unambiguous and a valid design choice by virtue of being a different codec. Therefore we should not have details in go-cid that are dag-cbor specific, nor should we have concepts attached to the word cbor, because both are extremely likely to suggest inaccurate relationships and confuse in the future.

We should not pursue this patch.


Instead: let's do this, but do it from other packages that use go-cid. This should be especially approachable because one of the main reasons we're looking at this is because we're also exploring code-gen tactics across our other libraries and consuming applications.

For example: supposing we have even one other structure which surrounds the use of the CID:

type MyProtocol struct {
    sauce cid.Cid
}
func (x *MyProtocol) DecodeDagCborFastPath(io.Reader) { /*...*/ }

Since from this first structure, we already have a control flow defined by MyProtocol.DecodeDagCborFastPath, it can already call another custom function on the cid field -- and this can be done without that function being a method on cid.Cid.

This from-the-outside approach will lead to much better outcomes, because it neither adds a dependency to go-cid, nor locks us in to any codecs nor particular implementations of them, nor (most importantly) introduces suggestions of conceptual relationships that are semantically backwards.


EDIT: ahheh, sorry. I'm a slow typer.

@tobowers
Copy link

tobowers commented Sep 1, 2019

This is good stuff and the code generator looks great. Is the plan to have go-ipld-cbor do an interface check for MarshalCBOR and use that if it exists? If so, is there a timeline from that and are PRs welcome?

@Stebalien
Copy link
Member

@whyrusleeping what's the state of this?

@rvagg
Copy link
Member

rvagg commented Apr 4, 2023

Closing because stale. But, there could be a case to pick this up again and just implement the MarshalCBOR and UnmarshalCBOR methods, even without the dependency and needing to codegen and it'd be pretty straightforward.

@rvagg rvagg closed this Apr 4, 2023
@rvagg rvagg deleted the feat/cbor-marshal branch April 4, 2023 04:22
@rvagg rvagg restored the feat/cbor-marshal branch April 4, 2023 04:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants