Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAG Import/Export #7011

Closed
wants to merge 17 commits into from
Closed

DAG Import/Export #7011

wants to merge 17 commits into from

Conversation

ribasushi
Copy link
Contributor

@ribasushi ribasushi commented Mar 18, 2020

Implements #6870.

@ribasushi ribasushi force-pushed the feat/carfile-support branch 2 times, most recently from c6f9e3b to f5d5715 Compare March 19, 2020 21:26
core/commands/dag/dag.go Outdated Show resolved Hide resolved
@ribasushi
Copy link
Contributor Author

@Stebalien @rvagg @warpfork : this is the final implementation of the import/export functionality. The only thing missing are the sharness tests, which are coming some time Saturday: I am still not entirely happy with my test cases.

Sorry it took forever, there were a lot of... dusty corners to connect properly.

I specifically want someone to check my concurrency work ( around the progress meters ). I have not worked with such code a whole lot, and while there was a lot to crib from existing progress meters in ther commands, nothing was quite close to what I needed.

I will convert this into a proper PR once I push the sharness bits.

@Stebalien Stebalien marked this pull request as ready for review March 20, 2020 22:16
@Stebalien Stebalien changed the title Complete export support for pre-review DAG Import/Export Mar 20, 2020
Copy link
Contributor

@willscott willscott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🏎 💨

core/commands/dag/dag.go Outdated Show resolved Hide resolved
core/commands/dag/dag.go Show resolved Hide resolved
core/commands/dag/dag.go Outdated Show resolved Hide resolved
core/commands/dag/dag.go Outdated Show resolved Hide resolved
core/commands/dag/dag.go Outdated Show resolved Hide resolved
core/commands/dag/dag.go Outdated Show resolved Hide resolved
core/commands/dag/dag.go Outdated Show resolved Hide resolved
core/commands/dag/dag.go Outdated Show resolved Hide resolved
core/commands/dag/dag.go Show resolved Hide resolved
core/commands/dag/dag.go Outdated Show resolved Hide resolved
// It is not guaranteed that a root in a header is actually present in the same ( or any )
// .car file. This is the case in version 1, and ideally in further versions too
// Accumulate any root CID seen in a header, and supplement its actual node if/when encountered
// We will attempt a pin *only* at the end in case all car files were well formed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

attempt a pin only at the end in case all car files were well formed

I can't quite see this in the code. It looks like it's done on a per-root basis, not per-car.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct - each individual root is opportunistically pinned, to maximize the benefit of what we just ingested being pinned.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better change the comment then, it seems misleading talking about the "car files" being "well formed" since you may find one root and pin that and not find another and not pin that.

//
// if err := api.Pin().Add(req.Context, rp, options.Pin.Recursive(true)); err != nil {

ret := RootMeta{Cid: &c, PresentInCar: seen}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if seen is false but the CID exists in the blockstore already. I think this would mean that it gets pinned regardless of the "malformed" CAR. Should the next block of if/else's be contained in an if seen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this goes back to the question of "malformed" then, are you sure you want to have a lose edge like this? might be better to be a little more strict about the contents of a car - "you don't have the root you say you do, bad luck", instead you are opening up edges that could become important features for some in the future when we decide to be more strict with a future CAR format, or could lead to unexpected behaviour for others.

return sess.GetBlock(req.Context, c)
}
sb := gipselectorbuilder.NewSelectorSpecBuilder(gipfree.NodeBuilder())
car := gocar.NewSelectiveCar(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we care about the case where you don't have a complete graph available? I believe that the * selector is going to result in an error if it reaches an edge that sess.GetBlock() can't return for a CID. We just can't currently disambiguate the error type from ipld-prime to know whether that may be acceptable (e.g. if a --allow-incomplete type flag was introduced), but we could probably change it to get one if that's needed. (https://github.com/ipld/go-ipld-prime/blob/eb71617f4aeb7e73f36276f14e76e47fd90241ef/traversal/walk.go#L119)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rvagg on export we want to either output the complete graph ( potentially grabbing it from the network ), or to fail. I.e. at present we do not at all have --allow-incomplete in spirit, so that the .car files produced by go-ipfs are exclusively "well formed and deterministic" ( deterministic as dictated by go-ipld-prime ).

The import of course is a complete free-for-all ( also by design ) as can be seen here:
https://github.com/ipfs/go-ipfs/blob/03fe9bde87462aa11775bf6e8361a56fd967f89d/test/sharness/t0054-dag-car-import-export.sh#L41-L55
based on the datasets: https://github.com/ipfs/go-ipfs/blob/03fe9bde87462aa11775bf6e8361a56fd967f89d/test/sharness/t0054-dag-car-import-export-data/README.md

Let me know if this is potentially problematic...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment above, it's not necessarily problematic, it just opens up API feature space that may be difficult to claw back in the future if we decide to be more strict. My personal inclination with such things these days is to be explicit and strict and open up feature space only in response to requests and demonstrated need (it also gives you a much more explicit space to construct tests around). Not my call here, so just a word of caution.

- Stop printing during Run: it does not execute on local tty
- Use tabs for easier parsing
- Ensure countstats do not appear in json events
It is not possible to stream out "binary" and emit events at
the same time. Thus switch from listing out how many objects
did we export, to how many *bytes* are being spat out.

Move everything to PostRun to keep emitting on Stderr of the
calling process.
On relatively large amount of objects things get into a livelock
( spinning fans, no progress )

specifically after importing: https://ipfs.io/ipfs/QmbnqojkPFjZtkCtzzCEmUbAVoKEkJzdY5u7Z7anR6vAte
this locks up: ipfs dag export bafy2bzaced4ueelaegfs5fqu4tzsh6ywbbpfk3cxppupmxfdhbpbhzawfw5oy
Failcases to come as a separate PR
We do not lose any blocks ( we are able to re-export ), so that's great
But the fact that `repo gc` comes back with non-existent Qm CIDs... ugh
@@ -54,6 +54,7 @@ require (
github.com/ipfs/go-unixfs v0.2.4
github.com/ipfs/go-verifcid v0.0.1
github.com/ipfs/interface-go-ipfs-core v0.2.6
github.com/ipld/go-car v0.0.5-0.20200316204026-3e2cf7af0fab
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go ahead and cut a CAR release when ready.

roots = ret.roots

progressTicker.Stop()
if err := res.Emit(&CarImportOutput{ObjectCounts: &counts}); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't safe. Emit can do anything it wants with the object, including send it to another goroutine.

Copy link
Contributor Author

@ribasushi ribasushi Mar 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason &counts is a pointer ( here and your next comment ) is that this is the only way to activate omitempty higher up: https://github.com/ipfs/go-ipfs/pull/7011/files#diff-15172926c4147422f472139313467179R73

Want to make sure you agree with the rationale before I add a bunch of mutexes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine although I'm not sure why omitempty is important in this case. My point is that we should be copying this object safely instead of sharing it.


retCh := make(chan importResult, 1)
var counts ObjectCounts
go importWorker(req, &res, &api, &counts, retCh)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely not safe. We can't update counts from a different thread without a lock.

core/commands/dag/dag.go Show resolved Hide resolved
core/commands/dag/dag.go Show resolved Hide resolved
core/commands/dag/dag.go Show resolved Hide resolved
core/commands/dag/dag.go Show resolved Hide resolved
for {
len, readErr := r.Read(buf)
if len > 0 {
if err := re.Emit(bytes.NewBuffer(buf[:len])); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may happen to work (should add a check to prevent it) when not running the daemon, but simply won't work when actually running the daemon. When running the daemon, we can only return:

  1. Repeated results of the same type (encoded as json). To do this, we have to specify the type.
  2. Streaming bytes (no type specified).

We can't return both.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure abut both not being possible, so opted for the loop. Will simplify.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I misread this the first time. We can repeatedly stream new byte readers.

However, we really shouldn't read 4MiB, write 4MiB, etc. Ideally, we'd print out a rate and amount downloaded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not read/write 4MiB. This is the max we can read from the pipe at once, but other than that we just block and pull through whatever slice is placed on the pipe.

https://github.com/ipfs/go-ipfs/pull/7037/files#diff-15172926c4147422f472139313467179R418-R421

@Stebalien
Copy link
Member

Returning progress on export seems insanely complicated and not particularly useful unless we already know the number of blocks to expect.

Additionally, the current method simply won't work with the current commands library. We can't mix data blocks and objects.

@Stebalien
Copy link
Member

Let's just drop progress on get. Progress on add is useful and should be pretty simple to implement.

@ribasushi
Copy link
Contributor Author

Returning progress on export seems insanely complicated and not particularly useful unless we already know the number of blocks to expect.

All I am doing is copying bytes from one pipe to another ( with a complication-conditional that I now realise is not needed ). export can mean going to the network. Without a progress of any sort it is not clear whether the dag is just humongous, or we are stuck on a particular spot of the DAG forever. Having this progress was also what allowed quick diagnosis of go-ipld-prime being broken. The entire pattern was copied over from: https://github.com/ipfs/go-ipfs/blob/master/core/commands/cat.go#L82-L103

Additionally, the current method simply won't work with the current commands library. We can't mix data blocks and objects.

This code works 100%, I just elected not to add tests for it. I could do that if you are unconvinced :)

Let's just drop progress on get.

I can do that, but having long-running network operations without any way to tell "is this thing alive?" is pretty bad UX. Please let me know this is really what you want to do.

Progress on add is useful and should be pretty simple to implement.

Progress on add ( import ) is already in this PR... what am I missing?


if silent || encType != "text" {
// force-disable progress unless on non-silent CLI
req.Options[progressOptionName] = false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I ask for progress but set the encoding type to json, I'd expect json progress updates. I wouldn't mess with this in pre-run.

// enable progress implicitly if a TTY
errStat, _ := os.Stderr.Stat()
if 0 != (errStat.Mode() & os.ModeCharDevice) {
req.Options[progressOptionName] = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just set the progress option default to true in the options. That way, the client can override it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Stebalien
Copy link
Member

Ok, I've been mixing up parts of import/export. I'll need to re-read this but it looks more reasonable on a second pass.

However, I'd still like to punt some of this logic till a future version. This is very tricky code with subtle issues and addressing them all in one PR will be frustrating for all of us.

This code works 100%, I just elected not to add tests for it. I could do that if you are unconvinced.

You're right, I was mixing up import/export and https://github.com/ipfs/go-ipfs/pull/7011/files#r398257738 (I thought this case was reachable).

c, err := cid.Decode(req.Arguments[0])
if err != nil {
return fmt.Errorf(
"unable to parse selector (currently only bare CIDs are supported): %s",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the format of this argument when proper selectors are allowed? I assume it'll allow parsing of @creationix's new string-form selectors at some point, but aren't we still going to need to specify a root separate to the selector? Is that going to need some extra syntax to squish CID+selector together or will this need to be 2 arguments, one for CID and one for selector? And in that case, should this argument just be renamed from "selector" to "root" for now with the expected addition of a selector argument to come later?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@warpfork warpfork Mar 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 When selector support is added, I'd advocate for a tuple here, yes.

Selectors are a declarative document which, when interpreted, navigates from one starting node, to a series of reached nodes (being all those visited) and matched nodes (being those which the selector... selects).

Starting a selector at a particular CID is a common operation -- great; it composes cleanly with the above.

Appearances of func foo(CID, Selector) would look totally reasonable and idiomatic to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rvagg
Copy link
Member

rvagg commented Mar 26, 2020

Been pondering the import semantics all day and want to continue on from my inline comment earlier: I think the first version of this should be fairly exclusive and not attempt to handle too much novelty. Rejecting zero-root CARs is probably a good idea for now (since it's essentially baked in go CARv1) and being more picky about how to handle the root-not-in-body case is probably a good idea too - it's badly formed. Don't even bother trying to pin that root.

There's two main reasons for this line of thinking:

  1. Don't make it easy for the user to give you bad data or bad tooling will proliferate and persist.
  2. Don't open your API too wide because it's much much harder to restrict an expansive API than to expand a restrictive one. Expand the functionality options in response to real, justified use-cases. Being too expansive is also going to make a more restrictive CARv2 more difficult, it'll end up being a more heavily breaking API when introduced here.

The two areas I can see justified for not making import a strict mirror of export in terms of its exclusiveness are: well ordered, structured and complete DAG ("deterministic", too hard, but we will want to make that reasonable to check in CARv2) and >1 roots (although export could easily be expanded here to accept more than one root already).

@ribasushi
Copy link
Contributor Author

I have addressed all EX port pieces in a separate PR to make thins simpler to review.
The Import part coming up in another 2 PRs...

@ribasushi
Copy link
Contributor Author

New PR's:
export functionality only: #7036
export with cli progress: #7037
import functionality only: #7038

Regarding @rvagg comment earlier: I still feel that dag import should be nothing more than a more convenient dag put. This does mean that the streams it accepts would be .car files that are not valid anywhere else but... so what?

If the group feels this is not the direction we should pursue: I will add further strictness checks as requested, and will in parallel write something else to provide the "more convenient dag put" I need for other purposes :)

@ribasushi
Copy link
Contributor Author

@Stebalien the sharness test is a fluke ( pubsub stuff failed ). In any case this PR is now too different from the "broken up" ones, I am closing it, and killing the branch, as all suggestions here got incorporated ( aside from the discussion on "is import ok as-is", which we will do in the import PR #7038

@ribasushi ribasushi closed this Mar 28, 2020
@ribasushi ribasushi deleted the feat/carfile-support branch March 28, 2020 03:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants