Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

idea: support for transactional groups of writes #106

Open
ianopolous opened this issue Feb 1, 2016 · 36 comments

Comments

@ianopolous
Copy link
Member

@ianopolous ianopolous commented Feb 1, 2016

I had a brief discussion with @whyrusleeping and wanted to start the ball rolling on a public discussion here.

Motivation:
The ability to implement an ACID database or other transactional system in a highly concurrent environment. In my use case with Peergos I am imagining many users concurrently writing to the same IPFS daemon. This means that the GC will essentially need to be called at random relative to each thread. Each user will be building up trees of objects using object.patch, before eventually committing (pinning).

Implementation:
What I had in mind was an optional parameter to all IPFS write commands which was a transaction ID (and maybe a timeout as well). Then all writes are tagged with this temporary ID within a transaction. IPFS just keeps a map from transaction ID to a set of hashes, and these form new object roots during GC. Then when a transaction is complete you could call a completeTransaction(tid) which dumps the mapping (or let it time out). This gives you guarantees that an object you've just written won't be GC'd regardless of the GC activity before you are finished with it.

@jbenet

This comment has been minimized.

Copy link
Member

@jbenet jbenet commented Feb 1, 2016

Good ideas 👍

@whyrusleeping

This comment has been minimized.

Copy link
Member

@whyrusleeping whyrusleeping commented Feb 1, 2016

this could be done on top of the files api, where the 'token' is just part of the path element. And then we have the root of the database namespace 'pinned' with the soon-to-be-created ipfs files pin command

@jbenet

This comment has been minimized.

Copy link
Member

@jbenet jbenet commented Dec 22, 2016

@ianopolous thanks for bringing this up again (in irc) -- i'm still not entirely sure what the best way to do this is. we should outline the needs and constraints, how operations function (how they may fail, how to recover), and how to implement proper consistency/safety (oplogs? journals?)

this also relates to ipfs-cluster and the "transaction oplog" we're making there. cc @hsanjuan

@wking

This comment has been minimized.

Copy link

@wking wking commented Dec 22, 2016

@ianopolous

This comment has been minimized.

Copy link
Member Author

@ianopolous ianopolous commented Dec 22, 2016

@wking I don't think that would give you any guarantees. Consider the example where a user is doing a single add of a large file, which can take arbitrarily long. Half way through writing this file, someone else calls GC. I believe (correct me if I'm wrong @whyrusleeping ) but this would gc all the objects written for the first half of the file. Even if that is handled correctly internally, and the sub objects aren't able to be gc'd, it proves that the user needs to control the timeout of the transaction.

There isn't any overlap between my proposal and object pinning: It would work in conjunction with pinning.
e.g.

add object X with transactionID T, giving multihash H
pin (H)
completeTransaction(T)

or a more complex example:

add object X with transactionID T, giving multihash H1
add object Y (which has a merkle link to H1) with transactionID T, giving multihash H2
pin -r (H2)
completeTransaction(T)

To me this is a very simple mental model which gives you real guarantees and thus allows you to construct algorithms which are robust under concurrency.

@wking

This comment has been minimized.

Copy link

@wking wking commented Dec 23, 2016

@ianopolous

This comment has been minimized.

Copy link
Member Author

@ianopolous ianopolous commented Dec 23, 2016

With a configurable expiration time,
you can still handle wide, bushy trees safely if the expiration times
are configurable.

The timeout would have to be configurable on a call by call basis, which is exactly what I'm suggesting.

All the calls I'm suggesting have a user controlled timeout (see the first comment), in my code examples I left it as an implicit parameter. This means the behaviour is well defined for ipfs crashes and restarts, assuming the return from the write command means that the transactionID + timeout and hash have been "written to disk". Upon restart, ipfs just continues the timeout. To be clear, all transactions would have a timeout controlled by the user. This would also play well with ipfs-cluster, where "write to disk" means safely committed to as many nodes as the cluster setup requires, or erasure coded or whatever.

Let's assume that ipfs can handle the internal writing of a large file correctly under concurrency. The timeout would need to be from completion of writing of the final root object, whatever the write command called. The timeout for a transaction would get updated with every write in the same transaction, so you can "keep asking for more time" effectively (another property lacking in an expiring pin model).

Expiring pins cannot help here. Consider the following interleaving:

add object X with transactionID T, timeout t, giving multihash H
gc (from another thread)
pin (H)
completeTransaction(T)

We've lost our object immediately because of another thread calling gc. This can't be fixed by any modification to the pin call, because the damage is already done before the pin call starts.

@wking

This comment has been minimized.

Copy link

@wking wking commented Dec 23, 2016

The timeout for a transaction would get updated with every write in the same transaction, so you can "keep asking for more time" effectively (another property lacking in an expiring pin model).

I agree that you'd need a way to ask for more time, but I don't see why you couldn't get that with expiring pins. And it would give you a way to guard an existing object:

err = pin(object.hash, expiration=estimated_time_to_patent + buffer)
If err == DoesNotExist:
    push(object.content, expiration=estimated_time_to_patent + buffer)

... the damage is already done before the pin call starts.

Yeah, you'd need either a default expiring pin on all objects that lasts long enough for the explicit pin placement or a way to set (expiring) pins atomically with the push.

But these expiring pins look a lot like your expiring transactions. The trade off is that transactions can be explicitly closed/canceled if you finish early but require per-(object, transaction) tracking on the server, while expiring pins are not cancelable (unless your OK with Bob removing Alice's pin) and only require a single value (the max expiration time) to be tracked per object.

@jbenet

This comment has been minimized.

Copy link
Member

@jbenet jbenet commented Dec 23, 2016

We've considered a similar heuristic like "dont gc things added in less than sec" that may help some, but may introduce other problems.

I think good transactional semantics are useful to define, and ipfs (at least the repo business) is quite simple; this should be easy.

I think a place to start would be to go over the core api, refine the functions we want to "transactionize" (i suspect they're likely the same as for ipfs-cluster cc @hsanjuan) and then propose how we would do the transaction on each.

How we do GC can definitely be rethought quite a bit to support these operations safely.

@ianopolous

This comment has been minimized.

Copy link
Member Author

@ianopolous ianopolous commented Dec 24, 2016

For a concrete example of a complex operation I'd like to put in a transaction, have a look at Peergos' write to a merkle-btree, the recursive put method: https://github.com/Peergos/Peergos/blob/master/src/peergos/shared/merklebtree/TreeNode.java#L80

Note that's written in an asynchronous way using CompletableFutures so that it can be automatically compiled to Javascript and use native Promises in a browser. Ordinarily I'd just use synchronous Java code which is easier to read. You can ignore the UserPublicKey argument, and the ContentAddressedStorage argument is IPFS.

Essentially, it has to modify ~ log(tree size) nodes resulting in a new root which I then pin. Non leaf nodes have merkle links to other nodes, and leaf nodes have merkle links to larger data objects.

A much simpler use case at my day job is a single add-and-pin of a large file.

@ianopolous

This comment has been minimized.

Copy link
Member Author

@ianopolous ianopolous commented Dec 24, 2016

@whyrusleeping Another, much simpler, idea I had to solve all these use cases is to give every write command an option to also pin (recursively) the result (the assumption is that each command itself is atomic internally with respect to gc). This would then allow the user to build up arbitrary structures, then when the final root is written and pinned) explicitly unpin all the child pins. Obviously then the burden is on the client to keep track of all their intermediate pins in a transaction, but that is easy.

@jbenet

This comment has been minimized.

Copy link
Member

@jbenet jbenet commented Dec 24, 2016

@ianopolous yeah i think that would work. that -- as a baseline -- should be possible. i woudl like to support larger transaction batches though

@whyrusleeping

This comment has been minimized.

Copy link
Member

@whyrusleeping whyrusleeping commented Dec 26, 2016

@whyrusleeping

This comment has been minimized.

Copy link
Member

@whyrusleeping whyrusleeping commented Sep 2, 2017

cc @Stebalien, I remember we were talking about making the concept of 'sessions' extend up into the API. that could potentially support this usecase.

@Stebalien

This comment has been minimized.

Copy link

@Stebalien Stebalien commented Sep 2, 2017

Yes: ipfs/go-ipfs#4138

Really, we already have blockservice sessions (you added them when you added bitswap sessions), we just need to refine them a bit and ensure that blocks don't get garbage collected when a session that has observed them is open.

@ianopolous

This comment has been minimized.

Copy link
Member Author

@ianopolous ianopolous commented Nov 5, 2017

@whyrusleeping asked me to write my ideal api for this so here goes:

ipfs transaction start $duration_millis => tid
ipfs transaction close $tid

Then write commands (block put in our case) get an optional tid parameter so usage would look like:

ipfs transaction start $duration_millis => tid
ipfs block put .... tid=$tid
...
ipfs block put .... tid=$tid
ipfs pin add $hash
ipfs transaction close $tid
@momack2

This comment has been minimized.

Copy link

@momack2 momack2 commented Sep 27, 2018

@whyrusleeping does any of the new session work you've been doing affect this issue? Seemingly this is a main blocker for data safety in Peergos, which is something they want to work on in Q4. If it is applicable, would it be ready on that timeline? If not, is there something we could do to unblock them in the nearer term (other GC improvements?)?

@Stebalien

This comment has been minimized.

Copy link

@Stebalien Stebalien commented Sep 27, 2018

Not really. We'll probably use sessions with this but it's mostly orthogonal.

@momack2

This comment has been minimized.

Copy link

@momack2 momack2 commented Oct 4, 2018

Hey @michaelavila! I hear you are working on something close to this in pinning land. Peergos has some work that depends on this capability (effectively the ability to add temporary pins to a "don't garbage collect this yet" set that we clean up later) and we'd like to unblock them by early/mid Q4 so they can do more work on top of that. Would you be capable/interested in owning this?

@michaelavila

This comment has been minimized.

Copy link

@michaelavila michaelavila commented Oct 5, 2018

@momack2 I don't mind taking on this task, but I'm out 10/20-11/4 and we have a few weeks of team gatherings in different countries before that. All of which I think will have an effect on timing, which appears to be a constraint here.

@momack2

This comment has been minimized.

Copy link

@momack2 momack2 commented Oct 17, 2018

Let's discuss on Thursday at the London IPFS meetup to see if the current very basic "don't GC until this is done" solution is sufficient for Peergos folks to make progress. If something more adaptive is needed we can find another owner.

@hsanjuan

This comment has been minimized.

Copy link

@hsanjuan hsanjuan commented Oct 21, 2018

Isn't there an mfs workaround for this consisting in putting in mfs the cid of things that you want to prevent from being GCed. This puts them in a gc ignore list. After being done, you can remove from them from mfs. This was told to me when asking about similar issues with GC when doing "refs pinning" (refs -r + pin add at the end).

@hsanjuan

This comment has been minimized.

Copy link

@hsanjuan hsanjuan commented Oct 21, 2018

So putting are removing things from mfs effectively gives you a session for that thing (?)

@ianopolous

This comment has been minimized.

Copy link
Member Author

@ianopolous ianopolous commented Oct 23, 2018

This sounds like it might be a possibility to implement the desired api. Can you tag the hashes in mfs with the transaction id - so we can have multiple concurrent sessions/transactions in flight?

Presumably we could just use whatever mechanism mfs does to add them to a gc ignore list directly though?

@hsanjuan

This comment has been minimized.

Copy link

@hsanjuan hsanjuan commented Oct 23, 2018

You don't need transaction ids. Just putting the object in an mfs folder (with a random name known by your transaction actor) so it's not GCed. At the end of the transaction, the actor removes its folder.. I think GC gets an ignore list with all the things on MFS, so as long as someone put it in there it should be safe.

I'm just saying that if you want to have this functionality there's a potential workaround that doesn't need any changes to the codebase and it's not crazy to implement on top of ipfs, right now (again, in theory, maybe I'm missing something).

@ianopolous

This comment has been minimized.

Copy link
Member Author

@ianopolous ianopolous commented Oct 23, 2018

We need transaction ids just to group objects that can be released together, without affecting concurrent transactions. As you say, the id could be used as the folder name in mfs.

In our case, we have to implement all new calls we use in our codebase and we'd rather not do temporary workarounds (we've been waiting 3 years already for safe transactions in ipfs, so a little longer is fine :-) ). But it sounds like one could use mfs (or what it uses under the hood) to implement our proposed calls in ipfs.

@michaelavila

This comment has been minimized.

Copy link

@michaelavila michaelavila commented Nov 15, 2018

@ianopolous are you ok with using a flag to pass the transaction id in?

e.g.:

ipfs transaction start $duration_millis => tid
ipfs block put --transaction=$tid ...
...
ipfs block put --transaction=$tid ...
ipfs pin add $hash
ipfs transaction close $tid

We can have a shorthand for the transaction flag too: --transaction=<tid> | -t <tid>.

@ianopolous

This comment has been minimized.

Copy link
Member Author

@ianopolous ianopolous commented Nov 15, 2018

@michaelavila Yep, that's how I imagined it'd be implemented.

@cboddy

This comment has been minimized.

Copy link
Member

@cboddy cboddy commented Nov 29, 2018

@michaelavila that's exciting just wondering if there is any work proceeding for this feature or when such a feature may be expected?

@momack2

This comment has been minimized.

Copy link

@momack2 momack2 commented Dec 4, 2018

@michaelavila @eingenito @Stebalien - can I ask that we freeze an interface for folks to implement against sooner rather than later (ex, by EOWeek if possible) despite not having this implemented yet? If we can agree on that, we can do useful work in parallel that lines up along that interface. =]

@eingenito

This comment has been minimized.

Copy link

@eingenito eingenito commented Dec 5, 2018

@cboddy and @momack2 - @michaelavila is currently not working on this feature, and it's unlikely to be addressed in the next two months with priorities as-is; which means that agreeing on the command line API might not buy much. Particularly since actually trying to invoke ipfs with these arguments will just cause it to complain.

That said how about:
ipfs transaction start [--duration=<millis>] so we can reserve the right to create open ended transactions
ipfs block put [--transaction-id=<tid>] <data> to specify the transaction maybe leaving room for a boolean transaction flag or something like that later.
ipfs transaction commit <tid> to commit an identified transaction

-t isn't a great short name because it's already used in ipfs add and that's a command that might well want to participate in a transaction, so maybe just punt on that for now?

Also does pinning participate in the transaction? ipfs pin add [--transaction-id=<tid>]. I mean I know that under the hood the first implementation is likely to just defer GC until transaction commit, but that's not something the user would automatically know (I mean everyone here would).

What do people think?

@ianopolous

This comment has been minimized.

Copy link
Member Author

@ianopolous ianopolous commented Dec 5, 2018

@eingenito We don't need a commit, that is covered by pin.

The idea is you write a bunch of related ipld objects inside the transaction, pin what you need at the end, and then close the transaction (which just allows the rest to be GC'd). The pin command doesn't need awareness of the transaction (though clearly the intermediate objects need to be pinnable).

FWIW We only care about the http api and specifically only the block put command.

@momack2

This comment has been minimized.

Copy link

@momack2 momack2 commented Dec 14, 2018

I think we are agreed on @michaelavila's api from #106 (comment) with the caveat that --transaction-id=<tid> (without the short -t option) seems safer. (@Stebalien - shout if you disagree)

@ianopolous is there anything else blocking you?

@ianopolous

This comment has been minimized.

Copy link
Member Author

@ianopolous ianopolous commented Dec 15, 2018

@momack2 That's great! That's all we need to move forward. Thank you!

@ianopolous ianopolous referenced this issue Dec 24, 2018
@ianopolous

This comment has been minimized.

Copy link
Member Author

@ianopolous ianopolous commented Jul 10, 2019

@momack2 Is there any possibility to progress an implementation of this?

@momack2

This comment has been minimized.

Copy link

@momack2 momack2 commented Jul 12, 2019

Our team is very focused on the identified needs for package managers this quarter, so I don't think we have development bandwidth to take this on - but if you want to drive an implementation proposal and bring it to an upcoming core implementations meeting for discussion / design feedback, I think we could help review, advise, and unblock from that perspective.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
10 participants
You can’t perform that action at this time.