Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In-place staging mode #146

Open
daveh86 opened this issue Sep 16, 2019 · 8 comments
Open

In-place staging mode #146

daveh86 opened this issue Sep 16, 2019 · 8 comments

Comments

@daveh86
Copy link

daveh86 commented Sep 16, 2019

Feature Request Here.

Wondering if you would consider adding an inplace option, similar to rsync's --inplace option.

This would be a really useful feature for handling large files and for handling real-time files, like logs

@xenoscopic
Copy link
Member

That makes sense. It would require a bit of modification to the way Mutagen's change application algorithm works (and some failure handling, because it could leave files in a corrupt state in the event of a network disconnect), but it's certainly theoretically possible (and I can see the value for large files). I'll put it on the roadmap for a future release.

@xenoscopic xenoscopic added this to the Unplanned milestone Sep 20, 2019
@xenoscopic
Copy link
Member

This issue is indirectly related to guard/guard#924, in the sense that implementing this would likely solve that issue, though that issue might also be solved in other ways.

@xenoscopic xenoscopic changed the title Inplace Syncing Mode In-place staging mode Jan 27, 2020
@xenoscopic xenoscopic modified the milestones: Unplanned, v0.12.x Jan 27, 2020
@xenoscopic
Copy link
Member

Just a minor update here: Although it's not an "in-place" staging mode, Mutagen v0.13 did add an internal staging mode that will at least place the staging directory inside the synchronization root. There's also a neighboring staging mode that will place the staging directory next to the synchronization root. Both of these options are designed to avoid cross-device renames.

@matthewtraughber
Copy link

matthewtraughber commented May 10, 2022

tldr;
"in-place" staging would be extremely useful for bidirectional data synchronization between CoW filesystems (that already provide check-summing / redundancy / snapshots, etc).


I’ve been looking for a real-time bidirectional syncing solution to use between two ZFS pools (with automated ZFS snapshotting on one pool). However, because ZFS operates on a block level (Copy on Write), most syncing solutions negate the storage efficiency of ZFS snapshots (as they update via copy/rename instead of in-place).

 

This example is with ZFS & Syncthing, but I believe the same principle applies with Mutagen:

Unfortunately, syncthing doesn't update the files in a way that's friendly to the block-based replication that zfs does. Look at their documentation closely:
Syncthing never writes directly to a destination file. Instead all changes are made to a temporary copy which is then moved in place over the old version.
This method of updating is not compatible with efficient ZFS snapshotting. ZFS expects changes to files to be made directly to the destination file, and only the specific changes needed. Syncthing makes a full copy of the file being synced (with any updates applied), then makes a rename() system call to overwrite the old file with the newly updated file. Because ZFS is a copy-on-write filesystem, it follows the instructions of syncthing, makes an exact copy of all the data from the original file in the new temporary file, while leaving the original file in place as well. Since you have a snapshot referring the blocks in the old file, and you have the active mount referring to the blocks in the new file, you end up storing both.

 

From my research, there isn’t a clear bidirectional file syncing solution that’s compatible with CoW filesystems.

@xenoscopic
Copy link
Member

@matthewtraughber Thanks for the additional discussion points and links. I definitely think your argument is one of the strongest motivating factors for an in-place staging mode (though, of course, there are many other valid reasons).

In fact, CoW filesystems are actually one of the few places where I think it might be "trivial" to implement in-place staging because you can make a cheap copy of the base file using FICLONE/FICLONERANGE.

For other filesystems (e.g. ext4) you have to keep track of which rsync blocks in a file have been invalidated by being overwritten or shifted, and making mid-file insertions work efficiently is extremely difficult since you have to watch for overlapping writes/reads and update the tracking of block indices and so on. In fact, I think that somewhere in the rsync documentation, there's a caveat stating that the --inplace flag can reduce transfer efficiency for exactly that reason (i.e. that it might no longer be able to use base file blocks as they become invalidated), but I can't find it at the moment.

Anyway, I agree, though I still think it will be really tough to implement and validate.

@matthewtraughber
Copy link

@xenoscopic

Appreciate the additional context; I wasn't aware of FICLONE/FICLONERANGE.

If the level of effort is significantly greater for non-CoW filesystems, I'd propose (albeit somewhat selfishly) there's benefit in a "CoW modelevel" for Mutagen, which utilizes those IO control commands.

In full disclosure though, this is quite out of my area of expertise; I just wanted to add more data points on why this functionality would be needed.

@xenoscopic
Copy link
Member

@matthewtraughber Understood. And your links are definitely appreciated!

For reference, can you tell me what types of files you're looking to sync on these filesystems? Are these large, append-only files like logs? Code? Media files? I'm curious about the use-case drivers a little bit, because it might inform some of the heuristics for doing in-place staging in a more optimal fashion.

@matthewtraughber
Copy link

Realistically, it would be any file type (I know that's not particularly helpful). Text (code) / media (H.265/264 primarily) / containers, etc...

My use case is still being fleshed out, but essentially: I have a local server that hosts numerous applications, along with acting as a central backup for all devices on my network/VPN. The data on the server resides on multiple ZFS mirror pools, taking regular snapshots with Sanoid.

I'm transitioning from using the server directly for all computing needs to a new laptop (M1 macbook). I'd like to retain R/W access to all data on the server as if it was on the local machine (laptop).

Initially I was looking into FUSE/SSHFS as a solution to access data on the primary server. However, there's bandwidth constraints for large files / streaming local media (and it requires a constant network connection). Naturally, the alternative is then to maintain a copy of the data on both devices. If synchronization was one way, then zfs send/receive would be the logical solution. However, the server will be creating new data (as well as having other devices backup their data to it). Using zfs send from the laptop would overwrite new data on the server, and zfs receive would overwrite any work done on the laptop. The only possible way around this (that I'm aware of) is to split the data into different datasets, but I'd prefer to avoid this route if possible.

Hopefully that gives a bit more context to the workflow I'm envisioning.

@xenoscopic xenoscopic modified the milestones: v0.12.x, Unplanned Jul 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants