In-place staging mode #146

daveh86 · 2019-09-16T05:57:38Z

Feature Request Here.

Wondering if you would consider adding an inplace option, similar to rsync's --inplace option.

This would be a really useful feature for handling large files and for handling real-time files, like logs

xenoscopic · 2019-09-20T03:17:03Z

That makes sense. It would require a bit of modification to the way Mutagen's change application algorithm works (and some failure handling, because it could leave files in a corrupt state in the event of a network disconnect), but it's certainly theoretically possible (and I can see the value for large files). I'll put it on the roadmap for a future release.

xenoscopic · 2019-09-28T02:26:08Z

This issue is indirectly related to guard/guard#924, in the sense that implementing this would likely solve that issue, though that issue might also be solved in other ways.

xenoscopic · 2022-02-07T13:29:56Z

Just a minor update here: Although it's not an "in-place" staging mode, Mutagen v0.13 did add an internal staging mode that will at least place the staging directory inside the synchronization root. There's also a neighboring staging mode that will place the staging directory next to the synchronization root. Both of these options are designed to avoid cross-device renames.

matthewtraughber · 2022-05-10T01:48:45Z

tldr;
"in-place" staging would be extremely useful for bidirectional data synchronization between CoW filesystems (that already provide check-summing / redundancy / snapshots, etc).

I’ve been looking for a real-time bidirectional syncing solution to use between two ZFS pools (with automated ZFS snapshotting on one pool). However, because ZFS operates on a block level (Copy on Write), most syncing solutions negate the storage efficiency of ZFS snapshots (as they update via copy/rename instead of in-place).

This example is with ZFS & Syncthing, but I believe the same principle applies with Mutagen:

Unfortunately, syncthing doesn't update the files in a way that's friendly to the block-based replication that zfs does. Look at their documentation closely:
Syncthing never writes directly to a destination file. Instead all changes are made to a temporary copy which is then moved in place over the old version.
This method of updating is not compatible with efficient ZFS snapshotting. ZFS expects changes to files to be made directly to the destination file, and only the specific changes needed. Syncthing makes a full copy of the file being synced (with any updates applied), then makes a rename() system call to overwrite the old file with the newly updated file. Because ZFS is a copy-on-write filesystem, it follows the instructions of syncthing, makes an exact copy of all the data from the original file in the new temporary file, while leaving the original file in place as well. Since you have a snapshot referring the blocks in the old file, and you have the active mount referring to the blocks in the new file, you end up storing both.

From my research, there isn’t a clear bidirectional file syncing solution that’s compatible with CoW filesystems.

rsync
- Can use --inplace to "update destination files in-place". However, rsync isn’t designed for bidirectional syncing.
- https://stackoverflow.com/questions/62881708/how-to-setup-bidirectional-rsync
rclone
- Has bisync, but doesn’t support delta / inplace transfers.
- delta synchronization like in rsync rclone/rclone#3488 (comment)
Syncthing
- Bidirectional sync, but no delta / inplace transfers.
- In place file updates syncthing/syncthing#3897
- https://github.com/syncthing/syncthing/wiki/In-Place-Updates
- https://docs.syncthing.net/users/syncing.html#temporary-files

xenoscopic · 2022-05-13T13:57:06Z

@matthewtraughber Thanks for the additional discussion points and links. I definitely think your argument is one of the strongest motivating factors for an in-place staging mode (though, of course, there are many other valid reasons).

In fact, CoW filesystems are actually one of the few places where I think it might be "trivial" to implement in-place staging because you can make a cheap copy of the base file using FICLONE/FICLONERANGE.

For other filesystems (e.g. ext4) you have to keep track of which rsync blocks in a file have been invalidated by being overwritten or shifted, and making mid-file insertions work efficiently is extremely difficult since you have to watch for overlapping writes/reads and update the tracking of block indices and so on. In fact, I think that somewhere in the rsync documentation, there's a caveat stating that the --inplace flag can reduce transfer efficiency for exactly that reason (i.e. that it might no longer be able to use base file blocks as they become invalidated), but I can't find it at the moment.

Anyway, I agree, though I still think it will be really tough to implement and validate.

matthewtraughber · 2022-05-13T15:55:39Z

@xenoscopic

Appreciate the additional context; I wasn't aware of FICLONE/FICLONERANGE.

If the level of effort is significantly greater for non-CoW filesystems, I'd propose (albeit somewhat selfishly) there's benefit in a "CoW ~~mode~~level" for Mutagen, which utilizes those IO control commands.

In full disclosure though, this is quite out of my area of expertise; I just wanted to add more data points on why this functionality would be needed.

xenoscopic · 2022-05-13T16:08:11Z

@matthewtraughber Understood. And your links are definitely appreciated!

For reference, can you tell me what types of files you're looking to sync on these filesystems? Are these large, append-only files like logs? Code? Media files? I'm curious about the use-case drivers a little bit, because it might inform some of the heuristics for doing in-place staging in a more optimal fashion.

matthewtraughber · 2022-05-17T01:26:54Z

Realistically, it would be any file type (I know that's not particularly helpful). Text (code) / media (H.265/264 primarily) / containers, etc...

My use case is still being fleshed out, but essentially: I have a local server that hosts numerous applications, along with acting as a central backup for all devices on my network/VPN. The data on the server resides on multiple ZFS mirror pools, taking regular snapshots with Sanoid.

I'm transitioning from using the server directly for all computing needs to a new laptop (M1 macbook). I'd like to retain R/W access to all data on the server as if it was on the local machine (laptop).

Initially I was looking into FUSE/SSHFS as a solution to access data on the primary server. However, there's bandwidth constraints for large files / streaming local media (and it requires a constant network connection). Naturally, the alternative is then to maintain a copy of the data on both devices. If synchronization was one way, then zfs send/receive would be the logical solution. However, the server will be creating new data (as well as having other devices backup their data to it). Using zfs send from the laptop would overwrite new data on the server, and zfs receive would overwrite any work done on the laptop. The only possible way around this (that I'm aware of) is to split the data into different datasets, but I'd prefer to avoid this route if possible.

Hopefully that gives a bit more context to the workflow I'm envisioning.

xenoscopic added the feature request label Sep 20, 2019

xenoscopic added this to the Unplanned milestone Sep 20, 2019

xenoscopic changed the title ~~Inplace Syncing Mode~~ In-place staging mode Jan 27, 2020

xenoscopic added the performance label Jan 27, 2020

xenoscopic modified the milestones: Unplanned, v0.12.x Jan 27, 2020

xenoscopic mentioned this issue Jan 28, 2020

Add a staging mode for in-place file modification #170

Closed

xenoscopic mentioned this issue Feb 7, 2022

Staging files triggering in watcher rebuilds #283

Closed

xenoscopic modified the milestones: v0.12.x, Unplanned Jul 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In-place staging mode #146

In-place staging mode #146

daveh86 commented Sep 16, 2019

xenoscopic commented Sep 20, 2019

xenoscopic commented Sep 28, 2019

xenoscopic commented Feb 7, 2022

matthewtraughber commented May 10, 2022 •

edited

Loading

xenoscopic commented May 13, 2022

matthewtraughber commented May 13, 2022

xenoscopic commented May 13, 2022

matthewtraughber commented May 17, 2022

In-place staging mode #146

In-place staging mode #146

Comments

daveh86 commented Sep 16, 2019

xenoscopic commented Sep 20, 2019

xenoscopic commented Sep 28, 2019

xenoscopic commented Feb 7, 2022

matthewtraughber commented May 10, 2022 • edited Loading

xenoscopic commented May 13, 2022

matthewtraughber commented May 13, 2022

xenoscopic commented May 13, 2022

matthewtraughber commented May 17, 2022

matthewtraughber commented May 10, 2022 •

edited

Loading