New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Use fsfreeze
instead of sync
before snapshot
#2187
Comments
Pre-merged Checklist
|
ref: #2029 (comment) |
the manual test should be:
@khushboo-rancher @meldafrawi I am guessing this is already covered by our manual backup tests. |
Verified with Longhorn-master Validation - Pass The test steps from #2187 (comment) works fine. @joshimoo The snapshot and backup creation worked fine while the writing was in process, but this works fine with v1.1.0 also, how can I make sure that Regarding the test cases, we generally do these kind of testing with our pre release testing like basic operations with volume while writing is in the process, but I can't locate exact test case, I'll add this in our pre-release test. |
Logs for
|
@khushboo-rancher Your best bet is the Froze Filesystem log. You can also add status reporting If you want to give this a real good test, you can spin up a 32 GB memory node, then create a 500 GB volume. Before doing the snapshot, run Which will get flushed once the fsfreeze happens, the same behavior can be tested without a mounted filesystem in that case we will run a sync which can take a while as well. Here is some info on the page cache. |
fsfreeze
instead of sync
before snapshotfsfreeze
instead of sync
before snapshot
Reopened, due to #3125 reverting the implementation. |
Is this still going to proceed? |
The issue mentioned in #3125 is due to that The potential race condition can be addressed by adding a device mapper (dm) device on top of the iSCSI block device. As a result, the block device was exported as a Longhorn volume. If one wants to take a snapshot of the volume, it can execute Any feedback is appreciated. Thank you. cc @shuo-wu @innobead @james-munson @WebberHuang1118 @Vicente-Cheng |
@ChipWolf Yes. Any concern or disadvantage of the current |
Stable snapshots, freezing disk writes |
Performance benchmark
|
As I know, the suspended DM device should be on the top of the storage stack, according to the architecture, there would be another longhorn device on it? Since DM introduces an additional layer for processing bio, it could bring performance degradation. However, if we only leverage Aside from above, LGTM, thanks. |
If an existing volume gets reattached, then we are still able to apply this dm device, right? About live upgrade, this will be a big benefit for resource usage reduction, because it means there will be no more 2 instance manager pods in the long run if users enable live migration. We can do live migration in a later version, but this version focuses on data consistency. Also, the current implementation using sync is for all filesystems on the host, so it could be expected to have extra performance costs.. After discussing with @derekbit, introducing this extra layer dm device on top of Longhorn volume should be independent from volume type, no matter what type of volume is. It can apply to fs or raw block volume. |
|
The new storage stack will be like |
Does this mean we need to have two ways to handle existing volume and new volume? If yes, how to differentiate them? |
Thanks @derekbit for the detailed explanation. I do not have much concern about this design. It looks like the dm and Just curious about the client IO and suspend period. TL;DR, I am curious about the trade-off between dm suspension period and client IO. |
From my understanding, During DM suspension, the on-the-fly IOs would be kept in the fs layer or dm layer, which depends on the progress of individual IOs. After For the suspension period, the affection depends on the application's implementation. For example, However, @derekbit only plans to leverage DM suspend to have Besides, since the DM Thanks. |
Correct my statement. Just discussed with dm expert @WebberHuang1118 and did a quick test. dm-linear is a simple device and won't change the data layout, so we can detach existing volumes and add the dm-linear device when attaching them. cc @innobead |
This is great! Good cooperation @derekbit @WebberHuang1118 |
When starting a v1 volume engine, there is a big controller lock for protecting the controller creation, and the WriteAt and ReadAt are also safeguarded by the big lock. Unlike the implementation of v2 volume, we cannot directly add a dm-linear device in startFrontend function. Attempting to do so would result in a deadlock during dm-linear creation. This is due to there are IOs being written to the iSCSI device when creating a dm-linear device. To address the issue, we have to do
Additionally, the dm-linear device introduced also impacts the volume expansion and so on. The scope is much bigger and beyond the ticket. Before the implementation, it is essential to investigate the necessity of the big controller lock. |
Hi @derekbit, Can you provide your implementation? I'd like to take a loot at it, thanks. |
@WebberHuang1118 |
Based on my understanding, Since the block device shouldn't recieve any I/O before LH exposes it, the big controller lock in controller.Start() seems okay to be removed. Is there any concern about holding the lock during controller start? IIRC, DM device has flush I/Os before table loading, maybe it's the reason @derekbit encountered I/O deadlock. Thanks. |
I have applications that need to save data to disk before write operations are paused for a snapshot, and be notified when write operations resume. In the proposed change, will there be a way to do this, hooks or something? Some of my applications, due to a specific upstream dependency, hold crucial data in memory and only save it to disk occasionally. These applications don't handle extended write pauses well, but they expose some methods to manage write suspensions/resume operation, and saving any pending data to disk. As the volumes used by these applications are large, snapshots can take a while, making it important for our app to perform these operations before writes are suspended. Ideally we'd want to receive a hook pre-snapshot that our app must acknowledge before the snapshot takes place, then another hook on completion of the snapshot. |
Status update:After discussing offline with @innobead and @derekbit, we have decided to pursue the fsfreeze approach (assuming we can prove its safety) for now, with the linear device potentially coming later if it help to improve live engine upgrade for v1 volumes as predicted. I implemented an idea for making the fsfreeze approach safe in longhorn/longhorn-engine@34f4b2a. It opened file descriptors to all the volume's mount points, then chose one to verify (is it still a mount point) and freeze. The open descriptors prevented anyone (e.g. the CSI driver) from unmounting the file system before the freeze command was issued. @PhanLe1010, @shuo-wu, and I discovered some drawbacks:
For 1, I am implementing a suggestion from @PhanLe1010 that we use our own mount of the file system instead of a file descriptor on someone else's mount to do the "lock" in longhorn/longhorn-engine@fd512c3. We control the mount, so there is virtually no danger of us losing it before fsfreeze. For 2, I am still working on it, but I think it will look something like a suggestion from @shuo-wu. When instance-manager starts up, it will briefly look for mounted Longhorn file systems and attempt to unfreeze them. (If instance-manager was previously down, there is no way these file systems are healthy.) Similarly, when an engine starts up, it will briefly look for mounts of itself and attempt to unfreeze one. We may need to periodically check for mounts of Longhorn file systems for which no engine is running in instance-manager as well, in case there is a scenario in which:
While I think this feature will be good to implement, I'm a bit worried users will run into corner cases we do not anticipate. I think we should make fsfreeze functionality at least globally configurable (defaulting to on) to give users an out if their environment or use case doesn't play well with it. |
Another interesting caveat is that both my method of identifying a file system to freeze and the one implemented in https://github.com/longhorn/longhorn-engine/pull/565/files tend to fall apart for block volumes. Both implementations look for a file system with the source
This is probably fine, and maybe even good.
|
Implementation is currently held up by an ongoing investigation into a manual test failure that can leave the fsfreeze approach AND the dmsetup approach with unkillable workload processes and the need to reboot. Posting the results here as I work through it: https://github.com/longhorn/longhorn/wiki/Freezing-File-Systems-With-dmsetup-suspend-Versus-fsfreeze. |
Is your feature request related to a problem? Please describe.
Currently before snapshot operation, we will run a
sync
to flush the cache into the disk. It will be even safer if we also runfsfreeze
against the filesystem, to stop the filesystem from written data to the disk before we take the snapshot. This will ensure the snapshot we took contains the stable filesystem image.Describe the solution you'd like
Snapshot process:
fsfreeze --freeze <fs_mount_point>
, which should include the cache flush. See https://linux.die.net/man/8/fsfreezefsfreeze --unfreeze <fs_mount_point>
Notice
fsfreeze
only works for certain filesystems. For the filesystems that doesn't supportfsfreeze
,sync
will be used instead.Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Since we have to keep the freeze period as short as possible, it's better to do it inside the engine. One additional complexity is how can we know where is block device is mounted inside the engine.
The text was updated successfully, but these errors were encountered: