Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Use fsfreeze instead of sync before snapshot #2187

Open
yasker opened this issue Jan 15, 2021 · 30 comments
Open

[FEATURE] Use fsfreeze instead of sync before snapshot #2187

yasker opened this issue Jan 15, 2021 · 30 comments
Assignees
Labels
area/v1-data-engine v1 data engine (iSCSI tgt) area/volume-data-integrity Volume Data integrity related highlight Important feature/issue to highlight kind/feature Feature request, new feature priority/0 Must be fixed in this release (managed by PO) require/manual-test-plan Require adding/updating manual test cases if they can't be automated
Milestone

Comments

@yasker
Copy link
Member

yasker commented Jan 15, 2021

Is your feature request related to a problem? Please describe.
Currently before snapshot operation, we will run a sync to flush the cache into the disk. It will be even safer if we also run fsfreeze against the filesystem, to stop the filesystem from written data to the disk before we take the snapshot. This will ensure the snapshot we took contains the stable filesystem image.

Describe the solution you'd like
Snapshot process:

  1. fsfreeze --freeze <fs_mount_point>, which should include the cache flush. See https://linux.die.net/man/8/fsfreeze
  2. Take a snapshot
  3. fsfreeze --unfreeze <fs_mount_point>

Notice fsfreeze only works for certain filesystems. For the filesystems that doesn't support fsfreeze, sync will be used instead.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Since we have to keep the freeze period as short as possible, it's better to do it inside the engine. One additional complexity is how can we know where is block device is mounted inside the engine.

@yasker yasker added this to the v1.1.1 milestone Jan 15, 2021
@yasker yasker added area/v1-data-engine v1 data engine (iSCSI tgt) kind/feature Feature request, new feature priority/0 Must be fixed in this release (managed by PO) require/manual-test-plan Require adding/updating manual test cases if they can't be automated labels Jan 15, 2021
@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Mar 3, 2021

Pre-merged Checklist

  • Is the reproduce steps/test steps documented?

  • Is there a workaround for the issue? If so, is it documented?

  • Does the PR include the explanation for the fix or the feature?

  • Does the PR include deployment change (YAML/Chart)? If so, have both YAML file and Chart been updated in the PR?

  • Is the backend code merged (Manager, Engine, Instance Manager, BackupStore etc)?
    The PR is at Freeze a volumes filesystem if mounted before taking a snapshot longhorn-engine#565
    Wrap underlying os.exec error so we can process the programs exit code go-iscsi-helper#38

  • Which areas/issues this PR might have potential impacts on?
    Area engine, snapshots, backups, mounted volumes
    Issues

  • If labeled: require/LEP Has the Longhorn Enhancement Proposal PR submitted?
    The LEP PR is at

  • If labeled: area/ui Has the UI issue filed or ready to be merged?
    The UI issue/PR is at

  • If labeled: require/doc Has the necessary document PR submitted or merged?
    The Doc issue/PR is at

  • If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case?
    The automation skeleton PR is at
    The automation test case PR is at

  • If labeled: require/automation-engine Has the engine integration test been merged?
    The engine automation PR is at Freeze a volumes filesystem if mounted before taking a snapshot longhorn-engine#565

  • If labeled: require/manual-test-plan Has the manual test plan been documented?
    The updated manual test plan is at

  • If the fix introduces the code for backward compatibility Has an separate issue been filed with the label release/obsolete-compatibility?
    The compatibility issue is filed at

@innobead
Copy link
Member

innobead commented Mar 8, 2021

ref: #2029 (comment)

@joshimoo
Copy link
Contributor

joshimoo commented Mar 9, 2021

the manual test should be:

  • create a kubernetes pod + pvc that mounts a lh volume
  • dd if=/dev/urandom of=/mnt/<volume>
  • while running initiate a snapshot
  • verify snapshot succeeded and dd operation will complete
  • create another snapshot/backup.

@khushboo-rancher @meldafrawi I am guessing this is already covered by our manual backup tests.

@khushboo-rancher
Copy link
Contributor

Verified with Longhorn-master 03/19/2021

Validation - Pass

The test steps from #2187 (comment) works fine.

@joshimoo The snapshot and backup creation worked fine while the writing was in process, but this works fine with v1.1.0 also, how can I make sure that fsfreeze is being validated with these test steps?

Regarding the test cases, we generally do these kind of testing with our pre release testing like basic operations with volume while writing is in the process, but I can't locate exact test case, I'll add this in our pre-release test.

@khushboo-rancher
Copy link
Contributor

Logs for Froze filesystem

[instance-manager-e-424ff07d] [pvc-43f5d029-511e-48ff-84bd-13545d3a7ef9-e-93cf8583] time="2021-03-19T15:12:59Z" level=info msg="Starting snapshot" snapshot= volume=pvc-43f5d029-511e-48ff-84bd-13545d3a7ef9 
[instance-manager-e-424ff07d] [pvc-43f5d029-511e-48ff-84bd-13545d3a7ef9-e-93cf8583] time="2021-03-19T15:13:00Z" level=info msg="Froze filesystem of volume mounted at /var/lib/kubelet/pods/7f3a44c7-fc27-4e08-ac8c-134eb79b3106/volumes/kubernetes.io~csi/pvc-43f5d029-511e-48ff-84bd-13545d3a7ef9/mount" volume=pvc-43f5d029-511e-48ff-84bd-13545d3a7ef9 
[instance-manager-r-8b09fc1c] [pvc-43f5d029-511e-48ff-84bd-13545d3a7ef9-r-9320ccbc] time="2021-03-19T15:13:00Z" level=info msg="Replica server starts to snapshot [d77debd9-7670-4cd1-ab7e-eecfd11c4026] volume, user created true, created time 2021-03-19T15:13:00Z, labels map[]" 
[instance-manager-r-8b09fc1c] time="2021-03-19T15:13:00Z" level=info msg="Starting to create disk" disk=d77debd9-7670-4cd1-ab7e-eecfd11c4026 
[instance-manager-r-8b09fc1c] time="2021-03-19T15:13:00Z" level=info msg="Finished creating disk" disk=d77debd9-7670-4cd1-ab7e-eecfd11c4026 
[instance-manager-e-424ff07d] [pvc-43f5d029-511e-48ff-84bd-13545d3a7ef9-e-93cf8583] time="2021-03-19T15:13:00Z" level=info msg="Starting to snapshot: 10.42.3.7:10000 d77debd9-7670-4cd1-ab7e-eecfd11c4026 UserCreated true Created at 2021-03-19T15:13:00Z, Labels map[]" 
[instance-manager-r-15523ad4] [pvc-43f5d029-511e-48ff-84bd-13545d3a7ef9-r-290693c4] time="2021-03-19T15:13:00Z" level=info msg="Replica server starts to snapshot [d77debd9-7670-4cd1-ab7e-eecfd11c4026] volume, user created true, created time 2021-03-19T15:13:00Z, labels map[]" 
[instance-manager-r-5bdc7ee4] [pvc-43f5d029-511e-48ff-84bd-13545d3a7ef9-r-cab61160] time="2021-03-19T15:13:00Z" level=info msg="Replica server starts to snapshot [d77debd9-7670-4cd1-ab7e-eecfd11c4026] volume, user created true, created time 2021-03-19T15:13:00Z, labels map[]" 
[instance-manager-r-5bdc7ee4] time="2021-03-19T15:13:00Z" level=info msg="Starting to create disk" disk=d77debd9-7670-4cd1-ab7e-eecfd11c4026 
[instance-manager-r-15523ad4] [pvc-43f5d029-511e-48ff-84bd-13545d3a7ef9-r-290693c4] time="2021-03-19T15:13:00Z" level=info msg="Starting to create disk" disk=d77debd9-7670-4cd1-ab7e-eecfd11c4026 
[instance-manager-r-5bdc7ee4] time="2021-03-19T15:13:00Z" level=info msg="Finished creating disk" disk=d77debd9-7670-4cd1-ab7e-eecfd11c4026 
[instance-manager-r-15523ad4] [pvc-43f5d029-511e-48ff-84bd-13545d3a7ef9-r-290693c4] time="2021-03-19T15:13:00Z" level=info msg="Finished creating disk" disk=d77debd9-7670-4cd1-ab7e-eecfd11c4026 
[instance-manager-e-424ff07d] [pvc-43f5d029-511e-48ff-84bd-13545d3a7ef9-e-93cf8583] time="2021-03-19T15:13:00Z" level=info msg="Starting to snapshot: 10.42.1.7:10000 d77debd9-7670-4cd1-ab7e-eecfd11c4026 UserCreated true Created at 2021-03-19T15:13:00Z, Labels map[]" 
[instance-manager-e-424ff07d] time="2021-03-19T15:13:00Z" level=info msg="Starting to snapshot: 10.42.2.7:10000 d77debd9-7670-4cd1-ab7e-eecfd11c4026 UserCreated true Created at 2021-03-19T15:13:00Z, Labels map[]" 
[instance-manager-e-424ff07d] time="2021-03-19T15:13:00Z" level=info msg="Finished to snapshot: 10.42.1.7:10000 d77debd9-7670-4cd1-ab7e-eecfd11c4026 UserCreated true Created at 2021-03-19T15:13:00Z, Labels map[]" 
[instance-manager-e-424ff07d] time="2021-03-19T15:13:00Z" level=info msg="Finished to snapshot: 10.42.2.7:10000 d77debd9-7670-4cd1-ab7e-eecfd11c4026 UserCreated true Created at 2021-03-19T15:13:00Z, Labels map[]" 
[instance-manager-e-424ff07d] [pvc-43f5d029-511e-48ff-84bd-13545d3a7ef9-e-93cf8583] time="2021-03-19T15:13:00Z" level=info msg="Finished to snapshot: 10.42.3.7:10000 d77debd9-7670-4cd1-ab7e-eecfd11c4026 UserCreated true Created at 2021-03-19T15:13:00Z, Labels map[]" 
[instance-manager-e-424ff07d] time="2021-03-19T15:13:00Z" level=info msg="Finished snapshot" snapshot= volume=pvc-43f5d029-511e-48ff-84bd-13545d3a7ef9 
[instance-manager-e-424ff07d] [pvc-43f5d029-511e-48ff-84bd-13545d3a7ef9-e-93cf8583] time="2021-03-19T15:13:00Z" level=info msg="Unfroze filesystem of volume mounted at /var/lib/kubelet/pods/7f3a44c7-fc27-4e08-ac8c-134eb79b3106/volumes/kubernetes.io~csi/pvc-43f5d029-511e-48ff-84bd-13545d3a7ef9/mount" volume=pvc-43f5d029-511e-48ff-84bd-13545d3a7ef9 

@joshimoo
Copy link
Contributor

joshimoo commented Mar 19, 2021

@khushboo-rancher Your best bet is the Froze Filesystem log.

You can also add status reporting dd if=/dev/urandom of=/mnt/<volume> conv=sync status=progress this should force outputting. But all of this is only relevant if the freeze takes longer then a second, which it shouldn't :)

If you want to give this a real good test, you can spin up a 32 GB memory node, then create a 500 GB volume.
Mount this volume and write it full, then start a new write now try dd if=/dev/urandom of=/mnt/<volume> status=progress bs=16M we use a big 16M buffer for DD, you should have a big dirty page cache because of the 32 GB.

Before doing the snapshot, run cat /proc/meminfo | grep Dirty
After the fsfreeze run cat /proc/meminfo | grep Dirty and keep the results.

Which will get flushed once the fsfreeze happens, the same behavior can be tested without a mounted filesystem in that case we will run a sync which can take a while as well.

Here is some info on the page cache.
https://www.thomas-krenn.com/en/wiki/Linux_Page_Cache_Basics

@khushboo-rancher
Copy link
Contributor

I checked the page after taking the snapshot, it went down after the fsfreeze.
Screen Shot 2021-03-23 at 1 48 45 PM

@innobead innobead added the highlight Important feature/issue to highlight label Apr 9, 2021
@innobead innobead changed the title [FEATURE]Use fsfreeze instead of sync before snapshot [FEATURE] Use fsfreeze instead of sync before snapshot Apr 16, 2021
@innobead innobead reopened this Oct 8, 2021
@innobead
Copy link
Member

innobead commented Oct 8, 2021

Reopened, due to #3125 reverting the implementation.

@ChipWolf
Copy link

ChipWolf commented Mar 3, 2022

Is this still going to proceed?

@innobead innobead removed this from the v1.1.1 milestone Mar 3, 2022
@innobead innobead modified the milestones: v1.5.0, v1.6.0 Apr 13, 2023
@derekbit
Copy link
Member

The issue mentioned in #3125 is due to that fsfreeze freezes the mount point of the given path. Despite the man page stating the path is the pathname of the mount point, it can be any path.

The potential race condition can be addressed by adding a device mapper (dm) device on top of the iSCSI block device. As a result, the block device was exported as a Longhorn volume.

If one wants to take a snapshot of the volume, it can execute dmsetup suspend and the filesystem is automatically frozen (code). In addition to preventing the race condition, the DM device can be used to live migrate a v1 volume to another instance-manager pod as described in #6001 (comment). Its disadvantage is that it cannot be applied to existing volumes.

Any feedback is appreciated. Thank you.

cc @shuo-wu @innobead @james-munson @WebberHuang1118 @Vicente-Cheng

@derekbit
Copy link
Member

Is this still going to proceed?

@ChipWolf Yes. Any concern or disadvantage of the current sync way?

@ChipWolf
Copy link

Is this still going to proceed?

@ChipWolf Yes. Any concern or disadvantage of the current sync way?

Stable snapshots, freezing disk writes

@derekbit
Copy link
Member

derekbit commented Sep 12, 2023

Performance benchmark

  • AWS c5d.xlarge with direct nvme disks

  • The device mapper doesn't introduce significant performance impact.

  • Longhorn volume with single replica (v1.5.1)

IOPS (Read/Write)
        Random:           15,277 / 9,014
    Sequential:          26,884 / 13,337

Bandwidth in KiB/sec (Read/Write)
        Random:         180,662 / 78,641
    Sequential:         180,682 / 79,799
                                        

Latency in ns (Read/Write)
        Random:        388,402 / 371,265
    Sequential:        324,336 / 372,944
  • Longhorn volume with single replica + device mapper linear device
    • dmsetup create testdisk --table "0 $(blockdev --getsz /dev/longhorn/testvol) linear /dev/longhorn/testvol 0"
IOPS (Read/Write)
        Random:           15,394 / 9,001
    Sequential:          26,783 / 13,309

Bandwidth in KiB/sec (Read/Write)
        Random:         180,626 / 78,629
    Sequential:         185,223 / 79,812
                                        
Latency in ns (Read/Write)
        Random:        385,477 / 369,820
    Sequential:        318,747 / 367,316

cc @shuo-wu @shuo-wu

@WebberHuang1118
Copy link

The issue mentioned in #3125 is due to that fsfreeze freezes the mount point of the given path. Despite the man page stating the path is the pathname of the mount point, it can be any path.

The potential race condition can be addressed by adding a device mapper (dm) device on top of the iSCSI block device. As a result, the block device was exported as a Longhorn volume.

If one wants to take a snapshot of the volume, it can execute dmsetup suspend and the filesystem is automatically frozen (code). In addition to preventing the race condition, the DM device can be used to live migrate a v1 volume to another instance-manager pod as described in #6001 (comment). Its disadvantage is that it cannot be applied to existing volumes.

Any feedback is appreciated. Thank you.

cc @shuo-wu @innobead @james-munson @WebberHuang1118 @Vicente-Cheng

As I know, the suspended DM device should be on the top of the storage stack, according to the architecture, there would be another longhorn device on it?

Since DM introduces an additional layer for processing bio, it could bring performance degradation. However, if we only leverage linear target, maybe it can be ignored.

Aside from above, LGTM, thanks.

@innobead
Copy link
Member

innobead commented Sep 12, 2023

If one wants to take a snapshot of the volume, it can execute dmsetup suspend and the filesystem is automatically frozen (code). In addition to preventing the race condition, the DM device can be used to live migrate a v1 volume to another instance-manager pod as described in #6001 (comment). Its disadvantage is that it cannot be applied to existing volumes.

If an existing volume gets reattached, then we are still able to apply this dm device, right?

About live upgrade, this will be a big benefit for resource usage reduction, because it means there will be no more 2 instance manager pods in the long run if users enable live migration. We can do live migration in a later version, but this version focuses on data consistency.

Also, the current implementation using sync is for all filesystems on the host, so it could be expected to have extra performance costs..

After discussing with @derekbit, introducing this extra layer dm device on top of Longhorn volume should be independent from volume type, no matter what type of volume is. It can apply to fs or raw block volume.

@derekbit
Copy link
Member

derekbit commented Sep 12, 2023

If one wants to take a snapshot of the volume, it can execute dmsetup suspend and the filesystem is automatically frozen (code). In addition to preventing the race condition, the DM device can be used to live migrate a v1 volume to another instance-manager pod as described in #6001 (comment). Its disadvantage is that it cannot be applied to existing volumes.

If an existing volume gets reattached, then we are still able to apply this dm device, right?

After creating an iSCSI block device /dev/sdx, we will add a dm-linear device on the top of the block device. Then, link the device to /dev/longhorn/<pvc name>. Thus, the block mappings are different from those in the current device implementation. The dm solution cannot be applied to existing volumes.

About live upgrade, this will be a big benefit for resource usage reduction, because it means there will be no more 2 instance manager pods in the long run if users enable live migration. We can do live migration in a later version, but this version focuses on data consistency.

Also, the current implementation using sync is for all filesystems on the host, so it could be expected to have extra performance costs..

After discussing with @derekbit, introducing this extra layer dm device on top of Longhorn volume should be independent from volume type, no matter what type of volume is. It can apply to fs or raw block volume.

@derekbit
Copy link
Member

As I know, the suspended DM device should be on the top of the storage stack, according to the architecture, there would be another longhorn device on it?

The new storage stack will be like
#2187 (comment)

cc @WebberHuang1118

@innobead
Copy link
Member

After creating an iSCSI block device /dev/sdx, we will add a dm-linear device on the top of the block device. Then, link the device to /dev/longhorn/<pvc name>. Thus, the block mappings are different from those in the current device implementation. The dm solution cannot be applied to existing volumes.

Does this mean we need to have two ways to handle existing volume and new volume? If yes, how to differentiate them?

@Vicente-Cheng
Copy link

Vicente-Cheng commented Sep 13, 2023

Thanks @derekbit for the detailed explanation.
So briefly, the new design would be like #6001 (comment), right?
(means the /dev/sdx would add a dm on the top of the block device and export it as the longhorn device)

I do not have much concern about this design. It looks like the dm and suspend command would help us to do things like fsfreeze.

Just curious about the client IO and suspend period.
When we do dmsetup suspend, the client IO will stall.
Some sensitive clients will get errors or warnings if this suspension period takes a little bit longer.
Also on #6001 (comment), we will have the same situation.

TL;DR, I am curious about the trade-off between dm suspension period and client IO.
Others seem good to me! Thanks for it!

@WebberHuang1118
Copy link

WebberHuang1118 commented Sep 13, 2023

Thanks @derekbit for the detailed explanation. So briefly, the new design would be like #6001 (comment), right? (means the /dev/sdx would add a dm on the top of the block device and export it as the longhorn device)

I do not have much concern about this design. It looks like the dm and suspend command would help us to do things like fsfreeze.

Just curious about the client IO and suspend period. When we do dmsetup suspend, the client IO will stall. Some sensitive clients will get errors or warnings if this suspension period takes a little bit longer. Also on #6001 (comment), we will have the same situation.

TL;DR, I am curious about the trade-off between dm suspension period and client IO. Others seem good to me! Thanks for it!

From my understanding,

During DM suspension, the on-the-fly IOs would be kept in the fs layer or dm layer, which depends on the progress of individual IOs. After DM resume is issued, the prisoned IOs will be re-processed.

For the suspension period, the affection depends on the application's implementation. For example, dd has no pre-defined timeout, so there will be no side effects for it. But like @Vicente-Cheng mentioned, it could be a different situation for some sensitive applications.

However, @derekbit only plans to leverage DM suspend to have table reload for a linear target, since it's an operation with less complexity (compared to taking snapshots, data migration ... etc.), IMHO, this may not require a great suspend period but is worth further verification.

Besides, since the DM linear target only performs naive bio remap, it should be safe to add it to any storage layer dynamically. Maybe we can consider this to have backward compatibility.

Thanks.

@derekbit
Copy link
Member

If one wants to take a snapshot of the volume, it can execute dmsetup suspend and the filesystem is automatically frozen (code). In addition to preventing the race condition, the DM device can be used to live migrate a v1 volume to another instance-manager pod as described in #6001 (comment). Its disadvantage is that it cannot be applied to existing volumes.

If an existing volume gets reattached, then we are still able to apply this dm device, right?

After creating an iSCSI block device /dev/sdx, we will add a dm-linear device on the top of the block device. Then, link the device to /dev/longhorn/<pvc name>. Thus, the block mappings are different from those in the current device implementation. The dm solution cannot be applied to existing volumes.

Correct my statement. Just discussed with dm expert @WebberHuang1118 and did a quick test. dm-linear is a simple device and won't change the data layout, so we can detach existing volumes and add the dm-linear device when attaching them.

cc @innobead

@innobead
Copy link
Member

This is great! Good cooperation @derekbit @WebberHuang1118

@derekbit
Copy link
Member

derekbit commented Sep 26, 2023

When starting a v1 volume engine, there is a big controller lock for protecting the controller creation, and the WriteAt and ReadAt are also safeguarded by the big lock. Unlike the implementation of v2 volume, we cannot directly add a dm-linear device in startFrontend function. Attempting to do so would result in a deadlock during dm-linear creation. This is due to there are IOs being written to the iSCSI device when creating a dm-linear device.

To address the issue, we have to do

  • modify startFrontend(): create an iSCSI block device and do not duplicate the iSCSI block device to /dev/longhorn/
  • introduce exposeFrontend(): add a dm-linear device on the top of the iSCSI block device and duplicate the device /dev/mapper/<volume name> to /dev/longhorn/<volume name>. The function cannot be protected by the controller lock to avoid a dead lock. However, will it introduce any issue?

Additionally, the dm-linear device introduced also impacts the volume expansion and so on. The scope is much bigger and beyond the ticket.

Before the implementation, it is essential to investigate the necessity of the big controller lock.

cc @shuo-wu @Vicente-Cheng @WebberHuang1118 @innobead

@WebberHuang1118
Copy link

When starting a v1 volume engine, there is a big controller lock for protecting the controller creation, and the WriteAt and ReadAt are also safeguarded by the big lock. Unlike the implementation of v2 volume, we cannot directly add a dm-linear device in startFrontend function. Attempting to do so would result in a deadlock during dm-linear creation. This is due to there are IOs being written to the iSCSI device when creating a dm-linear device.

To address the issue, we have to do

  • modify startFrontend(): create an iSCSI block device and do not duplicate the iSCSI block device to /dev/longhorn/
  • introduce exposeFrontend(): add a dm-linear device on the top of the iSCSI block device and duplicate the device /dev/mapper/<volume name> to /dev/longhorn/<volume name>. The function cannot be protected by the controller lock to avoid a dead lock. However, will it introduce any issue?

Additionally, the dm-linear device introduced also impacts the volume expansion and so on. The scope is much bigger and beyond the ticket.

Before the implementation, it is essential to investigate the necessity of the big controller lock.

cc @shuo-wu @Vicente-Cheng @WebberHuang1118 @innobead

Hi @derekbit,

Can you provide your implementation? I'd like to take a loot at it, thanks.

@derekbit
Copy link
Member

@WebberHuang1118
Not implemented because of the concerns mentioned in #2187 (comment).

@WebberHuang1118
Copy link

Based on my understanding,
the LH block device is visible after controller.start()->controller.startFrontend()->Tgt.Startup().

Since the block device shouldn't recieve any I/O before LH exposes it, the big controller lock in controller.Start() seems okay to be removed. Is there any concern about holding the lock during controller start?

IIRC, DM device has flush I/Os before table loading, maybe it's the reason @derekbit encountered I/O deadlock.

Thanks.

@ChipWolf
Copy link

ChipWolf commented Oct 26, 2023

I have applications that need to save data to disk before write operations are paused for a snapshot, and be notified when write operations resume. In the proposed change, will there be a way to do this, hooks or something?

Some of my applications, due to a specific upstream dependency, hold crucial data in memory and only save it to disk occasionally. These applications don't handle extended write pauses well, but they expose some methods to manage write suspensions/resume operation, and saving any pending data to disk. As the volumes used by these applications are large, snapshots can take a while, making it important for our app to perform these operations before writes are suspended.

Ideally we'd want to receive a hook pre-snapshot that our app must acknowledge before the snapshot takes place, then another hook on completion of the snapshot.

Related/duplicate #2128 #2164

@ejweber
Copy link
Contributor

ejweber commented Mar 28, 2024

Status update:

After discussing offline with @innobead and @derekbit, we have decided to pursue the fsfreeze approach (assuming we can prove its safety) for now, with the linear device potentially coming later if it help to improve live engine upgrade for v1 volumes as predicted.

I implemented an idea for making the fsfreeze approach safe in longhorn/longhorn-engine@34f4b2a. It opened file descriptors to all the volume's mount points, then chose one to verify (is it still a mount point) and freeze. The open descriptors prevented anyone (e.g. the CSI driver) from unmounting the file system before the freeze command was issued. @PhanLe1010, @shuo-wu, and I discovered some drawbacks:

  1. The file system can still be lazy unmounted (umount -l) while "locked" in this way. It is unlikely that anyone will lazy unmount the file system in the super brief time period between lock and freeze (since it doesn't appear that the CSI plugin or kubelet do it), but it is a concern.
  2. If the engine crashes mid-snapshot, it cannot unfreeze the file system. Subsequent attempts to unmount it will fail until some component tries to unfreeze it again.

For 1, I am implementing a suggestion from @PhanLe1010 that we use our own mount of the file system instead of a file descriptor on someone else's mount to do the "lock" in longhorn/longhorn-engine@fd512c3. We control the mount, so there is virtually no danger of us losing it before fsfreeze.

For 2, I am still working on it, but I think it will look something like a suggestion from @shuo-wu. When instance-manager starts up, it will briefly look for mounted Longhorn file systems and attempt to unfreeze them. (If instance-manager was previously down, there is no way these file systems are healthy.) Similarly, when an engine starts up, it will briefly look for mounts of itself and attempt to unfreeze one. We may need to periodically check for mounts of Longhorn file systems for which no engine is running in instance-manager as well, in case there is a scenario in which:

  • An engine crashes.
  • Its instance-manager does not crash.
  • Longhorn never tries to restart the engine.

While I think this feature will be good to implement, I'm a bit worried users will run into corner cases we do not anticipate. I think we should make fsfreeze functionality at least globally configurable (defaulting to on) to give users an out if their environment or use case doesn't play well with it.

@ejweber
Copy link
Contributor

ejweber commented Apr 3, 2024

Another interesting caveat is that both my method of identifying a file system to freeze and the one implemented in https://github.com/longhorn/longhorn-engine/pull/565/files tend to fall apart for block volumes.

Both implementations look for a file system with the source /dev/longhorn/... as its source. If the container consuming the block volume creates a file system and mounts it to some location:

  • Its source is /some/path/in/the/container.
  • The mount is completely invisible to longhorn-engine, since it does not exist in the host mount namespace.

This is probably fine, and maybe even good.

  • I don't think Kubernetes block volumes are generally used by workloads that create a file system on them and mount them. The Kubernetes docs suggest a more common use case is raw block access for databases.
  • If a workload IS partitioning a block volume and/or creating file systems on it, it probably assumes it has total control over it. It would be strange to seek out file systems that Longhorn didn't create itself for freezing.

@ejweber
Copy link
Contributor

ejweber commented Apr 15, 2024

Implementation is currently held up by an ongoing investigation into a manual test failure that can leave the fsfreeze approach AND the dmsetup approach with unkillable workload processes and the need to reboot. Posting the results here as I work through it: https://github.com/longhorn/longhorn/wiki/Freezing-File-Systems-With-dmsetup-suspend-Versus-fsfreeze.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/v1-data-engine v1 data engine (iSCSI tgt) area/volume-data-integrity Volume Data integrity related highlight Important feature/issue to highlight kind/feature Feature request, new feature priority/0 Must be fixed in this release (managed by PO) require/manual-test-plan Require adding/updating manual test cases if they can't be automated
Projects
None yet
Development

No branches or pull requests

10 participants