[BUG] Unable to backup volume after NFS server IP change #5856

roger-ryao · 2023-05-04T12:16:54Z

Describe the bug (🐛 if you encounter this issue)

I tried to stop the NFS server VM instance on AWS EC2 and start it again. The instance public IP changed to a new one. After changing the backup target to the new IP, the volume couldn't be backed up. The icon turned gray, and the backup page showed an error.

To Reproduce

Steps to reproduce the behavior:
Pre-requisite
To set up an external NFS server for backup store, perform the following steps:

Install the nfs-kernel-server package using the following command:
sudo zypper install nfs-kernel-server
Enable and start the rpcbind.service and nfsserver.service services using the following commands:

systemctl enable rpcbind.service
systemctl start rpcbind.service
systemctl enable nfsserver.service
systemctl start nfsserver.service

Create a directory to export and change its ownership to nobody:nogroup using the following commands:

mkdir /var/nfs
chown nobody:nogroup /var/nfs

Edit the /etc/exports file and add the following line:

/var/nfs     *(rw,no_root_squash,no_subtree_check)

Run the following command to export the directory:
exportfs -a
In the Longhorn UI, go to Setting -> Backup Target, and set it as nfs://(NFS server IP):/var/nfs

Note: To simulate network disconnection, download the network_down.sh script from the following link: https://github.com/longhorn/longhorn/files/4864127/network_down.sh.zip

The test steps

I prepared one NFS servers.
Setup backup target
Create a volume and then do a backup
Stop the NFS server instance on AWS EC2 and start it again. The instance public IP changed to a new one
Update backup target
Do a backup again
The volume couldn't be backed up. The icon turned gray, and the backup page showed an error.

Expected behavior

The volume should be backed up. the backup page would not show an error.

Log or Support bundle

supportbundle_81183726-5b70-4292-a359-3d41e96c9847_2023-05-04T09-29-55Z.zip

Environment

Longhorn version: v1.4.x / master
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: K3s
- Number of management node in the cluster: 1
- Number of worker node in the cluster: 3
Node config
- OS type and version: ubuntu
- CPU per node: 4 core
- Memory per node: 8 GB
- Disk type(e.g. SSD/NVMe): SSD
- Network bandwidth between the nodes:
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): AWS
Number of Longhorn volumes in the cluster: 1

Additional context

Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

derekbit · 2023-05-04T12:31:50Z

This is due tofilesystem.GetMnt() in https://github.com/longhorn/backupstore/blob/master/util/util.go#L316.
GetMnt() iterates and gets information of all mount points in the mount table. If there is any dead mount point, the iteration will hang for a while, so the caller (backup ls) will run into a timeout error in the end.

innobead · 2023-05-04T13:02:32Z

@derekbit Can we just clean mount point without querying no matter if it's valid or dead mount? Have we tested this case before?

derekbit · 2023-05-04T13:12:36Z

I will do two improvements

Don't use filesystem.GetMnt(). Use os.statfs instead
Move the cleanup logic to longhorn-manager like [IMPROVEMENT] Clean up backup target if the backup target setting is unset #5655
WDYT?

derekbit · 2023-05-04T13:17:14Z

if it's valid or dead mount? Have we tested this case before?

We change the backup target before, but the targets are healthy.
For the dead case, we didn't test it.

innobead · 2023-05-04T13:32:04Z

I will do two improvements

Don't use filesystem.GetMnt(). Use os.statfs instead

Move the cleanup logic to longhorn-manager like [IMPROVEMENT] Clean up backup target if the backup target setting is unset #5655
WDYT?

I mean can we just clean up it directly? since cleanupMount is the best effort?

	// mnt, err := filesystem.GetMount(mountPoint)
	// if err != nil {
	// 	return true, errors.Wrapf(err, "failed to get mount for %v", mountPoint)
	// }

	// if strings.Contains(mnt.FilesystemType, Kind) {
	// 	return true, nil
	// }

	log.Warnf("Cleaning up the mount point %v because the fstype %v is changed to %v", mountPoint, mnt.FilesystemType, Kind)

	if mntErr := cleanupMount(mountPoint, mounter, log); mntErr != nil {
		return true, errors.Wrapf(mntErr, "failed to clean up mount point %v (%v) for %v protocol", mnt.FilesystemType, mountPoint, Kind)
	}

	return false, nil

derekbit · 2023-05-04T13:35:05Z

I will do two improvements

Don't use filesystem.GetMnt(). Use os.statfs instead

Move the cleanup logic to longhorn-manager like [IMPROVEMENT] Clean up backup target if the backup target setting is unset #5655
WDYT?

I mean can we just clean up it directly? since cleanupMount is the best effort?

	// mnt, err := filesystem.GetMount(mountPoint)
	// if err != nil {
	// 	return true, errors.Wrapf(err, "failed to get mount for %v", mountpoint)
	// }

	// if strings.Contains(mnt.FilesystemType, Kind) {
	// 	return true, nil
	// }

	log.Warnf("Cleaning up the mount point %v because the fstype %v is changed to %v", mountPoint, mnt.FilesystemType, Kind)

	if mntErr := cleanupMount(mountPoint, mounter, log); mntErr != nil {
		return true, errors.Wrapf(mntErr, "failed to clean up mount point %v (%v) for %v protocol", mnt.FilesystemType, mountPoint, Kind)
	}

	return false, nil

Yes, it is my fix.
But Don't use filesystem.GetMnt(). Use os.statfs instead is still necessary, because we need to check if the filesystem type is expected.

longhorn-io-github-bot · 2023-05-04T13:37:39Z

Pre Ready-For-Testing Checklist

Where is the reproduce steps/test steps documented?
The reproduce steps/test steps are at:

#5856 (comment)

Is there a workaround for the issue? If so, where is it documented?
The workaround is at:
Does the PR include the explanation for the fix or the feature?
Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
The PR is at

longhorn/backupstore#129
longhorn/longhorn-engine#895
longhorn/longhorn-instance-manager#222
longhorn/longhorn-manager#1883

longhorn/backupstore#131
longhorn/longhorn-instance-manager#225
longhorn/longhorn-manager#1885
longhorn/longhorn-engine#898

Which areas/issues this PR might have potential impacts on?
Area: backup target
Issues

innobead · 2023-05-04T14:35:25Z

Yes, it is my fix. But Don't use filesystem.GetMnt(). Use os.statfs instead is still necessary, because we need to check if the filesystem type is expected.

I see, that makes sense. However, one question, right now we only have one backup target, so if blindly clean up the mount point in this case w/o checking the filesystem type, will be there any side effects?

In longhorn/backupstore@149c9aa3 fix, it's different from this scenario, because it's to clean up mount points if the backup target is reset, so it can blindly cleanup w/o any checks.

innobead · 2023-05-04T14:37:22Z

Move the cleanup logic to longhorn-manager like [IMPROVEMENT] Clean up backup target if the backup target setting is unset #5655
WDYT?

Do we still need to do at longhorn-manager side? the original fix is for cleaning up the mount points when trying to mount the same mount point again, so fixing in the backupstore should be good enough?

derekbit · 2023-05-04T14:37:45Z

Yes, it is my fix. But Don't use filesystem.GetMnt(). Use os.statfs instead is still necessary, because we need to check if the filesystem type is expected.

I see, that makes sense. However, one question, right now we only have one backup target, so if blindly clean up the mount point in this case w/o checking the filesystem type, will be there any side effects?

In longhorn/backupstore@149c9aa3 fix, it's different from this scenario, because it's to clean up mount points if the backup target is reset, so it can blindly cleanup w/o any checks.

For v1.4.x, the check of filesystem type is not needed, because it only supports nfs.
But v1.5+, it is a must, because nfs and cifs use the same mountpoint.

derekbit · 2023-05-04T14:41:25Z

Move the cleanup logic to longhorn-manager like [IMPROVEMENT] Clean up backup target if the backup target setting is unset #5655
WDYT?

Do we still need to do at longhorn-manager side? the original fix is for cleaning up the mount points when trying to mount the same mount point again, so fixing in the backupstore should be good enough?

If the backup target's IP is changed, the old mountpoint still needs cleanup. I feel the task can be handled in longhorn-manager and should not impact the data path (I mean mount, ls ....).

I think we check if there is better way and just remove the cleanup for avoiding the timeout issue. WDYT?

innobead · 2023-05-04T14:58:22Z

For v1.4.x, the check of filesystem type is not needed, because it only supports nfs. But v1.5+, it is a must, because nfs and cifs use the same mountpoint.

I still want to clarify a bit. When doing unmount, we can just umount it no matter what type of mount, right? The below code mounter is just a general one created by mount.New("") from k8s lib. Probably something I misunderstood 🤔

/util/util.go#L269-L278

func cleanupMount(mountDir string, mounter mount.Interface, log logrus.FieldLogger) error {
	forceUnmounter, ok := mounter.(mount.MounterForceUnmounter)
	if ok {
		log.Infof("Trying to force clean up mount point %v", mountDir)
		return mount.CleanupMountWithForce(mountDir, forceUnmounter, false, forceCleanupMountTimeout)
	}

	log.Infof("Trying to clean up mount point %v", mountDir)
	return mount.CleanupMountPoint(mountDir, forceUnmounter, false)
}

If yes, even though the IP of backup store has been changed, we should still be able to force unmount it, or not?

P.S. About multiple backup stores, that will be another implementation for sure, because we can't just unmount all mount points.

derekbit · 2023-05-04T15:02:12Z

	var stat syscall.Statfs_t

	if err := syscall.Statfs(mountPoint, &stat); err != nil {
		return true, errors.Wrapf(err, "failed to statfs for mount point %v", mountPoint)
	}

	kind, err := fstypeToKind(stat.Type)
	if err != nil {
		return true, errors.Wrapf(err, "failed to get kind for mount point %v", mountPoint)
	}

	if strings.Contains(kind, Kind) {
		return true, nil
	}

	log.Warnf("Cleaning up the mount point %v because the fstype %v is changed to %v", mountPoint, kind, Kind)

	if mntErr := cleanupMount(mountPoint, mounter, log); mntErr != nil {
		return true, errors.Wrapf(mntErr, "failed to clean up mount point %v (%v) for %v protocol", kind, mountpoint, Kind)
	}

The mount point is mounted. However, it might be a nfs or cifs mountpoint. If user change the target from nfs to cifs, we need to check the fstype of the existing mount point by if strings. Contains(kind, Kind) {. If the type is different from the new target, do the cleanup.

derekbit · 2023-05-04T15:04:47Z

If yes, even though the IP of backup store has been changed, we should still be able to force unmount it, or not?

Ideally, yes. But the cleanup needs to scan all mount points. The dead mount points while cleanup will lead to the timeout.
So, cleaning them up here is not a good idea. That's why I want to move the cleanup to longhorn-manager instead.

innobead · 2023-05-04T15:13:22Z

If yes, even though the IP of backup store has been changed, we should still be able to force unmount it, or not?

Ideally, yes. But the cleanup needs to scan all mount points. The dead mount points while cleanup will lead to the timeout. So, cleaning them up here is not a good idea. That's why I want to move the cleanup to longhorn-manager instead.

To sum up.

Cleaning up all mount points will be stuck due to https://github.com/longhorn/backupstore/blob/master/util/util.go#L316 (this can be fixed by Fix timeout caused by dead mount points backupstore#129)
Checking the type of backup store (nfs/cifs) is to determine if we need to clean up mount points (I am not quite sure this part, discuss tomorrow)
Even Fix timeout caused by dead mount points backupstore#129 will fix the item 1 issue, but still have some uncertain concerns, so want to move the clean up to longhorn-manager? Probably want to clean up the specific mount point instead of cleaning up all mounts.

Cleaning mount point is general no matter what type of backup store (nfs/cifs)

@derekbit right?

derekbit · 2023-05-04T15:18:11Z

Yes.

I still want to clarify a bit. When doing unmount, we can just umount it no matter what type of mount, right?

EnsureMountPoint checks if the mount point is valid. If it is invalid, clean up mount point. So, the purpose is to make sure the mountpoint's type is expected and mountpoint is accessible, so it is not just do blindly unnmout.

innobead · 2023-05-04T15:20:41Z

Yes.

I still want to clarify a bit. When doing unmount, we can just umount it no matter what type of mount, right?

EnsureMountPoint checks if the mount point is valid. If it is invalid, clean up mount point. So, the purpose is to make sure the mountpoint's type is expected and mountpoint is accessible, so it is not just do blindly unnmout.

Yeah, agreed with this. However, this means longhorn/backupstore@149c9aa3 would have issues? since it's doing the blind unmount when the backup target is reset.

derekbit · 2023-05-04T15:25:09Z

Yeah, agreed with this. However, this means longhorn/backupstore@149c9aa3 would have issues? since it's doing the blind unmount when the backup target is reset.

No, this is for unset.
What I mentioned is the case, for example, change from nfs://10.20.90.100:/Longhorn to cifs://10.20.90.100:/Longhorn.

innobead · 2023-05-04T15:29:19Z

It looks like there are different paths for cleaning up the mount points. It's good to clarify the test cases as well for QA.

backup target unset
backup target set, but unresponsive then responsive
backup target reset with the same protocol
backup target reset with a different protocol

anything else? BTW, ideally, the trigger should be caller, so it's longhorn-manager.

cc @longhorn/qa

derekbit · 2023-05-04T15:31:59Z

Yeah, I think the cases are covered.
Changing hard to soft mode makes error handling more complicated.

innobead · 2023-05-04T15:33:37Z

Yeah, I think the cases are covered. Changing hard to soft mode makes error handling more complicated.

But it will solve potential performance issues, so it still deserves it. We just need to ensure the coverage w/o regression.

innobead · 2023-05-04T15:34:21Z

cc @ChanYiLin

innobead · 2023-05-05T01:31:17Z

I discussed with @derekbit the action items below.

When ensuring the mount point, check the specific mount point instead of scanning the whole mount table. Fix timeout caused by dead mount points backupstore#129
There are two places to create a backup target mount point (in replica and longhorn-manager), so when reconciling the backup target setting change, go clean up the old mount points from longhorn-manager. (@ChanYiLin has done the cleaning when the backup target is unset, so we probably can consolidate both) @derekbit will work on this.

This issue is required to fix in 1.4.2.

roger-ryao · 2023-05-12T04:57:08Z

Verified on master-head 20230512

longhorn master-head (13bf7b6)
longhorn-manager master-head (a7dd20c)
backupstore master-head (6183661)
longhorn-engine master-head (1f57dd9)
longhorn-instance-manager master-head (0e0ec6d)

The test steps

#5856 (comment)

Result Passed

After changing the backup target to the new IP, the volume can be backed up. the backup page did not show an error.

roger-ryao added kind/bug reproduce/always 100% reproducible labels May 4, 2023

roger-ryao mentioned this issue May 4, 2023

[BUG] mkdir a same but dead mount point causes backup stuck #5628

Closed

innobead assigned derekbit May 4, 2023

innobead added severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade) priority/0 Must be implement or fixed in this release (managed by PO) area/backup-store Remote backup store related labels May 4, 2023

innobead added this to the v1.5.0 milestone May 4, 2023

innobead added the backport/1.4.2 label May 4, 2023

github-actions bot mentioned this issue May 4, 2023

[BACKPORT][v1.4.2][BUG] Unable to backup volume after NFS server IP change #5857

Closed

derekbit mentioned this issue May 4, 2023

Fix timeout caused by dead mount points longhorn/backupstore#129

Merged

derekbit mentioned this issue May 4, 2023

vendor: update backupstore longhorn/longhorn-engine#895

Merged

derekbit mentioned this issue May 5, 2023

[BACKPORT][v1.4.2] Fix timeout caused by dead mount points longhorn/backupstore#130

Merged

roger-ryao self-assigned this May 5, 2023

This was referenced May 5, 2023

vendor: update backupstore and longhorn-engine longhorn/longhorn-instance-manager#222

Merged

[BACKPORT][v1.4.2] vendor: update backupstore longhorn/longhorn-engine#896

Merged

innobead added the require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated label May 5, 2023

github-actions bot mentioned this issue May 5, 2023

[TEST][BUG] Unable to backup volume after NFS server IP change #5861

Open

innobead mentioned this issue May 8, 2023

[IMPROVEMENT] Consolidate the mounts in longhorn-manager and instance-manager #5883

Closed

roger-ryao closed this as completed May 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Unable to backup volume after NFS server IP change #5856

[BUG] Unable to backup volume after NFS server IP change #5856

roger-ryao commented May 4, 2023

derekbit commented May 4, 2023 •

edited

Loading

innobead commented May 4, 2023 •

edited

Loading

derekbit commented May 4, 2023 •

edited

Loading

derekbit commented May 4, 2023

innobead commented May 4, 2023 •

edited

Loading

derekbit commented May 4, 2023 •

edited

Loading

longhorn-io-github-bot commented May 4, 2023 •

edited by derekbit

Loading

innobead commented May 4, 2023

innobead commented May 4, 2023

derekbit commented May 4, 2023 •

edited

Loading

derekbit commented May 4, 2023 •

edited

Loading

innobead commented May 4, 2023 •

edited

Loading

derekbit commented May 4, 2023

derekbit commented May 4, 2023

innobead commented May 4, 2023 •

edited

Loading

derekbit commented May 4, 2023

innobead commented May 4, 2023 •

edited

Loading

derekbit commented May 4, 2023

innobead commented May 4, 2023 •

edited

Loading

derekbit commented May 4, 2023

innobead commented May 4, 2023 •

edited

Loading

innobead commented May 4, 2023

innobead commented May 5, 2023 •

edited

Loading

roger-ryao commented May 12, 2023

[BUG] Unable to backup volume after NFS server IP change #5856

[BUG] Unable to backup volume after NFS server IP change #5856

Comments

roger-ryao commented May 4, 2023

Describe the bug (🐛 if you encounter this issue)

To Reproduce

Expected behavior

Log or Support bundle

Environment

Additional context

derekbit commented May 4, 2023 • edited Loading

innobead commented May 4, 2023 • edited Loading

derekbit commented May 4, 2023 • edited Loading

derekbit commented May 4, 2023

innobead commented May 4, 2023 • edited Loading

derekbit commented May 4, 2023 • edited Loading

longhorn-io-github-bot commented May 4, 2023 • edited by derekbit Loading

Pre Ready-For-Testing Checklist

innobead commented May 4, 2023

innobead commented May 4, 2023

derekbit commented May 4, 2023 • edited Loading

derekbit commented May 4, 2023 • edited Loading

innobead commented May 4, 2023 • edited Loading

derekbit commented May 4, 2023

derekbit commented May 4, 2023

innobead commented May 4, 2023 • edited Loading

derekbit commented May 4, 2023

innobead commented May 4, 2023 • edited Loading

derekbit commented May 4, 2023

innobead commented May 4, 2023 • edited Loading

derekbit commented May 4, 2023

innobead commented May 4, 2023 • edited Loading

innobead commented May 4, 2023

innobead commented May 5, 2023 • edited Loading

roger-ryao commented May 12, 2023

derekbit commented May 4, 2023 •

edited

Loading

innobead commented May 4, 2023 •

edited

Loading

derekbit commented May 4, 2023 •

edited

Loading

innobead commented May 4, 2023 •

edited

Loading

derekbit commented May 4, 2023 •

edited

Loading

longhorn-io-github-bot commented May 4, 2023 •

edited by derekbit

Loading

derekbit commented May 4, 2023 •

edited

Loading

derekbit commented May 4, 2023 •

edited

Loading

innobead commented May 4, 2023 •

edited

Loading

innobead commented May 4, 2023 •

edited

Loading

innobead commented May 4, 2023 •

edited

Loading

innobead commented May 4, 2023 •

edited

Loading

innobead commented May 4, 2023 •

edited

Loading

innobead commented May 5, 2023 •

edited

Loading