Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Unable to backup volume after NFS server IP change #5856

Closed
roger-ryao opened this issue May 4, 2023 · 24 comments
Closed

[BUG] Unable to backup volume after NFS server IP change #5856

roger-ryao opened this issue May 4, 2023 · 24 comments
Assignees
Labels
area/backup-store Remote backup store related backport/1.4.2 kind/bug priority/0 Must be implement or fixed in this release (managed by PO) reproduce/always 100% reproducible require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade)
Milestone

Comments

@roger-ryao
Copy link

Describe the bug (🐛 if you encounter this issue)

I tried to stop the NFS server VM instance on AWS EC2 and start it again. The instance public IP changed to a new one. After changing the backup target to the new IP, the volume couldn't be backed up. The icon turned gray, and the backup page showed an error.

To Reproduce

Steps to reproduce the behavior:
Pre-requisite
To set up an external NFS server for backup store, perform the following steps:

  1. Install the nfs-kernel-server package using the following command:
    sudo zypper install nfs-kernel-server
  2. Enable and start the rpcbind.service and nfsserver.service services using the following commands:
systemctl enable rpcbind.service
systemctl start rpcbind.service
systemctl enable nfsserver.service
systemctl start nfsserver.service
  1. Create a directory to export and change its ownership to nobody:nogroup using the following commands:
mkdir /var/nfs
chown nobody:nogroup /var/nfs
  1. Edit the /etc/exports file and add the following line:
/var/nfs     *(rw,no_root_squash,no_subtree_check)
  1. Run the following command to export the directory:
    exportfs -a
  2. In the Longhorn UI, go to Setting -> Backup Target, and set it as nfs://(NFS server IP):/var/nfs

Note: To simulate network disconnection, download the network_down.sh script from the following link: https://github.com/longhorn/longhorn/files/4864127/network_down.sh.zip

The test steps

  1. I prepared one NFS servers.
  2. Setup backup target
  3. Create a volume and then do a backup
  4. Stop the NFS server instance on AWS EC2 and start it again. The instance public IP changed to a new one
  5. Update backup target
  6. Do a backup again
  7. The volume couldn't be backed up. The icon turned gray, and the backup page showed an error.
    Screenshot_20230504_172848
    Screenshot_20230504_172950

Expected behavior

The volume should be backed up. the backup page would not show an error.

Log or Support bundle

supportbundle_81183726-5b70-4292-a359-3d41e96c9847_2023-05-04T09-29-55Z.zip

Environment

  • Longhorn version: v1.4.x / master
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: K3s
    • Number of management node in the cluster: 1
    • Number of worker node in the cluster: 3
  • Node config
    • OS type and version: ubuntu
    • CPU per node: 4 core
    • Memory per node: 8 GB
    • Disk type(e.g. SSD/NVMe): SSD
    • Network bandwidth between the nodes:
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): AWS
  • Number of Longhorn volumes in the cluster: 1

Additional context

Add any other context about the problem here.

@innobead innobead added severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade) priority/0 Must be implement or fixed in this release (managed by PO) area/backup-store Remote backup store related labels May 4, 2023
@innobead innobead added this to the v1.5.0 milestone May 4, 2023
@derekbit
Copy link
Member

derekbit commented May 4, 2023

This is due tofilesystem.GetMnt() in https://github.com/longhorn/backupstore/blob/master/util/util.go#L316.
GetMnt() iterates and gets information of all mount points in the mount table. If there is any dead mount point, the iteration will hang for a while, so the caller (backup ls) will run into a timeout error in the end.

@innobead
Copy link
Member

innobead commented May 4, 2023

@derekbit Can we just clean mount point without querying no matter if it's valid or dead mount? Have we tested this case before?

@derekbit
Copy link
Member

derekbit commented May 4, 2023

I will do two improvements

@derekbit
Copy link
Member

derekbit commented May 4, 2023

if it's valid or dead mount? Have we tested this case before?

We change the backup target before, but the targets are healthy.
For the dead case, we didn't test it.

@innobead
Copy link
Member

innobead commented May 4, 2023

I will do two improvements

I mean can we just clean up it directly? since cleanupMount is the best effort?

	// mnt, err := filesystem.GetMount(mountPoint)
	// if err != nil {
	// 	return true, errors.Wrapf(err, "failed to get mount for %v", mountPoint)
	// }

	// if strings.Contains(mnt.FilesystemType, Kind) {
	// 	return true, nil
	// }

	log.Warnf("Cleaning up the mount point %v because the fstype %v is changed to %v", mountPoint, mnt.FilesystemType, Kind)

	if mntErr := cleanupMount(mountPoint, mounter, log); mntErr != nil {
		return true, errors.Wrapf(mntErr, "failed to clean up mount point %v (%v) for %v protocol", mnt.FilesystemType, mountPoint, Kind)
	}

	return false, nil

@derekbit
Copy link
Member

derekbit commented May 4, 2023

I will do two improvements

I mean can we just clean up it directly? since cleanupMount is the best effort?

	// mnt, err := filesystem.GetMount(mountPoint)
	// if err != nil {
	// 	return true, errors.Wrapf(err, "failed to get mount for %v", mountpoint)
	// }

	// if strings.Contains(mnt.FilesystemType, Kind) {
	// 	return true, nil
	// }

	log.Warnf("Cleaning up the mount point %v because the fstype %v is changed to %v", mountPoint, mnt.FilesystemType, Kind)

	if mntErr := cleanupMount(mountPoint, mounter, log); mntErr != nil {
		return true, errors.Wrapf(mntErr, "failed to clean up mount point %v (%v) for %v protocol", mnt.FilesystemType, mountPoint, Kind)
	}

	return false, nil

Yes, it is my fix.
But Don't use filesystem.GetMnt(). Use os.statfs instead is still necessary, because we need to check if the filesystem type is expected.

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented May 4, 2023

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at:

#5856 (comment)

  • Is there a workaround for the issue? If so, where is it documented?
    The workaround is at:

  • Does the PR include the explanation for the fix or the feature?

  • Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
    The PR is at

longhorn/backupstore#129
longhorn/longhorn-engine#895
longhorn/longhorn-instance-manager#222
longhorn/longhorn-manager#1883

longhorn/backupstore#131
longhorn/longhorn-instance-manager#225
longhorn/longhorn-manager#1885
longhorn/longhorn-engine#898

  • Which areas/issues this PR might have potential impacts on?
    Area: backup target
    Issues

@innobead
Copy link
Member

innobead commented May 4, 2023

Yes, it is my fix. But Don't use filesystem.GetMnt(). Use os.statfs instead is still necessary, because we need to check if the filesystem type is expected.

I see, that makes sense. However, one question, right now we only have one backup target, so if blindly clean up the mount point in this case w/o checking the filesystem type, will be there any side effects?

In longhorn/backupstore@149c9aa3 fix, it's different from this scenario, because it's to clean up mount points if the backup target is reset, so it can blindly cleanup w/o any checks.

@innobead
Copy link
Member

innobead commented May 4, 2023

Do we still need to do at longhorn-manager side? the original fix is for cleaning up the mount points when trying to mount the same mount point again, so fixing in the backupstore should be good enough?

@derekbit
Copy link
Member

derekbit commented May 4, 2023

Yes, it is my fix. But Don't use filesystem.GetMnt(). Use os.statfs instead is still necessary, because we need to check if the filesystem type is expected.

I see, that makes sense. However, one question, right now we only have one backup target, so if blindly clean up the mount point in this case w/o checking the filesystem type, will be there any side effects?

In longhorn/backupstore@149c9aa3 fix, it's different from this scenario, because it's to clean up mount points if the backup target is reset, so it can blindly cleanup w/o any checks.

For v1.4.x, the check of filesystem type is not needed, because it only supports nfs.
But v1.5+, it is a must, because nfs and cifs use the same mountpoint.

@derekbit
Copy link
Member

derekbit commented May 4, 2023

Do we still need to do at longhorn-manager side? the original fix is for cleaning up the mount points when trying to mount the same mount point again, so fixing in the backupstore should be good enough?

If the backup target's IP is changed, the old mountpoint still needs cleanup. I feel the task can be handled in longhorn-manager and should not impact the data path (I mean mount, ls ....).

I think we check if there is better way and just remove the cleanup for avoiding the timeout issue. WDYT?

@innobead
Copy link
Member

innobead commented May 4, 2023

For v1.4.x, the check of filesystem type is not needed, because it only supports nfs. But v1.5+, it is a must, because nfs and cifs use the same mountpoint.

I still want to clarify a bit. When doing unmount, we can just umount it no matter what type of mount, right? The below code mounter is just a general one created by mount.New("") from k8s lib. Probably something I misunderstood 🤔

/util/util.go#L269-L278

func cleanupMount(mountDir string, mounter mount.Interface, log logrus.FieldLogger) error {
	forceUnmounter, ok := mounter.(mount.MounterForceUnmounter)
	if ok {
		log.Infof("Trying to force clean up mount point %v", mountDir)
		return mount.CleanupMountWithForce(mountDir, forceUnmounter, false, forceCleanupMountTimeout)
	}

	log.Infof("Trying to clean up mount point %v", mountDir)
	return mount.CleanupMountPoint(mountDir, forceUnmounter, false)
}

If yes, even though the IP of backup store has been changed, we should still be able to force unmount it, or not?

P.S. About multiple backup stores, that will be another implementation for sure, because we can't just unmount all mount points.

@derekbit
Copy link
Member

derekbit commented May 4, 2023

	var stat syscall.Statfs_t

	if err := syscall.Statfs(mountPoint, &stat); err != nil {
		return true, errors.Wrapf(err, "failed to statfs for mount point %v", mountPoint)
	}

	kind, err := fstypeToKind(stat.Type)
	if err != nil {
		return true, errors.Wrapf(err, "failed to get kind for mount point %v", mountPoint)
	}

	if strings.Contains(kind, Kind) {
		return true, nil
	}

	log.Warnf("Cleaning up the mount point %v because the fstype %v is changed to %v", mountPoint, kind, Kind)

	if mntErr := cleanupMount(mountPoint, mounter, log); mntErr != nil {
		return true, errors.Wrapf(mntErr, "failed to clean up mount point %v (%v) for %v protocol", kind, mountpoint, Kind)
	}

The mount point is mounted. However, it might be a nfs or cifs mountpoint. If user change the target from nfs to cifs, we need to check the fstype of the existing mount point by if strings. Contains(kind, Kind) {. If the type is different from the new target, do the cleanup.

@derekbit
Copy link
Member

derekbit commented May 4, 2023

If yes, even though the IP of backup store has been changed, we should still be able to force unmount it, or not?

Ideally, yes. But the cleanup needs to scan all mount points. The dead mount points while cleanup will lead to the timeout.
So, cleaning them up here is not a good idea. That's why I want to move the cleanup to longhorn-manager instead.

@innobead
Copy link
Member

innobead commented May 4, 2023

If yes, even though the IP of backup store has been changed, we should still be able to force unmount it, or not?

Ideally, yes. But the cleanup needs to scan all mount points. The dead mount points while cleanup will lead to the timeout. So, cleaning them up here is not a good idea. That's why I want to move the cleanup to longhorn-manager instead.

To sum up.

  1. Cleaning up all mount points will be stuck due to https://github.com/longhorn/backupstore/blob/master/util/util.go#L316 (this can be fixed by Fix timeout caused by dead mount points backupstore#129)
  2. Checking the type of backup store (nfs/cifs) is to determine if we need to clean up mount points (I am not quite sure this part, discuss tomorrow)
  3. Even Fix timeout caused by dead mount points backupstore#129 will fix the item 1 issue, but still have some uncertain concerns, so want to move the clean up to longhorn-manager? Probably want to clean up the specific mount point instead of cleaning up all mounts.

Cleaning mount point is general no matter what type of backup store (nfs/cifs)

@derekbit right?

@derekbit
Copy link
Member

derekbit commented May 4, 2023

Yes.

I still want to clarify a bit. When doing unmount, we can just umount it no matter what type of mount, right?

EnsureMountPoint checks if the mount point is valid. If it is invalid, clean up mount point. So, the purpose is to make sure the mountpoint's type is expected and mountpoint is accessible, so it is not just do blindly unnmout.

@innobead
Copy link
Member

innobead commented May 4, 2023

Yes.

I still want to clarify a bit. When doing unmount, we can just umount it no matter what type of mount, right?

EnsureMountPoint checks if the mount point is valid. If it is invalid, clean up mount point. So, the purpose is to make sure the mountpoint's type is expected and mountpoint is accessible, so it is not just do blindly unnmout.

Yeah, agreed with this. However, this means longhorn/backupstore@149c9aa3 would have issues? since it's doing the blind unmount when the backup target is reset.

@derekbit
Copy link
Member

derekbit commented May 4, 2023

Yeah, agreed with this. However, this means longhorn/backupstore@149c9aa3 would have issues? since it's doing the blind unmount when the backup target is reset.

No, this is for unset.
What I mentioned is the case, for example, change from nfs://10.20.90.100:/Longhorn to cifs://10.20.90.100:/Longhorn.

@innobead
Copy link
Member

innobead commented May 4, 2023

It looks like there are different paths for cleaning up the mount points. It's good to clarify the test cases as well for QA.

  • backup target unset
  • backup target set, but unresponsive then responsive
  • backup target reset with the same protocol
  • backup target reset with a different protocol

anything else? BTW, ideally, the trigger should be caller, so it's longhorn-manager.

cc @longhorn/qa

@derekbit
Copy link
Member

derekbit commented May 4, 2023

Yeah, I think the cases are covered.
Changing hard to soft mode makes error handling more complicated.

@innobead
Copy link
Member

innobead commented May 4, 2023

Yeah, I think the cases are covered. Changing hard to soft mode makes error handling more complicated.

But it will solve potential performance issues, so it still deserves it. We just need to ensure the coverage w/o regression.

@innobead
Copy link
Member

innobead commented May 4, 2023

cc @ChanYiLin

@innobead
Copy link
Member

innobead commented May 5, 2023

I discussed with @derekbit the action items below.

  • When ensuring the mount point, check the specific mount point instead of scanning the whole mount table. Fix timeout caused by dead mount points backupstore#129
  • There are two places to create a backup target mount point (in replica and longhorn-manager), so when reconciling the backup target setting change, go clean up the old mount points from longhorn-manager. (@ChanYiLin has done the cleaning when the backup target is unset, so we probably can consolidate both) @derekbit will work on this.

This issue is required to fix in 1.4.2.

@roger-ryao roger-ryao self-assigned this May 5, 2023
@innobead innobead added the require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated label May 5, 2023
@roger-ryao
Copy link
Author

Verified on master-head 20230512

  • longhorn master-head (13bf7b6)
  • longhorn-manager master-head (a7dd20c)
  • backupstore master-head (6183661)
  • longhorn-engine master-head (1f57dd9)
  • longhorn-instance-manager master-head (0e0ec6d)

The test steps

#5856 (comment)

Result Passed

  1. After changing the backup target to the new IP, the volume can be backed up. the backup page did not show an error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/backup-store Remote backup store related backport/1.4.2 kind/bug priority/0 Must be implement or fixed in this release (managed by PO) reproduce/always 100% reproducible require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade)
Projects
Status: Closed
Development

No branches or pull requests

4 participants