Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to unmount from EC2 #66

Closed
MrOlm opened this issue Oct 1, 2021 · 3 comments
Closed

Unable to unmount from EC2 #66

MrOlm opened this issue Oct 1, 2021 · 3 comments

Comments

@MrOlm
Copy link
Contributor

MrOlm commented Oct 1, 2021

Hello,

I'm running aegea version 2.6.9 for compatibility reasons and am hitting the following error after my batch jobs finish:

2021-09-30 02:20:48+00:00 Detaching EBS volume vol-0d10807d4007a84bf
2021-09-30 02:20:49+00:00 umount: /mnt: target is busy
2021-09-30 02:20:49+00:00         (In some cases useful info about processes that
2021-09-30 02:20:49+00:00          use the device is found by lsof(8) or fuser(1).)
2021-09-30 02:20:49+00:00 Traceback (most recent call last):
2021-09-30 02:20:49+00:00   File "/usr/local/bin/aegea", line 23, in <module>
2021-09-30 02:20:49+00:00     aegea.main()
2021-09-30 02:20:49+00:00   File "/usr/local/lib/python3.5/dist-packages/aegea/__init__.py", line 89, in main
2021-09-30 02:20:49+00:00     result = parsed_args.entry_point(parsed_args)
2021-09-30 02:20:49+00:00   File "/usr/local/lib/python3.5/dist-packages/aegea/ebs.py", line 177, in detach
2021-09-30 02:20:49+00:00     subprocess.check_call(["umount", find_devnode(volume_id)])
2021-09-30 02:20:49+00:00   File "/usr/lib/python3.5/subprocess.py", line 271, in check_call
2021-09-30 02:20:49+00:00     raise CalledProcessError(retcode, cmd)
2021-09-30 02:20:49+00:00 subprocess.CalledProcessError: Command '['umount', '/dev/disk/by-id/nvme-Amazon_Elastic_Block_Store_vol0d10807d4007a84bf-ns-1']' returned non-zero exit status 32
INFO:aegea:Job 6a0d141f-67e4-4510-998e-0a36ff9fd833: Essential container in task exited

The result is that the drives are left up after the jobs finish. Do you have any idea what could be causing this issue? I'm happy to play around in the aegea code of this version to fix things myself, I just don't know where to start.

Thanks in advance,
Matt

@kislyuk
Copy link
Owner

kislyuk commented Oct 3, 2021

Hi, the per-batch job EBS provisioning code was retired in the latest version of aegea, and is not recommended because of hazards like the one you encountered. Note that after this failure to unmount, you will end up with an orphaned EBS volume that your organization will continue to pay for until you clean it up. This is quite hazardous if nobody is watching out for this type of error, because EBS volumes are not cheap.

I recommend using the EFS auto-mount functionality in the latest version of aegea instead. That way your jobs can use a shared EFS filesystem as their scratch space, and Batch automatically manages mounting and unmounting it for you.

If you must continue to use the EBS batch job volume code, it was updated to incorporate tracking down and terminating processes that prevent a clean unmount:

https://github.com/kislyuk/aegea/blob/develop/aegea/ebs.py#L180

You could try forking your version of aegea and updating this line to see if it makes a difference.

@kislyuk
Copy link
Owner

kislyuk commented Oct 3, 2021

In addition to the line above, if you take a look at the commit that originally introduced it in v2.8.0, there is another change that makes the Batch cleanup handler perform a cd / before trying to unmount. This is important because if the top level user process descended into a subdirectory in the mount, it will hold an advisory lock on that subdirectory and the mount will be seen as busy.

609d004

@MrOlm
Copy link
Contributor Author

MrOlm commented Oct 5, 2021

Hello,

Thanks again for all of this advice. I was going to just keep an eye on the EBS volumes to make sure there weren't any orphaned ones, but after this message I decided to just update to aegea 4.0.1 and just make the dependencies work out. So far so good.

Thanks again for your quick and helpful reply- much appreciated.

Best,
MO

@MrOlm MrOlm closed this as completed Oct 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants