-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CSI] "volume in use" error deregistering volume previously associated with now stopped job #8057
Comments
I also just saw the
It's also probably worth noting that the node in question where these old allocs were held had been deployed with
So I'm just looking for a way to forcefully get rid of the reference to this volume first, and second to lodge this as a bug :). I've also tried forcing |
I'm a little confused by the CLI outputs you've provided here. It looks like you're showing me three volumes:
But then you're trying to run a job stop on the volume, which isn't an operation supported on volumes:
By any chance do the jobs have the same name as the volume? I'm wondering if using Moving on to some observations:
It looks like all 3 allocs that claim these 3 volumes are This is a little weird:
I'm not sure how we could get into a state where the job would be GC'd out from under us but the allocation could still be queried by the volume API to give us information like
Is the |
Hey @tgross , thanks for the response. You're right, that there are 3 volumes that all claim the same volume ID. I was playing with trying to develop my automation, and ended-up creating 3 different volumes for the same EBS volume. The job was previously called simply
However, note that this plugin was broken for some time as I transitioned it to use Nomad ACLs (which I may be doing incorrectly since how to do that isn't documented - the only doc I could find assumed you didn't have Nomad ACLs). I fixed it around 12h ago, and I still can't deregister the volume(s). I'm now wondering if this has something to do with ACLs. I found some more logs floating around on the Nomad leader as I attempt to deregister the volume:
I also see some of this:
|
Hey @tgross , I've sent my plugin job definitions to nomad-oss-debug. Here's my Nomad anonymous ACL policy
|
Hey @tgross , any thoughts on how to workaround this bug? I can't seem to get Nomad to "let go" of these volume definitions because it's still tracking those allocations. And of course none of the allocations exist when I run Not sure if it's helpful or not, but I also tried deploying a job to use this volume, since it's theoretically defined, and here's the relevant placement failure:
Note that I've also recycled all nodes in the cluster at this point. The only thing I haven't done is totally destroy & recreate the Nomad raft DB from my deployment scripts. I'm fairly confident that will fix this, but I'd really like to figure out how to fix this in case we hit this in production at some point. However, if I can't get this fixed in the next day, a clean wipe of the DB is what I'm going to have to do so I can keep going on my automation here. |
I got this to happen again. It seems like the problem is related to draining a node. As in, my job using the EBS volume was running on node x, say. I issue a drain on node x, and the job tries to move to node y. However, the node plugin job (which is a system job as prescribed) didn't unmount the volume from the node. Then when I terminated the drained node, nomad still thinks that the volume's writer claims are exhausted, even though it's not claimed by any node (since the node originally claiming it has been destroyed).
|
|
Faced similar issue when trying to deregister absent volume. Here's how to reproduce:
Nomad thinks my volume is still there in UI/cli and I'm not able to deregister it:
Obviously my bad for doing things in wrong order, but still would be glad to have something like |
@holtwilkins I'm looking into this, but in the meanwhile Nomad 0.11.3 was shipped with some improvements to the volume claim reaping (ref https://github.com/hashicorp/nomad/blob/master/CHANGELOG.md#0113-june-5-2020) so it might be worth your while to check that out. |
Thanks @tgross , I've reproduced this same issue on Nomad 0.11.3 shortly after it was released, unfortunately. |
Ok, so this is probably related to #8232 and #8080 The advice I gave to folks in those issues applies here as well, but that only helps prevent the issue in the future by reducing the risk of triggering the race condition between plugin teardown and volume claim reaping:
|
I've also opened #8251 for the |
Hi, I've also encountered same problem. We have nomad client nodes in ASG and when they got rotated/replaced we've stuck with multiple clients in "ineligible" state even though they are long terminated. Removing job using volume, volume itself and plugin did not help. Probably worth mentioning is fact, that as part of autoscaling clients are auto drained by lambda before termination proceeds and it worked as expected. Every time GC is run or one of the servers restarted there are multiple messages like this: I've even changed defaults in servers to:
but it just made messages to appear more often, not only when forcing GC via command line. Cheers |
Is there a workaround for "unsticking" volumes stuck in limbo like this? I have one that's currently attached to a five-days-dead job that I don't know how to reclaim. |
Update: my five-days-dead job is now over two weeks dead, and the EBS volume is still attached. To be blunt, I don't know that CSI support can even be called "beta" at this point. The current bugs are severe enough to make CSI effectively unusable, and it doesn't appear from the release notes that any progress was made on it in 0.12. |
Hello, I'm facing a similar issue, is there a release for the force option on nomad 0.11 ? |
I have the same message and can't
error seems to come from https://github.com/hashicorp/nomad/blob/master/nomad/state/state_store.go#L723 Could it be possible NodeID being empty preventing from EDIT : my volume isn't attached it is just not claimable |
The
I really can't disagree with you there, @jfcantu. The CSI project had been my primary focus but we were pretty short-handed for it during development. As you might note from some of the other open issues with the
There is not and we won't be backporting that. |
That's great info. Thanks for the update! I'm eagerly anticipating the next steps. |
Hello, I have just updated to 0.12.0, it seems your fix is in. However, I am still unable to remove this ghost allocation in volume locked state
Corresponding log
|
Ok, that's an interesting state: the allocation doesn't exist at all but is showing running status for the volume. That gives me a new clue on where to look. |
Wanted to give a quick status update. I've landed a handful of PRs that will be released as part of the upcoming 0.12.2 release:
I believe these fixes combined should get us into pretty good shape, and #8584 will give you an escape hatch to manually detach the volume via |
For sake of our planning, I'm going to close this issue. We'll continue to track progress of this set of problems in #8100. |
I'm still seeing this issue on 12.4. Even all allocations are failed or stopped (or I do garbage collect so they don't exist anymore), the volume status still shows up as these ghost allocations. I can deregister if I -force, but it doesn't seem to garbage collect these ghost allocations correctly. |
I see same issue with aws-ebs-csi-driver after upgrading to v1.0.0. I stopped all nodes replaced there nomad binary and started nomad service again. I guess that issue is not reproduced on the new nomad versions because I don't see any issues on GitHub. If my problem in Nomad's environment can someone suggest the correct way to update Nomad on nodes and clean it's environment? |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Getting a "volume in use" error when trying to deregister, long after the volume has stopped being in use. Possibly a recurrence or variation of #7625, but this time my whole cluster is running Nomad
0.11.2
, whereas that issue was on a0.11.0
RC.Nomad version
v0.11.2
Operating system and Environment details
Ubuntu 16
Issue
I have stopped jobs, which have been stopped for over a day, but they're still somehow preventing me from deregistering the CSI volume that was associated with them.
Nomad Client logs (if appropriate)
I've also verified via the AWS console that the volume itself is not attached to an instance, but in the
available
state.Nomad Server logs (if appropriate)
The text was updated successfully, but these errors were encountered: