Long lasting request (attach/detach) fail because token expiration #1394

mape90 · 2021-02-02T08:44:24Z

What happened:
Cinder internally fails to attach or to detach. Causing permanent failure for volume.

Issue is due openstack design problem where they reuse user tokens internally and has solution on OS side.
https://docs.openstack.org/cinder/latest/configuration/block-storage/service-token.html

However not every one have that fix or is willing or able to fix their openstack installation.

What you expected to happen:
Detach and attach shuold work.

How to reproduce it:
Have openstack without service tokens configured for Cinder service. And attach or detach volumes during the time when token is about to be expired. This also needs that attach/detach are a bit slow on openstack side so that window of failure would be bigger.

Anything else we need to know?:
Fix is to create new client everytime we do attach or detach. (AttachVolume and DetachVolume in openstack_volume.go in csi/cinder/openstack)
cli, err := openstack.NewComputeV2(os.compute.ProviderClient, os.epOpts)

It is already done for some openstack calls line attach+multiattach and in volume expansion.

Downside on this is that we need to create extra call to keystone api, but this fixes many error situations on older openstacks that do not have service tokens defined by default. I am not sure if those are by default enabled even on newer openstacks.

All the currently maintained binaries are:

cinder-csi-plugin

/kind bug

The text was updated successfully, but these errors were encountered:

kayrus · 2021-02-02T08:53:59Z

Cinder internally fails to attach or to detach. Causing permanent failure for volume.

what error/response code does cinder API return? If there is a 401, openstack client should reauth automatically and retry.

UPD:

fix is to create new client everytime we do attach or detach. (AttachVolume and DetachVolume in openstack_volume.go in csi/cinder/openstack)

new service client creation won't cause reauth, you need to create a brand new provider client and auth it.

mape90 · 2021-02-02T09:57:05Z

It is Nova where the attach and detach requests are going. It returns normally as everything is fine. However it will then internally after API call start handling the attaching/detaching using the same token that user used against it. If that token now expires the volume handling will stall and no further action is possible. Cinder CSI has now way of detecting or fixing it as volume is now permanently in attaching/detaching state. Some one has to forcefully reset the state of the volume.

Okei yes then we would need to call CreateOpenStackProvider() if I am not wrong. Or if we could reauthenticate ~50% of token expiration they it could fix the problem. However I do not think gophercloud implements such feature or is willing to implement.

kayrus · 2021-02-02T10:13:10Z

I don't think that reauth in advance would help. Nova API is asynchronous and we probably need to introduce additional "waitfor" functions to verify the status of the volume like it is done for LBaaS.

mape90 · 2021-02-02T14:11:53Z

Problem isn't the waiting. The attach and detach already implement the waiting.

Problem is that Openstack Nova it self is poorly implemented. It just blindly will reuse token given by user for its future actions.

Example
User has token that will expire in 1min.
Attach request is send to Nova with that token. -> Keystone validate the token and says OK -> Nova will return 200 and start executing the attaching (changes DB state of volume to attaching). However it self will not do that but it is done by Cinder. So it it will try to delegate the attaching to Cinder. It tries to use the user token against Cinder but that is now expired and Nova do not implement reauthentication so it can not continue. It also do not have any rollback or recovery implementation so it just stops handling the attachment and volume is no indefinitely in attaching state and it is not even been tried to attach to VM.

In some openstacks they introduced optional feature where Nova will create its own service token that it self manages. And when it does these delecation actions with user token it authorizes using both tokens. (its own and user token) this then allows accepting expired user tokens as long as Novas own service token is still valid.

So if Cinder CSI would create new token/client every time attaching or detaching this would be resolved as then openstack is usually capable of handling the request.

kayrus · 2021-02-02T14:21:47Z

Since you're talking about the internal openstack services communication, are there related openstack bug reports?

Are the nova/cinder attach/detach actions the only actions, which require a token renewal?

cc @Joker-official @RaphaelVogel can it be related to our cases, when volumes attach/detach fail?

jichenjc · 2021-02-03T01:50:55Z

@kayrus I assume this will be fixed at gopher cloud layer to add a re-auth function on some time limit?
thus we don't need consider create a new client here ?

just curious, how does gophercloud handle token expire issue now? e.g we created a cloud provider client then if overtime the token expire in OCCM layer, what will happen? thanks

kayrus · 2021-02-03T06:51:53Z

@jichenjc

just curious, how does gophercloud handle token expire issue now

reauth is triggered by a 401 response code. So there must be an API call performed to identify that the client token is expired.

jichenjc · 2021-02-03T07:09:13Z

@jichenjc

just curious, how does gophercloud handle token expire issue now

reauth is triggered by a 401 response code. So there must be an API call performed to identify that the client token is expired.

Thanks， this make sense for me ,so the issue is if this token expire in the middle of a long call, then gophercloud can do nothing... thus we need introduce timely refresh mechanism..

joker-at-work · 2021-02-03T08:23:18Z

Are the nova/cinder attach/detach actions the only actions, which require a token renewal?

I would assume any action where Nova makes a call to another service can be problematic in some way. Adding/Removing a new port might thus be problematic. But I would assume a retry helps there.

cc @Joker-official @RaphaelVogel can it be related to our cases, when volumes attach/detach fail?

We have service tokens enabled since ~2 years and it helped a lot.

mape90 · 2021-02-03T11:42:06Z

The service token was introduce to Openstack to fix this issue. However not all openstacks on production have this feature enabled. And usually users do not see this if they use openstack cli/horizon for all their actions as those will create new token for every request, even some times they might create multiple tokens as that code was never build to be efficient.

But if gophercloud could have feature where it would renew token after X minutes it would solve this issue. Or at least if there would exits long lasting request you would only need to tune the renew after value on CSI and/or update openstack token expiration time.

Luckily attach and detach are usually quite fast. Even with slow systems it just take some minutes. So just to be able to renew token some time before it expires could solve the issue on those old openstacks.

kayrus · 2021-02-03T11:52:32Z

But if gophercloud could have feature where it would renew token after X minutes it would solve this issue. Or at least if there would exits long lasting request you would only need to tune the renew after value on CSI and/or update openstack token expiration time.

each token contains an expires_at attribute:

{
  "token": {
    "audit_ids": [
      "abcdefg"
    ],
    "catalog": "***",
    "expires_at": "2021-02-03T19:50:45.000000Z",
...
}

This attribute can be checked in advance before each API call, and if necessary trigger a reauth. @mape90 it's better to discuss this topic directly in gophercloud.

fejta-bot · 2021-05-04T12:36:10Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

fejta-bot · 2021-06-03T13:04:54Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

fejta-bot · 2021-07-03T13:42:49Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

k8s-ci-robot · 2021-07-03T13:42:54Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 2, 2021

kayrus mentioned this issue Feb 2, 2021

Core: add an option to renew a token in advance according to a configured duration gophercloud/gophercloud#2111

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 4, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 3, 2021

k8s-ci-robot closed this as completed Jul 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long lasting request (attach/detach) fail because token expiration #1394

Long lasting request (attach/detach) fail because token expiration #1394

mape90 commented Feb 2, 2021

kayrus commented Feb 2, 2021 •

edited

Loading

mape90 commented Feb 2, 2021

kayrus commented Feb 2, 2021

mape90 commented Feb 2, 2021

kayrus commented Feb 2, 2021

jichenjc commented Feb 3, 2021

kayrus commented Feb 3, 2021 •

edited

Loading

jichenjc commented Feb 3, 2021

joker-at-work commented Feb 3, 2021 •

edited

Loading

mape90 commented Feb 3, 2021

kayrus commented Feb 3, 2021 •

edited

Loading

fejta-bot commented May 4, 2021

fejta-bot commented Jun 3, 2021

fejta-bot commented Jul 3, 2021

k8s-ci-robot commented Jul 3, 2021

Long lasting request (attach/detach) fail because token expiration #1394

Long lasting request (attach/detach) fail because token expiration #1394

Comments

mape90 commented Feb 2, 2021

kayrus commented Feb 2, 2021 • edited Loading

mape90 commented Feb 2, 2021

kayrus commented Feb 2, 2021

mape90 commented Feb 2, 2021

kayrus commented Feb 2, 2021

jichenjc commented Feb 3, 2021

kayrus commented Feb 3, 2021 • edited Loading

jichenjc commented Feb 3, 2021

joker-at-work commented Feb 3, 2021 • edited Loading

mape90 commented Feb 3, 2021

kayrus commented Feb 3, 2021 • edited Loading

fejta-bot commented May 4, 2021

fejta-bot commented Jun 3, 2021

fejta-bot commented Jul 3, 2021

k8s-ci-robot commented Jul 3, 2021

kayrus commented Feb 2, 2021 •

edited

Loading

kayrus commented Feb 3, 2021 •

edited

Loading

joker-at-work commented Feb 3, 2021 •

edited

Loading

kayrus commented Feb 3, 2021 •

edited

Loading