Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long lasting request (attach/detach) fail because token expiration #1394

Closed
mape90 opened this issue Feb 2, 2021 · 15 comments
Closed

Long lasting request (attach/detach) fail because token expiration #1394

mape90 opened this issue Feb 2, 2021 · 15 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@mape90
Copy link

mape90 commented Feb 2, 2021

What happened:
Cinder internally fails to attach or to detach. Causing permanent failure for volume.

Issue is due openstack design problem where they reuse user tokens internally and has solution on OS side.
https://docs.openstack.org/cinder/latest/configuration/block-storage/service-token.html

However not every one have that fix or is willing or able to fix their openstack installation.

What you expected to happen:
Detach and attach shuold work.

How to reproduce it:
Have openstack without service tokens configured for Cinder service. And attach or detach volumes during the time when token is about to be expired. This also needs that attach/detach are a bit slow on openstack side so that window of failure would be bigger.

Anything else we need to know?:
Fix is to create new client everytime we do attach or detach. (AttachVolume and DetachVolume in openstack_volume.go in csi/cinder/openstack)
cli, err := openstack.NewComputeV2(os.compute.ProviderClient, os.epOpts)

It is already done for some openstack calls line attach+multiattach and in volume expansion.

Downside on this is that we need to create extra call to keystone api, but this fixes many error situations on older openstacks that do not have service tokens defined by default. I am not sure if those are by default enabled even on newer openstacks.

All the currently maintained binaries are:

  • cinder-csi-plugin

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 2, 2021
@kayrus
Copy link
Contributor

kayrus commented Feb 2, 2021

Cinder internally fails to attach or to detach. Causing permanent failure for volume.

what error/response code does cinder API return? If there is a 401, openstack client should reauth automatically and retry.

UPD:

fix is to create new client everytime we do attach or detach. (AttachVolume and DetachVolume in openstack_volume.go in csi/cinder/openstack)

new service client creation won't cause reauth, you need to create a brand new provider client and auth it.

@mape90
Copy link
Author

mape90 commented Feb 2, 2021

It is Nova where the attach and detach requests are going. It returns normally as everything is fine. However it will then internally after API call start handling the attaching/detaching using the same token that user used against it. If that token now expires the volume handling will stall and no further action is possible. Cinder CSI has now way of detecting or fixing it as volume is now permanently in attaching/detaching state. Some one has to forcefully reset the state of the volume.

Okei yes then we would need to call CreateOpenStackProvider() if I am not wrong. Or if we could reauthenticate ~50% of token expiration they it could fix the problem. However I do not think gophercloud implements such feature or is willing to implement.

@kayrus
Copy link
Contributor

kayrus commented Feb 2, 2021

I don't think that reauth in advance would help. Nova API is asynchronous and we probably need to introduce additional "waitfor" functions to verify the status of the volume like it is done for LBaaS.

@mape90
Copy link
Author

mape90 commented Feb 2, 2021

Problem isn't the waiting. The attach and detach already implement the waiting.

Problem is that Openstack Nova it self is poorly implemented. It just blindly will reuse token given by user for its future actions.

Example
User has token that will expire in 1min.
Attach request is send to Nova with that token. -> Keystone validate the token and says OK -> Nova will return 200 and start executing the attaching (changes DB state of volume to attaching). However it self will not do that but it is done by Cinder. So it it will try to delegate the attaching to Cinder. It tries to use the user token against Cinder but that is now expired and Nova do not implement reauthentication so it can not continue. It also do not have any rollback or recovery implementation so it just stops handling the attachment and volume is no indefinitely in attaching state and it is not even been tried to attach to VM.

In some openstacks they introduced optional feature where Nova will create its own service token that it self manages. And when it does these delecation actions with user token it authorizes using both tokens. (its own and user token) this then allows accepting expired user tokens as long as Novas own service token is still valid.

So if Cinder CSI would create new token/client every time attaching or detaching this would be resolved as then openstack is usually capable of handling the request.

@kayrus
Copy link
Contributor

kayrus commented Feb 2, 2021

Since you're talking about the internal openstack services communication, are there related openstack bug reports?

Are the nova/cinder attach/detach actions the only actions, which require a token renewal?

cc @Joker-official @RaphaelVogel can it be related to our cases, when volumes attach/detach fail?

@jichenjc
Copy link
Contributor

jichenjc commented Feb 3, 2021

@kayrus I assume this will be fixed at gopher cloud layer to add a re-auth function on some time limit?
thus we don't need consider create a new client here ?

just curious, how does gophercloud handle token expire issue now? e.g we created a cloud provider client then if overtime the token expire in OCCM layer, what will happen? thanks

@kayrus
Copy link
Contributor

kayrus commented Feb 3, 2021

@jichenjc

just curious, how does gophercloud handle token expire issue now

reauth is triggered by a 401 response code. So there must be an API call performed to identify that the client token is expired.

@jichenjc
Copy link
Contributor

jichenjc commented Feb 3, 2021

@jichenjc

just curious, how does gophercloud handle token expire issue now

reauth is triggered by a 401 response code. So there must be an API call performed to identify that the client token is expired.

Thanks, this make sense for me ,so the issue is if this token expire in the middle of a long call, then gophercloud can do nothing... thus we need introduce timely refresh mechanism..

@joker-at-work
Copy link

joker-at-work commented Feb 3, 2021

Are the nova/cinder attach/detach actions the only actions, which require a token renewal?

I would assume any action where Nova makes a call to another service can be problematic in some way. Adding/Removing a new port might thus be problematic. But I would assume a retry helps there.

cc @Joker-official @RaphaelVogel can it be related to our cases, when volumes attach/detach fail?

We have service tokens enabled since ~2 years and it helped a lot.

@mape90
Copy link
Author

mape90 commented Feb 3, 2021

The service token was introduce to Openstack to fix this issue. However not all openstacks on production have this feature enabled. And usually users do not see this if they use openstack cli/horizon for all their actions as those will create new token for every request, even some times they might create multiple tokens as that code was never build to be efficient.

But if gophercloud could have feature where it would renew token after X minutes it would solve this issue. Or at least if there would exits long lasting request you would only need to tune the renew after value on CSI and/or update openstack token expiration time.

Luckily attach and detach are usually quite fast. Even with slow systems it just take some minutes. So just to be able to renew token some time before it expires could solve the issue on those old openstacks.

@kayrus
Copy link
Contributor

kayrus commented Feb 3, 2021

But if gophercloud could have feature where it would renew token after X minutes it would solve this issue. Or at least if there would exits long lasting request you would only need to tune the renew after value on CSI and/or update openstack token expiration time.

each token contains an expires_at attribute:

{
  "token": {
    "audit_ids": [
      "abcdefg"
    ],
    "catalog": "***",
    "expires_at": "2021-02-03T19:50:45.000000Z",
...
}

This attribute can be checked in advance before each API call, and if necessary trigger a reauth. @mape90 it's better to discuss this topic directly in gophercloud.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 4, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 3, 2021
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

6 participants