Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vault: avoid continual renewal of invalid token #18985

Merged
merged 2 commits into from Nov 8, 2023
Merged

Conversation

lgfa29
Copy link
Contributor

@lgfa29 lgfa29 commented Nov 2, 2023

A series of errors may happen when a token is invalidated while the Vault client is waiting to renew it. The token may have been invalidated for several reasons, such as the alloc finished running and it's now terminal or the token may have been change directly on Vault out-of-band.

Most of the errors are caused by retries that will never succeed until Vault fully removes the token from its state.

This commit prevents the retries by making the error invalid lease ID a fatal error.

In earlier versions of Vault, this case was covered by the error lease not found or lease is not renewable, which is already considered to be a fatal error by Nomad:

https://github.com/hashicorp/vault/blob/2d0cde4ccc0323591d9414342cb15f5cb70271d7/vault/expiration.go#L636-L639

But hashicorp/vault#5346 introduced an earlier nil check that generates a different error message:

https://github.com/hashicorp/vault/blob/750ab337eaa0b049d9cf1535c00e860129e5e9a0/vault/expiration.go#L1362-L1364

Both errors happen for the same reason (le == nil) and so should be considered fatal on renewal.

Closes #12465

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Nice work on this @lgfa29!

Comment on lines 373 to 376
// Add renewal request to the heap to start tracking the token.
c.lock.Lock()
c.heap.Push(renewalReq, time.Now())
c.lock.Unlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove the original heap.Push or at least change this comment?

// If the identifier is not already tracked, this is a first
// renewal request. In this case, add an entry into the heap
// with the next renewal time.

Actually, is that whole branch now an unreachable state because the if c.isTracked(req.id) conditional will always be true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum...good point. Looking at the code again, I think this change can actually add a new racing condition where the token is renewed twice!

When RenewToken() pushes the token to heap the run() loop may pick it up for renewal right away, but then RenewToken() calls c.renew() for an immediate renewal, resulting in a double renew.

I don't think that's a problem but it's certainly unnecessary.

I think we could make it so only the run() loop actually renew tokens and RenewToken() blocks until the renewal request completes, but that could add some latency to task start.

Another option is to revert the new !c.isTracked check and just adjust the error so it's considered fatal and we only get one error max.

The select race condition that it tries to address seems rather unlikely to happen, so a single sporadic error may not be too bad?

Another option would be to have different behaviours if c.renew() is called from RenewToken() or run(), like a new renewUntracked() function that doesn't check the heap 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could make it so only the run() loop actually renew tokens and RenewToken() blocks until the renewal request completes, but that could add some latency to task start.

I like this idea, because trying to call renew from two different places seems to be the root of the race. The only potential added latency would come from another token/lease being renewed at the same time as the task starts, so that it hits the run loop while that loop is otherwise occupied. But I'm pretty sure we already have that same latency today, because only one token/lease can be renewed at the same time because renew takes the mutex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Implemented this in #18998.

Since it's a fair amount of changes I think it may be better to avoid backporting it. I then updated this PR to only include the extra error check which will prevent the multiple renewals and is much safer to backport.

A series of errors may happen when a token is invalidated while the
Vault client is waiting to renew it. The token may have been invalidated
for several reasons, such as the alloc finished running and it's now
terminal or the token may have been change directly on Vault
out-of-band.

Most of the errors are caused by retries that will never succeed until
Vault fully removes the token from its state.

This commit prevents the retries by making the error `invalid lease ID`
a fatal error.

In earlier versions of Vault, this case was covered by the error `lease
not found or lease is not renewable`, which is already considered to be
a fatal error by Nomad:

https://github.com/hashicorp/vault/blob/2d0cde4ccc0323591d9414342cb15f5cb70271d7/vault/expiration.go#L636-L639

But hashicorp/vault#5346 introduced an earlier
`nil` check that generates a different error message:

https://github.com/hashicorp/vault/blob/750ab337eaa0b049d9cf1535c00e860129e5e9a0/vault/expiration.go#L1362-L1364

Both errors happen for the same reason (`le == nil`) and so should be
considered fatal on renewal.
@lgfa29 lgfa29 merged commit ab36cf0 into main Nov 8, 2023
25 of 26 checks passed
@lgfa29 lgfa29 deleted the b-fix-vault-renew-race branch November 8, 2023 00:50
nvanthao pushed a commit to nvanthao/nomad that referenced this pull request Mar 1, 2024
A series of errors may happen when a token is invalidated while the
Vault client is waiting to renew it. The token may have been invalidated
for several reasons, such as the alloc finished running and it's now
terminal or the token may have been change directly on Vault
out-of-band.

Most of the errors are caused by retries that will never succeed until
Vault fully removes the token from its state.

This commit prevents the retries by making the error `invalid lease ID`
a fatal error.

In earlier versions of Vault, this case was covered by the error `lease
not found or lease is not renewable`, which is already considered to be
a fatal error by Nomad:

https://github.com/hashicorp/vault/blob/2d0cde4ccc0323591d9414342cb15f5cb70271d7/vault/expiration.go#L636-L639

But hashicorp/vault#5346 introduced an earlier
`nil` check that generates a different error message:

https://github.com/hashicorp/vault/blob/750ab337eaa0b049d9cf1535c00e860129e5e9a0/vault/expiration.go#L1362-L1364

Both errors happen for the same reason (`le == nil`) and so should be
considered fatal on renewal.
nvanthao pushed a commit to nvanthao/nomad that referenced this pull request Mar 1, 2024
A series of errors may happen when a token is invalidated while the
Vault client is waiting to renew it. The token may have been invalidated
for several reasons, such as the alloc finished running and it's now
terminal or the token may have been change directly on Vault
out-of-band.

Most of the errors are caused by retries that will never succeed until
Vault fully removes the token from its state.

This commit prevents the retries by making the error `invalid lease ID`
a fatal error.

In earlier versions of Vault, this case was covered by the error `lease
not found or lease is not renewable`, which is already considered to be
a fatal error by Nomad:

https://github.com/hashicorp/vault/blob/2d0cde4ccc0323591d9414342cb15f5cb70271d7/vault/expiration.go#L636-L639

But hashicorp/vault#5346 introduced an earlier
`nil` check that generates a different error message:

https://github.com/hashicorp/vault/blob/750ab337eaa0b049d9cf1535c00e860129e5e9a0/vault/expiration.go#L1362-L1364

Both errors happen for the same reason (`le == nil`) and so should be
considered fatal on renewal.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.4.x backport to 1.4.x release line backport/1.5.x backport to 1.5.x release line backport/1.6.x backport to 1.6.x release line
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Vault token renewals race with task shutdown
2 participants