Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vault: always renew tokens using the renewal loop #18998

Merged
merged 2 commits into from
Nov 8, 2023

Conversation

lgfa29
Copy link
Contributor

@lgfa29 lgfa29 commented Nov 3, 2023

Previously, a Vault token could renewed either periodically via the renewal loop or immediately by calling RenewToken().

But a race condition in the renewal loop could cause an attempt to renew an expired token. If both updateCh and renewalCh are active (such as when a task stops at the same time its token is waiting for renewal), the following select picks a case at random.

select {
case <-renewalCh:
if err := c.renew(renewalReq); err != nil {
c.logger.Error("error renewing token", "error", err)
metrics.IncrCounter([]string{"client", "vault", "renew_token_error"}, 1)
}
case <-c.updateCh:
continue

If case <-renewalCh is picked, the token is incorrectly re-added to the heap, causing unnecessary renewals of a token that is already expired.

// If the identifier is not already tracked, this is a first
// renewal request. In this case, add an entry into the heap
// with the next renewal time.
if err := c.heap.Push(req, next); err != nil {
return fmt.Errorf("failed to push an entry to heap. err: %v", err)
}

To prevent this situation, the renew() function should only renew tokens that are currently in the heap, so RenewToken() must first push the token to the heap and wait for the renewal to happen instead of calling renew() directly since this could cause another race condition where the token is renewed twice: once by RenewToken() calling renew() directly and a second time if the renewal happens to pick the token as soon as RenewToken() adds it to the heap.

Extracted from #18985

Previously, a Vault token could renewed either periodically via the
renewal loop or immediately by calling `RenewToken()`.

But a race condition in the renewal loop could cause an attempt to renew
an expired token. If both `updateCh` and `renewalCh` are active (such as
when a task stops at the same time its token is waiting for renewal),
the following `select` picks a `case` at random.

https://github.com/hashicorp/nomad/blob/78f0c6b2a910c25522ac88ff63343bb371d72bff/client/vaultclient/vaultclient.go#L557-L564

If `case <-renewalCh` is picked, the token is incorrectly re-added to
the heap, causing unnecessary renewals of a token that is already expired.

https://github.com/hashicorp/nomad/blob/1604dba508b2ae8187b6b9405ce6d9b6ca8e8b4b/client/vaultclient/vaultclient.go#L505-L510

To prevent this situation, the `renew()` function should only renew
tokens that are currently in the heap, so `RenewToken()` must first push
the token to the heap and wait for the renewal to happen instead of
calling `renew()` directly since this could cause another race condition
where the token is renewed twice: once by `RenewToken()` calling
`renew()` directly and a second time if the renewal happens to pick the
token as soon as `RenewToken()` adds it to the heap.
// Any error returned during the periodical renewal will be written to a
// buffered channel and the channel is returned instead of an actual error.
// This helps the caller be notified of a renewal failure asynchronously for
// appropriate actions to be taken. The caller of this function need not have
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closing a closed channel causes a panic. If the renewal loop has ownership over closing the channel, should we be more firm with this comment and say that the caller of the function must not close the channel?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part of the comment already existed, so I just kept it as is, but reading again it's indeed weirdly phrased 😅

It's also not necessary because RenewToken returns a read channel and trying to close it is a compilation error: https://go.dev/play/p/MQxK_XyGa-7

I will remove this last sentence.

errCh := make(chan error, 1)

// Create a renewal request and indicate that the identifier in the
// request is a token and not a lease
renewalReq := &vaultClientRenewalRequest{
renewalCh: renewalCh,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're using a variable renewalCh in the run method to pick up timer events to start a renew call (something is about to happen in the near future). Whereas this channel with the same name is signaling that this particular request has been successfully renewed (something has just happened in the recent past). So maybe for clarity we should name this something like completeCh or renewedCh?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum...good point. I renamed it to renewalLoopCh to be more explicit, but I found very tricky to find a good name 😅

I feel like completeCh and renewedCh are a bit misleading because the operation is not complete (the channel will still receive values) and the token may not have been renewed due to an error.

renewalLoopCh is kind of too specific because it leaks the notion that the renewal loop exists, but this is an internal struct so probably OK?

Comment on lines 442 to 452
// Always notify listeners that the request has been processed before
// exiting.
defer func() {
for {
select {
case req.renewalCh <- struct{}{}:
default:
return
}
}
}()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this block. If we send on the channel and it's blocked, we'll fall through the default and then return. Otherwise, we'll send on the channel and then loop again and try to send again. Only until the channel is blocked do we end up returning? If we're going to return if we're blocked anyways, why have the loop instead of something like this?

defer func() {
	select {
	case req.renewalCh <- struct{}{}:
	default:
	}
}()

Also, it looks like RenewToken is the only method that populates renewalCh and it doesn't return the channel for another caller. So we only ever read from this channel once (in RenewToken), right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, I used a loop here in case there were multiple listeners but never actually happens so I removed the outer for loop.

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@lgfa29 lgfa29 merged commit 7054fe1 into main Nov 8, 2023
25 of 26 checks passed
@lgfa29 lgfa29 deleted the f-refactor-vault-token-renewal branch November 8, 2023 00:49
lgfa29 added a commit that referenced this pull request Feb 12, 2024
lgfa29 added a commit that referenced this pull request Feb 13, 2024
tgross pushed a commit that referenced this pull request Feb 13, 2024
* Revert "vault: always renew tokens using the renewal loop (#18998)"
  This reverts commit 7054fe1.
* test: add case for concurrent Vault token renewal
tgross pushed a commit that referenced this pull request Feb 13, 2024
…se/1.7.x #19967

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
nvanthao pushed a commit to nvanthao/nomad that referenced this pull request Mar 1, 2024
Previously, a Vault token could renewed either periodically via the
renewal loop or immediately by calling `RenewToken()`.

But a race condition in the renewal loop could cause an attempt to renew
an expired token. If both `updateCh` and `renewalCh` are active (such as
when a task stops at the same time its token is waiting for renewal),
the following `select` picks a `case` at random.

https://github.com/hashicorp/nomad/blob/78f0c6b2a910c25522ac88ff63343bb371d72bff/client/vaultclient/vaultclient.go#L557-L564

If `case <-renewalCh` is picked, the token is incorrectly re-added to
the heap, causing unnecessary renewals of a token that is already expired.

https://github.com/hashicorp/nomad/blob/1604dba508b2ae8187b6b9405ce6d9b6ca8e8b4b/client/vaultclient/vaultclient.go#L505-L510

To prevent this situation, the `renew()` function should only renew
tokens that are currently in the heap, so `RenewToken()` must first push
the token to the heap and wait for the renewal to happen instead of
calling `renew()` directly since this could cause another race condition
where the token is renewed twice: once by `RenewToken()` calling
`renew()` directly and a second time if the renewal happens to pick the
token as soon as `RenewToken()` adds it to the heap.
nvanthao pushed a commit to nvanthao/nomad that referenced this pull request Mar 1, 2024
)

* Revert "vault: always renew tokens using the renewal loop (hashicorp#18998)"
  This reverts commit 7054fe1.
* test: add case for concurrent Vault token renewal
nvanthao pushed a commit to nvanthao/nomad that referenced this pull request Mar 1, 2024
Previously, a Vault token could renewed either periodically via the
renewal loop or immediately by calling `RenewToken()`.

But a race condition in the renewal loop could cause an attempt to renew
an expired token. If both `updateCh` and `renewalCh` are active (such as
when a task stops at the same time its token is waiting for renewal),
the following `select` picks a `case` at random.

https://github.com/hashicorp/nomad/blob/78f0c6b2a910c25522ac88ff63343bb371d72bff/client/vaultclient/vaultclient.go#L557-L564

If `case <-renewalCh` is picked, the token is incorrectly re-added to
the heap, causing unnecessary renewals of a token that is already expired.

https://github.com/hashicorp/nomad/blob/1604dba508b2ae8187b6b9405ce6d9b6ca8e8b4b/client/vaultclient/vaultclient.go#L505-L510

To prevent this situation, the `renew()` function should only renew
tokens that are currently in the heap, so `RenewToken()` must first push
the token to the heap and wait for the renewal to happen instead of
calling `renew()` directly since this could cause another race condition
where the token is renewed twice: once by `RenewToken()` calling
`renew()` directly and a second time if the renewal happens to pick the
token as soon as `RenewToken()` adds it to the heap.
nvanthao pushed a commit to nvanthao/nomad that referenced this pull request Mar 1, 2024
)

* Revert "vault: always renew tokens using the renewal loop (hashicorp#18998)"
  This reverts commit 7054fe1.
* test: add case for concurrent Vault token renewal
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants