-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add "Alias" Check Type #4320
Add "Alias" Check Type #4320
Conversation
agent/agent.go
Outdated
|
||
var rpcReq structs.NodeSpecificRequest | ||
rpcReq.Datacenter = a.config.Datacenter | ||
rpcReq.Token = a.tokens.AgentToken() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the correct behavior? Should we use the agent token if none is specified, or user token?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I think it actually depends on context.
For checks added during service registration, this should really use the token that the check/service was registered with as the least surprising option - if you register with a token that can read some peer service which is otherwise denied read to others it seems logical that you are implicitly granting the check you registered under that token the same access.
But that would be a new direction for checks so far I think - we'd have to somehow pass through or lookup the service registration token from local state here.
However, if this is a node-level check, we should prefer to use AgentToken
as you have here since that is the one that is preferred over user token when registering the node.
These are all pretty big edge cases but we should think about them - it might actually make some sense to pass a token as an argument for the check in this case (internally at least) so that we can set it with the one that makes most sense in the context that is creating the check - e.g. during node registration or service registration etc.?
Needs more though - may be this is good enough for now as the others are likely edge cases but could be surprising ACL behaviour. Think AgentToken makes most sense short of the more invasive changes mentioned above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this check pretty similar to a prepared query and for prepared queries we allow configuring the token in the query definition.
So maybe we should add a "Token" parameter to the check definition and use that if provided or the anonymous token otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checks actually already support a token
used for anti-entropy for that check. That might satisfy the behavior @banks is expecting above. We've never passed that into a check implementation so it'd require a bit more plumbing (its not on the struct), but that might be worth it.
So proposed logic:
- If
token
is specified on check, use that. - Else, user agent token if specified.
- Else, no token.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mitchellh I think the least surprising thing would be if we default to copying the current scope token (from node/service registration) into the Check's token
if it's not other wise specified.
After that I agree with your proposed logic being sane. My worry is that having to specify the same token in the service deviation and each check body within that definition seems like a bad UX even if it is only needed in an edge case where the check doesn't just get access it needs anonymously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into this a bit further and this is the behavior with checks (with an asterisk... will note):
- For checks embedded in a service registration, the token used is the token specified with the service registration (directly in the definition). If not specified, the user token is used.
- Note on the above: If a token is explicitly specified in the embedded check, it is ignored. I don't see this documented anywhere and its unlikely anyone has tried this, but that's how it works.
- For standalone checks, the token on the check definition is used, otherwise the agent token.
This aligns really nicely with what is proposed above and therefore is probably the right decision: 1.) we should use the registration token 2.) else agent token. That matches everything else.
I won't change the behavior of ignoring the embedded check token for service registration, since that seems designed that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// case is safer than a 100% CPU loop. | ||
if args.MinQueryIndex < 1 { | ||
args.MinQueryIndex = 1 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Learned my lesson (cc @banks). Even if this is theoretically not possible, I protect against it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks awesome @mitchellh!
I think overall it's pretty close to being able to land. I think we should talk about the ACL thing a bit more and some minor typos/nits inline.
I would argue for not holding this until 1.3. It's not a breaking change and the fact that Connect will route traffic to unhealthy instances is a reasonably large wart that I hoped we'd squash in a rapid point release during beta rather than push out. This is a pre-requisite for fixing that.
agent/agent.go
Outdated
|
||
var rpcReq structs.NodeSpecificRequest | ||
rpcReq.Datacenter = a.config.Datacenter | ||
rpcReq.Token = a.tokens.AgentToken() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I think it actually depends on context.
For checks added during service registration, this should really use the token that the check/service was registered with as the least surprising option - if you register with a token that can read some peer service which is otherwise denied read to others it seems logical that you are implicitly granting the check you registered under that token the same access.
But that would be a new direction for checks so far I think - we'd have to somehow pass through or lookup the service registration token from local state here.
However, if this is a node-level check, we should prefer to use AgentToken
as you have here since that is the one that is preferred over user token when registering the node.
These are all pretty big edge cases but we should think about them - it might actually make some sense to pass a token as an argument for the check in this case (internally at least) so that we can set it with the one that makes most sense in the context that is creating the check - e.g. during node registration or service registration etc.?
Needs more though - may be this is good enough for now as the others are likely edge cases but could be surprising ACL behaviour. Think AgentToken makes most sense short of the more invasive changes mentioned above.
agent/checks/alias.go
Outdated
) | ||
|
||
// CheckAlias is a check type that aliases the health of another service | ||
// instance. If the service aliased has any critical health checks, then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should we say "aliases another service instance or whole node". That use case probably fell out of the implementation but it's valid so we might as well mention it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, agreed. Updating.
agent/checks/alias.go
Outdated
Checks() map[types.CheckID]*structs.HealthCheck | ||
} | ||
|
||
// Start is used to start a check ttl, runs until Stop() func (c *CheckAlias) Start() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo?
agent/checks/alias.go
Outdated
// NOTE(mitchellh): This currently returns ALL health checks for | ||
// a node even though we also have the service ID. This can be | ||
// optimized if we introduce a new RPC endpoint to filter both, | ||
// but for blocking queries isn't that more efficient since the checks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo:
isn't that much more efficient?
agent/checks/alias.go
Outdated
waitTime := (1 << (attempt - checkAliasBackoffMin)) * time.Second | ||
if waitTime > checkAliasBackoffMaxWait { | ||
waitTime = checkAliasBackoffMaxWait | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored a loop like this in connect code to actually bounds check the shift - if we retry more than 32 + checkAliasBackoffMin times here then the wait goes to zero and we go into a hot loop.
Note that the clamp on checkAliasBackoffMaxWait
doesn't save us here because waitTime
will be zero a that point so won't compare greater than the max.
For a worked example:
- let
attempts = 35, checkAliasBackoffMin = 3, checkAliasBackoffMaxWait = 1 minute
waitTime := (1 << (35 - 3)) * time.Second
waitTime := (1 << 32) * time.Second
waitTime := 0
(https://play.golang.org/p/xdeslj9CgUk)- busy loop from now on.
That does require about 30 mins of failures from servers but that doesn't seem super unlikely!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed by bounds checking the shift amount.
agent/local/state.go
Outdated
// Do not block. All notify channels should be buffered to at | ||
// least 1 in which case not-blocking does not result in loss | ||
// of data because a failed send means a notification is | ||
// already queued. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent a while describing a race condition with this logic since the state output isn't actually updated at this point, but then realised it's all moot since this whole method is holding the state lock.
Maybe we could add a note here stating that as a necessary condition for the invariant you claimed? If an alias could drain it's notification chan and load the local check state before the status and output are updated below then there is a possible race. It's only mitigated because they can't get a lock to load the state until we're done with this whole method.
|
||
// Start is used to start the check, runs until Stop() func (c *CheckAlias) Start() { | ||
func (c *CheckAlias) Start() { | ||
c.stopLock.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing to change here really I just noticed that we have this pattern of signalling something to stop with a bool + mutex + channel. At some point (not necessarily in this PR) we should abstract this out into another Struct to encapsulate this logic and stop rewriting this all over the place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I just copy/pasted the other checks. When we have some more breathing room I was going to propose unifying both this as well as a check interface so that we can have less repetition elsewhere (in the agent and config parsers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just whipped up a new type to mimic the Python threading.Event type (https://docs.python.org/2.7/library/threading.html#event-objects) with an extra function to get at the actual channel to wait on if asked.
https://gist.github.com/mkeeler/cb88cc762ca36733db0798ca80f1e73e
We could use that or do something else more sophisticated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, so this is a bigger discussion for outside of this thread, but we've generally avoided abstracting "shutdown" channels because the shutdown behavior you want is generally different by type of thing. Abstracting shutdown for checks though (smaller scope with equivalent behavior) is reasonable to me, but we should talk about it more later when I'm back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
Personally I'd vote for merging this now and releasing in 1.2.2. I realise it's a new feature but I don't think our versioning semantics forbid minor new features in patch releases and I can't think of any way it would be backwards incompatible.
My main motivation for landing it now is that it enables us to use this to fix the Proxy discovery problem earlier which feels like a decent sized paper cut in the current beta and I'm keep to have people interested see progress on issues like that before "GA". It gives us more chance for feedback if it somehow doesn't work in edge cases too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple of questions.
agent/local/state_test.go
Outdated
l.UpdateCheck(types.CheckID("c1"), api.HealthCritical, "") | ||
select { | ||
case <-notifyCh: | ||
case <-time.After(100 * time.Millisecond): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need a time.After here or in any of the other selects. When sending to the channel in UpdateCheck is finished the value should be immediately available for the select.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need the time.After so we can fail if the channel didn't send. I'm specifically testing that the channel works. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know we need a case other than the case <-notifyCh
but within the same go routine the data will already be written into the channel and be available. The time.After
case could just be a default
case as the data will either be available or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ah yes right. Thanks!
agent/local/state_test.go
Outdated
} | ||
|
||
// Update change and verify we get notified | ||
l.UpdateCheck(types.CheckID("c1"), api.HealthPassing, "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I am missing something but why should we be notified here. Shouldn't the overall health of the service still be critical because check c2 is still critical?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I figured it out. Check c2 is not part of the alias check and therefore we should be notified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is testing that the wrong check ID doesn't send an update.
@mkeeler Addressed your feedback |
agent/local/state_test.go
Outdated
case <-notifyCh: | ||
t.Fatal("notify received") | ||
|
||
case <-time.After(50 * time.Millisecond): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change to default:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought these would be better as a short timer to help check we dont receive a channel send (ie if the code was changed to async by accident). 50ms is not too long so maybe it’s not even worth it but from a test perspective we want to check we don’t receive a channel send.
Behaviorally TODAY this could be default but was wondering if from a test perspective it’s worth loosening to try to catch issues? Or should we just do default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting the time.After here I think could provide a false sense of security. If we did change the function to be async its likely that 50 or 100ms wait times would not be enough (especially when heavily parallelized or running in environments like travis). For these cases where we are checking that we dont receive a notification its likely it would always pass (even when it shouldn't). For the checks where we are expecting a notification a 50 - 100ms wait could be to little and we would have periodic test failures. The only way to make this test bullet-proof would be to have an exceedingly long timeout but the disadvantage is that it would make running this test take much longer. Even when we do not currently need it to be.
My preference is to make the tests as accurate and fast as possible to reduce the feedback loop time during development. In this case making the assumption that the underlying function is not async seems reasonable. Also if the function does become async, the first check with expecting a notification should fail quickly and it would immediately identify the need to change/rewrite the test to accommodate it being async.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I’ll make the change when I get back to a computer!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
agent/local/state_test.go
Outdated
case <-notifyCh: | ||
t.Fatal("notify received") | ||
|
||
case <-time.After(50 * time.Millisecond): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change to default:
Note on merge: If this is approved, this should probably wait until 1.3? It has no breaking changes so I'm not sure.
This PR adds a new check type "alias" that aliases the health status of another service instance.
The use case here is that the health of a service may in fact depend on the health of another service, the most basic example being a sidecar proxy. A sidecar proxy should not be healthy if the service that it is proxying to is also not healthy, because any connection to the sidecar will invariably fail connecting to the upstream service instance.
An alias check can alias a service instance on any node. If a node isn't specified, it will alias the matching service on the current node.
Example
Aliasing a local "web" service:
Aliasing a remote "web" service:
Aliasing a remote node (no service):
Behavior
I think and hope this behavior is all obvious, but to enumerate:
Implementation
The alias check uses one of two methods:
Blocking health query - If the alias is watching a remote node/service, then the check maintains a blocking query to the server to watch the health.
Local state - If the alias is watching a local service, then the check uses notifications from the local state and in-memory data to watch the health.
Edge Cases
Alias Cycles
Alias cycles are fine, because the anti-entropy system won't trigger state changes if the health isn't changing. Since aliases mirror exactly, this can never result in an infinite loop. This applies to both local state watches and remote watches. So if you setup two mirroring aliases, they'll always behave correctly, although its probably not what you want.