Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "Alias" Check Type #4320

Merged
merged 18 commits into from
Jul 20, 2018
Merged

Add "Alias" Check Type #4320

merged 18 commits into from
Jul 20, 2018

Conversation

mitchellh
Copy link
Contributor

@mitchellh mitchellh commented Jun 30, 2018

Note on merge: If this is approved, this should probably wait until 1.3? It has no breaking changes so I'm not sure.

This PR adds a new check type "alias" that aliases the health status of another service instance.

The use case here is that the health of a service may in fact depend on the health of another service, the most basic example being a sidecar proxy. A sidecar proxy should not be healthy if the service that it is proxying to is also not healthy, because any connection to the sidecar will invariably fail connecting to the upstream service instance.

An alias check can alias a service instance on any node. If a node isn't specified, it will alias the matching service on the current node.

Example

Aliasing a local "web" service:

{
  "check": {
    "name": "Aliasing a local web",
    "alias_service": "web"
  }
}

Aliasing a remote "web" service:

{
  "check": {
    "name": "Aliasing a remote web",
    "alias_node": "web-001",
    "alias_service": "web"
  }
}

Aliasing a remote node (no service):

{
  "check": {
    "name": "Aliasing a remote node, no service",
    "alias_node": "web-001"
  }
}

Behavior

I think and hope this behavior is all obvious, but to enumerate:

  • If the aliased service has any critical checks, then the alias is critical
  • If the aliased service has no critical, but has warnings, then the alias is warning
  • If the aliased service is only passing, then the alias is passing
  • If the node of the aliased service is critical or warning, then the alias itself is critical or warning
  • If the aliased service has no checks, it is considered passing.
  • If no aliased service is specified, but a node is specified, the alias only reflects the node state.
  • If no aliased node is specified, but a service is specified, the node defaults to the local node.
  • The output of the check mirrors the output of the aliased checks if they're failing

Implementation

The alias check uses one of two methods:

  1. Blocking health query - If the alias is watching a remote node/service, then the check maintains a blocking query to the server to watch the health.

  2. Local state - If the alias is watching a local service, then the check uses notifications from the local state and in-memory data to watch the health.

Edge Cases

Alias Cycles

Alias cycles are fine, because the anti-entropy system won't trigger state changes if the health isn't changing. Since aliases mirror exactly, this can never result in an infinite loop. This applies to both local state watches and remote watches. So if you setup two mirroring aliases, they'll always behave correctly, although its probably not what you want.

@mitchellh mitchellh requested review from banks and mkeeler June 30, 2018 14:43
agent/agent.go Outdated

var rpcReq structs.NodeSpecificRequest
rpcReq.Datacenter = a.config.Datacenter
rpcReq.Token = a.tokens.AgentToken()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the correct behavior? Should we use the agent token if none is specified, or user token?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I think it actually depends on context.

For checks added during service registration, this should really use the token that the check/service was registered with as the least surprising option - if you register with a token that can read some peer service which is otherwise denied read to others it seems logical that you are implicitly granting the check you registered under that token the same access.

But that would be a new direction for checks so far I think - we'd have to somehow pass through or lookup the service registration token from local state here.

However, if this is a node-level check, we should prefer to use AgentToken as you have here since that is the one that is preferred over user token when registering the node.

These are all pretty big edge cases but we should think about them - it might actually make some sense to pass a token as an argument for the check in this case (internally at least) so that we can set it with the one that makes most sense in the context that is creating the check - e.g. during node registration or service registration etc.?

Needs more though - may be this is good enough for now as the others are likely edge cases but could be surprising ACL behaviour. Think AgentToken makes most sense short of the more invasive changes mentioned above.

Copy link
Member

@mkeeler mkeeler Jul 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this check pretty similar to a prepared query and for prepared queries we allow configuring the token in the query definition.

So maybe we should add a "Token" parameter to the check definition and use that if provided or the anonymous token otherwise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checks actually already support a token used for anti-entropy for that check. That might satisfy the behavior @banks is expecting above. We've never passed that into a check implementation so it'd require a bit more plumbing (its not on the struct), but that might be worth it.

So proposed logic:

  1. If token is specified on check, use that.
  2. Else, user agent token if specified.
  3. Else, no token.

Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mitchellh I think the least surprising thing would be if we default to copying the current scope token (from node/service registration) into the Check's token if it's not other wise specified.

After that I agree with your proposed logic being sane. My worry is that having to specify the same token in the service deviation and each check body within that definition seems like a bad UX even if it is only needed in an edge case where the check doesn't just get access it needs anonymously.

Copy link
Contributor Author

@mitchellh mitchellh Jul 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into this a bit further and this is the behavior with checks (with an asterisk... will note):

  • For checks embedded in a service registration, the token used is the token specified with the service registration (directly in the definition). If not specified, the user token is used.
  • Note on the above: If a token is explicitly specified in the embedded check, it is ignored. I don't see this documented anywhere and its unlikely anyone has tried this, but that's how it works.
  • For standalone checks, the token on the check definition is used, otherwise the agent token.

This aligns really nicely with what is proposed above and therefore is probably the right decision: 1.) we should use the registration token 2.) else agent token. That matches everything else.

I won't change the behavior of ignoring the embedded check token for service registration, since that seems designed that way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// case is safer than a 100% CPU loop.
if args.MinQueryIndex < 1 {
args.MinQueryIndex = 1
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Learned my lesson (cc @banks). Even if this is theoretically not possible, I protect against it.

Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks awesome @mitchellh!

I think overall it's pretty close to being able to land. I think we should talk about the ACL thing a bit more and some minor typos/nits inline.

I would argue for not holding this until 1.3. It's not a breaking change and the fact that Connect will route traffic to unhealthy instances is a reasonably large wart that I hoped we'd squash in a rapid point release during beta rather than push out. This is a pre-requisite for fixing that.

agent/agent.go Outdated

var rpcReq structs.NodeSpecificRequest
rpcReq.Datacenter = a.config.Datacenter
rpcReq.Token = a.tokens.AgentToken()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I think it actually depends on context.

For checks added during service registration, this should really use the token that the check/service was registered with as the least surprising option - if you register with a token that can read some peer service which is otherwise denied read to others it seems logical that you are implicitly granting the check you registered under that token the same access.

But that would be a new direction for checks so far I think - we'd have to somehow pass through or lookup the service registration token from local state here.

However, if this is a node-level check, we should prefer to use AgentToken as you have here since that is the one that is preferred over user token when registering the node.

These are all pretty big edge cases but we should think about them - it might actually make some sense to pass a token as an argument for the check in this case (internally at least) so that we can set it with the one that makes most sense in the context that is creating the check - e.g. during node registration or service registration etc.?

Needs more though - may be this is good enough for now as the others are likely edge cases but could be surprising ACL behaviour. Think AgentToken makes most sense short of the more invasive changes mentioned above.

)

// CheckAlias is a check type that aliases the health of another service
// instance. If the service aliased has any critical health checks, then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we say "aliases another service instance or whole node". That use case probably fell out of the implementation but it's valid so we might as well mention it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, agreed. Updating.

Checks() map[types.CheckID]*structs.HealthCheck
}

// Start is used to start a check ttl, runs until Stop() func (c *CheckAlias) Start() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo?

// NOTE(mitchellh): This currently returns ALL health checks for
// a node even though we also have the service ID. This can be
// optimized if we introduce a new RPC endpoint to filter both,
// but for blocking queries isn't that more efficient since the checks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo:

isn't that much more efficient?

waitTime := (1 << (attempt - checkAliasBackoffMin)) * time.Second
if waitTime > checkAliasBackoffMaxWait {
waitTime = checkAliasBackoffMaxWait
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored a loop like this in connect code to actually bounds check the shift - if we retry more than 32 + checkAliasBackoffMin times here then the wait goes to zero and we go into a hot loop.

Note that the clamp on checkAliasBackoffMaxWait doesn't save us here because waitTime will be zero a that point so won't compare greater than the max.

For a worked example:

  • let attempts = 35, checkAliasBackoffMin = 3, checkAliasBackoffMaxWait = 1 minute
  • waitTime := (1 << (35 - 3)) * time.Second
  • waitTime := (1 << 32) * time.Second
  • waitTime := 0 (https://play.golang.org/p/xdeslj9CgUk)
  • busy loop from now on.

That does require about 30 mins of failures from servers but that doesn't seem super unlikely!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by bounds checking the shift amount.

// Do not block. All notify channels should be buffered to at
// least 1 in which case not-blocking does not result in loss
// of data because a failed send means a notification is
// already queued.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent a while describing a race condition with this logic since the state output isn't actually updated at this point, but then realised it's all moot since this whole method is holding the state lock.

Maybe we could add a note here stating that as a necessary condition for the invariant you claimed? If an alias could drain it's notification chan and load the local check state before the status and output are updated below then there is a possible race. It's only mitigated because they can't get a lock to load the state until we're done with this whole method.


// Start is used to start the check, runs until Stop() func (c *CheckAlias) Start() {
func (c *CheckAlias) Start() {
c.stopLock.Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing to change here really I just noticed that we have this pattern of signalling something to stop with a bool + mutex + channel. At some point (not necessarily in this PR) we should abstract this out into another Struct to encapsulate this logic and stop rewriting this all over the place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I just copy/pasted the other checks. When we have some more breathing room I was going to propose unifying both this as well as a check interface so that we can have less repetition elsewhere (in the agent and config parsers).

Copy link
Member

@mkeeler mkeeler Jul 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just whipped up a new type to mimic the Python threading.Event type (https://docs.python.org/2.7/library/threading.html#event-objects) with an extra function to get at the actual channel to wait on if asked.

https://gist.github.com/mkeeler/cb88cc762ca36733db0798ca80f1e73e

We could use that or do something else more sophisticated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, so this is a bigger discussion for outside of this thread, but we've generally avoided abstracting "shutdown" channels because the shutdown behavior you want is generally different by type of thing. Abstracting shutdown for checks though (smaller scope with equivalent behavior) is reasonable to me, but we should talk about it more later when I'm back.

@mitchellh
Copy link
Contributor Author

@mkeeler @banks Fixed the review notes and also updated the website with docs for this check. Ready for review again and discussion of whether we want this in the 1.2.x series.

Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

Personally I'd vote for merging this now and releasing in 1.2.2. I realise it's a new feature but I don't think our versioning semantics forbid minor new features in patch releases and I can't think of any way it would be backwards incompatible.

My main motivation for landing it now is that it enables us to use this to fix the Proxy discovery problem earlier which feels like a decent sized paper cut in the current beta and I'm keep to have people interested see progress on issues like that before "GA". It gives us more chance for feedback if it somehow doesn't work in edge cases too.

Copy link
Member

@mkeeler mkeeler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of questions.

l.UpdateCheck(types.CheckID("c1"), api.HealthCritical, "")
select {
case <-notifyCh:
case <-time.After(100 * time.Millisecond):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need a time.After here or in any of the other selects. When sending to the channel in UpdateCheck is finished the value should be immediately available for the select.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the time.After so we can fail if the channel didn't send. I'm specifically testing that the channel works. :)

Copy link
Member

@mkeeler mkeeler Jul 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we need a case other than the case <-notifyCh but within the same go routine the data will already be written into the channel and be available. The time.After case could just be a default case as the data will either be available or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ah yes right. Thanks!

}

// Update change and verify we get notified
l.UpdateCheck(types.CheckID("c1"), api.HealthPassing, "")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I am missing something but why should we be notified here. Shouldn't the overall health of the service still be critical because check c2 is still critical?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I figured it out. Check c2 is not part of the alias check and therefore we should be notified.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is testing that the wrong check ID doesn't send an update.

@mitchellh
Copy link
Contributor Author

@mkeeler Addressed your feedback

case <-notifyCh:
t.Fatal("notify received")

case <-time.After(50 * time.Millisecond):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to default:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought these would be better as a short timer to help check we dont receive a channel send (ie if the code was changed to async by accident). 50ms is not too long so maybe it’s not even worth it but from a test perspective we want to check we don’t receive a channel send.

Behaviorally TODAY this could be default but was wondering if from a test perspective it’s worth loosening to try to catch issues? Or should we just do default?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting the time.After here I think could provide a false sense of security. If we did change the function to be async its likely that 50 or 100ms wait times would not be enough (especially when heavily parallelized or running in environments like travis). For these cases where we are checking that we dont receive a notification its likely it would always pass (even when it shouldn't). For the checks where we are expecting a notification a 50 - 100ms wait could be to little and we would have periodic test failures. The only way to make this test bullet-proof would be to have an exceedingly long timeout but the disadvantage is that it would make running this test take much longer. Even when we do not currently need it to be.

My preference is to make the tests as accurate and fast as possible to reduce the feedback loop time during development. In this case making the assumption that the underlying function is not async seems reasonable. Also if the function does become async, the first check with expecting a notification should fail quickly and it would immediately identify the need to change/rewrite the test to accommodate it being async.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I’ll make the change when I get back to a computer!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

case <-notifyCh:
t.Fatal("notify received")

case <-time.After(50 * time.Millisecond):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to default:

@pearkes pearkes added this to the 1.2.2 milestone Jul 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants