Add "Alias" Check Type #4320

mitchellh · 2018-06-30T14:43:13Z

Note on merge: If this is approved, this should probably wait until 1.3? It has no breaking changes so I'm not sure.

This PR adds a new check type "alias" that aliases the health status of another service instance.

The use case here is that the health of a service may in fact depend on the health of another service, the most basic example being a sidecar proxy. A sidecar proxy should not be healthy if the service that it is proxying to is also not healthy, because any connection to the sidecar will invariably fail connecting to the upstream service instance.

An alias check can alias a service instance on any node. If a node isn't specified, it will alias the matching service on the current node.

Example

Aliasing a local "web" service:

{
  "check": {
    "name": "Aliasing a local web",
    "alias_service": "web"
  }
}

Aliasing a remote "web" service:

{
  "check": {
    "name": "Aliasing a remote web",
    "alias_node": "web-001",
    "alias_service": "web"
  }
}

Aliasing a remote node (no service):

{
  "check": {
    "name": "Aliasing a remote node, no service",
    "alias_node": "web-001"
  }
}

Behavior

I think and hope this behavior is all obvious, but to enumerate:

If the aliased service has any critical checks, then the alias is critical
If the aliased service has no critical, but has warnings, then the alias is warning
If the aliased service is only passing, then the alias is passing
If the node of the aliased service is critical or warning, then the alias itself is critical or warning
If the aliased service has no checks, it is considered passing.
If no aliased service is specified, but a node is specified, the alias only reflects the node state.
If no aliased node is specified, but a service is specified, the node defaults to the local node.
The output of the check mirrors the output of the aliased checks if they're failing

Implementation

The alias check uses one of two methods:

Blocking health query - If the alias is watching a remote node/service, then the check maintains a blocking query to the server to watch the health.
Local state - If the alias is watching a local service, then the check uses notifications from the local state and in-memory data to watch the health.

Edge Cases

Alias Cycles

Alias cycles are fine, because the anti-entropy system won't trigger state changes if the health isn't changing. Since aliases mirror exactly, this can never result in an infinite loop. This applies to both local state watches and remote watches. So if you setup two mirroring aliases, they'll always behave correctly, although its probably not what you want.

mitchellh · 2018-06-30T14:43:55Z

agent/agent.go

+
+			var rpcReq structs.NodeSpecificRequest
+			rpcReq.Datacenter = a.config.Datacenter
+			rpcReq.Token = a.tokens.AgentToken()


Is this the correct behavior? Should we use the agent token if none is specified, or user token?

Hmm I think it actually depends on context.

For checks added during service registration, this should really use the token that the check/service was registered with as the least surprising option - if you register with a token that can read some peer service which is otherwise denied read to others it seems logical that you are implicitly granting the check you registered under that token the same access.

But that would be a new direction for checks so far I think - we'd have to somehow pass through or lookup the service registration token from local state here.

However, if this is a node-level check, we should prefer to use AgentToken as you have here since that is the one that is preferred over user token when registering the node.

These are all pretty big edge cases but we should think about them - it might actually make some sense to pass a token as an argument for the check in this case (internally at least) so that we can set it with the one that makes most sense in the context that is creating the check - e.g. during node registration or service registration etc.?

Needs more though - may be this is good enough for now as the others are likely edge cases but could be surprising ACL behaviour. Think AgentToken makes most sense short of the more invasive changes mentioned above.

Isn't this check pretty similar to a prepared query and for prepared queries we allow configuring the token in the query definition.

So maybe we should add a "Token" parameter to the check definition and use that if provided or the anonymous token otherwise.

Checks actually already support a token used for anti-entropy for that check. That might satisfy the behavior @banks is expecting above. We've never passed that into a check implementation so it'd require a bit more plumbing (its not on the struct), but that might be worth it.

So proposed logic:

If token is specified on check, use that.

Else, user agent token if specified.

Else, no token.

Thoughts?

@mitchellh I think the least surprising thing would be if we default to copying the current scope token (from node/service registration) into the Check's token if it's not other wise specified.

After that I agree with your proposed logic being sane. My worry is that having to specify the same token in the service deviation and each check body within that definition seems like a bad UX even if it is only needed in an edge case where the check doesn't just get access it needs anonymously.

I looked into this a bit further and this is the behavior with checks (with an asterisk... will note):

For checks embedded in a service registration, the token used is the token specified with the service registration (directly in the definition). If not specified, the user token is used.

Note on the above: If a token is explicitly specified in the embedded check, it is ignored. I don't see this documented anywhere and its unlikely anyone has tried this, but that's how it works.

For standalone checks, the token on the check definition is used, otherwise the agent token.

This aligns really nicely with what is proposed above and therefore is probably the right decision: 1.) we should use the registration token 2.) else agent token. That matches everything else.

I won't change the behavior of ignoring the embedded check token for service registration, since that seems designed that way.

mitchellh · 2018-06-30T14:44:37Z

agent/checks/alias.go

+		// case is safer than a 100% CPU loop.
+		if args.MinQueryIndex < 1 {
+			args.MinQueryIndex = 1
+		}


Learned my lesson (cc @banks). Even if this is theoretically not possible, I protect against it.

banks

This looks awesome @mitchellh!

I think overall it's pretty close to being able to land. I think we should talk about the ACL thing a bit more and some minor typos/nits inline.

I would argue for not holding this until 1.3. It's not a breaking change and the fact that Connect will route traffic to unhealthy instances is a reasonably large wart that I hoped we'd squash in a rapid point release during beta rather than push out. This is a pre-requisite for fixing that.

banks · 2018-07-02T13:48:33Z

agent/agent.go

+
+			var rpcReq structs.NodeSpecificRequest
+			rpcReq.Datacenter = a.config.Datacenter
+			rpcReq.Token = a.tokens.AgentToken()


Hmm I think it actually depends on context.

For checks added during service registration, this should really use the token that the check/service was registered with as the least surprising option - if you register with a token that can read some peer service which is otherwise denied read to others it seems logical that you are implicitly granting the check you registered under that token the same access.

But that would be a new direction for checks so far I think - we'd have to somehow pass through or lookup the service registration token from local state here.

However, if this is a node-level check, we should prefer to use AgentToken as you have here since that is the one that is preferred over user token when registering the node.

These are all pretty big edge cases but we should think about them - it might actually make some sense to pass a token as an argument for the check in this case (internally at least) so that we can set it with the one that makes most sense in the context that is creating the check - e.g. during node registration or service registration etc.?

Needs more though - may be this is good enough for now as the others are likely edge cases but could be surprising ACL behaviour. Think AgentToken makes most sense short of the more invasive changes mentioned above.

banks · 2018-07-02T13:49:40Z

agent/checks/alias.go

+)
+
+// CheckAlias is a check type that aliases the health of another service
+// instance. If the service aliased has any critical health checks, then


nit: should we say "aliases another service instance or whole node". That use case probably fell out of the implementation but it's valid so we might as well mention it.

Yep, agreed. Updating.

banks · 2018-07-02T13:50:57Z

agent/checks/alias.go

+	Checks() map[types.CheckID]*structs.HealthCheck
+}
+
+// Start is used to start a check ttl, runs until Stop() func (c *CheckAlias) Start() {


banks · 2018-07-02T13:53:15Z

agent/checks/alias.go

+		// NOTE(mitchellh): This currently returns ALL health checks for
+		// a node even though we also have the service ID. This can be
+		// optimized if we introduce a new RPC endpoint to filter both,
+		// but for blocking queries isn't that more efficient since the checks


typo:

isn't that much more efficient?

banks · 2018-07-02T13:57:07Z

agent/checks/alias.go

+			waitTime := (1 << (attempt - checkAliasBackoffMin)) * time.Second
+			if waitTime > checkAliasBackoffMaxWait {
+				waitTime = checkAliasBackoffMaxWait
+			}


I refactored a loop like this in connect code to actually bounds check the shift - if we retry more than 32 + checkAliasBackoffMin times here then the wait goes to zero and we go into a hot loop.

Note that the clamp on checkAliasBackoffMaxWait doesn't save us here because waitTime will be zero a that point so won't compare greater than the max.

For a worked example:

let attempts = 35, checkAliasBackoffMin = 3, checkAliasBackoffMaxWait = 1 minute

waitTime := (1 << (35 - 3)) * time.Second

waitTime := (1 << 32) * time.Second

waitTime := 0 (https://play.golang.org/p/xdeslj9CgUk)

busy loop from now on.

That does require about 30 mins of failures from servers but that doesn't seem super unlikely!

Fixed by bounds checking the shift amount.

banks · 2018-07-02T14:12:27Z

agent/local/state.go

+			// Do not block. All notify channels should be buffered to at
+			// least 1 in which case not-blocking does not result in loss
+			// of data because a failed send means a notification is
+			// already queued.


I spent a while describing a race condition with this logic since the state output isn't actually updated at this point, but then realised it's all moot since this whole method is holding the state lock.

Maybe we could add a note here stating that as a necessary condition for the invariant you claimed? If an alias could drain it's notification chan and load the local check state before the status and output are updated below then there is a possible race. It's only mitigated because they can't get a lock to load the state until we're done with this whole method.

mkeeler · 2018-07-03T13:25:58Z

agent/checks/alias.go

+
+// Start is used to start the check, runs until Stop() func (c *CheckAlias) Start() {
+func (c *CheckAlias) Start() {
+	c.stopLock.Lock()


Nothing to change here really I just noticed that we have this pattern of signalling something to stop with a bool + mutex + channel. At some point (not necessarily in this PR) we should abstract this out into another Struct to encapsulate this logic and stop rewriting this all over the place.

Agreed. I just copy/pasted the other checks. When we have some more breathing room I was going to propose unifying both this as well as a check interface so that we can have less repetition elsewhere (in the agent and config parsers).

I just whipped up a new type to mimic the Python threading.Event type (https://docs.python.org/2.7/library/threading.html#event-objects) with an extra function to get at the actual channel to wait on if asked.

https://gist.github.com/mkeeler/cb88cc762ca36733db0798ca80f1e73e

We could use that or do something else more sophisticated.

Nice, so this is a bigger discussion for outside of this thread, but we've generally avoided abstracting "shutdown" channels because the shutdown behavior you want is generally different by type of thing. Abstracting shutdown for checks though (smaller scope with equivalent behavior) is reasonable to me, but we should talk about it more later when I'm back.

mitchellh · 2018-07-13T04:15:33Z

@mkeeler @banks Fixed the review notes and also updated the website with docs for this check. Ready for review again and discussion of whether we want this in the 1.2.x series.

banks

Looks great!

Personally I'd vote for merging this now and releasing in 1.2.2. I realise it's a new feature but I don't think our versioning semantics forbid minor new features in patch releases and I can't think of any way it would be backwards incompatible.

My main motivation for landing it now is that it enables us to use this to fix the Proxy discovery problem earlier which feels like a decent sized paper cut in the current beta and I'm keep to have people interested see progress on issues like that before "GA". It gives us more chance for feedback if it somehow doesn't work in edge cases too.

mkeeler

Just a couple of questions.

mkeeler · 2018-07-13T19:57:33Z

agent/local/state_test.go

+	l.UpdateCheck(types.CheckID("c1"), api.HealthCritical, "")
+	select {
+	case <-notifyCh:
+	case <-time.After(100 * time.Millisecond):


I don't think we need a time.After here or in any of the other selects. When sending to the channel in UpdateCheck is finished the value should be immediately available for the select.

We need the time.After so we can fail if the channel didn't send. I'm specifically testing that the channel works. :)

I know we need a case other than the case <-notifyCh but within the same go routine the data will already be written into the channel and be available. The time.After case could just be a default case as the data will either be available or not.

Ah ah yes right. Thanks!

mkeeler · 2018-07-13T19:59:26Z

agent/local/state_test.go

+	}
+
+	// Update change and verify we get notified
+	l.UpdateCheck(types.CheckID("c1"), api.HealthPassing, "")


Maybe I am missing something but why should we be notified here. Shouldn't the overall health of the service still be critical because check c2 is still critical?

I think I figured it out. Check c2 is not part of the alias check and therefore we should be notified.

Yeah, this is testing that the wrong check ID doesn't send an update.

mitchellh · 2018-07-18T21:16:57Z

@mkeeler Addressed your feedback

mkeeler · 2018-07-19T14:13:02Z

agent/local/state_test.go

+	case <-notifyCh:
+		t.Fatal("notify received")
+
+	case <-time.After(50 * time.Millisecond):


change to default:

I thought these would be better as a short timer to help check we dont receive a channel send (ie if the code was changed to async by accident). 50ms is not too long so maybe it’s not even worth it but from a test perspective we want to check we don’t receive a channel send.

Behaviorally TODAY this could be default but was wondering if from a test perspective it’s worth loosening to try to catch issues? Or should we just do default?

Putting the time.After here I think could provide a false sense of security. If we did change the function to be async its likely that 50 or 100ms wait times would not be enough (especially when heavily parallelized or running in environments like travis). For these cases where we are checking that we dont receive a notification its likely it would always pass (even when it shouldn't). For the checks where we are expecting a notification a 50 - 100ms wait could be to little and we would have periodic test failures. The only way to make this test bullet-proof would be to have an exceedingly long timeout but the disadvantage is that it would make running this test take much longer. Even when we do not currently need it to be.

My preference is to make the tests as accurate and fast as possible to reduce the feedback loop time during development. In this case making the assumption that the underlying function is not async seems reasonable. Also if the function does become async, the first check with expecting a notification should fail quickly and it would immediately identify the need to change/rewrite the test to accommodate it being async.

Sounds good. I’ll make the change when I get back to a computer!

mkeeler · 2018-07-19T14:13:13Z

agent/local/state_test.go

+	case <-notifyCh:
+		t.Fatal("notify received")
+
+	case <-time.After(50 * time.Millisecond):


change to default:

mitchellh requested review from banks and mkeeler June 30, 2018 14:43

mitchellh commented Jun 30, 2018

View reviewed changes

banks reviewed Jul 2, 2018

View reviewed changes

mkeeler reviewed Jul 3, 2018

View reviewed changes

mitchellh added 11 commits July 12, 2018 09:36

agent/checks: add Alias check type

632e4a2

agent/config: support configuring alias check

f0658a0

agent/checks: reflect node failure as alias check failure

60c75b8

agent: run alias checks

4a67beb

agent/local: support local alias checks

7543d27

agent/checks: use local state for local services

e9914ee

agent/checks: set critical if RPC fails

1e9233e

agent/checks: support node-only checks

36e3309

agent/structs: check is alias if node is empty

5bc27fe

agent: alias checks have no interval

19ced12

agent: address some basic feedback

f97bfd5

mitchellh force-pushed the f-alias-check branch from 6042924 to f97bfd5 Compare July 12, 2018 16:36

mitchellh added 3 commits July 12, 2018 10:17

agent: use the correct ACL token for alias checks

d6ecd97

agent/checks: prevent overflow of backoff

9a90400

website: document alias check

1027a01

website: fix some website typos

6be4fa9

banks approved these changes Jul 13, 2018

View reviewed changes

mkeeler requested changes Jul 13, 2018

View reviewed changes

agent/local: don't use time.After in test since notify is instant

9f128e4

mkeeler requested changes Jul 19, 2018

View reviewed changes

mitchellh added 2 commits July 19, 2018 14:20

agent/local: address remaining test feedback

8c72bb0

agent/local: silly spacing on select statements

b3854fd

pearkes added this to the 1.2.2 milestone Jul 19, 2018

mkeeler approved these changes Jul 20, 2018

View reviewed changes

mitchellh merged commit 7fa6bb0 into master Jul 20, 2018

mitchellh deleted the f-alias-check branch July 20, 2018 18:01

hanshasselberg mentioned this pull request May 20, 2020

BUGFIX #7374 When a service does not exists in an alias, consider it failing #7384

Merged

Add "Alias" Check Type #4320

Add "Alias" Check Type #4320

Conversation

mitchellh commented Jun 30, 2018 • edited Loading

Example

Behavior

Implementation

Edge Cases

Alias Cycles

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mkeeler Jul 2, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mitchellh Jul 12, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

banks left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mkeeler Jul 3, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mitchellh commented Jul 13, 2018

banks left a comment

Choose a reason for hiding this comment

mkeeler left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mkeeler Jul 18, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mitchellh commented Jul 18, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mitchellh commented Jun 30, 2018 •

edited

Loading

mkeeler Jul 2, 2018 •

edited

Loading

mitchellh Jul 12, 2018 •

edited

Loading

mkeeler Jul 3, 2018 •

edited

Loading

mkeeler Jul 18, 2018 •

edited

Loading