l5d incorrectly reporting/caching endpoints after service is scaled to 0 #1635

jsurdilla · 2017-09-11T23:15:25Z

Issue Type:

Bug report
Feature request

What happened:
l5d incorrectly reports the number endpoints available after a service has been scaled all the way down to 0 and stays in an invalid state even after scaling up.

Linkerd logs reports service failure and does not correctly remove the stale node IPs:

|l5d-h5lwn::l5d              | E 0911 23:08:35.537 UTC THREAD28 TraceId:3656019fcecb838b: service failure: Failure(connection timed out: /10.44.5.204:9090 at remote address: /10.44.5.204:9090. Remote Info: Not Available, flags=0x09) with RemoteInfo -> Upstream Address: /10.44.2.192:38684, Upstream Client Id: Not Available, Downstream Address: /10.44.5.204:9090, Downstream Client Id: %/io.l5d.k8s.localnode/10.44.5.200/#/io.l5d.k8s.ns/http/reservation-srv, Trace Id: 3656019fcecb838b.be266a254a63552e<:b60c58529334b85c
|l5d-2kdvb::l5d              | E 0911 23:08:35.536 UTC THREAD50 TraceId:d1893092c9eeb8e1: service failure: Failure(connection timed out: /10.44.2.190:9090 at remote address: /10.44.2.190:9090. Remote Info: Not Available, flags=0x09) with RemoteInfo -> Upstream Address: /10.44.2.192:36334, Upstream Client Id: Not Available, Downstream Address: /10.44.2.190:9090, Downstream Client Id: %/io.l5d.k8s.localnode/10.44.2.192/#/io.l5d.k8s.ns/http/reservation-srv, Trace Id: d1893092c9eeb8e1.664bb4085c05d9f8<:7bdef9c6b046c20e

What you expected to happen:
l5d should correctly report new endpoints as available for a service as replica count goes up an down regardless of the number of replicas.

How to reproduce it (as minimally and precisely as possible):

Scale down a service replica to 0 (at this point, service available is already incorrect).
Scale service back up (still incorrect).

Anything else we need to know?:

Running l5d as a daemonset.
dtab is correctly reporting the set of prod IPs but requests themselves are failing to resolve.

Environment:

linkerd/namerd version, config files: 1.2.0 config
Platform, version, and config files (Kubernetes, DC/OS, etc): Kubernetes
Cloud provider or hardware configuration: GKE

The text was updated successfully, but these errors were encountered:

wmorgan · 2017-09-11T23:59:30Z

Thanks @jsurdilla! We'll dig into this.

hawkw · 2017-09-12T14:00:56Z

It seems possible to me that this issue is related to/shares the same root cause as #1626 ...

Taik · 2017-09-12T14:46:21Z

Thanks for linking to that issue @hawkw, that's helpful. It sounds like it; we'll run a test on the latest nighty and see if the issue still exists.

hawkw · 2017-09-12T15:20:48Z

Great, thanks @Taik! Please let me know if you can reproduce the issue discussed in #1626 as well.

jsurdilla · 2017-09-12T16:02:24Z

@hawkw , I'm able to reproduce this on the nightly build.

hawkw · 2017-09-12T16:32:06Z

@jsurdilla yeah, that is (unfortunately!) not too surprising...#1626 also still exists after that snapshot.

Just to double-check, though, are you using buoyantio/linkerd:1.2.0-SNAPSHOT or buoyantio/linkerd:nightly? The 1.2.0-SNAPSHOT tag contains patches that aren't on master (which the nightly tag is built from).

jsurdilla · 2017-09-12T20:57:13Z

@hawkw , tested it on both buoyantio/linkerd:1.2.0 and buoyantio/linkerd:nightly

hawkw · 2017-09-12T21:05:05Z

@jsurdilla okay, thanks for confirming, I've done the same, and determined that the SNAPSHOT image doesn't fix the problem. We're working on trying to diagnose what's going on here, stay tuned!

jsurdilla · 2017-09-12T21:16:14Z

Tried it quickly, can confirm that I'm able to reproduce on 1.2.0-SNAPSHOT. Thanks for looking into it!

hawkw · 2017-09-13T21:37:44Z

Hi @jsurdilla and @Taik, pull request #1638 should fix this issue. We'll try and get the fix out as soon as possible!

jsurdilla · 2017-09-13T21:43:39Z

That's amazing, looking forward to checking it out. Thank you!

The way `stabilize()` is currently implemented means that when a service is scaled down and then scaled back up, or removed and then re-created, the namer returns a new `Var` with the new address of that service. However, when a client stack for the service is created, it's given the original `Var` and cached, so it doesn't receive the updated address. I've changed `stabilize()` to update the previous `Var`, rather than creating a new one. I've added tests in `EndpointsNamerTest` based on events fired by Kubernetes while I was trying to debug this issue – there's a test that addresses are correctly updated after a scale down/scale up, and a test that references to the original `Var` are now properly updated. Fixes #1626. Fixes #1635. Fixes #1636.

The `linkerd check` command was not validating whether data plane proxies were successfully reporting metrics to Prometheus. Introduce a new check that validates data plane proxies are found in Prometheus. This is made possible via the existing `ListPods` endpoint in the public API, which includes an `Added` field, indicating a pod's metrics were found in Prometheus. Fixes linkerd#1517 Signed-off-by: Andrew Seigner <siggy@buoyant.io>

adleong added this to Bug in Linkerd 1.x Backlog. See https://github.com/linkerd/linkerd2/blob/main/ROADMAP.md Sep 12, 2017

This was referenced Sep 13, 2017

HTTP 504 Gateway timeouts after upgrading to linkerd 1.2.0 #1636

Closed

Fix K8s namers not updating cached name Vars #1638

Merged

hawkw closed this as completed in #1638 Sep 14, 2017

adleong removed this from Bug in Linkerd 1.x Backlog. See https://github.com/linkerd/linkerd2/blob/main/ROADMAP.md Sep 15, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

l5d incorrectly reporting/caching endpoints after service is scaled to 0 #1635

l5d incorrectly reporting/caching endpoints after service is scaled to 0 #1635

jsurdilla commented Sep 11, 2017

wmorgan commented Sep 11, 2017

hawkw commented Sep 12, 2017

Taik commented Sep 12, 2017

hawkw commented Sep 12, 2017

jsurdilla commented Sep 12, 2017

hawkw commented Sep 12, 2017

jsurdilla commented Sep 12, 2017

hawkw commented Sep 12, 2017

jsurdilla commented Sep 12, 2017

hawkw commented Sep 13, 2017

jsurdilla commented Sep 13, 2017

l5d incorrectly reporting/caching endpoints after service is scaled to 0 #1635

l5d incorrectly reporting/caching endpoints after service is scaled to 0 #1635

Comments

jsurdilla commented Sep 11, 2017

Issue Type:

wmorgan commented Sep 11, 2017

hawkw commented Sep 12, 2017

Taik commented Sep 12, 2017

hawkw commented Sep 12, 2017

jsurdilla commented Sep 12, 2017

hawkw commented Sep 12, 2017

jsurdilla commented Sep 12, 2017

hawkw commented Sep 12, 2017

jsurdilla commented Sep 12, 2017

hawkw commented Sep 13, 2017

jsurdilla commented Sep 13, 2017