Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

l5d incorrectly reporting/caching endpoints after service is scaled to 0 #1635

Closed
1 of 2 tasks
jsurdilla opened this issue Sep 11, 2017 · 11 comments
Closed
1 of 2 tasks

Comments

@jsurdilla
Copy link

Issue Type:

  • Bug report
  • Feature request

What happened:
l5d incorrectly reports the number endpoints available after a service has been scaled all the way down to 0 and stays in an invalid state even after scaling up.

Linkerd logs reports service failure and does not correctly remove the stale node IPs:

|l5d-h5lwn::l5d              | E 0911 23:08:35.537 UTC THREAD28 TraceId:3656019fcecb838b: service failure: Failure(connection timed out: /10.44.5.204:9090 at remote address: /10.44.5.204:9090. Remote Info: Not Available, flags=0x09) with RemoteInfo -> Upstream Address: /10.44.2.192:38684, Upstream Client Id: Not Available, Downstream Address: /10.44.5.204:9090, Downstream Client Id: %/io.l5d.k8s.localnode/10.44.5.200/#/io.l5d.k8s.ns/http/reservation-srv, Trace Id: 3656019fcecb838b.be266a254a63552e<:b60c58529334b85c
|l5d-2kdvb::l5d              | E 0911 23:08:35.536 UTC THREAD50 TraceId:d1893092c9eeb8e1: service failure: Failure(connection timed out: /10.44.2.190:9090 at remote address: /10.44.2.190:9090. Remote Info: Not Available, flags=0x09) with RemoteInfo -> Upstream Address: /10.44.2.192:36334, Upstream Client Id: Not Available, Downstream Address: /10.44.2.190:9090, Downstream Client Id: %/io.l5d.k8s.localnode/10.44.2.192/#/io.l5d.k8s.ns/http/reservation-srv, Trace Id: d1893092c9eeb8e1.664bb4085c05d9f8<:7bdef9c6b046c20e

What you expected to happen:
l5d should correctly report new endpoints as available for a service as replica count goes up an down regardless of the number of replicas.

How to reproduce it (as minimally and precisely as possible):

  1. Scale down a service replica to 0 (at this point, service available is already incorrect).
  2. Scale service back up (still incorrect).

Anything else we need to know?:

  • Running l5d as a daemonset.
  • dtab is correctly reporting the set of prod IPs but requests themselves are failing to resolve.

Environment:

  • linkerd/namerd version, config files: 1.2.0 config
  • Platform, version, and config files (Kubernetes, DC/OS, etc): Kubernetes
  • Cloud provider or hardware configuration: GKE
@wmorgan
Copy link
Member

wmorgan commented Sep 11, 2017

Thanks @jsurdilla! We'll dig into this.

@hawkw
Copy link
Member

hawkw commented Sep 12, 2017

It seems possible to me that this issue is related to/shares the same root cause as #1626 ...

@Taik
Copy link

Taik commented Sep 12, 2017

Thanks for linking to that issue @hawkw, that's helpful. It sounds like it; we'll run a test on the latest nighty and see if the issue still exists.

@hawkw
Copy link
Member

hawkw commented Sep 12, 2017

Great, thanks @Taik! Please let me know if you can reproduce the issue discussed in #1626 as well.

@jsurdilla
Copy link
Author

@hawkw , I'm able to reproduce this on the nightly build.

@hawkw
Copy link
Member

hawkw commented Sep 12, 2017

@jsurdilla yeah, that is (unfortunately!) not too surprising...#1626 also still exists after that snapshot.

Just to double-check, though, are you using buoyantio/linkerd:1.2.0-SNAPSHOT or buoyantio/linkerd:nightly? The 1.2.0-SNAPSHOT tag contains patches that aren't on master (which the nightly tag is built from).

@jsurdilla
Copy link
Author

@hawkw , tested it on both buoyantio/linkerd:1.2.0 and buoyantio/linkerd:nightly

@hawkw
Copy link
Member

hawkw commented Sep 12, 2017

@jsurdilla okay, thanks for confirming, I've done the same, and determined that the SNAPSHOT image doesn't fix the problem. We're working on trying to diagnose what's going on here, stay tuned!

@jsurdilla
Copy link
Author

Tried it quickly, can confirm that I'm able to reproduce on 1.2.0-SNAPSHOT. Thanks for looking into it!

@hawkw
Copy link
Member

hawkw commented Sep 13, 2017

Hi @jsurdilla and @Taik, pull request #1638 should fix this issue. We'll try and get the fix out as soon as possible!

@jsurdilla
Copy link
Author

That's amazing, looking forward to checking it out. Thank you!

hawkw added a commit that referenced this issue Sep 14, 2017
The way `stabilize()` is currently implemented means that when a service is scaled down and then scaled back up, or removed and then re-created, the namer returns a new `Var` with the new address of that service. However, when a client stack for the service is created, it's given the original `Var` and cached, so it doesn't receive the updated address.

I've changed `stabilize()` to update the previous `Var`, rather than creating a new one.

I've added tests in `EndpointsNamerTest` based on events fired by Kubernetes while I was trying to debug this issue – there's a test that addresses are correctly updated after a scale down/scale up, and a test that references to the original `Var` are now properly updated.

Fixes #1626.
Fixes #1635.
Fixes #1636.
Tim-Brooks pushed a commit to Tim-Brooks/linkerd that referenced this issue Dec 20, 2018
The `linkerd check` command was not validating whether data plane
proxies were successfully reporting metrics to Prometheus.

Introduce a new check that validates data plane proxies are found in
Prometheus. This is made possible via the existing `ListPods` endpoint
in the public API, which includes an `Added` field, indicating a pod's
metrics were found in Prometheus.

Fixes linkerd#1517

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants