Namerd not updated automatically #1669

jsenon · 2017-10-11T06:51:25Z

Hello,

We are using namerd for dynamic dtab, but when we want to update a route, change are not take into account without restart of namerd pod.

Kubernetes version:

NAME                                           STATUS    AGE       VERSION   EXTERNAL-IP   OS-IMAGE                      KERNEL-VERSION
ip-xxx.eu-central-1.compute.internal   Ready     7d        v1.7.6    <none>        Debian GNU/Linux 8 (jessie)   4.4.78-k8s
ip-xxx.eu-central-1.compute.internal   Ready     15d       v1.7.6    <none>        Debian GNU/Linux 8 (jessie)   4.4.78-k8s

Deployed over AWS.

Linkerd Configuration:

    # Custom HTTP Ingress Controller listening on port 80
    # Accessible Externaly
    - protocol: http
      label: http-ingress
      servers:
        - port: 80
          ip: 0.0.0.0
          clearContext: true
      interpreter:
        kind: io.l5d.namerd
        dst: /#/io.l5d.k8s/linkerd/4100/namerd
        namespace: external
      identifier:
        kind: io.l5d.path
        segments: 1
        consume: true

Namerd Configuration:

    admin:
      ip: 0.0.0.0
      port: 9991

    namers:
    - kind: io.l5d.k8s
      experimental: true
      host: localhost
      port: 8001

    storage:
      kind: io.l5d.k8s
      host: localhost
      port: 8001
      namespace: linkerd

    interfaces:
    - kind: io.l5d.thriftNameInterpreter
      ip: 0.0.0.0
      port: 4100
    - kind: io.l5d.httpController
      ip: 0.0.0.0
      port: 4180

Namerd Service:

apiVersion: v1
kind: Service
metadata:
  name: namerd
  namespace: linkerd
spec:
  selector:
    app: namerd
  type: LoadBalancer
  ports:
  - name: thrift
    port: 4100
  - name: http
    port: 4180
  - name: admin
    port: 9991

Linkerd Metrics:

metrics.txt

The text was updated successfully, but these errors were encountered:

klingerf · 2017-10-11T19:04:22Z

Hey @jsenon, thanks for reporting this -- we'll take a look and get back to you. Would you mind also adding Namerd's metrics payload if you have it?

jsenon · 2017-10-11T21:15:04Z

I don't have historical payload. does namerd metrics or heapster screenshot can help you?

klingerf · 2017-10-12T01:33:28Z

@jsenon That's no problem. We can look into reproducing with the info you added in the description. In the meantime, how hard would it be for you to switch to namerd's io.l5d.mesh interface? That interface is newer and it might not have the same correctness problem, so I think it would be worth a shot.

jsenon · 2017-10-12T08:34:21Z

@klingerf Unfortunately env. is freeze for demo. I will check on another cluster.

klingerf · 2017-10-12T21:28:19Z

I was able to reproduce this behavior in one of our Kubernetes test environments, running Kubernetes 1.6.10 and Linkerd/Namerd 1.3.0. It appears that Namerd fails to successfully re-establish watches after it encounters a 410 "too old resource version" error from the watch API. In the namerd logs, I see:

D 1012 20:59:08.359 UTC THREAD33 TraceId:08acbf261bc25b5e: json read eoc
D 1012 20:59:08.365 UTC THREAD33 TraceId:08acbf261bc25b5e: json reading chunk of 8192 bytes
D 1012 20:59:08.365 UTC THREAD33 TraceId:08acbf261bc25b5e: json read chunk: {"type":"ERROR","object":{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"too old resource version: 65030480 (65047701)","reason":"Gone","code":410}}

D 1012 20:59:08.365 UTC THREAD33 TraceId:08acbf261bc25b5e: json chunk reading: [179, 180] 

D 1012 20:59:08.366 UTC THREAD33 TraceId:08acbf261bc25b5e: json chunk read: [0, 179] {"type":"ERROR","object":{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"too old resource version: 65030480 (65047701)","reason":"Gone","code":410}} EndpointsError(Status(Some(Status),Some(v1),Some(ObjectMeta(None,None,None,None,None,None,None,None,None,None,None)),Some(Failure),Some(too old resource version: 65030480 (65047701)),Some(Gone),None,Some(410)))
D 1012 20:59:08.366 UTC THREAD33 TraceId:08acbf261bc25b5e: k8s returned 'too old resource version' error with incorrect HTTP status code, restarting watch
D 1012 20:59:08.366 UTC THREAD33 TraceId:08acbf261bc25b5e: k8s restarting watch on /api/v1/watch/namespaces/test/endpoints/hello1, resource version Some(65030480) was too old
D 1012 21:05:07.058 UTC THREAD37 TraceId:20a8f25b7c3e6bd5: json read eoc
D 1012 21:05:07.164 UTC THREAD37 TraceId:20a8f25b7c3e6bd5: json reading chunk of 8192 bytes

And after that message is printed, I see no further updates from /api/v1/watch/namespaces/test/endpoints/hello1. If I restart all of the pods in the hello1 service, service success rate drops to 0 once the pods have been fully rolled. If I restart namerd, service success rate returns to 100%.

We'll need to investigate why the watch is not successfully re-established after encountering the "too old resource version" error. This relates to #1649.

jsenon · 2017-10-12T21:43:37Z

@klingerf thanks for investigation and error investigation on your side

Taik · 2017-10-16T15:00:09Z

Just piggybacking this issue. We're seeing the same behavior after deploying linkerd/namerd 1.3.0 to our staging env. Thanks for looking into it @klingerf.

activeshadow · 2017-10-16T17:59:33Z

@klingerf any update on when this might get fixed?

klingerf · 2017-10-16T23:46:25Z

Hey @activeshadow, we are actively working on it. Have a repro but not a fix yet, but should hopefully have the fix soon. Will update here when we do, at which point we can cut a bugfix release.

wmorgan · 2017-10-18T21:42:58Z

Update: we have identified a possible root cause and are working to verify. This is a high priority issue.

activeshadow · 2017-10-19T11:56:23Z

Great, thank you for the update!

…

On Wed, Oct 18, 2017 at 02:43:03PM -0700, William Morgan wrote: Update: we have identified a possible root cause and are working to verify. This is a high priority issue. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.*

-- //SIGNED// Bryan T. Richardson Active Shadow LLC 505.382.2077

klingerf · 2017-10-19T21:25:58Z

Hey @jsenon, @activeshadow, @Taik -- we just merged a fix for this issue and are working on cutting a release candidate now. Can update in an hour or two once it's available, and then we'd love to have you verify it's fixed in your environment.

klingerf · 2017-10-19T22:13:19Z

Ok, please give buoyantio/linkerd:1.3.1-rc1 / buoyantio/namerd:1.3.1-rc1 a shot to see if the fix your issue, thanks!

jsenon · 2017-10-19T22:34:07Z

@klingerf sure, I ll test it tomorrow and give you quick feedback, thanks

activeshadow · 2017-10-19T22:51:27Z

@klingerf so far the fix looks to be working! I can restart pods, causing them to get a new endpoint IP, and linkerd picks it up and continues to send them data. w00t!

Thanks!

jsenon · 2017-10-19T23:10:10Z

Same for me, update with curl command line on namerd dtab is automatically take into account. Thanks a lot for this quick fix: @klingerf @wmorgan and all.

Taik · 2017-10-20T12:47:50Z

@klingerf Looks good for me so far. I'll let it bake in our staging for a bit to see how things go.

Thanks for the quick turnaround!

klingerf · 2017-10-20T17:59:09Z

Ok, thanks for all of the feedback folks! We will get this officially released as part of 1.3.1 next week.

adleong added this to Triage in Linkerd 1.x Backlog. See https://github.com/linkerd/linkerd2/blob/main/ROADMAP.md Oct 11, 2017

klingerf mentioned this issue Oct 12, 2017

Linkerd fails to route to kubernetes service seemingly after persistent k8s watch connection dies #1668

Closed

2 tasks

klingerf mentioned this issue Oct 19, 2017

Call non-watch endpoint first when restarting watches #1674

Merged

hawkw closed this as completed in #1674 Oct 19, 2017

hawkw added this to the 1.3.1 milestone Oct 20, 2017

adleong removed this from Triage in Linkerd 1.x Backlog. See https://github.com/linkerd/linkerd2/blob/main/ROADMAP.md Oct 24, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Namerd not updated automatically #1669

Namerd not updated automatically #1669

jsenon commented Oct 11, 2017 •

edited by klingerf

klingerf commented Oct 11, 2017

jsenon commented Oct 11, 2017

klingerf commented Oct 12, 2017

jsenon commented Oct 12, 2017

klingerf commented Oct 12, 2017

jsenon commented Oct 12, 2017

Taik commented Oct 16, 2017

activeshadow commented Oct 16, 2017

klingerf commented Oct 16, 2017

wmorgan commented Oct 18, 2017

activeshadow commented Oct 19, 2017 via email

klingerf commented Oct 19, 2017

klingerf commented Oct 19, 2017

jsenon commented Oct 19, 2017

activeshadow commented Oct 19, 2017

jsenon commented Oct 19, 2017

Taik commented Oct 20, 2017

klingerf commented Oct 20, 2017

Namerd not updated automatically #1669

Namerd not updated automatically #1669

Comments

jsenon commented Oct 11, 2017 • edited by klingerf

klingerf commented Oct 11, 2017

jsenon commented Oct 11, 2017

klingerf commented Oct 12, 2017

jsenon commented Oct 12, 2017

klingerf commented Oct 12, 2017

jsenon commented Oct 12, 2017

Taik commented Oct 16, 2017

activeshadow commented Oct 16, 2017

klingerf commented Oct 16, 2017

wmorgan commented Oct 18, 2017

activeshadow commented Oct 19, 2017 via email

klingerf commented Oct 19, 2017

klingerf commented Oct 19, 2017

jsenon commented Oct 19, 2017

activeshadow commented Oct 19, 2017

jsenon commented Oct 19, 2017

Taik commented Oct 20, 2017

klingerf commented Oct 20, 2017

jsenon commented Oct 11, 2017 •

edited by klingerf