Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Namerd not updated automatically #1669

Closed
jsenon opened this issue Oct 11, 2017 · 18 comments
Closed

Namerd not updated automatically #1669

jsenon opened this issue Oct 11, 2017 · 18 comments
Milestone

Comments

@jsenon
Copy link

jsenon commented Oct 11, 2017

Hello,

We are using namerd for dynamic dtab, but when we want to update a route, change are not take into account without restart of namerd pod.

Kubernetes version:

NAME                                           STATUS    AGE       VERSION   EXTERNAL-IP   OS-IMAGE                      KERNEL-VERSION
ip-xxx.eu-central-1.compute.internal   Ready     7d        v1.7.6    <none>        Debian GNU/Linux 8 (jessie)   4.4.78-k8s
ip-xxx.eu-central-1.compute.internal   Ready     15d       v1.7.6    <none>        Debian GNU/Linux 8 (jessie)   4.4.78-k8s

Deployed over AWS.

Linkerd Configuration:

    # Custom HTTP Ingress Controller listening on port 80
    # Accessible Externaly
    - protocol: http
      label: http-ingress
      servers:
        - port: 80
          ip: 0.0.0.0
          clearContext: true
      interpreter:
        kind: io.l5d.namerd
        dst: /#/io.l5d.k8s/linkerd/4100/namerd
        namespace: external
      identifier:
        kind: io.l5d.path
        segments: 1
        consume: true

Namerd Configuration:

    admin:
      ip: 0.0.0.0
      port: 9991

    namers:
    - kind: io.l5d.k8s
      experimental: true
      host: localhost
      port: 8001

    storage:
      kind: io.l5d.k8s
      host: localhost
      port: 8001
      namespace: linkerd

    interfaces:
    - kind: io.l5d.thriftNameInterpreter
      ip: 0.0.0.0
      port: 4100
    - kind: io.l5d.httpController
      ip: 0.0.0.0
      port: 4180

Namerd Service:

apiVersion: v1
kind: Service
metadata:
  name: namerd
  namespace: linkerd
spec:
  selector:
    app: namerd
  type: LoadBalancer
  ports:
  - name: thrift
    port: 4100
  - name: http
    port: 4180
  - name: admin
    port: 9991

Linkerd Metrics:

metrics.txt

@klingerf
Copy link
Member

Hey @jsenon, thanks for reporting this -- we'll take a look and get back to you. Would you mind also adding Namerd's metrics payload if you have it?

@jsenon
Copy link
Author

jsenon commented Oct 11, 2017

I don't have historical payload. does namerd metrics or heapster screenshot can help you?

@klingerf
Copy link
Member

@jsenon That's no problem. We can look into reproducing with the info you added in the description. In the meantime, how hard would it be for you to switch to namerd's io.l5d.mesh interface? That interface is newer and it might not have the same correctness problem, so I think it would be worth a shot.

@jsenon
Copy link
Author

jsenon commented Oct 12, 2017

@klingerf Unfortunately env. is freeze for demo. I will check on another cluster.

@klingerf
Copy link
Member

I was able to reproduce this behavior in one of our Kubernetes test environments, running Kubernetes 1.6.10 and Linkerd/Namerd 1.3.0. It appears that Namerd fails to successfully re-establish watches after it encounters a 410 "too old resource version" error from the watch API. In the namerd logs, I see:

D 1012 20:59:08.359 UTC THREAD33 TraceId:08acbf261bc25b5e: json read eoc
D 1012 20:59:08.365 UTC THREAD33 TraceId:08acbf261bc25b5e: json reading chunk of 8192 bytes
D 1012 20:59:08.365 UTC THREAD33 TraceId:08acbf261bc25b5e: json read chunk: {"type":"ERROR","object":{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"too old resource version: 65030480 (65047701)","reason":"Gone","code":410}}

D 1012 20:59:08.365 UTC THREAD33 TraceId:08acbf261bc25b5e: json chunk reading: [179, 180] 

D 1012 20:59:08.366 UTC THREAD33 TraceId:08acbf261bc25b5e: json chunk read: [0, 179] {"type":"ERROR","object":{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"too old resource version: 65030480 (65047701)","reason":"Gone","code":410}} EndpointsError(Status(Some(Status),Some(v1),Some(ObjectMeta(None,None,None,None,None,None,None,None,None,None,None)),Some(Failure),Some(too old resource version: 65030480 (65047701)),Some(Gone),None,Some(410)))
D 1012 20:59:08.366 UTC THREAD33 TraceId:08acbf261bc25b5e: k8s returned 'too old resource version' error with incorrect HTTP status code, restarting watch
D 1012 20:59:08.366 UTC THREAD33 TraceId:08acbf261bc25b5e: k8s restarting watch on /api/v1/watch/namespaces/test/endpoints/hello1, resource version Some(65030480) was too old
D 1012 21:05:07.058 UTC THREAD37 TraceId:20a8f25b7c3e6bd5: json read eoc
D 1012 21:05:07.164 UTC THREAD37 TraceId:20a8f25b7c3e6bd5: json reading chunk of 8192 bytes

And after that message is printed, I see no further updates from /api/v1/watch/namespaces/test/endpoints/hello1. If I restart all of the pods in the hello1 service, service success rate drops to 0 once the pods have been fully rolled. If I restart namerd, service success rate returns to 100%.

We'll need to investigate why the watch is not successfully re-established after encountering the "too old resource version" error. This relates to #1649.

@jsenon
Copy link
Author

jsenon commented Oct 12, 2017

@klingerf thanks for investigation and error investigation on your side

@Taik
Copy link

Taik commented Oct 16, 2017

Just piggybacking this issue. We're seeing the same behavior after deploying linkerd/namerd 1.3.0 to our staging env. Thanks for looking into it @klingerf.

@activeshadow
Copy link

@klingerf any update on when this might get fixed?

@klingerf
Copy link
Member

Hey @activeshadow, we are actively working on it. Have a repro but not a fix yet, but should hopefully have the fix soon. Will update here when we do, at which point we can cut a bugfix release.

@wmorgan
Copy link
Member

wmorgan commented Oct 18, 2017

Update: we have identified a possible root cause and are working to verify. This is a high priority issue.

@activeshadow
Copy link

activeshadow commented Oct 19, 2017 via email

@klingerf
Copy link
Member

Hey @jsenon, @activeshadow, @Taik -- we just merged a fix for this issue and are working on cutting a release candidate now. Can update in an hour or two once it's available, and then we'd love to have you verify it's fixed in your environment.

@klingerf
Copy link
Member

Ok, please give buoyantio/linkerd:1.3.1-rc1 / buoyantio/namerd:1.3.1-rc1 a shot to see if the fix your issue, thanks!

@jsenon
Copy link
Author

jsenon commented Oct 19, 2017

@klingerf sure, I ll test it tomorrow and give you quick feedback, thanks

@activeshadow
Copy link

@klingerf so far the fix looks to be working! I can restart pods, causing them to get a new endpoint IP, and linkerd picks it up and continues to send them data. w00t!

Thanks!

@jsenon
Copy link
Author

jsenon commented Oct 19, 2017

Same for me, update with curl command line on namerd dtab is automatically take into account. Thanks a lot for this quick fix: @klingerf @wmorgan and all.

@Taik
Copy link

Taik commented Oct 20, 2017

@klingerf Looks good for me so far. I'll let it bake in our staging for a bit to see how things go.

Thanks for the quick turnaround!

@klingerf
Copy link
Member

Ok, thanks for all of the feedback folks! We will get this officially released as part of 1.3.1 next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants