Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not observe changes to resolved names while building delegation tree #1763

Closed

Conversation

koiuo
Copy link
Contributor

@koiuo koiuo commented Dec 22, 2017

At NCBI we have very complicated dtabs. For a single name resolution consul appears 16 times in leaves of delegation tree, only 1 of 15 lookups in consul usually gives us expected result, the rest are expected to fail.

With such setup calls to delegator.json sometimes were causing namerd to start doing hundreeds to thousands of requests per second to consul and in some (unfortunately, not fully identified) cases, this state of namerd could even last for at least several days (until restarted).

This fixes the issue for us on our setup.

--- commit message
Delegator.json tries to walk through all branches of delegation tree.
Any resolved name is then observed for changes.
On large complicated dtabs this slows down delegation tree building
and in some cases may even cause StackOverflowError while unfolding
resulting structure of Activities and Vars.

This change makes Delegator.json to use only the first lookup result.

@dadjeibaah
Copy link
Contributor

Thanks for submitting the PR @edio we are taking a look at the changes you provided.

@adleong
Copy link
Member

adleong commented Dec 23, 2017

Thanks for this, @edio. This actually relates pretty closely to #1731. The pickFirst method you implement here is essentially just toFuture but coerced to an Activity. If the Activity always just uses the first value, then the delegate method should really have a return type of Future instead of Activity.

This points to a larger change. If we truly always only care about the initial value of a delegation (and I think this is probably the case) then we should change the signature of delegate in the Delegator trait to return a Future instead of an Activity. This allows us to replace the use of pickFirst with toFuture.

This also allows us to change the namerd delegation thrift method to be a simple unary request instead of a long-poll. In turn, this allows us to get rid of the delegation observation cache that is causing us trouble in #1731.

Of course, this is a much larger sweeping change and also a breaking API change to the namerd API so we should tread carefully. But I think going in this direction is much preferable to the isolated change here that makes the delegate implementation incongruous with it's type signature.

Let me know what you think.

@koiuo koiuo force-pushed the bugfix/delegator_json_observation branch from c1c497c to c43858a Compare December 23, 2017 14:22
Delegator.json tries to walk through all branches of delegation tree.
Any resolved name is then observed for changes.
On large complicated dtabs this slows down delegation tree building
and in some cases may even cause StackOverflowError while unfolding
resulting structure of Activities and Vars.

This change makes Delegator.json to use only initial delegation result.
@koiuo koiuo force-pushed the bugfix/delegator_json_observation branch from c43858a to 21b2a83 Compare December 23, 2017 14:24
@koiuo
Copy link
Contributor Author

koiuo commented Dec 23, 2017

@adleong thanks for explaining big picture and for pointint out conceptual difference between Activity and Future (this now obvious thing didn't occur to me).

I do see, that Delegator signature change will cause a lot of changes across codebase and more important, will change behavior of other parts of namerd.

And this doesn't look like something, that can be included in the nearest bugfix release. And the issue is real for us (although, I do understand, that without STR this is not very convincing).

Do you think we could implement some part of this big change asap? Namely, we'll try to return Future up to the point, where it conflicts with traits signatures. This, I hope, will be aligned with the bigger picture, while still being isolated.

Please, see the update to the code.

If you're not willing to include this, would reproduce scenario of the issue we have convince you? (i would really like to avoid that, because it appears to be much more effortful, than the fix itself, but if there are no other options... :))

@adleong
Copy link
Member

adleong commented Dec 26, 2017

Thanks, @edio. I will discuss this with the team in the new year once folks are back at work.

@adleong
Copy link
Member

adleong commented Jan 4, 2018

Hi @edio, we may have an alternate solution for you. Take a look at this PR: #1768 which adds a /client_state.json admin endpoint that returns the current address set for each client.
This should be more performant than delegator.json because it doesn't spin up a new Activity and it should also be more correct because it uses the existing Activity/watch rather than starting a new one.

Please take a look and let me know if that satisfies your requirements.

@koiuo
Copy link
Contributor Author

koiuo commented Jan 4, 2018

I guess we could block delegator.json for users and expose that new endpoint instead which would partially solve the issue for us.

@olix0r
Copy link
Member

olix0r commented Jan 4, 2018

@edio Can you share some more context around the use of delegator.json? What exactly are clients trying to do? Is this just to avoid the two calls to bind & addr?

@koiuo
Copy link
Contributor Author

koiuo commented Jan 4, 2018

@olix0r, oh sorry for not making it clear. It's only delegator UI, so we can check (mostly during troubleshooting), how a name is bound and resolved.

@adleong
Copy link
Member

adleong commented Jan 11, 2018

Hi @edio, sorry for the delay in getting back to you about this. After some internal discussion I think the best path forward is:

Change the signature of the delegate method on the Delegator trait to return a Future instead of an Activity. Add a new thrift method to the namerd thrift interface that returns just the current delegation and have the thrift interpreter client use that for the delegate implementation. Add a new grpc method to the namerd mesh interface that returns just the current delegation and have the mesh interpreter client use that for the delegate implementation. I believe there is already a non-streaming version of the delegate http endpoint that the http interpreter client can use.

This is certainly a larger change that what you have proposed here, but I think it is more correct and will be easier to maintain in the long run. If you'd like to take a crack at implementing my suggestion, I'd be happy to help out however I can. Otherwise, we'll likely have time to do this sometime next week.

Let me know what you think!

@koiuo
Copy link
Contributor Author

koiuo commented Jan 12, 2018

@adleong I'd be willing to take this opportunity to dig into code deeper, but I can do this not earlier than the next week too. And probably even only the week after the next one. So feel free to work on that if you have time.

I'll post here, when I'm ready to start work, and if it appears that you haven't started yet, I will work on the PR.

@adleong
Copy link
Member

adleong commented Jan 30, 2018

Hey @edio, I'm going to close this PR but I'm looking forward to working together with you on the solution we discussed above.

@adleong adleong closed this Jan 30, 2018
@ashald
Copy link
Member

ashald commented Mar 2, 2018

@adleong @olix0r so far this issue is the most frequent one bringing Namerd down in our environment. As far as I understand, you proposed some sort of a fundamental solution that requires some work to be implemented. Unfortunately at the moment @edio don't have availability to contribute that so I wonder if you can reconsider accepting this PR as is as as simple remedy to the issue or maybe we can ask you to implement a proper fix in the next version? Thanks

@adleong
Copy link
Member

adleong commented Mar 2, 2018

Thanks for reminding me about this! If you folks don't have the bandwidth to do this, I'll make sure we allocate time for it as part of 1.3.7. Does that work for you?

@ashald
Copy link
Member

ashald commented Mar 2, 2018

Would be amazing, thanks a lot!

Tim-Brooks pushed a commit to Tim-Brooks/linkerd that referenced this pull request Dec 20, 2018
…inkerd#1763)

Appending proxy-init to the end of the list ensures that it won't
interfere with other init containers from accessing the network,
before the proxy container is created.

This resolves bug linkerd#1760

Signed-off-by: ihcsim <ihcsim@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants