Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Circuit Breaker support? #2846

Open
benley opened this issue May 23, 2019 · 11 comments
Open

Circuit Breaker support? #2846

benley opened this issue May 23, 2019 · 11 comments

Comments

@benley
Copy link

@benley benley commented May 23, 2019

Feature Request

Linkerd 1.x and Istio (and various other service meshes) have documented methods of configuring Circuit Breakers:

It looks like linkerd 2 currently doesn't quite do the same thing, or at least it isn't documented clearly.

I found another issue inquiring about circuit breaking in this repo that's since been closed: #1255

@olix0r explained on slack:

currently we "circuit break" on tcp errors -- meaning that we won't try communicating with nodes (from the request's point of view) that cannot connect

but now that we have classification via service profiles, it's not conceptually hard to add a circuit breaking layer that keys off that

when that original issue was opened, we didn't have any classification

So: might linkerd2 get some sort of Circuit Breaker functionality soon?

@benley
Copy link
Author

@benley benley commented May 23, 2019

To expand a bit on what I mean by Circuit Breakers, in the context of my team at work:

What we have right now is an in-process library that observes how long some block of code (usually representing an external network request) takes to finish, and aborts quickly ("trips the breaker") when the average gets above some configurable threshold. While the breaker is tripped, it records aborted requests to having 0 latency in order to bring the weighted average back down until it's below the abort threshold, at which point the breaker is un-tripped and the external requests can resume.

This works decently, except that every process in a many-worker app without shared memory (e.g. Python gunicorn) has to discover upstream outages independently since they don't have any shared state among them. In some cases worker processes are restarted quite frequently, and all circuit breaker status is lost with each restart.

So, we are hoping to either (a) concoct a shared-state implementation of this and keep it in the application processes, or (b) rely on an external proxy implementation like linkerd to do it.

@olix0r olix0r added the priority/P1 label May 28, 2019
@olix0r olix0r added this to To do in 2.5 - Release via automation May 28, 2019
@olix0r olix0r self-assigned this Jun 10, 2019
@admc admc removed this from To do in 2.5 - Release Jun 11, 2019
@stale
Copy link

@stale stale bot commented Sep 8, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Sep 8, 2019
@wmorgan wmorgan added pinned and removed wontfix labels Sep 9, 2019
@wmorgan
Copy link
Member

@wmorgan wmorgan commented Sep 9, 2019

Keeping this ticket open. For those watching, we've done some preliminary design work on this feature and learned some good things.

@tomsanbear
Copy link

@tomsanbear tomsanbear commented Jan 16, 2020

@wmorgan would you be able to expand on the design work/investigation?
I'd be interested in helping contributing if there was some more information/interest in moving forwards with this.

Echoing the statement from @benley above about a distributed implementation of something that hystrix/resilience4j (at least from the java world) gives, is very intriguing.

@grampelberg
Copy link
Contributor

@grampelberg grampelberg commented Jan 16, 2020

@tomsanbear @adleong looked into some of the details previously and can probably give you a data dump there.

If you're interested in doing a contribution, we've got a lightweight process to go through. Not everything is documented yet as we're still getting it setup. Happy to walk you through what's required if you're interested! Jump into #contributors on slack and we can start going through the details =)

@jensoncs
Copy link

@jensoncs jensoncs commented Aug 30, 2021

Are we planning to prioritize the circuit-breaking functionality? is there any option in linked to limit the number of requests and connections at the proxy level or what is the right way to go about this?

@olix0r olix0r added this to the stable-2.12.0 milestone Aug 31, 2021
@olix0r
Copy link
Member

@olix0r olix0r commented Aug 31, 2021

@jensoncs Richer client-side policies are planned for stable-2.12.0

@mailmahee
Copy link

@mailmahee mailmahee commented Oct 28, 2021

Glad this topic was discussed today - Looks like there is some write up and diagrams that are available here
https://github.com/Ashish-Bansal/rfc/blob/circuit-breaking/design/0005-circuit-breaking.md

not sure what the current state is - seems like a good topic for a design doc/Blog.

@sherifkayad
Copy link

@sherifkayad sherifkayad commented Feb 2, 2022

Are there any plans for working on that topic? I see that it has been almost 6 months since there has been an update ..

@andrew-waters
Copy link

@andrew-waters andrew-waters commented Feb 2, 2022

@sherifkayad there was a blog post at the turn of the year that mentioned it in the upcoming roadmap so I'd expect to see this implemented in the future

@sherifkayad
Copy link

@sherifkayad sherifkayad commented Feb 2, 2022

@andrew-waters amazing! keeping an eye for that

@adleong adleong modified the milestones: stable-2.12.0, stable-2.13.0 Jul 7, 2022
@adleong adleong added the priority/P0 label Jul 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

No branches or pull requests

10 participants