Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Circuit Breaker support? #2846

Closed
benley opened this issue May 23, 2019 · 17 comments
Closed

Circuit Breaker support? #2846

benley opened this issue May 23, 2019 · 17 comments
Assignees
Milestone

Comments

@benley
Copy link

benley commented May 23, 2019

Feature Request

Linkerd 1.x and Istio (and various other service meshes) have documented methods of configuring Circuit Breakers:

It looks like linkerd 2 currently doesn't quite do the same thing, or at least it isn't documented clearly.

I found another issue inquiring about circuit breaking in this repo that's since been closed: #1255

@olix0r explained on slack:

currently we "circuit break" on tcp errors -- meaning that we won't try communicating with nodes (from the request's point of view) that cannot connect

but now that we have classification via service profiles, it's not conceptually hard to add a circuit breaking layer that keys off that

when that original issue was opened, we didn't have any classification

So: might linkerd2 get some sort of Circuit Breaker functionality soon?

@benley
Copy link
Author

benley commented May 23, 2019

To expand a bit on what I mean by Circuit Breakers, in the context of my team at work:

What we have right now is an in-process library that observes how long some block of code (usually representing an external network request) takes to finish, and aborts quickly ("trips the breaker") when the average gets above some configurable threshold. While the breaker is tripped, it records aborted requests to having 0 latency in order to bring the weighted average back down until it's below the abort threshold, at which point the breaker is un-tripped and the external requests can resume.

This works decently, except that every process in a many-worker app without shared memory (e.g. Python gunicorn) has to discover upstream outages independently since they don't have any shared state among them. In some cases worker processes are restarted quite frequently, and all circuit breaker status is lost with each restart.

So, we are hoping to either (a) concoct a shared-state implementation of this and keep it in the application processes, or (b) rely on an external proxy implementation like linkerd to do it.

@olix0r olix0r added the priority/P1 Planned for Release label May 28, 2019
@olix0r olix0r self-assigned this Jun 10, 2019
@stale
Copy link

stale bot commented Sep 8, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Sep 8, 2019
@wmorgan wmorgan added pinned and removed wontfix labels Sep 9, 2019
@wmorgan
Copy link
Member

wmorgan commented Sep 9, 2019

Keeping this ticket open. For those watching, we've done some preliminary design work on this feature and learned some good things.

@tomsanbear
Copy link

@wmorgan would you be able to expand on the design work/investigation?
I'd be interested in helping contributing if there was some more information/interest in moving forwards with this.

Echoing the statement from @benley above about a distributed implementation of something that hystrix/resilience4j (at least from the java world) gives, is very intriguing.

@grampelberg
Copy link
Contributor

@tomsanbear @adleong looked into some of the details previously and can probably give you a data dump there.

If you're interested in doing a contribution, we've got a lightweight process to go through. Not everything is documented yet as we're still getting it setup. Happy to walk you through what's required if you're interested! Jump into #contributors on slack and we can start going through the details =)

@jensoncs
Copy link

Are we planning to prioritize the circuit-breaking functionality? is there any option in linked to limit the number of requests and connections at the proxy level or what is the right way to go about this?

@olix0r olix0r added this to the stable-2.12.0 milestone Aug 31, 2021
@olix0r
Copy link
Member

olix0r commented Aug 31, 2021

@jensoncs Richer client-side policies are planned for stable-2.12.0

@mailmahee
Copy link

Glad this topic was discussed today - Looks like there is some write up and diagrams that are available here
https://github.com/Ashish-Bansal/rfc/blob/circuit-breaking/design/0005-circuit-breaking.md

not sure what the current state is - seems like a good topic for a design doc/Blog.

@sherifkayad
Copy link

Are there any plans for working on that topic? I see that it has been almost 6 months since there has been an update ..

@andrew-waters
Copy link

@sherifkayad there was a blog post at the turn of the year that mentioned it in the upcoming roadmap so I'd expect to see this implemented in the future

@sherifkayad
Copy link

@andrew-waters amazing! keeping an eye for that

@adleong adleong modified the milestones: stable-2.12.0, stable-2.13.0 Jul 7, 2022
@adleong adleong added the priority/P0 Release Blocker label Jul 7, 2022
@jon-depop
Copy link

are there any updates on this @adleong?

@adleong
Copy link
Member

adleong commented Jan 3, 2023

Hi @jon-depop. The groundwork to support client policy (such as circuit breakers) in the proxy is currently in progress and you can follow along at https://github.com/linkerd/linkerd2-proxy/pulls.

@srinath-panda
Copy link

@adleong any updates on this ?

@kleimkuhler
Copy link
Contributor

We're still working towards support for this in 2.13. Unfortunately we are not referencing this issue too much in related PRs in linkerd2 and linkerd2-proxy, but you can follow along with PRs in those repositories if you are interested.

@hawkw
Copy link
Contributor

hawkw commented Apr 6, 2023

Hi folks, I'm very excited to let you all know that yesterday, we released edge-23.4.1, a release candidate for Linkerd 2.13, which features initial support for request-level HTTP circuit breaking.

This circuit breaking is configured by adding annotations to Services that describe the failure accrual policy clients should use when communicating with that Service. We're still working on documentation for how to configure circuit-breaking in Linkerd 2.13, but in the meantime, if you're interested in trying it out on the edge release, you can lfind the annotations in the source code, here:

fn parse_accrual_config(
annotations: &std::collections::BTreeMap<String, String>,
) -> Result<Option<FailureAccrual>> {
annotations
.get("balancer.linkerd.io/failure-accrual")
.map(|mode| {
if mode == "consecutive" {
let max_failures = annotations
.get("balancer.linkerd.io/failure-accrual-consecutive-max-failures")
.map(|s| s.parse::<u32>())
.transpose()?
.unwrap_or(7);
let max_penalty = annotations
.get("balancer.linkerd.io/failure-accrual-consecutive-max-penalty")
.map(|s| parse_duration(s))
.transpose()?
.unwrap_or_else(|| time::Duration::from_secs(60));
let min_penalty = annotations
.get("balancer.linkerd.io/failure-accrual-consecutive-min-penalty")
.map(|s| parse_duration(s))
.transpose()?
.unwrap_or_else(|| time::Duration::from_secs(1));
let jitter = annotations
.get("balancer.linkerd.io/failure-accrual-consecutive-jitter-ratio")
.map(|s| s.parse::<f32>())
.transpose()?
.unwrap_or(0.5);
if min_penalty > max_penalty {
bail!(
"min_penalty ({min_penalty:?}) cannot exceed max_penalty ({max_penalty:?})",
);
}
Ok(FailureAccrual::Consecutive {
max_failures,
backoff: Backoff {
min_penalty,
max_penalty,
jitter,
},
})
} else {
bail!("unsupported failure accrual mode: {mode}");
}
})
.transpose()
}

@risingspiral
Copy link
Contributor

Introduced in https://github.com/linkerd/linkerd2/releases/tag/edge-23.4.1

@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests