Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PinUntilErrorChannel doesn't switch on 429 #661

Merged
merged 6 commits into from
Apr 17, 2020

Conversation

iamdanfox
Copy link
Contributor

Before this PR

In PDS-117063, a user of our internal atlas-replacement switched from c-j-r -> dialogue and saw server errors.

Looking at the pinuntilerror.nextNode metric, it seemed we switched channels 5 times during a supposedly transactional workflow. This meant that some requests landed on one node and others landed on a different node, which caused the second node to return a hard error.

cc @LucasIME and @jkozlowski

After this PR

==COMMIT_MSG==
PinUntilErrorChannel doesn't switch on 429, to unblock transactional workflows
==COMMIT_MSG==

Possible downsides?

@changelog-app
Copy link

changelog-app bot commented Apr 17, 2020

Generate changelog in changelog/@unreleased

Type

  • Feature
  • Improvement
  • Fix
  • Break
  • Deprecation
  • Manual task
  • Migration

Description

PinUntilErrorChannel doesn't switch on 429, to unblock transactional workflows

Check the box to generate changelog(s)

  • Generate changelog entry

@policy-bot policy-bot bot requested a review from fawind April 17, 2020 11:40
@iamdanfox iamdanfox requested review from carterkozak and ferozco and removed request for fawind April 17, 2020 11:40
@@ -17,7 +17,7 @@
live_reloading[CONCURRENCY_LIMITER_ROUND_ROBIN].txt: success=89.9% client_mean=PT3.8310808S server_cpu=PT1H55M21.66S client_received=2500/2500 server_resps=2500 codes={200=2248, 500=252}
live_reloading[UNLIMITED_ROUND_ROBIN].txt: success=60.2% client_mean=PT2.84698S server_cpu=PT1H58M37.45S client_received=2500/2500 server_resps=2500 codes={200=1504, 500=996}
one_big_spike[CONCURRENCY_LIMITER_BLACKLIST_ROUND_ROBIN].txt: success=79.0% client_mean=PT1.478050977S server_cpu=PT1M59.71393673S client_received=1000/1000 server_resps=790 codes={200=790, Failed to make a request=210}
one_big_spike[CONCURRENCY_LIMITER_PIN_UNTIL_ERROR].txt: success=100.0% client_mean=PT1.286733552S server_cpu=PT2M48.75S client_received=1000/1000 server_resps=1125 codes={200=1000}
one_big_spike[CONCURRENCY_LIMITER_PIN_UNTIL_ERROR].txt: success=100.0% client_mean=PT1.135007332S server_cpu=PT2M49.65S client_received=1000/1000 server_resps=1131 codes={200=1000}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As expected, the graph shows one big spike of requests (i.e. this exact use case) remains pinned to one uri. https://github.com/palantir/dialogue/blob/dfox/pin-until-error-fix/simulation/src/test/resources/report.md#one_big_spikeconcurrency_limiter_pin_until_error

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to update this simulation to respond 429 instead of 503

Copy link
Contributor Author

@iamdanfox iamdanfox Apr 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's intended to be representative of this exact workflow, so it responds 429 above some threshold:

 public void one_big_spike() {
        int capacity = 100;
        servers = servers(
                SimulationServer.builder()
                        .serverName("node1")
                        .simulation(simulation)
                        .handler(h -> h.respond200UntilCapacity(429, capacity).responseTime(Duration.ofMillis(150)))
                        .build(),
                SimulationServer.builder()
                        .serverName("node2")
                        .simulation(simulation)
                        .handler(h -> h.respond200UntilCapacity(429, capacity).responseTime(Duration.ofMillis(150)))
                        .build());

@bulldozer-bot bulldozer-bot bot merged commit e6ec9b2 into develop Apr 17, 2020
@bulldozer-bot bulldozer-bot bot deleted the dfox/pin-until-error-fix branch April 17, 2020 11:57
@svc-autorelease
Copy link
Collaborator

Released 1.23.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants