Better behaviour in the presence of 429s #786

j-baker · 2018-08-02T02:45:35Z

If a client sends requests at a rate of 40 per second, but the server
demands only 20 per second by sending back 429s, the default behaviour
should be that the server should smooth the client's load down to 20
per second.

Right now this does not work, because we use exponential backoff with
a maximum number of retries, and each call is independent, so each
individual request is as likely to succeed as any other.

What this means is that given sustained attempted load of 40 per second
and a limit of 20 requests per second, eventually some of the calls
will fail.

You can see this most easily right now with an endpoint that takes 1ms
to return with a rate limit of 1 request per second, which will fail
within a few seconds.

This PR attempts to fix the issue. It uses Netflix's concurrency-limits
library to apply a tcp-like strategy to request concurrency; basically
provided a single request per second won't be limited into oblivion,
you can make progress.

I also include a test that will fail immediately unless this concurrency
limiting code exists.

At the moment, the scope is per-endpoint - I don't know how to do
better. We could later on define 'limiting domains' which get somehow
annotated by the service author according to rate limit groupings.

Note that we'd have to modify our internal retrofit codegen to properly
add the path template header.

There is scope to improve this later - particularly to do this on the
server too, and prioritise interactive over batch traffic.

The main downside of this code is that you have to make sure you always
clean up - otherwise you have a resource leak. In order to avoid this,
there is code which cleans up in the case of the resource leak, logging
at the same time. This should avoid the IOException deadlock pain
we've seen in the past.

I assert that: If a client sends requests at a rate of 40 per second, but the server demands only 20 per second by sending back 429s, the default behaviour should be that the server should smooth the client's load down to 20 per second. Right now this does not work, because we use exponential backoff with a maximum number of retries, and each call is independent, so each individual request is as likely to succeed as any other. What this means is that given sustained attempted load of 40 per second and a limit of 20 requests per second, eventually some of the calls _will_ fail. You can see this most easily right now with an endpoint that takes 1ms to return with a rate limit of 1 request per second, which will fail within a few seconds. This PR attempts to fix the issue. It uses Netflix's concurrency-limits library to apply a tcp-like strategy to request concurrency; basically provided a single request per second won't be limited into oblivion, you can make progress. I also include a test that will fail immediately unless this concurrency limiting code exists. At the moment, the scope is per-endpoint - I don't know how to do better. We could later on define 'limiting domains' which get somehow annotated by the service author according to rate limit groupings. Note that we'd have to modify our internal retrofit codegen to properly add the path template header. There is scope to improve this later - particularly to do this on the server too, and prioritise interactive over batch traffic. The main downside of this code is that you have to make sure you always clean up - otherwise you have a resource leak. In order to avoid this, there is code which cleans up in the case of the resource leak, logging at the same time. This should avoid the IOException deadlock pain we've seen in the past.

j-baker · 2018-08-02T02:46:17Z

Note - testing is insufficient right now - this is to get validation on the approach, though I would like to get this merged soon :)

j-baker · 2018-08-02T02:56:31Z

http-clients/src/main/java/com/palantir/remoting3/clients/ClientConfigurations.java

@@ -87,7 +87,7 @@ public static ClientConfiguration of(
                .enableGcmCipherSuites(DEFAULT_ENABLE_GCM_CIPHERS)
                .proxy(ProxySelector.getDefault())
                .proxyCredentials(Optional.empty())
-                .maxNumRetries(uris.size())
+                .maxNumRetries(DEFAULT_MAX_NUM_RETRIES)


@samrogerson fysa

uhm, I thought we had merged such a change already?

yes, but we did it in only one of the two places (see above in this class)

uschi2000 · 2018-08-02T14:51:31Z

okhttp-clients/src/main/java/com/palantir/remoting3/okhttp/ConcurrencyLimiters.java

+
+    public AsyncLimiter limiter(Request request) {
+        final String limiterKey;
+        String pathTemplate = request.header(OkhttpTraceInterceptor.PATH_TEMPLATE_HEADER);


this is a bit dodgy

uschi2000 · 2018-08-02T14:57:38Z

okhttp-clients/src/main/java/com/palantir/remoting3/okhttp/ConcurrencyLimiters.java

+        return limiter(limiterKey);
+    }
+
+    static final class AsyncLimiter {


this construction is pretty complicated. minimally needs some docs to explain what's going on.

uschi2000 · 2018-08-02T15:03:55Z

okhttp-clients/src/main/java/com/palantir/remoting3/okhttp/ConcurrencyLimiters.java

+        private final Queue<SettableFuture<Limiter.Listener>> waitingRequests = new LinkedBlockingQueue<>();
+        private final Limiter<Void> limiter;
+
+        public AsyncLimiter(


why do we need to make the limiter asynchronous? from https://github.com/Netflix/concurrency-limits/blob/master/concurrency-limits-core/src/main/java/com/netflix/concurrency/limits/Limiter.java it seems like Limiter#acquire returns immediately?

oh i see, nvm.

uschi2000 · 2018-08-02T15:08:04Z

okhttp-clients/src/main/java/com/palantir/remoting3/okhttp/ConcurrencyLimiters.java

+                if (head == null) {
+                    acquired.onIgnore();
+                } else {
+                    head.set(acquired);


does this construction satisfy some basic fairness properties, i.e., every request will get scheduled/acquired eventually?

yes, via the FIFO queue

uschi2000 · 2018-08-02T15:11:48Z

okhttp-clients/src/main/java/com/palantir/remoting3/okhttp/ConcurrencyLimiters.java

+                    return;
+                }
+                Limiter.Listener acquired = maybeAcquired.get();
+                SettableFuture<Limiter.Listener> head = waitingRequests.poll();


would prefer to make the multi-threadedness here a little easier to understand by calling poll only

SettableFuture<> head; // Note that different threads may be executing processQueue; this is safe because ... while( (head = waitingRequests.poll()) != null ) { .. }

ah but if you do that, you're not guaranteed to have any permits to give them

i see. then add a comment, please

uschi2000 · 2018-08-02T15:14:01Z

okhttp-clients/src/main/java/com/palantir/remoting3/okhttp/RemotingOkHttpCall.java

            }

            @Override
            public void onResponse(Call call, Response response) throws IOException {
                // Relay successful responses
                if (response.code() / 100 <= 2) {
                    callback.onResponse(call, response);
+                    listener.onSuccess();


the construction here is very brittle because you need to trace that the Listener is released (onSuccess, onDropped) on all code paths.

yeah, i'm gonna clean that up

uschi2000 · 2018-08-02T15:15:08Z

okhttp-clients/src/main/java/com/palantir/remoting3/okhttp/RemotingOkHttpCall.java

@@ -236,14 +272,18 @@ public Void visit(QosException.Throttle exception) {
                }

                Duration backoff = exception.getRetryAfter().orElse(nonAdvertizedBackoff.get());


difficult to grok what the interplay is between the Limiter and the backoffStrategy. they seem duplicative.

and yet, they are not :)

okhttp-clients/src/main/java/com/palantir/remoting3/okhttp/OkHttpClients.java

uschi2000 · 2018-08-03T17:28:35Z

okhttp-clients/src/main/java/com/palantir/remoting3/okhttp/ConcurrencyLimiters.java

+ * limitations under the License.
+ */
+
+package com.palantir.remoting3.okhttp;


@iamdanfox what's the deal with remoting-vs-conjure in PRs?

uschi2000 · 2018-08-03T18:16:54Z

okhttp-clients/src/main/java/com/palantir/remoting3/okhttp/ConcurrencyLimiters.java

+        }
+
+        private Limiter.Listener wrap(
+                Map<Limiter.Listener, Runnable> activeListeners, Limiter.Listener listener) {


remove activeListeners param?

maybe even make ConcurrencyLimiter non-static?

okhttp-clients/src/main/java/com/palantir/remoting3/okhttp/ConcurrencyLimiters.java

uschi2000 · 2018-08-03T18:18:40Z

okhttp-clients/src/main/java/com/palantir/remoting3/okhttp/ConcurrencyLimiters.java

+
+    private final ConcurrentMap<String, ConcurrencyLimiter> limiters = new ConcurrentHashMap<>();
+
+    private static Limiter<Void> newLimiter() {


inline in ConcurrencyLimiter constructor?

uschi2000 · 2018-08-03T18:19:01Z

okhttp-clients/src/main/java/com/palantir/remoting3/okhttp/ConcurrencyLimiters.java

+
+    ConcurrencyLimiter limiter(Request request) {
+        final String limiterKey;
+        String pathTemplate = request.header(OkhttpTraceInterceptor.PATH_TEMPLATE_HEADER);


this is still dodgy

I don't really see a way of avoiding this. It seems reasonable to do this by endpoint, and if you do that you end up with this.

could also see this being something that uses a dynamic proxy which makes it much easier to limit per method or per some annotation. think the only sad thing about this is relying on the tracing header which is only every passed around internally (never sent across the wire)

OK. But then I'd rename the code bits so that they're no longer "trace"-specific. Probably also need to stop deleting the header in the trace-specific code path

okhttp-clients/src/main/java/com/palantir/remoting3/okhttp/RemotingOkHttpCall.java

uschi2000 · 2018-08-03T18:22:05Z

okhttp-clients/src/test/java/com/palantir/remoting3/okhttp/ExponentialBackoffTest.java

+            public void run() {
+                for (int i = 0; i < 1001;) {
+                    Limiter.Listener listener = Futures.getUnchecked(limiter.acquire());
+                    //System.out.println(i);


I'm going to rewrite these tests into a kinda contract test thing.

uschi2000 · 2018-08-03T18:24:06Z

okhttp-clients/src/main/java/com/palantir/remoting3/okhttp/RemotingOkHttpCall.java

+            public void onFailure(Throwable t) {
+                callback.onFailure(
+                        RemotingOkHttpCall.this,
+                        new IOException(new AssertionError("This should never happen", t)));


explain a bit more in the message please

uschi2000 · 2018-08-03T18:33:07Z

okhttp-clients/src/main/java/com/palantir/remoting3/okhttp/ConcurrencyLimiters.java

+
+/**
+ * Remoting calls may observe 429 or 503 responses from the server, at which point they back off in order to
+ * reduce excess load. Unfortunately this state on backing off is stored per-call, so 429s or 503s in one call do not


maybe just describe how it works rather lamenting ("Unfortunately") the decomp?

Flow control in Conjure is a collaborative effort between servers and clients: Servers advertise an overloaded state via 429/503 responses, and clients throttle the number of requests they are sending. The latter is implemented as a combination of two techniques, yielding a mechanism similar to flow control in TCP/IP: First, clients use the frequency of 429/503 responses to determine an estimate for the number of permissible concurrent requests. Second, each such request gets scheduled according to an exponential backoff algorithm. This class provides an asynchronous implementation of Netflix's concurrency-limits library for determining the above-mentioned concurrency estimates. [...]

done, thanks

uschi2000

I'm fine with the functionality. Decomp-wise, I sort of wonder if we should roll the limit and backoff functionality into the same class? Don't know exactly what that would look like.

Waiting for @iamdanfox 's feedback to see where we should merge this.

j-baker · 2018-08-03T22:17:10Z

got fixes to this comin in soon

ellisjoe · 2018-09-03T14:02:36Z

have you seen the concurrency limiting in Dispatcher? https://github.com/square/okhttp/blob/master/okhttp/src/main/java/okhttp3/Dispatcher.java

It also seems like ideally we'd be able to implement the concurrency limiting at the Call level. You'd end up with a ForwardingCall that looks something like:

public Response execute() {
    Limiter limiter = limiters.limit(pathTemplateHeader);
    Response response = delegate.execute();
    if (response / 100 == 2) {
        limiter.success();
    } else if (....) {
        limiter.dropped();
    } else {
        limiter.ignored();
    }
}

which lets you avoid the async extras. Doesn't necessarily need to be exactly this, but probably worth discussing an option like this?

ellisjoe · 2018-09-04T09:17:12Z

The approach mentioned above is the way Netflix does it in their GRPC implementation as well: https://github.com/Netflix/concurrency-limits/blob/master/concurrency-limits-grpc/src/main/java/com/netflix/concurrency/limits/grpc/client/ConcurrencyLimitClientInterceptor.java

…behaviour

ellisjoe · 2018-09-12T10:00:42Z

okhttp-clients/src/main/java/com/palantir/remoting3/okhttp/ConcurrencyLimitingInterceptor.java

+ * An interceptor for limiting the concurrency of requests to an endpoint.
+ *
+ * Requests must be tagged (before reaching this point) with a ConcurrencyLimitTag. At this point, we block on
+ * receiving a permit to run the request, and store the listener in the tag.


I think it would be better to just use the QosHandler directly in here rather than passing around this tag and requiring it to be set

ellisjoe · 2018-09-12T10:01:15Z

okhttp-clients/src/main/java/com/palantir/remoting3/okhttp/ConcurrencyLimitingInterceptor.java

+        return chain.proceed(chain.request());
+    }
+
+    public static Callback wrapCallback(Callback callback) {


don't follow why this is necessary?

ellisjoe · 2018-09-12T10:03:27Z

okhttp-clients/src/main/java/com/palantir/remoting3/okhttp/ConcurrencyLimitingInterceptor.java

+        }
+    }
+
+    private static final class ResourceDeallocator extends AsyncTimeout {


don't think it's worth worrying about this case. if clients aren't releasing resources properly they're going to lock things up eventually anyway. At a minimum it should be a separate change from the concurrency limiting

discussed more in person: going to timeout on acquiring a limit vs releasing a limit which has the added benefit of always allowing requests through after some period of time

ellisjoe · 2018-09-12T10:05:07Z

okhttp-clients/src/main/java/com/palantir/remoting3/okhttp/RemotingConcurrencyLimiter.java

+ * <li>Change code style to match Palantir baseline.</li>
+ * </ol>
+ */
+final class RemotingConcurrencyLimiter implements Limiter<Void> {


Not sure how to review this. Are the things listed above expected to merge upstream so we can remove this class at some point?

okhttp-clients/src/main/java/com/palantir/remoting3/okhttp/ConcurrencyLimitingInterceptor.java

JacekLach · 2018-09-13T15:14:46Z

okhttp-clients/src/main/java/com/palantir/remoting3/okhttp/ConcurrencyLimitingInterceptor.java

+        public Object invoke(Object proxy, Method method, Object[] args) throws Throwable {
+            if (method.getName().equals("close") && !closed) {
+                closed = true;
+                listener.onSuccess();


So the closing mechanism is now implicit, in that as long as the response body is closed (either by reading it fully during json object mapping, or via inputstream.close() if streaming and want to finish earlier) we tag as successful?

Is it viable at some point to have more control over this? One thing that occasionally happens when streaming is your stream ends too early because an error was encountered once some data (esp. headers) was already sent. We would preferably mark those as failed.

JacekLach · 2018-09-13T15:17:44Z

okhttp-clients/src/main/java/com/palantir/remoting3/okhttp/ConcurrencyLimitingInterceptor.java

+ * 429 and 503 response codes are used for backpressure, whilst 200 -> 399 request codes are used for determining
+ * new limits and all other codes are not factored in to timings.
+ * <p>
+ * Concurrency permits are only released when the response body is closed.


I think this should be more visible; the last sentence on a package-private class is not exactly discoverable. Perhaps javadoc on RetrofitClient.create (since that's the clients we use for streaming, I believe?)

I've just put a section about this in the README too

* Expose HostMetricsRegistry record methods (#780) This is so that we can separately implement a HostMetricsSink for it (see #779) such that we can share host metrics when constructing both conjure and remoting3 clients * publish BOM for jaxrs-client (#789) * Excavator: Render CircleCI file using template specified in .circleci/template.sh (#791) * Upgrade OkHttp to 3.11.0. (#794) * AssertionErrors are converted into service exceptions with type internal (#727) * No more ThreadLocals, delegate everything to 'palantir/tracing-java' (#799) * Use BINTRAY_*_REMOTING envs (#802) The project's default bintray creds are currently set up to publish to `conjure-java-runtime`. Use these custom env vars to maintain ability to publish http-remoting. * Better behaviour in the presence of 429s (#786)

j-baker requested review from iamdanfox and uschi2000 August 2, 2018 02:45

j-baker requested a review from a team as a code owner August 2, 2018 02:45

j-baker changed the title ~~Better behaviour in the presence of backing off~~ Better behaviour in the presence of 429s Aug 2, 2018

Merge branch 'develop' into jbaker/better_429_behaviour

c583b33

j-baker commented Aug 2, 2018

View reviewed changes

uschi2000 reviewed Aug 2, 2018

View reviewed changes

j-baker commented Aug 2, 2018

View reviewed changes

okhttp-clients/src/main/java/com/palantir/remoting3/okhttp/OkHttpClients.java Show resolved Hide resolved

more docs, easier to read

d51d3ac

uschi2000 reviewed Aug 3, 2018

View reviewed changes

okhttp-clients/src/main/java/com/palantir/remoting3/okhttp/ConcurrencyLimiters.java Outdated Show resolved Hide resolved

uschi2000 reviewed Aug 3, 2018

View reviewed changes

okhttp-clients/src/main/java/com/palantir/remoting3/okhttp/RemotingOkHttpCall.java Outdated Show resolved Hide resolved

uschi2000 reviewed Aug 3, 2018

View reviewed changes

j-baker added 2 commits August 3, 2018 11:58

Better concurrency limiters

19a8173

fixes

d7164de

uschi2000 approved these changes Aug 3, 2018

View reviewed changes

j-baker force-pushed the jbaker/better_429_behaviour branch from 357ef9b to c68bc89 Compare September 2, 2018 19:04

update lockfiles

91dd6d2

j-baker added 3 commits September 11, 2018 17:31

Merge remote-tracking branch 'origin/develop' into jbaker/better_429_…

2229a81

…behaviour

New attempt using interceptor

ee8e539

more comments

036d45b

ellisjoe reviewed Sep 12, 2018

View reviewed changes

JacekLach reviewed Sep 12, 2018

View reviewed changes

okhttp-clients/src/main/java/com/palantir/remoting3/okhttp/ConcurrencyLimitingInterceptor.java Outdated Show resolved Hide resolved

j-baker added 3 commits September 12, 2018 17:56

some bullshit

1e18435

Perfect

fbdeab1

cleanup

951dfdd

dansanduleac force-pushed the develop branch from 0989949 to 706a47c Compare September 13, 2018 12:27

Ready to go?

ed18c48

j-baker changed the base branch from develop to remoting September 13, 2018 14:34

j-baker added 2 commits September 13, 2018 15:37

docs

fda4d64

Checkstyle

8a088d7

ellisjoe approved these changes Sep 13, 2018

View reviewed changes

chekcstyle

a9721e4

JacekLach reviewed Sep 13, 2018

View reviewed changes

j-baker and others added 5 commits September 13, 2018 17:09

Metric

1ce7b82

Javadoc

bc38b76

README describes new flow control

addbdca

Move docs -> class level javadoc

c97925a

Rename ConcurrencyLimiters#limiter -> acquireLimiter

baaa142

iamdanfox force-pushed the jbaker/better_429_behaviour branch from fd6dd61 to baaa142 Compare September 13, 2018 16:59

iamdanfox merged commit 14d8533 into remoting Sep 13, 2018

iamdanfox deleted the jbaker/better_429_behaviour branch September 14, 2018 09:40

jonsyu1 mentioned this pull request Jan 29, 2019

Implement adaptive rate limits palantir/conjure-python-client#45

Open

		@@ -236,14 +272,18 @@ public Void visit(QosException.Throttle exception) {
		}

		Duration backoff = exception.getRetryAfter().orElse(nonAdvertizedBackoff.get());


		private final ConcurrentMap<String, ConcurrencyLimiter> limiters = new ConcurrentHashMap<>();

		private static Limiter<Void> newLimiter() {

Better behaviour in the presence of 429s #786

Better behaviour in the presence of 429s #786

Conversation

j-baker commented Aug 2, 2018 • edited Loading

j-baker commented Aug 2, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

uschi2000 left a comment

Choose a reason for hiding this comment

j-baker commented Aug 3, 2018

ellisjoe commented Sep 3, 2018

ellisjoe commented Sep 4, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

j-baker commented Aug 2, 2018 •

edited

Loading