Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a way for a user to initiate connection shutdown. #3715

Merged
merged 34 commits into from Aug 3, 2021

Conversation

alexc-db
Copy link
Contributor

@alexc-db alexc-db commented Jul 21, 2021

Motivation:

#3516. A user sometimes needs to send a GOAWAY frame manually when they wants to make a client establish a new connection explicitly. Armeria currently doesn't provide an API for this.

Modifications:

  • Add CompletableFuture<Void> initiateConnectionShutdown() and CompletableFuture<Void> initiateConnectionShutdown(Duration gracePeriod)methods to DefaultServiceRequestContext. CompletableFuture is completed when channel is closed. Mark as @UnstableApi to allow follow up modification. Calling this method sends InitiateConnectionShutdown event using ChannelPipeline.fireUserEventTriggered API. The event is handled by the respective protocol handler.
  • Implement handling for HTTP/1 and HTTP/2 on the server side.
    • For HTTP/1 results in destruction of KeepAliveHandler. This adds "Connection: close" header to the last response and closes connection after response is done.
    • For HTTP/2 behavior depends on configured or provided grace period duration.
      • If grace period is positive:
        • Step 1: On grace period start send GOAWAY frame with 2^31-1 and NO_ERROR code to signal clients that connection will be closed but still accept in flight requests (see https://datatracker.ietf.org/doc/html/rfc7540#section-6.8).
        • Step 2: On grace period end send GOAWAY frame with the latest stream ID and NO_ERROR code, this will cause new requests to be rejected.
        • Step 3: Close connection once all active streams are closed.
      • If grace period is 0 or negative - skip grace period and start from Step 2 as described above.
  • Add server-level configuration for the grace period.
  • Add documentation to AbstractHttp2ConnectionHandler.goAway which mentions that flushing is responsibility of the caller. Initially, I forgot to add the ctx.flush() and spent some time debugging unexpected behavior, hopefully extra documentation will help future readers.

Result:

@ikhoon
Copy link
Contributor

ikhoon commented Jul 22, 2021

Awesome! We look forward to more contributions from @databricks ❤️.

@alexc-db alexc-db force-pushed the goaway branch 2 times, most recently from 45b887b to 1ad81af Compare July 25, 2021 06:04
@codecov
Copy link

codecov bot commented Jul 26, 2021

Codecov Report

Merging #3715 (e5ab2dd) into master (f20fac2) will increase coverage by 0.01%.
The diff coverage is 84.25%.

❗ Current head e5ab2dd differs from pull request most recent head f4f34cc. Consider uploading reports for the commit f4f34cc to get more accurate results
Impacted file tree graph

@@             Coverage Diff              @@
##             master    #3715      +/-   ##
============================================
+ Coverage     73.35%   73.36%   +0.01%     
- Complexity    14613    14653      +40     
============================================
  Files          1283     1285       +2     
  Lines         56160    56272     +112     
  Branches       7181     7185       +4     
============================================
+ Hits          41194    41282      +88     
- Misses        11323    11338      +15     
- Partials       3643     3652       +9     
Impacted Files Coverage Δ
...linecorp/armeria/server/ServiceRequestContext.java 94.87% <ø> (ø)
...p/armeria/server/ServiceRequestContextWrapper.java 0.00% <0.00%> (ø)
...ecorp/armeria/server/ServerHttp2ObjectEncoder.java 74.28% <66.66%> (-2.04%) ⬇️
...p/armeria/server/DefaultServiceRequestContext.java 88.38% <71.42%> (-1.69%) ⬇️
...c/main/java/com/linecorp/armeria/common/Flags.java 59.17% <75.00%> (+0.18%) ⬆️
...rnal/common/GracefulConnectionShutdownHandler.java 76.19% <76.19%> (ø)
...ava/com/linecorp/armeria/server/ServerBuilder.java 80.83% <83.33%> (+0.03%) ⬆️
...nternal/common/AbstractHttp2ConnectionHandler.java 86.00% <100.00%> (+1.90%) ⬆️
...ia/internal/common/InitiateConnectionShutdown.java 100.00% <100.00%> (ø)
...m/linecorp/armeria/server/Http1RequestDecoder.java 81.08% <100.00%> (+0.41%) ⬆️
... and 25 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f20fac2...f4f34cc. Read the comment docs.

@alexc-db alexc-db changed the title Add a way for a user to send a GOAWAY frame. Add a way for a user to initiate connection shutdown. Jul 26, 2021
@trustin
Copy link
Member

trustin commented Jul 27, 2021

I see one test failure, which may or may not be a flaky one:

ServerMaxConnectionAgeTest > http2MaxConnectionAge(SessionProtocol) > com.linecorp.armeria.server.ServerMaxConnectionAgeTest.http2MaxConnectionAge(SessionProtocol)[2] FAILED
    org.awaitility.core.ConditionTimeoutException: Assertion condition defined as a com.linecorp.armeria.server.ServerMaxConnectionAgeTest 
    Expecting a throwable with cause being an instance of:
     <com.linecorp.armeria.client.UnprocessedRequestException>
    but was an instance of:
     <com.linecorp.armeria.common.ClosedSessionException> within 10 seconds.
        at org.awaitility.core.ConditionAwaiter.await(ConditionAwaiter.java:166)
        at org.awaitility.core.AssertionCondition.await(AssertionCondition.java:119)
        at org.awaitility.core.AssertionCondition.await(AssertionCondition.java:31)
        at org.awaitility.core.ConditionFactory.until(ConditionFactory.java:939)
        at org.awaitility.core.ConditionFactory.untilAsserted(ConditionFactory.java:723)
        at com.linecorp.armeria.server.ServerMaxConnectionAgeTest.http2MaxConnectionAge(ServerMaxConnectionAgeTest.java:229)

        Caused by:
        java.lang.AssertionError: 
        Expecting a throwable with cause being an instance of:
         <com.linecorp.armeria.client.UnprocessedRequestException>
        but was an instance of:
         <com.linecorp.armeria.common.ClosedSessionException>
            at com.linecorp.armeria.server.ServerMaxConnectionAgeTest.lambda$http2MaxConnectionAge$6(ServerMaxConnectionAgeTest.java:246)

Could you take a look if your changes affected it?

@minwoox minwoox added this to the 1.10.0 milestone Jul 27, 2021
@alexc-db
Copy link
Contributor Author

Could you take a look if your changes affected it?

I think it's related to my change. It's flaky and doesn't happen all the time, but I was able to repro locally by running many tests in a loop in this branch, cannot reproduce in the master branch. Investigating.

Copy link
Member

@minwoox minwoox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Left some minor comments.

core/src/main/java/com/linecorp/armeria/common/Flags.java Outdated Show resolved Hide resolved
*
* @param durationMicros the drain duration. {@code 0} or negative value disables the drain.
*/
public ServerBuilder connectionDrainDurationMicros(long durationMicros) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering network delay, microsecond precision of drain for in flight requests seems not effective.
How about using milliseconds for connection drain duration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://datatracker.ietf.org/doc/html/rfc7540#section-6.8 suggests that drain duration can be as low as round-trip time. Roundtrip inside the datacenter is < 1ms, around few hundred microseconds. If client is connected via localhost (e.g. sidecar process) it's gonna be few microseconds. I chose microsecond precision to give users an ability to configure in those lower ranges if they need to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Make sense. Let's keep it as it is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm somewhat concerned about using micros for a long parameter because this is probably the only place we use micros instead of millis. Would it be a bad idea to accept a millis value here? A user can use Duration for micros precision.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this builder method - its name contains Micros so there should be little room for the confusion. If you think that providing an override with Millis is valuable I can add that. But I think there's value in keeping Micros override too, to make it explicit that server builder supports that level of precision for connection drain duration.

Comment on lines 140 to 142
gracefulConnectionShutdownHandler.updateDrainDuration(
((InitiateConnectionShutdown) evt).drainDurationMicros());
ctx.channel().close();
Copy link
Contributor

@ikhoon ikhoon Jul 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Add return after closing a channel?
Suggested change
gracefulConnectionShutdownHandler.updateDrainDuration(
((InitiateConnectionShutdown) evt).drainDurationMicros());
ctx.channel().close();
gracefulConnectionShutdownHandler.updateDrainDuration(
((InitiateConnectionShutdown) evt).drainDurationMicros());
ctx.channel().close();
return;
  1. I imagine that InitiateConnectionShutdown events are simultaneously triggered by a throttling decorator.
    It would be inefficient that ctx.channel().close() is constantly called from multiple requests.
    Could we reschedule the ongoing drainFuture using gracefulConnectionShutdownHandler.updateDrainDuration()
    and call ctx.channel().close() only once?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't realize that calling ctx.channel().close() multiple times is expensive, fixed.

@alexc-db
Copy link
Contributor Author

@trustin re: #3715 (comment) I think it should be fixed by 1431a68, I'm not able to reproduce it anymore.

}

@Override
public CompletableFuture<Void> initiateConnectionShutdown(long drainDurationMicros) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm somewhat concerned about using micros for a long parameter because this is probably the only place we use micros instead of millis. Would it be a bad idea to accept a millis value here? A user can use Duration for micros precision.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this one might be confusing, but in this case I think your earlier argument about GC pressure applies as well. What if user needs micros precision and their server handles O(million) connections and they use this API? Constructing and destructing Duration objects for each of the connections will cause high GC pressure.

What do you think about making the method name more descriptive? E.g. initiateConnectionShutdownWithDrainDurationMicros(long) for custom micros, initiateConnectionShutdownWithDrainDuration(Duration) for custom duration, initiateConnectionShutdown() for default? Sort of similar to the server builder API #3715 (comment).

*
* @param durationMicros the drain duration. {@code 0} or negative value disables the drain.
*/
public ServerBuilder connectionDrainDurationMicros(long durationMicros) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm somewhat concerned about using micros for a long parameter because this is probably the only place we use micros instead of millis. Would it be a bad idea to accept a millis value here? A user can use Duration for micros precision.

"/goaway_async?duration=1",
"/goaway_async?duration=100",
})
void initiateConnectionShutdownHttp1(String path) throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about making sure all test methods in this class check if the connection is actually closed by the server, even if the client doesn't?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, PTAL. Found that Armeria doesn't seem to propagate config.idleTimeoutMillis during the HTTP/2 upgrade, so it's always seem to be set to Netty default. Added https://github.com/line/armeria/pull/3715/files#diff-e055c2b1d925141bd5d258b22992c79d706c678dd614399284e4b5e54ea22ac3R42, please let me know if that looks reasonable to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: discussed offline with Trustin, for HTTP/2 server connection handler we can completely disable internal Netty gracefulShutdownTimeoutMillis because drain implemented in this PR is a better way to handle graceful shutdown. This also decouples keep-alive configuration (idleTimeoutMillis) from the graceful shutdown configuration (this PR).

alexc-db and others added 8 commits July 30, 2021 01:28
Motivation:

line#3516. A user sometimes needs to send a GOAWAY frame manually when they wants to make a client establish a new connection explicitly. Armeria currently doesn't provide an API for this.

Modifications:

- Add CompletableFuture<Void> initiateConnectionShutdown() method to ServiceRequestContext interface that returns CompletableFuture, which is completed when channel is closed. Mark as @UnstableApi to allow follow up modification, see *Future work* below, e.g. adding method parameters for grace period between first and second GOAWAY frames described there.
- Implement method for DefaultServiceRequestContext - calling the method for HTTP/2 requests results in sending a GOAWAY frame with stream ID 2^31-1 and NO_ERROR code. This signals to the client that a shutdown is imminent and that initiating further requests is prohibited https://datatracker.ietf.org/doc/html/rfc7540#section-6.8. HTTP/1 is currently not supported, if called for HTTP/1 request - CompletableFuture completes exceptionally with UnsupportedOperationException.
- Implement method for ServiceRequestContextWrapper.
- Add documentation to AbstractHttp2ConnectionHandler.goAway which mentions that flushing is responsibility of the caller. Initially, I forgot to add the `ctx.flush()` and spent some time debugging unexpected behavior, hopefully extra documentation will help future readers.

Result:

- Provided an API for sending GOAWAY frame from request handler.

Future work:

- HTTP/1 support - calling initiateConnectionShutdown() would result in sending `Connection: close` header.
- Support for sending second GOAWAY frame with latest stream ID after grace period, and rejecting incoming streams with higher IDs using RST_STREAM frame, error code REFUSED_STREAM, see https://datatracker.ietf.org/doc/html/rfc7540#section-8.1.4.
…eConnectionShutdown. HTTP/2 sends GOAWAY frames, HTTP/1 sends Connection: close.

Add grace period argument to the initiateConnectionShutdown to allow multiple GOAWAY frames to be sent as described in https://datatracker.ietf.org/doc/html/rfc7540#section-8.1.4.

Pass ChannelHandlerContext to the DefaultServiceRequestContext. Add FakeChannelHandlerContext private class to ServiceRequestContextBuilder - use-case similar to FakeChannel in AbstractRequestContextBuilder.
…tdown event to trigger graceful connection drain before shutdown.
…wn. Move graceful connection shutdown logic to the close() method of the handler.
…that doesn't take grace period and doesn't override current grace period value.
Co-authored-by: Trustin Lee <t@motd.kr>
alexc-db and others added 9 commits July 30, 2021 01:28
…rConnectionHandlerBuilder. Tune idleTimeoutMillis setting in tests to avoid flaky failures.
Co-authored-by: Ikhun Um <ih.pert@gmail.com>
…for HTTP/1. Update Http2ServerConnectionHandler build process - disable gracefulShutdownTimeoutMillis since it's replaced with drain functionality introduced in this PR.
…Test - CI machines are not very powerful and small timeouts cause various kinds of flakiness.
…ction alive by sending periodic requests, then verify that eventually it closes even without idle connection. I think the source of flakiness is previous test design where code inside untilAsserted was attempting to create requests **around the moment** when keep alive handler is trying to close connection due to max-age. It's inherently prone to race conditions because untilAsserted sleeps between polls and may schedule **after** the connection was closed, effectively opening another connection.
Copy link
Member

@trustin trustin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly nits. Looking great otherwise!

* @see #initiateConnectionShutdown(long)
*/
@UnstableApi
CompletableFuture<Void> initiateConnectionShutdown(Duration drainDuration);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, that comment is not relevant for this method since it only converts Duration to long and can be implemented as a default. PTAL.

Comment on lines 228 to 236
@Override
public CompletableFuture<Void> initiateConnectionShutdown(Duration drainDuration) {
return delegate().initiateConnectionShutdown(drainDuration);
}

@Override
public CompletableFuture<Void> initiateConnectionShutdown() {
return delegate().initiateConnectionShutdown();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could remove this if make them default methods. :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alexc-db and others added 4 commits August 2, 2021 10:20
Co-authored-by: Trustin Lee <t@motd.kr>
Co-authored-by: Ikhun Um <ih.pert@gmail.com>
…inDuration)` a default method. Rolled back suggested change with default `public CompletableFuture<Void> initiateConnectionShutdown()` since it changes behavior.
@alexc-db
Copy link
Contributor Author

alexc-db commented Aug 2, 2021

Recent checks failed with the following memory leak

04:21:11.386 [armeria-common-worker-epoll-2-2] ERROR io.netty.util.ResourceLeakDetector - LEAK: ByteBuf.release() was not called before it's garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information.
Recent access records: 
#1:
	io.netty.buffer.AbstractPooledDerivedByteBuf.deallocate(AbstractPooledDerivedByteBuf.java:86)
	io.netty.buffer.AbstractReferenceCountedByteBuf.handleRelease(AbstractReferenceCountedByteBuf.java:110)
	io.netty.buffer.AbstractReferenceCountedByteBuf.release(AbstractReferenceCountedByteBuf.java:100)
	io.netty.buffer.WrappedByteBuf.release(WrappedByteBuf.java:1037)
	io.netty.buffer.SimpleLeakAwareByteBuf.release(SimpleLeakAwareByteBuf.java:102)
	io.netty.buffer.AdvancedLeakAwareByteBuf.release(AdvancedLeakAwareByteBuf.java:941)
	com.linecorp.armeria.common.ByteBufHttpData.close(ByteBufHttpData.java:256)
	com.linecorp.armeria.client.grpc.protocol.UnaryGrpcClient$GrpcFramingDecorator$1.onError(UnaryGrpcClient.java:256)
	com.linecorp.armeria.client.grpc.protocol.UnaryGrpcClient$GrpcFramingDecorator$1.onNext(UnaryGrpcClient.java:242)

Note UnaryGrpcClient.java:256 - this branch isn't rebased on top of fcd2d72, so there's no line 256 in UnaryGrpcClient.java in this branch. Any idea why master branch changes leaked here @trustin @ikhoon @minwoox? Do checks rebase branch on top of master implicitly?

@trustin
Copy link
Member

trustin commented Aug 3, 2021

@alexc-db Re: leak:

That's weird. We don't merge the base branch at all in PR runs. Please run ./gradlew --parallel -Pleak test and grep -rFl LEAK: . as the source of truth if you keep seeing this. 🤔

@alexc-db
Copy link
Contributor Author

alexc-db commented Aug 3, 2021

@alexc-db Re: leak:

That's weird. We don't merge the base branch at all in PR runs. Please run ./gradlew --parallel -Pleak test and grep -rFl LEAK: . as the source of truth if you keep seeing this. 🤔

For the record https://github.com/line/armeria/runs/3224114462 - the leak report and stack trace can be found inside the reports-JVM-15 artifact.

Copy link
Member

@trustin trustin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your patience. High quality code FTW 😄

@alexc-db
Copy link
Contributor Author

alexc-db commented Aug 3, 2021

@minwoox @ikhoon could you please give it a final pass and approve if everything looks good?

Copy link
Member

@minwoox minwoox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 🎉 🎉
Thanks a lot for your hard work, @alexc-db!

Copy link
Contributor

@ikhoon ikhoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work! @alexc-db

@trustin trustin merged commit 3491204 into line:master Aug 3, 2021
@alexc-db
Copy link
Contributor Author

alexc-db commented Aug 3, 2021

Thanks for the thorough review @trustin @ikhoon @minwoox!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a way for a user to send a GOAWAY frame.
4 participants