Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow configuring a timeout for Endpoint selection in EndpointGroup #4246

Merged
merged 17 commits into from Jul 4, 2022

Conversation

ikhoon
Copy link
Contributor

@ikhoon ikhoon commented May 10, 2022

Motivation:

A DynamicEndpointGroup lazily resolves Endpoints even after a
request is started. Armeria clients use a connection timeout to wait for
the late Endpoints.

return endpointGroup.select(this, temporaryEventLoop, connectTimeoutMillis()).handle((e, cause) -> {

The default connection timeout is 3.2 seconds which might be enough for a
single EndpointGroup.

However, HealthCheckedEndpointGroup wraps an EndpointGroup that
could be a DynamicEndpointGroup as well. Let's say the delegate
EndpointGroup is a DNS EndpointGroup. If a request is started, a DNS
resolution and HTTP health check should be preceded. If a server
starts, it could take more resources and 3.2 seconds that can not make
sense to a complex EndpointGroup structure.

Modifications:

  • Add selectionTimeoutMillis() to EndpointGroup so that a
    EndpointSelector uses the value to schedule a timeout.
  • Add selectionTimeout{Millis}(long) to
    AbstractDynamicEndpointGroupBuilder in order to override the
    timeout.
    • Set 0 to the default selection timeout for static endpoints.
    • Set the default connection timeout to the default selection timeout
      for DynamicEndpointGrouop.
    • Set the default response timeout to the default selection timeout
      for HealthCheckedEndpointGroup, ZooKeeperEndpointGroup,
      EurekaEndpointGroup and ConsulDnsEndpointGroup.
    • HealthCheckedEndpointGroup adds its selectionTimeout and
      delegate's selectionTimeout.
  • Raise EndpointSelectionTimeoutException if EndpointSelctor fails
    to select an Endpoint within a timeout.
  • Make ZooKeeperEndpointGroupBuilder and EurekaEndpointGroupBuilder
    implment DynamicEndpointGroupSetters.
  • Deprecate) AbstractEndpointSelector.select(ctx, executor, timeout)
    in favor of AbstractEndpointSelector.select(ctx, executor)

Result:

  • You can now configure a timeout for an Endpoint selection of an
    DynamicEndpointGroup.
    DnsAddressEndpointGroup delegate = DnsAddressEndpointGroup.of(...);
    HealthCheckedEndpointGroup healthGroup =
      HealthCheckedEndpointGroup
        .builder(delegate, "/health")
        .selectionTimeout(Duration.ofSeconds(10))  // 👈👈👈
        .build();

Motivation:

A `DynamicEndpointGroup` lazily resolves `Endpoint`s even after a
request is started. Armeria clients use a connection timeout to wait for
the late `Endpoint`s.
https://github.com/line/armeria/blob/cc1656e6c9f7ba6b04d115e0298e90b8cf7c0971/core/src/main/java/com/linecorp/armeria/client/DefaultClientRequestContext.java#L320
The default connection timeout is 3.2 seconds that might be enough for a
single `EndpointGroup`.

However, `HealthCheckedEndpointGroup` wraps an `EndpointGroup` that
could be a `DynamicEndpointGroup` as well. Let's say the delegate
`EndpointGroup` is a DNS `EndpointGroup`. If a request is started, a DNS
resolution and HTTP health check should be preceded.  If a server
starts, it could take more resources and 3.2 seconds that can not make
sense to a complex `EndpointGroup` structure.

Modifications:

- Add `selectionTimeoutMillis()` to `EndpointGroup` so that a
  `EndpointSelector` uses the value to schedule a timeout.
- Add `selectionTimeout{Millis}(long)` to
  `AbstractDynamicEndpointGroupBuilder` in order to override the
  timeout.
  - Set 0 to the default selection timeout for static endpoints.
  - Set the default connection timeout to the default selection timeout
    for `DynamicEndpointGrouop`.
  - Set the default response timeout to the default selection timeout
    for `HealthCheckedEndpointGroup`, `ZooKeeperEndpointGroup`,
    `EurekaEndpointGroup` and `ConsulDnsEndpointGroup`.
  - `HealthCheckedEndpointGroup` adds its `selectionTimeout` and
    delegate's `selectionTimeout`.
- Raise `EndpointSelectionTimeoutException` if `EndpointSelctor` fails
  to select an `Endpoint` within a timeout.
- Make `ZooKeeperEndpointGroupBuilder` and `EurekaEndpointGroupBuilder`
  implment `DynamicEndpointGroupSetters`.
- Deprecate) `AbstractEndpointSelector.select(ctx, executor, timeout)`
  in favor of `AbstractEndpointSelector.select(ctx, executor)`

Result:

- You can configure a timeout for an `Endpoint` selection of an
  `DynamicEndpointGroup`.
  ```
  DnsAddressEndpointGroup delegate = DnsAddressEndpointGroup.of(...);
  HealthCheckedEndpointGroup healthGroup =
    HealthCheckedEndpointGroup
      .builder(delegate, "/health")
      .selectionTimeout(Duration.ofSeconds(10))  // 👈👈👈
      .build();
  ```
@codecov
Copy link

codecov bot commented May 13, 2022

Codecov Report

Merging #4246 (eba5255) into master (6f256cb) will increase coverage by 0.18%.
The diff coverage is 80.86%.

@@             Coverage Diff              @@
##             master    #4246      +/-   ##
============================================
+ Coverage     73.27%   73.46%   +0.18%     
- Complexity    17253    17340      +87     
============================================
  Files          1473     1475       +2     
  Lines         64966    65093     +127     
  Branches       8158     8169      +11     
============================================
+ Hits          47606    47821     +215     
+ Misses        13218    13117     -101     
- Partials       4142     4155      +13     
Impacted Files Coverage Δ
...inecorp/armeria/client/endpoint/EndpointGroup.java 81.25% <ø> (ø)
...m/linecorp/armeria/internal/client/ClientUtil.java 84.78% <ø> (ø)
...a/client/endpoint/DynamicEndpointGroupBuilder.java 36.36% <33.33%> (-6.50%) ⬇️
.../internal/client/endpoint/StaticEndpointGroup.java 84.61% <33.33%> (-6.69%) ⬇️
...ain/java/com/linecorp/armeria/client/Endpoint.java 91.28% <50.00%> (+0.54%) ⬆️
...p/armeria/client/endpoint/OrElseEndpointGroup.java 61.53% <50.00%> (+0.42%) ⬆️
...eria/client/eureka/EurekaEndpointGroupBuilder.java 35.78% <71.42%> (+5.30%) ⬆️
...lient/zookeeper/ZooKeeperEndpointGroupBuilder.java 70.96% <73.33%> (+0.37%) ⬆️
.../endpoint/AbstractDynamicEndpointGroupBuilder.java 81.25% <75.00%> (-18.75%) ⬇️
...eck/AbstractHealthCheckedEndpointGroupBuilder.java 83.09% <76.19%> (-3.18%) ⬇️
... and 57 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6f256cb...eba5255. Read the comment docs.

Copy link
Contributor

@jrhee17 jrhee17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good! Left some minor questions


/**
* Creates a new instance.
*/
protected AbstractDynamicEndpointGroupBuilder() {}
protected AbstractDynamicEndpointGroupBuilder() {
this(Flags.defaultConnectTimeoutMillis());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the default for DnsEndpointGroupBuilder will be connectTimeout instead of responseTimeout. I wanted to check if this is your intention.

If the intention is to set responseTimeout as the default for pre-defined EndpointGroups in armeria, what do you think of just removing the empty constructor?
We can catch these type of mistakes at compile-time this way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late response.

I think the default for DnsEndpointGroupBuilder will be connectTimeout instead of responseTimeout. I wanted to check if this is your intention.

Currently, the default select timeout for DnsEndpointGroup is the default connection timeout.

assertSelectionTimeout(endpointGroup).isEqualTo(Flags.defaultConnectTimeoutMillis());

DNS servers usually return a response quickly because they cache DNS records. Therefore, 3.2 seconds may be reasonable.

By the way, I checked the default timeout for a DNS query. It is 5 seconds that can be a more sensible default timeout for `DnsEndpointGroup/

what do you think of just removing the empty constructor?

As this class is a public API, that could cause a breaking change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this class is a public API, that could cause a breaking change.

AbstractDynamicEndpointGroupBuilder is part of unstable APIs. Will remove the default constructor.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds reasonable to use the same timeout for DnsEndpointGroup and our DNS resolver.

* ...
* .build();
* assert endpointGroup.selectionTimeoutMillis() == 13000;
* }</pre>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit; what do you think of adding an explanation that setting this will set both selectionTimeoutMillis and initialSelectionTimeoutMillis


/**
* Creates a new instance.
* Creates a new instance with the default {@code selectionTimeoutMillis}.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Creates a new instance with the default {@code selectionTimeoutMillis}.
* Creates a new instance with the specified default {@code selectionTimeoutMillis}.

@@ -143,6 +144,13 @@ static EndpointGroup of(EndpointSelectionStrategy selectionStrategy,
@Override
Endpoint selectNow(ClientRequestContext ctx);

/**
* Returns the timeout to wait until a successful {@link Endpoint} selection.
* {@code 0} means {@link #selectNow(ClientRequestContext)} should always return an {@link Endpoint}.
Copy link
Collaborator

@trustin trustin May 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also document what happens on the initial selection attempt? Will it wait indefinitely or just return null?

Copy link
Contributor

@jrhee17 jrhee17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me once @trustin 's comments are addressed 👍 Thanks @ikhoon ! 🙇 🚀 👍

Copy link
Member

@minwoox minwoox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot!
LGTM once @trustin's comments are addressed. 😉

@@ -73,25 +82,41 @@ public final CompletableFuture<Endpoint> select(ClientRequestContext ctx,
return UnmodifiableFuture.completedFuture(endpoint);
}

final long selectionTimeoutMillis = endpointGroup.selectionTimeoutMillis();
if (selectionTimeoutMillis == 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Would it be better to use -1 to indicate StaticEndpointGroup?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to differentiate the timeout of StaticEndpointGroup from others.
A DynamincEndpointGroup may have 0 selection timeout if it is made from a static resource such as a file or an environment variable.

Copy link
Collaborator

@trustin trustin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one comment. Looks great otherwise!

* If unspecified, {@link Flags#defaultConnectTimeoutMillis()} is used by default.
*/
@UnstableApi
default DynamicEndpointGroupSetters selectionTimeout(Duration selectionTimeout) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

selectionTimeout or selectTimeout? I feel like selectionTimeout is correct, but we have connectTimeout 🤔

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably selectionTimeout is better here because we have EndpointSelectionTimeoutException.

Copy link
Collaborator

@trustin trustin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, @ikhoon 🙇

@ikhoon ikhoon merged commit 554d596 into line:master Jul 4, 2022
@ikhoon ikhoon deleted the endpoint-selection-timeout branch July 4, 2022 01:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants