Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Be able to retry the query via TCP if a query failed because of a timeout #13757

Merged
merged 5 commits into from
Jan 12, 2024

Conversation

normanmaurer
Copy link
Member

@normanmaurer normanmaurer commented Dec 29, 2023

Motivation:

We should allow people to retry the query via TCP if the query failed because of a
timeout when using UDP.

Modifications:

  • Move all the retry code for TCP into DnsQueryContext so we can reuse
    the same code for handling truncation and retry.
  • Retry on timeout if configured by user
  • Add unit tests

Result:

More robust resolver

…sing UDP

Motivation:

We should retry the query via TCP if the query failed because of a timeout when using UDP.

Modifications:

- Move all the retry code for TCP into DnsQueryContext so we can reuse the same code for handling truncation and retry.
- Retry on timeout if possible
- Add unit tests

Result:

More robust resolver
* @return {@code true} if retry via TCP is supported and so the ownership of
* {@code originalResult} was transferred, {@code false} otherwise.
*/
private boolean retryWithTCP(final Object originalResult) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please elaborate on the motivation in a bit more details? Where this is coming from?
In my experience, DNS timeouts are quite frequent and automatically retrying them via TCP may cause unexpected side-effects, like:

  1. People expecting the query to fail within a pre-defined timeout. Retrying it will double the total timeout.
  2. Spike of TCP traffic for DNS when timeouts are frequent enough.

I will propose to make this an opt-in feature, but mostly looking for a motivational story behind it.

Copy link

@hellojukay hellojukay Jan 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@idelpivnitskiy you may read this https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts

Dns through tcp is real in needed, maybe tcp is not default behavior, but we should make people can enable tcp fallback option and set tcp timeout.

We use redisson watch dns change event, redisson dns watching implement by netty dns resolver, redisson dns udp timeout 5000ms frequently becase of udp dns race. redisson/redisson#5137 and #13705

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the link and additional context.

My arguments (1) and (2) remain valid. If we merge this PR, it changes the current behavior and may cause unexpected outages under heavy load and scale. So, if we proceed with this change, it should be configurable (preferable to be opt-in by default) to let impacted users experiment in a safe fashion without affecting everyone.

After reading the article, I'm not sure this specific change is a good mitigation for the described problem. DNS timeouts are quite often, happen not only in k8s env, and may be caused by various different reasons. There are cases, when automatic retry over TCP can make everything even worse.
Good mitigation points from the article are:
a. disable parallel lookups
b. use TCP for lookups

With (b) and RFC7766 together, it makes more sense to let users chose TCP as the default protocol for their DNS client and skip UDP completely or let it be a fallback. However, it opens other questions around connection management and DoS protection described in RFC7766.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot assume what kind of network our program will run on. In any case, a timeout will occur. We should allow users to configure: when a failure occurs, switch to another network connection method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@idelpivnitskiy the problem is that there is really no way for people to to do a retry via TCP by themself (we might want to support also create a DnsNameResolver which only does TCP in the future). That said let's just make it configurable to also fallback to TCP on timeout.

@hellojukay
Copy link

Will this PR be merged in the futrue.

@normanmaurer
Copy link
Member Author

normanmaurer commented Jan 10, 2024 via email

Copy link
Member

@idelpivnitskiy idelpivnitskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the config flag @normanmaurer!

@normanmaurer normanmaurer merged commit 684dfd8 into 4.1 Jan 12, 2024
15 checks passed
@normanmaurer normanmaurer deleted the retry_tcp_on_timeout branch January 12, 2024 17:47
@normanmaurer
Copy link
Member Author

Will work on a patch for main

normanmaurer added a commit that referenced this pull request Jan 14, 2024
…eout when u… (#13757)

…sing UDP

Motivation:

We should allow people to retry the query via TCP if the query failed because of a
timeout when using UDP.

Modifications:

- Move all the retry code for TCP into DnsQueryContext so we can reuse
the same code for handling truncation and retry.
- Retry on timeout if configured by user
- Add unit tests

Result:

More robust resolver
@normanmaurer normanmaurer changed the title Retry the query via TCP if a query failed because of a timeout when u… Be able to retry the query via TCP if a query failed because of a timeout Jan 16, 2024
franz1981 pushed a commit to franz1981/netty that referenced this pull request Feb 9, 2024
…eout when u… (netty#13757)

…sing UDP

Motivation:

We should allow people to retry the query via TCP if the query failed because of a
timeout when using UDP.

Modifications:

- Move all the retry code for TCP into DnsQueryContext so we can reuse
the same code for handling truncation and retry.
- Retry on timeout if configured by user
- Add unit tests

Result:

More robust resolver
@paolosciamm
Copy link

Hi,
thanks for this release.
Just a question about this quote:

  • Retry on timeout if configured by user

Using the 4.1.105.Final version, how can I configure/enable the TCP fallback in Spring?
Is it needed a particular properties or bean config?

@normanmaurer
Copy link
Member Author

you would configure it via the DnsNameResolverBuilder.

@violetagg
Copy link
Member

Hi, thanks for this release. Just a question about this quote:

  • Retry on timeout if configured by user

Using the 4.1.105.Final version, how can I configure/enable the TCP fallback in Spring? Is it needed a particular properties or bean config?

If you use WebClient then you can configure the Reactor Netty HttpClient like this:

HttpClient client = HttpClient.create().resolver(spec -> spec.retryTcpOnTimeout(true));

See more here https://projectreactor.io/docs/netty/release/reference/index.html#_host_name_resolution_2

@paolosciamm
Copy link

paolosciamm commented Jul 1, 2024

Hi, thanks for this release. Just a question about this quote:

  • Retry on timeout if configured by user

Using the 4.1.105.Final version, how can I configure/enable the TCP fallback in Spring? Is it needed a particular properties or bean config?

If you use WebClient then you can configure the Reactor Netty HttpClient like this:

HttpClient client = HttpClient.create().resolver(spec -> spec.retryTcpOnTimeout(true));

See more here https://projectreactor.io/docs/netty/release/reference/index.html#_host_name_resolution_2

Thanks for the reply.
And if I won't use Reactor Netty, would adding the following configuration (according to @normanmaurer answer) be enough?

@Configuration
public class CustomDnsNameResolver {
    @Bean
    public DnsNameResolver customDnsNameResolverBuilder() {
        NioEventLoopGroup group = new NioEventLoopGroup();
        return new DnsNameResolverBuilder(group.next())
                .channelType(NioDatagramChannel.class)
                .resolvedAddressTypes(ResolvedAddressTypes.IPV4_ONLY)
                .nameServerProvider(DnsServerAddressStreamProviders.platformDefault())
                .queryTimeoutMillis(5000)
                .maxQueriesPerResolve(2)
                .build();
    }
}

@violetagg
Copy link
Member

@paolosciamm I would say, you need .socketChannelType(NioSocketChannel.class, true)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants