Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reresolve hostnames as fallback when all hosts are unreachable #1708

Merged
merged 1 commit into from
Jun 28, 2023

Conversation

wprzytula
Copy link
Contributor

If all nodes in the cluster change their IPs at one time, driver used to no longer be able to ever contact the cluster; the only solution was to restart the driver. A fallback is added to the control connection reconnect() logic so that when no known host is reachable, all hostnames provided in ClusterConfig (initial contact points) are reresolved and control connection is attempted to be opened to any of them. If this succeeds, a metadata fetch is issued normally and the whole cluster is discovered with its new IPs.

For the cluster to correctly learn new IPs in case that nodes are accessible indirectly (e.g. through a proxy), that is, by translated address and not rpc_address or broadcast_address, the code introduced in #1682 is extended to remove and re-add a host also when its translated address changed (even when its internal address stays the same).

As a bonus, a misnamed variable hostport is renamed to a suitable
hostaddr.

@wprzytula
Copy link
Contributor Author

@martin-sucha Could you please take a look?

Copy link
Contributor

@martin-sucha martin-sucha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wprzytula thank you for the pull request! This is indeed a useful fix. I've added a few comments inline.

control.go Outdated Show resolved Hide resolved
control.go Outdated Show resolved Hide resolved
control.go Outdated Show resolved Hide resolved
control.go Outdated Show resolved Hide resolved
control.go Outdated Show resolved Hide resolved
@wprzytula
Copy link
Contributor Author

I've addressed comments and retested.

control.go Outdated Show resolved Hide resolved
If all nodes in the cluster change their IPs at one time, driver used to
no longer be able to ever contact the cluster; the only solution was to
restart the driver. A fallback is added to the control connection
`reconnect()` logic so that when no known host is reachable,
all hostnames provided in ClusterConfig (initial contact points)
are reresolved and control connection is attempted to be opened to any
of them. If this succeeds, a metadata fetch is issued normally
and the whole cluster is discovered with its new IPs.

For the cluster to correctly learn new IPs in case that nodes are
accessible indirectly (e.g. through a proxy), that is, by translated
address and not `rpc_address` or `broadcast_address`, the code
introduced in apache#1682 was extended to remove and re-add a host also when
its translated address changed (even when its internal address stays the
same).

As a bonus, a misnamed variable `hostport` is renamed to a suitable
`hostaddr`.
@wprzytula
Copy link
Contributor Author

@martin-sucha Done.

@martin-sucha martin-sucha merged commit a507dae into apache:master Jun 28, 2023
16 checks passed
@martin-sucha
Copy link
Contributor

Merged. Thank you!

Would you be willing to contribute a test that would ensure that the function does not break in the future?

@wprzytula
Copy link
Contributor Author

I'm afraid that automated testing of DNS changes is too much effort. The manual tests that I ran were complex: they involved:

  • stopping systemd DNS service,
  • running custom local DNS service that maps hostnames to "old" IPs,
  • using a proxy on connections to all nodes, listening on "old" IPs,
  • running a crafted test that periodically sends queries,
  • breaking connections by stopping proxies,
  • changing DNS rules to resolve to new IPs,
  • reestablishing proxies on new IPs,
  • waiting until all pools get populated again,
  • asserting that it happens in reasonable time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants