-
Notifications
You must be signed in to change notification settings - Fork 616
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolving DNS when connecting pool connections can lead to connection imbalances #1575
Comments
To reproduce. Setup a local nameserver (bind9 in my case) with the the local IPs as the A record entries:
Then create a new cluster/session using cluster := gocql.NewCluster("example.com")
cluster.NumConns = 100
session, err := gocql.NewSession(*cluster)
if err != nil {
log.Fatalf("unable to connect session: %v", err)
} Note the imbalanced connection counts:
|
When a hostname is used for contact points it's resolved initially as part of the initialization process then `HostInfo` objects are created with the original `hostname` and the resolved `connectAddress`. `hostname` is then used to dial the pool connections when causes another DNS resolve which could result in a different IP then the original `connectAddress` because A records can change order for each resolve. This results in a connection pool for a given IP address containing connections to multiple different IP addresses. This patch removes the second resolve when dialing by setting the `hostname` member to the resolved IP in the initialization step. Resolves apache#1575
We currently have https://github.com/gocql/gocql/blob/bc256bbb90de7113a74ad4d777beeec75eb9c4e7/cluster.go#L166-L172 in docs. If you only want to resolve the IP addresses when creating the cluster, you can simply resolve the DNS name to IP addresses yourself and pass the list of IPs to What is the desired behavior in case the DNS record changes? |
Thanks for the pointer in the docs. I wasn't aware of that. Sorry.
I'm trying to work this out myself. :) Pools are keyed based on the resolved IP ( I'm trying to wrap my head around a case when the driver would want unresolved hosts. Maybe in the case of a total cluster outage in an environment (like k8s) where all the hosts IPs have changed (but this would only make sense for re-establishing the control connection, not for pool connetions) or some address translator scenario? |
I'm looking into a similar situation when used on kubernetes where you get a headless DNS that could return A records for 3 nodes. The problem I was trying to figure out is how to roll the nodes in a cluster and ensure the client has an updated host pool. I wasn't sure if the client driver regularly resolves the hosts or not. So I am wondering, is there any case where the client driver will resolve the host names again? Is there some kind of eventing that is not happening on my end when the new node pods start and the client doesn't see them? Or should I be using sticky IP addresses for the pods so they remain fixed after being rolled? |
Okay, I've re-read the code and the original post. So we do resolve DNS names to IP addresses when establishing the initial control connection (during session initialization) and we build a pool out of that. The imbalance in the pool is because we re-resolve the hostname when dialing. That should be fixable by dialing the IP address instead of the hostname (for TCP connection). Dialing IP address instead of hostname might break some dialers that expect hostname (like in #1579 that might resolve through proxy, but it seems such dialer would not work anyway as we try to resolve hostnames to IP addresses first). We need to update the docs to reflect the current behaviour. As for the rolling restart in Kubernetes, gocql receives events from the cluster about added/removed nodes. I think we should see some events from the cluster about new IP address of the host (but I'm not sure about that). Currently we keep nodes in pool by IP address. If we switch the dialer to IP address, that would not help with the k8s rolling restart case as we'd not re-resolve the hostname. @justinfx would you mind opening a separate issue with a log of events (compile with gocql_debug tag) that we get from the cluster during a rolling restart? It will be interesting to see what events we receive in that case. I think we need a new dialer interface (that would get HostInfo pointer instead of a simple address), a place where to re-discover initial hosts (when we lose all connections), a user-specified function to discover the hosts to connect (called during session init and when we detect we lost all connections) and a way to construct HostInfo outside of gocql package. That would help with #1579 and #1487. Being able to construct HostInfo would help with testing host selection policies as well. |
Thanks for looking into that, @martin-sucha. I will try post a new issue with the debug output. From my tests so far, when I roll a cluster I do see events come in to the client. But the factor here is how fast you roll the cluster. If I roll them one-by-one as soon as each one passes its health-check, it seems to be too fast for the client, which ends up in a state where it thinks the entire pool is down. But if I manually roll the cluster slowly, I see the events come in for the new nodes and eventually the old down node stops logging. Unfortunately I don't think a cluster is always going to go down in that very nicely controlled fashion. |
This issue is fixed by 7a6cf00, Same reproduction flow endup with balanced connections count:
Tested on |
What version of Cassandra are you using?
Reproducible with any version. Tested with 3.11.10.
What version of Gocql are you using?
bc256bb
What version of Go are you using?
go1.16.6 linux/amd64
What did you do?
Create a
Cluster
object using a DNS name with multiple A records:What did you expect to see?
100 connections per host for the 3 hosts in the cluster and 1 extra for the control connection.
What did you see instead?
Imbalanced number of connections.
The problem
The host is created with the original DNS entry as the struct member
hostname
:https://github.com/gocql/gocql/blob/bc256bbb90de7113a74ad4d777beeec75eb9c4e7/control.go#L151
which causes it to be re-resolved when making the connection:
https://github.com/gocql/gocql/blob/bc256bbb90de7113a74ad4d777beeec75eb9c4e7/conn.go#L247
using the
hostname
retrieved fromHostInfo.HostnameAndPort()
:https://github.com/gocql/gocql/blob/bc256bbb90de7113a74ad4d777beeec75eb9c4e7/host_source.go#L369
The problem is that pools are mapped using
ConnectAddress()
from the original DNS resolved IP address (https://github.com/gocql/gocql/blob/bc256bbb90de7113a74ad4d777beeec75eb9c4e7/connectionpool.go#L224), but when re-resolved indialer.DialContext()
it can result in a different addresses because A-records don't always come back in the same order. This causes pools to contain connections to multiple addresses instead of one and results in an imbalance.The text was updated successfully, but these errors were encountered: