-
-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Xandra.Cluster
connects, but doesn't think so
#314
Comments
I realized that there is telemetry not being logged, so adding a logger for the telemetry around connections. |
The telemetry reveals that it is not connecting to more than one host in the cluster. Additionally, every time the topology runs, it swaps in and out the two hosts that are not the one with the control connection. This appears to have no effect on creating or stopping connections, however. Here is the startup of the application in a data center with 3 nodes.
It establishes the control connection to From then on, when the topology refresh happens, it oscillates the other two hosts into and out of the cluster like this:
Something very odd is going on with the cluster discovery. |
Hey Karl! Sorry for the late response, I've been a busy bee with other stuff. So, it looks like this is good up to
but then it goes haywire. I think I know what's going on: we introduced a "health check" for connected hosts, since Cassandra's internal change events are basically garbage. So, I think what's happening is that it's not quick enough to connect to the nodes, so the next time we do this health check, most nodes sort of report as "down", which causes the I'll take a look at merging #309 soon, and then try to fix this. I'll let you know when you'll have something to try out! |
@relistan one question: if you query |
Thanks @whatyouhide, that makes a lot of sense. Yes, when I query the control connection node I see the other two in `system.peers`
|
Hey @relistan! I've been hard at work rewriting Xandra's clustering functionality. Wooph 😓 It should be more reliable now, but it's painful as hell to test this locally with small synthetic clusters. I’m working with folks at Veeps (where I work) to be able to better test this as well, but I wanted to tell you to give |
Thanks @whatyouhide! I'm out until the middle of next week but maybe @britto or @jacquelineIO can take a look before that if they have time. |
That would be fantastic, as it would also help me figure out if something is wrong and start working on fixes. @britto and @jacquelineIO, please reach out if I can help in any way! |
Xandra.Cluster
connects, but doesn't think so
Xandra has changed so much in the past couple of months that I’m gonna go ahead and say that this is fixed 😄 It might not be, but I’m pretty sure any error would look different at this point! |
I'm still digging to try to figure out what is going on with our inability to connect with Xandra repeatably. Any pointers appreciated!
We are now running 0.16.0 with the peers patch that is on the Draft PR.
I deployed a copy of our service that I could manually start in the container with
start_iex
. I rantcpdump
to capture the network traffic. What I see is that the first connection that goes out successfully connects. It stays connected the whole time. Many other connections are made, all of which seem to establish a TCP session. I cannot see anything different at the TCP level between when Xandra succeeds in starting the cluster and when it fails. Because it is over SSL (and I can't change that) I cannot see what is going on at the Cassandra protocol level. I assume if there were an actual protocol issue that I would see Cassandra sending RSTs or FINs on the connection, which I don't see.Attached is a screenshot of Wireshark displaying the first connection. You can see by the packet count that it stays connected until I exit the application.
I built a little startup watcher GenServer as a child of
Application
that prevents our processing pipelines from starting untilXandra.Cluster
is up. It tells you how long it waited and how many tries that took. There seems to be no pattern in the amount of time it takes.I do not believe this is server-side because we run
golang-migrate
on every app startup and it has no trouble connecting to Cassandra at all, only a few seconds before Xandra starts up.Things I noticed:
topology_refresh_interval
to 30 seconds as a debugging experiment, that may be the thing that sometimes kick it in the pants if it doesn't connect the first time. Anecdotally there is a correlation timestamp-wise.My observations may point to some kind of timing issue. But I am not sure where to look. Something with internal messaging between the BEAM processes?
The text was updated successfully, but these errors were encountered: