Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Xandra.Cluster connects, but doesn't think so #314

Closed
relistan opened this issue Jul 31, 2023 · 9 comments
Closed

Xandra.Cluster connects, but doesn't think so #314

relistan opened this issue Jul 31, 2023 · 9 comments
Labels
Milestone

Comments

@relistan
Copy link
Contributor

relistan commented Jul 31, 2023

I'm still digging to try to figure out what is going on with our inability to connect with Xandra repeatably. Any pointers appreciated!

We are now running 0.16.0 with the peers patch that is on the Draft PR.

I deployed a copy of our service that I could manually start in the container with start_iex. I ran tcpdump to capture the network traffic. What I see is that the first connection that goes out successfully connects. It stays connected the whole time. Many other connections are made, all of which seem to establish a TCP session. I cannot see anything different at the TCP level between when Xandra succeeds in starting the cluster and when it fails. Because it is over SSL (and I can't change that) I cannot see what is going on at the Cassandra protocol level. I assume if there were an actual protocol issue that I would see Cassandra sending RSTs or FINs on the connection, which I don't see.

Attached is a screenshot of Wireshark displaying the first connection. You can see by the packet count that it stays connected until I exit the application.

I built a little startup watcher GenServer as a child of Application that prevents our processing pipelines from starting until Xandra.Cluster is up. It tells you how long it waited and how many tries that took. There seems to be no pattern in the amount of time it takes.

I do not believe this is server-side because we run golang-migrate on every app startup and it has no trouble connecting to Cassandra at all, only a few seconds before Xandra starts up.

Things I noticed:

  • I set the topology_refresh_interval to 30 seconds as a debugging experiment, that may be the thing that sometimes kick it in the pants if it doesn't connect the first time. Anecdotally there is a correlation timestamp-wise.
  • When I set the CPU limit very low, it almost always is able to connect right away

My observations may point to some kind of timing issue. But I am not sure where to look. Something with internal messaging between the BEAM processes?

Screen Shot 2023-07-31 at 2 30 31 PM
@relistan
Copy link
Contributor Author

I realized that there is telemetry not being logged, so adding a logger for the telemetry around connections.

@relistan
Copy link
Contributor Author

relistan commented Aug 1, 2023

The telemetry reveals that it is not connecting to more than one host in the cluster. Additionally, every time the topology runs, it swaps in and out the two hosts that are not the one with the control connection. This appears to have no effect on creating or stopping connections, however.

Here is the startup of the application in a data center with 3 nodes.

1-43c4-8a79-9755134a17d4", port: 11172, rack: "usw2-az1", release_version: "4.0.4", schema_version: "724335b0-8709-3d1d-8b94-233dfe8fbce5"}
11:27:02.618 [info] Xandra Cluster: [:change_event]: %{changed: true, cluster_name: Store, cluster_pid: #PID<0.5486.0>, event_type: :host_added, host: %Xandra.Cluster.Host{address: {10, 4, 0, 86}, data_center: "redacted", host_id: "425bee92-19ab-4be1-99c9-392ce820670a", port: 11172, rack: "usw2-az3", release_version: "4.0.4", schema_version: "724335b0-8709-3d1d-8b94-233dfe8fbce5"}
11:27:02.618 [info] Xandra Cluster: [:change_event]: %{changed: true, cluster_name: Store, cluster_pid: #PID<0.5486.0>, event_type: :host_added, host: %Xandra.Cluster.Host{address: {10, 4, 0, 42}, data_center: "redacted", host_id: "ea819914-e415-4a70-8726-9d6c91653ffa", port: 11172, rack: "usw2-az2", release_version: "4.0.4", schema_version: "724335b0-8709-3d1d-8b94-233dfe8fbce5"}
11:27:02.619 [info] Xandra Cluster: [:control_connection, :connected]: %{cluster_name: Store, cluster_pid: #PID<0.5486.0>, host: %Xandra.Cluster.Host{address: {10, 4, 0, 9}, data_center: "redacted", host_id: "473a0351-8c81-43c4-8a79-9755134a17d4", port: 11172, rack: "usw2-az1", release_version: "4.0.4", schema_version: "724335b0-8709-3d1d-8b94-233dfe8fbce5"}
11:27:03.130 [info] Xandra Cluster: [:change_event]: %{changed: true, cluster_name: Store, cluster_pid: #PID<0.5486.0>, event_type: :host_down, host: %Xandra.Cluster.Host{address: {10, 4, 0, 9}, data_center: "redacted", host_id: "473a0351-8c81-43c4-8a79-9755134a17d4", port: 11172, rack: "usw2-az1", release_version: "4.0.4", schema_version: "724335b0-8709-3d1d-8b94-233dfe8fbce5"}
11:27:03.145 [info] Xandra Cluster: [:change_event]: %{changed: true, cluster_name: Store, cluster_pid: #PID<0.5486.0>, event_type: :host_down, host: %Xandra.Cluster.Host{address: {10, 4, 0, 86}, data_center: "redacted", host_id: "425bee92-19ab-4be1-99c9-392ce820670a", port: 11172, rack: "usw2-az3", release_version: "4.0.4", schema_version: "724335b0-8709-3d1d-8b94-233dfe8fbce5"}
11:27:03.535 [info] Xandra: [:connected]: %{address: {10, 4, 0, 42}, connection_name: nil, port: 11172}
11:27:03.589 [info] Xandra: [:connected]: %{address: {10, 4, 0, 42}, connection_name: nil, port: 11172}
11:27:03.706 [info] Xandra: [:connected]: %{address: {10, 4, 0, 42}, connection_name: nil, port: 11172}
11:27:03.723 [info] Xandra: [:connected]: %{address: {10, 4, 0, 42}, connection_name: nil, port: 11172}
11:27:03.753 [info] Xandra: [:connected]: %{address: {10, 4, 0, 42}, connection_name: nil, port: 11172}
11:27:03.790 [info] Xandra: [:connected]: %{address: {10, 4, 0, 42}, connection_name: nil, port: 11172}
11:27:03.883 [info] Xandra: [:connected]: %{address: {10, 4, 0, 42}, connection_name: nil, port: 11172}
11:27:03.906 [info] Xandra: [:connected]: %{address: {10, 4, 0, 42}, connection_name: nil, port: 11172}
11:27:03.907 [info] Xandra: [:connected]: %{address: {10, 4, 0, 42}, connection_name: nil, port: 11172}
11:27:03.909 [info] Xandra: [:connected]: %{address: {10, 4, 0, 42}, connection_name: nil, port: 11172}
11:27:03.927 [info] Xandra: [:connected]: %{address: {10, 4, 0, 42}, connection_name: nil, port: 11172}
11:27:03.934 [info] Xandra: [:connected]: %{address: {10, 4, 0, 42}, connection_name: nil, port: 11172}
11:27:03.944 [info] Xandra: [:connected]: %{address: {10, 4, 0, 42}, connection_name: nil, port: 11172}
11:27:03.950 [info] Xandra: [:connected]: %{address: {10, 4, 0, 42}, connection_name: nil, port: 11172}
11:27:03.951 [info] Xandra: [:connected]: %{address: {10, 4, 0, 42}, connection_name: nil, port: 11172}

It establishes the control connection to 10.4.0.9. It then adds 10.4.0.86 and 10.4.0.42. It then marks 10.4.0.9 down even while the control connection is still connected. It then marks 10.4.0.86 as down. Then the only host left is 10.4.0.42 so it opens all the connections to that host. However, the control connection remains open to 10.4.0.9 and the other two hosts are actually up.

From then on, when the topology refresh happens, it oscillates the other two hosts into and out of the cluster like this:

11:47:03.166 [info] Xandra Cluster: [:change_event]: %{changed: true, cluster_name: Store, cluster_pid: #PID<0.5486.0>, event_type: :host_down, host: %Xandra.Cluster.Host{address: {10, 4, 0, 9}, data_center: "redacted", host_id: "473a0351-8c81-43c4-8a79-9755134a17d4", port: 11172, rack: "usw2-az1", release_version: "4.0.4", schema_version: "724335b0-8709-3d1d-8b94-233dfe8fbce5"}
11:47:03.678 [info] Xandra Cluster: [:change_event]: %{changed: true, cluster_name: Store, cluster_pid: #PID<0.5486.0>, event_type: :host_down, host: %Xandra.Cluster.Host{address: {10, 4, 0, 86}, data_center: "redacted", host_id: "425bee92-19ab-4be1-99c9-392ce820670a", port: 11172, rack: "usw2-az3", release_version: "4.0.4", schema_version: "724335b0-8709-3d1d-8b94-233dfe8fbce5"}
11:52:02.658 [info] Xandra Cluster: [:change_event]: %{changed: true, cluster_name: Store, cluster_pid: #PID<0.5486.0>, event_type: :host_up, host: %Xandra.Cluster.Host{address: {10, 4, 0, 9}, data_center: "redacted", host_id: "473a0351-8c81-43c4-8a79-9755134a17d4", port: 11172, rack: "usw2-az1", release_version: "4.0.4", schema_version: "724335b0-8709-3d1d-8b94-233dfe8fbce5"}
11:52:02.659 [info] Xandra Cluster: [:change_event]: %{changed: true, cluster_name: Store, cluster_pid: #PID<0.5486.0>, event_type: :host_up, host: %Xandra.Cluster.Host{address: {10, 4, 0, 86}, data_center: "redacted", host_id: "425bee92-19ab-4be1-99c9-392ce820670a", port: 11172, rack: "usw2-az3", release_version: "4.0.4", schema_version: "724335b0-8709-3d1d-8b94-233dfe8fbce5"}

Something very odd is going on with the cluster discovery.

@whatyouhide
Copy link
Owner

Hey Karl! Sorry for the late response, I've been a busy bee with other stuff. So, it looks like this is good up to

11:27:02.618 [info] Xandra Cluster: [:change_event]: %{changed: true, cluster_name: Store, cluster_pid: #PID<0.5486.0>, event_type: :host_added, host: %Xandra.Cluster.Host{address: {10, 4, 0, 86}, data_center: "redacted", host_id: "425bee92-19ab-4be1-99c9-392ce820670a", port: 11172, rack: "usw2-az3", release_version: "4.0.4", schema_version: "724335b0-8709-3d1d-8b94-233dfe8fbce5"}
11:27:02.618 [info] Xandra Cluster: [:change_event]: %{changed: true, cluster_name: Store, cluster_pid: #PID<0.5486.0>, event_type: :host_added, host: %Xandra.Cluster.Host{address: {10, 4, 0, 42}, data_center: "redacted", host_id: "ea819914-e415-4a70-8726-9d6c91653ffa", port: 11172, rack: "usw2-az2", release_version: "4.0.4", schema_version: "724335b0-8709-3d1d-8b94-233dfe8fbce5"}
11:27:02.619 [info] Xandra Cluster: [:control_connection, :connected]: %{cluster_name: Store, cluster_pid: #PID<0.5486.0>, host: %Xandra.Cluster.Host{address: {10, 4, 0, 9}, data_center: "redacted", host_id: "473a0351-8c81-43c4-8a79-9755134a17d4", port: 11172, rack: "usw2-az1", release_version: "4.0.4", schema_version: "724335b0-8709-3d1d-8b94-233dfe8fbce5"}

but then it goes haywire. I think I know what's going on: we introduced a "health check" for connected hosts, since Cassandra's internal change events are basically garbage. So, I think what's happening is that it's not quick enough to connect to the nodes, so the next time we do this health check, most nodes sort of report as "down", which causes the :host_down event. If you reduce CPU, it's probably not fast enough to do the health check so it doesn't disconnect.

I'll take a look at merging #309 soon, and then try to fix this. I'll let you know when you'll have something to try out!

@whatyouhide
Copy link
Owner

@relistan one question: if you query 10.4.0.9 (the control connection node) for system.peers, you see the other peers, right? You should have a [:xandra, :cluster, :discovered_peers] Telemetry event for that.

@relistan
Copy link
Contributor Author

relistan commented Aug 2, 2023 via email

@whatyouhide whatyouhide modified the milestones: v0.15.0, 0.18.0 Aug 3, 2023
@whatyouhide whatyouhide changed the title Xandra.Cluster connects, but doesn't think so (0.16.0) Xandra.Cluster connects, but doesn't think so Aug 3, 2023
@whatyouhide
Copy link
Owner

Hey @relistan! I've been hard at work rewriting Xandra's clustering functionality. Wooph 😓

It should be more reliable now, but it's painful as hell to test this locally with small synthetic clusters. I’m working with folks at Veeps (where I work) to be able to better test this as well, but I wanted to tell you to give main a try if you want and see how it goes 😉

@relistan
Copy link
Contributor Author

Thanks @whatyouhide! I'm out until the middle of next week but maybe @britto or @jacquelineIO can take a look before that if they have time.

@whatyouhide
Copy link
Owner

That would be fantastic, as it would also help me figure out if something is wrong and start working on fixes. @britto and @jacquelineIO, please reach out if I can help in any way!

@whatyouhide whatyouhide changed the title Xandra.Cluster connects, but doesn't think so Xandra.Cluster connects, but doesn't think so Sep 12, 2023
@whatyouhide
Copy link
Owner

Xandra has changed so much in the past couple of months that I’m gonna go ahead and say that this is fixed 😄 It might not be, but I’m pretty sure any error would look different at this point!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants