`Xandra.Cluster` connects, but doesn't think so #314

relistan · 2023-07-31T13:45:51Z

I'm still digging to try to figure out what is going on with our inability to connect with Xandra repeatably. Any pointers appreciated!

We are now running 0.16.0 with the peers patch that is on the Draft PR.

I deployed a copy of our service that I could manually start in the container with start_iex. I ran tcpdump to capture the network traffic. What I see is that the first connection that goes out successfully connects. It stays connected the whole time. Many other connections are made, all of which seem to establish a TCP session. I cannot see anything different at the TCP level between when Xandra succeeds in starting the cluster and when it fails. Because it is over SSL (and I can't change that) I cannot see what is going on at the Cassandra protocol level. I assume if there were an actual protocol issue that I would see Cassandra sending RSTs or FINs on the connection, which I don't see.

Attached is a screenshot of Wireshark displaying the first connection. You can see by the packet count that it stays connected until I exit the application.

I built a little startup watcher GenServer as a child of Application that prevents our processing pipelines from starting until Xandra.Cluster is up. It tells you how long it waited and how many tries that took. There seems to be no pattern in the amount of time it takes.

I do not believe this is server-side because we run golang-migrate on every app startup and it has no trouble connecting to Cassandra at all, only a few seconds before Xandra starts up.

Things I noticed:

I set the topology_refresh_interval to 30 seconds as a debugging experiment, that may be the thing that sometimes kick it in the pants if it doesn't connect the first time. Anecdotally there is a correlation timestamp-wise.
When I set the CPU limit very low, it almost always is able to connect right away

My observations may point to some kind of timing issue. But I am not sure where to look. Something with internal messaging between the BEAM processes?

The text was updated successfully, but these errors were encountered:

relistan · 2023-07-31T17:40:31Z

I realized that there is telemetry not being logged, so adding a logger for the telemetry around connections.

relistan · 2023-08-01T15:14:39Z

The telemetry reveals that it is not connecting to more than one host in the cluster. Additionally, every time the topology runs, it swaps in and out the two hosts that are not the one with the control connection. This appears to have no effect on creating or stopping connections, however.

Here is the startup of the application in a data center with 3 nodes.

1-43c4-8a79-9755134a17d4", port: 11172, rack: "usw2-az1", release_version: "4.0.4", schema_version: "724335b0-8709-3d1d-8b94-233dfe8fbce5"}
11:27:02.618 [info] Xandra Cluster: [:change_event]: %{changed: true, cluster_name: Store, cluster_pid: #PID<0.5486.0>, event_type: :host_added, host: %Xandra.Cluster.Host{address: {10, 4, 0, 86}, data_center: "redacted", host_id: "425bee92-19ab-4be1-99c9-392ce820670a", port: 11172, rack: "usw2-az3", release_version: "4.0.4", schema_version: "724335b0-8709-3d1d-8b94-233dfe8fbce5"}
11:27:02.618 [info] Xandra Cluster: [:change_event]: %{changed: true, cluster_name: Store, cluster_pid: #PID<0.5486.0>, event_type: :host_added, host: %Xandra.Cluster.Host{address: {10, 4, 0, 42}, data_center: "redacted", host_id: "ea819914-e415-4a70-8726-9d6c91653ffa", port: 11172, rack: "usw2-az2", release_version: "4.0.4", schema_version: "724335b0-8709-3d1d-8b94-233dfe8fbce5"}
11:27:02.619 [info] Xandra Cluster: [:control_connection, :connected]: %{cluster_name: Store, cluster_pid: #PID<0.5486.0>, host: %Xandra.Cluster.Host{address: {10, 4, 0, 9}, data_center: "redacted", host_id: "473a0351-8c81-43c4-8a79-9755134a17d4", port: 11172, rack: "usw2-az1", release_version: "4.0.4", schema_version: "724335b0-8709-3d1d-8b94-233dfe8fbce5"}
11:27:03.130 [info] Xandra Cluster: [:change_event]: %{changed: true, cluster_name: Store, cluster_pid: #PID<0.5486.0>, event_type: :host_down, host: %Xandra.Cluster.Host{address: {10, 4, 0, 9}, data_center: "redacted", host_id: "473a0351-8c81-43c4-8a79-9755134a17d4", port: 11172, rack: "usw2-az1", release_version: "4.0.4", schema_version: "724335b0-8709-3d1d-8b94-233dfe8fbce5"}
11:27:03.145 [info] Xandra Cluster: [:change_event]: %{changed: true, cluster_name: Store, cluster_pid: #PID<0.5486.0>, event_type: :host_down, host: %Xandra.Cluster.Host{address: {10, 4, 0, 86}, data_center: "redacted", host_id: "425bee92-19ab-4be1-99c9-392ce820670a", port: 11172, rack: "usw2-az3", release_version: "4.0.4", schema_version: "724335b0-8709-3d1d-8b94-233dfe8fbce5"}
11:27:03.535 [info] Xandra: [:connected]: %{address: {10, 4, 0, 42}, connection_name: nil, port: 11172}
11:27:03.589 [info] Xandra: [:connected]: %{address: {10, 4, 0, 42}, connection_name: nil, port: 11172}
11:27:03.706 [info] Xandra: [:connected]: %{address: {10, 4, 0, 42}, connection_name: nil, port: 11172}
11:27:03.723 [info] Xandra: [:connected]: %{address: {10, 4, 0, 42}, connection_name: nil, port: 11172}
11:27:03.753 [info] Xandra: [:connected]: %{address: {10, 4, 0, 42}, connection_name: nil, port: 11172}
11:27:03.790 [info] Xandra: [:connected]: %{address: {10, 4, 0, 42}, connection_name: nil, port: 11172}
11:27:03.883 [info] Xandra: [:connected]: %{address: {10, 4, 0, 42}, connection_name: nil, port: 11172}
11:27:03.906 [info] Xandra: [:connected]: %{address: {10, 4, 0, 42}, connection_name: nil, port: 11172}
11:27:03.907 [info] Xandra: [:connected]: %{address: {10, 4, 0, 42}, connection_name: nil, port: 11172}
11:27:03.909 [info] Xandra: [:connected]: %{address: {10, 4, 0, 42}, connection_name: nil, port: 11172}
11:27:03.927 [info] Xandra: [:connected]: %{address: {10, 4, 0, 42}, connection_name: nil, port: 11172}
11:27:03.934 [info] Xandra: [:connected]: %{address: {10, 4, 0, 42}, connection_name: nil, port: 11172}
11:27:03.944 [info] Xandra: [:connected]: %{address: {10, 4, 0, 42}, connection_name: nil, port: 11172}
11:27:03.950 [info] Xandra: [:connected]: %{address: {10, 4, 0, 42}, connection_name: nil, port: 11172}
11:27:03.951 [info] Xandra: [:connected]: %{address: {10, 4, 0, 42}, connection_name: nil, port: 11172}

It establishes the control connection to 10.4.0.9. It then adds 10.4.0.86 and 10.4.0.42. It then marks 10.4.0.9 down even while the control connection is still connected. It then marks 10.4.0.86 as down. Then the only host left is 10.4.0.42 so it opens all the connections to that host. However, the control connection remains open to 10.4.0.9 and the other two hosts are actually up.

From then on, when the topology refresh happens, it oscillates the other two hosts into and out of the cluster like this:

11:47:03.166 [info] Xandra Cluster: [:change_event]: %{changed: true, cluster_name: Store, cluster_pid: #PID<0.5486.0>, event_type: :host_down, host: %Xandra.Cluster.Host{address: {10, 4, 0, 9}, data_center: "redacted", host_id: "473a0351-8c81-43c4-8a79-9755134a17d4", port: 11172, rack: "usw2-az1", release_version: "4.0.4", schema_version: "724335b0-8709-3d1d-8b94-233dfe8fbce5"}
11:47:03.678 [info] Xandra Cluster: [:change_event]: %{changed: true, cluster_name: Store, cluster_pid: #PID<0.5486.0>, event_type: :host_down, host: %Xandra.Cluster.Host{address: {10, 4, 0, 86}, data_center: "redacted", host_id: "425bee92-19ab-4be1-99c9-392ce820670a", port: 11172, rack: "usw2-az3", release_version: "4.0.4", schema_version: "724335b0-8709-3d1d-8b94-233dfe8fbce5"}
11:52:02.658 [info] Xandra Cluster: [:change_event]: %{changed: true, cluster_name: Store, cluster_pid: #PID<0.5486.0>, event_type: :host_up, host: %Xandra.Cluster.Host{address: {10, 4, 0, 9}, data_center: "redacted", host_id: "473a0351-8c81-43c4-8a79-9755134a17d4", port: 11172, rack: "usw2-az1", release_version: "4.0.4", schema_version: "724335b0-8709-3d1d-8b94-233dfe8fbce5"}
11:52:02.659 [info] Xandra Cluster: [:change_event]: %{changed: true, cluster_name: Store, cluster_pid: #PID<0.5486.0>, event_type: :host_up, host: %Xandra.Cluster.Host{address: {10, 4, 0, 86}, data_center: "redacted", host_id: "425bee92-19ab-4be1-99c9-392ce820670a", port: 11172, rack: "usw2-az3", release_version: "4.0.4", schema_version: "724335b0-8709-3d1d-8b94-233dfe8fbce5"}

Something very odd is going on with the cluster discovery.

whatyouhide · 2023-08-02T09:24:51Z

Hey Karl! Sorry for the late response, I've been a busy bee with other stuff. So, it looks like this is good up to

11:27:02.618 [info] Xandra Cluster: [:change_event]: %{changed: true, cluster_name: Store, cluster_pid: #PID<0.5486.0>, event_type: :host_added, host: %Xandra.Cluster.Host{address: {10, 4, 0, 86}, data_center: "redacted", host_id: "425bee92-19ab-4be1-99c9-392ce820670a", port: 11172, rack: "usw2-az3", release_version: "4.0.4", schema_version: "724335b0-8709-3d1d-8b94-233dfe8fbce5"}
11:27:02.618 [info] Xandra Cluster: [:change_event]: %{changed: true, cluster_name: Store, cluster_pid: #PID<0.5486.0>, event_type: :host_added, host: %Xandra.Cluster.Host{address: {10, 4, 0, 42}, data_center: "redacted", host_id: "ea819914-e415-4a70-8726-9d6c91653ffa", port: 11172, rack: "usw2-az2", release_version: "4.0.4", schema_version: "724335b0-8709-3d1d-8b94-233dfe8fbce5"}
11:27:02.619 [info] Xandra Cluster: [:control_connection, :connected]: %{cluster_name: Store, cluster_pid: #PID<0.5486.0>, host: %Xandra.Cluster.Host{address: {10, 4, 0, 9}, data_center: "redacted", host_id: "473a0351-8c81-43c4-8a79-9755134a17d4", port: 11172, rack: "usw2-az1", release_version: "4.0.4", schema_version: "724335b0-8709-3d1d-8b94-233dfe8fbce5"}

but then it goes haywire. I think I know what's going on: we introduced a "health check" for connected hosts, since Cassandra's internal change events are basically garbage. So, I think what's happening is that it's not quick enough to connect to the nodes, so the next time we do this health check, most nodes sort of report as "down", which causes the :host_down event. If you reduce CPU, it's probably not fast enough to do the health check so it doesn't disconnect.

I'll take a look at merging #309 soon, and then try to fix this. I'll let you know when you'll have something to try out!

whatyouhide · 2023-08-02T10:56:23Z

@relistan one question: if you query 10.4.0.9 (the control connection node) for system.peers, you see the other peers, right? You should have a [:xandra, :cluster, :discovered_peers] Telemetry event for that.

relistan · 2023-08-02T20:36:57Z

Thanks @whatyouhide, that makes a lot of sense. Yes, when I query the control connection node I see the other two in `system.peers`

whatyouhide · 2023-08-15T08:57:17Z

Hey @relistan! I've been hard at work rewriting Xandra's clustering functionality. Wooph 😓

It should be more reliable now, but it's painful as hell to test this locally with small synthetic clusters. I’m working with folks at Veeps (where I work) to be able to better test this as well, but I wanted to tell you to give main a try if you want and see how it goes 😉

relistan · 2023-08-15T14:13:42Z

Thanks @whatyouhide! I'm out until the middle of next week but maybe @britto or @jacquelineIO can take a look before that if they have time.

whatyouhide · 2023-08-15T14:23:10Z

That would be fantastic, as it would also help me figure out if something is wrong and start working on fixes. @britto and @jacquelineIO, please reach out if I can help in any way!

whatyouhide · 2023-10-10T14:55:39Z

Xandra has changed so much in the past couple of months that I’m gonna go ahead and say that this is fixed 😄 It might not be, but I’m pretty sure any error would look different at this point!

whatyouhide added the Kind:Bug label Aug 3, 2023

whatyouhide modified the milestones: v0.15.0, 0.18.0 Aug 3, 2023

whatyouhide changed the title ~~Xandra.Cluster connects, but doesn't think so (0.16.0)~~ Xandra.Cluster connects, but doesn't think so Aug 3, 2023

jacquelineIO mentioned this issue Aug 22, 2023

using rpc_address for peer address Shimmur/xandra#4

Closed

jacquelineIO mentioned this issue Aug 31, 2023

Fix IPv6 support in Xandra.Cluster #328

Merged

whatyouhide changed the title ~~Xandra.Cluster connects, but doesn't think so~~ Xandra.Cluster connects, but doesn't think so Sep 12, 2023

whatyouhide closed this as completed Oct 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Xandra.Cluster` connects, but doesn't think so #314

`Xandra.Cluster` connects, but doesn't think so #314

relistan commented Jul 31, 2023 •

edited

Loading

relistan commented Jul 31, 2023

relistan commented Aug 1, 2023

whatyouhide commented Aug 2, 2023

whatyouhide commented Aug 2, 2023

relistan commented Aug 2, 2023 via email •

edited

Loading

whatyouhide commented Aug 15, 2023

relistan commented Aug 15, 2023

whatyouhide commented Aug 15, 2023

whatyouhide commented Oct 10, 2023

Xandra.Cluster connects, but doesn't think so #314

Xandra.Cluster connects, but doesn't think so #314

Comments

relistan commented Jul 31, 2023 • edited Loading

relistan commented Jul 31, 2023

relistan commented Aug 1, 2023

whatyouhide commented Aug 2, 2023

whatyouhide commented Aug 2, 2023

relistan commented Aug 2, 2023 via email • edited Loading

whatyouhide commented Aug 15, 2023

relistan commented Aug 15, 2023

whatyouhide commented Aug 15, 2023

whatyouhide commented Oct 10, 2023

`Xandra.Cluster` connects, but doesn't think so #314

`Xandra.Cluster` connects, but doesn't think so #314

relistan commented Jul 31, 2023 •

edited

Loading

relistan commented Aug 2, 2023 via email •

edited

Loading