refresh ring when node UP event received for missing host #1669

sseidman · 2023-01-11T21:48:09Z

The current issue is that there are no events dispatched for NEW_NODE changes. In a 3 node cassandra cluster deployed on cloud infrastructure, a cassandra instance was replaced using the replace_address flag so that it was moved to a new cloud instance (new IP, new host ID). The following debug logs are an example of how the client is currently handling this change in topology.

2023/01/11 20:59:35 gocql: handling frame: [topology_change change=NEW_NODE host=10.128.191.122 port=9042]
2023/01/11 20:59:35 gocql: handling frame: [status_change change=UP host=10.128.191.122 port=9042]
2023/01/11 20:59:35 gocql: dispatching event: &{change:UP host:[10 128 191 122] port:9042}
2023/01/11 20:59:35 gocql: Session.handleNodeUp: 10.128.191.122:9042

There is no event dispatched for the NEW_NODE event and it jumps straight to processing the UP event, so the new node is never added to session ring. If all nodes in the cluster are replaced in this manner, the eventual outcome is that clients lose connection to the cluster and begin outputting gocql: no hosts available in the pool.

After bisecting recent commits, I found this PR introduced the bug into the client. It looks like there was seemingly a fail-safe for when the NEW_NODE event was missed, but this function was removed in the previously mentioned PR.

This change proposes to refresh the hostSource ring on UP events when the host cannot be found in the ring. This ensures that the hostMap stays up to date even if NEW_NODE events are not processed.

This should fix #1668, #1667, and #1582

sseidman · 2023-01-20T20:12:38Z

To be clear, this is occurs when multiple frames are passed to handleNodeEvent here. The NEW_NODE event can get overwritten by the UP event depending on what order they are processed:

2023/01/20 19:55:50 debug: begin frame processing loop
2023/01/20 19:55:50 debug: processing frame
2023/01/20 19:55:50 debug: topology frame event. change: NEW_NODE host: 10.131.0.96
2023/01/20 19:55:50 debug: host not duplicate, adding event to hostEvent map
2023/01/20 19:55:50 debug: setting event change type
2023/01/20 19:55:50 debug: processing frame
2023/01/20 19:55:50 debug: status frame event. change: UP host: 10.131.0.96
2023/01/20 19:55:50 debug: setting event change type
2023/01/20 19:55:50 debug: end frame processing loop

sseidman · 2023-02-02T19:29:33Z

any thoughts on this?

mikelococo · 2023-02-10T15:49:53Z

We're now running these changes on our sizeable production infrastructure, having built them off a private fork.

This issue doesn't just affect folks who cycle ip's on Cassandra restart, it also prevents node-rotation. Our IPs are stable across restarts, but we regularly perform vertical scaling operations with -Dcassandra.replace_address_first_boot=<dead_node_ip> where every Cassandra node is replaced in sequence with bigger hardware running on a new IP address. This operation was regularly causing serving impacts where the gocql clients would fail to re-establish connections to the new nodes resulting first in hotspots and then eventually in complete loss of connectivity to the cluster. It appears that a growing number of folks are impacted by this issue.

We (sseidman, myself, and the folks we work with) are happy to do any reasonable amount of work required to make this fix (or a better designed fix, if the maintainers think this is a non-optimal approach) upstreamable. Whether that's more tests, docs, testing alternate approaches... I think there's significant value in addressing this bug and we're motivated to try to get a fix merged if there's a world where that can happen. I don't think we plan on doing further nudges here to try to elicit a maintainer response, though.

gocql#1669 gocql#1670

StevenLacerda · 2023-03-13T17:34:07Z

@martin-sucha Any chance this can be reviewed and merged?

lwimmer · 2023-03-20T13:45:58Z

#915 seems to be related too. Do you think this will fix it too?

martin-sucha · 2023-03-20T16:48:33Z

Hi! Looks good to me. Sorry for the delay.

What do you guys think of merging #1680 instead? Would that one work for you? Seems that it handles more cases.

sseidman added 2 commits January 11, 2023 16:15

refresh ring when Node Up event received for unknown host

b9c80a0

add name/email to authors

d80175d

PierreF mentioned this pull request Feb 8, 2023

"gocql: no hosts available in the pool" cannot recover without restart #915

Open

mfamador added a commit to mfamador/gocql that referenced this pull request Feb 10, 2023

chore: fixes

96084af

gocql#1669 gocql#1670

joao-r-reis mentioned this pull request Mar 13, 2023

Always refresh ring on topology changes and reconnections #1680

Merged

lwimmer mentioned this pull request Mar 21, 2023

Fix k8s reconnect #1682

Merged

martin-sucha merged commit 1d26125 into gocql:master Mar 27, 2023

joao-r-reis mentioned this pull request Sep 29, 2023

The driver does not seem to reconnect when nodes are restarted in a rolling fashion. datastax/gocql-astra#5

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refresh ring when node UP event received for missing host #1669

refresh ring when node UP event received for missing host #1669

sseidman commented Jan 11, 2023 •

edited

sseidman commented Jan 20, 2023

sseidman commented Feb 2, 2023

mikelococo commented Feb 10, 2023 •

edited

StevenLacerda commented Mar 13, 2023

lwimmer commented Mar 20, 2023

martin-sucha commented Mar 20, 2023

refresh ring when node UP event received for missing host #1669

refresh ring when node UP event received for missing host #1669

Conversation

sseidman commented Jan 11, 2023 • edited

sseidman commented Jan 20, 2023

sseidman commented Feb 2, 2023

mikelococo commented Feb 10, 2023 • edited

StevenLacerda commented Mar 13, 2023

lwimmer commented Mar 20, 2023

martin-sucha commented Mar 20, 2023

sseidman commented Jan 11, 2023 •

edited

mikelococo commented Feb 10, 2023 •

edited