Tweak logs and attempt to avoid races around removing nodes #504

jsdw · 2022-10-05T11:27:15Z

Updated some of the log messages to the nicer inline fmt style, and tweaked the odd message.
Don't queue up "remove node" messages from shard to core if disconnected; they won't make sense after a reconnect.
Try to avoid reusing IDs; this might cause races around disconnects if messages are queued or received with "old" IDs and new nodes are reusing said IDs.

lexnv · 2022-10-05T12:21:19Z

backend/telemetry_core/src/aggregator/inner_loop.rs

+                        // It's possible that some race between removing and disconnecting shards might lead to
+                        // more than one remove message for the same node. This isn't really a problem, but we
+                        // hope it won't happen so make a note if it does:
+                        log::debug!(


Are we worried about the process consuming too much memory because we may never call remove_nodes_and_broadcast_result here? Leading to a potentially OOM kill?

Would it be possible for a malicious party to force the shard to submit Remove { random_id }?
And in that could we "mute" / disconnect the shard if it makes N wrong Remove calls?

It shouldn't be possible for any party to have any control over any of the ids; we assign them all internally.

And if the id doesn't exist for this remove message, it means the node has been removed already :)

Remove calls are also entirely internal; an external person cannot cause it to be called except by disconnecting (or when messages with certain ids go stale).

niklasad1 · 2022-10-05T13:46:56Z

backend/common/src/assign_id.rs

+        // Leave the `current_id` as-is. Why? To avoid reusing IDs and risking
+        // race conditions where old messages can accidentally screw with new nodes
+        // that have been assigned the same ID.
+        self.mapping = BiMap::new();


hmm, I don't understand.

Before we were resetting this.id == 0 in BiMap::new() and multiple nodes to entities could do this an end-up with id == 0 after clear and now just ignore this id and the next one will be this.id +1?

It looks like the id which will wrap in release mode if this.id + 1 > MAX

Can you add an explicit panic or something so we are sure that it doesn't wrap around and ends up to replace an "old ID" or what will happen then?

Yup, I did this because I worried that "old" IDs might be reused in some sorts of obscure race conditions, leading to new nodes being affected by "old" messages or something like that.

Indeed, eventually it'll overflow and panic, but since the ID is a 64 bit value it'll never realisically hit usize::MAX so I wasn't worried about overflowing (I almost considered adding an explicit wrapping_add but meh)

Since it was so easy to do I did make the thing wrapping_add :)

…ch#504) * Tweak logs and attempt to avoid races around removing nodes * wrapping_add in assign_id

Tweak logs and attempt to avoid races around removing nodes

086322e

jsdw requested a review from a team as a code owner October 5, 2022 11:27

This was referenced Oct 5, 2022

Telemetry core: "Cannot find ID for node with shard/connectionId of ConnId(3)/ShardNodeId(217030)" #501

Open

The backend reports duplicate peers. #497

Open

lexnv approved these changes Oct 5, 2022

View reviewed changes

lexnv reviewed Oct 5, 2022

View reviewed changes

niklasad1 reviewed Oct 5, 2022

View reviewed changes

wrapping_add in assign_id

df015ca

niklasad1 approved these changes Oct 5, 2022

View reviewed changes

jsdw merged commit e7d15d0 into master Oct 10, 2022

jsdw deleted the jsdw-logs-and-removing branch October 10, 2022 12:12

Rose2161 pushed a commit to Cryptob3auty/mtsubstrate-telemen that referenced this pull request Apr 15, 2024

Tweak logs and attempt to avoid races around removing nodes (parityte…

fb9ae1b

…ch#504) * Tweak logs and attempt to avoid races around removing nodes * wrapping_add in assign_id

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tweak logs and attempt to avoid races around removing nodes #504

Tweak logs and attempt to avoid races around removing nodes #504

jsdw commented Oct 5, 2022

lexnv Oct 5, 2022

jsdw Oct 5, 2022

niklasad1 Oct 5, 2022 •

edited

jsdw Oct 5, 2022

jsdw Oct 5, 2022

Tweak logs and attempt to avoid races around removing nodes #504

Tweak logs and attempt to avoid races around removing nodes #504

Conversation

jsdw commented Oct 5, 2022

lexnv Oct 5, 2022

Choose a reason for hiding this comment

jsdw Oct 5, 2022

Choose a reason for hiding this comment

niklasad1 Oct 5, 2022 • edited

Choose a reason for hiding this comment

jsdw Oct 5, 2022

Choose a reason for hiding this comment

jsdw Oct 5, 2022

Choose a reason for hiding this comment

niklasad1 Oct 5, 2022 •

edited