Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Fix a bug where servers could be marked as up when they were failing #16506

Merged
merged 5 commits into from Oct 17, 2023

Conversation

clokep
Copy link
Contributor

@clokep clokep commented Oct 16, 2023

As described in #15226, the RetryDestinationLimiter seems to mark servers as up when the connection fails. It seems to be a regression from #12500, see #12500 (comment).

I think this will cause the notifier to be woken up and additional replication traffic. I think it will then cause federation traffic via a new transaction being sent to the unreachable server?

@@ -0,0 +1 @@
Fix a bug introduced in Synapse 1.59.0 where servers would be incorrectly marked as available when a request resulted in an error.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if there's a more visible symptom? Perhaps it would cause things to be retried too often?

# the notifier.
self.replication_client.send_remote_server_up(self.destination)
# If the server was previously failing, but is no longer.
if previously_failing:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@erikjohnston this might need some thoughts from you as the original author of #12500 -- was this done on purpose and I'm missing some understanding?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think this is not quite right still, it will end up calling this code if we were previously failing & still failing. I think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think this is not quite right still, it will end up calling this code if we were previously failing & still failing. I think?

It should be OK now. 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic could be simplified to only check not currently_failing, which depends on the earlier return to not kick off the background process. But I find this a bit clearer to include previously_failing and not currently_failing. 🤷

tests/util/test_retryutils.py Outdated Show resolved Hide resolved
@clokep clokep marked this pull request as ready for review October 16, 2023 19:40
@clokep clokep requested a review from a team as a code owner October 16, 2023 19:40
Copy link
Member

@erikjohnston erikjohnston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems legit to me!

@clokep clokep merged commit 77dfc1f into develop Oct 17, 2023
36 of 38 checks passed
@clokep clokep deleted the clokep/server-up branch October 17, 2023 11:32
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Clarify retry notification code
2 participants