Skip to content

Conversation

Anemy
Copy link
Member

@Anemy Anemy commented Jan 8, 2025

CLOUDP-286331

When we're on cloud we listen for non-retry-able errors on failed server heartbeats. These can happen when:

  • A user's session has ended.
  • The user's roles have changed (so they lack permissions).
  • The cluster / group they are trying to connect to has since been deleted.

When we encounter one we disconnect. This is to avoid polluting logs/metrics and to avoid constantly retrying to connect when we know it'll fail.

Screenshot 2025-01-08 at 6 17 27 PM

When a user runs a command after we've disconnected they end up with errors like this:

Screenshot 2025-01-08 at 6 18 23 PM

Which we surface ourselves. While they do have a message from the toast I'm thinking we probably want to give something less cryptic there. A bit more context, we also chatted about fully disconnecting from the connection, which would close all of their open tabs, but that may end up in them losing some work which would be a frustrating ux. Discussion: https://mongodb.slack.com/archives/C069YM25L8N/p1735662529296989

Not an easily testable flow as these are mostly internals of data service and mongo client, curious if folks think it's worth some mocks to facilitate tests for it. We'd need to simulate the serverHeartbeatFailed events on a mongo client and have the custom mongo client passed to the data service.

Custom close codes in mms: https://github.com/10gen/mms/blob/24f5af9c5318a5da746ad328547591f376449dc0/server/src/main/com/xgen/cloud/services/clusterconnection/runtime/res/CustomCloseCodes.java#L5
Where they are returned on pings which form the serverHeartbeatFailed events: https://github.com/10gen/mms/blob/24f5af9c5318a5da746ad328547591f376449dc0/server/src/main/com/xgen/cloud/services/clusterconnection/runtime/res/ClusterConnectionEndpoint.java#L92

Copy link
Collaborator

@syn-zhu syn-zhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there another PR on the MMS side you're currently working on?

@gribnoysup
Copy link
Collaborator

Not an easily testable flow as these are mostly internals of data service and mongo client

I think you definitely should be able to add unit tests in the compass-connections for this by emitting the event manually on a mocked dataService, we have all the tooling to setup a test like that and at least it would cover the error parsing logic. Should be possible to add some e2e tests too if we want to by modifying our ws-proxy code, we can chat more about how to set that up

@Anemy
Copy link
Member Author

Anemy commented Jan 13, 2025

@syn-zhu

Is there another PR on the MMS side you're currently working on?

No, this is the only branch I've been working on. Is there work you foresee us needing on the mms side? I see there's already support for the ping/pongs checking the rolls and throwing these errors https://github.com/10gen/mms/blob/de2a9c463cfe530efb8e2a0941033e8207b6cb11/server/src/main/com/xgen/cloud/services/clusterconnection/runtime/res/ClusterConnectionEndpoint.java#L521
Maybe I'm missing some work we need to do there?

@Anemy Anemy marked this pull request as ready for review January 14, 2025 18:16
Copy link
Collaborator

@syn-zhu syn-zhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one comment, everything else looks good

@Anemy
Copy link
Member Author

Anemy commented Jan 15, 2025

Chatted a bit with Sergey in slack, not going to go for an e2e for this at least for now. I'm going to wait on merging until I can reliably get this working with our main branch of mms. It might take some backend changes, having trouble getting these expected heartbeat failures at the moment (I developed this by manually throwing them from ccs), it may go away once we merge https://github.com/10gen/mms/pull/116288

@Anemy
Copy link
Member Author

Anemy commented Jan 16, 2025

Tried this out with the pr Simon has open in mms and it worked nicely.

@Anemy Anemy merged commit ba5c36f into main Jan 16, 2025
28 of 30 checks passed
@Anemy Anemy deleted the CLOUDP-286331 branch January 16, 2025 15:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants