chore(connections): disconnect when we encounter a non-retryable error code on an atlas connection CLOUDP-286331 #6598

Anemy · 2025-01-08T23:53:32Z

When we're on cloud we listen for non-retry-able errors on failed server heartbeats. These can happen when:

A user's session has ended.
The user's roles have changed (so they lack permissions).
The cluster / group they are trying to connect to has since been deleted.

When we encounter one we disconnect. This is to avoid polluting logs/metrics and to avoid constantly retrying to connect when we know it'll fail.

When a user runs a command after we've disconnected they end up with errors like this:

Which we surface ourselves. While they do have a message from the toast I'm thinking we probably want to give something less cryptic there. A bit more context, we also chatted about fully disconnecting from the connection, which would close all of their open tabs, but that may end up in them losing some work which would be a frustrating ux. Discussion: https://mongodb.slack.com/archives/C069YM25L8N/p1735662529296989

Not an easily testable flow as these are mostly internals of data service and mongo client, curious if folks think it's worth some mocks to facilitate tests for it. We'd need to simulate the serverHeartbeatFailed events on a mongo client and have the custom mongo client passed to the data service.

Custom close codes in mms: https://github.com/10gen/mms/blob/24f5af9c5318a5da746ad328547591f376449dc0/server/src/main/com/xgen/cloud/services/clusterconnection/runtime/res/CustomCloseCodes.java#L5
Where they are returned on pings which form the serverHeartbeatFailed events: https://github.com/10gen/mms/blob/24f5af9c5318a5da746ad328547591f376449dc0/server/src/main/com/xgen/cloud/services/clusterconnection/runtime/res/ClusterConnectionEndpoint.java#L92

…r code on an atlas connection

packages/compass-connections/src/stores/connections-store-redux.ts

syn-zhu

Is there another PR on the MMS side you're currently working on?

packages/compass-connections/src/stores/connections-store-redux.ts

gribnoysup · 2025-01-09T08:57:31Z

Not an easily testable flow as these are mostly internals of data service and mongo client

I think you definitely should be able to add unit tests in the compass-connections for this by emitting the event manually on a mocked dataService, we have all the tooling to setup a test like that and at least it would cover the error parsing logic. Should be possible to add some e2e tests too if we want to by modifying our ws-proxy code, we can chat more about how to set that up

packages/compass-connections/src/stores/connections-store-redux.ts

Anemy · 2025-01-13T17:51:06Z

@syn-zhu

Is there another PR on the MMS side you're currently working on?

No, this is the only branch I've been working on. Is there work you foresee us needing on the mms side? I see there's already support for the ping/pongs checking the rolls and throwing these errors https://github.com/10gen/mms/blob/de2a9c463cfe530efb8e2a0941033e8207b6cb11/server/src/main/com/xgen/cloud/services/clusterconnection/runtime/res/ClusterConnectionEndpoint.java#L521
Maybe I'm missing some work we need to do there?

packages/compass-connections/src/stores/connections-store-redux.ts

syn-zhu

one comment, everything else looks good

Anemy · 2025-01-15T21:36:35Z

Chatted a bit with Sergey in slack, not going to go for an e2e for this at least for now. I'm going to wait on merging until I can reliably get this working with our main branch of mms. It might take some backend changes, having trouble getting these expected heartbeat failures at the moment (I developed this by manually throwing them from ccs), it may go away once we merge https://github.com/10gen/mms/pull/116288

Anemy · 2025-01-16T15:11:52Z

Tried this out with the pr Simon has open in mms and it worked nicely.

Anemy added 2 commits January 8, 2025 18:42

chore(connections): disconnect when we encounter a non-retryable erro…

1eccde7

…r code on an atlas connection

Merge branch 'main' into CLOUDP-286331

17c7605

Anemy commented Jan 8, 2025

View reviewed changes

packages/compass-connections/src/stores/connections-store-redux.ts Outdated Show resolved Hide resolved

syn-zhu reviewed Jan 9, 2025

View reviewed changes

gribnoysup reviewed Jan 9, 2025

View reviewed changes

packages/compass-connections/src/stores/connections-store-redux.ts Outdated Show resolved Hide resolved

gribnoysup reviewed Jan 9, 2025

View reviewed changes

packages/compass-connections/src/stores/connections-store-redux.ts Show resolved Hide resolved

Merge branch 'main' into CLOUDP-286331

e9da815

Anemy and others added 2 commits January 13, 2025 13:33

fixup: remove atlas restrictive check, add unit tests

1aa74b7

Merge branch 'main' into CLOUDP-286331

4c77061

Anemy marked this pull request as ready for review January 14, 2025 18:16

Merge branch 'main' into CLOUDP-286331

0e5ac49

syn-zhu reviewed Jan 14, 2025

View reviewed changes

packages/compass-connections/src/stores/connections-store-redux.ts Show resolved Hide resolved

syn-zhu approved these changes Jan 14, 2025

View reviewed changes

fixup: mention compass web in comment

acac16d

Anemy merged commit ba5c36f into main Jan 16, 2025
28 of 30 checks passed

Anemy deleted the CLOUDP-286331 branch January 16, 2025 15:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore(connections): disconnect when we encounter a non-retryable error code on an atlas connection CLOUDP-286331 #6598

chore(connections): disconnect when we encounter a non-retryable error code on an atlas connection CLOUDP-286331 #6598

Uh oh!

Anemy commented Jan 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

syn-zhu left a comment

Uh oh!

Uh oh!

gribnoysup commented Jan 9, 2025

Uh oh!

Uh oh!

Anemy commented Jan 13, 2025

Uh oh!

Uh oh!

syn-zhu left a comment

Uh oh!

Anemy commented Jan 15, 2025 •

edited

Loading

Uh oh!

Anemy commented Jan 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chore(connections): disconnect when we encounter a non-retryable error code on an atlas connection CLOUDP-286331 #6598

chore(connections): disconnect when we encounter a non-retryable error code on an atlas connection CLOUDP-286331 #6598

Uh oh!

Conversation

Anemy commented Jan 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

syn-zhu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gribnoysup commented Jan 9, 2025

Uh oh!

Uh oh!

Anemy commented Jan 13, 2025

Uh oh!

Uh oh!

syn-zhu left a comment

Choose a reason for hiding this comment

Uh oh!

Anemy commented Jan 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Anemy commented Jan 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Anemy commented Jan 8, 2025 •

edited

Loading

Anemy commented Jan 15, 2025 •

edited

Loading