Skip to content

Address issues from earlier downgraded connection mode change#8468

Merged
vladsud merged 1 commit intomicrosoft:mainfrom
vladsud:DowngradedConnection
Dec 7, 2021
Merged

Address issues from earlier downgraded connection mode change#8468
vladsud merged 1 commit intomicrosoft:mainfrom
vladsud:DowngradedConnection

Conversation

@vladsud
Copy link
Copy Markdown
Contributor

@vladsud vladsud commented Dec 6, 2021

Addressing Issue #8398 - New Prod error: assert(this.runtime.deltaManager.active, 0x25d /* "We should never connect as 'read'" */);
Partial undo of #7753 that causes assert above.

The core of the problem - relay service sends leave op for "write" connection after 5 minutes of inactivity.
This leaves connection in weird state - most of the system believes it's "write" connection, but you can't send ops on that connection. Above mentioned PR attempted (among other things) to remove this discrepancy for loader layer, but I missed that runtime layer does not expect change in properties of connection.

This change constraints visible changes to DM layer, making other layers believe that we still work with "write" connection (see items below for more details).

It's worth raising that we can go in various ways about it, roughly in priority order:

  1. Best option IMHO - service disconnects "write" connection after 5 min of inactivity. Sending leave op and keeping connection alive is weird for client as some layers do not care about such op (DM itself for most part), while others expect properties of connection do not change over lifetime of connection.
  2. Same as above, but client does it (and reconnects as "read") on own join
    • I favor this direction as longterm direction at the moment (i.e., I think we need to implement it, this change is more stop-gap solution).
    • It has rather big con for R11S stack that does not have (ODSP) socket reuse logic though
  3. Same as above, but disconnect / reconnect is "virtual", i.e. we keep using same connection with same clientId, but expose it to other layers (runtime) as disconnect from "write" connection and reconnect to same connection (same clientId) but as "read".
    • I'm not sure I like it as it breaks fundamental gurantee that every new connection gets unique clientId.
  4. This approach - client keeps connection as if it's "write" connection, but DM internally knows it's "read" connection and will go through reconnection logic if any op is sent by client.
    • The pro of this change over next - we never see weird nacks like "nonexciting client". The con - summarizer keeps going on such connection without realizing it.
  5. Undo above mentioned PR altogether. Same con as above (plus weird nack/disconnect reasons).

Addressing microsoft#8398 - New Prod error: assert(this.runtime.deltaManager.active, 0x25d /* "We should never connect as 'read'" */);
Partial undo of https://github.com/microsoft/FluidFramework/pull/7753/files
@github-actions github-actions Bot added the area: loader Loader related issues label Dec 6, 2021
@agarwal-navin
Copy link
Copy Markdown
Contributor

Looks like all of these solutions have a same problem - summarizer will transition to "read" state in absence of activity. Do we every want this to happen? Should there be some configuration that tells the server to not disconnect summarizer client?

@agarwal-navin
Copy link
Copy Markdown
Contributor

Today, if the client gets a "leave" op, the connection is still maintainer as "write"? Basically, connection.mode will return "write"?
This looks to be the case from option 1 above, but making sure I understand this.

@vladsud
Copy link
Copy Markdown
Contributor Author

vladsud commented Dec 7, 2021

The main issue (and difference in solutions) we should look for - how the state of the system is exposed to various layers and is it consistent / up to expectations. We should strive to make sure that layers (for most part) are not concerned with complexity and that the message is clear - as long as connection is in flight, it has all properties of connection locked (not changed).

So, answering your question - yes, some options will transition to "read". However, the main question we should ask - is this visible? I.e., in solutions 4-5 summarizer does not know about it and acts as if it works with "write" connection, and it is consistently exposed to runtime as "write" connection (vs. current state where DM.active changes on the fly). But there is reconnection on any op being sent, so runtime is essentially lied to, but it has no way to discover that (as disconnection can happen at any moment in time for any reason).

Today, if the client gets a "leave" op, the connection is still maintainer as "write"? Basically, connection.mode will return "write"?
This looks to be the case from option 1 above but making sure I understand this.

After PR #7753 (current state of main), connection mode (as observed by most layers) will change to "read" on leave op. With this change, it will stay as "write".
Con of this change - summarizer in such mode will do unneeded work (generate and upload a summary) for nothing, as it will fail to send op.

Long term I want to move to # 2 solution, but take time to do it properly (this change is mostly back out of parts of previous change)

@vladsud
Copy link
Copy Markdown
Contributor Author

vladsud commented Dec 7, 2021

Note - opened #8483 to consider next steps, if any.

@vladsud vladsud merged commit e843bb9 into microsoft:main Dec 7, 2021
@vladsud vladsud deleted the DowngradedConnection branch December 7, 2021 20:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: loader Loader related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants