Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

r11s: enable socket.io long-polling as a fallback #8820

Merged
merged 6 commits into from Feb 8, 2022

Conversation

znewton
Copy link
Contributor

@znewton znewton commented Jan 20, 2022

This change will allow long-polling as a fallback/downgrade option in the r11s-driver and r11s server. We should be able to release this driver change before the server change, because if WebSockets are not available in a client's browser/network, then their scenario is already broken.

@github-actions github-actions bot added area: driver Driver related issues area: server Server related issues (routerlicious) labels Jan 20, 2022
@vladsud
Copy link
Contributor

vladsud commented Jan 20, 2022

Some observations:

  • Please take a look at Experiment with long polling in ODSP driver #7964 and the article it references.
  • based on what I read about socket.io protocols, adding long polling will result in it being used first, for some indeterminate period of time. I.e., its less of a fallback (the way it's implemented), latency of communication might be affected negatively by this change for all connections that do support websocket upgrade.
  • So it would be great to actually do some measurements and see if this approach is good, or it's better to first try websocket protocol, and if it fails, retry with long-polling.
  • It would be also nice to understand what problem is being solved (i.e. particular user scenario). It's possible that we do not like long-polling perf and would want to use it to formulate better error message to users, i.e. isolate case where we can't reach relay service with any protocol, or can only with long polling (but not websocket), to point them to the root of the problem, but not support FF over long polling.

@znewton
Copy link
Contributor Author

znewton commented Jan 21, 2022

@vladsud

based on what I read about socket.io protocols, adding long polling will result in it being used first, for some indeterminate period of time. I.e., its less of a fallback (the way it's implemented), latency of communication might be affected negatively by this change for all connections that do support websocket upgrade.

This is correct when talking about the defaults used by socket.io (["polling", "websocket"]), however, order matters; by setting the value to ["websocket", "polling"], a WebSocket connection is attempted first, and polling is only used as a fallback. See socket.io's transports docs for more info.

const socket = io("https://example.com", {
  transports: ["websocket", "polling"] // use WebSocket first, if available
});

socket.on("connect_error", () => {
  // revert to classic upgrade
  socket.io.opts.transports = ["polling", "websocket"];
});

That being said, looking at the code example from socket.io, it appears that in order for the failover to actually work we might need to add an on("connect_error") handler to switch over to polling by default. I'm still working out how to actually validate this failover by reproducing the customer scenario that prompted this change.

So it would be great to actually do some measurements and see if this approach is good, or it's better to first try websocket protocol, and if it fails, retry with long-polling.

See above, the "better" option here is what this PR does.

It would be also nice to understand what problem is being solved (i.e. particular user scenario). It's possible that we do not like long-polling perf and would want to use it to formulate better error message to users, i.e. isolate case where we can't reach relay service with any protocol, or can only with long polling (but not websocket), to point them to the root of the problem, but not support FF over long polling.

We have some very specific customers who have this requirement. Please reach out offline, and I can connect you with the right people to give you more information about the specific need.

@vladsud
Copy link
Contributor

vladsud commented Jan 21, 2022

It would be great to share your learnings in #7964, including impact on service (I believe there is non-zero impact here RE COGS).
We want to learn as same change is likely needed on ODSP side.

@github-actions github-actions bot requested a review from vladsud January 21, 2022 20:43
@msfluid-bot
Copy link
Collaborator

msfluid-bot commented Jan 21, 2022

@fluid-example/bundle-size-tests: +565 Bytes
Metric NameBaseline SizeCompare SizeSize Diff
aqueduct.js 379.93 KB 379.93 KB No change
containerRuntime.js 180.61 KB 180.61 KB No change
loader.js 160.84 KB 160.84 KB No change
map.js 50.64 KB 50.64 KB No change
matrix.js 145.45 KB 145.45 KB No change
odspDriver.js 193.51 KB 194.07 KB +565 Bytes
odspPrefetchSnapshot.js 45.32 KB 45.32 KB No change
sharedString.js 166.37 KB 166.37 KB No change
Total Size 1.32 MB 1.32 MB +565 Bytes

Baseline commit: b71abdd

Generated by 🚫 dangerJS against 8fe1cf5

@znewton
Copy link
Contributor Author

znewton commented Jan 21, 2022

I was able to mimic a WebSocket failure -> polling downgrade using a TamperMonkey script that closes a websocket connection before the handshake is able to complete. Using this test, I was able to confirm that some additional changes were needed to make the failover actually happen. Previously, driver-base/DocumentDeltaConnection would fail and terminate on any connect_error. However, we need to allow socket.io to re-attempt failover with polling once before terminating. Please take a look at the latest commit to see how I implemented that.

Scenarios I tested using Clicker and the result:

Script Blocks WebSockets Server Allows WebSockets Server Allows Polling Result
No Yes Yes Successful WebSocket handshake and connection, no polling attempt made, collaboration works as expected
Yes Yes Yes WebSocket error: closed before connection established, polling begins, collaboration works as expected
Yes Yes No WebSocket error: closed before connection established, polling attempt fails with 400 error, documentDeltaConnection fails with DeltaConnectionFailureToConnect {"delay":1000,"duration":1743,"online":1,"category":"error","error":"socket.io (connectError): xhr poll error: {\"type\":\"TransportError\",\"description\":400}"

Ideally, we would be able to know when the connect_error is due to a transport layer failure versus a socket.io server middleware error. I will continue to look into adding more specificity.

TL;DR this works, but I'm curious if there are other ideas on how to allow 1 reconnect attempt.

Copy link
Contributor

@tanviraumi tanviraumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GaryWilber Could you take a look at this?

@vladsud
Copy link
Contributor

vladsud commented Jan 24, 2022

One way I heard people block websockets upgrade is by proxy inspecting traffic and dropping http headers (I can probably try to dig on where I heard about it).

@znewton
Copy link
Contributor Author

znewton commented Jan 24, 2022

@vladsud that's a great idea. I think I can probably find a way to configure nginx in front of local R11s to prevent websocket upgrade

@@ -365,7 +367,15 @@ export class DocumentDeltaConnection
description.target = undefined;
}
} catch(_e) {}
fail(true, this.createErrorObject("connectError", error));
if (!this.socket.io.reconnection() ||
this.reconnectAttempts >= (this.socket.io.reconnectionAttempts() ?? 0)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great from start to incorporate proper telemetry into this flow:

  • events for reconnection attempts (and their duration / protocol)
  • Once connection is established - to log (in some form) of long polling or websocket is used.
    Ideally the form telemetry takes is left to the caller, so maybe we just need hooks for each driver to be able to leverage them? Maybe not, and this layer just issuing events is fine.
    @jatgarg , @markfields - FYI

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like since the logic is shared the logging can live here too, but maybe I'm not seeing the benefits of each driver having more flexibility.

I'll be curious to see how often this happens, and for tenants configured this way how spammy these logs are. If there's some other error that triggers this reconnect but never succeeds, then we'll just sit and log the same error every 8 seconds (see connectCore). #8411 would help...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #8968 to track this as a follow up for this PR

@@ -33,9 +33,25 @@ export class R11sDocumentDeltaConnection extends DocumentDeltaConnection
tenantId,
},
reconnection: false,
transports: ["websocket"],
// Enable long-polling as a downgrade option when WebSockets are not available
transports: ["websocket", "polling"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, does it matter if polling is added here, given that you change transports below (on error)? Or both places need to have it in the list?
Maybe other way to ask - if (for example) a host wants to control this behavior, where we would put IF condition to disable/enable long polling? Should we do it from the start (and bake into generic layer) instead of having it here, and each driver just control policy? Would be easier down the road for ODSP, and would allow not to replicate bugs across layers due to duplicated code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting it here allows for polling to be used in scenarios where websockets are not available in the given runtime (e.g. IE 9). I think currently IE 9 is the only browser that doesn't support websockets, and engine.io-client has built-in isomorphic support so it also works in Node.js via ws package. Realistically, adding this line here adds very niche support, but I figured it's not really hurting anything to add it.

As for where to put an IF condition, I was under the impression that we didn't want to bake socket.io specifics into the generic layer in case a driver/service wanted to use a different WebSockets implementation, but that might be an outdated understanding. In the case that it's here in the per-driver layer, I believe the IF condition would have to go both around this config line as well as around the following .on("connect_error") handler, since that is really where the majority of polling usecases are covered.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we care about IE9? I think we should not, unless there is some strong case here (I'm not aware of).

I'm not sure what you mean by different WebSocket implementation. If driver choses to use this class, then it agrees to use socket.io and appropriate transport protocols required / supported by this layer. It feels standard issue of code reuse - I do not see much value in duplicated same logic across drivers. If we expect same logic to be in other drivers that want long polling, then I'd rather move it to shared layer, and expose policy to enable / disable it.

I'd go even further - you want it for single driver to have ability to easily control it, and thus allow end experiences to evaluate / enable it slowly, observing impact (COGS, perf, etc.) and potentially cut it off through dynamic feature gates if things go wrong.

I can also see (maybe it's in some distant future) requests in the form of - could you please cache that given tenant does not support websockets, and thus move to long polling faster? I'd prefer to give tools to other layers (hosts) to be able to implement something like that, and simply expose enough policy controls for them to be able to do so, vs. bake it into drivers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we care about IE9? I think we should not, unless there is some strong case here (I'm not aware of).

Nope, just now found this FluidFramework browser support doc. I think I can safely remove this additional polling config, but will verify that everything still works.

Other stuff

Totally agree. Thank you for explaining that driver-base's DocumentDeltaConnection requires socket.io (looking at the types now this should have been obvious). I will work on an update to this PR that moves this handling into the shared space.

@znewton znewton requested a review from a team as a code owner January 31, 2022 22:11
@github-actions github-actions bot requested review from vladsud and removed request for a team January 31, 2022 22:12
@github-actions github-actions bot added the public api change Changes to a public API label Jan 31, 2022
@@ -365,7 +368,24 @@ export class DocumentDeltaConnection
description.target = undefined;
}
} catch(_e) {}
fail(true, this.createErrorObject("connectError", error));
if (this.enableLongPollingDowngrades) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer a bit "defense in depth" approach in here.
For example, can we get into if (this.enableLongPollingDowngrades) multiple times? If feels like the answer should be no. If that's the case, would it be worth resetting this.enableLongPollingDowngrades such that if we were to get here (for whatever reason) second time, we will bail out with error right away?
Similar, what happens if this.enableLongPollingDowngrades && error?.type !== "TransportError" ?
Or !this.enableLongPollingDowngrades && this.socket.io.reconnection() ?
Are these supported modes?

Can we have couple special conditions (escaping), and all other code paths (catch-all) resulting in fail() like it used to be?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hear what you're saying. Looking at this if chunk again, I agree that it's not clear what the behavior is in these scenarios and even thought the behavior is as expected for each of your described scenarios, the code path is not clean (goes into if() blocks unnecessarily), so I will try to clean it up. The current logic also assumes that the driver layer initially sets reconnection: false, which I don't think is ideal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I've updated the logic to be much more specific.
It will only enter the long-polling downgrade block if (1) it's a websocket transport error, (2) long-polling downgrade is enabled, and (3) the polling upgrade mechanism (transports: ["polling", ...]) is not already in place. Additionally, the socket.io internal reconnection behavior will not be altered if internal reconnection behavior was already enabled by the driver layer. Lastly, the reconnection enabled + reconnect attempts < max reconnect attempts check is now an escape rather than a condition to fail.
Let me know what you think, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: driver Driver related issues area: server Server related issues (routerlicious) public api change Changes to a public API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants