Irregular heartbeats with Safari #348

Open
ookke opened this Issue Mar 2, 2015 · 36 comments

Projects

None yet

4 participants

@ookke
ookke commented Mar 2, 2015

We are using primus 2.4.12 with the engine.io transport and we encountered a strange issue where some clients keep getting disconnected and reconnecting about once a minute. The common factor between these clients seems to be OS X 10.10.2 and Safari 8.0.3.

When investigating the issue we increased the server heartbeat timeout and added debug logging on the 'incoming::ping' event and noticed that the interval between heartbeats from the Safari clients is quite irregular (ping interval is set to 25 000 ms):

46998 ms since last ping or connect
46999 ms since last ping or connect
30992 ms since last ping or connect

Meanwhile a Google Chrome client will keep sending heartbeats regularly ~26 s apart. Any ideas what could be causing this and how to investigate it further? As a workaround we increased the server heartbeat timeout to 90 seconds which seems to mitigate the issue for now.

@lpinca
Member
lpinca commented Mar 3, 2015

I can consistently reproduce this. The ping interval respects the settings in my case but the client disconnects after 4/5 pings.
I will try to reproduce this using only Egine.IO to see where the bug lives.

@lpinca lpinca added the investigate label Mar 3, 2015
@lpinca
Member
lpinca commented Mar 4, 2015

The issue also happens when using sockjs or websockets as transformer, so I tried running this in Safari 8.0.3 keeping the app in background:

(function () {
  var start = +new Date();

  function timeout() {
    var now = +new Date();

    console.log(now - start);
    start = now;
    setTimeout(timeout, 25000);
  }

  setTimeout(timeout, 25000);
})();

This is what I got:

25005
25000
25567
35994
36000
26000
25007
25993
35999
36000
25001

We probably can't do much about this.

@ookke
ookke commented Mar 4, 2015

Fair enough, we'll just keep using the workaround of increasing the server heartbeat timeout then. FYI: I posted another similar issue about Android devices: #350.

@3rd-Eden
Member
3rd-Eden commented Mar 4, 2015

@lpinca @ookke so it's only limited to webkit based browsers? Firefox and IE are all good?

@3rd-Eden 3rd-Eden added bug labels Mar 4, 2015
@lpinca
Member
lpinca commented Mar 4, 2015

Chrome also works fine, I didn't test on mobile though. I also tried with Safari 8.0.5 and it acts exactly like 8.0.3.

@lpinca lpinca added Safari and removed investigate labels Mar 5, 2015
@lpinca
Member
lpinca commented Apr 12, 2015

Web Workers are not affected by this, so a possible solution is to tweak https://github.com/unshiftio/tick-tock to make it use Web Workers when possible.

@STRML
Contributor
STRML commented Sep 25, 2015

From my testing, Safari will execute timers more often if we simply have a timer running at a lower interval. I'm doing some investigation into its behavior.

Webworkers would be great, but I am not sure how long that behavior will last (until they are throttled properly too). I'm going to try forking tick-tock to give it a shot.

@STRML
Contributor
STRML commented Sep 25, 2015

It appears, in Safari, that having other timers active at lower intervals caps the effective range within which Safari will throttle. If I have a 10000ms and a 25000ms timer set, for example, the longest time I see print for the 25000ms timer is 30000ms.

We could effectively cap the error by running a noop setInterval at something like heartbeatTimeout / 4.

@lpinca
Member
lpinca commented Sep 26, 2015

@STRML I tried adding a bogus interval to my previous test case:

(function () {
  var start = +new Date()
    , count = 0;

  setInterval(function () {
    count++;
  }, 1000);


  function timeout() {
    var now = +new Date();

    console.log(now - start);
    start = now;
    setTimeout(timeout, 25000);
  }

  setTimeout(timeout, 25000);
})();

but I still get the same results in Safari 9.0.
The interval timer with 1000 ms delay does not affect the timeout timer which randomly gets fired after 35000 ms.

@lpinca
Member
lpinca commented Sep 26, 2015

Just another update:

var src = [
  'onmessage = function (e) {',
  '  var start = e.data;',
  '',
  '  setTimeout(function () {',
  '    var now = Date.now();',
  '    postMessage(now - start);',
  '  }, 25000);',
  '}'
];

var blob  = new Blob(src, { type: 'application/javascript' });
var worker = new Worker(URL.createObjectURL(blob));

worker.postMessage(Date.now());

worker.onmessage = function (e) {
  console.log(e.data);
  worker.postMessage(Date.now());
};

This doesn't help.

@STRML
Contributor
STRML commented Sep 26, 2015

Then I believe it's what I feared; that webworkers were only a temporary hack around this problem.

In Safari 8.0, this is the output I get with the webworker:

[Log] 25010
[Log] 25015
[Log] 25008
[Log] 25000
[Log] 24999 (x2)
[Log] 25000 (x6)
[Log] 24999 (x2)

IMO it's not worth putting together a websocket solution if it's going to break with new power-saving updates. We'll have to keep investigating. A server-sent heartbeat option would be a nice way around, assuming those callbacks actually trigger on time.

@STRML STRML referenced this issue in nodeca/tabex Sep 26, 2015
Open

Shared workers support? #2

@3rd-Eden
Member

I guess the only way to resolve this is to do a major refactor on our heartbeat system, making it more forgiving and have the timers only be present on the server, not the client.

@STRML
Contributor
STRML commented Sep 28, 2015

Yeah, this is unfortunately going to cripple any non-SharedWorker implementation of tab connection sharing as whichever tab has the connection could be throttled when inactive.

I'm curious, what led to the decision to move heartbeating to the client in the first place?

@lpinca
Member
lpinca commented Sep 29, 2015

@STRML no particular reason.
We only thought about this: "If the spark does no receive data in a given time frame kill it" and the implementation came naturally from that. We didn't think about the fact that a browser could have changed the delay of a timer.
I find this crazy. Yes you save energy, but you are completely destroying the timer and everything that is built on top of that. You are making timers useless and this is proven by the fact that we are thinking about moving the heartbeat timer on the server.

Furthermore, I doubt that Primus is the only library affected by this. If I remember correctly the heartbeat system in Engine.IO is implemented in the same way.

@STRML
Contributor
STRML commented Sep 29, 2015

Ah, I thought there was a particular reason because (for instance), protocol-level WS pings are from the server to the client.

I agree re: timer coalescing. Of course, if the browser developers provided a way to break the throttling, probably every ad network in the world would do it for their own reasons. Of course, one (very wasteful) workaround is to receive messages on a socket at intervals.

It will require some thought to figure out a way to reverse the heartbeat that is non-breaking. I am thinking it must be a parameter sent by the client on connection. If the server responds that it supports this mode, the client will expect heartbeats, otherwise it will send them as usual.

Otherwise of course we could break it entirely and release a major.

@lpinca
Member
lpinca commented Sep 29, 2015

Not necessarily, I think that either endpoint can send a ping frame in the WS protocol to which the other end should respond with a pong frame, but I may be wrong.

Yes it will probably be a major change, also I'm not sure how to handle the latency property if we move the timer to the server.

@3rd-Eden
Member

There are a few reasons on why client->server pings are a good idea. It is currently the only way to get a reliable indication of the message latency between client and server so you can optimize the message traffic from the client. E.g. by queueing messages to make larger updates instead of the more frequent updates. Removing the client->server pong would eliminate this functionality, but I would rather lose this ability instead of connections that break up.

@3rd-Eden
Member

Also ping packets from the WS protocol are initiated by the client not by the server.

@STRML
Contributor
STRML commented Sep 29, 2015

@3rd-Eden , it appears the spec doesn't specify, only saying that an endpoint (which I am reading to be client or server) can send a ping at any time and the other endpoint must send a pong. And at any time, either side can send pongs, which are meant to just be unanswered "I'm still here" messages.

Re: Latency, we could do a 3-step server ping/client pong/server pong heartbeat.

@lpinca
Member
lpinca commented Oct 1, 2015

Before refactoring the code I wonder if we can find how the timers are throttled.
I've seen that it doesn't matter what the delay of the timer is, it is always delayed by 10s.
If it is the case we can just make the server side timeout more tolerant by adding 10s, but ofc this workaround is very fragile as this behavior can change between browsers and versions.

@STRML
Contributor
STRML commented Jan 19, 2017

Something new in Chrome has made this much worse; I'm seeing timeouts as long as 80s when running the setInterval code you posted above.

We may have to disable heartbeating entirely for now as this is killing connections really regularly.

@STRML
Contributor
STRML commented Jan 19, 2017 edited

https://bugs.chromium.org/p/chromium/issues/detail?id=650594 may be related; I just noticed this happening a lot more recently and it lines up.

It must be some kind of heuristic based on how much CPU time the tab is taking up. On our testnet sites I can't get it to delay timers by more than 1s, but on our production site (where there is much more realtime data), I start seeing the following pattern after a while on a 1s timer:

screen shot 2017-01-19 at 4 05 42 pm

@lpinca
Member
lpinca commented Jan 19, 2017

@STRML Chrome 56? I cannot repro on Chrome 55.

@STRML
Contributor
STRML commented Jan 19, 2017

Yep, Version 56.0.2924.67 beta (64-bit).

It appears to be restricted to sites with a lot of background data or CPU usage. I'm testing on my own site, https://www.bitmex.com/app/trade/XBTUSD.

@lpinca
Member
lpinca commented Jan 19, 2017

From the link you posted above:

Each WebView has a budget (in seconds) for running timers in background.
A timer task is only allowed to run when the budget is non-negative.
After a timer has executed, its run time is subtracted from the budget.
The budget regenerates with time (at rate of 0.01 seconds per second).

@STRML
Contributor
STRML commented Jan 19, 2017

Well, that's utterly ridiculous without some way to define high- and low-priority timer tasks. What are they thinking?

@lpinca
Member
lpinca commented Jan 19, 2017

I agree this kinda sucks.

@STRML
Contributor
STRML commented Jan 19, 2017

Confirming this is how it is operating; I have triggered this even on localhost by just setting our market mocking bots on overdrive so the timers get more expensive.

Perhaps it is time we start doing heartbeats on the server.

@lpinca
Member
lpinca commented Jan 19, 2017

Perhaps it is time we start doing heartbeats on the server.

Yes, maybe the heartbeat system can also be improved by running only one timer for all connections instead of one per connection.

@STRML
Contributor
STRML commented Jan 19, 2017

That sounds worthwhile; a single timer could scan all open connections and close those that don't have a flag set, then unset the flag. Getting a response to the heartbeat sets the flag again.

@STRML
Contributor
STRML commented Jan 20, 2017

Great news; Chromium team is willing to bend on this.

They have also suggested putting the socket in a ServiceWorker. That should at least solve the heartbeating issue for us, but will require some development effort.

@lpinca
Member
lpinca commented Jan 20, 2017

Nice, I was also playing with server sent pings. I will publish a PoC in a PR if everything works.
It will be a breaking change which removes the latency property from the client but I think fixing this is more important.

@STRML
Contributor
STRML commented Jan 20, 2017

I think that's fine; I've never had any luck with latency being actually representative of client latency, probably due to timer coalescing.

@lpinca
Member
lpinca commented Jan 20, 2017

@STRML see #534.

@lpinca lpinca referenced this issue in socketio/engine.io Jan 22, 2017
Open

Allow disabling ping/pong? #312

@STRML
Contributor
STRML commented Jan 26, 2017

Just an FYI: Chrome has committed to delay this until M57 at least. The current implementation even throttles the response to WS messages, so reversing the flow will still break as of 56 beta. Hoping we'll see more positive progress; we got a lot of developers aware via HN & Reddit!

@lpinca
Member
lpinca commented Jan 26, 2017

The current implementation even throttles the response to WS messages

Ouch.
@STRML I've read your blog post, good stuff. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment