New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Irregular heartbeats with Safari #348

Closed
ookke opened this Issue Mar 2, 2015 · 37 comments

Comments

Projects
None yet
4 participants
@ookke

ookke commented Mar 2, 2015

We are using primus 2.4.12 with the engine.io transport and we encountered a strange issue where some clients keep getting disconnected and reconnecting about once a minute. The common factor between these clients seems to be OS X 10.10.2 and Safari 8.0.3.

When investigating the issue we increased the server heartbeat timeout and added debug logging on the 'incoming::ping' event and noticed that the interval between heartbeats from the Safari clients is quite irregular (ping interval is set to 25 000 ms):

46998 ms since last ping or connect
46999 ms since last ping or connect
30992 ms since last ping or connect

Meanwhile a Google Chrome client will keep sending heartbeats regularly ~26 s apart. Any ideas what could be causing this and how to investigate it further? As a workaround we increased the server heartbeat timeout to 90 seconds which seems to mitigate the issue for now.

@lpinca

This comment has been minimized.

Show comment
Hide comment
@lpinca

lpinca Mar 3, 2015

Member

I can consistently reproduce this. The ping interval respects the settings in my case but the client disconnects after 4/5 pings.
I will try to reproduce this using only Egine.IO to see where the bug lives.

Member

lpinca commented Mar 3, 2015

I can consistently reproduce this. The ping interval respects the settings in my case but the client disconnects after 4/5 pings.
I will try to reproduce this using only Egine.IO to see where the bug lives.

@lpinca lpinca added the investigate label Mar 3, 2015

@lpinca

This comment has been minimized.

Show comment
Hide comment
@lpinca

lpinca Mar 4, 2015

Member

The issue also happens when using sockjs or websockets as transformer, so I tried running this in Safari 8.0.3 keeping the app in background:

(function () {
  var start = +new Date();

  function timeout() {
    var now = +new Date();

    console.log(now - start);
    start = now;
    setTimeout(timeout, 25000);
  }

  setTimeout(timeout, 25000);
})();

This is what I got:

25005
25000
25567
35994
36000
26000
25007
25993
35999
36000
25001

We probably can't do much about this.

Member

lpinca commented Mar 4, 2015

The issue also happens when using sockjs or websockets as transformer, so I tried running this in Safari 8.0.3 keeping the app in background:

(function () {
  var start = +new Date();

  function timeout() {
    var now = +new Date();

    console.log(now - start);
    start = now;
    setTimeout(timeout, 25000);
  }

  setTimeout(timeout, 25000);
})();

This is what I got:

25005
25000
25567
35994
36000
26000
25007
25993
35999
36000
25001

We probably can't do much about this.

@ookke

This comment has been minimized.

Show comment
Hide comment
@ookke

ookke Mar 4, 2015

Fair enough, we'll just keep using the workaround of increasing the server heartbeat timeout then. FYI: I posted another similar issue about Android devices: #350.

ookke commented Mar 4, 2015

Fair enough, we'll just keep using the workaround of increasing the server heartbeat timeout then. FYI: I posted another similar issue about Android devices: #350.

@3rd-Eden

This comment has been minimized.

Show comment
Hide comment
@3rd-Eden

3rd-Eden Mar 4, 2015

Member

@lpinca @ookke so it's only limited to webkit based browsers? Firefox and IE are all good?

Member

3rd-Eden commented Mar 4, 2015

@lpinca @ookke so it's only limited to webkit based browsers? Firefox and IE are all good?

@3rd-Eden 3rd-Eden added bug labels Mar 4, 2015

@lpinca

This comment has been minimized.

Show comment
Hide comment
@lpinca

lpinca Mar 4, 2015

Member

Chrome also works fine, I didn't test on mobile though. I also tried with Safari 8.0.5 and it acts exactly like 8.0.3.

Member

lpinca commented Mar 4, 2015

Chrome also works fine, I didn't test on mobile though. I also tried with Safari 8.0.5 and it acts exactly like 8.0.3.

@lpinca lpinca added Safari and removed investigate labels Mar 5, 2015

@lpinca

This comment has been minimized.

Show comment
Hide comment
@lpinca

lpinca Apr 12, 2015

Member

Web Workers are not affected by this, so a possible solution is to tweak https://github.com/unshiftio/tick-tock to make it use Web Workers when possible.

Member

lpinca commented Apr 12, 2015

Web Workers are not affected by this, so a possible solution is to tweak https://github.com/unshiftio/tick-tock to make it use Web Workers when possible.

@STRML

This comment has been minimized.

Show comment
Hide comment
@STRML

STRML Sep 25, 2015

Contributor

From my testing, Safari will execute timers more often if we simply have a timer running at a lower interval. I'm doing some investigation into its behavior.

Webworkers would be great, but I am not sure how long that behavior will last (until they are throttled properly too). I'm going to try forking tick-tock to give it a shot.

Contributor

STRML commented Sep 25, 2015

From my testing, Safari will execute timers more often if we simply have a timer running at a lower interval. I'm doing some investigation into its behavior.

Webworkers would be great, but I am not sure how long that behavior will last (until they are throttled properly too). I'm going to try forking tick-tock to give it a shot.

@STRML

This comment has been minimized.

Show comment
Hide comment
@STRML

STRML Sep 25, 2015

Contributor

It appears, in Safari, that having other timers active at lower intervals caps the effective range within which Safari will throttle. If I have a 10000ms and a 25000ms timer set, for example, the longest time I see print for the 25000ms timer is 30000ms.

We could effectively cap the error by running a noop setInterval at something like heartbeatTimeout / 4.

Contributor

STRML commented Sep 25, 2015

It appears, in Safari, that having other timers active at lower intervals caps the effective range within which Safari will throttle. If I have a 10000ms and a 25000ms timer set, for example, the longest time I see print for the 25000ms timer is 30000ms.

We could effectively cap the error by running a noop setInterval at something like heartbeatTimeout / 4.

@lpinca

This comment has been minimized.

Show comment
Hide comment
@lpinca

lpinca Sep 26, 2015

Member

@STRML I tried adding a bogus interval to my previous test case:

(function () {
  var start = +new Date()
    , count = 0;

  setInterval(function () {
    count++;
  }, 1000);


  function timeout() {
    var now = +new Date();

    console.log(now - start);
    start = now;
    setTimeout(timeout, 25000);
  }

  setTimeout(timeout, 25000);
})();

but I still get the same results in Safari 9.0.
The interval timer with 1000 ms delay does not affect the timeout timer which randomly gets fired after 35000 ms.

Member

lpinca commented Sep 26, 2015

@STRML I tried adding a bogus interval to my previous test case:

(function () {
  var start = +new Date()
    , count = 0;

  setInterval(function () {
    count++;
  }, 1000);


  function timeout() {
    var now = +new Date();

    console.log(now - start);
    start = now;
    setTimeout(timeout, 25000);
  }

  setTimeout(timeout, 25000);
})();

but I still get the same results in Safari 9.0.
The interval timer with 1000 ms delay does not affect the timeout timer which randomly gets fired after 35000 ms.

@lpinca

This comment has been minimized.

Show comment
Hide comment
@lpinca

lpinca Sep 26, 2015

Member

Just another update:

var src = [
  'onmessage = function (e) {',
  '  var start = e.data;',
  '',
  '  setTimeout(function () {',
  '    var now = Date.now();',
  '    postMessage(now - start);',
  '  }, 25000);',
  '}'
];

var blob  = new Blob(src, { type: 'application/javascript' });
var worker = new Worker(URL.createObjectURL(blob));

worker.postMessage(Date.now());

worker.onmessage = function (e) {
  console.log(e.data);
  worker.postMessage(Date.now());
};

This doesn't help.

Member

lpinca commented Sep 26, 2015

Just another update:

var src = [
  'onmessage = function (e) {',
  '  var start = e.data;',
  '',
  '  setTimeout(function () {',
  '    var now = Date.now();',
  '    postMessage(now - start);',
  '  }, 25000);',
  '}'
];

var blob  = new Blob(src, { type: 'application/javascript' });
var worker = new Worker(URL.createObjectURL(blob));

worker.postMessage(Date.now());

worker.onmessage = function (e) {
  console.log(e.data);
  worker.postMessage(Date.now());
};

This doesn't help.

@STRML

This comment has been minimized.

Show comment
Hide comment
@STRML

STRML Sep 26, 2015

Contributor

Then I believe it's what I feared; that webworkers were only a temporary hack around this problem.

In Safari 8.0, this is the output I get with the webworker:

[Log] 25010
[Log] 25015
[Log] 25008
[Log] 25000
[Log] 24999 (x2)
[Log] 25000 (x6)
[Log] 24999 (x2)

IMO it's not worth putting together a websocket solution if it's going to break with new power-saving updates. We'll have to keep investigating. A server-sent heartbeat option would be a nice way around, assuming those callbacks actually trigger on time.

Contributor

STRML commented Sep 26, 2015

Then I believe it's what I feared; that webworkers were only a temporary hack around this problem.

In Safari 8.0, this is the output I get with the webworker:

[Log] 25010
[Log] 25015
[Log] 25008
[Log] 25000
[Log] 24999 (x2)
[Log] 25000 (x6)
[Log] 24999 (x2)

IMO it's not worth putting together a websocket solution if it's going to break with new power-saving updates. We'll have to keep investigating. A server-sent heartbeat option would be a nice way around, assuming those callbacks actually trigger on time.

@3rd-Eden

This comment has been minimized.

Show comment
Hide comment
@3rd-Eden

3rd-Eden Sep 28, 2015

Member

I guess the only way to resolve this is to do a major refactor on our heartbeat system, making it more forgiving and have the timers only be present on the server, not the client.

Member

3rd-Eden commented Sep 28, 2015

I guess the only way to resolve this is to do a major refactor on our heartbeat system, making it more forgiving and have the timers only be present on the server, not the client.

@STRML

This comment has been minimized.

Show comment
Hide comment
@STRML

STRML Sep 28, 2015

Contributor

Yeah, this is unfortunately going to cripple any non-SharedWorker implementation of tab connection sharing as whichever tab has the connection could be throttled when inactive.

I'm curious, what led to the decision to move heartbeating to the client in the first place?

Contributor

STRML commented Sep 28, 2015

Yeah, this is unfortunately going to cripple any non-SharedWorker implementation of tab connection sharing as whichever tab has the connection could be throttled when inactive.

I'm curious, what led to the decision to move heartbeating to the client in the first place?

@lpinca

This comment has been minimized.

Show comment
Hide comment
@lpinca

lpinca Sep 29, 2015

Member

@STRML no particular reason.
We only thought about this: "If the spark does no receive data in a given time frame kill it" and the implementation came naturally from that. We didn't think about the fact that a browser could have changed the delay of a timer.
I find this crazy. Yes you save energy, but you are completely destroying the timer and everything that is built on top of that. You are making timers useless and this is proven by the fact that we are thinking about moving the heartbeat timer on the server.

Furthermore, I doubt that Primus is the only library affected by this. If I remember correctly the heartbeat system in Engine.IO is implemented in the same way.

Member

lpinca commented Sep 29, 2015

@STRML no particular reason.
We only thought about this: "If the spark does no receive data in a given time frame kill it" and the implementation came naturally from that. We didn't think about the fact that a browser could have changed the delay of a timer.
I find this crazy. Yes you save energy, but you are completely destroying the timer and everything that is built on top of that. You are making timers useless and this is proven by the fact that we are thinking about moving the heartbeat timer on the server.

Furthermore, I doubt that Primus is the only library affected by this. If I remember correctly the heartbeat system in Engine.IO is implemented in the same way.

@STRML

This comment has been minimized.

Show comment
Hide comment
@STRML

STRML Sep 29, 2015

Contributor

Ah, I thought there was a particular reason because (for instance), protocol-level WS pings are from the server to the client.

I agree re: timer coalescing. Of course, if the browser developers provided a way to break the throttling, probably every ad network in the world would do it for their own reasons. Of course, one (very wasteful) workaround is to receive messages on a socket at intervals.

It will require some thought to figure out a way to reverse the heartbeat that is non-breaking. I am thinking it must be a parameter sent by the client on connection. If the server responds that it supports this mode, the client will expect heartbeats, otherwise it will send them as usual.

Otherwise of course we could break it entirely and release a major.

Contributor

STRML commented Sep 29, 2015

Ah, I thought there was a particular reason because (for instance), protocol-level WS pings are from the server to the client.

I agree re: timer coalescing. Of course, if the browser developers provided a way to break the throttling, probably every ad network in the world would do it for their own reasons. Of course, one (very wasteful) workaround is to receive messages on a socket at intervals.

It will require some thought to figure out a way to reverse the heartbeat that is non-breaking. I am thinking it must be a parameter sent by the client on connection. If the server responds that it supports this mode, the client will expect heartbeats, otherwise it will send them as usual.

Otherwise of course we could break it entirely and release a major.

@lpinca

This comment has been minimized.

Show comment
Hide comment
@lpinca

lpinca Sep 29, 2015

Member

Not necessarily, I think that either endpoint can send a ping frame in the WS protocol to which the other end should respond with a pong frame, but I may be wrong.

Yes it will probably be a major change, also I'm not sure how to handle the latency property if we move the timer to the server.

Member

lpinca commented Sep 29, 2015

Not necessarily, I think that either endpoint can send a ping frame in the WS protocol to which the other end should respond with a pong frame, but I may be wrong.

Yes it will probably be a major change, also I'm not sure how to handle the latency property if we move the timer to the server.

@3rd-Eden

This comment has been minimized.

Show comment
Hide comment
@3rd-Eden

3rd-Eden Sep 29, 2015

Member

There are a few reasons on why client->server pings are a good idea. It is currently the only way to get a reliable indication of the message latency between client and server so you can optimize the message traffic from the client. E.g. by queueing messages to make larger updates instead of the more frequent updates. Removing the client->server pong would eliminate this functionality, but I would rather lose this ability instead of connections that break up.

Member

3rd-Eden commented Sep 29, 2015

There are a few reasons on why client->server pings are a good idea. It is currently the only way to get a reliable indication of the message latency between client and server so you can optimize the message traffic from the client. E.g. by queueing messages to make larger updates instead of the more frequent updates. Removing the client->server pong would eliminate this functionality, but I would rather lose this ability instead of connections that break up.

@3rd-Eden

This comment has been minimized.

Show comment
Hide comment
@3rd-Eden

3rd-Eden Sep 29, 2015

Member

Also ping packets from the WS protocol are initiated by the client not by the server.

Member

3rd-Eden commented Sep 29, 2015

Also ping packets from the WS protocol are initiated by the client not by the server.

@STRML

This comment has been minimized.

Show comment
Hide comment
@STRML

STRML Sep 29, 2015

Contributor

@3rd-Eden , it appears the spec doesn't specify, only saying that an endpoint (which I am reading to be client or server) can send a ping at any time and the other endpoint must send a pong. And at any time, either side can send pongs, which are meant to just be unanswered "I'm still here" messages.

Re: Latency, we could do a 3-step server ping/client pong/server pong heartbeat.

Contributor

STRML commented Sep 29, 2015

@3rd-Eden , it appears the spec doesn't specify, only saying that an endpoint (which I am reading to be client or server) can send a ping at any time and the other endpoint must send a pong. And at any time, either side can send pongs, which are meant to just be unanswered "I'm still here" messages.

Re: Latency, we could do a 3-step server ping/client pong/server pong heartbeat.

@lpinca

This comment has been minimized.

Show comment
Hide comment
@lpinca

lpinca Oct 1, 2015

Member

Before refactoring the code I wonder if we can find how the timers are throttled.
I've seen that it doesn't matter what the delay of the timer is, it is always delayed by 10s.
If it is the case we can just make the server side timeout more tolerant by adding 10s, but ofc this workaround is very fragile as this behavior can change between browsers and versions.

Member

lpinca commented Oct 1, 2015

Before refactoring the code I wonder if we can find how the timers are throttled.
I've seen that it doesn't matter what the delay of the timer is, it is always delayed by 10s.
If it is the case we can just make the server side timeout more tolerant by adding 10s, but ofc this workaround is very fragile as this behavior can change between browsers and versions.

@STRML

This comment has been minimized.

Show comment
Hide comment
@STRML

STRML Jan 19, 2017

Contributor

Something new in Chrome has made this much worse; I'm seeing timeouts as long as 80s when running the setInterval code you posted above.

We may have to disable heartbeating entirely for now as this is killing connections really regularly.

Contributor

STRML commented Jan 19, 2017

Something new in Chrome has made this much worse; I'm seeing timeouts as long as 80s when running the setInterval code you posted above.

We may have to disable heartbeating entirely for now as this is killing connections really regularly.

@STRML

This comment has been minimized.

Show comment
Hide comment
@STRML

STRML Jan 19, 2017

Contributor

https://bugs.chromium.org/p/chromium/issues/detail?id=650594 may be related; I just noticed this happening a lot more recently and it lines up.

It must be some kind of heuristic based on how much CPU time the tab is taking up. On our testnet sites I can't get it to delay timers by more than 1s, but on our production site (where there is much more realtime data), I start seeing the following pattern after a while on a 1s timer:

screen shot 2017-01-19 at 4 05 42 pm

Contributor

STRML commented Jan 19, 2017

https://bugs.chromium.org/p/chromium/issues/detail?id=650594 may be related; I just noticed this happening a lot more recently and it lines up.

It must be some kind of heuristic based on how much CPU time the tab is taking up. On our testnet sites I can't get it to delay timers by more than 1s, but on our production site (where there is much more realtime data), I start seeing the following pattern after a while on a 1s timer:

screen shot 2017-01-19 at 4 05 42 pm

@lpinca

This comment has been minimized.

Show comment
Hide comment
@lpinca

lpinca Jan 19, 2017

Member

@STRML Chrome 56? I cannot repro on Chrome 55.

Member

lpinca commented Jan 19, 2017

@STRML Chrome 56? I cannot repro on Chrome 55.

@STRML

This comment has been minimized.

Show comment
Hide comment
@STRML

STRML Jan 19, 2017

Contributor

Yep, Version 56.0.2924.67 beta (64-bit).

It appears to be restricted to sites with a lot of background data or CPU usage. I'm testing on my own site, https://www.bitmex.com/app/trade/XBTUSD.

Contributor

STRML commented Jan 19, 2017

Yep, Version 56.0.2924.67 beta (64-bit).

It appears to be restricted to sites with a lot of background data or CPU usage. I'm testing on my own site, https://www.bitmex.com/app/trade/XBTUSD.

@lpinca

This comment has been minimized.

Show comment
Hide comment
@lpinca

lpinca Jan 19, 2017

Member

From the link you posted above:

Each WebView has a budget (in seconds) for running timers in background.
A timer task is only allowed to run when the budget is non-negative.
After a timer has executed, its run time is subtracted from the budget.
The budget regenerates with time (at rate of 0.01 seconds per second).

Member

lpinca commented Jan 19, 2017

From the link you posted above:

Each WebView has a budget (in seconds) for running timers in background.
A timer task is only allowed to run when the budget is non-negative.
After a timer has executed, its run time is subtracted from the budget.
The budget regenerates with time (at rate of 0.01 seconds per second).

@STRML

This comment has been minimized.

Show comment
Hide comment
@STRML

STRML Jan 19, 2017

Contributor

Well, that's utterly ridiculous without some way to define high- and low-priority timer tasks. What are they thinking?

Contributor

STRML commented Jan 19, 2017

Well, that's utterly ridiculous without some way to define high- and low-priority timer tasks. What are they thinking?

@lpinca

This comment has been minimized.

Show comment
Hide comment
@lpinca

lpinca Jan 19, 2017

Member

I agree this kinda sucks.

Member

lpinca commented Jan 19, 2017

I agree this kinda sucks.

@STRML

This comment has been minimized.

Show comment
Hide comment
@STRML

STRML Jan 19, 2017

Contributor

Confirming this is how it is operating; I have triggered this even on localhost by just setting our market mocking bots on overdrive so the timers get more expensive.

Perhaps it is time we start doing heartbeats on the server.

Contributor

STRML commented Jan 19, 2017

Confirming this is how it is operating; I have triggered this even on localhost by just setting our market mocking bots on overdrive so the timers get more expensive.

Perhaps it is time we start doing heartbeats on the server.

@lpinca

This comment has been minimized.

Show comment
Hide comment
@lpinca

lpinca Jan 19, 2017

Member

Perhaps it is time we start doing heartbeats on the server.

Yes, maybe the heartbeat system can also be improved by running only one timer for all connections instead of one per connection.

Member

lpinca commented Jan 19, 2017

Perhaps it is time we start doing heartbeats on the server.

Yes, maybe the heartbeat system can also be improved by running only one timer for all connections instead of one per connection.

@STRML

This comment has been minimized.

Show comment
Hide comment
@STRML

STRML Jan 19, 2017

Contributor

That sounds worthwhile; a single timer could scan all open connections and close those that don't have a flag set, then unset the flag. Getting a response to the heartbeat sets the flag again.

Contributor

STRML commented Jan 19, 2017

That sounds worthwhile; a single timer could scan all open connections and close those that don't have a flag set, then unset the flag. Getting a response to the heartbeat sets the flag again.

@STRML

This comment has been minimized.

Show comment
Hide comment
@STRML

STRML Jan 20, 2017

Contributor

Great news; Chromium team is willing to bend on this.

They have also suggested putting the socket in a ServiceWorker. That should at least solve the heartbeating issue for us, but will require some development effort.

Contributor

STRML commented Jan 20, 2017

Great news; Chromium team is willing to bend on this.

They have also suggested putting the socket in a ServiceWorker. That should at least solve the heartbeating issue for us, but will require some development effort.

@lpinca

This comment has been minimized.

Show comment
Hide comment
@lpinca

lpinca Jan 20, 2017

Member

Nice, I was also playing with server sent pings. I will publish a PoC in a PR if everything works.
It will be a breaking change which removes the latency property from the client but I think fixing this is more important.

Member

lpinca commented Jan 20, 2017

Nice, I was also playing with server sent pings. I will publish a PoC in a PR if everything works.
It will be a breaking change which removes the latency property from the client but I think fixing this is more important.

@STRML

This comment has been minimized.

Show comment
Hide comment
@STRML

STRML Jan 20, 2017

Contributor

I think that's fine; I've never had any luck with latency being actually representative of client latency, probably due to timer coalescing.

Contributor

STRML commented Jan 20, 2017

I think that's fine; I've never had any luck with latency being actually representative of client latency, probably due to timer coalescing.

@lpinca

This comment has been minimized.

Show comment
Hide comment
@lpinca
Member

lpinca commented Jan 20, 2017

@STRML see #534.

@STRML

This comment has been minimized.

Show comment
Hide comment
@STRML

STRML Jan 26, 2017

Contributor

Just an FYI: Chrome has committed to delay this until M57 at least. The current implementation even throttles the response to WS messages, so reversing the flow will still break as of 56 beta. Hoping we'll see more positive progress; we got a lot of developers aware via HN & Reddit!

Contributor

STRML commented Jan 26, 2017

Just an FYI: Chrome has committed to delay this until M57 at least. The current implementation even throttles the response to WS messages, so reversing the flow will still break as of 56 beta. Hoping we'll see more positive progress; we got a lot of developers aware via HN & Reddit!

@lpinca

This comment has been minimized.

Show comment
Hide comment
@lpinca

lpinca Jan 26, 2017

Member

The current implementation even throttles the response to WS messages

Ouch.
@STRML I've read your blog post, good stuff. Thanks.

Member

lpinca commented Jan 26, 2017

The current implementation even throttles the response to WS messages

Ouch.
@STRML I've read your blog post, good stuff. Thanks.

@lpinca

This comment has been minimized.

Show comment
Hide comment
@lpinca

lpinca May 4, 2017

Member

After #534 this is no longer an issue, at least on Safari, closing.

Member

lpinca commented May 4, 2017

After #534 this is no longer an issue, at least on Safari, closing.

@lpinca lpinca closed this May 4, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment