Skip to content
This repository has been archived by the owner on Apr 19, 2023. It is now read-only.

[Investigation] SSE Stream Reconnection Problems #156

Open
suyashkumar opened this issue Jan 28, 2017 · 3 comments
Open

[Investigation] SSE Stream Reconnection Problems #156

suyashkumar opened this issue Jan 28, 2017 · 3 comments

Comments

@suyashkumar
Copy link
Collaborator

suyashkumar commented Jan 28, 2017

Sometimes the SSE stream throws an error that it does not seem to recover from (because we'll stop receiving publish events.

Sample of logs below where we get a 502 from Particle (see the last 4 lines):

{ coreid: '3e001d000e51353432393339',
  time: 2017-01-25T15:18:33.428Z,
  loc: 'ADPLKenyaN3763',
  temps:
   { HXCI: '24.1',
     HXCO: '-63.7',
     HTR: '22.2',
     HXHI: '20.9',
     HXHO: '22.0' },
  valveStatus: undefined }
{ coreid: '3e001d000e51353432393339',
  time: 2017-01-25T15:18:33.789Z,
  loc: 'ADPLKenyaN3763',
  data: 218 }
ERROR (Likely Event Source)
Event { type: 'error' }
ERROR (Likely Event Source)
Event { type: 'error', status: 502 }
@suyashkumar
Copy link
Collaborator Author

Looking at the eventsource library source, it binds onConnectionClosed to the error event in connect(). onConnectionClosed has retry logic that is reproduced below:

function onConnectionClosed() {
    if (readyState === EventSource.CLOSED) return;
    readyState = EventSource.CONNECTING;
    _emit('error', new Event('error'));

    // The url may have been changed by a temporary
    // redirect. If that's the case, revert it now.
    if (reconnectUrl) {
      url = reconnectUrl;
      reconnectUrl = null;
    }
    setTimeout(function () {
      if (readyState !== EventSource.CONNECTING) {
        return;
      }
      connect();
    }, self.reconnectInterval);
  }

The reconnect interval default is 1000ms. The error messages we see are consistent with the additional error being emitted in onConnectionClosed(). We do know, however, that no more error events are emitted which either means that subsequent errors are not handled properly or there's an issue relaying event after an error reconnect

@mlp6 mlp6 added the bug label Jan 29, 2017
@suyashkumar
Copy link
Collaborator Author

screen shot 2017-02-23 at 10 15 38 am
More errors -- to be clear, these are coming from the particle server (returning 502s). Seems like they've become much less reliable since we started. We seem to be retrying the stream a number of times but it continues to fail on their end. After the stream fails we don't get any more SSE messages until our server is restarted. I'm not sure why the retry logic in the library (in the comment above) does not continue indefinitely (perhaps it gets a valid connection but receives no messages on it? That would be scary and hard to detect). We may need to invest some time in building more sophisticated retry logic on top of what's already included in the library (exponential backoff, retry connection if it's been x mins since receiving a message). It's also becoming increasingly necessary for us to start firing off alert emails when things like this occur so that we maintain data integrity.

@suyashkumar
Copy link
Collaborator Author

Another one

{ coreid: '24002d000951343334363138',
  time: 2017-03-17T18:22:39.433Z,
  loc: 'Kenya-North',
  temps:
   { HXCI: '21.5',
     HXCO: '26.9',
     HTR: '47.2',
     HXHI: '27.7',
     HXHO: '24.9' },
  valveStatus: '1' }
{ coreid: '400057000a51343334363138',
  time: 2017-03-17T18:22:48.564Z,
  loc: 'Duke',
  temps:
   { HXCI: '21.6',
     HXCO: '21.2',
     HTR: '20.9',
     HXHI: '21.2',
     HXHO: '23.3' },
  valveStatus: '1' }
{ coreid: '400057000a51343334363138',
  loc: 'Duke',
  time: 2017-03-17T18:22:48.564Z,
  data: '2' }
ERROR (Likely Event Source)
Event { type: 'error' }
ERROR (Likely Event Source)
Event { type: 'error', status: 502 }

suyashkumar added a commit that referenced this issue Aug 23, 2018
This should address #156 by creating a heartbeat timer that will reset the SSE connection to particle if a certain number of messages are not received every heartbeat interval.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
No open projects
Backend Re-Write
  
Awaiting triage
Development

No branches or pull requests

2 participants