The websocket connection can fail silently #99

Closed
petevg opened this Issue Mar 29, 2017 · 2 comments

Comments

Projects
None yet
1 participant
Collaborator

petevg commented Mar 29, 2017

I don't have a clean repro yet, but I believe this to be the underlying cause of this bug in matrix: juju-solutions/matrix#92

Basically, matrix periodically checks on the "health" of a cluster by checking the status for each unit in each application in Model.applications, where Model is the Model class from python-libjuju's model.py. That data is supposed to be updated by a listener inside the Model object.

If matrix is running on a flaky connection, or if the Controller times the websocket connection out, the listener will stop receiving updates, however, with no Exceptions being thrown. I've verified that this is what's happening by running matrix in a debugger, and checking the .is_open property of the connection. Eventually, it gets set to false because we have been disconnected from the Controller.

We need to either come up with an automagic reconnection scheme, or a recommended pattern for dealing with the case where the listener gets disconnected.

Note that this is semi-related to #98, though making a Pinger would only address Controller initiated timeouts; a flaky connection would still kill the listener.

@petevg petevg referenced this issue in juju-solutions/matrix Mar 29, 2017

Closed

No units in model - timeout #92

Collaborator

petevg commented Apr 4, 2017

I've been giving this one a lot of thought over the past couple of days.

I don't think that an auto-resume will work. There doesn't seem to be a facility in the websocket api for resuming a watcher, and there are a lot of ways to break and miss messages if we try to hack something together on our end.

I also don't think that an elaborate Exception surfacing scheme is necessarily called for. At least, I haven't come up with one that isn't a headache for developers who aren't me to understand.

I think that the best solution would be to attach a "monitor" class to each connection object. The monitor class would have a "status" property that would be in one of three states:

    1. connected
    2. errored
    3. closed

"connected" means that everything is healthy. "closed" means that the websocket connection is closed due to the connection.close() method being called. "errored" means that the connection.receiver routine raised an Exception for whatever reason, and our connection is no longer functional (and probably disconnected due to an unexpected network issue).

The monitor could be used in the following ways:

  1. The "health" check in matrix can see that connection.monitor.status == 'errored', and raise an InfraFailure. This addresses juju-solutions/matrix#92
  2. A tool with an interactive user interface could have a little "connection" icon that goes from green to red when the connection fails, prompting the user to take action to reconnect.
  3. A tool could track all its watchers, possibly via helpers in python-libjuju, and be able to restart them if the connection.monitor.status goes bad. The tool would be responsible for ensuring that any actions it takes on deltas in a watcher are idempotent. An auto-reconnect helper in the matrix tasks could be an eventual model for this pattern.

@tvansteenburgh @johnsca @abentley: thoughts on this approach? It isn't complicated, so I should be able to push a WIP PR soon ...

(Edit: I'd like to stick stuff in a "monitor" class, rather than attaching "status" directly to the connection because I think that it might help people reason about the purpose of the status, and use it more effectively.)

Collaborator

petevg commented Apr 26, 2017

Changes merged.

@petevg petevg closed this Apr 26, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment