Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
The websocket connection can fail silently #99
Comments
petevg
referenced this issue
in juju-solutions/matrix
Mar 29, 2017
Closed
No units in model - timeout #92
|
I've been giving this one a lot of thought over the past couple of days. I don't think that an auto-resume will work. There doesn't seem to be a facility in the websocket api for resuming a watcher, and there are a lot of ways to break and miss messages if we try to hack something together on our end. I also don't think that an elaborate Exception surfacing scheme is necessarily called for. At least, I haven't come up with one that isn't a headache for developers who aren't me to understand. I think that the best solution would be to attach a "monitor" class to each connection object. The monitor class would have a "status" property that would be in one of three states:
"connected" means that everything is healthy. "closed" means that the websocket connection is closed due to the connection.close() method being called. "errored" means that the connection.receiver routine raised an Exception for whatever reason, and our connection is no longer functional (and probably disconnected due to an unexpected network issue). The monitor could be used in the following ways:
@tvansteenburgh @johnsca @abentley: thoughts on this approach? It isn't complicated, so I should be able to push a WIP PR soon ... (Edit: I'd like to stick stuff in a "monitor" class, rather than attaching "status" directly to the connection because I think that it might help people reason about the purpose of the status, and use it more effectively.) |
|
Changes merged. |
petevg commentedMar 29, 2017
I don't have a clean repro yet, but I believe this to be the underlying cause of this bug in matrix: juju-solutions/matrix#92
Basically, matrix periodically checks on the "health" of a cluster by checking the status for each unit in each application in Model.applications, where Model is the Model class from python-libjuju's model.py. That data is supposed to be updated by a listener inside the Model object.
If matrix is running on a flaky connection, or if the Controller times the websocket connection out, the listener will stop receiving updates, however, with no Exceptions being thrown. I've verified that this is what's happening by running matrix in a debugger, and checking the .is_open property of the connection. Eventually, it gets set to false because we have been disconnected from the Controller.
We need to either come up with an automagic reconnection scheme, or a recommended pattern for dealing with the case where the listener gets disconnected.
Note that this is semi-related to #98, though making a Pinger would only address Controller initiated timeouts; a flaky connection would still kill the listener.