Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cog doesn't stop or restart emqttd cleanly #897

Closed
kevsmith opened this issue Aug 3, 2016 · 1 comment
Closed

Cog doesn't stop or restart emqttd cleanly #897

kevsmith opened this issue Aug 3, 2016 · 1 comment
Assignees
Milestone

Comments

@kevsmith
Copy link
Member

kevsmith commented Aug 3, 2016

Cog.BusDriver is a GenServer process responsible for configuring, starting, and stopping emqttd as an included application. Logic for configuring and starting emqttd is implemented in Cog.BusDriver.init/1. Logic for stopping the message bus is implemented in Cog.BusDriver.terminate/2. We've recently found two issues which point to implementation bugs in BusDriver:

  1. Relays don't always refresh their online status with Cog when the app restarts. This only happens sometimes and only when Cog's process tree is reinitialized (restarting the top level supervisor, for example). It never happens if the entire VM is restarted.
  2. Cog's test suite has started experiencing random failures. These failures don't seem to correlate with any test or test module but appear to happen randomly during runs. Once the error occurs all subsequent tests fail.
@kevsmith kevsmith added this to the Cog 0.12.0 milestone Aug 3, 2016
@kevsmith kevsmith self-assigned this Aug 3, 2016
@kevsmith
Copy link
Member Author

kevsmith commented Aug 3, 2016

Further debugging by @christophermaier and myself has uncovered the following:

  1. Cog.BusDriver does not trap exits. This means the Cog.BusDriver.terminate/2 is never called when Cog is shutting down or restarting. Thus emqttd is never restarted. Relay uses message bus connect/disconnect events to determine when to send an announcement to its parent Cog. In the described scenario Cog will forget the online status of all registered Relays while the Relays won't send new announcements because they were never disconnected from the message bus. From a user's PoV this would appear as though Cog spontaneously forgot about all the Relays until the VM is restarted.
  2. It turns out a change I introduced to Cog.BusDriver would cause emqttd to throw errors when accessing its private mnesia tables. I had modified Cog.BusDriver.init/1 to delete mnesia's data files before starting emqttd. My intention was to simplify the upgrade process to emqttd 1.1.2 by eliminating the possibility of emqttd crashing due to obsolete mnesia schema. Cog doesn't currently use persistent messages, nor do we have plans to use them in the future, so any data kept in mnesia is technically disposable. Subsequent investigation discovered ordering issues between mnesia, Cog, and emqttd for app startup and initialization. Deleting the data files in Cog.BusDriver.init/1 could delete data out from under running mnesia processes. This error case is mostly easily triggered in our test suite as it starts and stops Cog many times. It's also possible this error could occur during normal operation especially if Cog's process tree was restarted.

kevsmith pushed a commit that referenced this issue Aug 3, 2016
- `Cog.BusDriver` now traps exits if it successfully starts
  `emqttd`. This allows Cog to keep the message bus state in sync with
  Cog's overall state. If Cog is up, the message bus is up. If Cog is
  down, the message bus is down.

- Removed the `mnesia` data file deletion logic from
  `Cog.BusDriver. Wound up causing more problems than it solved. We can
  tell users to delete the data dir in the release notes.

- Moved all of Cog core, modulo `Cog.Repo` and `Cog.BusDriver`, to a
  separate supervisor named `Cog.CoreSup` with a restart strategy of
  one-for-one. The top level supervisor is now responsible for
  `Cog.Repo`, `Cog.BusDriver`, and `Cog.CoreSup` and uses the
  one-for-all restart strategy. This should stabilize Cog as all
  processes will be restarted if the database or message bus goes down.

Fixes #897
@kevsmith kevsmith added the review label Aug 3, 2016
kevsmith pushed a commit that referenced this issue Aug 3, 2016
Fix emqttd mgmt;Refine process tree structure

Fixes #897
@kevsmith kevsmith removed the review label Aug 3, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant