Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apps and drivers fail after restarting docker (or host) running databox #187

Open
cgreenhalgh opened this issue Oct 31, 2017 · 8 comments
Open
Labels

Comments

@cgreenhalgh
Copy link

If docker is restarted (or if host restarts) then the various databox services are re-created, including active drivers and apps, but in general they do not work work. They seem to fail to connect to and/or authenticate correctly with the store(s) they are using. The driver-os-monitor makes repeated attempts (wait for store) then terminates (and is auto-restarted); the app-os-monitor fails but this is only visible in the log (and no data appearing).

Example output from app-os-monitor:

ttps://driver-os-monitor-store-json:8080
[waitForStoreStatus] Retrying in 1s...[waitForStoreStatus] Retrying in 1s...[waitForStoreStatus] Retrying in 1s...[waitForStoreStatus] Retrying in 1s...[waitForStoreStatus] Retrying in 1s...{"target":"driver-os-monitor-store-json","path":"/ws","method":"GET"}
WSConnect::  401: Invalid API key
Token not in cache requesting new one
{"target":"driver-os-monitor-store-json","path":"/sub/loadavg1/ts","method":"GET"}
WSSubscribe dataSourceLoadavg1  401: Invalid API key
Token not in cache requesting new one

Example out from driver-os-monitor I have see Invalid API key but also connection refused:

[waitForStoreStatus] Retrying in 1s...
{ Error: connect ECONNREFUSED 10.0.0.4:8080
    at Object._errnoException (util.js:1021:11)
    at _exceptionWithHostPort (util.js:1043:20)
    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1175:14)
  code: 'ECONNREFUSED',
  errno: 'ECONNREFUSED',
  syscall: 'connect',
  address: '10.0.0.4',
  port: 8080 }
[ERROR] { Error: connect ECONNREFUSED 10.0.0.4:8080
    at Object._errnoException (util.js:1021:11)
    at _exceptionWithHostPort (util.js:1043:20)
    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1175:14)
  code: 'ECONNREFUSED',
  errno: 'ECONNREFUSED',
  syscall: 'connect',
  address: '10.0.0.4',
  port: 8080 }
@Toshbrown
Copy link
Contributor

This happens because databox is run as a docker service, and by default, services are restarted on reboot or docker restart.

This is problematic because the arbiter holds its permissions in memory and the container manager does not reregister all the running components.

There are three solutions as I see it:

  1. Start databox with the --autorestart flag set of OFF (simple fix but means you need to reinstall all apps and drivers after a restart)
  2. Add persistent storage to the arbiter (this would have to be a special case store or persistent volume mount) this could have security implications and would need encrypting
  3. The container manager could be altered to register the running stores, apps and drivers (this is only possible if the docker secrets persist )

cc @mor1 thoughts on how to proceed

@cgreenhalgh
Copy link
Author

I'm not certain if this is related, but I would suggest that the CM private key should definitely persist across restarts (whatever mode it is run in) as otherwise (in the secure UI version) users would have to install the new CA root certificate in their client(s) every time the databox restarted.

@mor1
Copy link
Contributor

mor1 commented Oct 31, 2017

Thoughts:

  • We do need to add more persistence to components, including the arbiter.
  • Both stores and more ad hoc storage for core components should be encrypted at rest by default.
  • Auto-restart is probably useful and should be retained.

So, @Toshbrown I think that means no to 1, yes to 2 for sure, and I'm not sure I understand 3 correctly...?

@Toshbrown
Copy link
Contributor

Toshbrown commented Oct 31, 2017

@mor1 If its a yes to 2 then 3 is not needed (and now I think about it would not work)

There is a 4 as well (if secrets persist )

We could pass the arbiter its half of the key using secrets rather than an API call (this already happens for core components). Then on restart, it can just reload the keys from /var/run/secrets

@cgreenhalgh the cm CA root certificate is persistent as are the arbiter keys for core components

What we decide here may also have implication for the core-network so ccing @sevenEng just in case

ccing @yousefamar as I may be missing some arbiter implementation details

@mor1
Copy link
Contributor

mor1 commented Oct 31, 2017

@Toshbrown Ah! I understand 3 now too :) Yes, 4 seems better than either 2 or 3 to me, assuming secrets passing is indeed secret even for an on-host observer, which it surely must be (?)

What are the core-network implications you're thinking of? In terms of the configuration state, or something else?

@Toshbrown
Copy link
Contributor

Toshbrown commented Oct 31, 2017

configuration state mainly. It also runs outside of the swarm, and hence is not part of the service so it may not get restarted automatically

@mor1
Copy link
Contributor

mor1 commented Oct 31, 2017

@Toshbrown Ok thanks
@sevenEng Auto-restart worth noting as an issue for core-network?

@Toshbrown
Copy link
Contributor

Fixed in 0.4.0 on Linux (see databox-install-ubuntu-service script) still an issue on macOS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants