Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wait: no child processes #178

Closed
kenjones-cisco opened this issue Jun 15, 2016 · 14 comments
Closed

wait: no child processes #178

kenjones-cisco opened this issue Jun 15, 2016 · 14 comments

Comments

@kenjones-cisco
Copy link

I keep see the wait: no child processes show up in my log files but I've not been able to determine the root cause of why that happens.

Does this mean that a health check or onChange script execution couldn't be executed as there were no available child processes?

The only reference I saw previously dealt with the reapChildren work done previously, so does this mean that the attempt to reap found no children to reap?

@tgross
Copy link
Contributor

tgross commented Jun 15, 2016

I've run into this intermittently. The code section in question is in utils/run.go ExecuteAndWait. If you check out the golang source code for cmd.Run you'll see a race condition. The process is started and then we wait for it. But if the process completes and exits before the wait happens (because, say, the go runtime decides to do a GC pause right then or the goroutine yields for the syscall), then we'll get an error there.

You should be able to add some logging to your health/onChange handlers to verify this for yourself; you'll see your logs and then the wait: no child processes error.

That being said, I'd like to come up with a better way to handle this. We want to make sure we're logging if there's a real error, but we don't want to log in this case which is really just a limitation of the underlying API.

@tgross tgross added the bug label Jun 16, 2016
@tgross
Copy link
Contributor

tgross commented Jun 16, 2016

I'm going to tag this as a bug although it's really more of a UX thing.

@tgross
Copy link
Contributor

tgross commented Jul 12, 2016

@kenjones-cisco are you running consul-template in the container where you're seeing this? And if so, which version? I've been having trouble reproducing lately and I wonder if you're seeing hashicorp/consul-template#506 which was fixed in consul-template 0.13 which was released in Feb.

@kenjones-cisco
Copy link
Author

kenjones-cisco commented Jul 12, 2016

Yes, inside the same container.

consul-template v0.15

@tgross
Copy link
Contributor

tgross commented Jul 13, 2016

Ok, I have definitely been able to reproduce now... just need to figure out how to fix it next. 😀

@tgross
Copy link
Contributor

tgross commented Aug 5, 2016

I believe that I've fixed this in #184 and #199. We're no longer calling cmd.Run and instead have called cmd.Start and cmd.Wait separately for all health, onChange, and sensor handlers.

We may want to try to catch the error wait: no child processes here and just log.Debugf it, but let me see if I can verify that first.

@tgross tgross added the open PR label Aug 5, 2016
@tgross
Copy link
Contributor

tgross commented Aug 12, 2016

Released in https://github.com/joyent/containerpilot/releases/tag/2.4.0. Going to mark this as fixed tentatively and close it. If it comes up again let's reopen this to dive in.

@tgross
Copy link
Contributor

tgross commented Aug 18, 2016

I'm going to have to re-open this, as I've seen it again while working on some autopilotpattern/mysql features. Again, it's pretty harmless log noise but we can do better.

@tgross tgross reopened this Aug 18, 2016
@tgross
Copy link
Contributor

tgross commented Sep 23, 2016

I know it's been a bit since I've circled back to this but I've discovered that this really is not harmless log noise and can result in missed heartbeats. I'll be digging into this shortly.

@tgross
Copy link
Contributor

tgross commented Sep 23, 2016

With debug logging on it manifests as follows:

2016/09/23 19:27:51 mysql.health.run waiting for PID 326:
2016/09/23 19:27:51 mysql.health exited with error: wait: no child processes
2016/09/23 19:27:51 mysql.health.RunWithTimeout end
    2016/09/23 19:27:52 [WARN] agent: Check 'mysql-96fac388f99f' missed TTL, is now critical
    2016/09/23 19:27:52 [INFO] agent: Synced check 'mysql-96fac388f99f'

Which I can trace to where we split out the Run and Wait in waitForTimeout. We're returning all errors to the caller when we should be swallowing that particular one as a log-at-debug-only at most.

@tgross
Copy link
Contributor

tgross commented Sep 23, 2016

Opened #222 to fix.

@tgross
Copy link
Contributor

tgross commented Sep 26, 2016

@tgross tgross closed this as completed Sep 26, 2016
@tgross tgross added released and removed open PR labels Sep 26, 2016
@lucj
Copy link

lucj commented Apr 25, 2017

One of my service, an api that only depends on an underlying mongodb instance, keeps on talking about this same error (api.health exited with error: waitid: no child processes).
The service uses Consul 0.7.5 and ContainerPilot 2.7.2

api_1            | 2017/04/25 14:31:18 Received event:db-change
api_1            | 2017/04/25 14:31:18 DB-EVENT
api_1            | warn: DB reconnection triggered...
api_1            | 2017/04/25 14:31:51     2017/04/25 14:31:51 [WARN] agent: Check 'api-622a73c0bcc0' missed TTL, is now critical
api_1            | 2017/04/25 14:31:51     2017/04/25 14:31:51 [INFO] agent: Synced check 'api-622a73c0bcc0'
api_1            | error: error checking DB connection(2): failed to reconnect after 30 attempts with interval 1000 ms
api_1            | 2017/04/25 14:31:57     2017/04/25 14:31:57 [INFO] agent: Synced check 'api-622a73c0bcc0'
api_1            | 2017/04/25 14:32:22     2017/04/25 14:32:22 [WARN] agent: Check 'api-622a73c0bcc0' missed TTL, is now critical
api_1            | 2017/04/25 14:32:22     2017/04/25 14:32:22 [INFO] agent: Synced check 'api-622a73c0bcc0'
api_1            | 2017/04/25 14:33:06     2017/04/25 14:33:06 [INFO] agent: Synced check 'api-622a73c0bcc0'
api_1            | 2017/04/25 14:33:31     2017/04/25 14:33:31 [WARN] agent: Check 'api-622a73c0bcc0' missed TTL, is now critical
api_1            | 2017/04/25 14:33:31     2017/04/25 14:33:31 [INFO] agent: Synced check 'api-622a73c0bcc0'
api_1            | 2017/04/25 14:34:36     2017/04/25 14:34:36 [INFO] agent: Synced check 'api-622a73c0bcc0'
api_1            | 2017/04/25 14:35:01     2017/04/25 14:35:01 [WARN] agent: Check 'api-622a73c0bcc0' missed TTL, is now critical
api_1            | 2017/04/25 14:35:01     2017/04/25 14:35:01 [INFO] agent: Synced check 'api-622a73c0bcc0'
api_1            | 2017/04/25 14:36:36 api.health exited with error: waitid: no child processes
api_1            | 2017/04/25 14:36:46 api.health exited with error: waitid: no child processes
api_1            | 2017/04/25 14:37:46     2017/04/25 14:37:46 [INFO] agent: Synced check 'api-622a73c0bcc0'
api_1            | 2017/04/25 14:38:46     2017/04/25 14:38:46 [WARN] agent: Check 'api-622a73c0bcc0' missed TTL, is now critical
api_1            | 2017/04/25 14:38:46     2017/04/25 14:38:46 [INFO] agent: Synced check 'api-622a73c0bcc0'

As a result the healthcheck is not consistent as the service status switches between healthy and unhealthy even if the underlying DB is stopped.

@tgross
Copy link
Contributor

tgross commented Apr 25, 2017

@lucj can you provide debug logs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants