wait: no child processes #178

kenjones-cisco · 2016-06-15T20:04:41Z

I keep see the wait: no child processes show up in my log files but I've not been able to determine the root cause of why that happens.

Does this mean that a health check or onChange script execution couldn't be executed as there were no available child processes?

The only reference I saw previously dealt with the reapChildren work done previously, so does this mean that the attempt to reap found no children to reap?

The text was updated successfully, but these errors were encountered:

tgross · 2016-06-15T20:25:28Z

I've run into this intermittently. The code section in question is in utils/run.go ExecuteAndWait. If you check out the golang source code for cmd.Run you'll see a race condition. The process is started and then we wait for it. But if the process completes and exits before the wait happens (because, say, the go runtime decides to do a GC pause right then or the goroutine yields for the syscall), then we'll get an error there.

You should be able to add some logging to your health/onChange handlers to verify this for yourself; you'll see your logs and then the wait: no child processes error.

That being said, I'd like to come up with a better way to handle this. We want to make sure we're logging if there's a real error, but we don't want to log in this case which is really just a limitation of the underlying API.

tgross · 2016-06-16T20:10:44Z

I'm going to tag this as a bug although it's really more of a UX thing.

tgross · 2016-07-12T20:39:42Z

@kenjones-cisco are you running consul-template in the container where you're seeing this? And if so, which version? I've been having trouble reproducing lately and I wonder if you're seeing hashicorp/consul-template#506 which was fixed in consul-template 0.13 which was released in Feb.

kenjones-cisco · 2016-07-12T22:22:16Z

Yes, inside the same container.

consul-template v0.15

tgross · 2016-07-13T15:16:56Z

Ok, I have definitely been able to reproduce now... just need to figure out how to fix it next. 😀

tgross · 2016-08-05T19:45:45Z

I believe that I've fixed this in #184 and #199. We're no longer calling cmd.Run and instead have called cmd.Start and cmd.Wait separately for all health, onChange, and sensor handlers.

We may want to try to catch the error wait: no child processes here and just log.Debugf it, but let me see if I can verify that first.

tgross · 2016-08-12T18:54:19Z

Released in https://github.com/joyent/containerpilot/releases/tag/2.4.0. Going to mark this as fixed tentatively and close it. If it comes up again let's reopen this to dive in.

tgross · 2016-08-18T18:19:30Z

I'm going to have to re-open this, as I've seen it again while working on some autopilotpattern/mysql features. Again, it's pretty harmless log noise but we can do better.

tgross · 2016-09-23T19:30:40Z

I know it's been a bit since I've circled back to this but I've discovered that this really is not harmless log noise and can result in missed heartbeats. I'll be digging into this shortly.

tgross · 2016-09-23T19:41:13Z

With debug logging on it manifests as follows:

2016/09/23 19:27:51 mysql.health.run waiting for PID 326:
2016/09/23 19:27:51 mysql.health exited with error: wait: no child processes
2016/09/23 19:27:51 mysql.health.RunWithTimeout end
    2016/09/23 19:27:52 [WARN] agent: Check 'mysql-96fac388f99f' missed TTL, is now critical
    2016/09/23 19:27:52 [INFO] agent: Synced check 'mysql-96fac388f99f'

Which I can trace to where we split out the Run and Wait in waitForTimeout. We're returning all errors to the caller when we should be swallowing that particular one as a log-at-debug-only at most.

tgross · 2016-09-23T20:11:34Z

Opened #222 to fix.

tgross · 2016-09-26T15:41:04Z

Released in https://github.com/joyent/containerpilot/releases/tag/2.4.2

lucj · 2017-04-25T14:44:13Z

One of my service, an api that only depends on an underlying mongodb instance, keeps on talking about this same error (api.health exited with error: waitid: no child processes).
The service uses Consul 0.7.5 and ContainerPilot 2.7.2

api_1            | 2017/04/25 14:31:18 Received event:db-change
api_1            | 2017/04/25 14:31:18 DB-EVENT
api_1            | warn: DB reconnection triggered...
api_1            | 2017/04/25 14:31:51     2017/04/25 14:31:51 [WARN] agent: Check 'api-622a73c0bcc0' missed TTL, is now critical
api_1            | 2017/04/25 14:31:51     2017/04/25 14:31:51 [INFO] agent: Synced check 'api-622a73c0bcc0'
api_1            | error: error checking DB connection(2): failed to reconnect after 30 attempts with interval 1000 ms
api_1            | 2017/04/25 14:31:57     2017/04/25 14:31:57 [INFO] agent: Synced check 'api-622a73c0bcc0'
api_1            | 2017/04/25 14:32:22     2017/04/25 14:32:22 [WARN] agent: Check 'api-622a73c0bcc0' missed TTL, is now critical
api_1            | 2017/04/25 14:32:22     2017/04/25 14:32:22 [INFO] agent: Synced check 'api-622a73c0bcc0'
api_1            | 2017/04/25 14:33:06     2017/04/25 14:33:06 [INFO] agent: Synced check 'api-622a73c0bcc0'
api_1            | 2017/04/25 14:33:31     2017/04/25 14:33:31 [WARN] agent: Check 'api-622a73c0bcc0' missed TTL, is now critical
api_1            | 2017/04/25 14:33:31     2017/04/25 14:33:31 [INFO] agent: Synced check 'api-622a73c0bcc0'
api_1            | 2017/04/25 14:34:36     2017/04/25 14:34:36 [INFO] agent: Synced check 'api-622a73c0bcc0'
api_1            | 2017/04/25 14:35:01     2017/04/25 14:35:01 [WARN] agent: Check 'api-622a73c0bcc0' missed TTL, is now critical
api_1            | 2017/04/25 14:35:01     2017/04/25 14:35:01 [INFO] agent: Synced check 'api-622a73c0bcc0'
api_1            | 2017/04/25 14:36:36 api.health exited with error: waitid: no child processes
api_1            | 2017/04/25 14:36:46 api.health exited with error: waitid: no child processes
api_1            | 2017/04/25 14:37:46     2017/04/25 14:37:46 [INFO] agent: Synced check 'api-622a73c0bcc0'
api_1            | 2017/04/25 14:38:46     2017/04/25 14:38:46 [WARN] agent: Check 'api-622a73c0bcc0' missed TTL, is now critical
api_1            | 2017/04/25 14:38:46     2017/04/25 14:38:46 [INFO] agent: Synced check 'api-622a73c0bcc0'

As a result the healthcheck is not consistent as the service status switches between healthy and unhealthy even if the underlying DB is stopped.

tgross · 2017-04-25T15:52:40Z

@lucj can you provide debug logs?

tgross added the bug label Jun 16, 2016

tgross added the open PR label Aug 5, 2016

tgross closed this as completed Aug 12, 2016

tgross added fixed and removed open PR fixed labels Aug 12, 2016

tgross reopened this Aug 18, 2016

tgross mentioned this issue Sep 23, 2016

Snapshot TTL expiry causes Primary to be marked unhealthy autopilotpattern/mysql#59

Closed

tgross mentioned this issue Sep 23, 2016

Prevent 'no child process' err from breaking heartbeats #222

Merged

tgross added the open PR label Sep 23, 2016

tgross closed this as completed Sep 26, 2016

tgross added released and removed open PR labels Sep 26, 2016

neil-hickey mentioned this issue Oct 11, 2022

❄️ Flaky Test Collection ❄️ carvel-dev/kapp-controller#753

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wait: no child processes #178

wait: no child processes #178

kenjones-cisco commented Jun 15, 2016

tgross commented Jun 15, 2016 •

edited

tgross commented Jun 16, 2016

tgross commented Jul 12, 2016

kenjones-cisco commented Jul 12, 2016 •

edited

tgross commented Jul 13, 2016

tgross commented Aug 5, 2016

tgross commented Aug 12, 2016

tgross commented Aug 18, 2016

tgross commented Sep 23, 2016 •

edited

tgross commented Sep 23, 2016 •

edited

tgross commented Sep 23, 2016

tgross commented Sep 26, 2016

lucj commented Apr 25, 2017

tgross commented Apr 25, 2017

wait: no child processes #178

wait: no child processes #178

Comments

kenjones-cisco commented Jun 15, 2016

tgross commented Jun 15, 2016 • edited

tgross commented Jun 16, 2016

tgross commented Jul 12, 2016

kenjones-cisco commented Jul 12, 2016 • edited

tgross commented Jul 13, 2016

tgross commented Aug 5, 2016

tgross commented Aug 12, 2016

tgross commented Aug 18, 2016

tgross commented Sep 23, 2016 • edited

tgross commented Sep 23, 2016 • edited

tgross commented Sep 23, 2016

tgross commented Sep 26, 2016

lucj commented Apr 25, 2017

tgross commented Apr 25, 2017

tgross commented Jun 15, 2016 •

edited

kenjones-cisco commented Jul 12, 2016 •

edited

tgross commented Sep 23, 2016 •

edited

tgross commented Sep 23, 2016 •

edited