Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cmd/opampsupervisor] Supervisor may leak collector process #32189

Closed
BinaryFissionGames opened this issue Apr 5, 2024 · 4 comments · Fixed by #32875
Closed

[cmd/opampsupervisor] Supervisor may leak collector process #32189

BinaryFissionGames opened this issue Apr 5, 2024 · 4 comments · Fixed by #32875

Comments

@BinaryFissionGames
Copy link
Contributor

BinaryFissionGames commented Apr 5, 2024

Component(s)

cmd/opampsupervisor, extension/opamp

Describe the issue you're reporting

In its current form, the Supervisor may leak a collector process if it is unexpectedly killed and doesn't get a chance to stop the collector. You can see this by issuing a kill -9 to the supervisor and observing that the collector process is still running. This will block subsequent startups of the supervisor, as multiple collectors will try to occupy the same ports (8888, any user-configured components with a port)

Under normal circumstance, we shouldn't leak the collector process, but I think we can make this more robust so that when the supervisor unexpectedly dies, the collector dies too.

I thought through a couple ideas (collector PID file, doing some black magic stuff per-os to have the OS auto-kill the children), but I think they all end up a way more complex than just having the collector monitor it's ppid and exiting if it changes.

In one of our old managed agents, this is how we would handle the cases where the supervisor may shutdown without properly killing the child process.

To fully outline the proposal, it would be something like:

  1. Add an optional field to the OpAMP extension, supervisor_pid
  2. When supervisor_pid is configured, the OpAMP extension will poll (maybe every ~5 seconds) and ensure that the value of os.Getppid() equals the value of supervisor_pid
  3. If these values are not equal, the OpAMP extension reports a fatal error to trigger a collector shutdown.

Note: I think the second step here might need different logic on Windows, as the orphaned process may not be re-parented like it is on other systems.

Copy link
Contributor

github-actions bot commented Apr 5, 2024

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@crobert-1 crobert-1 added bug Something isn't working and removed needs triage New item requiring triage labels Apr 5, 2024
@crobert-1
Copy link
Member

Sounds like a valid bug to me. I'll defer to others regarding the proper approach here, but I've removed needs triage.

@evan-bradley
Copy link
Contributor

This is a good catch. The approach you've outlined here makes sense to me. I don't know enough about Windows to make any suggestions, but at least for Unix-like operating systems, I think having the extension handle this will be the most reliable and portable approach.

@evan-bradley evan-bradley added the priority:p2 Medium label Apr 16, 2024
@tigrannajaryan
Copy link
Member

Apparently there is something called job that can help with this on Windows, but I think the general approach with polling parent pid should work on Windows too.

evan-bradley pushed a commit that referenced this issue May 3, 2024
…been orphaned (#32564)

**Description:** <Describe what has changed.>
* Allows the process to monitor a passed in ppid, which should be the
parent process ID for the collector. When the parent process ID exits,
the extension emits a fatal error event, which triggers a collector
shutdown.

**Link to tracking Issue:** This is part of #32189 - It does not resolve
this issue because the supervisor still needs changes to pass the its
PID in.

**Testing:**
Added some unit tests.

I've manually tested it on my macbook with this PR:
observIQ#4715
(running supervisor, kill -9 the supervisor, and take a look at the
agent logs to see it shut down).

I've tried testing this out on Windows, but the supervisor doesn't get
past bootstrapping (the Commander's Stop function does not work on
windows), so I wasn't able to fully test it.

**Documentation:** 
Added new parameter to README
rimitchell pushed a commit to rimitchell/opentelemetry-collector-contrib that referenced this issue May 8, 2024
…been orphaned (open-telemetry#32564)

**Description:** <Describe what has changed.>
* Allows the process to monitor a passed in ppid, which should be the
parent process ID for the collector. When the parent process ID exits,
the extension emits a fatal error event, which triggers a collector
shutdown.

**Link to tracking Issue:** This is part of open-telemetry#32189 - It does not resolve
this issue because the supervisor still needs changes to pass the its
PID in.

**Testing:**
Added some unit tests.

I've manually tested it on my macbook with this PR:
observIQ#4715
(running supervisor, kill -9 the supervisor, and take a look at the
agent logs to see it shut down).

I've tried testing this out on Windows, but the supervisor doesn't get
past bootstrapping (the Commander's Stop function does not work on
windows), so I wasn't able to fully test it.

**Documentation:** 
Added new parameter to README
jlg-io pushed a commit to jlg-io/opentelemetry-collector-contrib that referenced this issue May 14, 2024
…been orphaned (open-telemetry#32564)

**Description:** <Describe what has changed.>
* Allows the process to monitor a passed in ppid, which should be the
parent process ID for the collector. When the parent process ID exits,
the extension emits a fatal error event, which triggers a collector
shutdown.

**Link to tracking Issue:** This is part of open-telemetry#32189 - It does not resolve
this issue because the supervisor still needs changes to pass the its
PID in.

**Testing:**
Added some unit tests.

I've manually tested it on my macbook with this PR:
observIQ#4715
(running supervisor, kill -9 the supervisor, and take a look at the
agent logs to see it shut down).

I've tried testing this out on Windows, but the supervisor doesn't get
past bootstrapping (the Commander's Stop function does not work on
windows), so I wasn't able to fully test it.

**Documentation:** 
Added new parameter to README
evan-bradley pushed a commit that referenced this issue May 16, 2024
…sion (#32875)

**Description:** <Describe what has changed.>
* Configures the PPID of the opamp extension in the supervisor. This
allows the collector to detect if the supervisor exits and shut itself
down.

**Link to tracking Issue:** Closes #32189

**Testing:** <Describe what testing was performed and which tests were
added.>
* Manually tested by starting the supervisor, then kill -9'ing the
supervisor. The collector previously would have still been running, but
now shuts itself down. Doing this you can also see the following log:
```
2024-05-06T14:52:31.010-0400	error	otelcol@v0.99.1-0.20240503221155-67d37183e6ac/collector.go:278	Asynchronous error received, terminating process	{"error": "collector was orphaned, process with pid 38908 does not exist"}
```

**Documentation:** <Describe the documentation added.>
Added `orphan_detection_interval` to the spec as a configurable option

---------

Co-authored-by: Tiffany Hrabusa <30397949+tiffany76@users.noreply.github.com>
Co-authored-by: Andrzej Stencel <andrzej.stencel@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants