-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Instana AutoTrace causes patroni process to hang #377
Comments
Hi @ikaakkola, |
Hi, Agreed, this is how AutoTracing works. The module is installed because the service supports taking backups to Google Cloud storage, and hence it needs the relevant Python modules to be available, as the decision of 'where to backup' is a configuration option. So having modules that "nobody uses" isn't entirely correct, the service could be configured to use the module. I have not been able to get It would indeed be very useful to know how the traceback continues, but this is what the Python traceback gave me with PYTHONFAULTHANDLER=true and |
Perhaps you could try
|
I tried to manually get the instana instrumentation code to hang, but could not. (what I tried is to 'pip3 install instana' and It seems like, based on "Once the Instana host agent is installed on your host, it will periodically scan for running Python processes and automatically apply the Instana Python instrumentation transparently." (quoted from the documentation at https://www.ibm.com/docs/en/obi/current?topic=package-python-configuration-configuring-instana#autotrace) , that this automatic apply of the instrumentation to the running process is what causes it to hang. I will reconfigure the environment so that instana is again using AutoTrace and will try the above debugging process once I get Patroni to hang again. |
When com.instana.plugin.python -> autotrace -> enabled: true is set, all the existing 'patroni' processes in all the running service containers hanged after a moment , so this seems to happen 100% of the time when AutoTrace is used against spilo+patroni postgresql pods. |
And did you get the traceback from the shell? |
I did not, because the shell does not open to any usable state, it is just stuck there. |
When I run the service with INSTANA_DEBUG=true I get the following to STDERR before the process hangs:
(this is exactly as written to a debug file with 2>/tmp/patroni.stderr) It looks like Stan is on the scene twice , could this cause some weird deadlock issue? There is a Zombie 'ldconfig.real' process as a child of the Patroni python process, which appears at the same time as the patroni process hangs (and that would be when AutoTrace starts doing the instrumentation stuff). |
I added
into which provided me this (so this would be just before the process hangs):
|
When I add a random 1-10 second sleep into
@Ferenc- would you happen to have any idea why AutoTrace appears to be executed twice, at the same time, for these python processes or shall I turn to our Instana support for that instead? In any case, I would currently say that this isn't really a bug in Instana python-sensor directly. |
From this the only thing appears to be "hanging" is patroni itself scheduling it's next run, and it's watchdog waiting. for
That is initiated by the agent, which is proprietary. And should not be discussed here, only on different channels. |
Yes, so I will go via Instana Support for that, which I believe is the cause of this. My guess is (but as you said, this is proprietary) that AutoTrace somehome manages to import modules twice, which I think should not happen in Python (there is a reload() to , well, reload modules). |
As there is no sign of |
I'm fine with this being closed as I believe it to be an AutoTrace issue and not directly related to 'python-sensor', but it is not a Patroni bug, here is a minimal reproduction of a similar problem where the process hangs (assumes that the kubernetes environment has instana-agent running and python auto trace is enabled). This process hangs earlier in the Instana python code, when importing profiler at line 10 in Start a new Kubernetes pod with image
Install python3 and required utils
Create a test python file
Start the test python process:
Wait for a moment for AutoTrace to attempt to instrument this process and it will hang Note that the process hangs before INSTANA_DEBUG would print anything, so there will be no "Stan is on the scene" output Get the running processes
Note that the python3 process has a zombie 'ldconfig.real' child process Get a traceback from the process (via PYTHONFAULTHANDLER)
resulting traceback from SIGABRT:
Note: the process might need to be started a few times for the hang to happen (while with Patroni it seems to happen every time) on the main loop. When it does, the log output of the process will stop. Even when the main test.py keeps outputting, the Instana instrumentation code is stuck in the above when |
I can confirm the issue up to Python |
Unfortunately I don't have any easy way to access new pythons in the container envs where we have Instana available. I think this is a combination of the way the Auto Trace works (attaching to a live process with ptrace etc.) and some race condition deadlock in python calling ldconfig. Thanks for the help you provided here, at least I learned python debugging 👍 |
Problem Description
A database cluster built with Zalando Spilo (https://github.com/zalando/spilo) and postgres-operator (https://github.com/zalando/postgres-operator) uses patroni (https://github.com/zalando/patroni) for cluster state management . Running on Kubernetes.
When Instana python instrumentation is enabled, the patroni python process inside each 'postgresql' pod hangs a few seconds after startup.
The command line used to execute patroni via Spilo is available at https://github.com/zalando/spilo/blob/master/postgres-appliance/runit/patroni/run#L29 (used via runsv by spilo) - when the last exec in this changed to include INSTANA_DISABLE_AUTO_INSTR=true after env , patroni starts and runs normally.
The above was figured out after spending some time trying to figure out what is wrong. This traceback (which fortunately was printed due to a mistake in manually trying to get print a traceback from /usr/local/lib/python3.6/dist-packages/google/cloud/storage/batch.py) provided the hint of what is causing patroni to hang:
What pointed towards 'batch.py' was this traceback from Patroni when it got stuck (acquired via kill -SIGABRT when the process was started with
PYTHONFAULTHANDLER=true
)While attempting to figure out what is wrong, based on the above stack trace, noticed that uninstalling 'google-cloud-storage' fixed the problem. The curious part was that I could not find anything in patroni that would end up calling 'google-cloud-storage. While patroni supports WAL backups to google via WAL-E, and hence installs the google-cloud-storage python module, our configuration did not use it. Turned out the problem was not that patroni was importing it, but instana was, at https://github.com/instana/python-sensor/blob/master/instana/instrumentation/google/cloud/storage.py#L14 and as a consequence the whole patroni process hangs.
As said, if running strace on the patroni pid (
strace -f -v -s 128 -p <pid>
) this does not occur and patroni runs fine. If there are other ways I could provide further debugging details, I will be glad to assist.Minimal, Complete, Verifiable, Example
setup kubernetes, install zalando-spilo using postgres-operator, enable instana instrumentation and attempt to launch a postgres cluster. After a few seconds (< 30), the patroni API at localhost:8008 (inside any of the postgresql cluster containers) stops answering requests.
(sorry, was not able to replicate this in any minimal nor complete way..)
Python Version
Python 3.6
Python Modules
Python Environment
The text was updated successfully, but these errors were encountered: