Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Instana AutoTrace causes patroni process to hang #377

Closed
ikaakkola opened this issue Jul 1, 2022 · 16 comments
Closed

[Bug]: Instana AutoTrace causes patroni process to hang #377

ikaakkola opened this issue Jul 1, 2022 · 16 comments

Comments

@ikaakkola
Copy link

ikaakkola commented Jul 1, 2022

Problem Description

A database cluster built with Zalando Spilo (https://github.com/zalando/spilo) and postgres-operator (https://github.com/zalando/postgres-operator) uses patroni (https://github.com/zalando/patroni) for cluster state management . Running on Kubernetes.

When Instana python instrumentation is enabled, the patroni python process inside each 'postgresql' pod hangs a few seconds after startup.

The command line used to execute patroni via Spilo is available at https://github.com/zalando/spilo/blob/master/postgres-appliance/runit/patroni/run#L29 (used via runsv by spilo) - when the last exec in this changed to include INSTANA_DISABLE_AUTO_INSTR=true after env , patroni starts and runs normally.

The above was figured out after spending some time trying to figure out what is wrong. This traceback (which fortunately was printed due to a mistake in manually trying to get print a traceback from /usr/local/lib/python3.6/dist-packages/google/cloud/storage/batch.py) provided the hint of what is causing patroni to hang:

  File "<string>", line 1, in <module>
  File "/tmp/.instana/python/instana/__init__.py", line 206, in <module>
    boot_agent()
  File "/tmp/.instana/python/instana/__init__.py", line 155, in boot_agent
    from .instrumentation.google.cloud import storage
  File "/tmp/.instana/python/instana/instrumentation/google/cloud/storage.py", line 14, in <module>
    from google.cloud import storage
  File "/usr/local/lib/python3.6/dist-packages/google/cloud/storage/__init__.py", line 35, in <module>
    from google.cloud.storage.batch import Batch
  File "/usr/local/lib/python3.6/dist-packages/google/cloud/storage/batch.py", line 30, in <module>
    traceback.print_stack(file=sys.stdout)
NameError: name 'sys' is not defined

What pointed towards 'batch.py' was this traceback from Patroni when it got stuck (acquired via kill -SIGABRT when the process was started with PYTHONFAULTHANDLER=true)

Current thread 0x00007fbe8a4f0740 (most recent call first):
  File "/usr/lib/python3.6/re.py", line 182 in search
  File "/usr/lib/python3.6/ctypes/util.py", line 283 in _findSoname_ldconfig
  File "/usr/lib/python3.6/ctypes/util.py", line 313 in find_library
  File "/usr/lib/python3/dist-packages/asn1crypto/_perf/_big_num_ctypes.py", line 35 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  File "/usr/lib/python3/dist-packages/asn1crypto/_int.py", line 56 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  File "/usr/lib/python3/dist-packages/asn1crypto/_elliptic_curve.py", line 51 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  File "/usr/lib/python3/dist-packages/asn1crypto/keys.py", line 22 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  File "/usr/lib/python3/dist-packages/cryptography/x509/extensions.py", line 13 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  File "/usr/lib/python3/dist-packages/cryptography/x509/base.py", line 16 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  File "/usr/lib/python3/dist-packages/cryptography/x509/__init__.py", line 8 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  File "/usr/local/lib/python3.6/dist-packages/google/auth/crypt/_cryptography_rsa.py", line 27 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1023 in _handle_fromlist
  File "/usr/local/lib/python3.6/dist-packages/google/auth/crypt/rsa.py", line 20 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1023 in _handle_fromlist
  File "/usr/local/lib/python3.6/dist-packages/google/auth/crypt/__init__.py", line 43 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1023 in _handle_fromlist
  File "/usr/local/lib/python3.6/dist-packages/google/auth/_service_account_info.py", line 22 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1023 in _handle_fromlist
  File "/usr/local/lib/python3.6/dist-packages/google/oauth2/service_account.py", line 77 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1023 in _handle_fromlist
  File "/usr/local/lib/python3.6/dist-packages/google/auth/transport/requests.py", line 48 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  File "/usr/local/lib/python3.6/dist-packages/google/cloud/_helpers/__init__.py", line 31 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1023 in _handle_fromlist
  File "/usr/local/lib/python3.6/dist-packages/google/cloud/storage/batch.py", line 28 in <module>
  ...

While attempting to figure out what is wrong, based on the above stack trace, noticed that uninstalling 'google-cloud-storage' fixed the problem. The curious part was that I could not find anything in patroni that would end up calling 'google-cloud-storage. While patroni supports WAL backups to google via WAL-E, and hence installs the google-cloud-storage python module, our configuration did not use it. Turned out the problem was not that patroni was importing it, but instana was, at https://github.com/instana/python-sensor/blob/master/instana/instrumentation/google/cloud/storage.py#L14 and as a consequence the whole patroni process hangs.

As said, if running strace on the patroni pid (strace -f -v -s 128 -p <pid>) this does not occur and patroni runs fine. If there are other ways I could provide further debugging details, I will be glad to assist.

root@host# uname -a
Linux <redacted> 5.4.190-107.353.amzn2.x86_64 #1 SMP Wed Apr 27 21:16:35 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

root@host# cat /etc/issue
Ubuntu 18.04.6 LTS \n \l

Minimal, Complete, Verifiable, Example

setup kubernetes, install zalando-spilo using postgres-operator, enable instana instrumentation and attempt to launch a postgres cluster. After a few seconds (< 30), the patroni API at localhost:8008 (inside any of the postgresql cluster containers) stops answering requests.

(sorry, was not able to replicate this in any minimal nor complete way..)

Python Version

Python 3.6

Python Modules

`apt list '*python*' --installed `
libpython3-stdlib/now 3.6.7-1~18.04 amd64 [installed,local]
libpython3.6/now 3.6.9-1~18.04ubuntu1.7 amd64 [installed,local]
libpython3.6-minimal/now 3.6.9-1~18.04ubuntu1.7 amd64 [installed,local]
libpython3.6-stdlib/now 3.6.9-1~18.04ubuntu1.7 amd64 [installed,local]
postgresql-plpython3-10/now 10.21-1.pgdg18.04+1 amd64 [installed,local]
postgresql-plpython3-11/now 11.16-1.pgdg18.04+1 amd64 [installed,local]
postgresql-plpython3-12/now 12.11-1.pgdg18.04+1 amd64 [installed,local]
postgresql-plpython3-13/now 13.7-1.pgdg18.04+1 amd64 [installed,local]
postgresql-plpython3-14/now 14.4-1.pgdg18.04+1 amd64 [installed,local]
postgresql-plpython3-9.6/now 9.6.24-1.pgdg18.04+1 amd64 [installed,local]
python-babel-localedata/now 2.4.0+dfsg.1-2ubuntu1.1 all [installed,local]
python3/now 3.6.7-1~18.04 amd64 [installed,local]
python3-asn1crypto/now 0.24.0-1 all [installed,local]
python3-babel/now 2.4.0+dfsg.1-2ubuntu1.1 all [installed,local]
python3-blinker/now 1.4+dfsg1-0.1 all [installed,local]
python3-boto/now 2.44.0-1ubuntu2.18.04.1 all [installed,local]
python3-cachetools/now 2.0.0-2 all [installed,local]
python3-cdiff/now 1.0-1 all [installed,local]
python3-certifi/now 2018.1.18-2 all [installed,local]
python3-cffi/now 1.11.5-1 all [installed,local]
python3-cffi-backend/now 1.11.5-1 amd64 [installed,local]
python3-chardet/now 3.0.4-1 all [installed,local]
python3-click/now 6.7-3 all [installed,local]
python3-colorama/now 0.3.7-1 all [installed,local]
python3-consul/now 0.7.1-1 all [installed,local]
python3-cryptography/now 2.1.4-1ubuntu1.4 amd64 [installed,local]
python3-dateutil/now 2.6.1-1 all [installed,local]
python3-debtcollector/now 1.13.0-0ubuntu1 all [installed,local]
python3-dnspython/now 1.15.0-1 all [installed,local]
python3-docutils/now 0.14+dfsg-3 all [installed,local]
python3-etcd/now 0.4.3-2 all [installed,local]
python3-funcsigs/now 1.0.2-4 all [installed,local]
python3-gevent/now 1.2.2-2 amd64 [installed,local]
python3-greenlet/now 0.4.12-2 amd64 [installed,local]
python3-idna/now 2.6-1 all [installed,local]
python3-iso8601/now 0.1.11-1 all [installed,local]
python3-jwt/now 1.5.3+ds1-1 all [installed,local]
python3-kazoo/now 2.2.1-1ubuntu1 all [installed,local]
python3-keyring/now 10.6.0-1 all [installed,local]
python3-keystoneauth1/now 3.4.0-0ubuntu1 all [installed,local]
python3-keystoneclient/now 1:3.15.0-0ubuntu1 all [installed,local]
python3-lxml/now 4.2.1-1ubuntu0.6 amd64 [installed,local]
python3-meld3/now 1.0.2-2 amd64 [installed,local]
python3-minimal/now 3.6.7-1~18.04 amd64 [installed,local]
python3-monotonic/now 1.1-2 all [installed,local]
python3-msgpack/now 0.5.6-1 amd64 [installed,local]
python3-netaddr/now 0.7.19-1 all [installed,local]
python3-netifaces/now 0.10.4-0.1build4 amd64 [installed,local]
python3-oauthlib/now 2.0.6-1 all [installed,local]
python3-oslo.config/now 1:5.2.0-0ubuntu1 all [installed,local]
python3-oslo.i18n/now 3.19.0-0ubuntu1 all [installed,local]
python3-oslo.serialization/now 2.24.0-0ubuntu2 all [installed,local]
python3-oslo.utils/now 3.35.0-0ubuntu1.1 all [installed,local]
python3-pbr/now 3.1.1-3ubuntu3 all [installed,local]
python3-pkg-resources/now 39.0.1-2 all [installed,local]
python3-ply/now 3.11-1 all [installed,local]
python3-positional/now 1.1.1-3 all [installed,local]
python3-prettytable/now 0.7.2-3 all [installed,local]
python3-psutil/now 5.4.2-1ubuntu0.1 amd64 [installed,local]
python3-psycopg2/now 2.8.6-2~pgdg18.04+1 amd64 [installed,local]
python3-pyasn1/now 0.4.2-3 all [installed,local]
python3-pyasn1-modules/now 0.2.1-0.2 all [installed,local]
python3-pycparser/now 2.18-2 all [installed,local]
python3-pyparsing/now 2.2.0+dfsg1-2 all [installed,local]
python3-pystache/now 0.5.4-6 all [installed,local]
python3-requests/now 2.18.4-2ubuntu0.1 all [installed,local]
python3-rfc3986/now 0.3.1-2 all [installed,local]
python3-rsa/now 3.4.2-1 all [installed,local]
python3-six/now 1.11.0-2 all [installed,local]
python3-stevedore/now 1:1.28.0-0ubuntu1 all [installed,local]
python3-swiftclient/now 1:3.5.0-0ubuntu1 all [installed,local]
python3-tz/now 2018.3-2 all [installed,local]
python3-urllib3/now 1.22-1ubuntu0.18.04.2 all [installed,local]
python3-wrapt/now 1.9.0-3 amd64 [installed,local]
python3-yaml/now 3.12-1build2 amd64 [installed,local]
python3.6/now 3.6.9-1~18.04ubuntu1.7 amd64 [installed,local]
python3.6-minimal/now 3.6.9-1~18.04ubuntu1.7 amd64 [installed,local]

`pip3 list`
asn1crypto (0.24.0)
Babel (2.4.0)
blinker (1.4)
boto (2.44.0)
boto3 (1.23.10)
botocore (1.26.10)
cachetools (2.0.0)
cdiff (1.0)
certifi (2018.1.18)
cffi (1.11.5)
chardet (3.0.4)
click (6.7)
colorama (0.3.7)
cryptography (2.1.4)
debtcollector (1.13.0)
dnspython (1.15.0)
filechunkio (1.8)
funcsigs (1.0.2)
gevent (1.2.2)
google-api-core (2.8.2)
google-auth (2.8.0)
google-cloud-core (2.3.1)
google-cloud-storage (2.0.0)
google-crc32c (1.1.2)
google-resumable-media (2.3.3)
googleapis-common-protos (1.56.2)
greenlet (0.4.12)
idna (2.6)
iso8601 (0.1.11)
jmespath (0.10.0)
kazoo (2.2.1.dev0)
keystoneauth1 (3.4.0)
lxml (4.2.1)
meld3 (1.0.2)
monotonic (1.0)
msgpack (0.5.6)
netaddr (0.7.19)
netifaces (0.10.4)
oauthlib (2.0.6)
oslo.config (5.2.0)
oslo.i18n (3.19.0)
oslo.serialization (2.24.0)
oslo.utils (3.35.0)
patroni (2.1.4)
pbr (3.1.1)
pg-view (1.3.1)
pip (9.0.1)
ply (3.11)
positional (1.1.1)
prettytable (0.7.2)
protobuf (3.19.4)
psutil (5.4.2)
psycopg2 (2.8.6)
pyasn1 (0.4.2)
pyasn1-modules (0.2.1)
pycparser (2.18)
PyJWT (1.5.3)
pyparsing (2.2.0)
pystache (0.5.4)
python-consul (0.7.1)
python-dateutil (2.6.1)
python-etcd (0.4.3)
python-keystoneclient (3.15.0)
python-swiftclient (3.5.0)
pytz (2018.3)
PyYAML (3.12)
requests (2.18.4)
rfc3986 (0.3.1)
rsa (3.4.2)
s3transfer (0.5.2)
setuptools (59.6.0)
six (1.11.0)
stevedore (1.28.0)
urllib3 (1.22)
wal-e (1.1.1)
wrapt (1.9.0)
ydiff (1.2)

Python Environment

KUBERNETES_ROLE_LABEL=spilo-role
POD_IP=<redacted>
KUBERNETES_SERVICE_PORT=443
KUBERNETES_PORT=tcp://172.20.0.1:443
HOSTNAME=<redacted>
PGHOME=/home/postgres
KUBERNETES_PORT_443_TCP_ADDR=172.20.0.1
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/lib/postgresql/14/bin
KUBERNETES_PORT_443_TCP_PORT=443
KUBERNETES_PORT_443_TCP_PROTO=tcp
KUBERNETES_PORT_443_TCP=tcp://172.20.0.1:443
KUBERNETES_SERVICE_PORT_HTTPS=443
POD_NAMESPACE=platform
KUBERNETES_SERVICE_HOST=172.20.0.1
LC_ALL=en_US.utf-8
HOME=/home/postgres
@ikaakkola ikaakkola added the bug label Jul 1, 2022
@Ferenc-
Copy link
Contributor

Ferenc- commented Jul 4, 2022

Hi @ikaakkola,
Well that's how AutoTracing works, it checks the available python modules, and instruments them.
If you have modules in your dependency trees that nobody uses, then it is in general a good idea to remove them.
Now from your traceback (provided it was from the time when you experienced hanging),
the only thing I can see is that the from google.cloud import storage import statement causes hanging,
but not that instana would have anything to do with it.
So if you would like to keep autotrace on, with the current dependency tree, some more input would be needed from you.
Perhaps you could insert from google.cloud import storage into that call path yourself, and see if that hangs anyway?
And perhaps it would be useful to know how the traceback continues after File "/usr/local/lib/python3.6/dist-packages/google/cloud/storage/batch.py", line 30, as that should not be the end of the traceback,
because the way I see it, the import chain continues with from google.cloud.storage._http import Connection,
and that's where google module starts checking your environment variables, which are in fact system calls and can hang.

@Ferenc- Ferenc- removed the bug label Jul 4, 2022
@ikaakkola
Copy link
Author

Hi,

Agreed, this is how AutoTracing works. The module is installed because the service supports taking backups to Google Cloud storage, and hence it needs the relevant Python modules to be available, as the decision of 'where to backup' is a configuration option. So having modules that "nobody uses" isn't entirely correct, the service could be configured to use the module.

I have not been able to get from google.cloud import storage to hang (eg. via some test.py that does the import). It does however hang consistently when called via https://github.com/instana/python-sensor/blob/master/instana/instrumentation/google/cloud/storage.py - only not when strace or gdb is attached to the process doing the import.

It would indeed be very useful to know how the traceback continues, but this is what the Python traceback gave me with PYTHONFAULTHANDLER=true and kill -SIGABRT- I am not a python developer, so I do not know if there is some way to provide a longer traceback (is there some property to define the maximum length of traceback to return?).

@Ferenc-
Copy link
Contributor

Ferenc- commented Jul 4, 2022

Perhaps you could try pip3 install pyrasite, and then attach to the hanging process with pyrasite-shell ${PID_OF_YOUR_HANGING_PROCESS}"
and once you get that shell, then copy and execute this loop:

import sys, traceback

for thread_id, frame in sys._current_frames().items():
  print('Stack for thread {}'.format(thread_id))
  traceback.print_stack(frame)
  print('')

@ikaakkola
Copy link
Author

ikaakkola commented Jul 4, 2022

I tried to manually get the instana instrumentation code to hang, but could not. (what I tried is to 'pip3 install instana' and import instana inside the patroni process and then restart the patroni service).

It seems like, based on "Once the Instana host agent is installed on your host, it will periodically scan for running Python processes and automatically apply the Instana Python instrumentation transparently." (quoted from the documentation at https://www.ibm.com/docs/en/obi/current?topic=package-python-configuration-configuring-instana#autotrace) , that this automatic apply of the instrumentation to the running process is what causes it to hang.

I will reconfigure the environment so that instana is again using AutoTrace and will try the above debugging process once I get Patroni to hang again.

@ikaakkola
Copy link
Author

pyrasite-shell hangs when launched against the pid of the hanging process.

When com.instana.plugin.python -> autotrace -> enabled: true is set, all the existing 'patroni' processes in all the running service containers hanged after a moment , so this seems to happen 100% of the time when AutoTrace is used against spilo+patroni postgresql pods.

@ikaakkola ikaakkola changed the title [Bug]: Instana instrumentation causes patroni process to hang [Bug]: Instana AutoTrace causes patroni process to hang Jul 4, 2022
@Ferenc-
Copy link
Contributor

Ferenc- commented Jul 4, 2022

And did you get the traceback from the shell?

@ikaakkola
Copy link
Author

I did not, because the shell does not open to any usable state, it is just stuck there.

@ikaakkola
Copy link
Author

ikaakkola commented Jul 4, 2022

When I run the service with INSTANA_DEBUG=true I get the following to STDERR before the process hangs:

2022-07-04 10:14:04,724: 831 INFO instana: Stan is on the scene.  Starting Instana instrumentation version: 1.37.3
2022-07-04 10:14:04,724: 831 DEBUG instana: Loading Host Collector
2022-07-04 10:14:04,724: 831 DEBUG instana: Initializing host agent state machine
2022-07-04 10:14:04,724 INFO: Stan is on the scene.  Starting Instana instrumentation version: 1.37.3
2022-07-04 10:14:04,724 DEBUG: Loading Host Collector
2022-07-04 10:14:04,724 DEBUG: Initializing host agent state machine
2022-07-04 10:14:04,788: 831 DEBUG instana: Instrumenting asyncio
2022-07-04 10:14:04,788 DEBUG: Instrumenting asyncio

(this is exactly as written to a debug file with 2>/tmp/patroni.stderr)

It looks like Stan is on the scene twice , could this cause some weird deadlock issue?

There is a Zombie 'ldconfig.real' process as a child of the Patroni python process, which appears at the same time as the patroni process hangs (and that would be when AutoTrace starts doing the instrumentation stuff).

@ikaakkola
Copy link
Author

ikaakkola commented Jul 4, 2022

I added

import sys, traceback

for thread_id, frame in sys._current_frames().items():
  print('Stack for thread {}'.format(thread_id))
  traceback.print_stack(frame)
  print('')

into /usr/local/lib/python3.6/dist-packages/google/cloud/storage/batch.py just before line 28

which provided me this (so this would be just before the process hangs):

  File "/usr/lib/python3.6/threading.py", line 884, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.6/threading.py", line 1180, in run
    self.finished.wait(self.interval)
  File "/usr/lib/python3.6/threading.py", line 551, in wait
    signaled = self._cond.wait(timeout)
  File "/usr/lib/python3.6/threading.py", line 299, in wait
    gotit = waiter.acquire(True, timeout)
  File "/usr/lib/python3.6/threading.py", line 884, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.6/dist-packages/patroni/log.py", line 174, in run
    record = self._queue_handler.queue.get(True)
  File "/usr/lib/python3.6/queue.py", line 164, in get
    self.not_empty.wait()
  File "/usr/lib/python3.6/threading.py", line 295, in wait
    waiter.acquire()
  File "/usr/lib/python3.6/threading.py", line 884, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.6/socketserver.py", line 236, in serve_forever
    ready = selector.select(poll_interval)
  File "/usr/lib/python3.6/selectors.py", line 376, in select
    fd_event_list = self._poll.poll(timeout)
  File "/usr/lib/python3.6/threading.py", line 884, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.6/dist-packages/patroni/postgresql/callback_executor.py", line 29, in run
    self._condition.wait()
  File "/usr/lib/python3.6/threading.py", line 295, in wait
    waiter.acquire()
  File "/usr/lib/python3.6/threading.py", line 884, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 654, in run
    self._build_cache()
  File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 625, in _build_cache
    self._do_watch(objects.metadata.resource_version)
  File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 611, in _do_watch
    for event in iter_response_objects(response):
  File "/usr/local/lib/python3.6/dist-packages/patroni/utils.py", line 383, in iter_response_objects
    for chunk in response.read_chunked(decode_content=False):
  File "/usr/lib/python3/dist-packages/urllib3/response.py", line 598, in read_chunked
    self._update_chunk_length()
  File "/usr/lib/python3/dist-packages/urllib3/response.py", line 540, in _update_chunk_length
    line = self._fp.fp.readline()
  File "/usr/lib/python3.6/socket.py", line 586, in readinto
    return self._sock.recv_into(b)
  File "/usr/lib/python3.6/ssl.py", line 1012, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib/python3.6/ssl.py", line 874, in read
    return self._sslobj.read(len, buffer)
  File "/usr/lib/python3.6/ssl.py", line 631, in read
    v = self._sslobj.read(len, buffer)
  File "/usr/lib/python3.6/threading.py", line 884, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 654, in run
    self._build_cache()
  File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 625, in _build_cache
    self._do_watch(objects.metadata.resource_version)
  File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 611, in _do_watch
    for event in iter_response_objects(response):
  File "/usr/local/lib/python3.6/dist-packages/patroni/utils.py", line 383, in iter_response_objects
    for chunk in response.read_chunked(decode_content=False):
  File "/usr/lib/python3/dist-packages/urllib3/response.py", line 598, in read_chunked
    self._update_chunk_length()
  File "/usr/lib/python3/dist-packages/urllib3/response.py", line 540, in _update_chunk_length
    line = self._fp.fp.readline()
  File "/usr/lib/python3.6/socket.py", line 586, in readinto
    return self._sock.recv_into(b)
  File "/usr/lib/python3.6/ssl.py", line 1012, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib/python3.6/ssl.py", line 874, in read
    return self._sslobj.read(len, buffer)
  File "/usr/lib/python3.6/ssl.py", line 631, in read
    v = self._sslobj.read(len, buffer)
  File "/usr/local/bin/patroni", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/dist-packages/patroni/__main__.py", line 143, in main
    return patroni_main()
  File "/usr/local/lib/python3.6/dist-packages/patroni/__main__.py", line 135, in patroni_main
    abstract_main(Patroni, schema)
  File "/usr/local/lib/python3.6/dist-packages/patroni/daemon.py", line 100, in abstract_main
    controller.run()
  File "/usr/local/lib/python3.6/dist-packages/patroni/__main__.py", line 105, in run
    super(Patroni, self).run()
  File "/usr/local/lib/python3.6/dist-packages/patroni/daemon.py", line 59, in run
    self._run_cycle()
  File "/usr/local/lib/python3.6/dist-packages/patroni/__main__.py", line 117, in _run_cycle
    self.schedule_next_run()
  File "/usr/local/lib/python3.6/dist-packages/patroni/__main__.py", line 99, in schedule_next_run
    elif self.ha.watch(nap_time):
  File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1570, in watch
    return self.dcs.watch(leader_index, timeout)
  File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 1137, in watch
    return super(Kubernetes, self).watch(None, timeout + 0.5)
  File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/__init__.py", line 939, in watch
    self.event.wait(timeout)
  File "/usr/lib/python3.6/threading.py", line 551, in wait
    signaled = self._cond.wait(timeout)
  File "/usr/lib/python3.6/threading.py", line 299, in wait
    gotit = waiter.acquire(True, timeout)
  File "<string>", line 1, in <module>
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 678, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/tmp/.instana/python/instana/__init__.py", line 206, in <module>
    boot_agent()
  File "/tmp/.instana/python/instana/__init__.py", line 155, in boot_agent
    from .instrumentation.google.cloud import storage
  File "<frozen importlib._bootstrap>", line 1023, in _handle_fromlist
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 678, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/tmp/.instana/python/instana/instrumentation/google/cloud/storage.py", line 14, in <module>
    from google.cloud import storage
  File "<frozen importlib._bootstrap>", line 1023, in _handle_fromlist
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 678, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/usr/local/lib/python3.6/dist-packages/google/cloud/storage/__init__.py", line 35, in <module>
    from google.cloud.storage.batch import Batch
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked

@ikaakkola
Copy link
Author

When I add a random 1-10 second sleep into /usr/local/lib/python3.6/dist-packages/google/cloud/storage/batch.py the process no-longer hangs, so the problem is very likely that 'Stan is on the scene' twice and the Google cloud storage library hits a python, or underlying operating system bug, when it ends up calling re.py twice at the same instant;

Current thread 0x00007f8f746a7740 (most recent call first):
  File "/usr/lib/python3.6/re.py", line 182 in search
  File "/usr/lib/python3.6/ctypes/util.py", line 283 in _findSoname_ldconfig
  File "/usr/lib/python3.6/ctypes/util.py", line 313 in find_library
  File "/usr/lib/python3/dist-packages/asn1crypto/_perf/_big_num_ctypes.py", line 35 in <module>

@Ferenc- would you happen to have any idea why AutoTrace appears to be executed twice, at the same time, for these python processes or shall I turn to our Instana support for that instead?

In any case, I would currently say that this isn't really a bug in Instana python-sensor directly.

@Ferenc-
Copy link
Contributor

Ferenc- commented Jul 4, 2022

  File "/usr/local/lib/python3.6/dist-packages/patroni/__main__.py", line 99, in schedule_next_run
    elif self.ha.watch(nap_time):
  File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1570, in watch
    return self.dcs.watch(leader_index, timeout)
  File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 1137, in watch
    return super(Kubernetes, self).watch(None, timeout + 0.5)
  File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/__init__.py", line 939, in watch
    self.event.wait(timeout)
  File "/usr/lib/python3.6/threading.py", line 551, in wait
    signaled = self._cond.wait(timeout)
  File "/usr/lib/python3.6/threading.py", line 299, in wait
    gotit = waiter.acquire(True, timeout)

From this the only thing appears to be "hanging" is patroni itself scheduling it's next run, and it's watchdog waiting. for loop_wait time. That loop_wait time appears to be configurable, so I would recommend try it out with double and half values, for confirming that is the value that your system is waiting for.

would you happen to have any idea why AutoTrace appears to be executed twice, at the same time, for these python processes or shall I turn to our Instana support for that instead?

That is initiated by the agent, which is proprietary. And should not be discussed here, only on different channels.
For that one would had over to for eaxmple (https://support.instana.com/hc/en-us), and since so far I am confident that has nothing to do with your issue, I see no reason for that.

@ikaakkola
Copy link
Author

That is initiated by the agent, which is proprietary. And should not be discussed here, only on different channels.
For that one would had over to for eaxmple (https://support.instana.com/hc/en-us), and since so far I am confident that has nothing to do with your issue, I see no reason for that.

Yes, so I will go via Instana Support for that, which I believe is the cause of this.

My guess is (but as you said, this is proprietary) that AutoTrace somehome manages to import modules twice, which I think should not happen in Python (there is a reload() to , well, reload modules).

@Ferenc-
Copy link
Contributor

Ferenc- commented Jul 5, 2022

As there is no sign of python-sensor causing any delay, and there is evidence of patroni waiting.
I am closing this for now. Can be reopened if there is some evidence of a python-sensor bug.

@Ferenc- Ferenc- closed this as completed Jul 5, 2022
@ikaakkola
Copy link
Author

ikaakkola commented Jul 5, 2022

I'm fine with this being closed as I believe it to be an AutoTrace issue and not directly related to 'python-sensor', but it is not a Patroni bug, here is a minimal reproduction of a similar problem where the process hangs (assumes that the kubernetes environment has instana-agent running and python auto trace is enabled). This process hangs earlier in the Instana python code, when importing profiler at line 10 in /tmp/.instana/python/instana/singletons.py:

Start a new Kubernetes pod with image ubuntu:18.04

kubectl run -n <namespace> -i --tty --rm python-debug-pod  --image="ubuntu:18.04" -- bash

Install python3 and required utils

apt-get update && apt-get install -y python3 runit

Create a test python file

#!/usr/bin/python3

# -*- coding: utf-8 -*-
import logging
from time import sleep
from threading import Thread

logging.basicConfig(
    format='%(asctime)s %(levelname)-8s %(message)s',
    level=logging.INFO,
    datefmt='%Y-%m-%d %H:%M:%S')
logger = logging.getLogger("test-app")

class Example(Thread):

    def run (self):
        while True:
            logger.info("Sleeping..")
            sleep(1)
            logger.info("Continuing..")

Example().start()

Start the test python process:

unset TERM && PYTHONFAULTHANDLER=true INSTANA_DEBUG=true chpst -u daemon python3 test.py

Wait for a moment for AutoTrace to attempt to instrument this process and it will hang

Note that the process hangs before INSTANA_DEBUG would print anything, so there will be no "Stan is on the scene" output

Get the running processes

ps -ef
daemon    6272     1  0 07:55 pts/0    00:00:00 python3 test.py
daemon    6373  6272  0 07:56 pts/0    00:00:00 [ldconfig.real] <defunct>

Note that the python3 process has a zombie 'ldconfig.real' child process

Get a traceback from the process (via PYTHONFAULTHANDLER)

kill -SIGABRT 6272

resulting traceback from SIGABRT:

2022-07-05 07:56:19 INFO     Continuing..
2022-07-05 07:56:19 INFO     Sleeping..
Fatal Python error: Aborted

Thread 0x00007f7a699af700 (most recent call first):
  File "test.py", line 19 in run
  File "/usr/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/usr/lib/python3.6/threading.py", line 884 in _bootstrap

Current thread 0x00007f7a6af12740 (most recent call first):
  File "/usr/lib/python3.6/ctypes/util.py", line 283 in _findSoname_ldconfig
  File "/usr/lib/python3.6/ctypes/util.py", line 313 in find_library
  File "/usr/lib/python3.6/uuid.py", line 490 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  File "/tmp/.instana/python/instana/autoprofile/profile.py", line 6 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  File "/tmp/.instana/python/instana/autoprofile/sampler_scheduler.py", line 8 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  File "/tmp/.instana/python/instana/autoprofile/profiler.py", line 13 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  File "/tmp/.instana/python/instana/singletons.py", line 10 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  File "/tmp/.instana/python/instana/__init__.py", line 125 in boot_agent
  File "/tmp/.instana/python/instana/__init__.py", line 206 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  File "<string>", line 1 in <module>
  File "/usr/lib/python3.6/threading.py", line 1072 in _wait_for_tstate_lock
  File "/usr/lib/python3.6/threading.py", line 1056 in join
  File "/usr/lib/python3.6/threading.py", line 1294 in _shutdown
Aborted

Note: the process might need to be started a few times for the hang to happen (while with Patroni it seems to happen every time) on the main loop. When it does, the log output of the process will stop. Even when the main test.py keeps outputting, the Instana instrumentation code is stuck in the above when ps reports that the main python has a defunct ldconfig.real child process.

@Ferenc-
Copy link
Contributor

Ferenc- commented Jul 7, 2022

I can confirm the issue up to Python 3.8.13, but on3.9.13 and 3.10.13 I can't reproduce it anymore.
Could you check if you can reproduce it on 3.9.13 or 3.10.13?

@ikaakkola
Copy link
Author

Unfortunately I don't have any easy way to access new pythons in the container envs where we have Instana available.

I think this is a combination of the way the Auto Trace works (attaching to a live process with ptrace etc.) and some race condition deadlock in python calling ldconfig.

Thanks for the help you provided here, at least I learned python debugging 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants