-
OS type and version:
Debian GNU/Linux 8 (jessie)
-
Python version and virtual environment information python --version
Python 3.6.0
-
google-cloud-python version pip show google-cloud, pip show google-<service> or pip freeze
aniso8601==1.2.0
cachetools==2.0.1
Cerberus==0.9.2
click==6.7
confluent-kafka==0.9.2
couchbase==2.2.0
coverage==4.2
dill==0.2.7.1
flake8==3.4.1
Flask==0.12
Flask-RESTful==0.3.5
future==0.16.0
google-auth==1.1.0
google-cloud-core==0.27.1
google-cloud-pubsub==0.28.3
google-gax==0.15.14
googleapis-common-protos==1.5.2
grpc-google-iam-v1==0.11.3
grpcio==1.4.0
httplib2==0.10.3
itsdangerous==0.24
Jinja2==2.9.5
jsonschema==2.5.1
MarkupSafe==0.23
mccabe==0.6.1
microservice==0.7.0
mysql-connector-python==2.2.1
oauth2client==3.0.0
pep8==1.7.0
ply==3.8
protobuf==3.4.0
psutil==5.3.1
py==1.4.32
pyasn1==0.3.4
pyasn1-modules==0.1.4
pycodestyle==2.3.1
pyflakes==1.5.0
pytest==2.8.5
pytest-mock==1.5.0
python-dateutil==2.6.0
pytz==2016.10
requests==2.13.0
rsa==3.4.2
six==1.10.0
statsd==3.2.1
uWSGI==2.0.14
Werkzeug==0.11.15
-
Stacktrace if available:
N/A
-
Steps to reproduce
- I run my python pubsub-consumer (in uwsgi with 1 thread) for a bit more than one hour
- no special log messages about the application restarting
- observe high cpu usage
note: we have multiple consumers running. When I start for example 20 at the same time, they all run fine for slightly more than 1 hour and then sometime between 1h and 1h20m (I was not keeping an eye on the cpu usage explosion all the time) the cpu usage of almost all 20 increased. Sometimes it doesn't happen on one of the consumers (but there might be another reason for that)
- Code example
summary of relevant parts (I cannot paste the entire application):
subscriber = pubsub.SubscriberClient()
project_path = subscriber.project_path(project_name)
subscription_path = subscriber.subscription_path(project_name, subscription_name)
topic_path = subscriber.topic_path(project_name, topic_name)
self.pubsub_consumer_sub = subscriber.subscribe(subscription_path)
self.pubsub_consumer_sub.on_exception = self._error_cb
self.pubsub_consumer_sub.open(self._message_callback)
where _message_callback tries to do message.ack() if all was ok in application code and otherwise raises.
Relevant debugging:
top -b -n1 -H -p 35 # pid 35 shows up as high cpu in "top"
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
711 root 20 0 1580028 80752 15568 R 98.6 0.3 11:36.88 uwsgi
35 root 20 0 1580028 80752 15568 S 0.0 0.3 0:01.43 uwsgi
37 root 20 0 1580028 80752 15568 S 0.0 0.3 0:01.24 uwsgi
43 root 20 0 1580028 80752 15568 S 0.0 0.3 0:01.13 uwsgi
then take the process with the high cpu and strace it:
strace -e trace=all -p 711
(part of the output)
....
poll([{fd=15, events=POLLIN}, {fd=17, events=0}, {fd=20, events=0}, {fd=21, events=POLLIN}], 4, 81) = 2 ([{fd=17, revents=POLLHUP}, {fd=20, revents=POLLHUP}])
clock_gettime(CLOCK_MONOTONIC, {690178, 485753027}) = 0
poll([{fd=15, events=POLLIN}, {fd=17, events=0}, {fd=20, events=0}, {fd=21, events=POLLIN}], 4, 81) = 2 ([{fd=17, revents=POLLHUP}, {fd=20, revents=POLLHUP}])
clock_gettime(CLOCK_MONOTONIC, {690178, 485879159}) = 0
poll([{fd=15, events=POLLIN}, {fd=17, events=0}, {fd=20, events=0}, {fd=21, events=POLLIN}], 4, 80) = 2 ([{fd=17, revents=POLLHUP}, {fd=20, revents=POLLHUP}])
clock_gettime(CLOCK_MONOTONIC, {690178, 486030327}) = 0
poll([{fd=15, events=POLLIN}, {fd=17, events=0}, {fd=20, events=0}, {fd=21, events=POLLIN}], 4, 80) = 2 ([{fd=17, revents=POLLHUP}, {fd=20, revents=POLLHUP}])
clock_gettime(CLOCK_MONOTONIC, {690178, 486165028}) = 0
poll([{fd=15, events=POLLIN}, {fd=17, events=0}, {fd=20, events=0}, {fd=21, events=POLLIN}], 4, 80) = 2 ([{fd=17, revents=POLLHUP}, {fd=20, revents=POLLHUP}])
clock_gettime(CLOCK_MONOTONIC, {690178, 486290705}) = 0
poll([{fd=15, events=POLLIN}, {fd=17, events=0}, {fd=20, events=0}, {fd=21, events=POLLIN}], 4, 80) = 2 ([{fd=17, revents=POLLHUP}, {fd=20, revents=POLLHUP}])
clock_gettime(CLOCK_MONOTONIC, {690178, 486428764}) = 0
poll([{fd=15, events=POLLIN}, {fd=17, events=0}, {fd=20, events=0}, {fd=21, events=POLLIN}], 4, 80) = 2 ([{fd=17, revents=POLLHUP}, {fd=20, revents=POLLHUP}])
clock_gettime(CLOCK_MONOTONIC, {690178, 486548305}) = 0
poll([{fd=15, events=POLLIN}, {fd=17, events=0}, {fd=20, events=0}, {fd=21, events=POLLIN}], 4, 80) = 2 ([{fd=17, revents=POLLHUP}, {fd=20, revents=POLLHUP}])
clock_gettime(CLOCK_MONOTONIC, {690178, 486670569}) = 0
poll([{fd=15, events=POLLIN}, {fd=17, events=0}, {fd=20, events=0}, {fd=21, events=POLLIN}], 4, 80) = 2 ([{fd=17, revents=POLLHUP}, {fd=20, revents=POLLHUP}])
clock_gettime(CLOCK_MONOTONIC, {690178, 486776253}) = 0
poll([{fd=15, events=POLLIN}, {fd=17, events=0}, {fd=20, events=0}, {fd=21, events=POLLIN}], 4, 80) = 2 ([{fd=17, revents=POLLHUP}, {fd=20, revents=POLLHUP}])
clock_gettime(CLOCK_MONOTONIC, {690178, 486896929}) = 0
poll([{fd=15, events=POLLIN}, {fd=17, events=0}, {fd=20, events=0}, {fd=21, events=POLLIN}], 4, 79) = 2 ([{fd=17, revents=POLLHUP}, {fd=20, revents=POLLHUP}])
clock_gettime(CLOCK_MONOTONIC, {690178, 487011159}) = 0
....
and hundreds more of these
Taking a look at the open sockets, something jumps out:
ss -t -a
State Recv-Q Send-Q Local Address:Port Peer Address:Port
CLOSE-WAIT 1 0 10.0.20.188:59188 74.125.206.84:https
CLOSE-WAIT 1 0 10.0.20.188:38434 74.125.206.84:https
... (others removed)...
Note that these target IP addresses (74.125.206.84) are probably PubSub related, but I cannot really confirm that. Debugging over https obviously is a bit difficult as well.
The loop of clock_gettime + poll (with POLLHUP) in combination with the two sockets in CLOSE-WAIT with a Recv-Q of 1 leads me to believe there is somehow a bug in the Google PubSub libraries where an invalid/incomplete message is hanging in the receive queue, the library keeps trying to consume the message and then retries etc etc, causing a hot loop taking quite a lot of cpu.
Note that the application itself keeps consuming messages, but that might simply be because a uwsgi process is doing the work (not investigated).
I also did a gdb backtrace, which points to grpc:
(gdb) bt
#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1 0x00007fe5ea025c7e in now_impl (clock_type=<optimized out>) at src/core/lib/support/time_posix.c:92
#2 0x00007fe5ea025cfa in gpr_now (clock_type=clock_type@entry=GPR_CLOCK_MONOTONIC) at src/core/lib/support/time_posix.c:156
#3 0x00007fe5ea054a27 in cq_next (cc=0x2771790, deadline=..., reserved=<optimized out>) at src/core/lib/surface/completion_queue.c:833
#4 0x00007fe5ea05571e in grpc_completion_queue_next (cc=<optimized out>, deadline=..., reserved=reserved@entry=0x0) at src/core/lib/surface/completion_queue.c:873
#5 0x00007fe5e9ff9261 in __pyx_pf_4grpc_7_cython_6cygrpc_15CompletionQueue_2poll (__pyx_v_deadline=<optimized out>, __pyx_v_self=0x7fe5e8f6ab70)
at src/python/grpcio/grpc/_cython/cygrpc.c:10553
#6 __pyx_pw_4grpc_7_cython_6cygrpc_15CompletionQueue_3poll (__pyx_v_self=0x7fe5e8f6ab70, __pyx_args=<optimized out>, __pyx_kwds=<optimized out>)
at src/python/grpcio/grpc/_cython/cygrpc.c:10413
#7 0x00007fe5f191ce8e in _PyCFunction_FastCallDict (func_obj=0x7fe5e827bfc0, args=0x7fe5e8262db0, nargs=<optimized out>, kwargs=kwargs@entry=0x0) at Objects/methodobject.c:231
#8 0x00007fe5f191d1a7 in _PyCFunction_FastCallKeywords (func=func@entry=0x7fe5e827bfc0, stack=stack@entry=0x7fe5e8262db0, nargs=<optimized out>, kwnames=kwnames@entry=0x0)
at Objects/methodobject.c:295
#9 0x00007fe5f19b46d3 in call_function (pp_stack=pp_stack@entry=0x7fe5e1ffa5f0, oparg=<optimized out>, kwnames=kwnames@entry=0x0) at Python/ceval.c:4788
#10 0x00007fe5f19b9e61 in _PyEval_EvalFrameDefault (f=0x7fe5e8262c18, throwflag=<optimized out>) at Python/ceval.c:3275
#11 0x00007fe5f19b434a in _PyEval_EvalCodeWithName (_co=0x1, globals=0x7fe5e1ffa290, locals=0x2771820, locals@entry=0x0, args=0x7fe5f3683060, argcount=9223372036854775807,
kwnames=0xa7b03, kwnames@entry=0x7fe5f3683060, kwargs=0x7fe5f3683068, kwcount=0, kwstep=2, defs=0x0, defcount=0, kwdefs=0x0, closure=0x7fe5e8f64f60, name=0x0, qualname=0x0)
at Python/ceval.c:4119
#12 0x00007fe5f19b490f in PyEval_EvalCodeEx (_co=<optimized out>, globals=<optimized out>, locals=locals@entry=0x0, args=args@entry=0x7fe5f3683060, argcount=<optimized out>,
kws=kws@entry=0x7fe5f3683060, kwcount=0, defs=0x0, defcount=0, kwdefs=0x0, closure=0x7fe5e8f64f60) at Python/ceval.c:4140
#13 0x00007fe5f18f645e in function_call (func=0x7fe5e82987b8, arg=0x7fe5f3683048, kw=0x7fe5e86d6708) at Objects/funcobject.c:604
#14 0x00007fe5f18c492a in PyObject_Call (func=0x7fe5e82987b8, args=<optimized out>, kwargs=<optimized out>) at Objects/abstract.c:2246
#15 0x00007fe5f19b98f6 in do_call_core (kwdict=0x7fe5e86d6708, callargs=0x7fe5f3683048, func=0x7fe5e82987b8) at Python/ceval.c:5057
#16 _PyEval_EvalFrameDefault (f=0x7fe5e86f6dd8, throwflag=<optimized out>) at Python/ceval.c:3357
#17 0x00007fe5f19b39a0 in _PyFunction_FastCall (co=<optimized out>, args=<optimized out>, nargs=1, globals=<optimized out>) at Python/ceval.c:4870
#18 0x00007fe5f19b48ad in fast_function (kwnames=0x0, nargs=<optimized out>, stack=<optimized out>, func=0x7fe5f04742f0) at Python/ceval.c:4905
#19 call_function (pp_stack=pp_stack@entry=0x7fe5e1ffaae0, oparg=<optimized out>, kwnames=kwnames@entry=0x0) at Python/ceval.c:4809
#20 0x00007fe5f19b9e61 in _PyEval_EvalFrameDefault (f=0x7fe5cc0008d8, throwflag=<optimized out>) at Python/ceval.c:3275
#21 0x00007fe5f19b39a0 in _PyFunction_FastCall (co=<optimized out>, args=<optimized out>, nargs=1, globals=<optimized out>) at Python/ceval.c:4870
#22 0x00007fe5f19b48ad in fast_function (kwnames=0x0, nargs=<optimized out>, stack=<optimized out>, func=0x7fe5f0474510) at Python/ceval.c:4905
#23 call_function (pp_stack=pp_stack@entry=0x7fe5e1ffacb0, oparg=<optimized out>, kwnames=kwnames@entry=0x0) at Python/ceval.c:4809
#24 0x00007fe5f19b9e61 in _PyEval_EvalFrameDefault (f=0x7fe5e826dbb8, throwflag=<optimized out>) at Python/ceval.c:3275
#25 0x00007fe5f19b39a0 in _PyFunction_FastCall (co=<optimized out>, args=<optimized out>, nargs=1, globals=<optimized out>) at Python/ceval.c:4870
#26 0x00007fe5f19bd210 in _PyFunction_FastCallDict (func=func@entry=0x7fe5f0474378, args=args@entry=0x7fe5e1ffae60, nargs=1, kwargs=kwargs@entry=0x0) at Python/ceval.c:4972
#27 0x00007fe5f18c4b8e in _PyObject_FastCallDict (func=func@entry=0x7fe5f0474378, args=args@entry=0x7fe5e1ffae60, nargs=nargs@entry=1, kwargs=kwargs@entry=0x0)
at Objects/abstract.c:2295
#28 0x00007fe5f18c4c91 in _PyObject_Call_Prepend (func=0x7fe5f0474378, obj=<optimized out>, args=0x7fe5f3683048, kwargs=0x0) at Objects/abstract.c:2358
#29 0x00007fe5f18c492a in PyObject_Call (func=0x7fe5e86be988, args=<optimized out>, kwargs=<optimized out>) at Objects/abstract.c:2246
---Type <return> to continue, or q <return> to quit---
#30 0x00007fe5f19b4d91 in PyEval_CallObjectWithKeywords (func=<optimized out>, args=<optimized out>, kwargs=<optimized out>) at Python/ceval.c:4709
#31 0x00007fe5f1a07db2 in t_bootstrap (boot_raw=0x7fe5e8f1bb48) at ./Modules/_threadmodule.c:998
#32 0x00007fe5f32e8064 in start_thread (arg=0x7fe5e1ffb700) at pthread_create.c:309
#33 0x00007fe5f135862d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
I did not yet dive into the details of grpc, but to me all clues point in that direction. We did not encounter issues like this with the old pubsub library (we used google-cloud-pubsub==0.26 before)
OS type and version:
Debian GNU/Linux 8 (jessie)
Python version and virtual environment information
python --versionPython 3.6.0
google-cloud-python version
pip show google-cloud,pip show google-<service>orpip freezeStacktrace if available:
N/A
Steps to reproduce
note: we have multiple consumers running. When I start for example 20 at the same time, they all run fine for slightly more than 1 hour and then sometime between 1h and 1h20m (I was not keeping an eye on the cpu usage explosion all the time) the cpu usage of almost all 20 increased. Sometimes it doesn't happen on one of the consumers (but there might be another reason for that)
summary of relevant parts (I cannot paste the entire application):
where _message_callback tries to do message.ack() if all was ok in application code and otherwise raises.
Relevant debugging:
then take the process with the high cpu and strace it:
Taking a look at the open sockets, something jumps out:
Note that these target IP addresses (74.125.206.84) are probably PubSub related, but I cannot really confirm that. Debugging over https obviously is a bit difficult as well.
The loop of clock_gettime + poll (with POLLHUP) in combination with the two sockets in CLOSE-WAIT with a Recv-Q of 1 leads me to believe there is somehow a bug in the Google PubSub libraries where an invalid/incomplete message is hanging in the receive queue, the library keeps trying to consume the message and then retries etc etc, causing a hot loop taking quite a lot of cpu.
Note that the application itself keeps consuming messages, but that might simply be because a uwsgi process is doing the work (not investigated).
I also did a gdb backtrace, which points to grpc:
I did not yet dive into the details of grpc, but to me all clues point in that direction. We did not encounter issues like this with the old pubsub library (we used google-cloud-pubsub==0.26 before)