Skip to content

Python PubSub consumer high cpu usage after approx 1 hour #3965

@thijsterlouw

Description

@thijsterlouw
  1. OS type and version:
    Debian GNU/Linux 8 (jessie)

  2. Python version and virtual environment information python --version
    Python 3.6.0

  3. google-cloud-python version pip show google-cloud, pip show google-<service> or pip freeze

aniso8601==1.2.0
cachetools==2.0.1
Cerberus==0.9.2
click==6.7
confluent-kafka==0.9.2
couchbase==2.2.0
coverage==4.2
dill==0.2.7.1
flake8==3.4.1
Flask==0.12
Flask-RESTful==0.3.5
future==0.16.0
google-auth==1.1.0
google-cloud-core==0.27.1
google-cloud-pubsub==0.28.3
google-gax==0.15.14
googleapis-common-protos==1.5.2
grpc-google-iam-v1==0.11.3
grpcio==1.4.0
httplib2==0.10.3
itsdangerous==0.24
Jinja2==2.9.5
jsonschema==2.5.1
MarkupSafe==0.23
mccabe==0.6.1
microservice==0.7.0
mysql-connector-python==2.2.1
oauth2client==3.0.0
pep8==1.7.0
ply==3.8
protobuf==3.4.0
psutil==5.3.1
py==1.4.32
pyasn1==0.3.4
pyasn1-modules==0.1.4
pycodestyle==2.3.1
pyflakes==1.5.0
pytest==2.8.5
pytest-mock==1.5.0
python-dateutil==2.6.0
pytz==2016.10
requests==2.13.0
rsa==3.4.2
six==1.10.0
statsd==3.2.1
uWSGI==2.0.14
Werkzeug==0.11.15
  1. Stacktrace if available:
    N/A

  2. Steps to reproduce

  • I run my python pubsub-consumer (in uwsgi with 1 thread) for a bit more than one hour
  • no special log messages about the application restarting
  • observe high cpu usage

note: we have multiple consumers running. When I start for example 20 at the same time, they all run fine for slightly more than 1 hour and then sometime between 1h and 1h20m (I was not keeping an eye on the cpu usage explosion all the time) the cpu usage of almost all 20 increased. Sometimes it doesn't happen on one of the consumers (but there might be another reason for that)

  1. Code example
    summary of relevant parts (I cannot paste the entire application):
        subscriber = pubsub.SubscriberClient()
        project_path = subscriber.project_path(project_name)
        subscription_path = subscriber.subscription_path(project_name, subscription_name)
        topic_path = subscriber.topic_path(project_name, topic_name)

        self.pubsub_consumer_sub = subscriber.subscribe(subscription_path)
        self.pubsub_consumer_sub.on_exception = self._error_cb

        self.pubsub_consumer_sub.open(self._message_callback)

where _message_callback tries to do message.ack() if all was ok in application code and otherwise raises.

Relevant debugging:

 top -b -n1 -H -p 35    # pid 35 shows up as high cpu in "top"

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
  711 root      20   0 1580028  80752  15568 R 98.6  0.3  11:36.88 uwsgi
   35 root      20   0 1580028  80752  15568 S  0.0  0.3   0:01.43 uwsgi
   37 root      20   0 1580028  80752  15568 S  0.0  0.3   0:01.24 uwsgi
   43 root      20   0 1580028  80752  15568 S  0.0  0.3   0:01.13 uwsgi

then take the process with the high cpu and strace it:

strace -e trace=all -p 711

(part of the output)
....
poll([{fd=15, events=POLLIN}, {fd=17, events=0}, {fd=20, events=0}, {fd=21, events=POLLIN}], 4, 81) = 2 ([{fd=17, revents=POLLHUP}, {fd=20, revents=POLLHUP}])
clock_gettime(CLOCK_MONOTONIC, {690178, 485753027}) = 0
poll([{fd=15, events=POLLIN}, {fd=17, events=0}, {fd=20, events=0}, {fd=21, events=POLLIN}], 4, 81) = 2 ([{fd=17, revents=POLLHUP}, {fd=20, revents=POLLHUP}])
clock_gettime(CLOCK_MONOTONIC, {690178, 485879159}) = 0
poll([{fd=15, events=POLLIN}, {fd=17, events=0}, {fd=20, events=0}, {fd=21, events=POLLIN}], 4, 80) = 2 ([{fd=17, revents=POLLHUP}, {fd=20, revents=POLLHUP}])
clock_gettime(CLOCK_MONOTONIC, {690178, 486030327}) = 0
poll([{fd=15, events=POLLIN}, {fd=17, events=0}, {fd=20, events=0}, {fd=21, events=POLLIN}], 4, 80) = 2 ([{fd=17, revents=POLLHUP}, {fd=20, revents=POLLHUP}])
clock_gettime(CLOCK_MONOTONIC, {690178, 486165028}) = 0
poll([{fd=15, events=POLLIN}, {fd=17, events=0}, {fd=20, events=0}, {fd=21, events=POLLIN}], 4, 80) = 2 ([{fd=17, revents=POLLHUP}, {fd=20, revents=POLLHUP}])
clock_gettime(CLOCK_MONOTONIC, {690178, 486290705}) = 0
poll([{fd=15, events=POLLIN}, {fd=17, events=0}, {fd=20, events=0}, {fd=21, events=POLLIN}], 4, 80) = 2 ([{fd=17, revents=POLLHUP}, {fd=20, revents=POLLHUP}])
clock_gettime(CLOCK_MONOTONIC, {690178, 486428764}) = 0
poll([{fd=15, events=POLLIN}, {fd=17, events=0}, {fd=20, events=0}, {fd=21, events=POLLIN}], 4, 80) = 2 ([{fd=17, revents=POLLHUP}, {fd=20, revents=POLLHUP}])
clock_gettime(CLOCK_MONOTONIC, {690178, 486548305}) = 0
poll([{fd=15, events=POLLIN}, {fd=17, events=0}, {fd=20, events=0}, {fd=21, events=POLLIN}], 4, 80) = 2 ([{fd=17, revents=POLLHUP}, {fd=20, revents=POLLHUP}])
clock_gettime(CLOCK_MONOTONIC, {690178, 486670569}) = 0
poll([{fd=15, events=POLLIN}, {fd=17, events=0}, {fd=20, events=0}, {fd=21, events=POLLIN}], 4, 80) = 2 ([{fd=17, revents=POLLHUP}, {fd=20, revents=POLLHUP}])
clock_gettime(CLOCK_MONOTONIC, {690178, 486776253}) = 0
poll([{fd=15, events=POLLIN}, {fd=17, events=0}, {fd=20, events=0}, {fd=21, events=POLLIN}], 4, 80) = 2 ([{fd=17, revents=POLLHUP}, {fd=20, revents=POLLHUP}])
clock_gettime(CLOCK_MONOTONIC, {690178, 486896929}) = 0
poll([{fd=15, events=POLLIN}, {fd=17, events=0}, {fd=20, events=0}, {fd=21, events=POLLIN}], 4, 79) = 2 ([{fd=17, revents=POLLHUP}, {fd=20, revents=POLLHUP}])
clock_gettime(CLOCK_MONOTONIC, {690178, 487011159}) = 0
....
and hundreds more of these

Taking a look at the open sockets, something jumps out:

ss -t -a
State      Recv-Q Send-Q                                                          Local Address:Port                                                              Peer Address:Port 
CLOSE-WAIT 1      0                                                                 10.0.20.188:59188                                                            74.125.206.84:https
CLOSE-WAIT 1      0                                                                 10.0.20.188:38434                                                            74.125.206.84:https
... (others removed)...

Note that these target IP addresses (74.125.206.84) are probably PubSub related, but I cannot really confirm that. Debugging over https obviously is a bit difficult as well.

The loop of clock_gettime + poll (with POLLHUP) in combination with the two sockets in CLOSE-WAIT with a Recv-Q of 1 leads me to believe there is somehow a bug in the Google PubSub libraries where an invalid/incomplete message is hanging in the receive queue, the library keeps trying to consume the message and then retries etc etc, causing a hot loop taking quite a lot of cpu.

Note that the application itself keeps consuming messages, but that might simply be because a uwsgi process is doing the work (not investigated).

I also did a gdb backtrace, which points to grpc:

(gdb) bt
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007fe5ea025c7e in now_impl (clock_type=<optimized out>) at src/core/lib/support/time_posix.c:92
#2  0x00007fe5ea025cfa in gpr_now (clock_type=clock_type@entry=GPR_CLOCK_MONOTONIC) at src/core/lib/support/time_posix.c:156
#3  0x00007fe5ea054a27 in cq_next (cc=0x2771790, deadline=..., reserved=<optimized out>) at src/core/lib/surface/completion_queue.c:833
#4  0x00007fe5ea05571e in grpc_completion_queue_next (cc=<optimized out>, deadline=..., reserved=reserved@entry=0x0) at src/core/lib/surface/completion_queue.c:873
#5  0x00007fe5e9ff9261 in __pyx_pf_4grpc_7_cython_6cygrpc_15CompletionQueue_2poll (__pyx_v_deadline=<optimized out>, __pyx_v_self=0x7fe5e8f6ab70)
    at src/python/grpcio/grpc/_cython/cygrpc.c:10553
#6  __pyx_pw_4grpc_7_cython_6cygrpc_15CompletionQueue_3poll (__pyx_v_self=0x7fe5e8f6ab70, __pyx_args=<optimized out>, __pyx_kwds=<optimized out>)
    at src/python/grpcio/grpc/_cython/cygrpc.c:10413
#7  0x00007fe5f191ce8e in _PyCFunction_FastCallDict (func_obj=0x7fe5e827bfc0, args=0x7fe5e8262db0, nargs=<optimized out>, kwargs=kwargs@entry=0x0) at Objects/methodobject.c:231
#8  0x00007fe5f191d1a7 in _PyCFunction_FastCallKeywords (func=func@entry=0x7fe5e827bfc0, stack=stack@entry=0x7fe5e8262db0, nargs=<optimized out>, kwnames=kwnames@entry=0x0)
    at Objects/methodobject.c:295
#9  0x00007fe5f19b46d3 in call_function (pp_stack=pp_stack@entry=0x7fe5e1ffa5f0, oparg=<optimized out>, kwnames=kwnames@entry=0x0) at Python/ceval.c:4788
#10 0x00007fe5f19b9e61 in _PyEval_EvalFrameDefault (f=0x7fe5e8262c18, throwflag=<optimized out>) at Python/ceval.c:3275
#11 0x00007fe5f19b434a in _PyEval_EvalCodeWithName (_co=0x1, globals=0x7fe5e1ffa290, locals=0x2771820, locals@entry=0x0, args=0x7fe5f3683060, argcount=9223372036854775807, 
    kwnames=0xa7b03, kwnames@entry=0x7fe5f3683060, kwargs=0x7fe5f3683068, kwcount=0, kwstep=2, defs=0x0, defcount=0, kwdefs=0x0, closure=0x7fe5e8f64f60, name=0x0, qualname=0x0)
    at Python/ceval.c:4119
#12 0x00007fe5f19b490f in PyEval_EvalCodeEx (_co=<optimized out>, globals=<optimized out>, locals=locals@entry=0x0, args=args@entry=0x7fe5f3683060, argcount=<optimized out>, 
    kws=kws@entry=0x7fe5f3683060, kwcount=0, defs=0x0, defcount=0, kwdefs=0x0, closure=0x7fe5e8f64f60) at Python/ceval.c:4140
#13 0x00007fe5f18f645e in function_call (func=0x7fe5e82987b8, arg=0x7fe5f3683048, kw=0x7fe5e86d6708) at Objects/funcobject.c:604
#14 0x00007fe5f18c492a in PyObject_Call (func=0x7fe5e82987b8, args=<optimized out>, kwargs=<optimized out>) at Objects/abstract.c:2246
#15 0x00007fe5f19b98f6 in do_call_core (kwdict=0x7fe5e86d6708, callargs=0x7fe5f3683048, func=0x7fe5e82987b8) at Python/ceval.c:5057
#16 _PyEval_EvalFrameDefault (f=0x7fe5e86f6dd8, throwflag=<optimized out>) at Python/ceval.c:3357
#17 0x00007fe5f19b39a0 in _PyFunction_FastCall (co=<optimized out>, args=<optimized out>, nargs=1, globals=<optimized out>) at Python/ceval.c:4870
#18 0x00007fe5f19b48ad in fast_function (kwnames=0x0, nargs=<optimized out>, stack=<optimized out>, func=0x7fe5f04742f0) at Python/ceval.c:4905
#19 call_function (pp_stack=pp_stack@entry=0x7fe5e1ffaae0, oparg=<optimized out>, kwnames=kwnames@entry=0x0) at Python/ceval.c:4809
#20 0x00007fe5f19b9e61 in _PyEval_EvalFrameDefault (f=0x7fe5cc0008d8, throwflag=<optimized out>) at Python/ceval.c:3275
#21 0x00007fe5f19b39a0 in _PyFunction_FastCall (co=<optimized out>, args=<optimized out>, nargs=1, globals=<optimized out>) at Python/ceval.c:4870
#22 0x00007fe5f19b48ad in fast_function (kwnames=0x0, nargs=<optimized out>, stack=<optimized out>, func=0x7fe5f0474510) at Python/ceval.c:4905
#23 call_function (pp_stack=pp_stack@entry=0x7fe5e1ffacb0, oparg=<optimized out>, kwnames=kwnames@entry=0x0) at Python/ceval.c:4809
#24 0x00007fe5f19b9e61 in _PyEval_EvalFrameDefault (f=0x7fe5e826dbb8, throwflag=<optimized out>) at Python/ceval.c:3275
#25 0x00007fe5f19b39a0 in _PyFunction_FastCall (co=<optimized out>, args=<optimized out>, nargs=1, globals=<optimized out>) at Python/ceval.c:4870
#26 0x00007fe5f19bd210 in _PyFunction_FastCallDict (func=func@entry=0x7fe5f0474378, args=args@entry=0x7fe5e1ffae60, nargs=1, kwargs=kwargs@entry=0x0) at Python/ceval.c:4972
#27 0x00007fe5f18c4b8e in _PyObject_FastCallDict (func=func@entry=0x7fe5f0474378, args=args@entry=0x7fe5e1ffae60, nargs=nargs@entry=1, kwargs=kwargs@entry=0x0)
    at Objects/abstract.c:2295
#28 0x00007fe5f18c4c91 in _PyObject_Call_Prepend (func=0x7fe5f0474378, obj=<optimized out>, args=0x7fe5f3683048, kwargs=0x0) at Objects/abstract.c:2358
#29 0x00007fe5f18c492a in PyObject_Call (func=0x7fe5e86be988, args=<optimized out>, kwargs=<optimized out>) at Objects/abstract.c:2246
---Type <return> to continue, or q <return> to quit---
#30 0x00007fe5f19b4d91 in PyEval_CallObjectWithKeywords (func=<optimized out>, args=<optimized out>, kwargs=<optimized out>) at Python/ceval.c:4709
#31 0x00007fe5f1a07db2 in t_bootstrap (boot_raw=0x7fe5e8f1bb48) at ./Modules/_threadmodule.c:998
#32 0x00007fe5f32e8064 in start_thread (arg=0x7fe5e1ffb700) at pthread_create.c:309
#33 0x00007fe5f135862d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

I did not yet dive into the details of grpc, but to me all clues point in that direction. We did not encounter issues like this with the old pubsub library (we used google-cloud-pubsub==0.26 before)

Metadata

Metadata

Assignees

Labels

api: pubsubIssues related to the Pub/Sub API.performancepriority: p2Moderately-important priority. Fix may not be included in next release.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions