Lots of unexpected logging in Python library #540

kamalaboulhosn · 2021-11-17T18:47:30Z

Summarized from a customer issue:

Somehow, something with this google PubSub service is printing out Mutex.cc logs at an enormous scale of 3.5M / 5 minutes. The logs look as follows:

Error
2021-11-03 15:08:55.861 EDT[mutex.cc : 435] RAW: Lock blocking 0x7f0c66b9cb78 @
Error
2021-11-03 15:08:55.861 EDT[mutex.cc : 435] RAW: Lock returning 0x7f0c66b9cb78 @
Error
2021-11-03 15:08:55.861 EDT[mutex.cc : 435] RAW: Unlock 0x7f0c66b9cb78 @
Error
2021-11-03 15:08:55.861 EDT[mutex.cc : 435] RAW: Lock blocking 0x7f0c66b9cb78 @
Error
2021-11-03 15:08:55.861 EDT[mutex.cc : 435] RAW: Lock returning 0x7f0c66b9cb78 @
Error
2021-11-03 15:08:55.861 EDT[mutex.cc : 435] RAW: Unlock 0x7f0c66b9cb78 @
Error
2021-11-03 15:08:55.861 EDT[mutex.cc : 435] RAW: Lock blocking 0x7f0c66b9cb78 @
Error
2021-11-03 15:08:55.861 EDT[mutex.cc : 435] RAW: Lock returning 0x7f0c66b9cb78 @
Error
2021-11-03 15:08:55.861 EDT[mutex.cc : 435] RAW: Unlock 0x7f0c66b9cb78 @
Error
2021-11-03 15:08:55.861 EDT[mutex.cc : 435] RAW: Lock blocking 0x7f0c66b9cb78 @
Error
2021-11-03 15:08:55.861 EDT[mutex.cc : 435] RAW: Lock returning 0x7f0c66b9cb78 @
Error
2021-11-03 15:08:55.861 EDT[mutex.cc : 435] RAW: Unlock 0x7f0c66b9cb78 @
Error
2021-11-03 15:08:55.861 EDT[mutex.cc : 435] RAW: Lock blocking 0x7f0c66b9cb78 @
Error
2021-11-03 15:08:55.861 EDT[mutex.cc : 435] RAW: Lock returning 0x7f0c66b9cb78 @
Error
2021-11-03 15:08:55.861 EDT[mutex.cc : 435] RAW: Unlock 0x7f0c66b9cb78 @
Error
2021-11-03 15:08:55.861 EDT[mutex.cc : 435] RAW: Lock blocking 0x7f0c66b9cb78 @
Error
2021-11-03 15:08:55.861 EDT[mutex.cc : 435] RAW: Lock returning 0x7f0c66b9cb78 @
Error
2021-11-03 15:08:55.861 EDT[mutex.cc : 435] RAW: Unlock 0x7f0c66b9cb78 @
Error
2021-11-03 15:08:55.861 EDT[mutex.cc : 435] RAW: Lock blocking 0x7f0c66b9cb78 @
Error
2021-11-03 15:08:55.861 EDT[mutex.cc : 435] RAW: Lock returning 0x7f0c66b9cb78 @

The information provided about version:

We are using version 2.3.0 of google-cloud-pubsub
The spike in logs was not tied to an upgrade of pubsub or another library, we actually haven't touched the code in months

The customer suggested "I believe the core of the issue is that pubsub is sending messages without a delay and no exponential backoff, thus causing massive spikes in utilization and an over-use of the logging module."

The text was updated successfully, but these errors were encountered:

kamalaboulhosn · 2021-11-17T18:48:33Z

@plamut would you be able to take a look into this while @pradn is out? Thanks!

plamut · 2021-11-17T19:58:04Z

@kamalaboulhosn This appears to be happening at the level below Python. Can the customer post some more details about their OS, the Python version, the output of $ pip freeze, and any other environment info that might be relevant? Is the issue consistently reproducible/observable?

The output doesn't seem like the logging we do in our Pub/Sub client code (at the Python level), and it probably doesn't come from the interpreter itself either, which is written in C. Seems like some C++ code, maybe `grpc?

For a start, I would at least try upgrading Pub/Sub to the latest version (2.9.0) and see if it solves the problem.

kamalaboulhosn · 2021-11-17T20:18:46Z

@plamut I will ask about the additional information. I agree that it doesn't look to be something at the Pub/Sub library level. My suspicion is that debug logging is turned on the gRPC level somehow.

plamut · 2021-11-18T16:33:15Z

@lidizheng A question for you guys, do these logs resemble grpc logs, do they look familiar? And if yes, is it a known thing with a quick fix?

Thanks!

lidizheng · 2021-11-18T17:48:59Z

@plamut gRPC today uses Abseil C++ for their mutex implementations. There is no mutex.cc inside gRPC code base. If gRPC trace log is turned on, it should be pretty obvious, the log will talk about polling strategy and plugins.

plamut · 2021-11-18T18:44:25Z

@lidizheng Thanks.

@kamalaboulhosn had a look into Abseil, found the mutex.cc file and the very log messages discussed here. The most probable theory seems to be that the customer somehow enabled the debug logging level, resulting in the flood that was reported.

kamalaboulhosn · 2021-11-29T12:17:43Z

They say they have no adjusted log levels in their environment.

They use these Python libraries:

pymysql==0.8.1
google-cloud-pubsub==2.3.0
sqlalchemy==1.3.18
pdf2image==1.14.0

And these Alpine Linux libraries:

gcc
libc-dev
libffi-dev
mariadb-dev
libstdc++
openssh,jpeg-dev
zlib-dev
poppler
poppler-utils

pradn · 2021-11-30T12:07:38Z

Another customer is seeing the same issue.

The logs are coming from here.

I asked the ABSL team for info on how the raw log verbosity levels are set. abseil/abseil-cpp#1069

pradn · 2022-01-11T18:29:25Z

I pinged derekmauro who commented on abseil/abseil-cpp#1069 via internal chat. Hoping he can help us further with this.

pradn · 2022-01-11T19:51:55Z

The GRPC team doesn't think it's their issue and they can't do much unless we have repro instructions for them.

acocuzzo · 2022-07-12T17:10:34Z

Closing as issue has not reoccurred.

dis-sid · 2022-07-19T15:57:42Z

It happened to us in the night of 2022.07.11, I had to put down the service at the receiving end. but it took a few days to notice with ~ $500 per day of cloud logging usage.

2022-07-14 12:09:50.317 CEST [mutex.cc : 435] RAW: Lock returning 0x564c983a0378 @
2022-07-14 12:09:50.317 CEST [mutex.cc : 435] RAW: Unlock 0x564c983a0378 @
2022-07-14 12:09:50.317 CEST [mutex.cc : 435] RAW: Lock blocking 0x564c983a0378 @
2022-07-14 12:09:50.317 CEST [mutex.cc : 435] RAW: Lock returning 0x564c983a0378 @
2022-07-14 12:09:50.317 CEST [mutex.cc : 435] RAW: Unlock 0x564c983a0378 @
2022-07-14 12:09:50.317 CEST [mutex.cc : 435] RAW: Lock blocking 0x564c983a0378 @
2022-07-14 12:09:50.317 CEST [mutex.cc : 435] RAW: Lock returning 0x564c983a0378 @
2022-07-14 12:09:50.317 CEST [mutex.cc : 435] RAW: Unlock 0x564c983a0378 @
2022-07-14 12:09:50.317 CEST [mutex.cc : 435] RAW: Lock blocking 0x564c983a0378 @
2022-07-14 12:09:50.317 CEST [mutex.cc : 435] RAW: Lock returning 0x564c983a0378 @
2022-07-14 12:09:50.317 CEST [mutex.cc : 435] RAW: Unlock 0x564c983a0378 @
2022-07-14 12:09:50.317 CEST [mutex.cc : 435] RAW: Lock blocking 0x564c983a0378 @
2022-07-14 12:09:50.317 CEST [mutex.cc : 435] RAW: Lock returning 0x564c983a0378 @

google-api-core==2.3.2
google-auth==2.3.3
google-cloud-pubsub==2.9.0
googleapis-common-protos==1.54.0
grpc-google-iam-v1==0.12.3
grpcio==1.43.0
grpcio-status==1.43.0
proto-plus==1.19.8
protobuf==3.19.1
requests==2.26.0

davidjungermann · 2022-09-05T14:13:00Z

I just had the same thing happen over a weekend, which ended up costing us over 400$.

This service is running as a Flask application in Cloud Run. No adjusted log levels.

Flask==2.1.0
flask-smorest==0.39.0
gunicorn==20.1.0
google-cloud-pubsub==2.13.6
gobbler-utils==1.0.1
jsonschema==4.14.0
pytz==2022.2.1
python-dateutil==2.8.2
google-cloud-logging==3.2.2

It would be nice to know why this occurs, and if there's any way for us to prevent it. Paying several hundred bucks for an error you deem as "not reoccurring" feels really unfair.

EDIT: This just started occurring again, after not occurring for several hours. No changes were made to the container running, it just started spamming us with logs out of the blue. We're running two instances of this container on two environments, with the same dependencies, and are only getting the error on one environment. We're on europe-west1.

EDIT 2: For anyone having this issue, using log exclusion to the log sink seems to have done the trick for us. Docs here: https://cloud.google.com/logging/docs/routing/overview

davidjungermann · 2022-09-05T16:43:30Z

Would you care sharing the exclusion filter? tnx

I did this: resource.type = "cloud_run_revision" AND severity = "DEFAULT"

Could definitely be extended to add some further granularity, since this could filter out potentially wanted logs from other Cloud Run containers, so use with caution! :)

acocuzzo · 2022-11-17T15:22:20Z

I've commented again on: abseil/abseil-cpp#1069
It seems this issue has occurred for users on GKE, and not just using the pubsub library, so it's likely this is an issue with GRPC or GKE.
Did these spikes occur after any change in containers, such as an upgrade?

jovoss · 2022-11-30T11:31:24Z

The excessive logging happened to us as well. Beginning 2022-12-28 without any upgrades/updates to the Cloud Run container/deployment.

We are using the following grpc versions in our python module:
grpcio==1.44.0

Re-deploying the service did not fix the issue.
We implemented @davidjungermann exclusion filter to quick-fix the issue.

We are looking forward to a definitive answer why this is happening @acocuzzo

dbalogVEP · 2022-11-30T21:14:00Z

This happened to us at the beginning of the month, > $2000 in Log Volume charges.

my container was built on 2022-04-08 and uses Python 3.8.12 with these google packages.

gcsfs==2022.3.0
google-api-core==1.31.5
google-auth==2.4.1
google-auth-oauthlib==0.5.1
google-cloud-aiplatform==1.9.0
google-cloud-bigquery==2.32.0
google-cloud-core==2.2.2
google-cloud-secret-manager==2.5.0
google-cloud-storage==2.1.0
google-crc32c==1.3.0
google-resumable-media==2.1.0
googleapis-common-protos==1.54.0
grpc-google-iam-v1==0.12.3
grpcio==1.43.0

acocuzzo · 2022-11-30T22:26:11Z

I'm unable to reproduce the issue, if anyone has a consistent repro, it will help me narrow things down for the grpc folks. Thanks, appreciate the patience and information.

PaddyAlton · 2023-02-06T10:42:29Z

We have just experienced the same problem - caught via a regular billing report after running for three days.

I don't know if this was the same for others, but what seems to have happened is that our application code started throwing errors first.

A bad config change meant that the GKE pod that subscribes to Pub/Sub started to repeatedly throw errors in the callback function. This caused the pod to fall over and restart, repeatedly. Each time it would terminate as follows:

Top-level exception occurred in callback while processing a message
[logs relating to the application error]
Request ID: [some unique ID]

Then, after this had happened a few tens of times, the deluge of 'mutex' logs began.

What's interesting is that

this seems to have started up at the point when we'd expect the pod to terminate
on the earlier failures, the pod restarted after less than a second and we got the logs our application emits on startup; that didn't happen this time
instead, the pod begins to emit mutex logs by the bucketload from 6ms after the Request ID is logged
as best I can tell, the pod sits there in some unstable state, emitting these logs, rather than shutting down

Hope that helps to trace this behaviour!

AndreasBergmeier6176 · 2023-03-06T13:03:37Z

We see the same behavior when Pods run a certain time in GKE.
We by now removed our dependency on PubSub though - and it still happens.
So probably a more general problem with Python packages for GCP.

acocuzzo · 2023-03-13T21:57:03Z

@PaddyAlton Thanks so much for the detailed description.
@AndreasBergmeier6176
Can you share the python packages you are using? And by "run a certain time", do you mean it happens at a certain frequency, or regularly at a specific time? Perhaps when updating?
Thanks all for your details and I will forward this to the GKE and grpc teams.

AndreasBergmeier6176 · 2023-03-27T06:30:45Z

We looked into this some more and the problem seems to be coming from using the Abseil library "wrong".
For us there are indications that it is actually a problem of waitress in which python code is running.

acocuzzo · 2023-03-27T17:42:28Z

Thanks for the update @AndreasBergmeier6176

riordan · 2023-04-19T00:44:51Z

This just occurred on my team's Cloud Composer cluster, where a single container in the Cloud Composer GKE cluster (autopilot enabled) began emitting 32k mutex logs/sec and incurred ~USD700 in 24h. We caught it after a day and a half. Conditions appeared very similar to what @PaddyAlton describes.

We had to try multiple attempts to kill the pod in order to get the logs to finally stop. Didn't rise to the level of detection by Autopilot.

ragavan-vyn · 2023-05-12T18:12:41Z

similar issue on our team's cloud composer cluster,

severity: "ERROR"
textPayload: "[mutex.cc : 435] RAW: Unlock 0x55aa0f0900b8   @"
timestamp: "2023-05-07T02:59:58.405571439Z"

Thought of setting up exclusion filter as suggested here but it may still consume the logging quota

Log entries are excluded after they are received by the Logging API and therefore these log entries consume entries.write API quota. You can't reduce the number of entries.write API calls by excluding log entries.

https://cloud.google.com/logging/docs/routing/overview#exclusions

acocuzzo · 2023-07-06T18:58:42Z

I would like move this issue to a broader audience so that it can get the attention in the right location.
It seems like the issue is reproduction, @ragavan-vyn, @riordan, @AndreasBergmeier6176, @PaddyAlton, would anyone be willing to share reproduction details that I can pass along? Thanks very much.

AndreasBergmeier6176 · 2023-07-07T06:19:52Z

For us the solution was simply to stop using waitress and python altogether.

acocuzzo · 2023-07-13T19:52:42Z

In an effort to improve visibility of this issue and allow for collaboration across libraries, I've filed a public issue:
https://issuetracker.google.com/issues/291120267
I'm closing the bug here but I encourage whatever information can be provided please be added to the public issue. Thanks everyone for your debugging efforts.

product-auto-label bot added the api: pubsub Issues related to the googleapis/python-pubsub API. label Nov 17, 2021

kamalaboulhosn assigned plamut Nov 17, 2021

kamalaboulhosn added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. labels Nov 17, 2021

plamut added the external This issue is blocked on a bug with the actual product. label Nov 18, 2021

pradn mentioned this issue Nov 30, 2021

How are ABSL_RAW_LOG verbosity levels set? abseil/abseil-cpp#1069

Closed

plamut removed their assignment Feb 15, 2022

acocuzzo closed this as completed Jul 12, 2022

acocuzzo self-assigned this Sep 15, 2022

acocuzzo reopened this Sep 15, 2022

acocuzzo added the status: investigating The issue is under investigation, which is determined to be non-trivial. label Sep 15, 2022

acocuzzo added the status: blocked Resolving the issue is dependent on other work. label Mar 13, 2023

acocuzzo closed this as completed Jul 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lots of unexpected logging in Python library #540

Lots of unexpected logging in Python library #540

kamalaboulhosn commented Nov 17, 2021

kamalaboulhosn commented Nov 17, 2021

plamut commented Nov 17, 2021

kamalaboulhosn commented Nov 17, 2021

plamut commented Nov 18, 2021

lidizheng commented Nov 18, 2021 •

edited

Loading

plamut commented Nov 18, 2021 •

edited

Loading

kamalaboulhosn commented Nov 29, 2021

pradn commented Nov 30, 2021

pradn commented Jan 11, 2022

pradn commented Jan 11, 2022

acocuzzo commented Jul 12, 2022 •

edited

Loading

dis-sid commented Jul 19, 2022 •

edited

Loading

davidjungermann commented Sep 5, 2022 •

edited

Loading

davidjungermann commented Sep 5, 2022

acocuzzo commented Nov 17, 2022

jovoss commented Nov 30, 2022

dbalogVEP commented Nov 30, 2022 •

edited

Loading

acocuzzo commented Nov 30, 2022

PaddyAlton commented Feb 6, 2023 •

edited

Loading

AndreasBergmeier6176 commented Mar 6, 2023

acocuzzo commented Mar 13, 2023

AndreasBergmeier6176 commented Mar 27, 2023

acocuzzo commented Mar 27, 2023

riordan commented Apr 19, 2023

ragavan-vyn commented May 12, 2023

acocuzzo commented Jul 6, 2023

AndreasBergmeier6176 commented Jul 7, 2023

acocuzzo commented Jul 13, 2023 •

edited

Loading

Lots of unexpected logging in Python library #540

Lots of unexpected logging in Python library #540

Comments

kamalaboulhosn commented Nov 17, 2021

kamalaboulhosn commented Nov 17, 2021

plamut commented Nov 17, 2021

kamalaboulhosn commented Nov 17, 2021

plamut commented Nov 18, 2021

lidizheng commented Nov 18, 2021 • edited Loading

plamut commented Nov 18, 2021 • edited Loading

kamalaboulhosn commented Nov 29, 2021

pradn commented Nov 30, 2021

pradn commented Jan 11, 2022

pradn commented Jan 11, 2022

acocuzzo commented Jul 12, 2022 • edited Loading

dis-sid commented Jul 19, 2022 • edited Loading

davidjungermann commented Sep 5, 2022 • edited Loading

davidjungermann commented Sep 5, 2022

acocuzzo commented Nov 17, 2022

jovoss commented Nov 30, 2022

dbalogVEP commented Nov 30, 2022 • edited Loading

acocuzzo commented Nov 30, 2022

PaddyAlton commented Feb 6, 2023 • edited Loading

AndreasBergmeier6176 commented Mar 6, 2023

acocuzzo commented Mar 13, 2023

AndreasBergmeier6176 commented Mar 27, 2023

acocuzzo commented Mar 27, 2023

riordan commented Apr 19, 2023

ragavan-vyn commented May 12, 2023

acocuzzo commented Jul 6, 2023

AndreasBergmeier6176 commented Jul 7, 2023

acocuzzo commented Jul 13, 2023 • edited Loading

lidizheng commented Nov 18, 2021 •

edited

Loading

plamut commented Nov 18, 2021 •

edited

Loading

acocuzzo commented Jul 12, 2022 •

edited

Loading

dis-sid commented Jul 19, 2022 •

edited

Loading

davidjungermann commented Sep 5, 2022 •

edited

Loading

dbalogVEP commented Nov 30, 2022 •

edited

Loading

PaddyAlton commented Feb 6, 2023 •

edited

Loading

acocuzzo commented Jul 13, 2023 •

edited

Loading