Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lots of unexpected logging in Python library #540

Closed
kamalaboulhosn opened this issue Nov 17, 2021 · 28 comments
Closed

Lots of unexpected logging in Python library #540

kamalaboulhosn opened this issue Nov 17, 2021 · 28 comments
Assignees
Labels
api: pubsub Issues related to the googleapis/python-pubsub API. external This issue is blocked on a bug with the actual product. priority: p2 Moderately-important priority. Fix may not be included in next release. status: blocked Resolving the issue is dependent on other work. status: investigating The issue is under investigation, which is determined to be non-trivial. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@kamalaboulhosn
Copy link
Contributor

Summarized from a customer issue:

Somehow, something with this google PubSub service is printing out Mutex.cc logs at an enormous scale of 3.5M / 5 minutes. The logs look as follows:

Error
2021-11-03 15:08:55.861 EDT[mutex.cc : 435] RAW: Lock blocking 0x7f0c66b9cb78 @
Error
2021-11-03 15:08:55.861 EDT[mutex.cc : 435] RAW: Lock returning 0x7f0c66b9cb78 @
Error
2021-11-03 15:08:55.861 EDT[mutex.cc : 435] RAW: Unlock 0x7f0c66b9cb78 @
Error
2021-11-03 15:08:55.861 EDT[mutex.cc : 435] RAW: Lock blocking 0x7f0c66b9cb78 @
Error
2021-11-03 15:08:55.861 EDT[mutex.cc : 435] RAW: Lock returning 0x7f0c66b9cb78 @
Error
2021-11-03 15:08:55.861 EDT[mutex.cc : 435] RAW: Unlock 0x7f0c66b9cb78 @
Error
2021-11-03 15:08:55.861 EDT[mutex.cc : 435] RAW: Lock blocking 0x7f0c66b9cb78 @
Error
2021-11-03 15:08:55.861 EDT[mutex.cc : 435] RAW: Lock returning 0x7f0c66b9cb78 @
Error
2021-11-03 15:08:55.861 EDT[mutex.cc : 435] RAW: Unlock 0x7f0c66b9cb78 @
Error
2021-11-03 15:08:55.861 EDT[mutex.cc : 435] RAW: Lock blocking 0x7f0c66b9cb78 @
Error
2021-11-03 15:08:55.861 EDT[mutex.cc : 435] RAW: Lock returning 0x7f0c66b9cb78 @
Error
2021-11-03 15:08:55.861 EDT[mutex.cc : 435] RAW: Unlock 0x7f0c66b9cb78 @
Error
2021-11-03 15:08:55.861 EDT[mutex.cc : 435] RAW: Lock blocking 0x7f0c66b9cb78 @
Error
2021-11-03 15:08:55.861 EDT[mutex.cc : 435] RAW: Lock returning 0x7f0c66b9cb78 @
Error
2021-11-03 15:08:55.861 EDT[mutex.cc : 435] RAW: Unlock 0x7f0c66b9cb78 @
Error
2021-11-03 15:08:55.861 EDT[mutex.cc : 435] RAW: Lock blocking 0x7f0c66b9cb78 @
Error
2021-11-03 15:08:55.861 EDT[mutex.cc : 435] RAW: Lock returning 0x7f0c66b9cb78 @
Error
2021-11-03 15:08:55.861 EDT[mutex.cc : 435] RAW: Unlock 0x7f0c66b9cb78 @
Error
2021-11-03 15:08:55.861 EDT[mutex.cc : 435] RAW: Lock blocking 0x7f0c66b9cb78 @
Error
2021-11-03 15:08:55.861 EDT[mutex.cc : 435] RAW: Lock returning 0x7f0c66b9cb78 @

The information provided about version:

  1. We are using version 2.3.0 of google-cloud-pubsub
  2. The spike in logs was not tied to an upgrade of pubsub or another library, we actually haven't touched the code in months

The customer suggested "I believe the core of the issue is that pubsub is sending messages without a delay and no exponential backoff, thus causing massive spikes in utilization and an over-use of the logging module."

@product-auto-label product-auto-label bot added the api: pubsub Issues related to the googleapis/python-pubsub API. label Nov 17, 2021
@kamalaboulhosn
Copy link
Contributor Author

@plamut would you be able to take a look into this while @pradn is out? Thanks!

@kamalaboulhosn kamalaboulhosn added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. labels Nov 17, 2021
@plamut
Copy link
Contributor

plamut commented Nov 17, 2021

@kamalaboulhosn This appears to be happening at the level below Python. Can the customer post some more details about their OS, the Python version, the output of $ pip freeze, and any other environment info that might be relevant? Is the issue consistently reproducible/observable?

The output doesn't seem like the logging we do in our Pub/Sub client code (at the Python level), and it probably doesn't come from the interpreter itself either, which is written in C. Seems like some C++ code, maybe `grpc?

For a start, I would at least try upgrading Pub/Sub to the latest version (2.9.0) and see if it solves the problem.

@kamalaboulhosn
Copy link
Contributor Author

@plamut I will ask about the additional information. I agree that it doesn't look to be something at the Pub/Sub library level. My suspicion is that debug logging is turned on the gRPC level somehow.

@plamut plamut added the external This issue is blocked on a bug with the actual product. label Nov 18, 2021
@plamut
Copy link
Contributor

plamut commented Nov 18, 2021

@lidizheng A question for you guys, do these logs resemble grpc logs, do they look familiar? And if yes, is it a known thing with a quick fix?

Thanks!

@lidizheng
Copy link

lidizheng commented Nov 18, 2021

@plamut gRPC today uses Abseil C++ for their mutex implementations. There is no mutex.cc inside gRPC code base. If gRPC trace log is turned on, it should be pretty obvious, the log will talk about polling strategy and plugins.

@plamut
Copy link
Contributor

plamut commented Nov 18, 2021

@lidizheng Thanks.

@kamalaboulhosn had a look into Abseil, found the mutex.cc file and the very log messages discussed here. The most probable theory seems to be that the customer somehow enabled the debug logging level, resulting in the flood that was reported.

@kamalaboulhosn
Copy link
Contributor Author

They say they have no adjusted log levels in their environment.

They use these Python libraries:

  • pymysql==0.8.1
  • google-cloud-pubsub==2.3.0
  • sqlalchemy==1.3.18
  • pdf2image==1.14.0

And these Alpine Linux libraries:

  • gcc
  • libc-dev
  • libffi-dev
  • mariadb-dev
  • libstdc++
  • openssh,jpeg-dev
  • zlib-dev
  • poppler
  • poppler-utils

@pradn
Copy link
Contributor

pradn commented Nov 30, 2021

Another customer is seeing the same issue.

The logs are coming from here.

I asked the ABSL team for info on how the raw log verbosity levels are set. abseil/abseil-cpp#1069

@pradn
Copy link
Contributor

pradn commented Jan 11, 2022

I pinged derekmauro who commented on abseil/abseil-cpp#1069 via internal chat. Hoping he can help us further with this.

@pradn
Copy link
Contributor

pradn commented Jan 11, 2022

The GRPC team doesn't think it's their issue and they can't do much unless we have repro instructions for them.

@plamut plamut removed their assignment Feb 15, 2022
@acocuzzo
Copy link
Contributor

acocuzzo commented Jul 12, 2022

Closing as issue has not reoccurred.

@dis-sid
Copy link

dis-sid commented Jul 19, 2022

It happened to us in the night of 2022.07.11, I had to put down the service at the receiving end. but it took a few days to notice with ~ $500 per day of cloud logging usage.

2022-07-14 12:09:50.317 CEST [mutex.cc : 435] RAW: Lock returning 0x564c983a0378 @
2022-07-14 12:09:50.317 CEST [mutex.cc : 435] RAW: Unlock 0x564c983a0378 @
2022-07-14 12:09:50.317 CEST [mutex.cc : 435] RAW: Lock blocking 0x564c983a0378 @
2022-07-14 12:09:50.317 CEST [mutex.cc : 435] RAW: Lock returning 0x564c983a0378 @
2022-07-14 12:09:50.317 CEST [mutex.cc : 435] RAW: Unlock 0x564c983a0378 @
2022-07-14 12:09:50.317 CEST [mutex.cc : 435] RAW: Lock blocking 0x564c983a0378 @
2022-07-14 12:09:50.317 CEST [mutex.cc : 435] RAW: Lock returning 0x564c983a0378 @
2022-07-14 12:09:50.317 CEST [mutex.cc : 435] RAW: Unlock 0x564c983a0378 @
2022-07-14 12:09:50.317 CEST [mutex.cc : 435] RAW: Lock blocking 0x564c983a0378 @
2022-07-14 12:09:50.317 CEST [mutex.cc : 435] RAW: Lock returning 0x564c983a0378 @
2022-07-14 12:09:50.317 CEST [mutex.cc : 435] RAW: Unlock 0x564c983a0378 @
2022-07-14 12:09:50.317 CEST [mutex.cc : 435] RAW: Lock blocking 0x564c983a0378 @
2022-07-14 12:09:50.317 CEST [mutex.cc : 435] RAW: Lock returning 0x564c983a0378 @

google-api-core==2.3.2
google-auth==2.3.3
google-cloud-pubsub==2.9.0
googleapis-common-protos==1.54.0
grpc-google-iam-v1==0.12.3
grpcio==1.43.0
grpcio-status==1.43.0
proto-plus==1.19.8
protobuf==3.19.1
requests==2.26.0

@davidjungermann
Copy link

davidjungermann commented Sep 5, 2022

I just had the same thing happen over a weekend, which ended up costing us over 400$.

This service is running as a Flask application in Cloud Run. No adjusted log levels.

Flask==2.1.0
flask-smorest==0.39.0
gunicorn==20.1.0
google-cloud-pubsub==2.13.6
gobbler-utils==1.0.1
jsonschema==4.14.0
pytz==2022.2.1
python-dateutil==2.8.2
google-cloud-logging==3.2.2

It would be nice to know why this occurs, and if there's any way for us to prevent it. Paying several hundred bucks for an error you deem as "not reoccurring" feels really unfair.

EDIT: This just started occurring again, after not occurring for several hours. No changes were made to the container running, it just started spamming us with logs out of the blue. We're running two instances of this container on two environments, with the same dependencies, and are only getting the error on one environment. We're on europe-west1.

EDIT 2: For anyone having this issue, using log exclusion to the log sink seems to have done the trick for us. Docs here: https://cloud.google.com/logging/docs/routing/overview

@davidjungermann
Copy link

Would you care sharing the exclusion filter? tnx

I did this: resource.type = "cloud_run_revision" AND severity = "DEFAULT"

Could definitely be extended to add some further granularity, since this could filter out potentially wanted logs from other Cloud Run containers, so use with caution! :)

@acocuzzo acocuzzo self-assigned this Sep 15, 2022
@acocuzzo acocuzzo reopened this Sep 15, 2022
@acocuzzo acocuzzo added the status: investigating The issue is under investigation, which is determined to be non-trivial. label Sep 15, 2022
@acocuzzo
Copy link
Contributor

I've commented again on: abseil/abseil-cpp#1069
It seems this issue has occurred for users on GKE, and not just using the pubsub library, so it's likely this is an issue with GRPC or GKE.
Did these spikes occur after any change in containers, such as an upgrade?

@jovoss
Copy link

jovoss commented Nov 30, 2022

The excessive logging happened to us as well. Beginning 2022-12-28 without any upgrades/updates to the Cloud Run container/deployment.

We are using the following grpc versions in our python module:
grpcio==1.44.0

Re-deploying the service did not fix the issue.
We implemented @davidjungermann exclusion filter to quick-fix the issue.

We are looking forward to a definitive answer why this is happening @acocuzzo

@dbalogVEP
Copy link

dbalogVEP commented Nov 30, 2022

This happened to us at the beginning of the month, > $2000 in Log Volume charges.

my container was built on 2022-04-08 and uses Python 3.8.12 with these google packages.

gcsfs==2022.3.0
google-api-core==1.31.5
google-auth==2.4.1
google-auth-oauthlib==0.5.1
google-cloud-aiplatform==1.9.0
google-cloud-bigquery==2.32.0
google-cloud-core==2.2.2
google-cloud-secret-manager==2.5.0
google-cloud-storage==2.1.0
google-crc32c==1.3.0
google-resumable-media==2.1.0
googleapis-common-protos==1.54.0
grpc-google-iam-v1==0.12.3
grpcio==1.43.0

@acocuzzo
Copy link
Contributor

I'm unable to reproduce the issue, if anyone has a consistent repro, it will help me narrow things down for the grpc folks. Thanks, appreciate the patience and information.

@PaddyAlton
Copy link

PaddyAlton commented Feb 6, 2023

We have just experienced the same problem - caught via a regular billing report after running for three days.

I don't know if this was the same for others, but what seems to have happened is that our application code started throwing errors first.

A bad config change meant that the GKE pod that subscribes to Pub/Sub started to repeatedly throw errors in the callback function. This caused the pod to fall over and restart, repeatedly. Each time it would terminate as follows:

Top-level exception occurred in callback while processing a message
[logs relating to the application error]
Request ID: [some unique ID]

Then, after this had happened a few tens of times, the deluge of 'mutex' logs began.

What's interesting is that

  • this seems to have started up at the point when we'd expect the pod to terminate
  • on the earlier failures, the pod restarted after less than a second and we got the logs our application emits on startup; that didn't happen this time
  • instead, the pod begins to emit mutex logs by the bucketload from 6ms after the Request ID is logged
  • as best I can tell, the pod sits there in some unstable state, emitting these logs, rather than shutting down

Hope that helps to trace this behaviour!

@AndreasBergmeier6176
Copy link

We see the same behavior when Pods run a certain time in GKE.
We by now removed our dependency on PubSub though - and it still happens.
So probably a more general problem with Python packages for GCP.

@acocuzzo acocuzzo added the status: blocked Resolving the issue is dependent on other work. label Mar 13, 2023
@acocuzzo
Copy link
Contributor

@PaddyAlton Thanks so much for the detailed description.
@AndreasBergmeier6176
Can you share the python packages you are using? And by "run a certain time", do you mean it happens at a certain frequency, or regularly at a specific time? Perhaps when updating?
Thanks all for your details and I will forward this to the GKE and grpc teams.

@AndreasBergmeier6176
Copy link

We looked into this some more and the problem seems to be coming from using the Abseil library "wrong".
For us there are indications that it is actually a problem of waitress in which python code is running.

@acocuzzo
Copy link
Contributor

Thanks for the update @AndreasBergmeier6176

@riordan
Copy link

riordan commented Apr 19, 2023

This just occurred on my team's Cloud Composer cluster, where a single container in the Cloud Composer GKE cluster (autopilot enabled) began emitting 32k mutex logs/sec and incurred ~USD700 in 24h. We caught it after a day and a half. Conditions appeared very similar to what @PaddyAlton describes.

We had to try multiple attempts to kill the pod in order to get the logs to finally stop. Didn't rise to the level of detection by Autopilot.

@ragavan-vyn
Copy link

similar issue on our team's cloud composer cluster,

severity: "ERROR"
textPayload: "[mutex.cc : 435] RAW: Unlock 0x55aa0f0900b8   @"
timestamp: "2023-05-07T02:59:58.405571439Z"

Thought of setting up exclusion filter as suggested here but it may still consume the logging quota

Log entries are excluded after they are received by the Logging API and therefore these log entries consume entries.write API quota. You can't reduce the number of entries.write API calls by excluding log entries.

https://cloud.google.com/logging/docs/routing/overview#exclusions

@acocuzzo
Copy link
Contributor

acocuzzo commented Jul 6, 2023

I would like move this issue to a broader audience so that it can get the attention in the right location.
It seems like the issue is reproduction, @ragavan-vyn, @riordan, @AndreasBergmeier6176, @PaddyAlton, would anyone be willing to share reproduction details that I can pass along? Thanks very much.

@AndreasBergmeier6176
Copy link

For us the solution was simply to stop using waitress and python altogether.

@acocuzzo
Copy link
Contributor

acocuzzo commented Jul 13, 2023

In an effort to improve visibility of this issue and allow for collaboration across libraries, I've filed a public issue:
https://issuetracker.google.com/issues/291120267
I'm closing the bug here but I encourage whatever information can be provided please be added to the public issue. Thanks everyone for your debugging efforts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the googleapis/python-pubsub API. external This issue is blocked on a bug with the actual product. priority: p2 Moderately-important priority. Fix may not be included in next release. status: blocked Resolving the issue is dependent on other work. status: investigating The issue is under investigation, which is determined to be non-trivial. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests