Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publisher thread terminates, forever breaking publication when GCE metadata service blips #1173

Open
pgcamus opened this issue May 13, 2024 · 0 comments
Assignees
Labels
api: pubsub Issues related to the googleapis/python-pubsub API.

Comments

@pgcamus
Copy link

pgcamus commented May 13, 2024

Thanks for stopping by to let us know something could be better!

PLEASE READ: If you have a support contract with Google, please create an issue in the support console instead of filing on GitHub. This will ensure a timely response.

Please run down the following list and make sure you've tried the usual "quick fixes":

If you are still having issues, please be sure to include as much information as possible:

Environment details

  • OS type and version: Ubuntu 22.04
  • Python version: 3.10.9
  • pip version: pip --version
  • google-cloud-pubsub version: 2.21.1

Steps to reproduce

Run google-cloud-pubsub and suffer a metadata outage like https://status.cloud.google.com/incidents/u6rQ2nNVbhAFqGCcTm58.

Note that this can trigger even in an un-sustained GCE metadata outage as once this exception triggers even once, the commit thread is dead forever. In our case, there was a short outage on the metadata server, but the retries all happened so quickly that the exception was raised before the service recovered

2024-04-26T07:30:45.783 Compute Engine Metadata server unavailable on attempt 1 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/universe/universe_domain (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7c1ca813c1c0>: Failed to establish a new connection: [Errno 111] Connection refused'))
2024-04-26T07:30:45.788 Compute Engine Metadata server unavailable on attempt 2 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/universe/universe_domain (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7c1ca8290730>: Failed to establish a new connection: [Errno 111] Connection refused'))
2024-04-26T07:30:45.794 Compute Engine Metadata server unavailable on attempt 3 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/universe/universe_domain (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7c1ca82918a0>: Failed to establish a new connection: [Errno 111] Connection refused'))
2024-04-26T07:30:45.801 [...]
2024-04-26T07:30:45.806 [...]

Code example

# example

Stack trace

Traceback (most recent call last):
  File "/app/device/trimark/proxy/proxy.runfiles/python3_10_x86_64-unknown-linux-gnu/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/app/device/trimark/proxy/proxy.runfiles/python3_10_x86_64-unknown-linux-gnu/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/app/device/trimark/proxy/proxy.runfiles/common_deps_google_cloud_pubsub/site-packages/google/cloud/pubsub_v1/publisher/_batch/thread.py", line 274, in _commit
    response = self._client._gapic_publish(
  File "/app/device/trimark/proxy/proxy.runfiles/common_deps_google_cloud_pubsub/site-packages/google/cloud/pubsub_v1/publisher/client.py", line 267, in _gapic_publish
    return super().publish(*args, **kwargs)
  File "/app/device/trimark/proxy/proxy.runfiles/common_deps_google_cloud_pubsub/site-packages/google/pubsub_v1/services/publisher/client.py", line 1058, in publish
    self._validate_universe_domain()
  File "/app/device/trimark/proxy/proxy.runfiles/common_deps_google_cloud_pubsub/site-packages/google/pubsub_v1/services/publisher/client.py", line 554, in _validate_universe_domain
    or PublisherClient._compare_universes(
  File "/app/device/trimark/proxy/proxy.runfiles/common_deps_google_cloud_pubsub/site-packages/google/pubsub_v1/services/publisher/client.py", line 531, in _compare_universes
    credentials_universe = getattr(credentials, \"universe_domain\", default_universe)
  File "/app/device/trimark/proxy/proxy.runfiles/common_deps_google_auth/site-packages/google/auth/compute_engine/credentials.py", line 154, in universe_domain
    self._universe_domain = _metadata.get_universe_domain(
  File "/app/device/trimark/proxy/proxy.runfiles/common_deps_google_auth/site-packages/google/auth/compute_engine/_metadata.py", line 284, in get_universe_domain
    universe_domain = get(
  File "/app/device/trimark/proxy/proxy.runfiles/common_deps_google_auth/site-packages/google/auth/compute_engine/_metadata.py", line 217, in get
    raise exceptions.TransportError(
google.auth.exceptions.TransportError: Failed to retrieve http://metadata.google.internal/computeMetadata/v1/universe/universe_domain from the Google Compute Engine metadata service. Compute Engine Metadata server unavailable

Speculative analysis

It looks like the issue is that the google-auth library is raising a TransportError which is not caught by the batch commit thread in this library. Potential fixes include catching that in Batch._commit (e.g. here), or catching it further down in google-cloud-pubsub and wrapping it in a GoogleAPIError.

@product-auto-label product-auto-label bot added the api: pubsub Issues related to the googleapis/python-pubsub API. label May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the googleapis/python-pubsub API.
Projects
None yet
Development

No branches or pull requests

2 participants