Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using built-in DI registration functions like AddPublisherServiceApiClient can potentially cause thread starvation during high load at startup. #11092

Open
flagbug opened this issue Sep 26, 2023 · 4 comments
Assignees
Labels
priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@flagbug
Copy link

flagbug commented Sep 26, 2023

Using the built-in DI registration functions like AddPublisherServiceApiClient can potentially cause thread starvation during high load at startup.

Environment details

  • OS: Debian 11 in Docker
  • .NET version: 7.0
  • Package name and version: Google.Cloud.PubSub.V1->3.2.0
  • Running on Google Cloud Run

Steps to reproduce

Unfortunately there is no way to reliably reproduce it, we've encountered this issue only twice in the past two months

  1. Create a basic ASP.NET Core with the DI registration code in Program.cs: builder.Services.AddPublisherServiceApiClient();

  2. Add a basic controller that accepts an PublisherServiceApiClient in the constructor and publishes Pub/Sub messages using await PublisherServiceApiClient.PublishAsync

  3. Send a ton of traffic over a longer period of time.

When a new instance starts (e.g due to scaling), under rare circumstances this can cause the exact same issue as described here googleapis/google-api-dotnet-client#1807, where the fetching of the Application Default Credentials might cause thread starvation.

To summarize this previous issue:

  • Soon after application startup, you start receiving requests that try to obtain the Application Default Credentials from DI.
  • All but one of the request thread blocks until the singleton is ready, which will be ready once the default credential is fetched.
  • More requests keep coming in, and keep blocking on either of the DI singletons. The thread pool starts to deplete.
  • The "lucky" thread that went on to fetch the credential gets to a point where it needs to check whether we are running on Compute, and for that an HTTP request is made to the metadata server. This is an async operation that doesn't block the thread, which is freed to do other things (probably attent to yet another incoming request).
  • Still more requests keep coming in and more threads keep getting blocked waiting for the credential to be fetched, stressing the thread pool even more. The thread pool is probably growing at this point, but very slowly at first. Requests start being queued up, and some may start timing out.
  • When the async HTTP request to the metadata server ends it needs a thread to resume back into, and it's now competing for a thread from a starved thread pool. Basically, we need a thread to finish fetching the credential, while most threads are blocked waiting for the credential to be fetched.
  • Eventually the thread pool grows enough, the credential fetching can resume (if under the time we timeout at else, you get the error you initially reported), and the rest of the threads unblock with time.

As you can see in that issue, we've been hit by this exact issue 3 years ago, but in a different service of ours, under different circumstances using some manual initialization logic.

The reason I'm reporting this again is that this is now happening to us using the build-in DI functions like AddPublisherServiceApiClient, which one would assume to the right thing™ already and don't run into this issue.

Not sure what the best way to solve this is, as the async nature of the credentials fetching makes this inherently harder with DI. Maybe the library should have an internal synchronization mechanism when fetching the Application Default Credentials? Or should the documentation mention that we should do this manually beforehand?

@jskeet
Copy link
Collaborator

jskeet commented Sep 26, 2023

Thanks for the detailed report.

I'll assign this to Amanda who looked into this previously... I suspect there are no "easy" fixes for this, but we'll see.
(If we could tell ASP.NET Core to throttle requests for a little while, that could potentially help.)

The aspect I think I'd focus on is why threads are blocking rather than using async operations. If we're doing something there that could become async (or offer a new async overload) that might be helpful. But if it's the DI system itself which is blocking threads, that could be tricky.

@jskeet jskeet added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. labels Sep 26, 2023
@amanda-tarafa
Copy link
Contributor

OK, I might I have a fix for this. It's exactly the same issue as before, sorry we dindn't realize earlier and re-introduced it ourselves.

@amanda-tarafa
Copy link
Contributor

amanda-tarafa commented Sep 26, 2023

We need to discuss some things internally, but you can implement this workaround while we do so.

The workaround is basically to fetch the application default credential yourself so it has already been obtained by the time the clients are built. Note that you don't have to do anything with it, as the application default credential is cached the first time is fetched.

Your DI code would look like this:

// Workaround for https://github.com/googleapis/google-cloud-dotnet/issues/11092
GoogleCredential.GetApplicationDefaultCredential();
builder.Services.AddPublisherServiceApiClient();

I'll get back on this thread when we know more.

amanda-tarafa added a commit to amanda-tarafa/google-cloud-dotnet that referenced this issue Oct 4, 2023
amanda-tarafa added a commit to amanda-tarafa/google-cloud-dotnet that referenced this issue Oct 4, 2023
@amanda-tarafa
Copy link
Contributor

We are going to document the workaround (#11142) while we evaluate our options. So I'll downgrade this issue to a P3 not because we are not looking into it, but because there's a straightforward documented workaround.

I'll come back when I know more.

@amanda-tarafa amanda-tarafa added priority: p3 Desirable enhancement or fix. May not be included in next release. and removed priority: p2 Moderately-important priority. Fix may not be included in next release. labels Oct 4, 2023
amanda-tarafa added a commit to amanda-tarafa/google-cloud-dotnet that referenced this issue Oct 5, 2023
@amanda-tarafa amanda-tarafa added priority: p2 Moderately-important priority. Fix may not be included in next release. and removed priority: p3 Desirable enhancement or fix. May not be included in next release. labels May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

3 participants