Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debuginfod failed-server cache #74757

Closed
wants to merge 2 commits into from

Conversation

kevinfrei
Copy link
Contributor

When using the Debuginfod library to pull symbols, a non-responding server is a major performance hit. I've added a "failed server" cache so that if a server has failed to respond, the library won't continue to query it. The performance hit is moderately catastrophic in my experience: 'target create /bin/ls' takes 45 seconds to fail on a system that's trying to talk to a server that is unreachable.

I'll be exposing more configuration settings for use of Debuginfod in the very near term, but for right now, this eliminates a pretty major negative user experience for folks on a Linux system sitting behind a proxy server, or in similar network-constrained situations.

@llvmbot
Copy link
Collaborator

llvmbot commented Dec 7, 2023

@llvm/pr-subscribers-debuginfo

Author: Kevin Frei (kevinfrei)

Changes

When using the Debuginfod library to pull symbols, a non-responding server is a major performance hit. I've added a "failed server" cache so that if a server has failed to respond, the library won't continue to query it. The performance hit is moderately catastrophic in my experience: 'target create /bin/ls' takes 45 seconds to fail on a system that's trying to talk to a server that is unreachable.

I'll be exposing more configuration settings for use of Debuginfod in the very near term, but for right now, this eliminates a pretty major negative user experience for folks on a Linux system sitting behind a proxy server, or in similar network-constrained situations.


Full diff: https://github.com/llvm/llvm-project/pull/74757.diff

1 Files Affected:

  • (modified) llvm/lib/Debuginfod/Debuginfod.cpp (+35-1)
diff --git a/llvm/lib/Debuginfod/Debuginfod.cpp b/llvm/lib/Debuginfod/Debuginfod.cpp
index 9df30ab55cbad..576cb8149cbb6 100644
--- a/llvm/lib/Debuginfod/Debuginfod.cpp
+++ b/llvm/lib/Debuginfod/Debuginfod.cpp
@@ -194,6 +194,26 @@ Error StreamedHTTPResponseHandler::handleBodyChunk(StringRef BodyChunk) {
   return Error::success();
 }
 
+// Create a file cache entry to remember if a given server failed
+Expected<AddStreamFn> CheckServerFailure(FileCache &Cache, uint &Task,
+                                         StringRef ServerUrl) {
+  SmallString<96> CachedServerFailurePath(ServerUrl);
+  llvm::transform(CachedServerFailurePath, CachedServerFailurePath.begin(),
+                  [&](char c) { return std::isalnum(c) ? c : '_'; });
+  CachedServerFailurePath.append(".failed");
+  return Cache(Task, CachedServerFailurePath, "");
+}
+
+void RegisterFailedServer(AddStreamFn &ServerFailureFn, uint &Task) {
+  Expected<std::unique_ptr<CachedFileStream>> FileStreamOrError =
+      ServerFailureFn(Task, "");
+  if (!FileStreamOrError)
+    consumeError(FileStreamOrError.takeError());
+  else
+    // Create the server-failure file
+    *FileStreamOrError.get()->OS << "";
+}
+
 // An over-accepting simplification of the HTTP RFC 7230 spec.
 static bool isHeader(StringRef S) {
   StringRef Name;
@@ -269,6 +289,17 @@ Expected<std::string> getCachedOrDownloadArtifact(
   HTTPClient Client;
   Client.setTimeout(Timeout);
   for (StringRef ServerUrl : DebuginfodUrls) {
+    // First, check to make sure we should keep asking this server.
+    Expected<AddStreamFn> ServerFailureFnOrErr =
+        CheckServerFailure(Cache, Task, ServerUrl);
+    if (!ServerFailureFnOrErr)
+      return ServerFailureFnOrErr.takeError();
+    AddStreamFn &ServerFailureFn = *ServerFailureFnOrErr;
+    if (!ServerFailureFn)
+      // We found a 'server failure' file in the cache which means
+      // this server has failed before: don't bother trying it again.
+      continue;
+
     SmallString<64> ArtifactUrl;
     sys::path::append(ArtifactUrl, sys::path::Style::posix, ServerUrl, UrlPath);
 
@@ -280,8 +311,11 @@ Expected<std::string> getCachedOrDownloadArtifact(
       HTTPRequest Request(ArtifactUrl);
       Request.Headers = getHeaders();
       Error Err = Client.perform(Request, Handler);
-      if (Err)
+      if (Err) {
+        // Put a server-failure marker in the cache so we don't keep trying it.
+        RegisterFailedServer(ServerFailureFn, Task);
         return std::move(Err);
+      }
 
       unsigned Code = Client.responseCode();
       if (Code && Code != 200)

llvm::transform(CachedServerFailurePath, CachedServerFailurePath.begin(),
[&](char c) { return std::isalnum(c) ? c : '_'; });
CachedServerFailurePath.append(".failed");
return Cache(Task, CachedServerFailurePath, "");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reusing the file cache for this will inherit whatever cache pruning policy is configured for debuginfod files. It seems prima facie pretty unlikely that sharing this policy is desirable; looking at the defaults, it takes a week for entries to expire, and the cache is only pruned once every 20 minutes. This means that a momentary connection failure would become one of a much much greater scope.

I'm not convinced that negative caching is the way to go here; looking broadly around, it seems rare to cache negative HTTP requests, while negative DNS requests are already typically cached and should fail quickly.

There's also already DEBUGINFOD_TIMEOUT; it looks like we didn't quite implement this correctly. It should be the timeout to receive 100K, so there shouldn't be any harm in setting it very short if desired. We used CURLOPT_TIMEOUT_MS instead of CURLOPT_LOW_SPEED_LIMIT and CURLOPT_LOW_SPEED_TIME, which is what both libdebuginfod's docs and implementation suggest. Fixing this seems like a better way to address this, since it doesn't require inventing any new mechanism.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm working on exposing some of the internal, fixed settings of Debuginfod as debugger settings currently. I disagree with your assessment of what "momentary" is, as the current setting waits for a full 45 seconds before it stops, so having an unreachable server is catastrophically bad (in my direct experience: My development machine lives in a data center where the distro has DEBUGINFOD_URLS configured to an unreachable server).

Even if we do set the timeout's much lower, the user experience is pretty horrific without some sort of negative caching, as each binary/.so triggers the timeout. So even setting it down to just a few seconds results in a downed server triggering a whole lot of user wait time.

Overall, this is, perhaps not the 'right' workaround. I'm working to shift Debuginfod access, generally, to an async model by default (or just off?) because the impact to user experience is just so very bad when it's misconfigured...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll move ahead with configuring the timeouts to use more appropriate defaults, while also exposing them as LLDB options. I'd like to see the user experience after fixing those to see if it's still as bad as I'm worried it will be :/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, by "momentary", I meant that if a server were actually to momentarily become unavailable, this caching implementation would make that failure last up to one additional week after the server had recovered. That's much worse than debuginfod not working well if misconfigured, since it would make it work much worse if properly configured. A user can fix configurations, but they can't usually make their internet connection more reliable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it: I understand your concerns. There definitely needs to be a way for the user to adjust how sensitive a negative policy is, and a way to reset it easily. This implementation doesn't do the former, and the latter is...non-obvious. I'm going to dig into DEBUGINFO configuration more deeply... (Thanks for the push at misconfigured CURLOPT's). I'll abandon this particular PR.

@kevinfrei
Copy link
Contributor Author

Gonna accomplish the end results in a more user visible & configurable manner.

@kevinfrei kevinfrei closed this Dec 7, 2023
@kevinfrei kevinfrei deleted the debuginfod_failure_cache branch December 19, 2023 23:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants