Debuginfod failed-server cache #74757

kevinfrei · 2023-12-07T20:33:16Z

When using the Debuginfod library to pull symbols, a non-responding server is a major performance hit. I've added a "failed server" cache so that if a server has failed to respond, the library won't continue to query it. The performance hit is moderately catastrophic in my experience: 'target create /bin/ls' takes 45 seconds to fail on a system that's trying to talk to a server that is unreachable.

I'll be exposing more configuration settings for use of Debuginfod in the very near term, but for right now, this eliminates a pretty major negative user experience for folks on a Linux system sitting behind a proxy server, or in similar network-constrained situations.

llvmbot · 2023-12-07T20:34:03Z

@llvm/pr-subscribers-debuginfo

Author: Kevin Frei (kevinfrei)

Changes

When using the Debuginfod library to pull symbols, a non-responding server is a major performance hit. I've added a "failed server" cache so that if a server has failed to respond, the library won't continue to query it. The performance hit is moderately catastrophic in my experience: 'target create /bin/ls' takes 45 seconds to fail on a system that's trying to talk to a server that is unreachable.

I'll be exposing more configuration settings for use of Debuginfod in the very near term, but for right now, this eliminates a pretty major negative user experience for folks on a Linux system sitting behind a proxy server, or in similar network-constrained situations.

Full diff: https://github.com/llvm/llvm-project/pull/74757.diff

1 Files Affected:

(modified) llvm/lib/Debuginfod/Debuginfod.cpp (+35-1)

diff --git a/llvm/lib/Debuginfod/Debuginfod.cpp b/llvm/lib/Debuginfod/Debuginfod.cpp
index 9df30ab55cbad..576cb8149cbb6 100644
--- a/llvm/lib/Debuginfod/Debuginfod.cpp
+++ b/llvm/lib/Debuginfod/Debuginfod.cpp
@@ -194,6 +194,26 @@ Error StreamedHTTPResponseHandler::handleBodyChunk(StringRef BodyChunk) {
   return Error::success();
 }
 
+// Create a file cache entry to remember if a given server failed
+Expected<AddStreamFn> CheckServerFailure(FileCache &Cache, uint &Task,
+                                         StringRef ServerUrl) {
+  SmallString<96> CachedServerFailurePath(ServerUrl);
+  llvm::transform(CachedServerFailurePath, CachedServerFailurePath.begin(),
+                  [&](char c) { return std::isalnum(c) ? c : '_'; });
+  CachedServerFailurePath.append(".failed");
+  return Cache(Task, CachedServerFailurePath, "");
+}
+
+void RegisterFailedServer(AddStreamFn &ServerFailureFn, uint &Task) {
+  Expected<std::unique_ptr<CachedFileStream>> FileStreamOrError =
+      ServerFailureFn(Task, "");
+  if (!FileStreamOrError)
+    consumeError(FileStreamOrError.takeError());
+  else
+    // Create the server-failure file
+    *FileStreamOrError.get()->OS << "";
+}
+
 // An over-accepting simplification of the HTTP RFC 7230 spec.
 static bool isHeader(StringRef S) {
   StringRef Name;
@@ -269,6 +289,17 @@ Expected<std::string> getCachedOrDownloadArtifact(
   HTTPClient Client;
   Client.setTimeout(Timeout);
   for (StringRef ServerUrl : DebuginfodUrls) {
+    // First, check to make sure we should keep asking this server.
+    Expected<AddStreamFn> ServerFailureFnOrErr =
+        CheckServerFailure(Cache, Task, ServerUrl);
+    if (!ServerFailureFnOrErr)
+      return ServerFailureFnOrErr.takeError();
+    AddStreamFn &ServerFailureFn = *ServerFailureFnOrErr;
+    if (!ServerFailureFn)
+      // We found a 'server failure' file in the cache which means
+      // this server has failed before: don't bother trying it again.
+      continue;
+
     SmallString<64> ArtifactUrl;
     sys::path::append(ArtifactUrl, sys::path::Style::posix, ServerUrl, UrlPath);
 
@@ -280,8 +311,11 @@ Expected<std::string> getCachedOrDownloadArtifact(
       HTTPRequest Request(ArtifactUrl);
       Request.Headers = getHeaders();
       Error Err = Client.perform(Request, Handler);
-      if (Err)
+      if (Err) {
+        // Put a server-failure marker in the cache so we don't keep trying it.
+        RegisterFailedServer(ServerFailureFn, Task);
         return std::move(Err);
+      }
 
       unsigned Code = Client.responseCode();
       if (Code && Code != 200)

mysterymath · 2023-12-07T22:37:54Z

llvm/lib/Debuginfod/Debuginfod.cpp

+  llvm::transform(CachedServerFailurePath, CachedServerFailurePath.begin(),
+                  [&](char c) { return std::isalnum(c) ? c : '_'; });
+  CachedServerFailurePath.append(".failed");
+  return Cache(Task, CachedServerFailurePath, "");


Reusing the file cache for this will inherit whatever cache pruning policy is configured for debuginfod files. It seems prima facie pretty unlikely that sharing this policy is desirable; looking at the defaults, it takes a week for entries to expire, and the cache is only pruned once every 20 minutes. This means that a momentary connection failure would become one of a much much greater scope.

I'm not convinced that negative caching is the way to go here; looking broadly around, it seems rare to cache negative HTTP requests, while negative DNS requests are already typically cached and should fail quickly.

There's also already DEBUGINFOD_TIMEOUT; it looks like we didn't quite implement this correctly. It should be the timeout to receive 100K, so there shouldn't be any harm in setting it very short if desired. We used CURLOPT_TIMEOUT_MS instead of CURLOPT_LOW_SPEED_LIMIT and CURLOPT_LOW_SPEED_TIME, which is what both libdebuginfod's docs and implementation suggest. Fixing this seems like a better way to address this, since it doesn't require inventing any new mechanism.

I'm working on exposing some of the internal, fixed settings of Debuginfod as debugger settings currently. I disagree with your assessment of what "momentary" is, as the current setting waits for a full 45 seconds before it stops, so having an unreachable server is catastrophically bad (in my direct experience: My development machine lives in a data center where the distro has DEBUGINFOD_URLS configured to an unreachable server).

Even if we do set the timeout's much lower, the user experience is pretty horrific without some sort of negative caching, as each binary/.so triggers the timeout. So even setting it down to just a few seconds results in a downed server triggering a whole lot of user wait time.

Overall, this is, perhaps not the 'right' workaround. I'm working to shift Debuginfod access, generally, to an async model by default (or just off?) because the impact to user experience is just so very bad when it's misconfigured...

I'll move ahead with configuring the timeouts to use more appropriate defaults, while also exposing them as LLDB options. I'd like to see the user experience after fixing those to see if it's still as bad as I'm worried it will be :/

Sorry, by "momentary", I meant that if a server were actually to momentarily become unavailable, this caching implementation would make that failure last up to one additional week after the server had recovered. That's much worse than debuginfod not working well if misconfigured, since it would make it work much worse if properly configured. A user can fix configurations, but they can't usually make their internet connection more reliable.

Got it: I understand your concerns. There definitely needs to be a way for the user to adjust how sensitive a negative policy is, and a way to reset it easily. This implementation doesn't do the former, and the latter is...non-obvious. I'm going to dig into DEBUGINFO configuration more deeply... (Thanks for the push at misconfigured CURLOPT's). I'll abandon this particular PR.

kevinfrei · 2023-12-07T23:19:30Z

Gonna accomplish the end results in a more user visible & configurable manner.

llvmbot added the debuginfo label Dec 7, 2023

Kevin Frei added 2 commits December 7, 2023 13:56

Added failed server caching to Debuginfod

196eb33

Final cleanup: prep for public PR

63a4182

kevinfrei force-pushed the debuginfod_failure_cache branch from 6bb6c30 to 63a4182 Compare December 7, 2023 21:56

mysterymath reviewed Dec 7, 2023

View reviewed changes

kevinfrei closed this Dec 7, 2023

kevinfrei deleted the debuginfod_failure_cache branch December 19, 2023 23:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Debuginfod failed-server cache #74757

Debuginfod failed-server cache #74757

kevinfrei commented Dec 7, 2023

llvmbot commented Dec 7, 2023

mysterymath Dec 7, 2023

kevinfrei Dec 7, 2023

kevinfrei Dec 7, 2023

mysterymath Dec 7, 2023

kevinfrei Dec 7, 2023

kevinfrei commented Dec 7, 2023

Debuginfod failed-server cache #74757

Debuginfod failed-server cache #74757

Conversation

kevinfrei commented Dec 7, 2023

llvmbot commented Dec 7, 2023

mysterymath Dec 7, 2023

Choose a reason for hiding this comment

kevinfrei Dec 7, 2023

Choose a reason for hiding this comment

kevinfrei Dec 7, 2023

Choose a reason for hiding this comment

mysterymath Dec 7, 2023

Choose a reason for hiding this comment

kevinfrei Dec 7, 2023

Choose a reason for hiding this comment

kevinfrei commented Dec 7, 2023