Skip to content

Conversation

necto
Copy link
Contributor

@necto necto commented Oct 2, 2025

Also remove the gratuitoius spaces after "," that break strict CSV compliance.

As the debug function name does not uniquely identify an entry point, add the main-TU name and the USR values for each entry point snapshot to reduce the likelyhood of collisions between declarations across large projects.

While adding a filename to each row increases the file size substantially, the difference in size for the compressed is acceptable.

I evaluated it on our set of 200+ open source C and C++ projects with 3M entry points, and got the following results when adding these two columns:

  • Raw CSV file increased from 530MB to 1.1GB
  • Compressed file (XZ) increased from 54 MB to 78 MB

--

CPP-7098

Co-authored-by: Balazs Benics balazs.benics@sonarsource.com

Also remove the gratuitoius spaces after "," that break strict CSV compliance.

As the debug function name does not uniquely identify an entry point,
add the main-TU name and the USR values for each entry point snapshot to
reduce the likelyhood of collisions between declarations across large
projects.

While adding a filename to each row increases the file size
substantially, the difference in size for the compressed is acceptable.

I evaluated it on our set of 200+ opensource C and C++ projects with 3M
entry points, and got the following results when adding these two columns:
- Raw CSV file increased from 530MB to 1.1GB
- Compressed file (XZ) increased from 54 MB to 78 MB

--

CPP-7098

Co-authored-by: Balazs Benics <balazs.benics@sonarsource.com>
@llvmbot llvmbot added clang Clang issues not falling into any other category clang:static analyzer labels Oct 2, 2025
@llvmbot
Copy link
Member

llvmbot commented Oct 2, 2025

@llvm/pr-subscribers-clang-static-analyzer-1

@llvm/pr-subscribers-clang

Author: Arseniy Zaostrovnykh (necto)

Changes

Also remove the gratuitoius spaces after "," that break strict CSV compliance.

As the debug function name does not uniquely identify an entry point, add the main-TU name and the USR values for each entry point snapshot to reduce the likelyhood of collisions between declarations across large projects.

While adding a filename to each row increases the file size substantially, the difference in size for the compressed is acceptable.

I evaluated it on our set of 200+ open source C and C++ projects with 3M entry points, and got the following results when adding these two columns:

  • Raw CSV file increased from 530MB to 1.1GB
  • Compressed file (XZ) increased from 54 MB to 78 MB

--

CPP-7098

Co-authored-by: Balazs Benics <balazs.benics@sonarsource.com>


Full diff: https://github.com/llvm/llvm-project/pull/161663.diff

4 Files Affected:

  • (modified) clang/include/clang/StaticAnalyzer/Core/PathSensitive/EntryPointStats.h (+1-1)
  • (modified) clang/lib/StaticAnalyzer/Core/EntryPointStats.cpp (+24-7)
  • (modified) clang/lib/StaticAnalyzer/Frontend/AnalysisConsumer.cpp (+11-1)
  • (modified) clang/test/Analysis/analyzer-stats/entry-point-stats.cpp (+6-2)
diff --git a/clang/include/clang/StaticAnalyzer/Core/PathSensitive/EntryPointStats.h b/clang/include/clang/StaticAnalyzer/Core/PathSensitive/EntryPointStats.h
index 633fb7aa8f72d..448e40269ca2d 100644
--- a/clang/include/clang/StaticAnalyzer/Core/PathSensitive/EntryPointStats.h
+++ b/clang/include/clang/StaticAnalyzer/Core/PathSensitive/EntryPointStats.h
@@ -25,7 +25,7 @@ class EntryPointStat {
 public:
   llvm::StringLiteral name() const { return Name; }
 
-  static void lockRegistry();
+  static void lockRegistry(llvm::StringRef CPPFileName);
 
   static void takeSnapshot(const Decl *EntryPoint);
   static void dumpStatsAsCSV(llvm::raw_ostream &OS);
diff --git a/clang/lib/StaticAnalyzer/Core/EntryPointStats.cpp b/clang/lib/StaticAnalyzer/Core/EntryPointStats.cpp
index b7f9044f65308..62ae62f2f2154 100644
--- a/clang/lib/StaticAnalyzer/Core/EntryPointStats.cpp
+++ b/clang/lib/StaticAnalyzer/Core/EntryPointStats.cpp
@@ -9,7 +9,9 @@
 #include "clang/StaticAnalyzer/Core/PathSensitive/EntryPointStats.h"
 #include "clang/AST/DeclBase.h"
 #include "clang/Analysis/AnalysisDeclContext.h"
+#include "clang/Index/USRGeneration.h"
 #include "llvm/ADT/STLExtras.h"
+#include "llvm/ADT/SmallVector.h"
 #include "llvm/ADT/StringExtras.h"
 #include "llvm/ADT/StringRef.h"
 #include "llvm/Support/FileSystem.h"
@@ -38,6 +40,7 @@ struct Registry {
   };
 
   std::vector<Snapshot> Snapshots;
+  std::string EscapedCPPFileName;
 };
 } // namespace
 
@@ -69,7 +72,7 @@ static void checkStatName(const EntryPointStat *M) {
   }
 }
 
-void EntryPointStat::lockRegistry() {
+void EntryPointStat::lockRegistry(llvm::StringRef CPPFileName) {
   auto CmpByNames = [](const EntryPointStat *L, const EntryPointStat *R) {
     return L->name() < R->name();
   };
@@ -78,6 +81,8 @@ void EntryPointStat::lockRegistry() {
   enumerateStatVectors(
       [](const auto &Stats) { llvm::for_each(Stats, checkStatName); });
   StatsRegistry->IsLocked = true;
+  llvm::raw_string_ostream OS(StatsRegistry->EscapedCPPFileName);
+  llvm::printEscapedString(CPPFileName, OS);
 }
 
 [[maybe_unused]] static bool isRegistered(llvm::StringLiteral Name) {
@@ -144,15 +149,27 @@ static std::vector<llvm::StringLiteral> getStatNames() {
   return Ret;
 }
 
+static std::string getUSR(const Decl *D) {
+  llvm::SmallVector<char> Buf;
+  if (index::generateUSRForDecl(D, Buf)) {
+    assert(false && "This should never fail");
+    return AnalysisDeclContext::getFunctionName(D);
+  }
+  return llvm::toStringRef(Buf).str();
+}
+
 void Registry::Snapshot::dumpAsCSV(llvm::raw_ostream &OS) const {
   OS << '"';
+  llvm::printEscapedString(getUSR(EntryPoint), OS);
+  OS << "\",\"";
+  OS << StatsRegistry->EscapedCPPFileName << "\",\"";
   llvm::printEscapedString(
       clang::AnalysisDeclContext::getFunctionName(EntryPoint), OS);
-  OS << "\", ";
+  OS << "\",";
   auto PrintAsBool = [&OS](bool B) { OS << (B ? "true" : "false"); };
-  llvm::interleaveComma(BoolStatValues, OS, PrintAsBool);
-  OS << ((BoolStatValues.empty() || UnsignedStatValues.empty()) ? "" : ", ");
-  llvm::interleaveComma(UnsignedStatValues, OS);
+  llvm::interleave(BoolStatValues, OS, PrintAsBool, ",");
+  OS << ((BoolStatValues.empty() || UnsignedStatValues.empty()) ? "" : ",");
+  llvm::interleave(UnsignedStatValues, OS, [&OS](unsigned U) { OS << U; }, ",");
 }
 
 static std::vector<bool> consumeBoolStats() {
@@ -181,8 +198,8 @@ void EntryPointStat::dumpStatsAsCSV(llvm::StringRef FileName) {
 }
 
 void EntryPointStat::dumpStatsAsCSV(llvm::raw_ostream &OS) {
-  OS << "EntryPoint, ";
-  llvm::interleaveComma(getStatNames(), OS);
+  OS << "USR,File,DebugName,";
+  llvm::interleave(getStatNames(), OS, [&OS](const auto &a) { OS << a; }, ",");
   OS << "\n";
 
   std::vector<std::string> Rows;
diff --git a/clang/lib/StaticAnalyzer/Frontend/AnalysisConsumer.cpp b/clang/lib/StaticAnalyzer/Frontend/AnalysisConsumer.cpp
index 53466e7a75b0f..9e7538eb34600 100644
--- a/clang/lib/StaticAnalyzer/Frontend/AnalysisConsumer.cpp
+++ b/clang/lib/StaticAnalyzer/Frontend/AnalysisConsumer.cpp
@@ -65,6 +65,15 @@ STAT_MAX(MaxCFGSize, "The maximum number of basic blocks in a function.");
 
 namespace {
 
+StringRef getMainFileName(const CompilerInvocation &Invocation) {
+  if (!Invocation.getFrontendOpts().Inputs.empty()) {
+    const FrontendInputFile &Input = Invocation.getFrontendOpts().Inputs[0];
+    return Input.isFile() ? Input.getFile()
+                          : Input.getBuffer().getBufferIdentifier();
+  }
+  return {};
+}
+
 class AnalysisConsumer : public AnalysisASTConsumer,
                          public DynamicRecursiveASTVisitor {
   enum {
@@ -125,7 +134,8 @@ class AnalysisConsumer : public AnalysisASTConsumer,
         PP(CI.getPreprocessor()), OutDir(outdir), Opts(opts), Plugins(plugins),
         Injector(std::move(injector)), CTU(CI),
         MacroExpansions(CI.getLangOpts()) {
-    EntryPointStat::lockRegistry();
+
+    EntryPointStat::lockRegistry(getMainFileName(CI.getInvocation()));
     DigestAnalyzerOptions();
 
     if (Opts.AnalyzerDisplayProgress || Opts.PrintStats ||
diff --git a/clang/test/Analysis/analyzer-stats/entry-point-stats.cpp b/clang/test/Analysis/analyzer-stats/entry-point-stats.cpp
index 1ff31d114ee99..9cbe04550a8d3 100644
--- a/clang/test/Analysis/analyzer-stats/entry-point-stats.cpp
+++ b/clang/test/Analysis/analyzer-stats/entry-point-stats.cpp
@@ -5,7 +5,9 @@
 // RUN: %csv2json "%t.csv" | FileCheck --check-prefix=CHECK %s
 //
 // CHECK:      {
-// CHECK-NEXT:   "fib(unsigned int)": {
+// CHECK-NEXT:   "c:@F@fib#i#": {
+// CHECK-NEXT:     "File": "{{.*}}entry-point-stats.cpp",
+// CHECK-NEXT:     "DebugName": "fib(unsigned int)",
 // CHECK-NEXT:     "NumBlocks": "{{[0-9]+}}",
 // CHECK-NEXT:     "NumBlocksUnreachable": "{{[0-9]+}}",
 // CHECK-NEXT:     "NumCTUSteps": "{{[0-9]+}}",
@@ -40,7 +42,9 @@
 // CHECK-NEXT:     "MaxValidBugClassSize": "{{[0-9]+}}",
 // CHECK-NEXT:     "PathRunningTime": "{{[0-9]+}}"
 // CHECK-NEXT:   },
-// CHECK-NEXT:   "main(int, char **)": {
+// CHECK-NEXT:   "c:@F@main#I#**C#": {
+// CHECK-NEXT:     "File": "{{.*}}entry-point-stats.cpp",
+// CHECK-NEXT:     "DebugName": "main(int, char **)",
 // CHECK-NEXT:     "NumBlocks": "{{[0-9]+}}",
 // CHECK-NEXT:     "NumBlocksUnreachable": "{{[0-9]+}}",
 // CHECK-NEXT:     "NumCTUSteps": "{{[0-9]+}}",

@necto necto requested a review from NagyDonat October 2, 2025 13:25
@necto
Copy link
Contributor Author

necto commented Oct 2, 2025

This is an updated version of the patch by @balazs-benics-sonarsource that we successfully used to do performance evaluation of a few changes downstream.

Copy link
Contributor

@NagyDonat NagyDonat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, seems to be a reasonable change.

return Input.isFile() ? Input.getFile()
: Input.getBuffer().getBufferIdentifier();
}
return {};
Copy link
Contributor

@NagyDonat NagyDonat Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you think about returning a human-readable placeholder like "<no input>" in this case? I would guess that it would be more helpful (e.g. easier to search for) if someone runs into this rare corner case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense changed in 7118400

@necto necto merged commit cf86ef9 into llvm:main Oct 3, 2025
9 checks passed
@necto necto deleted the az/metrics-id branch October 3, 2025 14:29
@steakhal
Copy link
Contributor

steakhal commented Oct 3, 2025

I think this patch introduced a build failure. See:
https://lab.llvm.org/buildbot/#/builders/206/builds/7031/steps/7/logs/stdio

[6997/8097] Linking CXX shared library lib/libclangStaticAnalyzerCore.so.22.0git
FAILED: lib/libclangStaticAnalyzerCore.so.22.0git 
: && /usr/bin/c++ -fPIC -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wno-missing-field-initializers -pedantic -Wno-long-long -Wimplicit-fallthrough -Wno-uninitialized -Wno-nonnull -Wno-class-memaccess -Wno-redundant-move -Wno-pessimizing-move -Wno-array-bounds -Wno-stringop-overread -Wno-noexcept-type -Wdelete-non-virtual-dtor -Wsuggest-override -Wno-comment -Wno-misleading-indentation -Wctad-maybe-unsupported -fdiagnostics-color -ffunction-sections -fdata-sections -fno-common -Woverloaded-virtual -O3 -DNDEBUG  -Wl,-z,defs -Wl,-z,nodelete   -Wl,-rpath-link,/home/botworker/bbot/hip-third-party-libs-test/build/./lib  -Wl,--gc-sections -shared -Wl,-soname,libclangStaticAnalyzerCore.so.22.0git -o lib/libclangStaticAnalyzerCore.so.22.0git tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/APSIntType.cpp.o tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/AnalysisManager.cpp.o tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/AnalyzerOptions.cpp.o tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/BasicValueFactory.cpp.o tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/BlockCounter.cpp.o tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/BugReporter.cpp.o tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/BugReporterVisitors.cpp.o tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/BugSuppression.cpp.o tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/CallDescription.cpp.o tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/CallEvent.cpp.o tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/Checker.cpp.o tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/CheckerContext.cpp.o tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/CheckerHelpers.cpp.o tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/CheckerManager.cpp.o tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/CheckerRegistryData.cpp.o tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/CommonBugCategories.cpp.o tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/ConstraintManager.cpp.o tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/CoreEngine.cpp.o tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/DynamicExtent.cpp.o tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/DynamicType.cpp.o tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/EntryPointStats.cpp.o tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/Environment.cpp.o tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/ExplodedGraph.cpp.o tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/ExprEngine.cpp.o tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/ExprEngineC.cpp.o tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/ExprEngineCXX.cpp.o tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/ExprEngineCallAndReturn.cpp.o tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/ExprEngineObjC.cpp.o tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/FunctionSummary.cpp.o tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/HTMLDiagnostics.cpp.o tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/LoopUnrolling.cpp.o tools/clang/lib/StaticAna
icAnalyzerCore.dir/MemRegion.cpp.o tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/PlistDiagnostics.cpp.o tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/ProgramState.cpp.o tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/RangeConstraintManager.cpp.o tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/RangedConstraintManager.cpp.o tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/RegionStore.cpp.o tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/SarifDiagnostics.cpp.o tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/SimpleConstraintManager.cpp.o tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/SimpleSValBuilder.cpp.o tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/SMTConstraintManager.cpp.o tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/Store.cpp.o tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/SValBuilder.cpp.o tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/SVals.cpp.o tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/SymbolManager.cpp.o tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/TextDiagnostics.cpp.o tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/WorkList.cpp.o tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/Z3CrosscheckVisitor.cpp.o  -Wl,-rpath,"\$ORIGIN/../lib:/home/botworker/bbot/hip-third-party-libs-test/build/lib:"  lib/libclangCrossTU.so.22.0git  lib/libclangFrontend.so.22.0git  lib/libclangAnalysis.so.22.0git  lib/libclangASTMatchers.so.22.0git  lib/libclangAST.so.22.0git  lib/libclangToolingCore.so.22.0git  lib/libclangRewrite.so.22.0git  lib/libclangLex.so.22.0git  lib/libclangBasic.so.22.0git  lib/libLLVMFrontendOpenMP.so.22.0git  lib/libLLVMSupport.so.22.0git  -Wl,-rpath-link,/home/botworker/bbot/hip-third-party-libs-test/build/lib && :
/usr/bin/ld: tools/clang/lib/StaticAnalyzer/Core/CMakeFiles/obj.clangStaticAnalyzerCore.dir/EntryPointStats.cpp.o: in function `clang::ento::EntryPointStat::dumpStatsAsCSV(llvm::raw_ostream&) [clone .localalias]':
EntryPointStats.cpp:(.text._ZN5clang4ento14EntryPointStat14dumpStatsAsCSVERN4llvm11raw_ostreamE+0x66b): undefined reference to `clang::index::generateUSRForDecl(clang::Decl const*, llvm::SmallVectorImpl<char>&)'
collect2: error: ld returned 1 exit status

@NagyDonat
Copy link
Contributor

🤔 How did this happen? I see that the "Build and test linux" CI job passed before the merge -- what's the difference between that build and this one that failed?

@steakhal
Copy link
Contributor

steakhal commented Oct 3, 2025

I know necto is busy right now. I'll revert this change as it's not urgent for us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang:static analyzer clang Clang issues not falling into any other category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants