-
Notifications
You must be signed in to change notification settings - Fork 15.3k
Description
Hi,
I've been struggling with an issue that has consumed me for nearly a year. I am building some large shared libraries (.so) related to CUDA using PGO continuation mode, but I consistently encounter the following error:
if (UseBiasVar && BiasAddr == BiasDefaultAddr) {
PROF_ERR("%s\n", "__llvm_profile_counter_bias is undefined");
return;
}Here:
BiasAddr = &__llvm_profile_counter_biasBiasDefaultAddr = &__llvm_profile_counter_bias_default
From my understanding, __llvm_profile_counter_bias should be determined later, after passing this check, inside mmapForContinuousMode.
From the code of Value *InstrProfiling::getCounterAddress(InstrProfInstBase *I), I assume that when counters are inserted, the compiler should insert __llvm_profile_counter_bias, but only as a placeholder. Since the counter size exceeds 100k, I assume this insertion has already happened, but the symbol might be lost because it is a global weak symbol?This seems similar to a case like [this Rust issue](rust-lang/rust#120842).
However, the build system of paddle is too large, fragile, and complex, and I cannot pinpoint where this symbol might be lost exactly.
Problem Details
The __llvm_profile_counter_bias symbol is consistently reported as undefined, and I cannot determine why this happens. The error always occurs, and due to CUDA's limitations, I can only use LLVM-15. While I reviewed LLVM-19's source code, the logic seems quite similar, so I suspect the root cause persists across these versions.
I am building the PaddlePaddle framework. After compiling with the following flags:
-fprofile-instr-generate-fcoverage-mapping-g-mllvm-runtime-counter-relocation
It produces four instrumented shared libraries (.so), which dlopen each other. The first three shared libraries behave as expected, but the latest one (libpaddle.so/libphi.so) always reports this error:
dll = ctypes.CDLL("/Paddle/build/python/paddle/libs/libphi.so")
LLVM Profile Error: LALALA: initializeProfileForContinuousMode
LLVM Profile Error: LALALA: CountersSize 32480
LLVM Profile Error: LALALA: BiasAddr 133895675533552
LLVM Profile Error: LALALA: BiasDefaultAddr 133896653571776
LALALA: mmapForContinuousMode 2
LLVM Profile Error: LALALA: initializeProfileForContinuousMode
LLVM Profile Error: LALALA: CountersSize 92448
LLVM Profile Error: LALALA: BiasAddr 133896026068704
LLVM Profile Error: LALALA: BiasDefaultAddr 133896653571776
LALALA: mmapForContinuousMode 2
LLVM Profile Error: LALALA: initializeProfileForContinuousMode
LLVM Profile Error: LALALA: CountersSize 7777352
LLVM Profile Error: LALALA: BiasAddr 133896653571776
LLVM Profile Error: LALALA: BiasDefaultAddr 133896653571776
LLVM Profile Error: __llvm_profile_counter_bias is undefined
I inspected the compiled .so files with nm and found the following symbol table entries:
libcommon.so:
00000000001298f0 B __llvm_profile_counter_bias_default
00000000001299d8 V __llvm_profile_filename
...
libpaddle.so:
00000000268d9c08 B __llvm_profile_counter_bias_default
U __llvm_profile_filename
...
libphi.so:
000000002566c6c0 B __llvm_profile_counter_bias_default
U __llvm_profile_filename
...
libphi_kernel_gpu.so:
0000000014e496e0 B __llvm_profile_counter_bias_default
U __llvm_profile_filename
...
As seen, the symbol __llvm_profile_counter_bias_default exists in all shared libraries, but the actual __llvm_profile_counter_bias symbol is missing.
Questions
-
Can I safely ignore this check?
if (UseBiasVar && BiasAddr == BiasDefaultAddr) { PROF_ERR("%s\n", "__llvm_profile_counter_bias is undefined"); return; }
Since
mmapForContinuousModeshould eventually initialize the__llvm_profile_counter_biassymbol, would ignoring this check resolve the issue? -
What scenarios might cause
__llvm_profile_counter_biasto be undefined?- From my understanding, if counters are being inserted (over 100k in this case), the compiler should have inserted
__llvm_profile_counter_bias. Could this issue arise due to its nature as a weak global symbol and loss during linking?
- From my understanding, if counters are being inserted (over 100k in this case), the compiler should have inserted
Any insights or hints on debugging this issue further would be greatly appreciated. Thank you!