Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tcmalloc bug when it handles non-sequential CPUs #188

Closed
jakewang-stripe opened this issue Jun 9, 2023 · 18 comments
Closed

tcmalloc bug when it handles non-sequential CPUs #188

jakewang-stripe opened this issue Jun 9, 2023 · 18 comments

Comments

@jakewang-stripe
Copy link

We opened an issue with envoyproxy envoyproxy/envoy#27775 about it's crashing on validating bootstrap config

Call stack:

[external/envoy/source/server/backtrace.h:104] Caught Segmentation fault, suspect faulting address 0x0
[external/envoy/source/server/backtrace.h:91] Backtrace (use tools/stack_decode.py to get line numbers):
[external/envoy/source/server/backtrace.h:92] Envoy version: 2f44165e55dd47475c44d2d03018eac3cb8a6264/1.24.4-stripe1/Clean/RELEASE/BoringSSL
[external/envoy/source/server/backtrace.h:96] #0: __restore_rt [0x7f26c19d1420]
[external/envoy/source/server/backtrace.h:96] https://github.com/envoyproxy/envoy/pull/1: tcmalloc::tcmalloc_internal::cpu_cache_internal::CpuCache<>::Refill() [0x55cf3f28ce2a]
[external/envoy/source/server/backtrace.h:96] https://github.com/envoyproxy/envoy/pull/2: tcmalloc::tcmalloc_internal::cpu_cache_internal::CpuCache<>::Allocate<>()::Helper::Underflow() [0x55cf3f28df77]
[external/envoy/source/server/backtrace.h:96] https://github.com/envoyproxy/envoy/pull/3: Envoy::Api::ValidationImpl::allocateDispatcher() [0x55cf3dd5b3de]
[external/envoy/source/server/backtrace.h:96] https://github.com/envoyproxy/envoy/pull/4: Envoy::Server::ValidationInstance::ValidationInstance() [0x55cf3dd4b52f]
[external/envoy/source/server/backtrace.h:96] https://github.com/envoyproxy/envoy/pull/5: Envoy::Server::validateConfig() [0x55cf3dd4aa65]
[external/envoy/source/server/backtrace.h:96] https://github.com/envoyproxy/envoy/pull/6: Envoy::MainCommonBase::run() [0x55cf3dd11370]
[external/envoy/source/server/backtrace.h:96] https://github.com/envoyproxy/envoy/pull/7: Envoy::MainCommon::main() [0x55cf3dd11a7d]
[external/envoy/source/server/backtrace.h:96] https://github.com/envoyproxy/envoy/pull/8: main [0x55cf3dd0da4a]
[external/envoy/source/server/backtrace.h:96] https://github.com/envoyproxy/envoy/pull/9: __libc_start_main [0x7f26c17ef083]

We actually found a potential root cause to be a tcmalloc bug that it is unable to handle non-sequential online CPUs. Those segfaults happen on ec2 instances with nitro-enclaves enabled so there are some hot-plugged off CPUs, i.e.

$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 16
On-line CPU(s) list: 0,2-8,10-15
Off-line CPU(s) list: 1,9

And the theory is tcmalloc uses the cpu's id to index into the per-cpu arrays that hold the per cpu data structures. If tcmalloc allocates 14 entries because ncpu is 14, but the 14th cpu id is 15 then its array access is out of bounds.

Can you confirm if that's the valid root cause, and has it been fixed by any commit?

Thanks

@jakewang-stripe
Copy link
Author

jakewang-stripe commented Jun 9, 2023

We're using Envoy version 1.24 which uses tcmalloc version from 2022-08-06.

@ckennelly
Copy link
Collaborator

I believe this is a bug in absl::base_internal::NumCPUs(), which is where TCMalloc gets its CPU count from: https://github.com/abseil/abseil-cpp/blob/master/absl/base/internal/sysinfo.cc#L128

@clundquist-stripe
Copy link

@ckennelly It looks like Abseil ends up calling:

https://man7.org/linux/man-pages/man3/sysconf.3.html

  • _SC_NPROCESSORS_ONLN
    The number of processors currently online (available).
    See also get_nprocs_conf(3).

Which is reasonable at first glance, and changing that behavior would probably break a lot of things.

Restating the issue, the number of CPUs online is discontinuous, and we have holes in our set.
From my quick read,

for (int cpu = 0; cpu < absl::base_internal::NumCPUs(); ++cpu) {
if (!NeedCpu(cpu, target)) {
// unnecessary -- user doesn't care about synchronization on this cpu
continue;
}
this for loop will not reach the final few CPUs when we have holes on our set.

i.e. if we have CPUs [0,2,4,6] online, it seems like we'd only try to initialize 0-3 with our for loop. It looks like we'd abandon CPUs [1,3] after trying to pin there and failing, but I don't see a code path that would attempt other CPU numbers.

Admittedly I'm just reading the code now, so I could totally be missing another path!

We're happy to open an issue on the Abseil side, but I don't think they'll change things there, since they are specifically querying for the online CPU count.

What's the best path forward here?

@junyer
Copy link
Contributor

junyer commented Jun 13, 2023

We're happy to open an issue on the Abseil side, but I don't think they'll change things there, since they are specifically querying for the online CPU count.

It's worth filing a bug, IMO, if only to signal that it's a real problem affecting real users. (Ping, @jyknight, because reasons.)

@ravenblackx
Copy link

We're happy to open an issue on the Abseil side, but I don't think they'll change things there, since they are specifically querying for the online CPU count.

Yeah, I think Number of logical processors (in the description of NumCPUs) implies that it's intentionally returning only the number of online CPUs.

Maybe they'd be interested in adding a function that returns the other value (e.g. _SC_NPROCESSORS_CONF) but I don't think it's a bug in absl so much as an omission (and then still a bug in tcmalloc, populating a value that expects to represent the number of physical cores, with a value that actually represents the number of logical.)

@clundquist-stripe
Copy link

I don't think it's a bug in absl so much as an omission (and then still a bug in tcmalloc, populating a value that expects to represent the number of physical cores, with a value that actually represents the number of logical.)

Yes, that's my outside perspective, however each of you know this code much better than I do.

Would writing a test similar to what I have below be a reasonable start?

i.e. if we have CPUs [0,2,4,6] online, it seems like we'd only try to initialize 0-3 with our for loop. It looks like we'd abandon CPUs [1,3] after trying to pin there and failing, but I don't see a code path that would attempt other CPU numbers

I've been out of the C++ game for a bit, and suspect someone here knows the desired style better. (I could probably write it, but it would probably be ugly, and we'd probably have code contribution agreement hoops)

Once we had a branch with a failing test, it seems like we can probably find a good fix.
Would having a specific owner help?

@ckennelly
Copy link
Collaborator

@clundquist-stripe : SlowFence should be used under fairly rare circumstances, we need to lack MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ, which first appeared in Linux 5.10. Plumbing to only consider populated (in CpuCache terminology) cores would be a bit involved to get the state passed around for a fallback path.

@clundquist-stripe
Copy link

clundquist-stripe commented Jun 14, 2023

We should have that capability then, we're around here https://packages.ubuntu.com/focal/linux-aws
~5.15.0-1037-aws across amd64/aarch64. I think we've only seen this on amd64 boxes though so far [but I don't think we have attempted aarch64 reproductions]

I don't know the tcmalloc source, so I could have easily chased the wrong lead!
As Jake mentioned, we can reproduce this pretty consistently with Envoy for hosts that also use Nitro Enclaves (which marks CPUs offline to the host, and are used to operate the Enclave)

Our mitigation is taskset-ing Envoy (our TCMalloc user here) to specific CPUs that we know to be online, with a bash one-liner. This "works" but we'd hit the same snag for anything else that uses TCMalloc.

I don't think we can post a coredump here, but is there anything else we can help provide to debug this issue?
We'd be happy to setup a pairing session if that could be helpful!

email wise I'm clundquist@stripe.com, if it helps!

@ckennelly
Copy link
Collaborator

I think the underlying problems are around NumCPUs(). We index into an array of that size (allocated by TcmallocSlab::Init, reallocated periodically by ResizeSlabs), so if we might get a higher CPU number than what NumCPUs() returned, the "cache" will be an out of bounds access when we try to Push and Pop. At best, this is some memory that hasn't been used for something (so full of zeroes); at worst, it's being used for another piece of TCMalloc metadata.

To separate out that issue, if you replaced NumCPUs() everywhere in TCMalloc with a hard-coded, implausibly high number of cores, do other issues manifest?

@clundquist-stripe
Copy link

I think the underlying problems are around NumCPUs(). We index into an array of that size ... so if we might get a higher CPU number than what NumCPUs() returned

Yes, that matches my understanding.

i.e. if we have CPUs [0,2,4,6] online, it seems like we'd only try to initialize 0-3 with our for loop. It looks like we'd abandon CPUs [1,3] after trying to pin there and failing, but I don't see a code path that would attempt other CPU numbers

I very likely linked/found the wrong snippet of code though.

To separate out that issue, if you replaced NumCPUs() everywhere in TCMalloc with a hard-coded, implausibly high number of cores, do other issues manifest?

This would take us a while to test, since we'd have to rebuild Envoy.
We can poke at that, but I suspect if we choose a number wisely (like 512) it would "fix" it.

If it did "fix" it, what would be the end game though?
I don't think Envoy wants to fork/patch TCMalloc.

@ckennelly
Copy link
Collaborator

Endgame would be to modify the implementation of absl::base_internal::NumCPUs(). I can discuss it a bit with the Abseil folks over internal channels as well.

@jfernandez
Copy link

jfernandez commented Jun 16, 2023

This is impacting Envoy when running inside a container that uses lxcfs to mount /sys/devices/system/cpu/online and has a constrained cpuset.

What is the CPU count used for in tcmalloc? Our workaround is to let the container read the host's /sys/devices/system/cpu/online, which can be up to 96 CPUs.

@clundquist-stripe
Copy link

clundquist-stripe commented Jun 16, 2023

What is the CPU count used for in tcmalloc?

TCMalloc is "Thread Caching malloc".

It is used to initialize a slab to allocate from in user space per CPU, for better cache coherency

@jfernandez
Copy link

It is used to initialize a slab to allocate from in user space per CPU, for better cache coherency

Are you aware of any potential downsides or drawbacks of allowing TCMalloc within a container to believe it has the CPU count of the host?

@clundquist-stripe
Copy link

I can only speculate since I don't know TCMalloc too well by code, but having written similar things for games, the downside would be slightly higher overhead in initializing structures and less efficient heap usage.

In practice, it probably won't matter unless you have some pathological worst case.

@ckennelly
Copy link
Collaborator

If the rseq syscall is available, TCMalloc uses per-CPU caches and does not use its eponymous thread caches.

The CPU count returned by NumCPUs() has to be consistent with the CPU IDs provided by the kernel to userspace. If the kernel provides us a CPU ID that's out of bounds, data corruption can/will occur.

@clundquist-stripe
Copy link

@jfernandez I think @ckennelly is getting at it could still be an issue since the CPU IDs may not line up if there are holes in the CPU list, it just reduces the chances with more CPUs.

We used something like this:
taskset --cpu-list $(lscpu --online -p=core | grep -v '#' | sort| uniq | tr '\\n' ',' | sed 's/,$/\\n/') /path/to/envoy

Which seemed to work for us

@clundquist-stripe
Copy link

@ckennelly any traction here with the Abseil folks?

copybara-service bot pushed a commit that referenced this issue Jul 11, 2023
PiperOrigin-RevId: 547256513
Change-Id: I44c42b154241bbbba21efa0cce1e8e4bc0f6a625
copybara-service bot pushed a commit that referenced this issue Jul 24, 2023
PiperOrigin-RevId: 550628877
Change-Id: Id6e6bb6fe8148a692a0376aeb0d0172a9dc74038
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants