Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
runtime: concurrent object placement on heap causes Linux kernel contention #33092
This applies to newer Go versions on any modern Linux kernel, but in this case:
go version go1.11.5 linux/amd64
If I allocate lots of small objects in parallel from many concurrent routines, they are all placed on the heap linearly, therefore each of the Go threads waiting for Linux pagefaults - which end up becoming more and more expensive because of spinlocking in that path.
In other cases of highly concurrent programming developers rely on allocators like jemalloc to do scalable placement of objects - that option does not exist with Go heap.
In some cases developers chose to have a background page that touches/cleans the pages before giving them to concurrent routines.
Having some spread in object placement would be a useful feature in the runtime.
A very basic program I have written to illustrate this in Go spends 50s in kernel while only 3s in userland, mostly from inflated costs in faulting:
perf report shows that most of the cost from running my program is in page fault spinlocks:
Source code for it:
Setting GOMAXPROCS=1 shows how little amount of work kernel has to do without contention:
Maybe I don't understand what you mean by "spread in object placement", but for a long time Go has placed objects in separate pages depending on which P ("processor") is allocating, and it's able to do this without any locking. This is similar to TCMalloc's design. Allocating pages requires a lock, but small object allocation already scales fairly well.
Furthermore, I'm not actually able to reproduce the exact behavior you're seeing on your machine with your example program. I get:
It's true there are many more page faults, and significantly more time is spent in the kernel, but overall elapsed time still improves.
GODEBUG=gctrace=1 on the other hand indicates that these page faults are coming from the heap growing and touching many new pages.
When we don't set GOMAXPROCS=1, the runtime assumes the heap just keeps growing. There are few GCs, but the heap grows to almost 400 MiB which is the cause of all those page faults (AFAICT).
When we set GOMAXPROCS=1, our heap stays small, and we end up re-using a lot of memory and don't have to fault in new pages quite so often.
For comparison, these are the parameters of the machine I was using:
go version go1.12.5 linux/amd64
There may be a legitimate bug here (I need to think more about the sample program), but the bug is likely not allocator scalability.
Weird, to have some common ground I set up a google cloud instance (n1-standard-64), downloaded latest stable golang:
go version go1.12.7 linux/amd64
Even named the binary 'thing':
domas_mituzas@my-vm:~/thing$ time ./thing
And indeed, allocator scalability itself is fantastic, which is why it slams into kernel scalability. I did not look much at how Go places objects - if really objects are placed into separate blocks, then it is hitting another page table contention that may be harder to work around (multiple mappings could do, though).
I have a very basic C testcase that shows how this breaks inside kernel:
Let me check the properties of random arena access vs linear.
update with latest stable kernel (5.2.1) and go1.12.7:
random access to same gigabyte does not show same contention signature:
meanwhile, single thread allocs seem to be much much more efficient:
(I'm having all the threads randomly walk the same memory, instead of linearly, with some adjustments to the .c above).
So, there's definitely ~25x pagefaulting cost increase in random allocs (very little of that in actual spinlocking), but the linear edge case shows ~1700x pagefaulting cost increase, mostly in spinlocks, which I trigger with my Go testcase too (which indicates that the page gets faulted from multiple threads at once).
By changing my C test to linearly allocate different maps in each thread I can see that the scaling behavior is very similar to random-order allocation - increased costs are from flush_tlb_mm_range, which would be hard to avoid, and down_read_trylock.
I'm really waving my hands here, but from what I see - the profiles of go testcase are indicative of pagefaults hitting same page and a spinlock, rather than TLB flushing.
In some of my other tests I ended up hitting runtime.(*mcentral).cacheSpan contention (simply because I was allocating lots of large objects).
The main way how I hit this pagefault issue is with fast mapassign functions, and this code in concurrent fashion will hit high % in kernel spinlocks:
i can reproduce this on my machine: centos 7.3.1611, and the results are weird :