New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mismatched calls to VirtualAlloc/VirtualFree #213
Comments
FWIW, I tested this and it seems to work, although it's really horrible: https://hg.mozilla.org/try/diff/aa0824a23885/memory/jemalloc/src/src/chunk_mmap.c |
I think Windows should just use the same defaults as other platforms. The key part is teaching jemalloc how to drop commit charge and then making that the default on platforms without overcommit, but with a way to disable it. It doesn't need to unmap any more than Linux does. |
The reason jemalloc doesn't unmap on Linux is because unmapping led to bad things in the kernel VM, not because it's not necessary. It is necessary. The malloc implementation is not the only thing that can map memory (GPU drivers will, for example), and keeping mappings indefinitely can exhaust address space for those other uses. So, no, I don't think making all platforms default to not unmap is the solution. |
No, not really. It's true that
It's not using an ever increasing amount of virtual memory. Fragmentation is even more of an issue if memory is being unmapped too, especially if an allocator knows how to reserve lots of space up-front. On 64-bit, there's 128TiB of address space... so even reserving a terabyte up-front is not a big deal and will significantly improve performance by reducing TLB misses and facilitating reuse of pages. It makes sense for the allocator to be slower on 32-bit where address space isn't plentiful, but unmapping / not reserving up-front is a pointless pessimization on 64-bit on Linux or Windows. |
Yes, really. What you say is right, but has nothing to do with why jemalloc doesn't unmap. The reason for that is written in clear in INSTALL: https://github.com/jemalloc/jemalloc/blob/dev/INSTALL#L142
Bigger page size reduces TLB misses, but I doubt mmapping a huge part of the address space actually makes a difference there. Pages are still individually allocated and accounted for. Contiguity might help.
MADV_FREE is also something that's not even in the Linux kernel yet, and it's not clear to me it is faster on the systems where it does exist (it doesn't look like so for OSX, I don't know for FreeBSD). But yes, keeping address space on 64 bits is less of a problem. But I'm not looking for a solution for a future 64-bits Linux system. |
No, and that's the reason I already gave above... It can only allocate at the bottom of the mmap in
It certainly makes a difference. The CPU is much better at caching pages that are closer together, even without the involvement huge pages. The chunk size is smaller than the huge page size now, so transparent huge pages are yet another reason to pack them together. Reserving memory also avoids unusable gaps caused by other users of The major win from reserving memory is that it can be partitioned between the arenas. This means chunk / huge allocation can be fully parallel without fragmenting the address space between arenas or using any atomic operations beyond the arena locks.
It is significantly faster on the systems where it does exist. You were measuring with a program without lots of concurrency or huge swings in memory usage. Firefox doesn't even thread caching and has a single arena... along with costs like junk filling on free. It's not a good way to measure jemalloc performance at all. MADV_FREE is currently in linux-next and the numbers prove that it's a big deal. For one thing, calls like
It's not less of a problem but rather a complete non-issue on today's hardware with 47-bit address ranges. Again, not unmapping does not cause a boundless increase in virtual memory usage by the allocator as you claim. In fact, by not reserving ranges up-front there will be fragmentation that causes the peak virtual memory usage to be significantly higher. There's also nothing Linux-specific about this. Large ranges of address space can be reserved on Windows too, as long as the allocator knows how to toggle ranges of committed memory rather than just using the cheaper lazy |
Other users of |
Since virtual memory is a per-process resource, peak memory usage by the allocator is the important factor on 32-bit. I doubt that unmapping will decrease the peak VM size... if anything it will increase it both due to fragmentation between jemalloc's chunk caching and the unmapped ranges and because of external |
And despite all that we do see madvise (yes, madvise, with MADV_FREE, and yes, the madvise call itself) having significant performance impact on some benchmarks on OSX, yet using mmap instead doesn't seem to make a difference, although I haven't looked at an actual profile to see if there actually a difference. But that's way offtopic for the present issue. |
It's not off-topic because this would be solved by using |
Heh, that was one of my first proposals (last paragraph in #213 (comment) ).
This is where we're disagreeing.
O_o do you mean fragmentation, not peak VM? |
I'm saying that it should not unmap chunks, but simply purge them. Purging should also be dropping commit charge by default on operating systems without overcommit or when overcommit is disabled on Linux. That means using It would probably be helpful to map in many chunks at the same time now that the chunk size has been dropped quite a bit too. This would cut down on the long-term fragmentation caused by other
No, I mean what I said: opportunistically unmapping memory will increase peak virtual memory in typical long-running processes. Virtual memory fragmentation is a very real issue on 32-bit, especially since jemalloc has a hard requirement that chunks are naturally aligned. It cooperates very poorly with other users of The 4M chunk size meant there were only 512 valid chunk addresses in a 1:1 split 32-bit OS like Windows and unmapping ends up leading to other The only saving grace is that unmapping didn't happen in practice... because chunks would rarely completely drain. The significantly smaller chunk size alleviates one part of the problem but since it makes unmapping much more likely it also makes it a much bigger problem in another way. |
VM fragmentation makes peak VM lower, not higher. It makes you possibly OOM with a lower amount of mapped memory.
... except for huge allocations. |
If you're measuring virtual memory usage by looking at the count of mapped pages via VIRT. Committed memory and commit charge are actually global resources and VM means those pages can't be unavailable due to physical fragmentation so those statistics are useful. VIRT just tells you the number of mapped pages which from the allocator's perspective isn't the same as how much is available. Mapping a bit under 512 4k pages spread out eventually across the address space on 32-bit Windows would deplete it from the perspective of the current stable release of jemalloc. When making a huge allocation, there isn't a substantial difference between virtual memory lost to unused space inside of chunks and virtual memory lost to partially used chunks (due to VirtualAlloc usage elsewhere). There's also the issue of overall VM fragmentation, but that's somewhat distinct from memory that's unusable due to the chunk alignment requirement. For example, you could leave the tail of huge allocations unmapped but it will only decrease the usable virtual memory in the long run...
Huge allocations used to be very rare though. They're a lot more common now, but unmapping is still going to be very rare if the unmapping granularity is significantly larger than the chunk size. Nearly all chunks in typical applications are used for non-huge allocations. |
@jasone, ideas? |
I'm more or less on vacation until the end of April, so I'm not going to have the chance to dig into this for a while. I suspect that we're going to need to add additional logic to control whether adjacent mappings can be coalesced, but I haven't considered the various possibilities in enough detail yet to have good intuition regarding what is most likely to work well. |
@jasone, ping |
@glandium, I got a MinGW-w64 environment working yesterday, and I started cleaning up the ridiculous number of compilation warnings. Unfortunately I have a bunch of other work tasks that are likely to take up most of my time over the next few weeks, but I am trending in the direction of this issue and others that are blocking the 4.0.0 release. |
Can confirm this bug exists, and makes jemalloc pretty much unusable in it's current state. |
I imagine that Any advice for building under |
I did a workaround locally. Modify function
I >THINK< it should work in every case, but its not optimal for sure. |
Bump? |
This issue is on my short list now. I've started working through MinGW/gcc build issues, but I don't know how to build using MSVC yet. |
From what I remember, I had no serious build problems. However, I use a (self-created) MSVC project for the building, as this library is part of a larger MSVC solution, so I mostly ignore everything that's not .h or .c
But this is mostly small things. I haven't tried all the functionality/configs, but the only bug in the actual code I encountered is the one that spawned this ticket ( |
I think at two different moments somewhere between 3.6 and current tip things have changed in how chunks are handled that make the use of VirtualAlloc and VirtualFree on Windows problematic.
The way they work, you need to match a addr = VirtualAlloc(addr, size, MEM_RESERVE) with a VirtualFree(addr, 0, MEM_RELEASE) (and it has to be 0, not size).
The problem is that while before we wouldn't end up calling pages_unmap with values of addr and size that don't match previous pages_map, we now do. Essentially, we end up allocating multiple chunks in one go, and deallocating parts of them (leading or trailing) independently.
So in practice, we're doing things like this, for example:
which actually does:
and that does nothing, which is wasteful, but not entirely problematic (or aborts with --enable-debug, which is)
But worse, we're also doing:
which actually does
and that blows things away since that releases the 6 chunks when we expect the remaining 4 to still be around. This was definitely made worse by the decrease in chunk size, which made it happen more often (it seems it was not happening before, but it might actually be the cause the random crashes we're seeing).
I've attempted to work around this by making chunksize the canonical size we always VirtualAlloc in the end, but that likely adds a lot of overhead. At least it removes the crashes I'm seeing with current tip, but it feels we need something better than this.
I was thinking maybe having bigger meta-chunks and making pages_map MEM_COMMIT and pages_unmap MEM_DECOMMIT ranges in there, and have the meta-chunks released when they are entirely decommitted (which, OTOH, requires some extra metadata, which adds a chicken-and-egg problem, AIUI, even the base allocator is using chunk allocation code and ends up in pages_map).
Thoughts?
The text was updated successfully, but these errors were encountered: