Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mismatched calls to VirtualAlloc/VirtualFree #213

Closed
glandium opened this issue Mar 18, 2015 · 24 comments
Closed

Mismatched calls to VirtualAlloc/VirtualFree #213

glandium opened this issue Mar 18, 2015 · 24 comments
Milestone

Comments

@glandium
Copy link
Contributor

I think at two different moments somewhere between 3.6 and current tip things have changed in how chunks are handled that make the use of VirtualAlloc and VirtualFree on Windows problematic.

The way they work, you need to match a addr = VirtualAlloc(addr, size, MEM_RESERVE) with a VirtualFree(addr, 0, MEM_RELEASE) (and it has to be 0, not size).

The problem is that while before we wouldn't end up calling pages_unmap with values of addr and size that don't match previous pages_map, we now do. Essentially, we end up allocating multiple chunks in one go, and deallocating parts of them (leading or trailing) independently.

So in practice, we're doing things like this, for example:

addr = pages_map(NULL, 6 * chunksize);
pages_unmap(addr + 4 * chunksize, 2 * chunksize);

which actually does:

addr = VirtualAlloc(NULL, 6 * chunksize, MEM_RESERVE);
VirtualFree(addr + 4 * chunksize, 0, MEM_RELEASE);

and that does nothing, which is wasteful, but not entirely problematic (or aborts with --enable-debug, which is)

But worse, we're also doing:

addr = pages_map(NULL, 6 * chunksize);
pages_unmap(addr, 2 * chunksize);

which actually does

addr = VirtualAlloc(NULL, 6 * chunksize, MEM_RESERVE);
VirtualFree(addr, 0, MEM_RELEASE);

and that blows things away since that releases the 6 chunks when we expect the remaining 4 to still be around. This was definitely made worse by the decrease in chunk size, which made it happen more often (it seems it was not happening before, but it might actually be the cause the random crashes we're seeing).

I've attempted to work around this by making chunksize the canonical size we always VirtualAlloc in the end, but that likely adds a lot of overhead. At least it removes the crashes I'm seeing with current tip, but it feels we need something better than this.

I was thinking maybe having bigger meta-chunks and making pages_map MEM_COMMIT and pages_unmap MEM_DECOMMIT ranges in there, and have the meta-chunks released when they are entirely decommitted (which, OTOH, requires some extra metadata, which adds a chicken-and-egg problem, AIUI, even the base allocator is using chunk allocation code and ends up in pages_map).

Thoughts?

@glandium
Copy link
Contributor Author

FWIW, I tested this and it seems to work, although it's really horrible: https://hg.mozilla.org/try/diff/aa0824a23885/memory/jemalloc/src/src/chunk_mmap.c

@thestinger
Copy link
Contributor

I think Windows should just use the same defaults as other platforms. The key part is teaching jemalloc how to drop commit charge and then making that the default on platforms without overcommit, but with a way to disable it. It doesn't need to unmap any more than Linux does.

@glandium
Copy link
Contributor Author

The reason jemalloc doesn't unmap on Linux is because unmapping led to bad things in the kernel VM, not because it's not necessary. It is necessary. The malloc implementation is not the only thing that can map memory (GPU drivers will, for example), and keeping mappings indefinitely can exhaust address space for those other uses. So, no, I don't think making all platforms default to not unmap is the solution.

@thestinger
Copy link
Contributor

The reason jemalloc doesn't unmap on Linux is because unmapping led to bad things in the kernel VM, not because it's not necessary.

No, not really. It's true that mmap has an O(n) worst-case on Linux once there's no longer a large enough gap at the bottom of the lowest mmap allocation. This could be fixed if someone really cared and other operations are O(log n) or O(1). It's definitely not the full rationale for not unmapping though.

It is necessary. The malloc implementation is not the only thing that can map memory (GPU drivers will, for example), and keeping mappings indefinitely can exhaust address space for those other uses.

It's not using an ever increasing amount of virtual memory. Fragmentation is even more of an issue if memory is being unmapped too, especially if an allocator knows how to reserve lots of space up-front. On 64-bit, there's 128TiB of address space... so even reserving a terabyte up-front is not a big deal and will significantly improve performance by reducing TLB misses and facilitating reuse of pages. MADV_FREE is way faster than unmapping so there's a pretty clear reason not to do it when you don't need to reduce commit charge. Even when you do need to reduce commit charge, it's still faster to hang onto the address space.

It makes sense for the allocator to be slower on 32-bit where address space isn't plentiful, but unmapping / not reserving up-front is a pointless pessimization on 64-bit on Linux or Windows.

@glandium
Copy link
Contributor Author

No, not really.

Yes, really. What you say is right, but has nothing to do with why jemalloc doesn't unmap. The reason for that is written in clear in INSTALL: https://github.com/jemalloc/jemalloc/blob/dev/INSTALL#L142
See also 7ca0fdf

so even reserving a terabyte up-front is not a big deal and will significantly improve performance by reducing TLB misses and facilitating reuse of pages.

Bigger page size reduces TLB misses, but I doubt mmapping a huge part of the address space actually makes a difference there. Pages are still individually allocated and accounted for. Contiguity might help.

MADV_FREE is way faster than unmapping.

MADV_FREE is also something that's not even in the Linux kernel yet, and it's not clear to me it is faster on the systems where it does exist (it doesn't look like so for OSX, I don't know for FreeBSD).

But yes, keeping address space on 64 bits is less of a problem. But I'm not looking for a solution for a future 64-bits Linux system.

@thestinger
Copy link
Contributor

Yes, really. What you say is right, but has nothing to do with why jemalloc doesn't unmap. The reason for that is written in clear in INSTALL: https://github.com/jemalloc/jemalloc/blob/dev/INSTALL#L142

No, and that's the reason I already gave above... It can only allocate at the bottom of the mmap in O(log n) time and has to fall back to O(n) searches to find gaps in other places. It's not the only reason for avoiding unmapping even if it was the reason that drove the initial decision. It works this way because it only has a red-black tree keyed by addresses so it can't efficiently search for gaps based on sizes. It's guaranteed to find them if the efficient search path fails though. This could be fixed if it was considered a big enough problem.

Bigger page size reduces TLB misses, but I doubt mmapping a huge part of the address space actually makes a difference there. Pages are still individually allocated and accounted for. Contiguity might help.

It certainly makes a difference. The CPU is much better at caching pages that are closer together, even without the involvement huge pages. The chunk size is smaller than the huge page size now, so transparent huge pages are yet another reason to pack them together. Reserving memory also avoids unusable gaps caused by other users of mmap.

The major win from reserving memory is that it can be partitioned between the arenas. This means chunk / huge allocation can be fully parallel without fragmenting the address space between arenas or using any atomic operations beyond the arena locks.

MADV_FREE is also something that's not even in the Linux kernel yet, and it's not clear to me it is faster on the systems where it does exist (it doesn't look like so for OSX, I don't know for FreeBSD).

It is significantly faster on the systems where it does exist. You were measuring with a program without lots of concurrency or huge swings in memory usage. Firefox doesn't even thread caching and has a single arena... along with costs like junk filling on free. It's not a good way to measure jemalloc performance at all. MADV_FREE is currently in linux-next and the numbers prove that it's a big deal.

For one thing, calls like mmap, munmap and mprotect need to grab a global writer lock (mmap_sem) to modify the virtual memory data structures. They are inherently serial operations and do not scale. That's true on every OS that I've looked into, not just Linux. On the other hand, page faults and purging with MADV_DONTNEED / MADV_FREE is much more concurrent as it can use the reader lock. It also avoids TLB flushes and MADV_FREE avoids the usual mass zeroing unless there's memory pressure.

But yes, keeping address space on 64 bits is less of a problem. But I'm not looking for a solution for a future 64-bits Linux system.

It's not less of a problem but rather a complete non-issue on today's hardware with 47-bit address ranges. Again, not unmapping does not cause a boundless increase in virtual memory usage by the allocator as you claim. In fact, by not reserving ranges up-front there will be fragmentation that causes the peak virtual memory usage to be significantly higher.

There's also nothing Linux-specific about this. Large ranges of address space can be reserved on Windows too, as long as the allocator knows how to toggle ranges of committed memory rather than just using the cheaper lazy MEM_RESET purging.

@thestinger
Copy link
Contributor

Other users of mmap generally don't need more than page alignment so they end up creating unaligned gaps that jemalloc can never use. For example, I think stdio allocates some buffers with mmap and a 4k allocation ends up wasting the entire chunk from jemalloc's perspective. This will also tend to force the allocator onto the slow path since the returned memory won't always be chunk aligned. The per-arena chunk allocation also adds a lot of virtual memory fragmentation if ranges aren't reserved along with missing out on a great optimization opportunity.

@thestinger
Copy link
Contributor

Since virtual memory is a per-process resource, peak memory usage by the allocator is the important factor on 32-bit. I doubt that unmapping will decrease the peak VM size... if anything it will increase it both due to fragmentation between jemalloc's chunk caching and the unmapped ranges and because of external mmap users. Commit charge is a global resource on systems not using overcommit but dropping commit charge doesn't require unmapping.

@glandium
Copy link
Contributor Author

It is significantly faster on the systems where it does exist. You were measuring with a program without lots of concurrency or huge swings in memory usage. Firefox doesn't even thread caching and has a single arena... along with costs like junk filling on free.

And despite all that we do see madvise (yes, madvise, with MADV_FREE, and yes, the madvise call itself) having significant performance impact on some benchmarks on OSX, yet using mmap instead doesn't seem to make a difference, although I haven't looked at an actual profile to see if there actually a difference.

But that's way offtopic for the present issue.

@thestinger
Copy link
Contributor

It's not off-topic because this would be solved by using MEM_COMMIT / MEM_DECOMMIT rather than mapping ranges in and out. You're challenging whether that's a viable thing to do and I'm responding to it. I don't think lowering virtual memory usage when the application's memory usage drops is worth doing. It's definitely important to minimize peak virtual memory usage on 32-bit, but attempting to drop it back down actually ends up makes the peak VM higher in most cases.

@glandium
Copy link
Contributor Author

It's not off-topic because this would be solved by using MEM_COMMIT / MEM_DECOMMIT rather than mapping ranges in and out. You're challenging whether that's a viable thing to do.

Heh, that was one of my first proposals (last paragraph in #213 (comment) ).

I don't think lowering virtual memory usage when the application's memory usage drops is worth doing.

This is where we're disagreeing.

It's definitely important to minimize peak virtual memory usage on 32-bit, but attempting to drop it back down actually ends up makes the peak VM higher in most cases.

O_o do you mean fragmentation, not peak VM?

@thestinger
Copy link
Contributor

Heh, that was one of my first proposals (last paragraph in #213 (comment) ).

I'm saying that it should not unmap chunks, but simply purge them. Purging should also be dropping commit charge by default on operating systems without overcommit or when overcommit is disabled on Linux. That means using MEM_DECOMMIT / MEM_COMMIT on Windows for both purging inside chunks and purging at the chunk level unless it's specifically turned off via a tunable. The equivalent on Linux is toggling on PROT_NONE, and as long as memory isn't being regularly mapped the creation of extra VMAs isn't really important.

It would probably be helpful to map in many chunks at the same time now that the chunk size has been dropped quite a bit too. This would cut down on the long-term fragmentation caused by other mmap users slicing and dicing the address space.

O_o do you mean fragmentation, not peak VM?

No, I mean what I said: opportunistically unmapping memory will increase peak virtual memory in typical long-running processes. Virtual memory fragmentation is a very real issue on 32-bit, especially since jemalloc has a hard requirement that chunks are naturally aligned. It cooperates very poorly with other users of mmap.

The 4M chunk size meant there were only 512 valid chunk addresses in a 1:1 split 32-bit OS like Windows and unmapping ends up leading to other VirtualAlloc users making those chunks unusable over time. In many cases, jemalloc can't actually fill a gap that's large enough to hold a single unaligned chunk. It will ask for memory equal to the chunk size which will fail since it won't be aligned, and then it will increase the size and end up missing that gap.

The only saving grace is that unmapping didn't happen in practice... because chunks would rarely completely drain. The significantly smaller chunk size alleviates one part of the problem but since it makes unmapping much more likely it also makes it a much bigger problem in another way.

@glandium
Copy link
Contributor Author

No, I mean what I said: opportunistically unmapping memory will increase peak virtual memory in typical long-running processes. Virtual memory fragmentation is a very real issue on 32-bit

VM fragmentation makes peak VM lower, not higher. It makes you possibly OOM with a lower amount of mapped memory.

The only saving grace is that unmapping didn't happen in practice...

... except for huge allocations.

@thestinger
Copy link
Contributor

VM fragmentation makes peak VM lower, not higher. It makes you possibly OOM with a lower amount of mapped memory.

If you're measuring virtual memory usage by looking at the count of mapped pages via VIRT. Committed memory and commit charge are actually global resources and VM means those pages can't be unavailable due to physical fragmentation so those statistics are useful. VIRT just tells you the number of mapped pages which from the allocator's perspective isn't the same as how much is available.

Mapping a bit under 512 4k pages spread out eventually across the address space on 32-bit Windows would deplete it from the perspective of the current stable release of jemalloc. When making a huge allocation, there isn't a substantial difference between virtual memory lost to unused space inside of chunks and virtual memory lost to partially used chunks (due to VirtualAlloc usage elsewhere). There's also the issue of overall VM fragmentation, but that's somewhat distinct from memory that's unusable due to the chunk alignment requirement.

For example, you could leave the tail of huge allocations unmapped but it will only decrease the usable virtual memory in the long run...

... except for huge allocations.

Huge allocations used to be very rare though. They're a lot more common now, but unmapping is still going to be very rare if the unmapping granularity is significantly larger than the chunk size. Nearly all chunks in typical applications are used for non-huge allocations.

@glandium
Copy link
Contributor Author

glandium commented Apr 8, 2015

@jasone, ideas?

@jasone
Copy link
Member

jasone commented Apr 8, 2015

I'm more or less on vacation until the end of April, so I'm not going to have the chance to dig into this for a while. I suspect that we're going to need to add additional logic to control whether adjacent mappings can be coalesced, but I haven't considered the various possibilities in enough detail yet to have good intuition regarding what is most likely to work well.

@glandium
Copy link
Contributor Author

@jasone, ping

@jasone
Copy link
Member

jasone commented May 13, 2015

@glandium, I got a MinGW-w64 environment working yesterday, and I started cleaning up the ridiculous number of compilation warnings. Unfortunately I have a bunch of other work tasks that are likely to take up most of my time over the next few weeks, but I am trending in the direction of this issue and others that are blocking the 4.0.0 release.

@jasone jasone added this to the 4.0.0 milestone May 13, 2015
@mlvdm
Copy link
Contributor

mlvdm commented Jun 24, 2015

Can confirm this bug exists, and makes jemalloc pretty much unusable in it's current state.
I worked around it locally by assuming all VirtualAlloc calls are made with chunksize, and thus I can loop for VirtualAlloc to free blocks of chunksize. I could imagine this breaking when allocating huge blocks via jemalloc (ie, request larger than chunksize), so it's probably not a correct solution.

@ariccio
Copy link

ariccio commented Jun 24, 2015

I imagine that /analyze would pick this up, as it's _very_ good at detecting this kind of misuse (I believe, largely due to the careful annotations in memoryapi.h).

Any advice for building under MSVC?

@mlvdm
Copy link
Contributor

mlvdm commented Jun 25, 2015

I did a workaround locally.

Modify function pages_unmap in chunk_mmap.c:
Block #ifdef _WIN32 to #else, replace with

#ifdef _WIN32
    MEMORY_BASIC_INFORMATION mbi;
    assert(size >= chunksize);
    if (size == chunksize)
    {
        /* If size is exactly the chunksize, we know for certain we are only dealing with one VirtualAlloc call for the given range */
        if (VirtualFree(addr, 0, MEM_RELEASE) != 0)
        {
            size = 0;
        }
    }
    else while (size)
    {
        /* This is non-optimal, VirtualFree cannot release memory that was allocated with more than one call to VirtualAlloc */
        /* However, when we are passed a range of chunks (ie, more than one call to VirtualAlloc), we need to find out what calls to VirtualAlloc were made */
        /* This information should be stored somewhere (in the chunk?) but I don't want to touch the memory layout etc in there */
        /* As a work-around, we will instead use VirtualQuery to find out the boundaries of each original VirtualQuery */
        /* This works because we never touch the allocated chunks other than with MEM_RESET which won't change the page attributes as far as VirtualQuery cares */
        /* However, we pay one additional VirtualQuery syscall per VirtualFree */
        if (VirtualQuery(addr, &mbi, sizeof(mbi)) == 0) break;
        assert(mbi.BaseAddress == addr);
        assert(mbi.AllocationBase == addr);
        assert(mbi.AllocationProtect == PAGE_READWRITE);
        assert(mbi.RegionSize <= size);
        assert(mbi.State == MEM_COMMIT);
        assert(mbi.Type == MEM_PRIVATE);
        if (VirtualFree(addr, 0, MEM_RELEASE) == 0) break;
        size -= mbi.RegionSize;
        addr = mbi.RegionSize + (char *)addr;
    }
    if (size)
#else

I >THINK< it should work in every case, but its not optimal for sure.
Only assumption I made is that there can never be a pages_map call with size < chunksize

@ariccio
Copy link

ariccio commented Jul 8, 2015

Any advice for building under MSVC?

Bump?

@jasone
Copy link
Member

jasone commented Jul 8, 2015

This issue is on my short list now. I've started working through MinGW/gcc build issues, but I don't know how to build using MSVC yet.

@mlvdm
Copy link
Contributor

mlvdm commented Jul 9, 2015

From what I remember, I had no serious build problems. However, I use a (self-created) MSVC project for the building, as this library is part of a larger MSVC solution, so I mostly ignore everything that's not .h or .c

  • There is one issue with the constructor and tls callback function pointers being optimized away with certain linker optimizations (even with the force-reference hacks in there), but I think that's because I built as a static library. Guessing it's because they are declared with static linkage. I would imagine the problem goes away if one were to make a dynamic library, but I didn't try that.
  • I manually walked through internal_defs.h and picked appropriate settings, but I haven't tried the make scripts (I have binaries for bash + some utilities on my windows, so I could run the .sh files to generate most headers, but not the make script). This was all one-time for me (the generated configuration headers are now checked in for my project).
  • The thread-local variable access (tsd_) implementation (with GetLastError-TlsGet-SetLastError sequence) is a bit slow, but the one implemented with __thread works fine as well if you #define __thread __declspec(thread)

But this is mostly small things. I haven't tried all the functionality/configs, but the only bug in the actual code I encountered is the one that spawned this ticket (VirtualFree calls don't match VirtualAlloc), but I worked around that one with the hack I posted above. Allocation seems to work otherwise and performance is competitive in comparison with Win32 heap, although I only did some artificial benchmarks.

@jasone jasone closed this as completed in d059b9d Jul 25, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants