New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unusually large (but constant) memory usage with transparent huge pages enabled #11077

Closed
kevinburke opened this Issue Jan 30, 2017 · 15 comments

Comments

Projects
None yet
10 participants
@kevinburke

kevinburke commented Jan 30, 2017

I'm working with a client that runs a simple Node process to consume messages off of a queue (NSQ) and send them to a downstream API (over TLS). We are using superagent as a HTTP client. Some of the processes we are running were using over 1.8GB of memory.

After a long investigation we discovered that the memory bloat was because transparent huge pages were enabled. Disabling transparent huge pages immediately took the memory usage down to about 150MB per process. This is most dramatically visualized here:

We're unsure exactly what mechanism was responsible for allocating so much memory, but took a core dump and noticed about 160,000 copies of the certificate list in memory - specifically an engineer ran strings | grep -i verisign and found that string 160,000 times. We suspect it's related to TLS negotiation or zlib or both.

We were running Node v6.9.2. The symptoms were extremely high (but not growing) memory usage that is off the Node heap - we ran node-heapdump on the process and we were only able to account for about 18MB of 1.8GB. Checking /proc/<pid>/smaps revealed that most of the memory was in AnonHugePages.

I'm mostly posting because there appear to be no Node-specific resources that point to transparent huge pages as a potential problem for memory management with Node. A Google search for "node transparent huge pages" yields nothing suggesting it might be a problem for Node users.

Several databases warn about the consequences of enabling transparent huge pages:

@addaleax addaleax added the memory label Jan 30, 2017

@kevinburke kevinburke changed the title from Unusually large (but constant) memory usage with transparent huge pages turned on to Unusually large (but constant) memory usage with transparent huge pages enabled Jan 30, 2017

@kevinburke

This comment has been minimized.

Show comment
Hide comment
@kevinburke

kevinburke Jan 30, 2017

I found this blog post from 2011 about Node copying SSL certificates around:

Using the builtin Allocations instrument, I was looking for how memory was being used. I expected to just see a large blob of allocation being done inside v8, since Instruments and DTrace that power it do not have visibility inside the VMs internals. Unexpectedly, it quickly became apparent our main use of memory was the node::crypto::SecureContext::AddRootCerts function. After going back to the Javascript, we could see that for every new TLS connection being made, Node was re-parsing the list of root-certificate authorities from their string forms, into the X509_STORE object used by OpenSSL:

Just by commenting out one line of Javascript, we were able to reduce memory usage by 20%, and increased the performance of the HTTPS server from 70 requests/second to 700 requests/second.

Ryan changed the Node crypto code to use a single global CA store for the default root certificates in 5c35dff. The current fix is a hack, the correct long term fix is to better use SSL_new with a shared SSL_CTX*, but that will require a larger refactoring of node_crypto.cc.

Any chance that hack is still in place?

kevinburke commented Jan 30, 2017

I found this blog post from 2011 about Node copying SSL certificates around:

Using the builtin Allocations instrument, I was looking for how memory was being used. I expected to just see a large blob of allocation being done inside v8, since Instruments and DTrace that power it do not have visibility inside the VMs internals. Unexpectedly, it quickly became apparent our main use of memory was the node::crypto::SecureContext::AddRootCerts function. After going back to the Javascript, we could see that for every new TLS connection being made, Node was re-parsing the list of root-certificate authorities from their string forms, into the X509_STORE object used by OpenSSL:

Just by commenting out one line of Javascript, we were able to reduce memory usage by 20%, and increased the performance of the HTTPS server from 70 requests/second to 700 requests/second.

Ryan changed the Node crypto code to use a single global CA store for the default root certificates in 5c35dff. The current fix is a hack, the correct long term fix is to better use SSL_new with a shared SSL_CTX*, but that will require a larger refactoring of node_crypto.cc.

Any chance that hack is still in place?

@mscdex

This comment has been minimized.

Show comment
Hide comment
@mscdex

mscdex Jan 30, 2017

Contributor
Contributor

mscdex commented Jan 30, 2017

@indutny

This comment has been minimized.

Show comment
Hide comment
@indutny

indutny Jan 30, 2017

Member

Hm... given that paging scheme affects it, I suspect that it is V8 that causes it. V8 mmap()s memory pages at randomized addresses to facilitate allocation of JavaScript objects. Considering that they may not be next to each other, and the kernel may be allocating more memory than actually needed.

Is connection to TLS usage an already tested hypothesis? FWIW I think it may be not related to TLS at all.

cc @bnoordhuis I wonder what are your thoughts on this?

Member

indutny commented Jan 30, 2017

Hm... given that paging scheme affects it, I suspect that it is V8 that causes it. V8 mmap()s memory pages at randomized addresses to facilitate allocation of JavaScript objects. Considering that they may not be next to each other, and the kernel may be allocating more memory than actually needed.

Is connection to TLS usage an already tested hypothesis? FWIW I think it may be not related to TLS at all.

cc @bnoordhuis I wonder what are your thoughts on this?

@bnoordhuis

This comment has been minimized.

Show comment
Hide comment
@bnoordhuis

bnoordhuis Jan 31, 2017

Member

V8 seems like the most likely culprit, yes. Since it maps memory in 4 kB chunks at randomized addresses, when huge pages are enabled, 98% or 99% of that memory is going to be wasted. Some sleuthing with perf(1) should be able to confirm that.

Problem is, I don't really know of a workaround except recommend that users disable THP. There isn't really a way to opt out at the application level. Perhaps V8 could be taught to use 2 MB pages but that's arguably less secure.

@ofrobots Have Chromium or V8 received bug reports about this before? Google didn't turn up anything except this issue. :-)

Member

bnoordhuis commented Jan 31, 2017

V8 seems like the most likely culprit, yes. Since it maps memory in 4 kB chunks at randomized addresses, when huge pages are enabled, 98% or 99% of that memory is going to be wasted. Some sleuthing with perf(1) should be able to confirm that.

Problem is, I don't really know of a workaround except recommend that users disable THP. There isn't really a way to opt out at the application level. Perhaps V8 could be taught to use 2 MB pages but that's arguably less secure.

@ofrobots Have Chromium or V8 received bug reports about this before? Google didn't turn up anything except this issue. :-)

@ofrobots

This comment has been minimized.

Show comment
Hide comment
@ofrobots

ofrobots Jan 31, 2017

Contributor

My search doesn't find anything for Chromium/V8 either, which perhaps isn't surprising as people using browsers on the desktop aren't likely to have transparent huge pages turned enabled. V8 does randomize anonymous memory that it acquires from mmap.

AFAIK The V8 heap spaces are allocated as contiguous 'heap pages' which happen to be large enough (512KiB or 1MiB depending on version) that you wouldn't see this large a difference in memory usage. Pages with jitted executable code however might be a different matter?

Adding some V8 memory folks: @hannespayer, @mlippautz, @ulan.

I am not sure if there is a real workaround here other than to disable transparent huge pages, or switch them to madvise mode. Using large pages for code memory would indeed be less secure on the browser, but perhaps less so for the server side use cases?

Contributor

ofrobots commented Jan 31, 2017

My search doesn't find anything for Chromium/V8 either, which perhaps isn't surprising as people using browsers on the desktop aren't likely to have transparent huge pages turned enabled. V8 does randomize anonymous memory that it acquires from mmap.

AFAIK The V8 heap spaces are allocated as contiguous 'heap pages' which happen to be large enough (512KiB or 1MiB depending on version) that you wouldn't see this large a difference in memory usage. Pages with jitted executable code however might be a different matter?

Adding some V8 memory folks: @hannespayer, @mlippautz, @ulan.

I am not sure if there is a real workaround here other than to disable transparent huge pages, or switch them to madvise mode. Using large pages for code memory would indeed be less secure on the browser, but perhaps less so for the server side use cases?

@mlippautz

This comment has been minimized.

Show comment
Hide comment
@mlippautz

mlippautz Jan 31, 2017

V8 GC/memory engineer here.

TL;DR: What @indutny and @ofrobots already said. I can see system-wide THP causing problems with fragmentation since V8 doesn't integrate well with them.

Details:

As already mentioned, the most recent V8 version allocates pages for regular objects in 512KiB pages (previously we used to have 1MiB pages). We allocate using regular mmap, making full use of the 64bit address space on 64bit. Since THP are afaik mapped eagerly by the kernel, you will definitely see a lot of fragmentation. Due to the large size of the 64bit address space I expect this to grow linearly with the size of huge pages, i.e., 4x for 2MiB huge pages.

Code space is special as we already need to put code pages into ranges that are close together because of certain calling schemes, which should actually limit fragmentation. So code pages are only really randomized on a global level, and don't use the full 64bit addressing scheme for individual code pages.

mlippautz commented Jan 31, 2017

V8 GC/memory engineer here.

TL;DR: What @indutny and @ofrobots already said. I can see system-wide THP causing problems with fragmentation since V8 doesn't integrate well with them.

Details:

As already mentioned, the most recent V8 version allocates pages for regular objects in 512KiB pages (previously we used to have 1MiB pages). We allocate using regular mmap, making full use of the 64bit address space on 64bit. Since THP are afaik mapped eagerly by the kernel, you will definitely see a lot of fragmentation. Due to the large size of the 64bit address space I expect this to grow linearly with the size of huge pages, i.e., 4x for 2MiB huge pages.

Code space is special as we already need to put code pages into ranges that are close together because of certain calling schemes, which should actually limit fragmentation. So code pages are only really randomized on a global level, and don't use the full 64bit addressing scheme for individual code pages.

@kevinburke

This comment has been minimized.

Show comment
Hide comment
@kevinburke

kevinburke Jan 31, 2017

Can we add additional documentation, or possibly a warning when Node starts? Is there currently Node documentation on debugging memory leaks or high memory usage?

kevinburke commented Jan 31, 2017

Can we add additional documentation, or possibly a warning when Node starts? Is there currently Node documentation on debugging memory leaks or high memory usage?

@rbranson

This comment has been minimized.

Show comment
Hide comment
@rbranson

rbranson Jan 31, 2017

(I'm working with @kevinburke on this) We observed the hugepages issue entirely in the [heap] mapping (as in /proc/<pid>/smaps), which AFAICT isn't allocated by mmap, but is controlled by brk/sbrk syscalls. The V8 managed regions were (relatively) small.

rbranson commented Jan 31, 2017

(I'm working with @kevinburke on this) We observed the hugepages issue entirely in the [heap] mapping (as in /proc/<pid>/smaps), which AFAICT isn't allocated by mmap, but is controlled by brk/sbrk syscalls. The V8 managed regions were (relatively) small.

@ofrobots

This comment has been minimized.

Show comment
Hide comment
@ofrobots

ofrobots Jan 31, 2017

Contributor

Based on #11077 (comment) and what @mlippautz stated about code memory, I find it unlikely that V8 can cause this much fragmentation.

@rbranson: Perhaps you can run the application with strace and try to correlate the memory ranges that show up w/ high fragmentation in the memory map? strace has a -k option to print stacks on each system call. Beware that this might produce a lot of output.

Contributor

ofrobots commented Jan 31, 2017

Based on #11077 (comment) and what @mlippautz stated about code memory, I find it unlikely that V8 can cause this much fragmentation.

@rbranson: Perhaps you can run the application with strace and try to correlate the memory ranges that show up w/ high fragmentation in the memory map? strace has a -k option to print stacks on each system call. Beware that this might produce a lot of output.

@bnoordhuis

This comment has been minimized.

Show comment
Hide comment
@bnoordhuis

bnoordhuis Feb 1, 2017

Member

perf record -g along with some tracepoints on brk/mmap/mmap2/etc. syscalls will probably be easier to work with.

We observed the hugepages issue entirely in the [heap] mapping (as in /proc//smaps), which AFAICT isn't allocated by mmap, but is controlled by brk/sbrk syscalls.

Intuitively, that doesn't sound right. brk allocates a contiguous range of virtual memory. Barring kernel bugs I wouldn't expect that to suffer much from fragmentation or wastage.

Member

bnoordhuis commented Feb 1, 2017

perf record -g along with some tracepoints on brk/mmap/mmap2/etc. syscalls will probably be easier to work with.

We observed the hugepages issue entirely in the [heap] mapping (as in /proc//smaps), which AFAICT isn't allocated by mmap, but is controlled by brk/sbrk syscalls.

Intuitively, that doesn't sound right. brk allocates a contiguous range of virtual memory. Barring kernel bugs I wouldn't expect that to suffer much from fragmentation or wastage.

@bnoordhuis

This comment has been minimized.

Show comment
Hide comment
@bnoordhuis
Member

bnoordhuis commented Feb 7, 2017

@kevinburke @rbranson Any updates?

@springmeyer

This comment has been minimized.

Show comment
Hide comment
@springmeyer

springmeyer Feb 13, 2017

@kevinburke I see you linked to https://www.digitalocean.com/company/blog/transparent-huge-pages-and-alternative-memory-allocators/. Are you using an alternative allocator in this case? If you are using jemalloc, consider upgrading - refs jemalloc/jemalloc#243

springmeyer commented Feb 13, 2017

@kevinburke I see you linked to https://www.digitalocean.com/company/blog/transparent-huge-pages-and-alternative-memory-allocators/. Are you using an alternative allocator in this case? If you are using jemalloc, consider upgrading - refs jemalloc/jemalloc#243

@rbranson

This comment has been minimized.

Show comment
Hide comment
@rbranson

rbranson Feb 22, 2017

@bnoordhuis I hear you about the fragmentation being unintuitive. The data segment is contiguous in virtual memory, where-as hugepages are contiguous physical memory pages. The issues happen when the program break is being continuously adjusted up and down. For instance, use brk to add 10MB to the heap: THP might cause this to be 4 hugepages. If the program is able to free up the last 1MB of this allocation, it might decide to issue a brk to allow the OS to reclaim that 1MB of memory. In this case only 50% of the hugepage is actually usable by the system at all. When the program extends the data segment again using brk, there's nothing that will force it to re-use the already-allocated hugepage at the tail end of the data segment (See the very bottom section of https://www.kernel.org/doc/Documentation/vm/transhuge.txt). THP has defrag functionality to theoretically address these problems.

There is additional memory usage that we're trying to address outside of the THP fix. THP just exacerbates an already problematic memory consumption problem by severely hindering the kernel's ability to quickly reclaim unused memory. Right now our most promising lead on extra memory consumption is that we're loading the CA certificates from disk ourselves, which causes unreasonable space and time overhead for HTTPS. This boils down to what is effectively a workaround for #4175.

We don't have any additional metrics that were requested. This is code that is incredibly difficult to instrument in production.

rbranson commented Feb 22, 2017

@bnoordhuis I hear you about the fragmentation being unintuitive. The data segment is contiguous in virtual memory, where-as hugepages are contiguous physical memory pages. The issues happen when the program break is being continuously adjusted up and down. For instance, use brk to add 10MB to the heap: THP might cause this to be 4 hugepages. If the program is able to free up the last 1MB of this allocation, it might decide to issue a brk to allow the OS to reclaim that 1MB of memory. In this case only 50% of the hugepage is actually usable by the system at all. When the program extends the data segment again using brk, there's nothing that will force it to re-use the already-allocated hugepage at the tail end of the data segment (See the very bottom section of https://www.kernel.org/doc/Documentation/vm/transhuge.txt). THP has defrag functionality to theoretically address these problems.

There is additional memory usage that we're trying to address outside of the THP fix. THP just exacerbates an already problematic memory consumption problem by severely hindering the kernel's ability to quickly reclaim unused memory. Right now our most promising lead on extra memory consumption is that we're loading the CA certificates from disk ourselves, which causes unreasonable space and time overhead for HTTPS. This boils down to what is effectively a workaround for #4175.

We don't have any additional metrics that were requested. This is code that is incredibly difficult to instrument in production.

@Trott

This comment has been minimized.

Show comment
Hide comment
@Trott

Trott Jul 26, 2017

Member

Should this remain open? Or can it be closed?

Member

Trott commented Jul 26, 2017

Should this remain open? Or can it be closed?

@bnoordhuis

This comment has been minimized.

Show comment
Hide comment
@bnoordhuis

bnoordhuis Jul 28, 2017

Member

I'll close it out, I don't think there is anything we can do here. (Node could print a warning when THP is enabled but THP isn't always a problem so I don't think that's the right thing to do.)

Suggestions welcome, though!

Member

bnoordhuis commented Jul 28, 2017

I'll close it out, I don't think there is anything we can do here. (Node could print a warning when THP is enabled but THP isn't always a problem so I don't think that's the right thing to do.)

Suggestions welcome, though!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment