New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak off heap on Windows when using File::Stat #3446

Closed
jsvd opened this Issue Nov 4, 2015 · 22 comments

Comments

Projects
None yet
8 participants
@jsvd

jsvd commented Nov 4, 2015

This affects all jruby versions since 1.7.20 inclusive:
1.7.11 works OK
1.7.18 works OK
1.7.19 works OK
1.7.20 NOT OK
1.7.22 NOT OK
9.0.3.0 NOT OK

Every time a File::Stat.new(path) or a File.stat(path) is executed, even if the context disappears there are chunks of memory that are not freed.

This can cause, for example, jruby instances to show a healthy jvm heap usage of < 150 mb while the process on windows uses 3gb.

This can be tested by executing bin\jruby -e " loop { File.stat('bin') }"and watching in VMMAP (https://technet.microsoft.com/en-us/sysinternals/dd535533.aspx) the amount of unusable memory.

In this image collection you can see the normal memory usage of 1.7.19 and the of the non-freed allocations in 1.7.20

@enebo

This comment has been minimized.

Show comment
Hide comment
@enebo

enebo Nov 4, 2015

Member

I switch both stat windows structs with threadlocal single instances which would eliminate the allocation of those structs from being a memory leak....The leak still happens! So reviewing 1.7.19->1.7.20 these jnr-related packages have changed:

jnr-enxio 0.4->0.9
jnr-unixsocket 0.3->0.8
jnr-posix 3.0.9->3.0.12
jnr-constants 0.8.6->0.8.7
jnr-ffi 2.0.1->2.0.3
jnr-jffi 1.2.7->1.2.9
Member

enebo commented Nov 4, 2015

I switch both stat windows structs with threadlocal single instances which would eliminate the allocation of those structs from being a memory leak....The leak still happens! So reviewing 1.7.19->1.7.20 these jnr-related packages have changed:

jnr-enxio 0.4->0.9
jnr-unixsocket 0.3->0.8
jnr-posix 3.0.9->3.0.12
jnr-constants 0.8.6->0.8.7
jnr-ffi 2.0.1->2.0.3
jnr-jffi 1.2.7->1.2.9
@enebo

This comment has been minimized.

Show comment
Hide comment
@enebo

enebo Nov 4, 2015

Member

ok good news and bad news. I figured out the leak is due to the addition of the WString type added in 1.7.20. Current stat code has been completely changed to use Windows APIs directly but it also is using WString. So yay we figured out what is leaking...

Bad part is WString itself is written fine and jnr-ffi's memory manager seems to be broken by perhaps holding a hard reference to the native string made from WString. This means no pages of memory are ever getting released.

I can somewhat fix this by using byte[] directly and not using this helper but I suspect WString is not the only thing in jnr-posix hitting this now....

Member

enebo commented Nov 4, 2015

ok good news and bad news. I figured out the leak is due to the addition of the WString type added in 1.7.20. Current stat code has been completely changed to use Windows APIs directly but it also is using WString. So yay we figured out what is leaking...

Bad part is WString itself is written fine and jnr-ffi's memory manager seems to be broken by perhaps holding a hard reference to the native string made from WString. This means no pages of memory are ever getting released.

I can somewhat fix this by using byte[] directly and not using this helper but I suspect WString is not the only thing in jnr-posix hitting this now....

@enebo

This comment has been minimized.

Show comment
Hide comment
@enebo

enebo Nov 5, 2015

Member

Done working on this for today but we are getting closer to narrowing down where something is holding an extra reference. Somewhat sad to see this has been broken for months but I am guessing Windows has more requirements for special types than the Unixy platforms so people are noticing it a bit less.

Member

enebo commented Nov 5, 2015

Done working on this for today but we are getting closer to narrowing down where something is holding an extra reference. Somewhat sad to see this has been broken for months but I am guessing Windows has more requirements for special types than the Unixy platforms so people are noticing it a bit less.

@enebo

This comment has been minimized.

Show comment
Hide comment
@enebo

enebo Nov 5, 2015

Member

Ok looks like we found a long-standing bug in jffi:

https://msdn.microsoft.com/en-us/library/windows/desktop/aa366892%28v=vs.85%29.aspx

        public void freePages(long address, int pageCount) {
            Foreign.VirtualFree(address, (int)this.pageSize() * pageCount, Foreign.MEM_RELEASE);
        }

This should be:

        public void freePages(long address, int pageCount) {
            Foreign.VirtualFree(address, 0, Foreign.MEM_RELEASE);
        }

Tip of the hat to @nirvdrum to helping debug this and notice the comment about MEM_RELEASE in the MSDN article. Since we spent more than a day on this I thought I would highlight the journey...

By yesterday, we realized that allocating the structs for stat() were not leaking. That only left WString. WString as a parameter to stat was added between 1.7.19 and 1.7.20 so it was likely part of the cause. WString can be a small object. jnr-ffi is smart in that it makes its own virtual paging system and stuffs all these small objects into its own managed pages. This ends up reducing fragmentation. The actual page management itself is very complicated and is heavily dependent on GC + Weak References + Phantom references. Figuring out the paging system was working took quite a while...but it was working.

This really only left on other avenue which was that we were not really freeing the native pages. This is what led @nirvdrum to the MSDN page on VirtualFree.

So tonight @nirvdrum will confirm the fix works (I have tested on my machine but I think we need a little more confidence). Once he confirms it we will rev jffi and dependent libs and then ask the logstash team to verify the leak is gone. Then we can put out 1.7.23 and hopefully 9.0.4.

Member

enebo commented Nov 5, 2015

Ok looks like we found a long-standing bug in jffi:

https://msdn.microsoft.com/en-us/library/windows/desktop/aa366892%28v=vs.85%29.aspx

        public void freePages(long address, int pageCount) {
            Foreign.VirtualFree(address, (int)this.pageSize() * pageCount, Foreign.MEM_RELEASE);
        }

This should be:

        public void freePages(long address, int pageCount) {
            Foreign.VirtualFree(address, 0, Foreign.MEM_RELEASE);
        }

Tip of the hat to @nirvdrum to helping debug this and notice the comment about MEM_RELEASE in the MSDN article. Since we spent more than a day on this I thought I would highlight the journey...

By yesterday, we realized that allocating the structs for stat() were not leaking. That only left WString. WString as a parameter to stat was added between 1.7.19 and 1.7.20 so it was likely part of the cause. WString can be a small object. jnr-ffi is smart in that it makes its own virtual paging system and stuffs all these small objects into its own managed pages. This ends up reducing fragmentation. The actual page management itself is very complicated and is heavily dependent on GC + Weak References + Phantom references. Figuring out the paging system was working took quite a while...but it was working.

This really only left on other avenue which was that we were not really freeing the native pages. This is what led @nirvdrum to the MSDN page on VirtualFree.

So tonight @nirvdrum will confirm the fix works (I have tested on my machine but I think we need a little more confidence). Once he confirms it we will rev jffi and dependent libs and then ask the logstash team to verify the leak is gone. Then we can put out 1.7.23 and hopefully 9.0.4.

@jsvd

This comment has been minimized.

Show comment
Hide comment

jsvd commented Nov 5, 2015

@enebo

This comment has been minimized.

Show comment
Hide comment
@enebo

enebo Nov 6, 2015

Member

@jsvd https://projectodd.ci.cloudbees.com/view/JRuby/job/jruby-1.7-dist/444/org.jruby$jruby-dist/

Can you test this. It should have the magic for the leak at least.

Member

enebo commented Nov 6, 2015

@jsvd https://projectodd.ci.cloudbees.com/view/JRuby/job/jruby-1.7-dist/444/org.jruby$jruby-dist/

Can you test this. It should have the magic for the leak at least.

@jsvd

This comment has been minimized.

Show comment
Hide comment
@jsvd

jsvd Nov 8, 2015

@enebo something strange is happening with that build. Now the unusable memory goes up and down so it seems the allocations are being freed correcltly.

However, after running the file stat loop for 10-15 minutes (if I have jvisualvm looking at the instance it can happen in 1 minute or 2), the process crashes:

capture

I can replicate this quite reliably, but I haven't found a way to extract more information about what's happening to cause this.

jsvd commented Nov 8, 2015

@enebo something strange is happening with that build. Now the unusable memory goes up and down so it seems the allocations are being freed correcltly.

However, after running the file stat loop for 10-15 minutes (if I have jvisualvm looking at the instance it can happen in 1 minute or 2), the process crashes:

capture

I can replicate this quite reliably, but I haven't found a way to extract more information about what's happening to cause this.

@enebo

This comment has been minimized.

Show comment
Hide comment
@enebo

enebo Nov 9, 2015

Member

@jsvd I did see it crash once but it was over an hour and a half or so and I was not sure if it was real or not. I did not get an actual crash file either. I will examine GetFileAttributesEx and see if there is some other obscure note I am missing.

Member

enebo commented Nov 9, 2015

@jsvd I did see it crash once but it was over an hour and a half or so and I was not sure if it was real or not. I did not get an actual crash file either. I will examine GetFileAttributesEx and see if there is some other obscure note I am missing.

@astefan

This comment has been minimized.

Show comment
Hide comment
@astefan

astefan Nov 17, 2015

Hi @enebo. Is there any progress on this issue? Do you have an ETA on when it might get fixed? Thanks.

astefan commented Nov 17, 2015

Hi @enebo. Is there any progress on this issue? Do you have an ETA on when it might get fixed? Thanks.

@donv

This comment has been minimized.

Show comment
Hide comment
@donv

donv Nov 17, 2015

Member

Hi @astefan !

I think this should be fixed in the 9.0.4.0 release. Have you tried it?

Member

donv commented Nov 17, 2015

Hi @astefan !

I think this should be fixed in the 9.0.4.0 release. Have you tried it?

@astefan

This comment has been minimized.

Show comment
Hide comment
@astefan

astefan Nov 18, 2015

Hi @donv. We will need this fix in the 1.7.x branch. I guess it should go in 1.7.23. Can you confirm the fix will go into this version as well?

astefan commented Nov 18, 2015

Hi @donv. We will need this fix in the 1.7.x branch. I guess it should go in 1.7.23. Can you confirm the fix will go into this version as well?

@donv

This comment has been minimized.

Show comment
Hide comment
@donv

donv Nov 18, 2015

Member

I am not responsible for doing it, but the fix should go into 1.7.23. It is however not a direct merge, so it has to be reimplemented for the jruby-1_7 branch, so if you are able to test the jruby-1_7 branch to confirm the fix that would ensure the fix is OK for 1.7.23.

Member

donv commented Nov 18, 2015

I am not responsible for doing it, but the fix should go into 1.7.23. It is however not a direct merge, so it has to be reimplemented for the jruby-1_7 branch, so if you are able to test the jruby-1_7 branch to confirm the fix that would ensure the fix is OK for 1.7.23.

@jsvd

This comment has been minimized.

Show comment
Hide comment
@jsvd

jsvd Nov 19, 2015

@donv @enebo when the fix lands in the branch I can test it

jsvd commented Nov 19, 2015

@donv @enebo when the fix lands in the branch I can test it

@mkristian

This comment has been minimized.

Show comment
Hide comment
@mkristian

mkristian Nov 19, 2015

Member

@jsvd I double check it - current jruby-9.0.4.0 and jruby-1_7 are using the same jffi library which is the latest which includes the patch from @enebo. a recent dist file is here https://projectodd.ci.cloudbees.com/view/JRuby/job/jruby-1.7-dist/456/org.jruby$jruby-dist/

but I would say that is the same version you already tested and which crashes the process.

Member

mkristian commented Nov 19, 2015

@jsvd I double check it - current jruby-9.0.4.0 and jruby-1_7 are using the same jffi library which is the latest which includes the patch from @enebo. a recent dist file is here https://projectodd.ci.cloudbees.com/view/JRuby/job/jruby-1.7-dist/456/org.jruby$jruby-dist/

but I would say that is the same version you already tested and which crashes the process.

@jsvd

This comment has been minimized.

Show comment
Hide comment
@jsvd

jsvd Nov 19, 2015

@mkristian right, I only know that Tom found another way to fix the problem after the previous fix made jruby crash. I don't know if it has reached github yet..

jsvd commented Nov 19, 2015

@mkristian right, I only know that Tom found another way to fix the problem after the previous fix made jruby crash. I don't know if it has reached github yet..

@enebo

This comment has been minimized.

Show comment
Hide comment
@enebo

enebo Nov 19, 2015

Member

@astefan @jsvd @donv @mkristian The current plan is to get 1.7.23 out this coming Monday. We tried to get it out before Rubyconf but we could not get it squeezed in. It should definitely be fixed for stat(). I was able to perform >2 billion stats (I interrupted after that).

Member

enebo commented Nov 19, 2015

@astefan @jsvd @donv @mkristian The current plan is to get 1.7.23 out this coming Monday. We tried to get it out before Rubyconf but we could not get it squeezed in. It should definitely be fixed for stat(). I was able to perform >2 billion stats (I interrupted after that).

@jakommo

This comment has been minimized.

Show comment
Hide comment
@jakommo

jakommo Nov 24, 2015

@enebo Looking at the release notes, is this issue the one in the highlight described as "Windows stat no longer leaks memory (broken since 1.7.20)" or didn't it make it into 1.7.23?

jakommo commented Nov 24, 2015

@enebo Looking at the release notes, is this issue the one in the highlight described as "Windows stat no longer leaks memory (broken since 1.7.20)" or didn't it make it into 1.7.23?

@enebo enebo added the windows label Nov 24, 2015

@enebo enebo added this to the JRuby 1.7.23 milestone Nov 24, 2015

@enebo

This comment has been minimized.

Show comment
Hide comment
@enebo

enebo Nov 24, 2015

Member

@jakommo yes this was that bullet but it did not make it into the list. I will update on website at least...

Member

enebo commented Nov 24, 2015

@jakommo yes this was that bullet but it did not make it into the list. I will update on website at least...

@jakommo

This comment has been minimized.

Show comment
Hide comment
@jakommo

jakommo Nov 24, 2015

@enebo roger that, thanks for the clarification 👍

jakommo commented Nov 24, 2015

@enebo roger that, thanks for the clarification 👍

@jsvd

This comment has been minimized.

Show comment
Hide comment
@jsvd

jsvd Nov 24, 2015

confirmed 1.7.23 fixes the problem. thanks for the awesome response and fix @enebo

jsvd commented Nov 24, 2015

confirmed 1.7.23 fixes the problem. thanks for the awesome response and fix @enebo

@jsvd jsvd closed this Nov 24, 2015

@colinsurprenant

This comment has been minimized.

Show comment
Hide comment

thanks 👍

@perlun

This comment has been minimized.

Show comment
Hide comment
@perlun

perlun Dec 16, 2015

Contributor

Where is the commit/PR for this on the 1.7 branch? Can't locate it, debugging #3505 since it breaks JRuby for Windows in our use case.

Contributor

perlun commented Dec 16, 2015

Where is the commit/PR for this on the 1.7 branch? Can't locate it, debugging #3505 since it breaks JRuby for Windows in our use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment