Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak off heap on Windows when using File::Stat #3446

Closed
jsvd opened this issue Nov 4, 2015 · 22 comments
Closed

Memory leak off heap on Windows when using File::Stat #3446

jsvd opened this issue Nov 4, 2015 · 22 comments
Labels
Milestone

Comments

@jsvd
Copy link

@jsvd jsvd commented Nov 4, 2015

This affects all jruby versions since 1.7.20 inclusive:
1.7.11 works OK
1.7.18 works OK
1.7.19 works OK
1.7.20 NOT OK
1.7.22 NOT OK
9.0.3.0 NOT OK

Every time a File::Stat.new(path) or a File.stat(path) is executed, even if the context disappears there are chunks of memory that are not freed.

This can cause, for example, jruby instances to show a healthy jvm heap usage of < 150 mb while the process on windows uses 3gb.

This can be tested by executing bin\jruby -e " loop { File.stat('bin') }"and watching in VMMAP (https://technet.microsoft.com/en-us/sysinternals/dd535533.aspx) the amount of unusable memory.

In this image collection you can see the normal memory usage of 1.7.19 and the of the non-freed allocations in 1.7.20

@enebo
Copy link
Member

@enebo enebo commented Nov 4, 2015

I switch both stat windows structs with threadlocal single instances which would eliminate the allocation of those structs from being a memory leak....The leak still happens! So reviewing 1.7.19->1.7.20 these jnr-related packages have changed:

jnr-enxio 0.4->0.9
jnr-unixsocket 0.3->0.8
jnr-posix 3.0.9->3.0.12
jnr-constants 0.8.6->0.8.7
jnr-ffi 2.0.1->2.0.3
jnr-jffi 1.2.7->1.2.9
@enebo
Copy link
Member

@enebo enebo commented Nov 4, 2015

ok good news and bad news. I figured out the leak is due to the addition of the WString type added in 1.7.20. Current stat code has been completely changed to use Windows APIs directly but it also is using WString. So yay we figured out what is leaking...

Bad part is WString itself is written fine and jnr-ffi's memory manager seems to be broken by perhaps holding a hard reference to the native string made from WString. This means no pages of memory are ever getting released.

I can somewhat fix this by using byte[] directly and not using this helper but I suspect WString is not the only thing in jnr-posix hitting this now....

@enebo
Copy link
Member

@enebo enebo commented Nov 5, 2015

Done working on this for today but we are getting closer to narrowing down where something is holding an extra reference. Somewhat sad to see this has been broken for months but I am guessing Windows has more requirements for special types than the Unixy platforms so people are noticing it a bit less.

@enebo
Copy link
Member

@enebo enebo commented Nov 5, 2015

Ok looks like we found a long-standing bug in jffi:

https://msdn.microsoft.com/en-us/library/windows/desktop/aa366892%28v=vs.85%29.aspx

        public void freePages(long address, int pageCount) {
            Foreign.VirtualFree(address, (int)this.pageSize() * pageCount, Foreign.MEM_RELEASE);
        }

This should be:

        public void freePages(long address, int pageCount) {
            Foreign.VirtualFree(address, 0, Foreign.MEM_RELEASE);
        }

Tip of the hat to @nirvdrum to helping debug this and notice the comment about MEM_RELEASE in the MSDN article. Since we spent more than a day on this I thought I would highlight the journey...

By yesterday, we realized that allocating the structs for stat() were not leaking. That only left WString. WString as a parameter to stat was added between 1.7.19 and 1.7.20 so it was likely part of the cause. WString can be a small object. jnr-ffi is smart in that it makes its own virtual paging system and stuffs all these small objects into its own managed pages. This ends up reducing fragmentation. The actual page management itself is very complicated and is heavily dependent on GC + Weak References + Phantom references. Figuring out the paging system was working took quite a while...but it was working.

This really only left on other avenue which was that we were not really freeing the native pages. This is what led @nirvdrum to the MSDN page on VirtualFree.

So tonight @nirvdrum will confirm the fix works (I have tested on my machine but I think we need a little more confidence). Once he confirms it we will rev jffi and dependent libs and then ask the logstash team to verify the leak is gone. Then we can put out 1.7.23 and hopefully 9.0.4.

@jsvd
Copy link
Author

@jsvd jsvd commented Nov 5, 2015

enebo added a commit that referenced this issue Nov 5, 2015
@enebo
Copy link
Member

@enebo enebo commented Nov 6, 2015

@jsvd https://projectodd.ci.cloudbees.com/view/JRuby/job/jruby-1.7-dist/444/org.jruby$jruby-dist/

Can you test this. It should have the magic for the leak at least.

@jsvd
Copy link
Author

@jsvd jsvd commented Nov 8, 2015

@enebo something strange is happening with that build. Now the unusable memory goes up and down so it seems the allocations are being freed correcltly.

However, after running the file stat loop for 10-15 minutes (if I have jvisualvm looking at the instance it can happen in 1 minute or 2), the process crashes:

capture

I can replicate this quite reliably, but I haven't found a way to extract more information about what's happening to cause this.

@enebo
Copy link
Member

@enebo enebo commented Nov 9, 2015

@jsvd I did see it crash once but it was over an hour and a half or so and I was not sure if it was real or not. I did not get an actual crash file either. I will examine GetFileAttributesEx and see if there is some other obscure note I am missing.

@astefan
Copy link

@astefan astefan commented Nov 17, 2015

Hi @enebo. Is there any progress on this issue? Do you have an ETA on when it might get fixed? Thanks.

@donv
Copy link
Member

@donv donv commented Nov 17, 2015

Hi @astefan !

I think this should be fixed in the 9.0.4.0 release. Have you tried it?

@astefan
Copy link

@astefan astefan commented Nov 18, 2015

Hi @donv. We will need this fix in the 1.7.x branch. I guess it should go in 1.7.23. Can you confirm the fix will go into this version as well?

@donv
Copy link
Member

@donv donv commented Nov 18, 2015

I am not responsible for doing it, but the fix should go into 1.7.23. It is however not a direct merge, so it has to be reimplemented for the jruby-1_7 branch, so if you are able to test the jruby-1_7 branch to confirm the fix that would ensure the fix is OK for 1.7.23.

@jsvd
Copy link
Author

@jsvd jsvd commented Nov 19, 2015

@donv @enebo when the fix lands in the branch I can test it

@mkristian
Copy link
Member

@mkristian mkristian commented Nov 19, 2015

@jsvd I double check it - current jruby-9.0.4.0 and jruby-1_7 are using the same jffi library which is the latest which includes the patch from @enebo. a recent dist file is here https://projectodd.ci.cloudbees.com/view/JRuby/job/jruby-1.7-dist/456/org.jruby$jruby-dist/

but I would say that is the same version you already tested and which crashes the process.

@jsvd
Copy link
Author

@jsvd jsvd commented Nov 19, 2015

@mkristian right, I only know that Tom found another way to fix the problem after the previous fix made jruby crash. I don't know if it has reached github yet..

@enebo
Copy link
Member

@enebo enebo commented Nov 19, 2015

@astefan @jsvd @donv @mkristian The current plan is to get 1.7.23 out this coming Monday. We tried to get it out before Rubyconf but we could not get it squeezed in. It should definitely be fixed for stat(). I was able to perform >2 billion stats (I interrupted after that).

@jakommo
Copy link

@jakommo jakommo commented Nov 24, 2015

@enebo Looking at the release notes, is this issue the one in the highlight described as "Windows stat no longer leaks memory (broken since 1.7.20)" or didn't it make it into 1.7.23?

@enebo enebo added the windows label Nov 24, 2015
@enebo enebo added this to the JRuby 1.7.23 milestone Nov 24, 2015
@enebo
Copy link
Member

@enebo enebo commented Nov 24, 2015

@jakommo yes this was that bullet but it did not make it into the list. I will update on website at least...

@jakommo
Copy link

@jakommo jakommo commented Nov 24, 2015

@enebo roger that, thanks for the clarification 👍

@jsvd
Copy link
Author

@jsvd jsvd commented Nov 24, 2015

confirmed 1.7.23 fixes the problem. thanks for the awesome response and fix @enebo

@jsvd jsvd closed this Nov 24, 2015
@colinsurprenant
Copy link
Contributor

@colinsurprenant colinsurprenant commented Nov 24, 2015

thanks 👍

@perlun
Copy link
Contributor

@perlun perlun commented Dec 16, 2015

Where is the commit/PR for this on the 1.7 branch? Can't locate it, debugging #3505 since it breaks JRuby for Windows in our use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
8 participants
You can’t perform that action at this time.