File.flock does not work on Solaris for 9k #3343

Closed
oblutak opened this Issue Sep 22, 2015 · 37 comments

Projects

None yet

5 participants

@oblutak
oblutak commented Sep 22, 2015

Same issue reported for 3254. The code change fixed the 1.7 branch (1.7.22) but issue is still in 9k. I thought it would be in 9.0.1.0 since it was released after:

$ ./jruby --version
jruby 9.0.1.0 (2.2.2) 2015-09-02 583f336 Java HotSpot(TM) 64-Bit Server VM 25.51-b03 on 1.8.0_51-b16 +jit [SunOS-amd64]

$ ./jruby -e "File.open('/tmp/test', File::RDWR|File::CREAT) { |f| f.flock(File::LOCK_EX) }"
AsmRuntime.java:40:in `newUnsatisifiedLinkError': java.lang.UnsatisfiedLinkError: unknown
        from null:-1:in `flock'
        from BaseNativePOSIX.java:527:in `flock'
        from CheckedPOSIX.java:416:in `flock'
        from LazyPOSIX.java:412:in `flock'
        from PosixShim.java:179:in `flock'
        from OpenFile.java:2488:in `run'
        from OpenFile.java:2485:in `run'
        from RubyThread.java:1352:in `executeTask'
        from OpenFile.java:2485:in `threadFlock'
        from RubyFile.java:298:in `flock'
        from RubyFile$INVOKER$i$1$0$flock.gen:-1:in `call'
        from CachingCallSite.java:313:in `cacheAndCall'
        from CachingCallSite.java:163:in `call'
        from -e:-1:in `invokeOther1:flock'
        from -e:1:in `RUBY$block$-e$0'
        from CompiledIRBlockBody.java:70:in `commonYieldPath'
        from IRBlockBody.java:140:in `doYield'
        from BlockBody.java:77:in `yield'
        from Block.java:147:in `yield'
        from RubyIO.java:1133:in `ensureYieldClose'
        from RubyIO.java:1126:in `open'
        from RubyIO$INVOKER$s$0$0$open.gen:-1:in `call'
        from DynamicMethod.java:209:in `call'
        from CachingCallSite.java:343:in `cacheAndCall'
        from CachingCallSite.java:205:in `callBlock'
        from CachingCallSite.java:209:in `call'
        from -e:-1:in `invokeOther7:open'
        from -e:1:in `RUBY$script'
        from MethodHandle.java:625:in `invokeWithArguments'
        from Compiler.java:111:in `load'
        from Ruby.java:822:in `runScript'
        from Ruby.java:814:in `runScript'
        from Ruby.java:752:in `runNormally'
        from Ruby.java:574:in `runFromMain'
        from Main.java:409:in `doRunFromMain'
        from Main.java:304:in `internalRun'
        from Main.java:231:in `run'
        from Main.java:200:in `main'
@oblutak
oblutak commented Oct 19, 2015

Hearing 9.0.2.0 coming out really soon. Any chance to squeak a fix for this issue? I just tried a snapshot from ci.jruby.com but problem is still present. It is a blocker for my to run 9k on Solaris, since I can't install gems
Last version I tried.
./jruby --version
jruby 9.0.2.0-SNAPSHOT (2.2.2) 2015-10-18 f37539a Java HotSpot(TM) 64-Bit Server VM 25.51-b03 on 1.8.0_51-b16 +jit [SunOS-amd64]

@headius
Member
headius commented Oct 19, 2015

Looking at the code, I see the same fix for #3254 applied in JRuby master:

if (real_fd != -1 && real_fd < FilenoUtil.FIRST_FAKE_FD && !Platform.IS_SOLARIS) {

The only way this check could pass is if we're not on Solaris, or if that check is not correctly detecting Solaris.

@headius headius added this to the JRuby 9.0.2.0 milestone Oct 19, 2015
@headius headius self-assigned this Oct 19, 2015
@oblutak
oblutak commented Oct 19, 2015

ok... weird
./jruby -e 'import org.jruby.platform.Platform; puts Platform::OS'
sunos
./jruby -e 'import org.jruby.platform.Platform; puts Platform::IS_SOLARIS'
false

@oblutak
oblutak commented Oct 19, 2015

I'm REALLY confused. 1.7.22 responds the same way but I don't have the issue on 1.7.22

./jruby --version
jruby 1.7.22 (1.9.3p551) 2015-08-20 c28f492 on Java HotSpot(TM) 64-Bit Server VM 1.8.0_51-b16 +jit [SunOS-amd64]

./jruby -e "File.open('/tmp/test', File::RDWR|File::CREAT) { |f| f.flock(File::LOCK_EX) }"
io/console not supported; tty will not be manipulated

./jruby -e 'import org.jruby.platform.Platform; puts Platform::IS_SOLARIS'
io/console not supported; tty will not be manipulated
false

@oblutak
oblutak commented Oct 19, 2015

on irc, was asked for uname -a output
SunOS ltetools 5.10 Generic_144489-12 i86pc i386 i86pc

@headius
Member
headius commented Oct 19, 2015

I believe 1.7 did not have native flock logic, so it wouldn't hit the bug. It seems our IS_SOLARIS is faulty somehow. Will look into that.

@headius
Member
headius commented Oct 19, 2015

@oblutak Can you provide the output of the following code?

puts ENV_JAVA["os.name"]
puts ENV_JAVA["os.arch"]

The fact that io/console is giving you a warning makes me think we're not handling native code loading right on Solaris either.

@headius
Member
headius commented Oct 19, 2015

Ok, I have a fix for the IS_SOLARIS problem, but there's a larger issue.

In JRuby 9k, we tend to prefer real native channels even for opening files. Unfortunately that means we can't use the pure-Java fallback logic here anyway. We'll either need to not use native file channels on Solaris or, more likely, emulate flock with fcntl the same way MRI does.

@oblutak
oblutak commented Oct 20, 2015

Here is the requested info
./jruby -e 'puts ENV_JAVA["os.name"], ENV_JAVA["os.arch"]'
SunOS
amd64

On your last comment about flock emulation, is that something that would be done in jnr-posix? Unfortunately, I'm not a coder so can't help that way, but if there is anything I could help testing with, please let me know.

@headius
Member
headius commented Oct 20, 2015

@oblutak Yes, I'm working on the flock emulation in jnr-posix. I've hit some snags, though, so this won't make it into the next 9k release. It should be in the one after that.

@headius headius modified the milestone: JRuby 9.0.4.0, JRuby 9.0.2.0 Oct 20, 2015
@oblutak
oblutak commented Oct 20, 2015

Understood. Thanks for the update. Would you rather I open a different issue for the submission of the IS_SOLARIS fix, or will you just submit both fixes under this issue?

@headius
Member
headius commented Oct 20, 2015

@oblutak I'll do both fixes as part of this issue.

@headius headius modified the milestone: JRuby 9.0.5.0, JRuby 9.0.4.0 Oct 22, 2015
@headius
Member
headius commented Oct 22, 2015

Bumping to 9.0.5.0 because 9.0.4 will be another quick flip to fix some regressions.

@headius
Member
headius commented Jan 19, 2016

I never got back to this, but I did at least start to attempt a flock impl for Solaris based on fcntl. Unfortunately never completed and had to revert when I accidentally pushed it: jnr/jnr-posix@32b9c84

@headius
Member
headius commented Jan 21, 2016

Bumping to 9.1 so we can get 9.0.5 out.

@headius headius modified the milestone: JRuby 9.1.0.0, JRuby 9.0.5.0 Jan 21, 2016
@headius
Member
headius commented Mar 3, 2016

@oblutak For 9.1 the current code basically just falls back on the old JDK logic when we're on Solaris. Can you test out a 9.1 snapshot build and let me know if it's working ok for you? http://ci.jruby.org

@headius
Member
headius commented Mar 3, 2016

I believe I found the bug in my patch and will attempt to apply it now. Downloading a Solaris vbox VM to test against.

@headius
Member
headius commented Mar 3, 2016

Pushed my tweaks for the patch in jnr/jnr-posix@f34106c (patch was applied in previous commit).

@headius
Member
headius commented Mar 3, 2016

Ok, I was unable to get the flock test to pass on Solaris and I have another task I need to work on.

@oblutak If you're still out there, you could pull master from jnr-posix and try to get the flock test passing on Solaris. I'm guessing the struct layout is not quite right, but I'm not certain.

@headius
Member
headius commented Mar 3, 2016

Patch to apply when flock is working in jnr-posix:

diff --git a/core/src/main/java/org/jruby/util/io/PosixShim.java b/core/src/main/java/org/jruby/util/io/PosixShim.java
index 1c122dc..7a4ff67 100644
--- a/core/src/main/java/org/jruby/util/io/PosixShim.java
+++ b/core/src/main/java/org/jruby/util/io/PosixShim.java
@@ -187,7 +187,7 @@ public class PosixShim {

         int real_fd = fd.realFileno;

-        if (real_fd != -1 && real_fd < FilenoUtil.FIRST_FAKE_FD && !Platform.IS_SOLARIS) {
+        if (real_fd != -1 && real_fd < FilenoUtil.FIRST_FAKE_FD) {
             // we have a real fd and not on Solaris...try native flocking
             // see jruby/jruby#3254 and jnr/jnr-posix#60
             int result = posix.flock(real_fd, lockMode);
@headius headius modified the milestone: JRuby 9.1.1.0, JRuby 9.1.0.0 Mar 8, 2016
@headius
Member
headius commented Mar 8, 2016

Punting to 9.1.1.0. I need assistance.

@mcarpenter

I've taken a look at this but have also been unable to get it working. Sharing my thoughts in the hope that someone can punt this a little further...

The failing test: https://github.com/jnr/jnr-posix/blob/master/src/test/java/jnr/posix/FileTest.java#L172

flockTest(jnr.posix.FileTest)  Time elapsed: 0.029 sec  <<< FAILURE!
java.lang.AssertionError: expected:<-1> but was:<0>
...
        at jnr.posix.FileTest.flockTest(FileTest.java:172)

I wrote a small standalone Java program imitating the failing test to pass to truss. That produced the following output:

22950/2:        fcntl(275, F_SETLKW64, 0xB2210000)              = 0
22950/2:                typ=F_RDLCK  whence=SEEK_SET start=0     len=0     sys=0  pid=0
22950/2:        fcntl(275, F_SETLKW64, 0xB2210020)              = 0
22950/2:                typ=F_UNLCK  whence=SEEK_SET start=0     len=0     sys=0  pid=0
22950/2:        fcntl(275, F_SETLKW64, 0xB2210040)              = 0
22950/2:                typ=F_WRLCK  whence=SEEK_SET start=0     len=0     sys=0  pid=0
22950/2:        fcntl(276, F_SETLK64, 0xB2210060)               = 0
22950/2:                typ=F_WRLCK  whence=SEEK_SET start=0     len=0     sys=0  pid=0

The struct flock looks okay but that 4th non-blocking exclusive-lock call to fcntl (that the unit test expects to fail) continues to succeed. (Also I noticed that jnr-constant appears to be returning the 64 bit version of F_SETLK[W]64 but changing this didn't make any difference).

I started to doubt the assumption of the test, so wrote some C:

#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>                                                           
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>

int main(int argc, const char *argv[])
{
    system("/bin/rm -f hello ; /bin/touch hello ; /bin/chmod g+l hello");     
    int fd1 = open("hello", O_RDWR);
    int fd2 = open("hello", O_RDWR);

    struct flock lock;
    lock.l_type = 0;
    lock.l_whence = SEEK_SET;                                                 
    lock.l_start = 0;
    lock.l_len = 0;                                                           
    lock.l_sysid = 0;
    lock.l_pid = 0;

    lock.l_type = F_RDLCK;                                                    
    if(fcntl(fd1, F_SETLKW, &lock)) {
        fprintf(stderr, "flock1\n");                                          
        exit(1);                                                              
    }
    lock.l_type = F_UNLCK;                                                    
    if(fcntl(fd1, F_SETLKW, &lock)) {                                         
        fprintf(stderr, "flock1\n");
        exit(1);
    }
    lock.l_type = F_WRLCK;                                                    
    if(fcntl(fd1, F_SETLKW, &lock)) {                                         
        fprintf(stderr, "flock3\n");
        exit(1);                                                              
    }
    lock.l_type = F_WRLCK;  
    if(!fcntl(fd2, F_SETLK, &lock)) {                                         
        fprintf(stderr, "flock4\n");                                          
        exit(1);
    }                                                                         
    lock.l_type = F_UNLCK;
    if(fcntl(fd1, F_SETLKW, &lock)) {                                         
        fprintf(stderr, "flock5\n");                                          
        exit(1);
    }

    return 0;
}

Compiling and running this:

$ make foo
gcc     foo.c   -o foo
$ ./foo 
flock4
$ truss -tfcntl -vall ./foo
fcntl(3, F_SETLKW, 0xFEFFE994)                  = 0
        typ=F_RDLCK  whence=SEEK_SET start=0     len=0     sys=4267573568 pid=-16782700
fcntl(3, F_SETLKW, 0xFEFFE994)                  = 0
        typ=F_UNLCK  whence=SEEK_SET start=0     len=0     sys=4267573568 pid=-16782700
fcntl(3, F_SETLKW, 0xFEFFE994)                  = 0
        typ=F_WRLCK  whence=SEEK_SET start=0     len=0     sys=4267573568 pid=-16782700
fcntl(4, F_SETLK, 0xFEFFE994)                   = 0
        typ=F_WRLCK  whence=SEEK_SET start=0     len=0     sys=4267573568 pid=-16782700
flock4
$

Same result: that fourth "exclusive lock it again" call continues to succeed :(

Other things I've considered:

  • I'm using ZFS, and this is a supported file system (see https://docs.oracle.com/cd/E19683-01/806-4125/6jd7pe6bk/index.html).
  • Mandatory vs. advisory locking (the sgid bit set in the above by the call to chmod(1)): this makes
    no difference. (See above link).
  • I note locks are destroyed by close(2)ing any other file, but that doesn't happen here.
  • Lack of lock inheritance on fork(2) is irrelevant.
  • I've tested the C program on Solaris 10 SPARC and Solaris 11 x86 with same results.

So we conclude that... the unit test assumption is wrong? Why? Or there's a bug in Solaris fcntl(2) (which... seems unlikely...)?

Other possibilities?

@headius headius modified the milestone: JRuby 9.1.1.0, JRuby 9.1.2.0 May 11, 2016
@oblutak
oblutak commented May 17, 2016

My apologies, I've been away from all (j)ruby activities since xmas. As indicated before, I'm no designer/coder so I quickly fall out of my element in too technical discussion.
I would love to help, but I think I would use more of your time. I don't even know how to use maven and get to running the jnr-posix tests. I haven't given up yet so if you have any good resources for me, please send them my way

I also tried to read/search re: fcntl and flock. The behavior of the C code above just didn't make sense to me. Comments in stackoverflow discussion I think give a hint.

  • simply: fnctl locks work as a Process <--> File relationship, ignoring filedescriptors
  • flock() places it's locks via a FD on a "Open file description"
    So I think the unit test of locking the same file/region within the same process will never 'fail' for Solaris/fcntl.
    Using C code provided above and compiling to versions with 'sleeps' in it, I think shows that other process trying to get exclusive lock to same file region behaves as expected.
    So I think the unit test assumption is wrong but don't know how to write a different test to show 'behavior' between two different processes. Is that 'adequate' locking for the needs here? There is certainly difference and have no idea how you would make fcntl equivalent to flock.
@mcarpenter

I'm no designer/coder

You are so busted :) This was a really good comment, I think you're right on the money. I've since been looking at the Linux fcntl(2) man page and a couple of things jump out at me:

A single process can hold only one type of lock on a file region; if a new lock
is  applied  to  an already-locked region, then the existing lock is converted
to the new lock type.

and:

F_SETLK (struct flock *)
          Acquire a lock [...]  If a  conflicting lock is held *by another process*,
this call returns -1

(That's with *my emphasis*). Opengroup has similar wording "An exclusive lock shall prevent any other process from setting a shared lock or an exclusive lock" (again my emphasis).

Given the unit test exists (and presumably doesn't fail) for systems with native flock() I guess that's what we should be aiming for if we wanted absolutely correct (equivalent) behavior. Linux flock() man page explains the unit test:

If a process uses open(2) (or similar) to obtain more than one descriptor for
the same file, these descriptors are treated independently by flock(). An
attempt to lock the file using one of these file descriptors may be denied by
a lock that the calling process has already placed via another descriptor.

On the other hand:

duplicate file descriptors (created by, for example, fork(2) or dup(2)) refer
to the same lock

I agree this seems hard to emulate with plain fcntl() and I have no good ideas on that for the moment. So now where?

@headius
Member
headius commented May 18, 2016

MRI supports some of these platforms that don't have flock...we really just need to do what they're doing.

I'd thought we might just be able to use pure-Java flocking, but then we'd need to make sure we're not using any native descriptors...which kills a bunch of more valuable features.

@headius
Member
headius commented May 18, 2016

See also ruby/spec#255.

@oblutak
oblutak commented May 19, 2016

If I understand the flock_spec from ruby/spec

 it "returns false if trying to lock an exclusively locked file" do
    @file.flock File::LOCK_EX

    ruby_exe(<<-END_OF_CODE, escape: true).should == "false"
      File.open('#{@name}', "w") do |f2|
        print f2.flock(File::LOCK_EX | File::LOCK_NB).to_s
      end
    END_OF_CODE
  end

The test is checking that a file with an exclusive lock appears locked to another process.
So I think the 'flock-fcntl bridge' implemented would satisfy (j)ruby's needs.

As for making jnr/posix tests passed for a flock-fcntl bridge, I don't know if that is possible. All existing flock test are perfectly valid for flock. Maybe fcntl locking needs to exists separately with its own set of unit tests (in jnr/posix) and at 'higher' layer (jruby?) need to make the distinction of using/implement a flock-fcntl bridge when a ruby 'flock' is called. (sorry, I said before I'm no coder so don't know the correct terminology but hopefully you understand what I am trying to say)

@headius
Member
headius commented May 20, 2016

@oblutak Actually that makes a lot of sense. jnr-posix should just implement and test both flock and fcntl, and if flock is not available it doesn't run the tests, like any other non-universal feature. We'd mimic MRI's logic for using fcntl as flock from the JRuby side.

So I think next steps would be to ensure that fcntl locking works with jnr-posix's fcntl today and write some jnr-posix tests for it. Then we look at how MRI emulates flock with fcntl, implement that in JRuby, and we're good?

@oblutak
oblutak commented May 20, 2016

I think the flock (using fnctl) as implemented in jnr-posix master in SolarisPOSIX.java replicates what is done in MRI.

Now for fcntl locking unit tests for jnr-posix, I wish I could help but I have no idea how. I don't know how you can 'launch' another process, required to try to get a lock on the same file. Maybe we just do the positive paths test for now just to validate that fcntl calls return successfully and rely on the flock_spec test at ruby/jruby to confirm 'flock' at (j)ruby layer behave as expected.

If you think there is a way I could help, let me know how, since I don't know java at all and never used tools like maven. I am trying a bit on my own. I got a copy of jnr-posix, got jdk-1.8u92 and maven 3.3.9 installed. Not sure what mvn commands to use. Doing mvn verify gave me some test failure as I would expect but I also get a core dump! I think asking to your help to just figuring out these basics steps would just impede and use your time needlessly. But I will look at the final code submitted and hope to learn from it and maybe one day I will be more helpful.

@enebo enebo modified the milestone: JRuby 9.1.2.0, JRuby 9.1.3.0 May 23, 2016
@headius
Member
headius commented Aug 17, 2016

@oblutak Can you recommend a good free Solaris (or friends) that I could use to verify the fix on a VirtualBox VM?

@headius
Member
headius commented Aug 17, 2016

@oblutak RE: testing...I'm not sure either. Is there a standard way we can attempt to lock a file without introducing a lot of dependencies? If not, then I think the next best option would be to simply have the test launch another JVM with some prepared code that tries to lock and prints its success.

If the jnr-posix tests fail (and especially if they crash) we should get that fixed.

@headius
Member
headius commented Aug 17, 2016

I think this is turning out to be a bit more work than we can spin in the next week, so we'll bump this to 9.1.4.0 and work with @oblutak to get jnr-posix up to scratch on Solaris.

@headius headius modified the milestone: JRuby 9.1.4.0, JRuby 9.1.3.0 Aug 17, 2016
@oblutak
oblutak commented Aug 23, 2016

Hi @headius, Sorry for the delay. I've spent vacations totally 'unplugged'. As for 'free Solaris and friends' I've only used Solaris on real hardware. On another issue reported, someone commented similar issue on SmartOS. There wiki has a section 'SmartOS as a Sandboxed VirtualBox Guest', but I don't have any personal experience to share.

@headius
Member
headius commented Sep 20, 2016

I have just pushed f8da225 for #4162, which implements flock like MRI using Ruby FFI. This should now be fixed.

Please test a build from http://ci.jruby.org once there's an updated snapshot.

@headius headius closed this Sep 20, 2016
@oblutak
oblutak commented Sep 21, 2016

Looks promising!
$ jruby --version jruby 9.1.6.0-SNAPSHOT (2.3.1) 2016-09-21 01030ab Java HotSpot(TM) 64-Bit Server VM 25.51-b03 on 1.8.0_51-b16 +jit [solaris-x86_64] $ jruby -e "File.open('/tmp/test', File::RDWR|File::CREAT) { |f| f.flock(File::LOCK_EX) }"
still unable to install gems, but looks like a totally different issue. I'll try to investigate a bit more and track under different issue, if required. Thanks for the fix!

@oblutak
oblutak commented Sep 21, 2016

In case someone stumble across my previous comment... Error is/was
jruby-9.1.6.0-SNAPSHOT/bin/jruby -S gem install builder LoadError: no such file to load -- rubygems/resolver/molinillo/lib/molinillo/dependency_graph/log <...>
I used the default 'tar' that ships with Solaris which had issue with long paths, so directory content looked like
ls jruby-9.1.6.0-SNAPSHOT/lib/ruby/stdlib/rubygems/resolver/molinillo/lib/molinillo/dependency_graph ac ad de lo se ta ve
instead of
ls jruby-9.1.6.0-SNAPSHOT/lib/ruby/stdlib/rubygems/resolver/molinillo/lib/molinillo/dependency_graph action.rb add_vertex.rb log.rb tag.rb add_edge_no_circular.rb detach_vertex_named.rb set_payload.rb vertex.rb
just use gnu tar instead to 'fix this issue'.

@headius
Member
headius commented Sep 21, 2016

Interesting. Narrow that down and file it...we'll get Solaris 100% yet!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment