Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Locking problems on MACOS Ventura 13.1 (22C65) #2599

Closed
joshStillerman opened this issue Jul 26, 2023 · 24 comments · Fixed by #2603
Closed

Locking problems on MACOS Ventura 13.1 (22C65) #2599

joshStillerman opened this issue Jul 26, 2023 · 24 comments · Fixed by #2603
Assignees
Labels
branch/alpha This is present on or relates to the alpha branch bug An unexpected problem or unintended behavior US Priority

Comments

@joshStillerman
Copy link
Contributor

Affiliation
MIT Plasma Science and Fusion Center

Version(s) Affected
MDSplus version: 7.139.28

Release: alpha_release_7.139.28

Platform
MACOS
At least on Ventura 13.1 (22C65)

Describe the bug
When trying to read data from a local tree a lock error is detected that is reported
as:

Error: Unable to get data for this node.
Error message was: %TREE-W-LOCK_FAILURE, Error locking file, perhaps NFSLOCKING not enabled on this system

To Reproduce
Steps to reproduce the behavior:

  • install this version downloaded from mdsplus.org
  • load the MDSplus environment vars:
    . /usr/local/mdsplus/setup.sh
    (we should have better directions for installing / using)
  • define default_tree_path to point at some directory containing a tree on your computer:
    export default_tree_path=/users/jas/trees
  • run a program that reads data like python or TCL as below:
TCL> set tree camera /shot = 230706107
TCL> dir

\CAMERA::TOP

 :CAMERA       :POWER        :TRIGGER     

  OBSERVER    

Total of 4 nodes.
TCL> deco observer.vector
Error: Unable to get data for this node.
Error message was: %TREE-W-LOCK_FAILURE, Error locking file, perhaps NFSLOCKING not enabled on this system
mdsdcl: --> failed on line 'deco observer.vector'

Expected behavior
contents of the node displayed

Additional context
I have not tested this on any other version of MACOS

@joshStillerman joshStillerman added the bug An unexpected problem or unintended behavior label Jul 26, 2023
@mwinkel-dev mwinkel-dev self-assigned this Jul 28, 2023
@mwinkel-dev mwinkel-dev added the branch/alpha This is present on or relates to the alpha branch label Jul 28, 2023
@mwinkel-dev
Copy link
Contributor

Encountered this same locking issue when used jTraverser to open a tree. This was when using an experimental build of MDSplus for MacOS Ventura 13.4.1 on Apple Silicon.

Note that Josh's bug was for MacOS Ventura on Intel.

@mwinkel-dev
Copy link
Contributor

Another easy way to trigger the bug is to add a node. Following output was generated by an experimental build of MDSplus for MacOS Ventura 13.4.1 on Apple Silicon.

TCL> edit bug_2599 /shot=-1 /new
TCL> add node some_text/usage=text
Error adding node some_text
Error was: %TREE-W-LOCK_FAILURE, Error locking file, perhaps NFSLOCKING not enabled on this system
mdsdcl: --> failed on line 'add node some_text/usage=text'
TCL> 
TCL> add node a_num/usage=num
Error adding node a_num
Error was: %TREE-W-LOCK_FAILURE, Error locking file, perhaps NFSLOCKING not enabled on this system
mdsdcl: --> failed on line 'add node a_num/usage=num'

@mwinkel-dev
Copy link
Contributor

mwinkel-dev commented Aug 1, 2023

The "Open File Descriptor" (OFD) locks are failing on MacOS Ventura. Using the LLDB debugger, traced the problem to treeshr, and specifically to the io_lock_local() routine in RemoteAccess.c.

The problem is on the fcntl() call in the following code block. If is failing and thus setting err to -1.

  if (use_ofd_locks == 1)
  {
    flock.l_pid = 0;
    err = fcntl(fd, nowait ? F_OFD_SETLK : F_OFD_SETLKW, &flock);
    if (err != 0 && errno == EINVAL)

This investigation triggered the bug by adding a node to a tree. Experiment done with MacOS Ventura running on Apple Silicon, using an experimental build of MDSplus.

@mwinkel-dev
Copy link
Contributor

mwinkel-dev commented Aug 1, 2023

Disabling the OFD locks (and instead using "process" locks) allows both mdstcl and jTraverser to work.

Although "process" locks are OK for a single program that is multi-threaded, the "open file descriptor" locks are better if have multiple programs simultaneously accessing the same tree. This difference in the locking merits more investigation and discussion before a decision is made regarding a fix.

mwinkel@lap-winkel y_trees % mdstcl
TCL> edit xxx /shot=-1 /new
TCL> add node a/usage=num
TCL> add node b/usage=text
TCL> dir

\XXX::TOP

 :A            :B           

Total of 2 nodes.
TCL> write
TCL> close
TCL> quit

image

The above examples were with an experimental build of MDSplus for MacOS Ventura running on Apple Silicon. (Also need to run these examples on MacOS Ventura for Intel.)

@joshStillerman
Copy link
Contributor Author

I would recommend putting in this fix with a #ifdef for apple while we investigate

@dgarnier
Copy link
Contributor

dgarnier commented Aug 1, 2023

What is that file descriptor pointing to? The actual tree file, or some temporary lock file?

@mwinkel-dev
Copy link
Contributor

Because this bug is blocking issue #2597, it inherits the "U.S. Priority" label from that issue.

@mwinkel-dev
Copy link
Contributor

The LLDB debugger confirms that the file descriptors are pointing to the tree file, and not to a temporary lock file.

@mwinkel-dev
Copy link
Contributor

Using "process" locks instead of "open file descriptor" locks also eliminates the error message associated with the deco command (see initial bug report way above).

However, now the deco command segfaults (when running an experimental MDSplus for MacOS Ventura on Apple Silicon). If the segfault also occurs on MacOS Ventura for Intel, it will be submitted as a separate issue.

@mwinkel-dev
Copy link
Contributor

Using "process" locks instead of "OFD" locks works on Intel MacOS Ventura 13.5. There are no error messages about locking, nor segfaults. For this experiment, used an experimental "CMake" build of MDSplus that was done on an Intel Mac.

Additional testing with Aarch64 (aka Arm64) Ubuntu22 also didn't trigger a segfault with the deco command. The segfault is specific to Apple Silicon (aka Arm64) MacOS Ventura. Thus is just a part of Issue #2597 to port MDSplus to MacOS Ventura on Apple Silicon.

@mwinkel-dev
Copy link
Contributor

The previous post is not quite correct. The TreeSegmentTest fails on Intel MacOS Ventura when using "process" locks. (Experiments with Arm64 Ubuntu 22 show that the test requires "OFD" locks to pass.). However, all other C and Python tests pass (on Intel MacOS Ventura) with "process" locks.

If "OFD" locks are used on Intel MacOS Ventura, then three tests fail: TreeSegmentTest, TreeDeleteNodeTest and mdslib_ctest.

As for Apple Silicon MacOS Ventura, there are ~20 tests that segfault. (Perhaps a compiler flag is missing.). Conjecture is that when the segfaults are fixed, that the "process vs. OFD" locking issue triggered by TreeSegmentTest will also show up in the Apple Silicon build of MDSplus.

@mwinkel-dev
Copy link
Contributor

Correction to my post above (the one with the jTraverser screenshot) -- which got the facts backwards. The OFD locks are needed for multi-threaded programs; the standard "record" locks (bound to a process) are for synchronizing file access by cooperating processes.

@mwinkel-dev
Copy link
Contributor

mwinkel-dev commented Aug 4, 2023

The OFD locks do not appear in the fcntl.h file that ships with Apple's Xcode development tool.

On MacOS Ventura, the file is found at this location:

/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/fcntl.h

And although the PureDarwin project added OFD locks a few years ago, it is unclear if those changes are in MacOS Ventura. (Both MacOS and PureDarwin are derived from Apple's open source Darwin project, but the forks have likely diverged.)

@mwinkel-dev
Copy link
Contributor

The OFD locks fail on both the APFS and HFS+ file systems. (HFS+ is also known as "Mac OS Extended".)

@dgarnier
Copy link
Contributor

dgarnier commented Aug 4, 2023

The source code for ventura kernel is here: https://github.com/apple-oss-distributions/xnu. The kernel code for OFD locks are there, in the rel/xnu-8792 branch... which is the kernel for Ventura.

What I'm not understanding is how we even got OFD locks to compile? (see below!)

If they aren't defined in fcntl.h. In the kernel they are defined in xnu/bsd/sys/fcntl.h:

#ifdef PRIVATE
#define F_OFD_SETLK             90      /* Acquire or release open file description lock */
#define F_OFD_SETLKW            91      /* (as F_OFD_SETLK but blocking if conflicting lock) */
#define F_OFD_GETLK             92      /* Examine OFD lock */

These definitions seem to be added for xnu-3247... (and did not exist in kernel before that time.)

Internally this is implemented using the vfs calls.. where this is implemented. So.. looks like a private API.. which probably matches your non-finding of it in the library.

However it currently compiles... there is some code that puts THIS in:

#ifndef F_OFD_SETLK
#define F_OFD_GETLK 36
#define F_OFD_SETLK 37
#define F_OFD_SETLKW 38

So.. that was basically undocumented behavior at that point. Who knows what it is supposed to do? I assume it did a no-op.

I wonder what would happen if you add:

#ifndef F_OFD_SETLK
#if defined (__APPLE__)
#define F_OFD_SETLK      90    
#define F_OFD_SETLKW   91 
#define F_OFD_GETLK      92    
#else
#define F_OFD_GETLK 36
#define F_OFD_SETLK 37
#define F_OFD_SETLKW 38
#endif

This is definitely evil, but...

@mwinkel-dev
Copy link
Contributor

Thanks for digging into the MacOS Ventura kernel. I will conduct that "evil" test later today.

@mwinkel-dev
Copy link
Contributor

Hacked up the TreeSegmentTest so that when used with standard record locks, it launches processes (not threads). And confirmed that the test runs successfully with three processes (each single-threaded) writing segments concurrently to the same tree. The test used the default settings: 3 writing processes, 100 segments written by each, with every segment having 10,000 elements.

Thus if the undocumented OFD locks in MacOS Ventura don't work, the fallback option for MacOS users of MDSplus will be to only use single-threaded programs for I/O to trees.

The test was performed with an experimental build of MDSplus using CMake on Intel MacOS Ventura.

@dgarnier
Copy link
Contributor

dgarnier commented Aug 4, 2023

Looking back at the history of this code, seems to me that this change maybe should just be reversed. This is the merge request that effectively was evil, basically assuming that if it isn't windows, its linux. The threaded behavior was there before, so you can see where we could put it back.

#2163

Oh.. and obviously, if we want MacOS as a first class citizen, we need to run unittests on it as well.

@dgarnier
Copy link
Contributor

dgarnier commented Aug 4, 2023

I found another project on Github that uses my "evil" solution. Who knows.
https://github.com/acronis/pcs-io/blob/a246d934f8fc389346eed0b843c3fd3b00494182/libpcs_io/pcs_compat.h#L96

Also.. here's a blog post that says that the apple shipped version of sqlite3, which is custom and built into MacOS.. also uses this private api.

https://bonsaidb.io/blog/acid-on-apple/

@mwinkel-dev
Copy link
Contributor

Issue #2163 changed many files, but has been in place for two or three years. My hunch (perhaps incorrect) is that it would be safer to keep #2163 and instead just make the minor changes needed to get MDSplus to work with MacOS Ventura.

Issue #2198 is a related change that is incompatible with the existing multi-threaded TreeSegmentTest, because if an operating system does not have OFD locks, then #2198 instead uses standard record locks (not thread-safe). If we allow fallback to record locks, then TreeSegmentTest needs to support that scenario too.

  • On an OS that supports OFD locks, TreeSegmentTest runs its multi-threaded code (e.g., three writer threads).
  • On an OS that only supports record locks, TreeSegmentTest launches processes (e.g., three writer programs).

@mwinkel-dev
Copy link
Contributor

mwinkel-dev commented Aug 4, 2023

Using the undocumented OFD values for MacOS works!!! (Thanks to Darren for suggesting the "evil" experiment.)

An experimental build with OFD locks on Intel MacOS Ventura was able to run mdstcl and execute the deco command without throwing errors. And jTraverser was able to open and edit trees (including modifying data). Plus the TreeSegmentTest (threaded version) passed.

We should probably use the "evil" solution to fix this MacOS Ventura locking issue. Nonetheless, before using an undocumented feature of the kernel, would be good for the team to at least briefly discuss the pros / cons of doing so.

@dgarnier
Copy link
Contributor

dgarnier commented Aug 5, 2023

We should probably use the "evil" solution to fix this MacOS Ventura locking issue. Nonetheless, before using an undocumented feature of the kernel, would be good for the team to at least briefly discuss the pros / cons of doing so.

Given that Apple is using it, and it's been built in for 5 past MacOS versions, I think its pretty safe. Watching MacOS evolve over the past 20 years, I think this is much more likely to move from private API to public, then to be removed. It adds compatibility with other codes, is much better than the broken SysV and POSIX semantics, and I can't see a security issue with it.

Of course, we should document the code and keep the test software so the fault shows up quickly with never OS versions.

@mwinkel-dev
Copy link
Contributor

Apple introduced the OFD locks in the kernel, xnu-3247.1.106, which was used in MacOS El Capitan back in 2015. (Thus the OFD locks have been present in 8 versions of MacOS.) And the #defines for the OFD locks have been constant over that period (no change in the numbers).

So I concur that it is reasonable to use the undocumented OFD locks in MDSplus.

@mwinkel-dev
Copy link
Contributor

Apple uses open source software in MacOS, thus publishes the internals on GitHub. For more information about the OFD locks, refer to the following links.

https://opensource.apple.com/releases/

  • click on "Releases" in upper right corner of the page
  • select the desired release of MacOS, and click to go to GitHub
  • then navigate to: xnu/bsd/sys/fcntl.h

https://github.com/apple-oss-distributions/xnu/blob/rel/xnu-3247/bsd/sys/fcntl.h

  • the OFD locks were introduced in xnu-3247
  • OS X El Capitan 10.11 (released 30-Sep-2015) is based on xnu-3247.1.106

@mwinkel-dev mwinkel-dev linked a pull request Aug 23, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch/alpha This is present on or relates to the alpha branch bug An unexpected problem or unintended behavior US Priority
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants