Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to build openembedded system LuneOS, error in packaging linux-libc-headers #2665

Closed
ericblade opened this issue Nov 14, 2017 · 40 comments
Assignees

Comments

@ericblade
Copy link

ericblade commented Nov 14, 2017

  • Your Windows build number: Microsoft Windows [Version 10.0.16299.19]
    (WSL Ubuntu fresh install from store)

  • What you're doing and what's happening:

Build instructions for the LuneOS system are here: http://webos-ports.org/wiki/Build_for_Tenderloin

Error text:

ERROR: linux-libc-headers-4.10-r0 do_package: Error executing a python function in exec_python_func() autogenerated:

The stack trace of python calls that resulted in this exception/failure was:
File: 'exec_python_func() autogenerated', lineno: 2, function: <module>
     0001:
 *** 0002:sstate_task_postfunc(d)
     0003:
File: '/home/eric/build-lune/webos-ports-env/webos-ports/openembedded-core/meta/classes/sstate.bbclass', lineno: 702, function: sstate_task_postfunc
     0698:
     0699:    omask = os.umask(0o002)
     0700:    if omask != 0o002:
     0701:       bb.note("Using umask 0o002 (not %0o) for sstate packaging" % omask)
 *** 0702:    sstate_package(shared_state, d)
     0703:    os.umask(omask)
     0704:
     0705:    sstateinst = d.getVar("SSTATE_INSTDIR")
     0706:    d.setVar('SSTATE_FIXMEDIR', shared_state['fixmedir'])
File: '/home/eric/build-lune/webos-ports-env/webos-ports/openembedded-core/meta/classes/sstate.bbclass', lineno: 624, function: sstate_package
     0620:    for plain in ss['plaindirs']:
     0621:        pdir = plain.replace(workdir, sstatebuild)
     0622:        bb.utils.mkdirhier(plain)
     0623:        bb.utils.mkdirhier(pdir)
 *** 0624:        os.rename(plain, pdir)
     0625:
     0626:    d.setVar('SSTATE_BUILDDIR', sstatebuild)
     0627:    d.setVar('SSTATE_PKG', sstatepkg)
     0628:
Exception: PermissionError: [Errno 13] Permission denied: '/home/eric/build-lune/webos-ports-env/webos-ports/tmp-glibc/work/armv7ahf-neon-webos-linux-gnueabi/linux-libc-headers/4.10-r0/package' -> '/home/eric/build-lune/webos-ports-env/webos-ports/tmp-glibc/work/armv7ahf-neon-webos-linux-gnueabi/linux-libc-headers/4.10-r0/sstate-build-package//package'

ERROR: linux-libc-headers-4.10-r0 do_package: Function failed: sstate_task_postfunc
ERROR: Logfile of failure stored in: /home/eric/build-lune/webos-ports-env/webos-ports/tmp-glibc/work/armv7ahf-neon-webos-linux-gnueabi/linux-libc-headers/4.10-r0/temp/log.do_package.26018
ERROR: Task (/home/eric/build-lune/webos-ports-env/webos-ports/openembedded-core/meta/recipes-kernel/linux-libc-headers/linux-libc-headers_4.10.bb:do_package) failed with exit code '1'

This seems to be some kind of permissions error, but I'm not sure what kind of permissions error. This does not occur when following identical instructions on an actual Ubuntu system running 16.04.

This occurs a couple of hours into a system build on my laptop, approximately task #478.

I am not intimately familiar with the OE build system, so advice for tracking down any further specifics would be helpful. I did notice the double "/" on the pathname at the end, the last time I tried to run this, and I attempted to fix that, but it had no effect on the overall outcome. I also tested that double // in paths work correctly, normally, and they seem to.

Building on DrvFS results in a hung process very early in the build process, not sure how to diagnose.

  • What should be happening:
    it shouldn't error here, and should complete the OS build.

  • Strace of the failing command: I've got a 290MB strace, I could use some direction on how to trim it down specifically to an area that is relevant to the problem

@bitcrazed
Copy link
Contributor

bitcrazed commented Nov 14, 2017

Hey Eric. Thanks for filing, tho' in future, please do complete the issue template - your issue above omits requested info:

  • precise repro steps
  • OS build version
  • strace
  • etc.

Please also attach the failure logfile noted at the end of your error dump above.

And since this looks like a permissions issue, please also include a long file listing (inc. permissions) of the offending '/home/eric/build-lune/webos-ports-env/webos-ports/tmp-glibc/work/armv7ahf-neon-webos-linux-gnueabi/linux-libc-headers/4.10-r0/package' -> '/home/eric/build-lune/webos-ports-env/webos-ports/tmp-glibc/work/armv7ahf-neon-webos-linux-gnueabi/linux-libc-headers/4.10-r0/sstate-build-package//package'

Thanks - appreciate your help diagnosing this issue.

@ericblade
Copy link
Author

ericblade commented Nov 14, 2017

OK, so, it's not able to create the last bit there -- .../sstate-build-package//package. Paste of results of ls -al coming

ls -al output for requested files/dirs:

eric@DESKTOP-463MS78:~$ ls -al /home/eric/build-lune/webos-ports-env/webos-ports/tmp-glibc/work/armv7ahf-neon-webos-linux-gnueabi/linux-libc-headers/4.10-r0/package
total 0
drwxr-xr-x 0 eric eric 512 Nov 14 00:32 .
drwxrwxrwx 0 eric eric 512 Nov 14 12:45 ..
drwxr-xr-x 0 eric eric 512 Nov 14 08:26 usr
eric@DESKTOP-463MS78:~$ ls -al /home/eric/build-lune/webos-ports-env/webos-ports/tmp-glibc/work/armv7ahf-neon-webos-linux-gnueabi/linux-libc-headers/4.10-r0/sstate-build-package//package
ls: cannot access '/home/eric/build-lune/webos-ports-env/webos-ports/tmp-glibc/work/armv7ahf-neon-webos-linux-gnueabi/linux-libc-headers/4.10-r0/sstate-build-package//package': No such file or directory
eric@DESKTOP-463MS78:~$ ls -al /home/eric/build-lune/webos-ports-env/webos-ports/tmp-glibc/work/armv7ahf-neon-webos-linux-gnueabi/linux-libc-headers/4.10-r0/sstate-build-package/
total 0
drwxr-xr-x 0 eric eric 512 Nov 14 12:45 .
drwxrwxrwx 0 eric eric 512 Nov 14 12:45 ..
eric@DESKTOP-463MS78:~$

@ericblade
Copy link
Author

strace.zip

Attached zip of strace. Please let me know if I can be of any further assistance.

@therealkenc
Copy link
Collaborator

therealkenc commented Nov 14, 2017

*** 0624:        os.rename(plain, pdir)

aka

24210 rename("/home/eric/build-lune/webos-ports-env/webos-ports/tmp-glibc/work/armv7ahf-neon-webos-linux-gnueabi/linux-libc-headers/4.10-r0/package", "/home/eric/build-lune/webos-ports-env/webos-ports/tmp-glibc/work/armv7ahf-neon-webos-linux-gnueabi/linux-libc-headers/4.10-r0/sstate-build-package/package" <unfinished ...>
[...]
24210 <... rename resumed> )            = -1 EACCES (Permission denied)

aka

#1529 #2243 #1726 #640 (message) et al

@ericblade
Copy link
Author

ericblade commented Nov 15, 2017

Thanks for the super quick find, @therealkenc . So that seems to indicate that it's not something that will be going away anytime in the near future? So, I guess, there's a question involved -- what can be done to get around the problem? Clearly this is happening to a number of other things, so it seems like having a general idea for how to change some things to not encounter this problem would be a good course of action, and I'd be happy to help in any way I can.

(i wonder if my idea of getting loopback device + ext fs would be a way to get 100% fs compatibility...)

@therealkenc
Copy link
Collaborator

So, I guess, there's a question involved -- what can be done to get around the problem?

I've been thinking about that too since I posted #1529 and this turned out to be the cause of my #640 grief. Seems to me that with effort some cases (perhaps apt) could be worked around. While there are a bunch of dupes accumulating here, it isn't causing userspace to completely fall over. POSIX rename would have been by necessity addressed ages ago if it was a super common case.

Other than fd caching scenarios like #1529, I speculate (without any evidence) that the handles sometimes aren't being held open for any particular reason. It is just that no one even thinks about it on Real Linux because it isn't a problem you'd worry about. Maybe they're even benign handle leaks. A strategic close() might solve a lot of scenarios.

Or might not. It might take a wholesale refactoring, or be (practically speaking) impossible because the fd is open for a good reason. I have not looked at your specific fail closely.

@therealkenc
Copy link
Collaborator

therealkenc commented Nov 15, 2017

Also before you do anything else, run the thing as root just to make absolutely sure we're talking about the open directory handle problem and not something else. [Can't be too careful.]

@ericblade
Copy link
Author

I'm not at all familiar with the python code that the problem is occurring in, though the number of times I see "python" in the issues that you've linked does make me wonder if there's something weird in python land. But I'll have a look at it and see if I see anything obvious as to if it might be unintentionally holding a file open before doing a move. Or is it a different process entirely that is holding a file inside the directory open, then python tries to move it? There shouldn't be other things occurring in there, but perhaps something is being a little sticky -- like in releases of Win8 when you'd have to wait minutes sometimes for files to be unlocked after a program had opened them, before you could delete.

I'm pretty sure OE will absolutely 100% refuse to run as root, but I'll try it tomorrow anyway. At least knowing the root cause of the problem (file somewhere in tree held open while a rename/move occurs) might allow me to cast a critical eye on it and find something.

@derekyu56
Copy link

derekyu56 commented Nov 19, 2017

My yocto rocko build in WSL was also hit by this bug. Would appreciate suggestion on how to get around it. I tried run as root (by disable sanity in sanity.conf) and still got hit with the bug.

@therealkenc
Copy link
Collaborator

I tried run as root (by disable sanity in sanity.conf) and still got hit with the bug.

That's basically the litmus test. If you get EACCES as root the problem is #1529.

@ericblade
Copy link
Author

ericblade commented Nov 24, 2017

hey @derekyu56 are you having the same problem in the same package? others are reporting that different kinds of builds (different distros, i don't recall which ones i read, but there were some people on the OE mailing list talking about it) are working, or different kinds of errors, it seems .. but i think it might be something specific to the linux-libc-headers recipe, or it's build/package process . . . i still haven't had time to go and open it up and check to see if there's anything obvious.

@derekyu56
Copy link

Yes, It is the same package with a different version:
ERROR: linux-libc-headers-4.12-r0 do_package: Error executing a python function in exec_python_func()

@ericblade
Copy link
Author

Thanks @derekyu56 gives me some insight that it might be package specific. I have taken a very brief look at the python script, and I didn't see anything that just stuck out at me like a sore thumb -- I'm starting to suspect that Linux Python in general can be problematic in WSL, based on the number of tickets that therealkenc mentioned that also mention Python in them. I just had a thought -- I wonder if the script's current working directory happens to be the directory that it's trying to move. Anyway, that's just speculation without real data to go on.

What I think I want to do next, when I have time, unless someone else gets to it before me (which is entirely possible, given my real serious lack of time to do personal projects right now) ..

in local.conf:

    BB_NUMBER_THREADS="1"
    BB_NUMBER_PARSE_THREADS="1"
    PARALLEL_MAKE="-j4"

then do a strace bitbake linux-libc-headers, and then go analyzing it to see if i can see what is being held open. I think first, though, i'll go and print the CWD at the point of failure.

@ericblade
Copy link
Author

FWIW, i've tracked down the problem a little bit when running on a Windows drive -- a process called "Cooker" (according to 'top') spins up the CPU to max, and just hangs there like that until terminated (i let it run for 4 hours, normally it doesn't stick there for more than a few seconds on a real Linux machine, or when run from the WSL root drive)

I'm also trying to put a little time into figuring out the problem with linux-libc-headers, and ran into another problem that seems to relate to fork() not working exactly like it's expected to -- the "db-native" package (not sure if that's included in the distro that others are using) fails when configuring, with "configure: fork: Invalid argument" followed by "wait_for: No record of process 2857". I saw some traffic on the oe-dev mailing list suggesting that fork issues in WSL would tank efforts to make it work, but they weren't specific as to what the issues might be.

@derekyu56
Copy link

Use the bitbake -k option to continue after error, linux-libc-headers-4.12-r0 is not the only one that has this problem. Found at least two other packages have the same problem:

ERROR: update-rc.d-0.7-r5 do_package: Error executing a python function in exec_python_func() autogenerated:
ERROR: base-files-3.0.14-r89 do_package: Error executing a python function in exec_python_func() autogenerated:

This is not a package specific problem.

@ericblade
Copy link
Author

ericblade commented Nov 25, 2017

So.. weirdly enough -- seems like it's caused in bitbake by some sort of a race condition. I have no idea why this fixed it, or why it appears to have fixed the problems with db-native and fork.. but ..

I am currently running the build greatly overloading the normal number of tasks. I wanted to do that to see if it would crank up all the cores on my CPU, because using the regular number didn't. When I more than doubled the amount of operations going on, I've managed to get through more than double the build that I had previously, and it's still going. Either that, or those packages failed an hour ago, but it's still carrying on with other tasks. :-)

I added this to my local.conf

BB_NUMBER_THREADS="16"
BB_NUMBER_PARSE_THREADS="32"
PARALLEL_MAKE="-j8"

The first time I ran it, it aborted sort of early in the process, then I ran it again, and it seems to be ripping right through both the spot it crashed on earlier (db-native), and linux-libc-headers, and a few other packages that i thought had failed on me in the past, too.

It's not a great solution, but at least it gets farther into finding any more potential bugs, or having a built system.

@therealkenc
Copy link
Collaborator

Try going the other way.

BB_NUMBER_THREADS="1"
BB_NUMBER_PARSE_THREADS="1"
PARALLEL_MAKE="-j1"

That is going to be slow as #$@&%* but might work. Possibly (probably?) one thread has not had a chance to close their handles before a different one tries to rename the directory. Why would increasing the count help? That's a pretty good question 🤷. Maybe more threads just means a higher chance of getting lucky for obscure scheduling reasons. Maybe single threaded fails all the same, but we won't know until someone tries.

WSL is pretty rock solid in terms of syscall level concurrency, for what it is worth. The whole edifice would collapse if it wasn't. With the usual caveat that anything is possible per Murphy.

@ericblade
Copy link
Author

ericblade commented Nov 25, 2017

I have a working theory that I don't know how to specifically test, that a previous step in the build process for that package has locked a file, then released it, but that NTFS or WSL or Bitbake or someone else somewhere in between isn't releasing it all the way....

and that increasing the task concurrency causes the amount of time in between the individual steps to increase, giving whatever has that file open/locked time to release it all the way.

Unfortunately, I got a big ol' BSOD around the time I reached 2000 tasks completed in the build (out of 7000.. sigh) and now I get to clean up all the pieces of packages that are halfway through some step, and don't know how to recover.

I had intended on going down to 1 thread once I had all the dependencies built to get to the linux-libc-headers package.. and much to my surprise, it blew right through.

@derekyu56 thanks for the -k switch, didn't know about that. I wonder if there's anything common to those two packages and linux-libc-headers. My guess is that update-rc.d and linux-libc-headers probably don't have any sort of compilation step, and probably go straight from fetch to unpack to install then to package.

@ericblade
Copy link
Author

ericblade commented Nov 25, 2017

So, running it with -k, everything passed until it hit towards the very end, when a ton of LuneOS specific packages hit, that were all dependent on the rest of the system. My suspicion about packages that don't include any involved build step seems to be right on -- 48 tasks failed, 47 of them with the exact same error in do_package. Every single one of those 47 packages appear to be files that are installed/packaged directly after the fetch step, with little to no handling in an intermediate configure/compile phase. The 48th was a compile error in gcc-cross-initial.

Sorry for spamming people's inboxes on a closed topic -- if there's a better forum to discuss this, I'd take it there. It's clearly OE related, but only in WSL . . and not clear yet that a workaround / fix would be appropriate for OE.

I see also that it didn't blow by linux-libc-headers .. it for some reason executed it towards the end of the build, instead of at the 900 package mark like usual. Possibly due to the amount of concurrency I enabled.

... before i lose the terminal output...

Summary: 48 tasks failed:
  /home/eric/luneos/webos-ports-env/webos-ports/openembedded-core/meta/recipes-devtools/gcc/gcc-cross-initial_6.3.bb:do_compile
  /home/eric/luneos/webos-ports-env/webos-ports/openembedded-core/meta/recipes-extended/shadow/shadow-securetty_4.2.1.bb:do_package
  /home/eric/luneos/webos-ports-env/webos-ports/meta-webos-ports/meta-luneui/recipes-webos/luna-webkit-api/luna-webkit-api.bb:do_package
  /home/eric/luneos/webos-ports-env/webos-ports/meta-webos-ports/meta-luneui/recipes-webos/luna-sysmgr-ipc-messages/luna-sysmgr-ipc-messages.bb:do_package
  /home/eric/luneos/webos-ports-env/webos-ports/openembedded-core/meta/recipes-core/os-release/os-release.bb:do_package
  /home/eric/luneos/webos-ports-env/webos-ports/meta-webos-ports/meta-luneui/recipes-webos/rdx-utils-stub/rdx-utils-stub.bb:do_package
  /home/eric/luneos/webos-ports-env/webos-ports/openembedded-core/meta/recipes-graphics/wayland/wayland-protocols_1.7.bb:do_package
  /home/eric/luneos/webos-ports-env/webos-ports/openembedded-core/meta/recipes-devtools/run-postinsts/run-postinsts_1.0.bb:do_package
  /home/eric/luneos/webos-ports-env/webos-ports/meta-webos-ports/meta-luneui/recipes-webos/cpushareholder-stub/cpushareholder-stub.bb:do_package
  /home/eric/luneos/webos-ports-env/webos-ports/openembedded-core/meta/recipes-core/base-files/base-files_3.0.14.bb:do_package
  /home/eric/luneos/webos-ports-env/webos-ports/openembedded-core/meta/recipes-kernel/linux-libc-headers/linux-libc-headers_4.10.bb:do_package
  /home/eric/luneos/webos-ports-env/webos-ports/openembedded-core/meta/recipes-core/volatile-binds/volatile-binds.bb:do_package
  /home/eric/luneos/webos-ports-env/webos-ports/openembedded-core/meta/recipes-core/update-rc.d/update-rc.d_0.7.bb:do_package
  /home/eric/luneos/webos-ports-env/webos-ports/openembedded-core/meta/recipes-support/iso-codes/iso-codes_3.74.bb:do_package
  /home/eric/luneos/webos-ports-env/webos-ports/meta-webos-ports/meta-luneos/recipes-support/cordova/cordova_2.3.0.bb:do_package
  /home/eric/luneos/webos-ports-env/webos-ports/meta-webos-ports/meta-luneos/recipes-luneos/apps/org.webosinternals.preware.bb:do_package
  /home/eric/luneos/webos-ports-env/webos-ports/meta-webos-ports/meta-luneos/recipes-luneos/apps/org.webosports.app.calculator.bb:do_package
  /home/eric/luneos/webos-ports-env/webos-ports/meta-webos-ports/meta-luneos/recipes-webos/frameworks/mojoservice-frameworks.bb:do_package
  /home/eric/luneos/webos-ports-env/webos-ports/meta-webos-ports/meta-luneos/recipes-webos/mojoloader/mojoloader.bb:do_package
  /home/eric/luneos/webos-ports-env/webos-ports/openembedded-core/meta/recipes-bsp/keymaps/keymaps_1.0.bb:do_package
  /home/eric/luneos/webos-ports-env/webos-ports/meta-webos-ports/meta-luneos/recipes-luneos/services/media-permission-service.bb:do_package
  /home/eric/luneos/webos-ports-env/webos-ports/meta-webos-ports/meta-luneos/recipes-luneos/messaging-accounts/messaging-accounts.bb:do_package
  /home/eric/luneos/webos-ports-env/webos-ports/meta-webos-ports/meta-luneos/recipes-luneos/apps/org.webosinternals.tweaks.bb:do_package
  /home/eric/luneos/webos-ports-env/webos-ports/openembedded-core/meta/recipes-support/ca-certificates/ca-certificates_20161130.bb:do_package
  /home/eric/luneos/webos-ports-env/webos-ports/meta-webos-ports/meta-luneos/recipes-webos/luna-systemui/luna-systemui.bb:do_package
  /home/eric/luneos/webos-ports-env/webos-ports/meta-smartphone/meta-android/recipes-android/android-system-compat/android-system-compat_1.0.bb:do_package
  /home/eric/luneos/webos-ports-env/webos-ports/meta-webos-ports/meta-luneos/recipes-luneos/services/keymanager.bb:do_package
  /home/eric/luneos/webos-ports-env/webos-ports/meta-webos-ports/meta-luneos/recipes-luneos/apps/org.webosports.app.filemanager.bb:do_package
  /home/eric/luneos/webos-ports-env/webos-ports/meta-webos-ports/meta-luneos/recipes-luneos/apps/org.webosports.app.memos.bb:do_package
  /home/eric/luneos/webos-ports-env/webos-ports/meta-webos-ports/meta-luneos/recipes-webos/app-services/app-services.bb:do_package
  /home/eric/luneos/webos-ports-env/webos-ports/meta-webos-ports/meta-luneos/recipes-luneos/services/org.webosports.service.licenses.bb:do_package
  /home/eric/luneos/webos-ports-env/webos-ports/meta-webos-ports/meta-luneos/recipes-webos/frameworks/underscore.bb:do_package
  /home/eric/luneos/webos-ports-env/webos-ports/meta-webos-ports/meta-luneos/recipes-luneos/services/org.webosports.service.devmode.bb:do_package
  /home/eric/luneos/webos-ports-env/webos-ports/meta-webos-ports/meta-luneos/recipes-luneos/apps/org.webosports.app.maps.bb:do_package
  /home/eric/luneos/webos-ports-env/webos-ports/meta-webos-ports/meta-luneos/recipes-luneos/apps/org.webosports.app.tasks.bb:do_package
  /home/eric/luneos/webos-ports-env/webos-ports/meta-webos-ports/meta-luneos/recipes-luneos/apps/org.webosports.app.pdf.bb:do_package
  /home/eric/luneos/webos-ports-env/webos-ports/meta-webos-ports/meta-luneos/recipes-luneos/apps/org.webosports.app.testr.bb:do_package
  /home/eric/luneos/webos-ports-env/webos-ports/meta-webos-ports/meta-luneos/recipes-luneos/services/downloadmanager.bb:do_package
  /home/eric/luneos/webos-ports-env/webos-ports/meta-webos-ports/meta-luneos/recipes-webos/frameworks/foundation-frameworks.bb:do_package
  /home/eric/luneos/webos-ports-env/webos-ports/meta-webos-ports/meta-luneos/recipes-luneos/services/org.webosports.cdav.bb:do_package
  /home/eric/luneos/webos-ports-env/webos-ports/meta-webos-ports/meta-luneos/recipes-webos/luna-init/luna-init.bb:do_package
  /home/eric/luneos/webos-ports-env/webos-ports/meta-webos-ports/meta-luneos/recipes-luneos/apps/org.webosports.app.contacts.bb:do_package
  /home/eric/luneos/webos-ports-env/webos-ports/meta-webos-ports/meta-luneos/recipes-luneos/services/org.webosports.service.messaging.bb:do_package
  /home/eric/luneos/webos-ports-env/webos-ports/meta-webos-ports/meta-luneos/recipes-luneos/apps/org.webosports.app.photos.bb:do_package
  /home/eric/luneos/webos-ports-env/webos-ports/meta-webos-ports/meta-luneos/recipes-webos/luna-applauncher/luna-applauncher.bb:do_package
  /home/eric/luneos/webos-ports-env/webos-ports/openembedded-core/meta/recipes-kernel/linux-firmware/linux-firmware_git.bb:do_package
  /home/eric/luneos/webos-ports-env/webos-ports/openembedded-core/meta/recipes-core/packagegroups/packagegroup-core-ssh-openssh.bb:do_package
  /home/eric/luneos/webos-ports-env/webos-ports/openembedded-core/meta/recipes-devtools/qemu/qemuwrapper-cross_1.0.bb:do_package
Summary: There were 3 WARNING messages shown.

@therealkenc
Copy link
Collaborator

but that NTFS or WSL or Bitbake or someone else somewhere in between isn't releasing it all the way....

Anything possible, but I am gonna say 'nah'. File operations (including close()) are atomic on NTFS and WSL or the whole edifice would fall down. Bitbake (and everything in between) doesn't care if they are closed because whether they are or they aren't isn't a concern on Real Linux.

Here's a work-around that might be worth trying. Track down that line in the Python script quoted in the OP:

 *** 0624:        os.rename(plain, pdir)

Guard the OSError around that one line with try: and except:, and if the exception is raised, sleep for 100ms and retry the rename. Loop until success, or re-throw after say 10 tries. If the problem is concurrency related (whichever theory), eventually the "other thing" will have released the handle and that os.rename() will succeed. Or something.

@derekyu56
Copy link

Looks like the solution is right in the python manual (https://docs.python.org/3/library/os.html):

quote
os.rename ....On Windows, if dst already exists, OSError will be raised even if it is a file.
...
If you want cross-platform overwriting of the destination, use replace().
unquote

I replaced line 359 and line 620 os.rename() with os.replace() in file meta/classes/sstate.bbclass and the linux-libc-headers-4.12-r0 package passed the build.

There are two other os.rename() in the same file. I did not change all of them

These simple changes triggered a total rebuild. It really sucks.

@derekyu56
Copy link

Sorry, my computer was too slow. It didn't fix the problem. the package showed later and os.replace() still generate EACCES error.

@ericblade
Copy link
Author

ericblade commented Nov 26, 2017

bitbake-strace.zip

Attached is a strace of mostly just the failing packaging process of a single package, after I had modified the script to retry if it failed, so there's a couple hundred retries after the initial failure. Retrying did not fix it. I tried to match up all the file opens to file closes, and I didn't see any mismatches.. i also checked the cwd of the script, and that wasn't the culprit either. I just don't have any idea what is holding which file open that would prevent the os.rename from succeeding.

In this case, it's /home/eric/luneos/webos-ports-env/webos-ports/tmp-glibc/work/all-webos-linux/luna-webkit-api/2.0.0-1+gitAUTOINC+d0b393ac0c-r0/package that is failing to move to /home/eric/luneos/webos-ports-env/webos-ports/tmp-glibc/work/all-webos-linux/luna-webkit-api/2.0.0-1+gitAUTOINC+d0b393ac0c-r0/sstate-build-package//package

@therealkenc
Copy link
Collaborator

I just don't have any idea what is holding which file open that would prevent the os.rename from succeeding.

Reopening if there's no positive evidence of an open handle. Thanks for digging.

@bitcrazed bitcrazed assigned SvenGroot and scooley and unassigned bitcrazed Nov 27, 2017
@bitcrazed
Copy link
Contributor

bitcrazed commented Nov 27, 2017

Thanks for taking the time to capture & share the trace @ericblade.

Assigning to @scooley @SvenGroot @sunilmut for triage.

@therealkenc
Copy link
Collaborator

therealkenc commented Nov 27, 2017

Curiosity killed the cat so I looked.

[...]
31182 open("/home/eric/luneos/webos-ports-env/webos-ports/tmp-glibc/work/all-webos-linux/luna-webkit-api/2.0.0-1+gitAUTOINC+d0b393ac0c-r0/pseudo/", O_RDONLY <unfinished ...>
31182 <... open resumed> )              = 6
[...]
31182 fcntl(6, F_DUPFD, 20 <unfinished ...>
31182 <... fcntl resumed> )             = 21
[...]
31182 rename("/home/eric/luneos/webos-ports-env/webos-ports/tmp-glibc/work/all-webos-linux/luna-webkit-api/2.0.0-1+gitAUTOINC+d0b393ac0c-r0/package", "/home/eric/luneos/webos-ports-env/webos-ports/tmp-glibc/work/all-webos-linux/luna-webkit-api/2.0.0-1+gitAUTOINC+d0b393ac0c-r0/sstate-build-package/package" <unfinished ...>
31182 <... rename resumed> )            = -1 EACCES (Permission denied)

FD 21 is never closed before the rename().

I don't know if I want to be so declarative as to dupe and close this issue since it has already been closed and re-opened once. But I'm about 98% sure directory rename() EACCES fails when running as root are going to be #1529 98% of the time (sic). Best course of action is probably going to be to address the known issue and then see how many suspected dupe dominoes fall.

@therealkenc
Copy link
Collaborator

Maybe (spitballing because I haven't thought about it hard enough) rename() should return EBUSY instead of EACCES so we can identify cases where we know lack of posix directory rename is the cause. Just in the interim while a complete fix is worked out. If that is doable, we can just grep for the error instead of playing "find the open handle".

@ericblade
Copy link
Author

ericblade commented Nov 28, 2017

ok, i think i want to know how you figured that out .. and then i think i don't want to know because it's probably crazy. :-)

so, just a few lines under the line you point out (line 612749)

31182 <... fcntl resumed> )             = 21

There is (line 612771)

fcntl(21, F_SETFD, FD_CLOEXEC <unfinished ...>

According to stackoverflow, this sets a "close-on-exec" flag, which causes the file descriptor to "be automatically (and atomically) closed when any of the exec family functions succeed".

Then the rename fails at line 714814.

So, 21 is a duplicate of fd 6, line 612747?

31182 fcntl(6, F_DUPFD, 20 <unfinished ...>

6 appears to be line 612733?

31182 open("/home/eric/luneos/webos-ports-env/webos-ports/tmp-glibc/work/all-webos-linux/luna-webkit-api/2.0.0-1+gitAUTOINC+d0b393ac0c-r0/pseudo/", O_RDONLY <unfinished ...>
31115 poll([{fd=5, events=POLLIN}], 1, 0) = 0 (Timeout)
31115 clock_gettime(CLOCK_REALTIME, {1511643548, 752880600}) = 0
31115 clock_gettime(CLOCK_REALTIME, {1511643548, 752969900}) = 0
31182 <... open resumed> )              = 6

So.. it looks like that is a handle to ....-r0/psuedo/ not -r0/package/ ?

Is CLOEXEC not working or not implemented? Are no exec() calls being executed between here and there? Are the exec() calls not entirely implemented?

(edit: i had originally pinpointed FD 6 to a different file entirely, changed the last of this text quite a bit after I realized I needed to be at a different fd 6)

@therealkenc
Copy link
Collaborator

Is CLOEXEC not working or not implemented? Are no exec() calls being executed between here and there? Are the exec() calls not entirely implemented?

Gold star for tenacity on this one 🌟. I think you might be onto something. I saw the fcntl(21, F_SETFD, FD_CLOEXEC) but convinced myself it didn't matter. I forgot the kids.....

Line 685930:

31182 clone(child_stack=0x7f33119efff0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f33119f09d0, tls=0x7f33119f0700, child_tidptr=0x7f33119f09d0) = 31196

Is there any chance you can test this on Insider 17046? It could be that FD_CLOEXEC isn't working. But I have another theory...

FD_CLOEXEC works, but doesn't get a chance to work. Because #1878 (message). That clone() call should have blocked process 31182 until 31196 exits. But on your WSL build (FCU), it won't. If the rename() in 31182 gets to the fd first -- boom. That would explain a lot about your shove threads down the machine's throat test too.

Gotta say, if that does turn out to be the problem, getting from a CLONE_VM misbehaviour to a EACCES fail was a long bloody bridge to walk.

@therealkenc
Copy link
Collaborator

therealkenc commented Nov 28, 2017

Couple of final data points for the night. The good news is FD_CLOEXEC works as advertised on WSL; I checked 'cause curious. This would be true in FCU or Insiders.

The bad news in doing the cursory test case, it looks like my first impression about that fcntl(21, F_SETFD, FD_CLOEXEC) is probably right. It doesn't matter. I set up a test case with what amounts to the same sequence of events and the fd is open, on both Real Linux and WSL.

[edit Below is total nonsense because I failed to register that there is no CLONE_VFORK flag]
But that doesn't mean it doesn't get closed by bitbake though some means I missed. That clone() I cited will behave differently on 17046, which could explain racey-like behaviour. The only way to know for sure is to try. But if I were a betting person, the smart money is still on a vanilla open handle problem in PID 31182.
[end nonsense]

@ericblade
Copy link
Author

Can you tell me if I was incorrect about the file handle appearing to point to a different directory than the one that it is failing to move?

I don't currently have a machine on Insiders (the one that was got bit by the "does not boot on AMD devices" bug a few releases back, and had to be restored, and is now in use by someone else :| ) but I can probably get there.

The race-condition idea turned out to be a non-starter -- because I had enabled such high amounts of parallelization, the failing tasks got bumped far to the back of the queue, instead of happening in the normal order that they go in, where you usually see all of the tasks for a specific package complete at once -- it was trying to do as much fetching and building as it could, and saving the packaging steps for later, or something like that.

On the bright side, I did just remember that it should be possible to just run the failing step on it's own, and maybe I can strace from that, saving hundreds of thousands of lines maybe.

@jstarks
Copy link
Member

jstarks commented Nov 28, 2017

@therealkenc Only CLONE_VFORK would cause 31182 to wait for 31196 to exit, and that wasn't included. RCU supports the included set of clone flags perfectly (CLONE_VM doesn't work by itself, but CLONE_VM | CLONE_THREAD works fine).

@therealkenc
Copy link
Collaborator

@jstarks
Dammit, my head saw a CLONE_VFORK in there (probably because I wanted there to be one) and clearly there isn't 🙄. Thanks.

@ericblade

it should be possible to just run the failing step on it's own

You can pull out the processes with grep. grep -E "^31182" bitbake-strace.txt. You can also grep for "clone(" in that output to find the kids to pull out. That's the secret sauce you were asking about. But don't. This one is pretty much "done".

@ericblade
Copy link
Author

So.. should I try on the Insider build? I'm still slightly confused as to the exact source of the problem -- i'm trying to find out which file is open, so that I might have a hope of working around the problem somewhere in one of the bitbake classes or recipe, now that i've narrowed it down to a common set of recipes that all have similar characteristics, there is probably? something that should be closing the open handle, but that isn't happening because of something in common among these recipes.

That's why I was trying to track down which file is actually open, which lead me to the handle at -r0/psuedo instead of -r0/package .. which must be incorrect, though, since the open handle that is blocking the rename would have to be in the directory tree being moved, and -r0/psuedo is not in -r0/package

@therealkenc
Copy link
Collaborator

therealkenc commented Nov 28, 2017

[edit with re-think]

So.. should I try on the Insider build?

Personally I wouldn't. #1878 is not going to change anything.

would have to be in the directory tree being moved, and -r0/psuedo is not in -r0/package

Okay I see what you mean now. You could put a big sleep in the Python code after catching the error, and then look at /proc to see what handles are open. But you are probably going to find "oh yeah that". Effort.

@ericblade
Copy link
Author

ericblade commented Nov 28, 2017

well, i'm interested in putting the effort in to figure it out, so i can either post a workaround (fixing the recipes), or we can absolutely determine that there's not another bug hiding. My Linux builder is no longer supported in real Linux, so I'm itching to get rid of it :-)

Where do I look in /proc ? and I'd guess I probably need to compare that to another strace run to see what the handles actually point to?

@therealkenc
Copy link
Collaborator

therealkenc commented Nov 29, 2017

ls -l /proc/<pid>/fd for the process that is running. You can get all the PIDs that are active at the point you sleep with pstree -p. You can have the active strace output file open in Sublime or similar to see what is going on as well. Or if you want to go hard, launch or attach the thing under gdb, break with crtrl-c at the sleep, and use the files command. That's a couple of common approaches, but there's other tools out there as well. If you are doing a deep dive you might want to contact whoever maintains bitbake, because people who live-and-breath the code often (but not always) can tell you "oh, that's <reason>" if they are given a little background information on what's going wrong.

[edit] There's gotta be an idiomatic way of finding the open file handles in Python too, but I wouldn't know.

@therealkenc
Copy link
Collaborator

Duping into #1529. If 1529 flips status and building LuneOS still fails, open a new issue to that effect and we can take another run at it.

@ericblade
Copy link
Author

ericblade commented Dec 3, 2018

I guess I come back to this about once a year, give or take. :-D I wanted to see if newest Windows release (1809) had made any headway on this, so I dug it out again. Stops in exactly the same spots, same packages (linux-libc-headers and base-files). SO, I finally opened up the code that was failing, and swapped out the python "os.rename" call to a "shutil.copy" call, and it has moved from being about 5% successful before failure, to about 15%, and still running without a failure, so far. IF I don't run out of disk space (i haven't attempted any of the hacks to move WSL to non-boot-disk yet), it's looking good. . . but will take several hours to complete. (in WSL it's moving at a much slower clip than my native Linux box, which is slower, has less memory, and only has a spinning disk.... but it's probably due to the sheer volume of disk access, which we know is a bit slow in WSL)

@therealkenc
Copy link
Collaborator

@ericblade

I wanted to see if newest Windows release (1809) had made any headway on this

Nah, nothing new on the underlying problem.

As a time sink you might take a try at running the LuneOS build under the recently posted fuzzyTew/nodejswsl. His work-around is aimed at NodeJS scenarios, but assuming your open handle is transient (and I think it might be) it has a better than nothing chance of working for you. The hook is doing a sleep/retry along the lines of what I was suggesting back in November 2017, only lower down so you don't actually have to patch the Python code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants