Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upload lock held for longer than necessary #11083

Open
jkarni opened this issue Jul 11, 2024 · 3 comments
Open

Upload lock held for longer than necessary #11083

jkarni opened this issue Jul 11, 2024 · 3 comments
Labels
performance remote build The SSH store, ssh:, ssh-ng:, ... (split from protocol label 2024-07) store Issues and pull requests concerning the Nix store

Comments

@jkarni
Copy link

jkarni commented Jul 11, 2024

Describe the bug

When building with remote builders, very often only one build will make progress; other ones are stuck for quite a long time. With sudo lslocks | grep upload, you can see that processes are waiting on the .upload-lock for the machine (and stracing them confirms that they're blocked on flock(5, LOCK_EX). This can take a very long time. Interestingly, stracing the process that does have the lock seems to indicate that it's past the upload phase anyway - I see hundred of thousands lines with type 105:

write(2, "@nix {\"action\":\"result\",\"fields\":[221366008,0,0,0],\"id\":17878179326722332,\"type\":105}\n", 86) = 86

This could I suppose this could be happening in parallel to trying to copy files, though I somewhat doubt that.

Steps To Reproduce

  1. Enable a remote builder
  2. Kick off a bunch of simultaenous builds with --max-jobs 0
  3. Keep track of .upload-locks in lslocks
  4. Strace the processes to see what they're doing

Expected behavior

I expect the builds to start building more quickly

nix-env --version output

nix-env (Nix) 2.18.4

Additional context

Some investigation shows this is happening here. The original motivation for this logic, according to comments in the Perl precursor to this module from a decade ago, is to prevent multiple processes from trying to copy the same derivation over and over again.

It seems like the lock is potentially held too long. But moreover, it's too "big" a lock - we should probably only have a lock per store path + remote. And the alarm of 15 minutes also seems very long.

Priorities

Add 👍 to issues you find important.

@jkarni jkarni added the bug label Jul 11, 2024
@roberth roberth added performance store Issues and pull requests concerning the Nix store remote build The SSH store, ssh:, ssh-ng:, ... (split from protocol label 2024-07) and removed bug labels Jul 11, 2024
@roberth
Copy link
Member

roberth commented Jul 11, 2024

The remote build feature could use some love. Would you be interested to contribute improvements?

(note that the remote build label is brand new; we used to have only protocol, and even that is fairly new) labels not helpful

@jkarni
Copy link
Author

jkarni commented Jul 11, 2024

Yup - am investigating the impact of some changes on build speed!

@jkarni
Copy link
Author

jkarni commented Jul 15, 2024

Reducing the timeout to 2 minutes dramatically improved the speed at which builds happen. The curious thing is that the wait was often quite long - much longer than the timeout. This wouldn't make much sense if the timeout was working as (I take it that was) intended. Moreoever, as I mentioned, the process holding the lock often seems to be past the copy stage. So though decreasing the timeout helps, I'm pretty sure there are other things to fix here - things which, if fixed, potentially would obviate the need to reduce the timeout.

In general, I'm not sure I understand yet how signals are working here. It seems a signal handling thread sometimes exist, and I presume it would be responsible for making sure these alarms only go to the right thread, but I also don't see the relevant sigmask.

Using signal (instead of sigaction) is also, it seems, a potential concern. Additionally, the existence of the second alarm is quite mysterious to me.

All of this is my quite untutored thoughts - just writing them here by way of update, and will continue investigating. Though for my own machine my inclination is to just disable this flock logic altogether.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance remote build The SSH store, ssh:, ssh-ng:, ... (split from protocol label 2024-07) store Issues and pull requests concerning the Nix store
Projects
None yet
Development

No branches or pull requests

2 participants