Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

git-lfs fetch fails ~50% of the time when using helper mode (macos, linux) #853

Closed
olsen232 opened this issue May 31, 2023 · 2 comments · Fixed by #856
Closed

git-lfs fetch fails ~50% of the time when using helper mode (macos, linux) #853

olsen232 opened this issue May 31, 2023 · 2 comments · Fixed by #856

Comments

@olsen232
Copy link
Collaborator

olsen232 commented May 31, 2023

git-lfs fetch can be called by kart clone, kart checkout, kart lfs+ fetch when working with tile-based datasets (point clouds, rasters). Of these kart clone is the most reliable way to reproduce since the lfs-cache is guaranteed to be empty at this point.

The crash is characterised by these two lines:
fatal error: signal_recv: inconsistent state at the top of the crash
Error: There was a problem with git-lfs fetch: Command '['git-lfs', 'fetch', 'origin', '<GIT_SHA>']' returned non-zero exit status 2. at the bottom of the crash

Re-running the command often succeeds.
Re-running the command with KART_USE_HELPER=0 always succeeds.
Since this git-lfs command shouldn't need to worry about if it's connected to a tty, or need to read from stdin, it's not clear what's going wrong with helper mode that's breaking git-lfs.

@olsen232
Copy link
Collaborator Author

This seems to pretty reliably toggle between a good-state and a bad state.
After the crash, it moves it into a good state: the next command will succeed.
The successful next command moves it into a bad state: the next command will fail.

Any kart command toggles this between good and bad state: it is not necessary that the command calls git-lfs. So, for instance, it is possible to reliably run kart clone without it ever crashing provided after each one you run kart --version or kart status or something that is guaranteed not to fail, and do so an odd number of times.

Killing the kart-helper process moves it into a bad state: the next command will always fail.

This is definitely related to #854: kart --help never works with helper mode, but the way in which kart --help fails can also tell you which state you are in. If kart --help has no output, you are moving into a bad state: the next command will fail. If kart --help has the output sh: 1: Bad file descriptor then you are moving into a good state: the next command will succeed.

@olsen232 olsen232 changed the title git-lfs fetch fails ~50% of the time when using helper mode (macos) git-lfs fetch fails ~50% of the time when using helper mode (macos, linux) Jun 1, 2023
@olsen232
Copy link
Collaborator Author

olsen232 commented Jun 1, 2023

Definitely something similar is happening on linux, where using the helper causes subprocesses to fail but only exactly half the time.

I was struggling to exercise this bug properly on my linux VM: man wasn't working, my SSH config wasn't working - so I wrote an ext-run script. The following script works half the time for me, on macos or linux:

import subprocess

def main(ctx, args):
    subprocess.check_call(["pwd"])

From here I discovered that if you explicitly set the following: stdin=sys.stdin, stdout=sys.stdout, stderr=sys.stderr - then it works 100% of the time. This is a bit like a workaround I discovered earlier: when running the git-lfs subprocess, if you set stdin to dev-null when making the subprocess call, it will always work - but now it is clearer why.

Ordinarily, setting stdin=sys.stdin, stdout=sys.stdout, stderr=sys.stderr would have no effect when running a subprocess - those are the default values. However, in helper mode, the values of sys.stdin, sys.stdout, sys.stderr change often. Presumably, failing to set the keyword-args means they take some default value, but not the latest, correct value. Or to put it another way: Python's "default-stdin-when-calling-a-subprocess" ends up out of sync with "sys.stdin", and sys.stdin has the value we want.
I still don't know how these values being out-of-sync results in a 2-state cycle of things being more or less broken.

Basically the fix is setting these values everywhere. This also (I think, I need to do further testing) even fixes interactive editors - the tty-capability of these streams is still working. But, I have some decisions about how to fix them:

  • if I don't expects a subprocess to listen for input, should I just give it stdin=/dev/null? Or should I give it sys.stdin, in case in rare circumstances it needs to interact with the user. Probably the second.
  • Should I write my own subprocess module that delegates to the existing subprocess module, but always sets stdin, stdout, and stderr to the appropriate values, so that to verify all of Kart is correct I just have to make sure I never import the wrong subprocess module? Probably I should: the only obstacle is that Kart already has some half-hearted wrappers on the subprocess module, which I need to look through to see how they fit in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant