Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operations frequently canceled when accessing a gcsfuse mount with a Go application #122

Open
axolotlgeoff opened this issue May 26, 2022 · 2 comments

Comments

@axolotlgeoff
Copy link

I suspect what is happening:

  1. A Go application accesses the mount

  2. The application raises SIGURG signals due to a Go feature introduced in 1.14 for non-cooperative goroutine preemption

  3. FUSE handles the signal and raises an INTERRUPT:

    If a process issuing a FUSE filesystem request is interrupted, the following will happen:

    • If the request is not yet sent to userspace AND the signal is fatal (SIGKILL or unhandled fatal signal), then the request is dequeued and returns immediately.
    • If the request is not yet sent to userspace AND the signal is not fatal, then an interrupted flag is set for the request. When the request has been successfully transferred to userspace and this flag is set, an INTERRUPT request is queued.
    • If the request is already sent to userspace, then an INTERRUPT request is queued.
  4. The InterruptOp is being handled and cancels the operation

  5. This cancels the context which is passed to the HTTP request to GCS resulting in errors such as:

    2022/05/25 15:20:21.417866 LookUpInode: operation canceled, clobbered: StatObject: not retrying StatObject("filename.example"): Get "https://storage.googleapis.com:443/storage/v1/b/example/o/filename.example?projection=full": net/http: request canceled
    fuse: 2022/05/25 15:20:21.417893 *fuseops.LookUpInodeOp error: operation canceled
    

Others, such as Docker and Gitea, have solved this by filtering out the SIGURG as referenced in golang/go#37942 . However as far as I can tell there's no option to ignore the SIGURG since it's FUSE which is handling them, and by the time it reaches this library as an interrupt the context is lost.

Our current workaround is to set GODEBUG="asyncpreemptoff=1" for the applications which use gcsfuse mounts.

Whilst the issue is happening whilst using gcsfuse I think the solution lies in this package but please let me know if that's not right. Happy to create a PR for this but I'm not entirely sure how to solve it. Can you think of any solutions for this as it seems like more people are running into this issue: GoogleCloudPlatform/gcsfuse#288 GoogleCloudPlatform/gcsfuse#562 ?

@stapelberg
Copy link
Collaborator

Thanks for the detective work!

I’m not sure yet I fully understand the issue, though. It sounds like the problem is in the Linux kernel, when the SIGURG signal is received while the program is in a syscall that works with a FUSE file system.

I would assume that the process’s signal mask should be respected by the kernel here, so if you ignore SIGURG in your program, the kernel wouldn’t interrupt?

Either way, do you have an easy way to reproduce this issue? Perhaps with one of the example libfuse file systems, and a minimal Go program that triggers the issue?

Thanks

@bjornleffler
Copy link

Reproducing this issue is easy and consistent. Try to clone any git repo in a directory that uses fuse. GCSFuse is one way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants