Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POSIX Spawn 2: Electric Boogaloo #487

Merged
merged 19 commits into from
Dec 10, 2021
Merged

POSIX Spawn 2: Electric Boogaloo #487

merged 19 commits into from
Dec 10, 2021

Conversation

Eugeny
Copy link
Contributor

@Eugeny Eugeny commented Aug 9, 2021

Here's another go at posix_spawn, with a helper handling cwd/uid/gid/CT/setuid and then exec'ing itself into the new process.

  • fork() is replaced with posix_spawn()ing a helper process, allowing 300x spawn speedup on macOS and 2 to 20x on Linux.
  • Added an option to close all parent FDs in the child process (closeFDs, UNIX only). This is free on macOS via POSIX_SPAWN_CLOEXEC and costs 300 ms on Linux due to having to close FDs one by one.

@Eugeny Eugeny mentioned this pull request Aug 9, 2021
@jerch
Copy link
Collaborator

jerch commented Aug 9, 2021

Wow, thats like a full rewrite of the whole process setup handling in node-pty, removing the fork stuff. 👍

The bad news are that posix_spawn() is still taking 300ms here

Hmm thats weird, any idea yet why? From what I've read, the problem with fork is the sandboxing on BigSur, where the memory transfer to the child process takes very long. In theory this problem should be gone with posix_spawn, since the process gets not clone-copied from the parent anymore.
Is that actually faster than fork? Is the slowdown related to the helper in between (is the direct spawning much faster)?

Lemme me know if you need some help or when to review, can at least assist for linux and freebsd to some degree.

Copy link
Collaborator

@jerch jerch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some early remarks from my side 😸

src/unix/pty.cc Outdated Show resolved Hide resolved
src/unix/pty.cc Outdated Show resolved Hide resolved
src/unix/pty.cc Outdated Show resolved Hide resolved
src/unix/pty.cc Outdated Show resolved Hide resolved
src/unix/pty.cc Outdated Show resolved Hide resolved
src/unix/pty.cc Outdated Show resolved Hide resolved
src/unix/spawn-helper.cc Outdated Show resolved Hide resolved
src/unix/spawn-helper.cc Outdated Show resolved Hide resolved
src/unix/spawn-helper.cc Outdated Show resolved Hide resolved
@Eugeny
Copy link
Contributor Author

Eugeny commented Aug 9, 2021

Thanks for the review - I'll go through it tomorrow. It's still 2x faster than fork but I've expected it to be way faster than that. I've profiled it and it's posix_spawn itself taking these 300ms, even with eg. /bin/df as file.
I've just really didn't want to parse argv in the helper with my rusty C hence the socket stuff :D It was DGRAM based initially but can be replaced with a pipe now.

@jerch
Copy link
Collaborator

jerch commented Aug 10, 2021

Thanks for the review - I'll go through it tomorrow. It's still 2x faster than fork but I've expected it to be way faster than that. I've profiled it and it's posix_spawn itself taking these 300ms, even with eg. /bin/df as file.

I see, well thats unfortunate. Is there a mystic flag or something else, that can be set to speed up things?

I've just really didn't want to parse argv in the helper with my rusty C hence the socket stuff :D It was DGRAM based initially but can be replaced with a pipe now.

Well you'd have to parse the numbers back for gid/uid back, but thats quite easy. You dont need to decompose the whole argv thingy, for exec you can simply pass it on, e.g. execvp(argv[2], &argv[2]) (if the command starts in second third position).

@jerch
Copy link
Collaborator

jerch commented Aug 10, 2021

Btw solaris does not know the TIOCSCTTY ioctl. To properly attach a controlling terminal there, it is needed to do an open call to it, e.g. something like this (error handling skipped):

#if defined(TIOCSCTTY)
  ...
#else
  char *slave_path = ttyname(STDIN_FILENO);
  // open implicit attaches a process to a terminal device if:
  // - process has no controlling terminal yet
  // - O_NOCTTY is not set
  close(open(slave_path, O_RDWR));
#endif

@Eugeny
Copy link
Contributor Author

Eugeny commented Aug 10, 2021

I think it's worth considering, as the next step after this, doing the spawn itself in an async worker, to at least avoid locking up the Electron's UI thread.

@Eugeny
Copy link
Contributor Author

Eugeny commented Aug 10, 2021

Please take another look at it when you'll have time - I think this is as clean as it gets.

Still needs testing on Linux and other UNIXes that I don't have access to.

Ideally I would love this to get merged first and then work on an async spawn() separately

@jerch
Copy link
Collaborator

jerch commented Aug 10, 2021

Looks pretty cleaned up at a first glance, will review and test it later on under Ubuntu 18, FreeBSD 12 and OpenIndiana (solaris clone), which should cover most prominent POSIX systems. Dont have access to OpenBSD, NetBSD or AIX (to name a few more that are less common, not even sure if those have a working nodejs port).

@mofux
Copy link

mofux commented Aug 10, 2021

@Eugeny Do you know if the new spawn helper will have to be manually code-signed on OSX when packaging an Electron App? I'm a little bit concerned the extra helper may complicate thinks in that matter.

Copy link
Collaborator

@jerch jerch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works so far under linux, but ofc I have a few remarks:

  • it is much much slower than the fork path (takes ~300ms on my machine to start bash)
  • slow startup leads to pty.write('abc') writes to show up on original tty (not sure yet why this happens). Compared to the current startup this is a showstopper / API breaker, as it means, that ppl prolly will have to wait for some onReady event, before they can interact with the pty process.
  • cwd and tty working correctly (thus setting cwd and controlling terminal works)
  • not tested: signals, uid/gid changes
  • more remarks in the code (see below)

Edit:

  • Linux ✔️
  • Solaris ✔️
  • FreeBSD ✔️

binding.gyp Outdated Show resolved Hide resolved
binding.gyp Outdated Show resolved Hide resolved
src/unix/spawn-helper.cc Show resolved Hide resolved
src/unix/spawn-helper.cc Outdated Show resolved Hide resolved
src/unix/spawn-helper.cc Outdated Show resolved Hide resolved
src/unix/spawn-helper.cc Outdated Show resolved Hide resolved
@Eugeny
Copy link
Contributor Author

Eugeny commented Aug 10, 2021

@mofux yes, but existing native modules already need to be signed anyway. Good news is that codesign takes care of this automatically when signing an app bundle

@Eugeny
Copy link
Contributor Author

Eugeny commented Aug 10, 2021

@jerch re performance: turns out, this is due to file closing. Removing the fd closer loop gets it down to the same 1-3 ms as fork on my machine.

@Eugeny
Copy link
Contributor Author

Eugeny commented Aug 10, 2021

All done, perf is back to normal with the tradeoff of keeping FDs open. Error handling is revamped and passes errno back to the parent. Also added POSIX_SPAWN_USE_VFORK for a slight perf boost when available

@Eugeny
Copy link
Contributor Author

Eugeny commented Aug 10, 2021

After loading some large modules and allocating & filling a 100 MB buffer first:

  • fork: 15 to 20 ms
  • posix_spawn: still 1.5 ms

(all of the performance numbers here and above are on Linux)

@jerch
Copy link
Collaborator

jerch commented Aug 11, 2021

Oh well, yeah then lets not go with that security measure by default.

I suggest to keep it as configurable option, so ppl can opt-in here under more dangerous situations. Maybe with an additional check whether POSIX_SPAWN_CLOEXEC_DEFAULT was defined (assuming we dont need to loop-close FDs if the system already did it for us).

The danger is real, see https://labs.portcullis.co.uk/blog/exploiting-inherited-file-handles-in-setuid-programs/. And such a dangerous situation can easily be made up - imagine some terminal hub service, that pulls the authentication/authorization from some DB socket in the parent process, then uses node-pty to create ptys with dropped privileges - boom, the DB socket got leaked possibly open for shady manipulations on the auth parts.

@Eugeny
Copy link
Contributor Author

Eugeny commented Aug 11, 2021

I'm a dumbass who doesn't know how to benchmark stuff. Only the first spawn() takes 300 ms on macOS, subsequent spawns are all <5 ms.

Looks like an async spawn() won't be necessary after all!

@Eugeny Eugeny marked this pull request as ready for review August 11, 2021 15:37
@Eugeny
Copy link
Contributor Author

Eugeny commented Aug 19, 2021

Released it in the latest Tabby beta - let's see how it goes.

@jerch
Copy link
Collaborator

jerch commented Aug 19, 2021

@Eugeny Sorry could not yet test it on FreeBSD, but I expect no bigger surprises there (the pty impl is very similar to linux). And it works really nice and fast on linux.

At first I was a bit unsure about the VFORK option, since it is blocking and sharing all memory with the parent, but after some reading it is prolly safe for the instant exec done by posix_spawn, and alot faster.

@Eugeny
Copy link
Contributor Author

Eugeny commented Aug 19, 2021

No problem. vfork() + exec() seems to be the recommended way to spawn processes in Linux, and although there's only a limited set of calls that are allowed to happen after a vfork, it should be safe to assume that glibc does it right.

@jerch
Copy link
Collaborator

jerch commented Aug 21, 2021

@Eugeny FYI - freebsd tested - works. (Well, some tests fail for other reasons, but not from your code...)

Edit:
Almost forgot to mention - freebsd does not include errno.h from any of the other headers in spawn_helper, thus needs

#include <errno.h>

explicit in spawn-helper.cc.

@Eugeny
Copy link
Contributor Author

Eugeny commented Aug 22, 2021

Done.

@Eugeny
Copy link
Contributor Author

Eugeny commented Aug 27, 2021

Let me know if anything else needs to be done to get this merged 🙌
Tabby v155 reports 421 active users on Mac and Linux, no bug reports and nothing coming in on Sentry.

@jerch
Copy link
Collaborator

jerch commented Sep 1, 2021

@Eugeny You get a 👍 from my side.

@Tyriar Any chance to look at this soon? Imho it affects quite some peeps...

@Tyriar
Copy link
Member

Tyriar commented Oct 7, 2021

@deepak1556 could you review this when you get a chance?

@deepak1556
Copy link
Contributor

@Eugeny @jerch Thanks for working on this change! I apologize for my delay in review, will test and add my comments this week.

@jerch
Copy link
Collaborator

jerch commented Dec 9, 2021

Isn't this embarrassing? A lot of ppl suffer from this, still this PR is not worth a ~15min review/feedback. Guys if you dont feel like maintaining this lib in a healthy state - just say so. Pretty sure there are enough ppl out there to take ownership in a responsible way, even without that CLA nonsense.
@Tyriar Told you more than once, that this lib has serious I/O issues, but nothing ever happened. Taking responsibility for OSS repos is not just about "hey it works for our vscode use case, so all fine" - nope, for infrastructure critical libs like this one it has to fit the broader picture dealing with all interface assumptions the underlying OS does. The fact, that this lib swallows I/O if used wrongly (mind you, that wrong usage means any command that does not enter a REPL loop itself within PIPE_BUF, e.g. not a shell), is just a big fault. Now ppl get a perceivable degradation on BigSur+, still nothing happens. Where is the bar for you to take action for real?

@deepak1556
Copy link
Contributor

@jerch not sure if you are following the upstream libuv refactor for posix_spawn which was the original inspiration for the suggestion of this change in the module #476. The PR also will suffer from the kernel bug mentioned in libuv/libuv#3257 (comment), I am waiting to follow what libuv does before taking any further action here from my side. Also to clarify, degradation should be perceived on Big Sur+ only if the module gets code signed, it doesn't happen in other cases as outlined in libuv/libuv#3050

I don't have any additional code reviews for this PR at the moment, I will let you and @Tyriar make the decision of merging the PR in its current state.

@deepak1556 deepak1556 removed their request for review December 9, 2021 12:18
@jerch
Copy link
Collaborator

jerch commented Dec 9, 2021

@deepak1556 Thx for the update on the issues behind, thats at least some information to work with (and no, I dont follow any upstream discussion on foreign repos). Itsn't a pity that I have to get cocky to get that information? Anyway, thx for the quick response.

@deepak1556
Copy link
Contributor

Sorry about that, I could have done a better communication from my side beforehand. I mostly get tied up with other core work that this kept falling through the cracks but definitely not a valid reason for slow response on this PR that is open for quite sometime.

Copy link
Member

@Tyriar Tyriar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will pull into VS Code after the beta is released and see how that goes.

@Tyriar Tyriar merged commit 2f826f3 into microsoft:main Dec 10, 2021
@jerch
Copy link
Collaborator

jerch commented Dec 11, 2021

I want to apologize for my raging outburst above, which came out way too directed at @Tyriar. There is nothing to defend my words, in fact @Tyriar does a great job and is known for having moved things by big leaps in the past. Many thanks for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Big Sur hardened runtime versus fork yields large slowdown in node-pty.spawn()
5 participants