Threads/Exec/Wait Refactor #48

rennergade · 2020-02-15T22:11:23Z

I've started a document detailing the overall plan for the refactor. So far I've written about the thread infrastructure.

This document can be found here.

rennergade · 2020-02-20T21:50:17Z

I've finished my initial thoughts on this and documented it here.

From talking to Brendan, it seems like the best path forward is to try to work with the EXEC implementation we have, since it's not explicitly broken but inefficient and not exactly compliant (we launch a new cage instead of doing what exec natively does).

My gameplan right now is to consolidate the threading infrastructure to give us the flexibility to set up threads and their relationships correctly, and then debug/clarify the mess that is wait from there.

rennergade · 2020-03-09T18:21:16Z

I've "finished" the initial stage of my refactor, which unified the thread pipelines. I now can run ls through bash to completion without using the hack I threw in before.

It's still hanging in wait with multiple jobs, which I mostly expected. Something like bash ls | cat hangs infinitely, but it after it gives the correct output.

Unfortunately, I earlier thought this meant pipes were working, but after further debugging it seems that ls process is running normally and printing via stdout, and then cat is running separately. This could point to the cause of the hang, but something like bash ls | date also ends up in an infinite wait loop (after printing out both ls and dates outputs to stdout).

rennergade · 2020-03-09T19:45:41Z

My next step is to create a minimal ls | cat using fork, dup2, pipe, exec, and wait to try to ascertain if the issues are in any of those underlying syscalls or if they're in edge cases presented by bash.

rennergade · 2020-03-10T22:30:39Z

I found that it's writing to stdout and not to the pipe like it's expected, because it thinks it should be writing in Cage 0, which hasn't had dup2 configure the file descriptors properly.

It seems that the cageid's for stdin/stdout/stderr do not get setup properly because of an alternate route NaCl takes to configure them withe the host system. In the case of a dup followed by an exec, the cage id's for these descriptors get overwritten to zero, resulting in unexpected behavior.

I should put together a fix for this shortly.

rennergade · 2020-03-11T18:20:47Z

I fixed the above stdout descriptor/cage issue. ls is now properly writing into the pipe in bash/the basic example.

From that I uncovered two new issues:

In the basic example, the ls output is being piped, but Lind terminates before cat can even run. cat exec's but all of Lind terminates early in the loader. I need to debug this further but something is telling the initial thread that the program has finished running.
There seems to still be some dup problems when I'm running bash, this time with the stdin of cat not duplicating. Debug so far indicates that the problem is happening in SafePOSIX.

At this point I'm not sure if the "wait" problems I've been having are actually a combo of these in disguise. I'm going to debug these first and see the results.

rennergade · 2020-03-13T22:31:54Z

I've figured out the root of problem 2. The cagetable currently sets unallocated hostfd's (this is the number NaCl uses for desc identification) to zero, but that's also the number set for stdin. This surprisingly doesn't interfere elsewhere, but gives us problems when we try to use dup with stdin.

rennergade · 2020-03-14T18:56:59Z

Problem 1 is a little hard to track down. The output isn't deterministic and actually runs to completion (at least runs into problem 2 instead of shutting down in the loader) when I record with RR.

In the past when I've seen non-determinism it's usually been an issue with cageid's and the RPC, so I'll start there.

rennergade · 2020-03-17T19:12:48Z

I'm realizing the changes I'm making on top of changing the threading functions are starting to pile up.

I'm going to checkout a branch with just the threading changes and queue up a PR for that to separate these out.

rennergade · 2020-03-17T20:32:17Z

I'm going to use this issue to concentrate on threads, which is still problem one from above:

In the basic example, the ls output is being piped, but Lind terminates before cat can even run. cat exec's but all of Lind terminates early in the loader. I need to debug this further but something is telling the initial thread that the program has finished running.

rennergade · 2020-03-25T21:01:45Z

I was able to fix the sample programs problem of early termination. This was a mistake on my end of not implementing wait in the user code correctly.

My sample and bash are now getting stuck in wait in the same way. It's easiest to talk about this by naming the threads/cages that are spawned:

Cage 1 - Bash
Cage 2 - First Fork (for ls)
Cage 3 - Second Fork (for cat)
Cage 4 - ls (exec'ed)
Cage 5 - cat (exec'ed)

Both bash and the sample are getting stuck when tearing down Cage 3. At this point cages 2, 4, and 5 have all all exited and have gone through the teardown process. This means that both ls and cat have executed, but for some reason the forked cage is getting stuck after the exec'ed cage has run its course.

The line where its getting stuck is here in the teardown process.

This is somewhat due to our weird "new cage for exec" infrastructure, but I believe that this bug is probably pretty fixable. Will continue to dig into this.

rennergade · 2020-03-26T22:57:27Z

I believe I've found the (well at least one, hopefully only one) problem with wait.

When we're creating new nap's here its storing info about the new nap's in both our "master" and "parent" nap's as well. However the index for which these are stored is a simple "num_children" counter that's being incremented or decremented. However, the manner which these cages are constructed and destructed isn't linear so in this case it's overwriting a valid cage, and because of that cages later absence it gets stuck while waiting on it.

rennergade · 2020-03-27T17:12:25Z

It seems pretty redundant here that we have an array containing children ids when we already have a dynamic array of ptr's to all the children. I think there's a pretty clear refactor here to get rid of the id's array entirely instead of attempting to fix the bug. My feeling is this will be cleaner and won't take much more time than fixing the bug would.

rennergade · 2020-03-27T21:01:48Z

I removed the ID array and set it so wait relies solely on the dynamic array of children pointers we have.

This has fixed the wait issue! And Bash is able to run simple pipe scripts.

rennergade · 2020-04-09T19:03:48Z

This was complete with the wait() merge.

rennergade self-assigned this Feb 15, 2020

This was referenced Feb 15, 2020

Bash #33

Closed

Execve #38

Closed

rennergade closed this as completed Apr 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Threads/Exec/Wait Refactor #48

Threads/Exec/Wait Refactor #48

rennergade commented Feb 15, 2020

rennergade commented Feb 20, 2020

rennergade commented Mar 9, 2020

rennergade commented Mar 9, 2020

rennergade commented Mar 10, 2020

rennergade commented Mar 11, 2020

rennergade commented Mar 13, 2020

rennergade commented Mar 14, 2020

rennergade commented Mar 17, 2020

rennergade commented Mar 17, 2020

rennergade commented Mar 25, 2020

rennergade commented Mar 26, 2020

rennergade commented Mar 27, 2020

rennergade commented Mar 27, 2020

rennergade commented Apr 9, 2020

Threads/Exec/Wait Refactor #48

Threads/Exec/Wait Refactor #48

Comments

rennergade commented Feb 15, 2020

rennergade commented Feb 20, 2020

rennergade commented Mar 9, 2020

rennergade commented Mar 9, 2020

rennergade commented Mar 10, 2020

rennergade commented Mar 11, 2020

rennergade commented Mar 13, 2020

rennergade commented Mar 14, 2020

rennergade commented Mar 17, 2020

rennergade commented Mar 17, 2020

rennergade commented Mar 25, 2020

rennergade commented Mar 26, 2020

rennergade commented Mar 27, 2020

rennergade commented Mar 27, 2020

rennergade commented Apr 9, 2020