Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Threads/Exec/Wait Refactor #48

Closed
rennergade opened this issue Feb 15, 2020 · 14 comments
Closed

Threads/Exec/Wait Refactor #48

rennergade opened this issue Feb 15, 2020 · 14 comments
Assignees

Comments

@rennergade
Copy link
Contributor

I've started a document detailing the overall plan for the refactor. So far I've written about the thread infrastructure.

This document can be found here.

@rennergade rennergade self-assigned this Feb 15, 2020
This was referenced Feb 15, 2020
Closed
@rennergade
Copy link
Contributor Author

I've finished my initial thoughts on this and documented it here.

From talking to Brendan, it seems like the best path forward is to try to work with the EXEC implementation we have, since it's not explicitly broken but inefficient and not exactly compliant (we launch a new cage instead of doing what exec natively does).

My gameplan right now is to consolidate the threading infrastructure to give us the flexibility to set up threads and their relationships correctly, and then debug/clarify the mess that is wait from there.

@rennergade
Copy link
Contributor Author

I've "finished" the initial stage of my refactor, which unified the thread pipelines. I now can run ls through bash to completion without using the hack I threw in before.

It's still hanging in wait with multiple jobs, which I mostly expected. Something like bash ls | cat hangs infinitely, but it after it gives the correct output.

Unfortunately, I earlier thought this meant pipes were working, but after further debugging it seems that ls process is running normally and printing via stdout, and then cat is running separately. This could point to the cause of the hang, but something like bash ls | date also ends up in an infinite wait loop (after printing out both ls and dates outputs to stdout).

@rennergade
Copy link
Contributor Author

My next step is to create a minimal ls | cat using fork, dup2, pipe, exec, and wait to try to ascertain if the issues are in any of those underlying syscalls or if they're in edge cases presented by bash.

@rennergade
Copy link
Contributor Author

I found that it's writing to stdout and not to the pipe like it's expected, because it thinks it should be writing in Cage 0, which hasn't had dup2 configure the file descriptors properly.

It seems that the cageid's for stdin/stdout/stderr do not get setup properly because of an alternate route NaCl takes to configure them withe the host system. In the case of a dup followed by an exec, the cage id's for these descriptors get overwritten to zero, resulting in unexpected behavior.

I should put together a fix for this shortly.

@rennergade
Copy link
Contributor Author

I fixed the above stdout descriptor/cage issue. ls is now properly writing into the pipe in bash/the basic example.

From that I uncovered two new issues:

  1. In the basic example, the ls output is being piped, but Lind terminates before cat can even run. cat exec's but all of Lind terminates early in the loader. I need to debug this further but something is telling the initial thread that the program has finished running.
  2. There seems to still be some dup problems when I'm running bash, this time with the stdin of cat not duplicating. Debug so far indicates that the problem is happening in SafePOSIX.

At this point I'm not sure if the "wait" problems I've been having are actually a combo of these in disguise. I'm going to debug these first and see the results.

@rennergade
Copy link
Contributor Author

I've figured out the root of problem 2. The cagetable currently sets unallocated hostfd's (this is the number NaCl uses for desc identification) to zero, but that's also the number set for stdin. This surprisingly doesn't interfere elsewhere, but gives us problems when we try to use dup with stdin.

@rennergade
Copy link
Contributor Author

Problem 1 is a little hard to track down. The output isn't deterministic and actually runs to completion (at least runs into problem 2 instead of shutting down in the loader) when I record with RR.

In the past when I've seen non-determinism it's usually been an issue with cageid's and the RPC, so I'll start there.

@rennergade
Copy link
Contributor Author

I'm realizing the changes I'm making on top of changing the threading functions are starting to pile up.

I'm going to checkout a branch with just the threading changes and queue up a PR for that to separate these out.

@rennergade
Copy link
Contributor Author

I'm going to use this issue to concentrate on threads, which is still problem one from above:

In the basic example, the ls output is being piped, but Lind terminates before cat can even run. cat exec's but all of Lind terminates early in the loader. I need to debug this further but something is telling the initial thread that the program has finished running.

@rennergade
Copy link
Contributor Author

I was able to fix the sample programs problem of early termination. This was a mistake on my end of not implementing wait in the user code correctly.

My sample and bash are now getting stuck in wait in the same way. It's easiest to talk about this by naming the threads/cages that are spawned:

Cage 1 - Bash
Cage 2 - First Fork (for ls)
Cage 3 - Second Fork (for cat)
Cage 4 - ls (exec'ed)
Cage 5 - cat (exec'ed)

Both bash and the sample are getting stuck when tearing down Cage 3. At this point cages 2, 4, and 5 have all all exited and have gone through the teardown process. This means that both ls and cat have executed, but for some reason the forked cage is getting stuck after the exec'ed cage has run its course.

The line where its getting stuck is here in the teardown process.

This is somewhat due to our weird "new cage for exec" infrastructure, but I believe that this bug is probably pretty fixable. Will continue to dig into this.

@rennergade
Copy link
Contributor Author

I believe I've found the (well at least one, hopefully only one) problem with wait.

When we're creating new nap's here its storing info about the new nap's in both our "master" and "parent" nap's as well. However the index for which these are stored is a simple "num_children" counter that's being incremented or decremented. However, the manner which these cages are constructed and destructed isn't linear so in this case it's overwriting a valid cage, and because of that cages later absence it gets stuck while waiting on it.

@rennergade
Copy link
Contributor Author

It seems pretty redundant here that we have an array containing children ids when we already have a dynamic array of ptr's to all the children. I think there's a pretty clear refactor here to get rid of the id's array entirely instead of attempting to fix the bug. My feeling is this will be cleaner and won't take much more time than fixing the bug would.

@rennergade
Copy link
Contributor Author

I removed the ID array and set it so wait relies solely on the dynamic array of children pointers we have.

This has fixed the wait issue! And Bash is able to run simple pipe scripts.

@rennergade
Copy link
Contributor Author

This was complete with the wait() merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant