Race condition when multiple users use isolate #26

seirl · 2017-03-07T12:04:07Z

There is a race condition when a lot of users are constantly using isolate in parallel, even if they try to play nice with each other.

Let's say you want to get a new box for your program. The "proper" way to do that is to list the directories in /var/lib/isolate/ and to --init with an available ID. But between the time you check that an ID is available and the time you call --init, another program might have done a --init for the same ID and you get the same cgroup and sandbox for both.

I see two solutions for that:

The first one is really simple to implement in isolate, it would be an option like --require-empty (name suggestions welcome) that makes isolate --init fail if the box folder already exists (mkdir returns EEXIST).
The second one is to move the responsibility of assigning box IDs from the calling program to isolate. When calling --init, you'd be able to specify, say, --box-id new. Then, isolate would try to find an available box id by looking at the directories in /var/lib/isolate, would try to mkdir it and if it gets EEXIST, it tries to find a new one until the directory has been properly created. Then, when --init is over, we print the box id along with the path of the box, so that --run and --cleanup can be called with this ID.

I'm okay to implement either solution as long as you give me approval for one of them and comments on the API & short/long option names (and any other implementation detail you might think of).

The text was updated successfully, but these errors were encountered:

seirl · 2017-03-07T12:52:46Z

If we use the first solution, I also need to know how to signal that the fail was because a box already existed. A different exit code? Or just exit(1) and require the user to parse the output message themself to know that they need to retry?

gollux · 2017-03-07T13:06:42Z

Hello! So far, we did not think of isolate as a front-end for regular users, but more as a back-end for another service, which provides high-level program testing services to users. In such cases, allocation of box IDs is naturally handled by the high-level service as is waiting for free ID to be available (you usually do not want the number of sandboxes executed simultaneously to exceed the number of logical CPUs). Could you please describe your situation in more detail?

hermanzdosilovic · 2017-03-07T13:19:10Z

I agree with @gollux, isolate should not care about how you use it's boxes. @seirl you said about "proper" way of handling race condition

The "proper" way to do that is to list the directories in /var/lib/isolate/ and to --init with an available ID. But between the time you check that an ID is available and the time you call --init, another program might have done a --init for the same ID and you get the same cgroup and sandbox for both.

I don't think this is "proper" way, because problem you have lies inside your solution of handling race conditions. You should change your solution and not isolate. 😄

As an example please take a look at Judge0 API. REST API that directly uses isolate to run untrusted programs. There, this race problem is solved by building Submission model which is stored in the database. It's unique ID inside database is used as box ID.

seirl · 2017-03-07T13:33:20Z

All of what you are describing is already what we are doing.

The problem is when someone decides to run two instances of our program, for instance a server that handles the requests (like what judge0 is doing) is already running, and then someone else decides to run the unit tests on the same machine.

I'm pretty sure running Judge0 alongside with an isolate testsuite will cause the same problem, they just ignore the problem completely from what I've seen.

hermanzdosilovic · 2017-03-07T14:30:39Z

I now better understand what your problem is, but I still believe that isolate should not be the one fixing it. Isolate provides simple interface for running untrusted code on your server/machine. If you have concurrency problems when creating isolate boxes, then you should create another layer of abstraction above isolate which solves your problem. So you could create a program which has the same interface as isolate (or at least similar) which solves your concurrency problem, and this program should use isolate "under the hood".

"Quick" fix for you problem could be one of the following:

Run tests on another server, not production server
On your server compile two isolates. Production isolate which uses default box_root, and the other which uses some other box_root. See default.cf. Then use production isolate for production server, and this other isolate for test environment.
Wrap your application inside Docker

Also worth mentioning. Yes, Judge0 API has the same problem if you install it on your server in the "old school way". But main power of it is its mobility from machine to machine. And that is achieved with Docker. So every instance of Judge0 is run inside Docker container which has its own isolate.

seirl · 2017-03-07T14:46:20Z

While what you're saying makes sense, I think this is the only issue that prevents different programs from using isolate at the same time, and it's pretty minor so it could be easily changed.

Adding a --require-empty option is trivial and has some real value even for single programs, as it might also help find problems where you reuse a box ID that has not been properly --cleanup'd. For instance, if you forgot to --cleanup the lingering box IDs you're using when you start your frontend after a server hard reset, it might be nice to have that extra check that tells you "hey, you're trying to --init something that already exists, are you sure you want to do that?" and it would pretty much solve the RC problem for testsuites etc.

Are you really opposed to making that small change?

gollux · 2017-03-07T16:20:19Z

Are you really opposed to making that small change?

I understand that people sometimes initialize a box with leftover files by mistake. If we want to help them avoid that trap, then the right solution is not to add a special option, but to to make --init fail in all such cases. Or perhaps to make --init do an implicit --cleanup. Also, this would not solve the race conditions this thread started with... there would still be a small time window between checking that the directory is empty and populating it, when another instance of isolate can step in. Having a real allocator of box IDs (either as a part of isolate or as a stand-alone program) would help, but so far I have an impression that automatic allocation of box IDs is far from trivial: in most cases, you want to avoid running more boxes simultaneously; also, you might to pin boxes to specific CPUs to improve consistency of results. So the supply of available boxes is usually very limited and the allocation requests would often fail. I will think about it a little bit more...

seirl · 2017-03-07T16:49:15Z

2017-03-07 17:20 GMT+01:00 Martin Mareš <notifications@github.com>:

> Are you really opposed to making that small change? I understand that people sometimes initialize a box with leftover files by mistake. If we want to help them avoid that trap, then the right solution is not to add a special option, but to to make --init fail in all such cases.

That would work for me, yes.

Or perhaps to make --init do an implicit --cleanup.

I don't like this one at all, there might be important things inside and it's not even guaranteed that the --cleanup will succeed. For instance, if the sandbox crashes, it won't restore the permissions because it requires a manual investigation when that happens. I think having --init failing when the directory isn't empty is the best way to go.

Also, this would not solve the race conditions this thread started with... there would still be a small time window between checking that the directory is empty and populating it, when another instance of isolate can step in.

If the check is only "the directory should not already exist", then we can do it on mkdir() which is atomic, and it solves the problem.

Having a real allocator of box IDs (either as a part of isolate or as a stand-alone program) would help, but so far I have an impression that automatic allocation of box IDs is far from trivial: in most cases, you want to avoid running more boxes simultaneously; also, you might to pin boxes to specific CPUs to improve consistency of results. So the supply of available boxes is usually very limited and the allocation requests would often fail.

Yes, I am now convinced that having an allocator of IDs as part of isolate is a bad idea and that it should be the responsibility of the caller.

…

-- Antoine Pietri

seirl · 2017-03-07T17:18:23Z

Added pull request #27 as a followup to @gollux 's first suggestion.

niemela · 2017-03-07T18:18:17Z

I think having --init failing when the directory isn't empty is the
best way to go.

+1

Silently "fixing" the error when assumptions seems to have been broken feels scary. Much better to fail early and controlled.

seirl · 2017-03-08T11:30:43Z

I also realized that another advantage of that PR is that it will ensure people check the exit code of their --cleanup call instead of assuming that it always work :-)
For instance, if the cgroup has not been deleted for some reason you really don't want to run something in the same box again, or else you might reuse that box.

It also forces you to cleanup everything when you start your frontend instead of assuming the directory is empty.

Since the last version (see ioi/isolate#26), isolate fails on init if the box already exists. We didn't cleanup in two situations: when the box was kept around, and when the worker is terminated in the middle of an execution. The former is easy to fix, the latter is not, so we also essentially revert to the previous behaviour by always calling cleanup before init.

seirl added a commit to seirl/isolate that referenced this issue Mar 7, 2017

--init: die if the box has not been --cleanup (resolves ioi#26)

70ed7a6

seirl added a commit to seirl/isolate that referenced this issue Mar 7, 2017

--init: die if the box has not been --cleanup (resolves ioi#26)

1c4e1c9

gollux closed this as completed in b9ce5b6 Jul 31, 2017

seirl mentioned this issue Oct 22, 2017

Use new isolate failing mechanism to prevent race conditions prologin/camisole#30

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition when multiple users use isolate #26

Race condition when multiple users use isolate #26

seirl commented Mar 7, 2017 •

edited

Loading

seirl commented Mar 7, 2017

gollux commented Mar 7, 2017 via email

hermanzdosilovic commented Mar 7, 2017

seirl commented Mar 7, 2017

hermanzdosilovic commented Mar 7, 2017

seirl commented Mar 7, 2017 •

edited

Loading

gollux commented Mar 7, 2017 via email

seirl commented Mar 7, 2017 via email

seirl commented Mar 7, 2017

niemela commented Mar 7, 2017

seirl commented Mar 8, 2017

Race condition when multiple users use isolate #26

Race condition when multiple users use isolate #26

Comments

seirl commented Mar 7, 2017 • edited Loading

seirl commented Mar 7, 2017

gollux commented Mar 7, 2017 via email

hermanzdosilovic commented Mar 7, 2017

seirl commented Mar 7, 2017

hermanzdosilovic commented Mar 7, 2017

seirl commented Mar 7, 2017 • edited Loading

gollux commented Mar 7, 2017 via email

seirl commented Mar 7, 2017 via email

seirl commented Mar 7, 2017

niemela commented Mar 7, 2017

seirl commented Mar 8, 2017

seirl commented Mar 7, 2017 •

edited

Loading

seirl commented Mar 7, 2017 •

edited

Loading