Run reports in a separate Python process. #41

plietar · 2024-04-09T13:27:54Z

The runpy module we've been using to run reports provides very little isolation. In particular, "any side effects (such as cached imports of other modules) will remain in place after the functions have returned". This means that if a user splits their report into multiple files, with one file importing another, the latter will be cached and never reloaded by Python, even if it is modified. Similarly, if the module was already loaded before, the report runs against old code that does not match the files included in the packet.

Using a separate subprocess gives us a much stronger isolation, including all global variables and not sharing any of previously loaded modules. It means a user can run a report, modify some of the imported code, run the report again and have this behave as expected.

The Packet class is refactored a little and its scope is reduced to just creating the packet and its metadata, without inserting it into the repository. The insertion is now handled by a separate function. This is done as a way of distinguishing what happens in the child process and what happens in the parent.

richfitz

I've looked through this a couple of times now, and am glad you've taken the time to sort this all out

richfitz · 2024-05-03T07:32:05Z

src/outpack/sandbox.py

+from outpack.util import openable_temporary_file
+
+
+def run_in_sandbox(target, args=(), cwd=None, syspath=None):


We use callr for this sort of manoeuvre in R - does python not have something we can pull in to do this for us? This way is fine though!

All I could find is the multiprocessing module, but it's a bit inconvenient to use and puts a burden on the caller to setup their script in the right way.

richfitz · 2024-05-03T07:32:57Z

src/outpack/sandbox.py

+    except BaseException as e:
+        # This allows the traceback to be pickled and communicated out of the
+        # the sandbox.
+        pickling_support.install(e)


nice job finding this

The runpy module we've been using to run reports provides very little isolation. In particular, "any side effects (such as cached imports of other modules) will remain in place after the functions have returned". This means that if a user splits their report into multiple files, with one file importing another, the latter will be cached and never reloaded by Python, even if it is modified. Similarly, if the module was already loaded before, the report runs against old code that does not match the files included in the packet. Using a separate subprocess gives us a much stronger isolation, including all global variables and not sharing any of previously loaded modules. It means a user can run a report, modify some of the imported code, run the report again and have this behave as expected. The sandbox is implemented by pickling the function and its argument into a file, starting a new Python process which unpickles the file and calls the target. Similarly, the function's return value, or any exception it may throw, it pickled to an output file, which is read by the parent. I tried to use the multiprocessing module to implement this, but it puts inconvenient requirements on the way the top-level source file is written. The module re-evaluates the file in the subprocess, meaning its contents must check for `__name__ == "__main__"` to avoid doing the work again. It also doesn't work too well with the REPL. Using our own implementation fixes these issues. The Packet class is refactored a little and its scope is reduced to just creating the packet and its metadata, without inserting it into the repository. The insertion is now handled by a separate function. This is done as a way of distinguishing what happens in the child process and what happens in the parent.

plietar force-pushed the mrc-5227 branch from 21338d4 to ee8081c Compare April 9, 2024 14:55

plietar marked this pull request as ready for review April 9, 2024 18:00

plietar requested a review from richfitz April 9, 2024 18:00

plietar force-pushed the mrc-5228 branch 2 times, most recently from 0053fd1 to cedb387 Compare April 9, 2024 18:04

plietar force-pushed the mrc-5227 branch from ee8081c to 3c3e655 Compare April 9, 2024 18:05

plietar changed the base branch from mrc-5228 to main April 10, 2024 14:19

This comment was marked as outdated.

Sign in to view

plietar force-pushed the mrc-5227 branch 3 times, most recently from 92c0ecf to 7f85351 Compare April 16, 2024 17:20

plietar force-pushed the mrc-5227 branch from 7f85351 to 9ce2f0c Compare April 23, 2024 11:09

richfitz approved these changes May 3, 2024

View reviewed changes

plietar force-pushed the mrc-5227 branch from 9ce2f0c to 6a20109 Compare May 3, 2024 11:56

plietar merged commit c42fca9 into main May 3, 2024
7 checks passed

plietar deleted the mrc-5227 branch May 3, 2024 12:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run reports in a separate Python process. #41

Run reports in a separate Python process. #41

plietar commented Apr 9, 2024

This comment was marked as outdated.

richfitz left a comment

richfitz May 3, 2024

plietar May 3, 2024

richfitz May 3, 2024

		from outpack.util import openable_temporary_file


		def run_in_sandbox(target, args=(), cwd=None, syspath=None):

Run reports in a separate Python process. #41

Run reports in a separate Python process. #41

Conversation

plietar commented Apr 9, 2024

This comment was marked as outdated.

richfitz left a comment

Choose a reason for hiding this comment

richfitz May 3, 2024

Choose a reason for hiding this comment

plietar May 3, 2024

Choose a reason for hiding this comment

richfitz May 3, 2024

Choose a reason for hiding this comment