Papermill on HPC/Dask #364

hoangthienan95 · 2019-05-12T04:16:42Z

Hello there,

Thank you for the very cool tool. I work in research and was developing my analysis pipeline interactively with Jupyter Notebook, and now that the pipeline is perfected, I am having to turn it into scripts and connect them. Will try Papermill this week, but I suspect it would save me a TON of time writing argparse stuff, and also save me from writing report files via pure python file writing (file.write("\n") a million times).

I have a question regarding HPC and schedulers like LSF, SLURM, PBS etc . Is there documentation/example of submitting papermill jobs on a distributed cluster and then pull the same variables from multiple notebooks back? From what I understand so far, my options are:

Interact with the scheduler directly, bsub papermill input output -p param (on LSF) and use the parameter to construct file names to read back into the main executing notebook. But then this way papermill will have no idea on the execution status of the notebook jobs. Do I have to write a separate papermill engine for this to work?
Use Dask with papermill's python API execute_notebook function like suggested in Dask (or threadpool) friendly functions #2 . Then should I run the Dask scheduler on the main notebook and submit 1000 execute_notebook jobs in the main notebook, or call execute_notebook once and have the one input notebook start a Dask cluster that submits those 1000 jobs? My workflow is flexible to do either, but I just wonder if there's any pros/cons that I have not thought about.
Use a workflow scheduler like Airflow, Luigi, Prefect to connect the steps of the pipeline together, putting the bsub calls inside these schedulers' steps.

I hope my questions are clear enough. I will report more on what I find.

The text was updated successfully, but these errors were encountered:

hoangthienan95 · 2019-05-12T04:26:00Z

From #239, it seems like if I go with option 2, then running one execute_notebook, letting the job notebook starts up the Dask cluster might be better?

MSeal · 2019-05-12T17:09:59Z

would work, but as you hinted you'd need to monitor the tasks for their exit codes. The papermill cli will return non-zero for failed notebooks. You wouldn't need a separate engine but you would need a wrapper task to track submitted work.
will have issues with parallelism issues with both jupyter_client within the process and ipython within the subprocess. There's some open PRs across a few projects that should get merged within a month but for now it will run into issues as is.
this is a pretty common solution. We use something like this at my work where have a NotebookJob node that simply generates the papermill input.ipynb output.ipynb -p option1 foo ... call for a remote container and monitors it until to completes.

If you wanted to, this could make a nice write-up section in our readthedocs page if you were so inclined to make a PR :)

hoangthienan95 · 2019-05-12T17:40:08Z

For solution 3, which scheduler do you use at work? Can you give me/point me to a minimum viable example of monitoring the papermill tasks and then pull the results from multiple output notebooks back? I'm not familiar with any of them (Airflow, Luigi, Prefect) and some boilerplate would be very much appreciated to hit the ground running.

And yes, after I'm done with this deadline, I'll take a stab at documenting my use case!

hoangthienan95 · 2019-05-12T17:53:31Z

Would something like this work? https://docs.prefect.io/guide/tutorials/dask-cluster.html

MSeal · 2019-05-12T18:00:32Z

That Dask pattern could work depending on what you're doing. Though usually Dask is used more to distribute a function across many executor nodes against subsets of rows. If you're running papermill on a notebook against a row and distributing that across the system it could work. I'd note that it usually assumes you're running of many many rows not running expensive functions on a few rows.

For other examples you could look at https://airflow.readthedocs.io/en/latest/papermill.html which has some basic setup examples now for papermill. Also https://github.com/timkpaine/paperboy has a scheduling + front-end solution -- though I haven't used it at all so can't comment on how well it's built.

hoangthienan95 · 2019-05-17T02:53:47Z

@MSeal thank you for the help! I'm still wrestling with how to monitor the task when LSF has no API to ping back my local process when the job finishes, so I would have to to either try Dask or check the job status every interval or so. Is that what you do too? Or do you work with clusters that somehow returns some information back when a job is done. Thank you for being patient, I'm new to this!

MSeal · 2019-05-25T17:28:20Z

Sorry for the late response, been very busy. We work with systems that generally abstract job status management and react to state changes with our coded responses. This is one of the advantages of using a scheduler or dag execution engine intended for this type of work.

MSeal added docs question labels May 12, 2019

LukaPitamic mentioned this issue Jun 12, 2019

RuntimeError: Kernel didn't respond in 60 seconds, when trying to run papermill with python multiprocessing #239

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Papermill on HPC/Dask #364

Papermill on HPC/Dask #364

hoangthienan95 commented May 12, 2019 •

edited

hoangthienan95 commented May 12, 2019

MSeal commented May 12, 2019

hoangthienan95 commented May 12, 2019

hoangthienan95 commented May 12, 2019

MSeal commented May 12, 2019

hoangthienan95 commented May 17, 2019

MSeal commented May 25, 2019

Papermill on HPC/Dask #364

Papermill on HPC/Dask #364

Comments

hoangthienan95 commented May 12, 2019 • edited

hoangthienan95 commented May 12, 2019

MSeal commented May 12, 2019

hoangthienan95 commented May 12, 2019

hoangthienan95 commented May 12, 2019

MSeal commented May 12, 2019

hoangthienan95 commented May 17, 2019

MSeal commented May 25, 2019

hoangthienan95 commented May 12, 2019 •

edited