Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Papermill on HPC/Dask #364

Open
hoangthienan95 opened this issue May 12, 2019 · 7 comments
Open

Papermill on HPC/Dask #364

hoangthienan95 opened this issue May 12, 2019 · 7 comments

Comments

@hoangthienan95
Copy link

hoangthienan95 commented May 12, 2019

Hello there,

Thank you for the very cool tool. I work in research and was developing my analysis pipeline interactively with Jupyter Notebook, and now that the pipeline is perfected, I am having to turn it into scripts and connect them. Will try Papermill this week, but I suspect it would save me a TON of time writing argparse stuff, and also save me from writing report files via pure python file writing (file.write("\n") a million times).

I have a question regarding HPC and schedulers like LSF, SLURM, PBS etc . Is there documentation/example of submitting papermill jobs on a distributed cluster and then pull the same variables from multiple notebooks back? From what I understand so far, my options are:

  1. Interact with the scheduler directly, bsub papermill input output -p param (on LSF) and use the parameter to construct file names to read back into the main executing notebook. But then this way papermill will have no idea on the execution status of the notebook jobs. Do I have to write a separate papermill engine for this to work?

  2. Use Dask with papermill's python API execute_notebook function like suggested in Dask (or threadpool) friendly functions #2 . Then should I run the Dask scheduler on the main notebook and submit 1000 execute_notebook jobs in the main notebook, or call execute_notebook once and have the one input notebook start a Dask cluster that submits those 1000 jobs? My workflow is flexible to do either, but I just wonder if there's any pros/cons that I have not thought about.

  3. Use a workflow scheduler like Airflow, Luigi, Prefect to connect the steps of the pipeline together, putting the bsub calls inside these schedulers' steps.

I hope my questions are clear enough. I will report more on what I find.

@hoangthienan95
Copy link
Author

From #239, it seems like if I go with option 2, then running one execute_notebook, letting the job notebook starts up the Dask cluster might be better?

@MSeal
Copy link
Member

MSeal commented May 12, 2019

  1. would work, but as you hinted you'd need to monitor the tasks for their exit codes. The papermill cli will return non-zero for failed notebooks. You wouldn't need a separate engine but you would need a wrapper task to track submitted work.

  2. will have issues with parallelism issues with both jupyter_client within the process and ipython within the subprocess. There's some open PRs across a few projects that should get merged within a month but for now it will run into issues as is.

  3. this is a pretty common solution. We use something like this at my work where have a NotebookJob node that simply generates the papermill input.ipynb output.ipynb -p option1 foo ... call for a remote container and monitors it until to completes.

If you wanted to, this could make a nice write-up section in our readthedocs page if you were so inclined to make a PR :)

@hoangthienan95
Copy link
Author

For solution 3, which scheduler do you use at work? Can you give me/point me to a minimum viable example of monitoring the papermill tasks and then pull the results from multiple output notebooks back? I'm not familiar with any of them (Airflow, Luigi, Prefect) and some boilerplate would be very much appreciated to hit the ground running.

And yes, after I'm done with this deadline, I'll take a stab at documenting my use case!

@hoangthienan95
Copy link
Author

Would something like this work? https://docs.prefect.io/guide/tutorials/dask-cluster.html

@MSeal
Copy link
Member

MSeal commented May 12, 2019

That Dask pattern could work depending on what you're doing. Though usually Dask is used more to distribute a function across many executor nodes against subsets of rows. If you're running papermill on a notebook against a row and distributing that across the system it could work. I'd note that it usually assumes you're running of many many rows not running expensive functions on a few rows.

For other examples you could look at https://airflow.readthedocs.io/en/latest/papermill.html which has some basic setup examples now for papermill. Also https://github.com/timkpaine/paperboy has a scheduling + front-end solution -- though I haven't used it at all so can't comment on how well it's built.

@hoangthienan95
Copy link
Author

@MSeal thank you for the help! I'm still wrestling with how to monitor the task when LSF has no API to ping back my local process when the job finishes, so I would have to to either try Dask or check the job status every interval or so. Is that what you do too? Or do you work with clusters that somehow returns some information back when a job is done. Thank you for being patient, I'm new to this!

@MSeal
Copy link
Member

MSeal commented May 25, 2019

Sorry for the late response, been very busy. We work with systems that generally abstract job status management and react to state changes with our coded responses. This is one of the advantages of using a scheduler or dag execution engine intended for this type of work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants