New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Papermill on HPC/Dask #364
Comments
From #239, it seems like if I go with option 2, then running one |
If you wanted to, this could make a nice write-up section in our readthedocs page if you were so inclined to make a PR :) |
For solution 3, which scheduler do you use at work? Can you give me/point me to a minimum viable example of monitoring the papermill tasks and then pull the results from multiple output notebooks back? I'm not familiar with any of them (Airflow, Luigi, Prefect) and some boilerplate would be very much appreciated to hit the ground running. And yes, after I'm done with this deadline, I'll take a stab at documenting my use case! |
Would something like this work? https://docs.prefect.io/guide/tutorials/dask-cluster.html |
That Dask pattern could work depending on what you're doing. Though usually Dask is used more to distribute a function across many executor nodes against subsets of rows. If you're running papermill on a notebook against a row and distributing that across the system it could work. I'd note that it usually assumes you're running of many many rows not running expensive functions on a few rows. For other examples you could look at https://airflow.readthedocs.io/en/latest/papermill.html which has some basic setup examples now for papermill. Also https://github.com/timkpaine/paperboy has a scheduling + front-end solution -- though I haven't used it at all so can't comment on how well it's built. |
@MSeal thank you for the help! I'm still wrestling with how to monitor the task when LSF has no API to ping back my local process when the job finishes, so I would have to to either try Dask or check the job status every interval or so. Is that what you do too? Or do you work with clusters that somehow returns some information back when a job is done. Thank you for being patient, I'm new to this! |
Sorry for the late response, been very busy. We work with systems that generally abstract job status management and react to state changes with our coded responses. This is one of the advantages of using a scheduler or dag execution engine intended for this type of work. |
Hello there,
Thank you for the very cool tool. I work in research and was developing my analysis pipeline interactively with Jupyter Notebook, and now that the pipeline is perfected, I am having to turn it into scripts and connect them. Will try Papermill this week, but I suspect it would save me a TON of time writing
argparse
stuff, and also save me from writing report files via pure python file writing (file.write("\n")
a million times).I have a question regarding HPC and schedulers like LSF, SLURM, PBS etc . Is there documentation/example of submitting papermill jobs on a distributed cluster and then pull the same variables from multiple notebooks back? From what I understand so far, my options are:
Interact with the scheduler directly,
bsub papermill input output -p param
(on LSF) and use the parameter to construct file names to read back into the main executing notebook. But then this way papermill will have no idea on the execution status of the notebook jobs. Do I have to write a separate papermill engine for this to work?Use Dask with papermill's python API
execute_notebook
function like suggested in Dask (or threadpool) friendly functions #2 . Then should I run the Dask scheduler on the main notebook and submit 1000execute_notebook
jobs in the main notebook, or callexecute_notebook
once and have the one input notebook start a Dask cluster that submits those 1000 jobs? My workflow is flexible to do either, but I just wonder if there's any pros/cons that I have not thought about.Use a workflow scheduler like Airflow, Luigi, Prefect to connect the steps of the pipeline together, putting the
bsub
calls inside these schedulers' steps.I hope my questions are clear enough. I will report more on what I find.
The text was updated successfully, but these errors were encountered: