Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New feature: Read input notebook from github #556

Closed
onevirus opened this issue Nov 30, 2020 · 7 comments · Fixed by #622
Closed

New feature: Read input notebook from github #556

onevirus opened this issue Nov 30, 2020 · 7 comments · Fixed by #622

Comments

@onevirus
Copy link
Contributor

In my org, I store every notebook in github and tweak papermill to read notebook from github directly.
Because we use papermill heavily in production and we need to version notebooks.

This is how we use

import papermill as pm

pm.execute_notebook(
   'https://github.com/nteract/papermill/blob/main/papermill/tests/notebooks/read_check.ipynb',
   'path/to/output.ipynb',
   parameters = dict(alpha=0.6, ratio=0.1)
)

Take just url of notebook like binder and nbviewer.
This has some pros.
Some teams use only dev / master branches.(Read from dev branch in dev env, read from master branch in prd env)
Other teams use tagging for versioning notebooks.
We don't need storage for notebooks.(We put output notebooks in gcs)

How do you think ?

@ronytesler
Copy link

I'd like to have it.
We run a notebook remotely in a google cloud notebook instance. Is there a way to watch the progress of the notebook as it runs? Instead of waiting it to finish (how do I know when it's finished or if it was run at all?).

@MSeal
Copy link
Member

MSeal commented Dec 7, 2020

This is a good pattern to use for reading from git as a read-only source. If someone wanted to invest a little time in making a new IO Handler for reading git this library would be useful to use: GitPython. I'd be happy to review / merge such an improvement.

@MSeal
Copy link
Member

MSeal commented Dec 7, 2020

We run a notebook remotely in a google cloud notebook instance. Is there a way to watch the progress of the notebook as it runs? Instead of waiting it to finish (how do I know when it's finished or if it was run at all?).

If you're using the CLI the terminal outputs progress (there's a few options to control this). Additionally it's saving the notebook output after each cell and periodically within a cell so refreshing the destination location in a notebook browser will show progress as well, albeit not in real-time necessarily.

@ronytesler
Copy link

@MSeal I use the instance's startup script, which uses papermill to execute the notebook. I run 'gcloud reset' on the machine so it would be started and the startup script will run. Is there a different way I can remotely run the notebook and also see its progress as you said?

@onevirus
Copy link
Contributor Author

onevirus commented Dec 9, 2020

@MSeal
I checked GitPython. IMHO, GitPython looks not suitable in this case. What we need is download a file from github and git doesn't have this functionality(git checkout not a file but whole repo). So, I think we need to use github api directly or package for github like PyGithub. In nbviewer, they use github rest api. For gitlab, I think we need another io handler for gitlab.
Anyway, if you don't mind using github api, I'll tackle it.

@MSeal
Copy link
Member

MSeal commented Dec 9, 2020

@onevirus That sounds reasonable for what you're targeting. I can imagine a more general git solution as well since there's a lot of git repos that aren't github/gitlab. But that being said github is the most popular in open source so I think optimizing for that end is worth the effort.

@MSeal
Copy link
Member

MSeal commented Dec 9, 2020

@ronytesler this is somewhat a different topic than the issue that was opened here, but usually you have the startup script logging to a logging sink that captures the stdout/stderr and makes it available to view. Papermill in and of itself doesn't manage this as it's a bit out of scope of the project. Managed execution of VMs or containers isn't the easiest to navigate but most of the solutions involve monitoring those standard outputs and triggering said executors on demand in some execution context. In this story arch papermill's responsibility is to output log text, notebook saves, and manage the kernel locally within that context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants