Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

schedule run notebook example #324

Open
otterotter408 opened this issue Feb 26, 2019 · 8 comments
Open

schedule run notebook example #324

otterotter408 opened this issue Feb 26, 2019 · 8 comments
Labels

Comments

@otterotter408
Copy link

I need to schedule running my notebook scripts on the first day of every month. Trying to follow the instruction in parameter but still could not follow. How should I set up the parameters in my case?
From the example provided, it only mentions "alpha" and "ratio". What do they mean? Do I need to stick with these two variables for scheduling. and how do I make them represent " first day of every month"?

Thank you for your time.

@MSeal
Copy link
Member

MSeal commented Feb 27, 2019

So papermill doesn't do scheduling by itself. Instead think of it as a tool for executing notebooks that's easy to pass information into.

In the provided examples alpha and ratio are just names of those inputs for that situation. You can pass any parameter name with any value into the notebook. Say you wanted to execute a notebook and pass the current date into it. You might call (assuming you're on Linux or Mac)

papermill my_notebook.ipynb result.ipynb -p today `date + '%Y%m%d'`

This would inject a variable called "today" into your notebook with a value of 20190226 (as of writing this post).

To schedule this execution you can try following these directions on using crontab. This will show you how to run the script above on a schedule. To run on the first day of every month you'd add this to your crontab:

0 0 1 * * papermill my_notebook.ipynb result.ipynb -p today `date + '%Y%m%d'`

Hope that helps!

@MSeal MSeal added the question label Feb 27, 2019
@otterotter408
Copy link
Author

So papermill doesn't do scheduling by itself. Instead think of it as a tool for executing notebooks that's easy to pass information into.

In the provided examples alpha and ratio are just names of those inputs for that situation. You can pass any parameter name with any value into the notebook. Say you wanted to execute a notebook and pass the current date into it. You might call (assuming you're on Linux or Mac)

papermill my_notebook.ipynb result.ipynb -p today `date + '%Y%m%d'`

This would inject a variable called "today" into your notebook with a value of 20190226 (as of writing this post).

To schedule this execution you can try following these directions on using crontab. This will show you how to run the script above on a schedule. To run on the first day of every month you'd add this to your crontab:

0 0 1 * * papermill my_notebook.ipynb result.ipynb -p today `date + '%Y%m%d'`

Hope that helps!

Thank you for your comments! It makes more sense now. I'm using a windows laptop. I heard that crontab is not available for windows. Could you suggest any other method?

@MSeal
Copy link
Member

MSeal commented Mar 1, 2019

https://stackoverflow.com/questions/132971/what-is-the-windows-version-of-cron links a few options depending on your OS version.

@mbrio
Copy link

mbrio commented Mar 13, 2019

I use a combination of Apache Airflow (https://airflow.apache.org/) and Papermill for very complex tasks that are scheduled and it works REALLY well. You'll need to write your own handler, an example could be:

import os
import papermill as pm
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python_operator import PythonOperator

def execute_python_notebook_task(**context):
    notebook_path = context['notebook_path']
    out_path = context['out_path']
    out_dir = os.path.dirname(out_path)
    statement_parameters = context['statement_parameters'] if 'statement_parameters' in context else None

    if not os.path.exists(out_dir):
        os.makedirs(out_dir)

    if callable(statement_parameters):
        statement_parameters = statement_parameters(context)

    pm.execute_notebook(
        notebook_path,
        out_path,
        parameters=statement_parameters
    )

seven_days_ago = datetime.combine(
    datetime.today() - timedelta(7),
    datetime.min.time()
)

default_args = {
    'owner': 'airflow',
    'start_date': seven_days_ago,
    'provide_context': True,
}

dag_name = 'runnin_notebooks_yo'
schedule_interval = '@monthly'

with DAG(dag_name, default_args=default_args, schedule_interval=schedule_interval) as dag:
    run_some_notebook_task = PythonOperator(
        task_id='run_some_notebook_task',
        python_callable=execute_python_notebook_task,
        op_kwargs={
            'notebook_path': 'path_to_some_notebook.ipynb',
            'out_path': 'path_to_some_notebook.out.ipynb',
            'statement_parameters': {
                'parameter_1': 'some_value'
            }
        }
    )

Please note, Airflow is a pretty full featured tool which includes running branching dependencies of tasks, it may be overkill for what you want, but it is a pretty good tool for handling this sort of scheduling.

@pybokeh
Copy link

pybokeh commented Mar 17, 2019

@mbrio @otterotter408 If I'm not mistaken, Apache Airflow is a pain to install on Windows.

@mbrio
Copy link

mbrio commented Mar 20, 2019

@MSeal
Copy link
Member

MSeal commented Mar 20, 2019

I can attest that the bash on windows approach works quite well for 99% of tasks (though I haven't tried airflow explicitly with this) :D

@yosefbs
Copy link

yosefbs commented Sep 13, 2021

I use a combination of Apache Airflow (https://airflow.apache.org/) and Papermill for very complex tasks that are scheduled and it works REALLY well. You'll need to write your own handler, an example could be:

import os
import papermill as pm
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python_operator import PythonOperator

def execute_python_notebook_task(**context):
    notebook_path = context['notebook_path']
    out_path = context['out_path']
    out_dir = os.path.dirname(out_path)
    statement_parameters = context['statement_parameters'] if 'statement_parameters' in context else None

    if not os.path.exists(out_dir):
        os.makedirs(out_dir)

    if callable(statement_parameters):
        statement_parameters = statement_parameters(context)

    pm.execute_notebook(
        notebook_path,
        out_path,
        parameters=statement_parameters
    )

seven_days_ago = datetime.combine(
    datetime.today() - timedelta(7),
    datetime.min.time()
)

default_args = {
    'owner': 'airflow',
    'start_date': seven_days_ago,
    'provide_context': True,
}

dag_name = 'runnin_notebooks_yo'
schedule_interval = '@monthly'

with DAG(dag_name, default_args=default_args, schedule_interval=schedule_interval) as dag:
    run_some_notebook_task = PythonOperator(
        task_id='run_some_notebook_task',
        python_callable=execute_python_notebook_task,
        op_kwargs={
            'notebook_path': 'path_to_some_notebook.ipynb',
            'out_path': 'path_to_some_notebook.out.ipynb',
            'statement_parameters': {
                'parameter_1': 'some_value'
            }
        }
    )

Please note, Airflow is a pretty full featured tool which includes running branching dependencies of tasks, it may be overkill for what you want, but it is a pretty good tool for handling this sort of scheduling.

Airflow now have a PapermillOperator :)
Airflow Papermill Operator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants