Copyright (c) Microsoft Corporation. All rights reserved.  
Licensed under the MIT License.

## Goal

The goal of this notebook is to show how running a notebook via papermill as a pipeline differs from using `Experiment.submit()`. To see the examples for `Experiment.submit()`, please see this [notebook](simple-pm-run.ipynb). It assumes [simple-pm-run.ipynb](simple-pm-run.ipynb) has already been run to create resources.

This notebook started as a copy of the Pipelines Getting Started Notebook [here](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-getting-started.ipynb).

## Dependencies

This notebook requires that `azureml-sdk` is installed in the environment in which it is run.

### Azure Machine Learning Imports

In this first code cell, we import key Azure Machine Learning modules that we will use below. 

In [21]:
import os

import azureml.core
from azureml.core import Workspace, Experiment
from azureml.core.runconfig import CondaDependencies, RunConfiguration
from azureml.core.runconfig import DEFAULT_CPU_IMAGE

## load pipeline dependencies
from azureml.pipeline.core import Pipeline
from azureml.pipeline.steps import PythonScriptStep

# Check core SDK version number
print("SDK version:", azureml.core.VERSION)

SDK version: 1.0.15


### Initialize Workspace

Initialize a [workspace](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.workspace(class%29) object from persisted configuration, or get it from Azure

In [100]:
aml_compute_target = "aml-compute-d2" ## 2-16 characters
exp_name = 'papermill-in-a-pipeline'
# project folder
project_folder = './projectDir'

In [3]:
if os.path.isdir('aml_config'):
    print('Loading Workspace information from configuration')
    ws = Workspace.from_config()
else:
    print('Getting Workspace information from Variables. You must set these or this will fail!')
    SUBSCRIPTION_ID = os.getenv("AZ_SUB","")
    RESOURCE_GROUP = os.getenv("RESOURCE_GROUP","")
    WS_NAME = os.getenv("WS_NAME","")
    WS_LOCATION = 'eastus'
    ws=Workspace.get(name=WS_NAME,
                    resource_group=RESOURCE_GROUP,
                    subscription_id=SUBSCRIPTION_ID)

print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')


Loading Workspace information from configuration
Found the config file in: C:\Users\jeremr\Documents\GitHub\papermill_execution_azureml\aml_config\config.json
jeremr_top10_mvl_aml
jeremr_top10_mvl
eastus
03909a66-bef8-4d52-8e9a-a346604e0902


## Compute Targets

We will pick up from the prior notebook and focus on cloud computing, and in this case, we'll continue to use AmlCompute for executing our pipeline step.



#### List of Compute Targets on the workspace

In [4]:
cts = ws.compute_targets
for ct in cts:
    print(ct)

jeremr-top10-adb
jeremr-top10-mvl
top10-mvl-d4v2
aml-compute-d2


In [23]:
## run_config.load() does not seem to work:
# my_aml_run_config = RunConfiguration()
# my_aml_run_config.load(path='.', name=aml_compute_target)
# print(my_aml_run_config.target) ## still prints 'local'
# it does not load the values...

## so recreate
aml_compute = AmlCompute(ws, aml_compute_target)

cd = CondaDependencies.create(pip_packages=["ipykernel", "papermill", "azureml-sdk"])
my_aml_run_config = RunConfiguration(conda_dependencies=cd)
my_aml_run_config.target = aml_compute_target
my_aml_run_config.environment.docker.enabled = True
my_aml_run_config.environment.docker.base_image = DEFAULT_CPU_IMAGE

## Run as a Pipeline ...

You would create a `PythonScriptStep`, and then a pipeline, and then submit the pipeline.

In [74]:
# Uses default values for PythonScriptStep construct.

step1nohash = PythonScriptStep(name="Use papermill to run a notebook",
                         script_name="papermill_run_notebook.py", 
                         compute_target=aml_compute, 
                         source_directory=project_folder,
                         runconfig=my_aml_run_config,
                         allow_reuse=False
                        )
print("Step1 created")

Step1 created


### Build, Validate, and Submit the pipeline
You have the option to [validate](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.pipeline.pipeline?view=azure-ml-py#validate) the pipeline prior to submitting for run. The platform runs validation steps such as checking for circular dependencies and parameter checks etc. even if you do not explicitly call validate method.

### Submit the pipeline
[Submitting](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.pipeline.pipeline?view=azure-ml-py#submit) the pipeline involves creating an [Experiment](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.experiment?view=azure-ml-py) object and providing the built pipeline for submission. 

In [75]:
pipeline1 = Pipeline(workspace=ws, steps=[step1nohash])
print ("Pipeline is built")
pipeline1.validate()
print("Pipeline validation complete")
pipeline_run1 = Experiment(ws, exp_name).submit(pipeline1)
print("Pipeline is submitted for execution")

Pipeline is built
Pipeline validation complete
Created step Use papermill to run a notebook [cd15b7bd][cfd165e6-8440-4062-8a76-9f5a9925338c], (This step will run and generate new outputs)
Submitted pipeline run: e9d66309-0ccc-4565-bdb6-62fa0e2c4137
Pipeline is submitted for execution


## Check for file changes

Pipelines are different than experiments. Try editing `msg` in `hello_world.ipynb`, and then building and running the same pipeline.


In [87]:
## Edit `msg` first!!
step1nohash = PythonScriptStep(name="Use papermill to run a notebook",
                         script_name="papermill_run_notebook.py", 
                         compute_target=aml_compute, 
                         source_directory=project_folder,
                         runconfig=my_aml_run_config,
                         allow_reuse=False
                        )
print("Step1 created")
pipeline1 = Pipeline(workspace=ws, steps=[step1nohash])
print ("Pipeline is built")
pipeline1.validate()
print("Pipeline validation complete")
pipeline_run1 = Experiment(ws, exp_name).submit(pipeline1)
print("Pipeline is submitted for execution")

Step1 created
Pipeline is built
Step Use papermill to run a notebook is ready to be created [7f50341b]
Pipeline validation complete
Created step Use papermill to run a notebook [7f50341b][c4b665d4-b695-4805-8c5d-bc1fa52b4230], (This step will run and generate new outputs)
Submitted pipeline run: ac016077-28b5-44e3-aaf7-745278243281
Pipeline is submitted for execution


## Results of Experiments

If you tried various configurations of updating that file, the upshot is that by default, you only get updates from the notebook if you also update the script. Updating the script triggers an update of the entire source_dir. It doesn't matter if you rebuild the pipeline, or even rebuild the step - if you only update the notebook, you won't see new results.



| run num |  pipeline updates              | msg status (nb)  | msg2 status (script) | msg log    | msg2 log |
|---------|--------------------------------|-------------------|----------------------|------------|----------|
| run 1   |  clean                         | hello world           | run1 | hello world     | run1      |
| run 2   | **only tried Experiment.submit()** | hello world           | **run2** | hello world     | run1  |
| run 3   | **built, validated, and submited** | hello world           | run2  | hello world     | **run2** |
| run 4   | built, validated, and submited | hello world           | **run4**  | hello world     | **run4**  |
| run 5   | built, validated, and submited | **'tst 2 - hello world'** | run4  | 'hello world'   | run4  |
| run 6   | built, validated, and submited | **'change - hello world'**  | **run7** | **'change - hello world'**  | **run7** |
| run 7   | built, validated, and submited | **'chng 2 - hello world'**  | run7 | 'change - hello world'  | run7 |
| run 8   | built, validated, and submited | **'hello world'**  | run7 | 'change - hello world'  | run7 |
| run 9   | built, validated, and submited | 'hello world'      | **run9** | **'hello world'**  | **run9** |
| run 10  | **step def., built, validated, and submited** | 'hello world'      | **run10** | 'hello world'  | **run10** |
| run 11  | step def., built, validated, and submited | **'delta: hello world'**      | run10 | 'hello world'  | run10 |
| run 12  | step def., built, validated, and submited | **'no show: hello world'**      | run10 | 'hello world'  | run10 |
| run 13  | step def., built, validated, and submited | 'no show: hello world'      | **run13** | **'no show: hello world'**  | **run13** |

In order to change this behavior, you can do a few things.

- Use the `hash_paths` argument to make sure that outputs are regenerated if key files (like the notebook) are checked for changes and resubmit if they are.
- set `regenerate_outputs=True` when you submit.

To see these two things in action, see the following sections.


## Using hash_paths

Reset the hello_world notebook, and rebuild the pipeline by marking that file as a `hash_path`.

In [115]:
# Uses default values for PythonScriptStep construct.
step1withhash = PythonScriptStep(name="Use papermill to run a notebook",
                         script_name="papermill_run_notebook.py", 
                         compute_target=aml_compute, 
                         source_directory=project_folder,
                         runconfig=my_aml_run_config,
                         allow_reuse=False,
                         hash_paths=['hello_world.ipynb']
                        )
print("Step1 created")
pipeline1 = Pipeline(workspace=ws, steps=[step1withhash])
print ("Pipeline is built")
pipeline1.validate()
print("Pipeline validation complete")
pipeline_run1 = Experiment(ws, exp_name).submit(pipeline1)
print("Pipeline is submitted for execution")


Step1 created
Pipeline is built
Step Use papermill to run a notebook is ready to be created [6bd9084b]
Pipeline validation complete
Created step Use papermill to run a notebook [6bd9084b][39f4878c-edb2-4abf-a70c-26b04c2ffa50], (This step will run and generate new outputs)
Submitted pipeline run: fdd43239-ba45-4897-bdea-317288ccbb25
Pipeline is submitted for execution


In [99]:

pipeline1 = Pipeline(workspace=ws, steps=[step1withhash])
print ("Pipeline is built")
pipeline1.validate()
print("Pipeline validation complete")
pipeline_run1 = Experiment(ws, exp_name).submit(pipeline1)
print("Pipeline is submitted for execution")

Pipeline is built
Pipeline validation complete
Created step Use papermill to run a notebook [2aaf0bb9][8cde8f88-c183-4737-aac4-44429b9ecea7], (This step will run and generate new outputs)
Submitted pipeline run: f20903f1-f0e1-41df-9f0a-0752121b8c3f
Pipeline is submitted for execution


## Results of Experiments with hash_paths

The upshot is that if you specify hash_paths, then you will get a new value when that file changes. It works as intended.


| run num |  pipeline updates              | msg status (nb)  | msg2 status (script) | msg log    | msg2 log |
|---------|--------------------------------|-------------------|----------------------|------------|----------|
| run 1   |  clean                         | hello world           | run1 | hello world     | run1      |
| run 2   | **Experiment.submit()**        | hello world           | **run2** | hello world     | run1  |
| run 3   | **built, validated, and submited** | hello world           | run2  | hello world     | **run2** |
| run 4   | built, validated, and submited | hello world           | **run4**  | hello world     | **run4**  |
| run 5   | built, validated, and submited | **'chng: Hello World!'** | run4  | **'chng: Hello World!'**  | run4  |
| run 6   | built, validated, and submited | **'Hello World again!'** | run4  | **'Hello World again!'**  | run4  |
| run 7   | **step def., built, validated, and submited** | 'Hello World again!'   | run4 | 'Hello World again!'  | run4 |
| run 8    | step def., built, validated, and submited | **'Hello World!'**      | run4 | **'Hello World!'**  | run4 |


## Explore regenerate_outputs

We can try the same experiment (with no hashing) to see if regenerate_outputs has an effect. In this case, because we know script changes trigger an update, we'll just manipulate the notebook, the pipeline upates, and whether regenerate_outputs is true or false on the submit call.


In [101]:
## Edit `msg` first!!
step1nohash = PythonScriptStep(name="Use papermill to run a notebook",
                         script_name="papermill_run_notebook.py", 
                         compute_target=aml_compute, 
                         source_directory=project_folder,
                         runconfig=my_aml_run_config,
                         allow_reuse=False
                        )
print("Step1 created")
pipeline1 = Pipeline(workspace=ws, steps=[step1nohash])
print ("Pipeline is built")
pipeline1.validate()
print("Pipeline validation complete")
pipeline_run1 = Experiment(ws, exp_name).submit(pipeline1, regenerate_outputs=False)
print("Pipeline is submitted for execution")

Step1 created
Pipeline is built
Pipeline validation complete
Created step Use papermill to run a notebook [787419c2][cfd165e6-8440-4062-8a76-9f5a9925338c], (This step will run and generate new outputs)
Submitted pipeline run: bf1ce206-8e2d-41f8-a3f4-20f7b0695045
Pipeline is submitted for execution


In [112]:
pipeline1 = Pipeline(workspace=ws, steps=[step1nohash])
print ("Pipeline is built")
pipeline1.validate()
print("Pipeline validation complete")
pipeline_run1 = Experiment(ws, exp_name).submit(pipeline1, regenerate_outputs=True)
print("Pipeline is submitted for execution")

Pipeline is built
Pipeline validation complete
Created step Use papermill to run a notebook [15ef3e31][bcd166f1-0f5c-4f13-a473-6c277a396cee], (This step will run and generate new outputs)
Submitted pipeline run: f7916073-d1a5-4fdc-b8eb-1cb9c9009360
Pipeline is submitted for execution


## Results of Experiments with regenerate_outputs

In all cases, `hash_paths=None`.

`regenerate_outputs` basically enforces a new build, it would appear. The default value appears to be analogous to `regenerate_outputs=False`


| run num |  regenerate_outputs | pipeline updates | msg status (nb)  | msg2 status (script) | msg log    | msg2 log |
|---------|---------------------|-----------|-------------------|----------------------|------------|----------|
| run 1   | 0 | clean                         | hello world           | run1 | hello world     | run1      |
| run 2   | 0 | **Experiment.submit()**        | **tst2**          | run1 | hello world     | run1  |
| run 3   | **1** | Experiment.submit()        | tst2          | run1 | hello world     | run1  |
| run 4   | 1 | Experiment.submit()        | tst2          | **run4** | hello world     | run1  |
| run 5   | 1 | **built, validated, and submited** | tst2           | **run1**  | **tst2**     | run1 |
| run 6   | 1 | built, validated, and submited | **change**           | run1  | **change**     | run1 |
| run 7   | 1 | built, validated, and submited | change           | run1  | change     | run1 |
| run 8   | 1 | built, validated, and submited | **delta**           | run1  | **delta**     | run1 |
| run 9   | 1 | built, validated, and submited | **gamma**           | run1  | **gamma**     | run1 |
| run 10   | **0** | built, validated, and submited | **epsilon**           | run1  | gamma     | run1 |
| run 11   | 0 | built, validated, and submited | **alpha**           | run1  | gamma     | run1 |
| run 12   | **1** | built, validated, and submited | alpha           | run1  | **alpha**   | run1 |


In [59]:
## Edit `msg` first!!
pipeline1 = Pipeline(workspace=ws, steps=[step1])
print ("Pipeline is built")
pipeline1.validate()
print("Pipeline validation complete")
pipeline_run1 = Experiment(ws, exp_name).submit(pipeline1, regenerate_outputs=False)
print("Pipeline is submitted for execution")


Pipeline is built
Step Use papermill to run a notebook is ready to be created [13443c3c]
Pipeline validation complete
Created step Use papermill to run a notebook [13443c3c][653e657b-dce6-4f99-b43e-5107c295b821], (This step will run and generate new outputs)
Submitted pipeline run: aea00d1e-732f-423a-acae-1c8c9257a3ef
Pipeline is submitted for execution


**Note:** If regenerate_outputs is set to True, a new submit will always force generation of all step outputs, and disallow data reuse for any step of this run. Once this run is complete, however, subsequent runs may reuse the results of this run.


### Examine the pipeline run

#### Use RunDetails Widget
We are going to use the RunDetails widget to examine the run of the pipeline. You can click each row below to get more details on the step runs.

In [114]:
step_runs = pipeline_run1.get_children()
for step_run in step_runs:
    status = step_run.get_status()
    print('Script:', step_run.name, 'status:', status)
    
    # Change this if you want to see details even if the Step has succeeded.
    joblog = step_run.get_job_log()
    print('job log:', joblog)

Script: Use papermill to run a notebook status: Finished


ErrorResponseException: (BadRequest) Response status code does not indicate success: 404 (Not Found).
Could not find node status for node [f7916073-d1a5-4fdc-b8eb-1cb9c9009360/5746942a-0be2-453c-95ee-d2b2d6fcbdc4]

In [None]:
step_run