Copyright (c) Microsoft Corporation. All rights reserved.  
Licensed under the MIT License.

## Goal

The goal of this notebook is to show how running a notebook via `papermill` as a **AzureML pipeline** differs from using a `ScriptRunConfig`. To see the examples for `ScriptRunConfig`, please see this [notebook](simple-pm-run.ipynb). 

This notebook assumes [simple-pm-run.ipynb](simple-pm-run.ipynb) has already been run to create resources such as the configuration for your AzureML workspace and your compute target.

This notebook started as a copy of the Pipelines Getting Started Notebook [here](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-getting-started.ipynb).

## Dependencies

This notebook requires that `azureml-sdk` is installed in the environment in which it is run.

## Workload

This notebook will submit the `papermill_run_notebook.py` script in the `./projectDir` directory, which will, in turn, execute the `hello_world.ipynb` notebook (also in the `./projectDir` directory). It does this by defining the entry script as a `PythonScriptStep()` in a pipeline, and submitting the pipeline. The notebook shows how to do this multiple times after changing various files to show the effects of parameters `hash_paths` and `regenerate_outputs`.

This entry script (`papermill_run_notebook.py`) is intended as a simplified version of the one used in the [Microsoft/Recommenders repository](https://github.com/Microsoft/Recommenders/blob/jumin/dnn/reco_utils/aml/wide_deep.py).

### Azure Machine Learning Imports

In this first code cell, we import key Azure Machine Learning modules that we will use below. 

In [None]:
import os

import azureml.core
from azureml.core import Workspace, Experiment
from azureml.core.runconfig import CondaDependencies, RunConfiguration
from azureml.core.runconfig import DEFAULT_CPU_IMAGE

## load pipeline dependencies
from azureml.pipeline.core import Pipeline
from azureml.pipeline.steps import PythonScriptStep

# Check core SDK version number
print("SDK version:", azureml.core.VERSION)

### Initialize Workspace

Initialize a [workspace](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.workspace(class%29) object from persisted configuration, or get it from Azure

In [None]:
aml_compute_target = "aml-compute-d2" ## 2-16 characters
exp_name = 'papermill-in-a-pipeline'
# project folder
project_folder = './projectDir'

In [None]:
if os.path.isdir('aml_config'):
    print('Loading Workspace information from configuration')
    ws = Workspace.from_config()
else:
    print('Getting Workspace information from Variables. You must set these or this will fail!')
    SUBSCRIPTION_ID = os.getenv("AZ_SUB","")
    RESOURCE_GROUP = os.getenv("RESOURCE_GROUP","")
    WS_NAME = os.getenv("WS_NAME","")
    WS_LOCATION = 'eastus'
    ws=Workspace.get(name=WS_NAME,
                    resource_group=RESOURCE_GROUP,
                    subscription_id=SUBSCRIPTION_ID)

print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')


## Compute Targets

We will pick up from the prior notebook and focus on cloud computing, and in this case, we'll continue to use AmlCompute for executing our pipeline step.



#### List of Compute Targets on the workspace

In [None]:
cts = ws.compute_targets
for ct in cts:
    print(ct)

In [None]:
## run_config.load() does not seem to work:
# my_aml_run_config = RunConfiguration()
# my_aml_run_config.load(path='.', name=aml_compute_target)
# print(my_aml_run_config.target) ## still prints 'local'
# it does not load the values...

## so recreate
aml_compute = AmlCompute(ws, aml_compute_target)

cd = CondaDependencies.create(pip_packages=["ipykernel", "papermill", "azureml-sdk"])
my_aml_run_config = RunConfiguration(conda_dependencies=cd)
my_aml_run_config.target = aml_compute_target
my_aml_run_config.environment.docker.enabled = True
my_aml_run_config.environment.docker.base_image = DEFAULT_CPU_IMAGE

## Run as a Pipeline ...

You would create a `PythonScriptStep`, and then a pipeline, and then submit the pipeline.

In [None]:
# Uses default values for PythonScriptStep construct.

step1nohash = PythonScriptStep(name="Use papermill to run a notebook",
                         script_name="papermill_run_notebook.py", 
                         compute_target=aml_compute, 
                         source_directory=project_folder,
                         runconfig=my_aml_run_config,
                         allow_reuse=False
                        )
print("Step1 created")

### Build, Validate, and Submit the pipeline
You have the option to [validate](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.pipeline.pipeline?view=azure-ml-py#validate) the pipeline prior to submitting for run. The platform runs validation steps such as checking for circular dependencies and parameter checks etc. even if you do not explicitly call validate method.

### Submit the pipeline
[Submitting](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.pipeline.pipeline?view=azure-ml-py#submit) the pipeline involves creating an [Experiment](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.experiment?view=azure-ml-py) object and providing the built pipeline for submission. 

In [None]:
pipeline1 = Pipeline(workspace=ws, steps=[step1nohash])
print ("Pipeline is built")
pipeline1.validate()
print("Pipeline validation complete")
pipeline_run1 = Experiment(ws, exp_name).submit(pipeline1)
print("Pipeline is submitted for execution")

## Check for file changes

Pipelines are different than experiments. Try editing `msg` in `hello_world.ipynb`, and then building and running the same pipeline.


In [None]:
## Edit `msg` first!!
step1nohash = PythonScriptStep(name="Use papermill to run a notebook",
                         script_name="papermill_run_notebook.py", 
                         compute_target=aml_compute, 
                         source_directory=project_folder,
                         runconfig=my_aml_run_config,
                         allow_reuse=False
                        )
print("Step1 created")
pipeline1 = Pipeline(workspace=ws, steps=[step1nohash])
print ("Pipeline is built")
pipeline1.validate()
print("Pipeline validation complete")
pipeline_run1 = Experiment(ws, exp_name).submit(pipeline1)
print("Pipeline is submitted for execution")

## Results of Experiments

Across runs, you can try changing various things. The table below documents an example set of runs. 

Entries are bolded if they are different from the entry in the prior row.

Column1 corresponds to the run number, column 2 corresponds to what steps I did to the pipeline after I changed a file, column3 corresponds to the value of the `msg` variable in the **notebook**, columng 4 correponds to the value of the `msg2` variable in teh **script**, and columns 5 and 6 correspond to what gets logged by the logger in each run. 

If you experiment and try various different changes (e.g. only change the notebook, but don't change the script), then you see that by default, with no additional parameters passed, updates to the notebook are only prpagated if you also update the script (and at least rebuild the pipeline). 

Updating the script triggers an update of the entire source_dir. It doesn't matter if you rebuild the pipeline, or even rebuild the `PythonScriptStep` - if you only update the notebook, you won't see new results. The key rows that show this are runs 5, 7, 8, 11, and 12. Rows 6, 9, and 13 show clearly that both the notebook and the script get updated when the script is changed.



| run num |  pipeline updates              | msg status (nb)  | msg2 status (script) | msg log    | msg2 log |
|---------|--------------------------------|-------------------|----------------------|------------|----------|
| run 1   |  clean                         | hello world           | run1 | hello world     | run1      |
| run 2   | **only Experiment.submit()** | hello world           | **run2** | hello world     | run1  |
| run 3   | **built, validated, and submited** | hello world           | run2  | hello world     | **run2** |
| run 4   | built, validated, and submited | hello world           | **run4**  | hello world     | **run4**  |
| run 5   | built, validated, and submited | **'tst 2 - hello world'** | run4  | 'hello world'   | run4  |
| run 6   | built, validated, and submited | **'change - hello world'**  | **run7** | **'change - hello world'**  | **run7** |
| run 7   | built, validated, and submited | **'chng 2 - hello world'**  | run7 | 'change - hello world'  | run7 |
| run 8   | built, validated, and submited | **'hello world'**  | run7 | 'change - hello world'  | run7 |
| run 9   | built, validated, and submited | 'hello world'      | **run9** | **'hello world'**  | **run9** |
| run 10  | **step def., built, validated, and submited** | 'hello world'      | **run10** | 'hello world'  | **run10** |
| run 11  | step def., built, validated, and submited | **'delta: hello world'**      | run10 | 'hello world'  | run10 |
| run 12  | step def., built, validated, and submited | **'no show: hello world'**      | run10 | 'hello world'  | run10 |
| run 13  | step def., built, validated, and submited | 'no show: hello world'      | **run13** | **'no show: hello world'**  | **run13** |


### How to change this behavior

There are (at least) two ways to change this behavior.

- Use the `hash_paths` argument to `PythonScriptStep()` in order to make sure that key files (like the notebook) are checked for changes prior to submission.
- set `regenerate_outputs=True` when you run `Experiment.submit()`.

To see these two options in action, see the following sections.


## Using hash_paths

Reset the hello_world notebook, and rebuild the `PythonScriptStep` and pipeline by defining that file's path as a `hash_path`.

In [None]:
# Uses default values for PythonScriptStep construct.
step1withhash = PythonScriptStep(name="Use papermill to run a notebook",
                         script_name="papermill_run_notebook.py", 
                         compute_target=aml_compute, 
                         source_directory=project_folder,
                         runconfig=my_aml_run_config,
                         allow_reuse=False,
                         hash_paths=['hello_world.ipynb']
                        )
print("Step1 created")
pipeline1 = Pipeline(workspace=ws, steps=[step1withhash])
print ("Pipeline is built")
pipeline1.validate()
print("Pipeline validation complete")
pipeline_run1 = Experiment(ws, exp_name).submit(pipeline1)
print("Pipeline is submitted for execution")


In [None]:
## Now change various features in the files and notebooks similar to above, and resbumit

pipeline1 = Pipeline(workspace=ws, steps=[step1withhash])
print ("Pipeline is built")
pipeline1.validate()
print("Pipeline validation complete")
pipeline_run1 = Experiment(ws, exp_name).submit(pipeline1)
print("Pipeline is submitted for execution")

## Results of Experiments with hash_paths

This table has similar rows and colums as the one above. The only difference in these pipeline runs is that, within the `PythonScriptStep`, `hash_paths=['hello_world.ipynb']`. With that additional argument, you see very different behavior. Output of each pipeline run tracks changes when either script or notebook changes.

| run num |  pipeline updates              | msg status (nb)  | msg2 status (script) | msg log    | msg2 log |
|---------|--------------------------------|-------------------|----------------------|------------|----------|
| run 1   |  clean                         | hello world           | run1 | hello world     | run1      |
| run 2   | **Experiment.submit()**        | hello world           | **run2** | hello world     | run1  |
| run 3   | **built, validated, and submited** | hello world           | run2  | hello world     | **run2** |
| run 4   | built, validated, and submited | hello world           | **run4**  | hello world     | **run4**  |
| run 5   | built, validated, and submited | **'chng: Hello World!'** | run4  | **'chng: Hello World!'**  | run4  |
| run 6   | built, validated, and submited | **'Hello World again!'** | run4  | **'Hello World again!'**  | run4  |
| run 7   | **step def., built, validated, and submited** | 'Hello World again!'   | run4 | 'Hello World again!'  | run4 |
| run 8    | step def., built, validated, and submited | **'Hello World!'**      | run4 | **'Hello World!'**  | run4 |


## Explore regenerate_outputs

We can try the same experiment (this time with `hash_paths=None`) to see if the `regenerate_outputs` parameter in `Experiment.submit()` has a similar effect. 

In this case, because we know that script changes trigger an update, we'll just manipulate the notebook, the pipeline upates, and whether regenerate_outputs is true or false on the submit call.


In [None]:
## Edit `msg` first!!
step1nohash = PythonScriptStep(name="Use papermill to run a notebook",
                         script_name="papermill_run_notebook.py", 
                         compute_target=aml_compute, 
                         source_directory=project_folder,
                         runconfig=my_aml_run_config,
                         allow_reuse=False
                        )
print("Step1 created")
pipeline1 = Pipeline(workspace=ws, steps=[step1nohash])
print ("Pipeline is built")
pipeline1.validate()
print("Pipeline validation complete")
pipeline_run1 = Experiment(ws, exp_name).submit(pipeline1, regenerate_outputs=False)
print("Pipeline is submitted for execution")

In [None]:
## try changing the notebook, the value of regnereate_outputs, and rerunning
pipeline1 = Pipeline(workspace=ws, steps=[step1nohash])
print ("Pipeline is built")
pipeline1.validate()
print("Pipeline validation complete")
pipeline_run1 = Experiment(ws, exp_name).submit(pipeline1, regenerate_outputs=True)
print("Pipeline is submitted for execution")

## Results of Experiments with regenerate_outputs

In all cases, `hash_paths=None`.

This table is similar to the tables above, but now column 2 reflects teh value of the regenerate_outputs parameter.

`regenerate_outputs` enforces a new build. Not passing it as an argument seems analogous to `regenerate_outputs=False`, but it appears undocumented. If `regnerate_outputs==False`, then changes to the notebook do not get propagated. to the run. If `regenerate_outputs==True`, then changes to the notebook are propagated.


| run num |  regenerate_outputs | pipeline updates | msg status (nb)  | msg2 status (script) | msg log    | msg2 log |
|---------|---------------------|-----------|-------------------|----------------------|------------|----------|
| run 1   | 0 | clean                         | hello world           | run1 | hello world     | run1      |
| run 2   | 0 | **Experiment.submit()**        | **tst2**          | run1 | hello world     | run1  |
| run 3   | **1** | Experiment.submit()        | tst2          | run1 | hello world     | run1  |
| run 4   | 1 | Experiment.submit()        | tst2          | **run4** | hello world     | run1  |
| run 5   | 1 | **built, validated, and submited** | tst2           | **run1**  | **tst2**     | run1 |
| run 6   | 1 | built, validated, and submited | **change**           | run1  | **change**     | run1 |
| run 7   | 1 | built, validated, and submited | change           | run1  | change     | run1 |
| run 8   | 1 | built, validated, and submited | **delta**           | run1  | **delta**     | run1 |
| run 9   | 1 | built, validated, and submited | **gamma**           | run1  | **gamma**     | run1 |
| run 10   | **0** | built, validated, and submited | **epsilon**           | run1  | gamma     | run1 |
| run 11   | 0 | built, validated, and submited | **alpha**           | run1  | gamma     | run1 |
| run 12   | **1** | built, validated, and submited | alpha           | run1  | **alpha**   | run1 |


In [None]:
## Edit `msg` first!!
pipeline1 = Pipeline(workspace=ws, steps=[step1])
print ("Pipeline is built")
pipeline1.validate()
print("Pipeline validation complete")
pipeline_run1 = Experiment(ws, exp_name).submit(pipeline1, regenerate_outputs=False)
print("Pipeline is submitted for execution")


**Note:** If regenerate_outputs is set to True, a new submit will always force generation of all step outputs, and disallow data reuse for any step of this run. Once this run is complete, however, subsequent runs may reuse the results of this run.


### Examine the pipeline run

#### Use RunDetails Widget
We are going to use the RunDetails widget to examine the run of the pipeline. You can click each row below to get more details on the step runs.

In [None]:
step_runs = pipeline_run1.get_children()
for step_run in step_runs:
    status = step_run.get_status()
    print('Script:', step_run.name, 'status:', status)
    
    # Change this if you want to see details even if the Step has succeeded.
    joblog = step_run.get_job_log()
    print('job log:', joblog)

In [None]:
step_run