# Multi-step pipeline example

In this example, we'll be building a two step pipeline which passes data from the a first step (prepare) to the second step (train).

**Note:** This example requires that you've ran the notebook from the first tutorial, so that the dataset and compute cluster are set up.

In [1]:
import os
import azureml.core
from azureml.core import Workspace, Experiment, Dataset, RunConfiguration
from azureml.pipeline.core import Pipeline, PipelineData, PipelineParameter
from azureml.pipeline.steps import PythonScriptStep
from azureml.data.dataset_consumption_config import DatasetConsumptionConfig

print("Azure ML SDK version:", azureml.core.VERSION)

Azure ML SDK version: 1.28.0


First, we will connect to the workspace. The command `Workspace.from_config()` will either:
* Read the local `config.json` with the workspace reference (given it is there) or
* Use the `az` CLI to connect to the workspace and use the workspace attached to via `az ml folder attach -g <resource group> -w <workspace name>`

In [2]:
ws = Workspace.from_config()
print(f'WS name: {ws.name}\nRegion: {ws.location}\nSubscription id: {ws.subscription_id}\nResource group: {ws.resource_group}')

WS name: demo-ent-ws
Region: westeurope
Subscription id: bcbf34a7-1936-4783-8840-8f324c37f354
Resource group: demo


Next, let's reference our training dataset from the last tutorial, so that we can use it as the pipeline input for the prepare step:

In [3]:
# Set our dataset as the default dataset (if user does not set the parameter during pipeline invocation)
default_training_dataset = Dataset.get_by_name(ws, "german-credit-train-tutorial")

# We're assigning passing the German dataset as default value of a PipelineParameter.
# This one can be set when submit a run of the Pipeline.
training_dataset_parameter = PipelineParameter(name="training_dataset", default_value=default_training_dataset)
training_dataset_consumption = DatasetConsumptionConfig("training_dataset", training_dataset_parameter).as_download()


Let's also define a `PipelineData` placeholder which will be used to persist and pipe data from the prepare step to the train step. The PipelineData name will be the blob storage's folder name containing the fist step's output data (*default storage --> azureml-blobstore-XXXXXXX... --> azureml --> step Run ID --> PipelineData name*):

In [4]:
default_datastore = ws.get_default_datastore()
prepared_data = PipelineData("prepared_data", datastore=default_datastore)


If you want to define a specific blob storage for the outputs, after defining it into the *Datastore*, you can use the Python [OutputFileDatasetConfig Class](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.output_dataset_config.outputfiledatasetconfig?view=azure-ml-py) ([here](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-move-data-in-out-of-pipelines#use-outputfiledatasetconfig-for-intermediate-data) some examples). It will be used in the next tutorial (*parallel_run_step_pipeline*).

Next, we can create our two-stepped pipeline that runs some preprocessing on the data and then pipes the output to the training code. In this case, we use a separate `runconfig` for each step. The dependency graph is automatically resolved through the data input/outputs, but we could also define it ourselves (if desired):

In [5]:
prepare_runconfig = RunConfiguration.load("prepare_runconfig.yml")

prepare_step = PythonScriptStep(name="prepare-step",
                        runconfig=prepare_runconfig,
                        source_directory="./",
                        script_name=prepare_runconfig.script,
                        arguments=['--data-input-path', training_dataset_consumption,
                                   '--data-output-path', prepared_data],
                        inputs=[training_dataset_consumption],
                        outputs=[prepared_data],
                        allow_reuse=False)

train_runconfig = RunConfiguration.load("train_runconfig.yml")

train_step = PythonScriptStep(name="train-step",
                        runconfig=train_runconfig,
                        source_directory="./",
                        script_name=train_runconfig.script,
                        arguments=['--data-path', prepared_data],
                        inputs=[prepared_data],
                        allow_reuse=False)

train_step.run_after(prepare_step) # not really needed here, just for illustration
steps = [prepare_step, train_step]

Finally, we can create our pipeline object and validate it. This will check the input and outputs are properly linked and that the pipeline graph is a non-cyclic graph:

In [6]:
pipeline = Pipeline(workspace=ws, steps=steps)
pipeline.validate()

Step prepare-step is ready to be created [af65b269]
Step train-step is ready to be created [0c2f9780]


[]

Lastly, we can submit the pipeline against an experiment:

In [7]:
pipeline_run = Experiment(ws, 'mlops-workshop-pipelines-20210524').submit(pipeline)
pipeline_run.wait_for_completion()

Created step prepare-step [af65b269][a5cd884e-449b-4691-8d1d-ef21cc0fbe5d], (This step will run and generate new outputs)Created step train-step [0c2f9780][1d830300-c223-46ed-91f3-966f4e06617e], (This step will run and generate new outputs)

Submitted PipelineRun e972bea0-a7c8-4af0-8644-ca63beb52a63
Link to Azure Machine Learning Portal: https://ml.azure.com/runs/e972bea0-a7c8-4af0-8644-ca63beb52a63?wsid=/subscriptions/bcbf34a7-1936-4783-8840-8f324c37f354/resourcegroups/demo/workspaces/demo-ent-ws&tid=1f053027-5c7a-4f10-8444-ca55e5715f27
PipelineRunId: e972bea0-a7c8-4af0-8644-ca63beb52a63
Link to Azure Machine Learning Portal: https://ml.azure.com/runs/e972bea0-a7c8-4af0-8644-ca63beb52a63?wsid=/subscriptions/bcbf34a7-1936-4783-8840-8f324c37f354/resourcegroups/demo/workspaces/demo-ent-ws&tid=1f053027-5c7a-4f10-8444-ca55e5715f27
PipelineRun Status: Running


StepRunId: 1976bc36-d674-4382-a3c8-f3336dfd807c
Link to Azure Machine Learning Portal: https://ml.azure.com/runs/1976bc36-d674-4382


Streaming azureml-logs/75_job_post-tvmps_36b3f3ebd83225e8bc2dc7d55ce1be36c710b2ebeba1bf09dc2117cd331435af_d.txt
[2021-05-24T17:52:39.551421] Entering job release
[2021-05-24T17:52:41.231670] Starting job release
[2021-05-24T17:52:41.232462] Logging experiment finalizing status in history service.
Starting the daemon thread to refresh tokens in background for process with pid = 163
[2021-05-24T17:52:41.234982] job release stage : upload_datastore starting...
[2021-05-24T17:52:41.245965] job release stage : start importing azureml.history._tracking in run_history_release.
[2021-05-24T17:52:41.246032] job release stage : execute_job_release starting...
[2021-05-24T17:52:41.246608] job release stage : copy_batchai_cached_logs starting...
[2021-05-24T17:52:41.246727] job release stage : copy_batchai_cached_logs completed...
[2021-05-24T17:52:41.317565] Entering context manager injector.
[2021-05-24T17:52:41.343936] job release stage : send_run_telemetry starting...
[2021-05-24T17:52:41.372




StepRunId: 56a05f73-be56-4f4e-a605-b2e8e9d3bab6
Link to Azure Machine Learning Portal: https://ml.azure.com/runs/56a05f73-be56-4f4e-a605-b2e8e9d3bab6?wsid=/subscriptions/bcbf34a7-1936-4783-8840-8f324c37f354/resourcegroups/demo/workspaces/demo-ent-ws&tid=1f053027-5c7a-4f10-8444-ca55e5715f27
StepRun( train-step ) Status: Running

Streaming azureml-logs/55_azureml-execution-tvmps_36b3f3ebd83225e8bc2dc7d55ce1be36c710b2ebeba1bf09dc2117cd331435af_d.txt
2021-05-24T17:53:07Z Successfully mounted a/an Blobfuse File System at /mnt/batch/tasks/shared/LS_root/jobs/demo-ent-ws/azureml/56a05f73-be56-4f4e-a605-b2e8e9d3bab6/mounts/workspaceblobstore
2021-05-24T17:53:08Z Starting output-watcher...
2021-05-24T17:53:08Z IsDedicatedCompute == True, won't poll for Low Pri Preemption
2021-05-24T17:53:08Z Executing 'Copy ACR Details file' on 10.0.0.5
2021-05-24T17:53:08Z Copy ACR Details file succeeded on 10.0.0.5. Output: 
>>>   
>>>   
Login Succeeded
Using default tag: latest
latest: Pulling from azure


Streaming azureml-logs/75_job_post-tvmps_36b3f3ebd83225e8bc2dc7d55ce1be36c710b2ebeba1bf09dc2117cd331435af_d.txt
bash: /azureml-envs/azureml_9cfcbdf246a71704d446a597691d2bad/lib/libtinfo.so.5: no version information available (required by bash)
[2021-05-24T17:53:26.836967] Entering job release
[2021-05-24T17:53:27.924360] Starting job release
[2021-05-24T17:53:27.930398] Logging experiment finalizing status in history service.
[2021-05-24T17:53:27.930580] job release stage : upload_datastore starting...
Starting the daemon thread to refresh tokens in background for process with pid = 177
[2021-05-24T17:53:27.931039] job release stage : start importing azureml.history._tracking in run_history_release.
[2021-05-24T17:53:27.931781] job release stage : execute_job_release starting...
[2021-05-24T17:53:27.934624] job release stage : copy_batchai_cached_logs starting...
[2021-05-24T17:53:27.941492] job release stage : copy_batchai_cached_logs completed...
[2021-05-24T17:53:27.942019] Enterin



PipelineRun Execution Summary
PipelineRun Status: Finished
{'runId': 'e972bea0-a7c8-4af0-8644-ca63beb52a63', 'status': 'Completed', 'startTimeUtc': '2021-05-24T17:47:52.925482Z', 'endTimeUtc': '2021-05-24T17:53:38.305968Z', 'properties': {'azureml.runsource': 'azureml.PipelineRun', 'runSource': 'SDK', 'runType': 'SDK', 'azureml.parameters': '{}'}, 'inputDatasets': [], 'outputDatasets': [], 'logFiles': {'logs/azureml/executionlogs.txt': 'https://demoentws5367325393.blob.core.windows.net/azureml/ExperimentRun/dcid.e972bea0-a7c8-4af0-8644-ca63beb52a63/logs/azureml/executionlogs.txt?sv=2019-02-02&sr=b&sig=CJVeu7yU1Z3ZeUwsK9ZjV5tolXN4K6pJoTnGNHbekDU%3D&st=2021-05-24T17%3A43%3A39Z&se=2021-05-25T01%3A53%3A39Z&sp=r', 'logs/azureml/stderrlogs.txt': 'https://demoentws5367325393.blob.core.windows.net/azureml/ExperimentRun/dcid.e972bea0-a7c8-4af0-8644-ca63beb52a63/logs/azureml/stderrlogs.txt?sv=2019-02-02&sr=b&sig=%2FN5u7XCV73JzykWBZtk%2BL%2FlWbCc7nBgXKXddlEvdibk%3D&st=2021-05-24T17%3A43%3A39Z&s

'Finished'

Alternatively, we can also publish the pipeline as a RESTful API Endpoint. In this case, you can specify the dataset upon invocation of the pipeline. This is nicely possible in the `Studio UI`, goto `Endpoints`, then `Pipeline Endpoints` and then select the pipeline. Once you hit the submit button, you can select the Dataset at the bottom of the window.

In [8]:
published_pipeline = pipeline.publish('mlops-multi-step-pipeline')
published_pipeline

Name,Id,Status,Endpoint
mlops-multi-step-pipeline,2140ab37-928e-4424-9f15-96bf7712d835,Active,REST Endpoint
