# Multi-step pipeline example

In this example, we'll be building a two step pipeline which passes data from the a first step (prepare) to the second step (train).

**Note:** This example requires that you've ran the notebook from the first tutorial, so that the dataset and compute cluster are set up.

In [1]:
import os
import azureml.core
from azureml.core import Workspace, Experiment, Dataset, RunConfiguration
from azureml.pipeline.core import Pipeline, PipelineData, PipelineParameter
from azureml.pipeline.steps import PythonScriptStep
from azureml.data.dataset_consumption_config import DatasetConsumptionConfig

print("Azure ML SDK version:", azureml.core.VERSION)

Azure ML SDK version: 1.20.0


First, we will connect to the workspace. The command `Workspace.from_config()` will either:
* Read the local `config.json` with the workspace reference (given it is there) or
* Use the `az` CLI to connect to the workspace and use the workspace attached to via `az ml folder attach -g <resource group> -w <workspace name>`

In [2]:
ws = Workspace.from_config()
print(f'WS name: {ws.name}\nRegion: {ws.location}\nSubscription id: {ws.subscription_id}\nResource group: {ws.resource_group}')

WS name: demo-ent-ws
Region: westeurope
Subscription id: bcbf34a7-1936-4783-8840-8f324c37f354
Resource group: demo


Next, let's reference our training dataset from the last tutorial, so that we can use it as the pipeline input for the prepare step:

In [3]:
# Set our dataset as the default dataset (if user does not set the parameter during pipeline invocation)
default_training_dataset = Dataset.get_by_name(ws, "german-credit-train-tutorial")

# We're assigning passing the German dataset as default value of a PipelineParameter.
# This one can be set when submit a run of the Pipeline.
training_dataset_parameter = PipelineParameter(name="training_dataset", default_value=default_training_dataset)
training_dataset_consumption = DatasetConsumptionConfig("training_dataset", training_dataset_parameter).as_download()


Let's also define a `PipelineData` placeholder which will be used to persist and pipe data from the prepare step to the train step. The PipelineData name will be the blob storage's folder name containing the fist step's output data (*default storage --> azureml-blobstore-XXXXXXX... --> azureml --> step Run ID --> PipelineData name*):

In [4]:
default_datastore = ws.get_default_datastore()
prepared_data = PipelineData("prepared_data", datastore=default_datastore)


If you want to define a specific blob storage for the outputs, after defining it into the *Datastore*, you can use the Python [OutputFileDatasetConfig Class](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.output_dataset_config.outputfiledatasetconfig?view=azure-ml-py) ([here](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-move-data-in-out-of-pipelines#use-outputfiledatasetconfig-for-intermediate-data) some examples). It will be used in the next tutorial (*parallel_run_step_pipeline*).

Next, we can create our two-stepped pipeline that runs some preprocessing on the data and then pipes the output to the training code. In this case, we use a separate `runconfig` for each step. The dependency graph is automatically resolved through the data input/outputs, but we could also define it ourselves (if desired):

In [5]:
prepare_runconfig = RunConfiguration.load("prepare_runconfig.yml")

prepare_step = PythonScriptStep(name="prepare-step",
                        runconfig=prepare_runconfig,
                        source_directory="./",
                        script_name=prepare_runconfig.script,
                        arguments=['--data-input-path', training_dataset_consumption,
                                   '--data-output-path', prepared_data],
                        inputs=[training_dataset_consumption],
                        outputs=[prepared_data],
                        allow_reuse=False)

train_runconfig = RunConfiguration.load("train_runconfig.yml")

train_step = PythonScriptStep(name="train-step",
                        runconfig=train_runconfig,
                        source_directory="./",
                        script_name=train_runconfig.script,
                        arguments=['--data-path', prepared_data],
                        inputs=[prepared_data],
                        allow_reuse=False)

train_step.run_after(prepare_step) # not really needed here, just for illustration
steps = [prepare_step, train_step]

Finally, we can create our pipeline object and validate it. This will check the input and outputs are properly linked and that the pipeline graph is a non-cyclic graph:

In [6]:
pipeline = Pipeline(workspace=ws, steps=steps)
pipeline.validate()

Step prepare-step is ready to be created [b3e97ee8]
Step train-step is ready to be created [b53a2c3d]


[]

Lastly, we can submit the pipeline against an experiment:

In [7]:
pipeline_run = Experiment(ws, 'mlops-workshop-pipelines').submit(pipeline)
pipeline_run.wait_for_completion()

Created step prepare-step [b3e97ee8][4825aed4-f018-4dbf-b919-34b48df82007], (This step will run and generate new outputs)
Created step train-step [b53a2c3d][2cb53e3d-df07-4d72-9582-667692e7d6ff], (This step will run and generate new outputs)
Submitted PipelineRun cd8baca6-b8a0-40c5-908a-dafdc467920a
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/mlops-workshop-pipelines/runs/cd8baca6-b8a0-40c5-908a-dafdc467920a?wsid=/subscriptions/bcbf34a7-1936-4783-8840-8f324c37f354/resourcegroups/demo/workspaces/demo-ent-ws
PipelineRunId: cd8baca6-b8a0-40c5-908a-dafdc467920a
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/mlops-workshop-pipelines/runs/cd8baca6-b8a0-40c5-908a-dafdc467920a?wsid=/subscriptions/bcbf34a7-1936-4783-8840-8f324c37f354/resourcegroups/demo/workspaces/demo-ent-ws
PipelineRun Status: Running


StepRunId: f0d9f292-d1fc-4586-a10d-297aac749093
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/mlops-workshop-pipe


StepRun(prepare-step) Execution Summary
StepRun( prepare-step ) Status: Finished
{'runId': 'f0d9f292-d1fc-4586-a10d-297aac749093', 'target': 'cluster', 'status': 'Completed', 'startTimeUtc': '2021-01-19T10:22:43.196178Z', 'endTimeUtc': '2021-01-19T10:23:28.854237Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '754fec0f-5d63-43c3-a186-1460a2b36604', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.moduleid': '4825aed4-f018-4dbf-b919-34b48df82007', 'azureml.nodeid': 'b3e97ee8', 'azureml.pipelinerunid': 'cd8baca6-b8a0-40c5-908a-dafdc467920a', '_azureml.ComputeTargetType': 'amlcompute', 'ProcessInfoFile': 'azureml-logs/process_info.json', 'ProcessStatusFile': 'azureml-logs/process_status.json'}, 'inputDatasets': [{'dataset': {'id': '73b4c537-e008-4d3c-8770-055011622520'}, 'consumptionDetails': {'type': 'RunInput', 'inputName': 'training_dataset', 'mechanism': 'Download'}}], 'outputDatasets': [], 'runDefinition': {'script': 'prepar




StepRunId: 92f2ce94-8aa7-4218-ba4c-98ae3549f00b
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/mlops-workshop-pipelines/runs/92f2ce94-8aa7-4218-ba4c-98ae3549f00b?wsid=/subscriptions/bcbf34a7-1936-4783-8840-8f324c37f354/resourcegroups/demo/workspaces/demo-ent-ws
StepRun( train-step ) Status: NotStarted
StepRun( train-step ) Status: Running

Streaming azureml-logs/55_azureml-execution-tvmps_3b58b23ebb37469ac1f07c2ab41fbc7f43557fb2270f063578cd53530d107746_d.txt
2021-01-19T10:24:04Z Starting output-watcher...
2021-01-19T10:24:04Z IsDedicatedCompute == True, won't poll for Low Pri Preemption
2021-01-19T10:24:05Z Executing 'Copy ACR Details file' on 10.0.0.4
2021-01-19T10:24:05Z Copy ACR Details file succeeded on 10.0.0.4. Output: 
>>>   
>>>   
Login Succeeded
Using default tag: latest
latest: Pulling from azureml/azureml_2e16031b3872f79cf3d953cc9fca1d91
Digest: sha256:036493314eb9df954dae7e4f412f3269a1de0e6ae06801fe94518fbb33300963
Status: Image is up to date fo


Streaming azureml-logs/75_job_post-tvmps_3b58b23ebb37469ac1f07c2ab41fbc7f43557fb2270f063578cd53530d107746_d.txt
bash: /azureml-envs/azureml_9cfcbdf246a71704d446a597691d2bad/lib/libtinfo.so.5: no version information available (required by bash)
[2021-01-19T10:24:30.498519] Entering job release
[2021-01-19T10:24:31.675703] Starting job release
[2021-01-19T10:24:31.682181] Logging experiment finalizing status in history service.
[2021-01-19T10:24:31.682336] job release stage : upload_datastore starting...
Starting the daemon thread to refresh tokens in background for process with pid = 171
[2021-01-19T10:24:31.682857] job release stage : start importing azureml.history._tracking in run_history_release.
[2021-01-19T10:24:31.691208] job release stage : execute_job_release starting...
[2021-01-19T10:24:31.691442] job release stage : copy_batchai_cached_logs starting...
[2021-01-19T10:24:31.691796] job release stage : copy_batchai_cached_logs completed...
[2021-01-19T10:24:31.692479] Enterin



PipelineRun Execution Summary
PipelineRun Status: Finished
{'runId': 'cd8baca6-b8a0-40c5-908a-dafdc467920a', 'status': 'Completed', 'startTimeUtc': '2021-01-19T10:22:11.192611Z', 'endTimeUtc': '2021-01-19T10:24:54.341741Z', 'properties': {'azureml.runsource': 'azureml.PipelineRun', 'runSource': 'SDK', 'runType': 'SDK', 'azureml.parameters': '{}'}, 'inputDatasets': [], 'outputDatasets': [], 'logFiles': {'logs/azureml/executionlogs.txt': 'https://demoentws5367325393.blob.core.windows.net/azureml/ExperimentRun/dcid.cd8baca6-b8a0-40c5-908a-dafdc467920a/logs/azureml/executionlogs.txt?sv=2019-02-02&sr=b&sig=evRaGuxDxJdjDcIvGJZETcpObwvrbv%2BqUfzVmPbaCmE%3D&st=2021-01-19T10%3A14%3A55Z&se=2021-01-19T18%3A24%3A55Z&sp=r', 'logs/azureml/stderrlogs.txt': 'https://demoentws5367325393.blob.core.windows.net/azureml/ExperimentRun/dcid.cd8baca6-b8a0-40c5-908a-dafdc467920a/logs/azureml/stderrlogs.txt?sv=2019-02-02&sr=b&sig=Fz785Z7l%2Br8fRhixUGGDTmrCbDKjD4dGnQRjthHCD14%3D&st=2021-01-19T10%3A14%3A55Z&se=

'Finished'

Alternatively, we can also publish the pipeline as a RESTful API Endpoint. In this case, you can specify the dataset upon invocation of the pipeline. This is nicely possible in the `Studio UI`, goto `Endpoints`, then `Pipeline Endpoints` and then select the pipeline. Once you hit the submit button, you can select the Dataset at the bottom of the window.

In [9]:
published_pipeline = pipeline.publish('mlops-multi-step-pipeline')
published_pipeline

Name,Id,Status,Endpoint
mlops-multi-step-pipeline,ee437e8c-7715-43c8-ace2-9835361a5212,Active,REST Endpoint
