# Single-step pipeline examples

In this example, we'll build a very simple pipeline that just contains a single train step. The dataset and compute cluster created in this tutorial will be re-used in the subsequent examples in this module.

In [2]:
!pip install azureml-sdk --upgrade

Collecting azureml-sdk
  Downloading azureml_sdk-1.28.0-py3-none-any.whl (4.4 kB)
Collecting azureml-train-automl-client~=1.28.0
  Downloading azureml_train_automl_client-1.28.0-py3-none-any.whl (120 kB)
[K     |████████████████████████████████| 120 kB 15.0 MB/s eta 0:00:01
[?25hCollecting azureml-core~=1.28.0
  Downloading azureml_core-1.28.0-py3-none-any.whl (2.2 MB)
[K     |████████████████████████████████| 2.2 MB 35.7 MB/s eta 0:00:01
[?25hCollecting azureml-dataset-runtime[fuse]~=1.28.0
  Downloading azureml_dataset_runtime-1.28.0-py3-none-any.whl (3.5 kB)
Collecting azureml-pipeline~=1.28.0
  Downloading azureml_pipeline-1.28.0-py3-none-any.whl (3.7 kB)
Collecting azureml-train~=1.28.0
  Downloading azureml_train-1.28.0-py3-none-any.whl (3.3 kB)
Collecting azureml-telemetry~=1.28.0
  Downloading azureml_telemetry-1.28.0-py3-none-any.whl (30 kB)
Collecting azureml-automl-core~=1.28.0
  Downloading azureml_automl_core-1.28.0-py3-none-any.whl (207 kB)
[K     |██████████████████

Collecting azureml-dataprep-native<34.0.0,>=33.0.0
  Downloading azureml_dataprep_native-33.0.0-cp36-cp36m-manylinux1_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 53.1 MB/s eta 0:00:01
Collecting azureml-dataprep-rslex<1.14.0a,>=1.13.0dev0
  Downloading azureml_dataprep_rslex-1.13.0-cp36-cp36m-manylinux2010_x86_64.whl (9.8 MB)
[K     |████████████████████████████████| 9.8 MB 80.4 MB/s eta 0:00:01
Collecting azureml-train-restclients-hyperdrive~=1.28.0
  Downloading azureml_train_restclients_hyperdrive-1.28.0-py3-none-any.whl (19 kB)
[31mERROR: azureml-widgets 1.22.0 has requirement azureml-core~=1.22.0, but you'll have azureml-core 1.28.0 which is incompatible.[0m
[31mERROR: azureml-widgets 1.22.0 has requirement azureml-telemetry~=1.22.0, but you'll have azureml-telemetry 1.28.0 which is incompatible.[0m
[31mERROR: azureml-train-automl 1.22.0 has requirement azureml-automl-core~=1.22.0, but you'll have azureml-automl-core 1.28.0 which is incompatible.[0

    Uninstalling azureml-core-1.22.0:
      Successfully uninstalled azureml-core-1.22.0
  Attempting uninstall: azureml-telemetry
    Found existing installation: azureml-telemetry 1.22.0
    Uninstalling azureml-telemetry-1.22.0:
      Successfully uninstalled azureml-telemetry-1.22.0
  Attempting uninstall: azureml-dataprep-native
    Found existing installation: azureml-dataprep-native 29.0.0
    Uninstalling azureml-dataprep-native-29.0.0:
      Successfully uninstalled azureml-dataprep-native-29.0.0
  Attempting uninstall: azureml-dataprep-rslex
    Found existing installation: azureml-dataprep-rslex 1.7.0
    Uninstalling azureml-dataprep-rslex-1.7.0:
      Successfully uninstalled azureml-dataprep-rslex-1.7.0
  Attempting uninstall: azureml-dataprep
    Found existing installation: azureml-dataprep 2.9.1
    Uninstalling azureml-dataprep-2.9.1:
      Successfully uninstalled azureml-dataprep-2.9.1
  Attempting uninstall: azureml-dataset-runtime
    Found existing installation: 

In [1]:
import os
import azureml.core
from azureml.core import Workspace, Experiment, Dataset, RunConfiguration
from azureml.pipeline.core import Pipeline, PipelineData
from azureml.pipeline.steps import PythonScriptStep
from azureml.data.dataset_consumption_config import DatasetConsumptionConfig

print("Azure ML SDK version:", azureml.core.VERSION)

Azure ML SDK version: 1.28.0


First, we will connect to the workspace. The command `Workspace.from_config()` will either:
* Read the local `config.json` with the workspace reference (given it is there) or
* Use the `az` CLI to connect to the workspace and use the workspace attached to via `az ml folder attach -g <resource group> -w <workspace name>`

In [2]:
ws = Workspace.from_config()
print(f'WS name: {ws.name}\nRegion: {ws.location}\nSubscription id: {ws.subscription_id}\nResource group: {ws.resource_group}')

WS name: demo-ent-ws
Region: westeurope
Subscription id: bcbf34a7-1936-4783-8840-8f324c37f354
Resource group: demo


# Preparation

Let's quickly a create a compute cluster named `cluster`, in case it does not exist.

In [3]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

aml_compute_target = "cluster"
try:
    aml_compute = AmlCompute(ws, aml_compute_target)
except ComputeTargetException:
    config = AmlCompute.provisioning_configuration(vm_size = "STANDARD_D2_V2", min_nodes = 0, max_nodes = 1,
                                                   idle_seconds_before_scaledown=3600)
    aml_compute = ComputeTarget.create(ws, aml_compute_target, config)
    aml_compute.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)

Furthermore, we'll create a new dataset and register it to the workspace. We'll be using this dataset also in the subsequent pipelines. If you already created this dataset, jump to the next cell.

The dataset contains data where each entry represents a person who takes a credit by a bank. Each person is classified as good or bad credit risks according to the set of attributes. The target variable is "Risk" (values "good", "bad").

In [4]:
from azureml.core import Dataset

datastore = ws.get_default_datastore()
datastore.upload(src_dir='../data-training', target_path='german-credit-train-tutorial', overwrite=True)
ds = Dataset.File.from_files(path=[(datastore, 'german-credit-train-tutorial')])
ds.register(ws, name='german-credit-train-tutorial', description='Dataset for workshop tutorials', create_new_version=True)

Uploading an estimated of 1 files
Uploading ../data-training/german_credit_data.csv
Uploaded ../data-training/german_credit_data.csv, 1 files out of an estimated total of 1
Uploaded 1 files


{
  "source": [
    "('workspaceblobstore', 'german-credit-train-tutorial')"
  ],
  "definition": [
    "GetDatastoreFiles"
  ],
  "registration": {
    "id": "73b4c537-e008-4d3c-8770-055011622520",
    "name": "german-credit-train-tutorial",
    "version": 1,
    "description": "Dataset for workshop tutorials",
    "workspace": "Workspace.create(name='demo-ent-ws', subscription_id='bcbf34a7-1936-4783-8840-8f324c37f354', resource_group='demo')"
  }
}

Next, let's reference our newly created training dataset, so that we can use it as the pipeline input:

In [6]:
training_dataset = Dataset.get_by_name(ws, "german-credit-train-tutorial")
# Download dataset to compute node - we can also use .as_mount() if the dataset does not fit the machine
training_dataset_consumption = DatasetConsumptionConfig("training_dataset", training_dataset).as_download()

Next, we can create a `PythonScriptStep` that runs our training code. In this case, we use a `runconfig` from a YAML file ([`runconfig.yml`](runconfig.yml)), that defines our training job (target compute cluster, conda environement, etc.) - have a look at it.

In [7]:
runconfig = RunConfiguration.load("runconfig.yml")

train_step = PythonScriptStep(name="train-step",
                        source_directory="./",
                        script_name="train.py",
                        arguments=['--data-path', training_dataset_consumption],
                        inputs=[training_dataset_consumption],
                        runconfig=runconfig,
                        allow_reuse=False)

steps = [train_step]

Finally, we can create our pipeline object and validate it. This will check the input and outputs are properly linked and that the pipeline graph is a non-cyclic graph:

In [8]:
pipeline = Pipeline(workspace=ws, steps=steps)
pipeline.validate()

Step train-step is ready to be created [fe4ff6af]


[]

Lastly, we can submit the pipeline against an experiment:

In [12]:
pipeline_run = Experiment(ws, 'mlops-workshop-pipelines-20210524').submit(pipeline)
pipeline_run.wait_for_completion()

Submitted PipelineRun 2266837e-5afb-4569-9fbf-f809a7669654
Link to Azure Machine Learning Portal: https://ml.azure.com/runs/2266837e-5afb-4569-9fbf-f809a7669654?wsid=/subscriptions/bcbf34a7-1936-4783-8840-8f324c37f354/resourcegroups/demo/workspaces/demo-ent-ws&tid=1f053027-5c7a-4f10-8444-ca55e5715f27
PipelineRunId: 2266837e-5afb-4569-9fbf-f809a7669654
Link to Azure Machine Learning Portal: https://ml.azure.com/runs/2266837e-5afb-4569-9fbf-f809a7669654?wsid=/subscriptions/bcbf34a7-1936-4783-8840-8f324c37f354/resourcegroups/demo/workspaces/demo-ent-ws&tid=1f053027-5c7a-4f10-8444-ca55e5715f27
PipelineRun Status: Running


StepRunId: 10ea349a-7de2-4df6-b588-ad45e77901ab
Link to Azure Machine Learning Portal: https://ml.azure.com/runs/10ea349a-7de2-4df6-b588-ad45e77901ab?wsid=/subscriptions/bcbf34a7-1936-4783-8840-8f324c37f354/resourcegroups/demo/workspaces/demo-ent-ws&tid=1f053027-5c7a-4f10-8444-ca55e5715f27
StepRun( train-step ) Status: NotStarted
StepRun( train-step ) Status: Running

St


Streaming azureml-logs/75_job_post-tvmps_78d47faa751acb7f51c6eaf1922c5bfd3baad2437fbd8fee673a40746f6cd22e_d.txt
[2021-05-24T15:18:45.144196] Entering job release
[2021-05-24T15:18:46.716286] Starting job release
[2021-05-24T15:18:46.717192] Logging experiment finalizing status in history service.
Starting the daemon thread to refresh tokens in background for process with pid = 163
[2021-05-24T15:18:46.718321] job release stage : upload_datastore starting...
[2021-05-24T15:18:46.718721] job release stage : start importing azureml.history._tracking in run_history_release.
[2021-05-24T15:18:46.719211] job release stage : execute_job_release starting...
[2021-05-24T15:18:46.726173] job release stage : copy_batchai_cached_logs starting...
[2021-05-24T15:18:46.734431] job release stage : copy_batchai_cached_logs completed...
[2021-05-24T15:18:46.881032] job release stage : execute_job_release completed...
[2021-05-24T15:18:46.885823] job release stage : send_run_telemetry starting...
[2021-



PipelineRun Execution Summary
PipelineRun Status: Finished
{'runId': '2266837e-5afb-4569-9fbf-f809a7669654', 'status': 'Completed', 'startTimeUtc': '2021-05-24T15:14:22.075513Z', 'endTimeUtc': '2021-05-24T15:18:59.634912Z', 'properties': {'azureml.runsource': 'azureml.PipelineRun', 'runSource': 'SDK', 'runType': 'SDK', 'azureml.parameters': '{}'}, 'inputDatasets': [], 'outputDatasets': [], 'logFiles': {'logs/azureml/executionlogs.txt': 'https://demoentws5367325393.blob.core.windows.net/azureml/ExperimentRun/dcid.2266837e-5afb-4569-9fbf-f809a7669654/logs/azureml/executionlogs.txt?sv=2019-02-02&sr=b&sig=jS82E3ItDwSP15s4PjmOz0CZM8t36DlrLVtZtE88smw%3D&st=2021-05-24T15%3A05%3A35Z&se=2021-05-24T23%3A15%3A35Z&sp=r', 'logs/azureml/stderrlogs.txt': 'https://demoentws5367325393.blob.core.windows.net/azureml/ExperimentRun/dcid.2266837e-5afb-4569-9fbf-f809a7669654/logs/azureml/stderrlogs.txt?sv=2019-02-02&sr=b&sig=5yUzAEpuUa%2BnIA8LVYwJwIUOuSdz32ZzES26gDkn%2Bnw%3D&st=2021-05-24T15%3A05%3A35Z&se=

'Finished'

Alternatively, we can also publish the pipeline as a RESTful API Endpoint (the date into the enpoint name is used only for demo purpose, as it doesn't make sense to version your pipeline using its name):

In [13]:
published_pipeline = pipeline.publish('mlops-training-pipeline-20210524')
published_pipeline

Name,Id,Status,Endpoint
mlops-training-pipeline-20210524,b79a0a2d-ff7c-4e1e-8f79-847c30378810,Active,REST Endpoint


What if we want to continously publish a new pipelines, but have it published as the same URL as the version prior? For this, we can use [`PipelineEndpoint`](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.pipelineendpoint?view=azure-ml-py), which keeps multiple `PublishedPipeline`s behind a single endpoint URL versioning them. It allows to set `default_version`, which determines to which `PublishedPipeline` (namely version) it should route the request.

In [16]:
from azureml.pipeline.core import PipelineEndpoint

endpoint_name = "mlops-training-pipeline-existing"

# Try to find the upon defined endpoint name.
# If not exists, create a new endpoint with that name as deafult endpoint
try:
   pipeline_endpoint = PipelineEndpoint.get(workspace=ws, name=endpoint_name)
   # Add new default endpoint - only works from PublishedPipeline
   pipeline_endpoint.add_default(published_pipeline)
   print(f"Pipeline endpoint '{endpoint_name}' found and set as default.")
except Exception:
    pipeline_endpoint = PipelineEndpoint.publish(workspace=ws,
                                            name=endpoint_name,
                                            pipeline=pipeline,
                                            description="New Training Pipeline Endpoint")
    print(f"Pipeline endpoint '{endpoint_name}' not found. New endpoint created")


Pipeline endpoint 'mlops-training-pipeline-existing' not found. New endpoint created
