# Parallel Batch Scoring pipeline example

In this example, we'll build a pipeline that is able to batch score data in parallel on one or multiple nodes. This can be used to either score large amounts of data or train many models in parallel.

In [1]:
import os
import azureml.core
from azureml.core import Workspace, Experiment, Dataset, RunConfiguration
from azureml.pipeline.core import Pipeline
from azureml.pipeline.steps import ParallelRunStep, ParallelRunConfig
from azureml.data import OutputFileDatasetConfig
from azureml.data.dataset_consumption_config import DatasetConsumptionConfig

print("Azure ML SDK version:", azureml.core.VERSION)


Azure ML SDK version: 1.20.0


First, we will connect to the workspace. The command `Workspace.from_config()` will either:
* Read the local `config.json` with the workspace reference (given it is there) or
* Use the `az` CLI to connect to the workspace and use the workspace attached to via `az ml folder attach -g <resource group> -w <workspace name>`

In [2]:
ws = Workspace.from_config()
print(f'WS name: {ws.name}\nRegion: {ws.location}\nSubscription id: {ws.subscription_id}\nResource group: {ws.resource_group}')

WS name: demo-ent-ws
Region: westeurope
Subscription id: bcbf34a7-1936-4783-8840-8f324c37f354
Resource group: demo


# Preparation

Let's register the provided `model.pkl` as model in our workspace. We'll use this model for batch scoring in the pipeline:

In [3]:
from azureml.core.model import Model
Model.register(model_path="model.pkl",
               model_name="credit_model_tutorial",
               description="Example model for batch scoring tutorial",
               workspace=ws)

Registering model credit_model_tutorial


Model(workspace=Workspace.create(name='demo-ent-ws', subscription_id='bcbf34a7-1936-4783-8840-8f324c37f354', resource_group='demo'), name=credit_model_tutorial, id=credit_model_tutorial:1, version=1, tags={}, properties={})

Let's also register a dataset with data that we want to use for batch scoring (the following dataset is different from the already registered one for training; this one is made by multiple files, so you need to register it also):

In [5]:
from azureml.core import Dataset

datastore = ws.get_default_datastore()
datastore.upload(src_dir='../data-batch-scoring', target_path='german-credit-batch-tutorial', overwrite=True)
ds = Dataset.File.from_files(path=[(datastore, 'german-credit-batch-tutorial')])
ds.register(ws, name='german-credit-batch-tutorial', description='Dataset for batch scoring tutorial', create_new_version=True)

Uploading an estimated of 4 files
Uploading ../data-batch-scoring/german_credit_data_batch_test_00.csv
Uploaded ../data-batch-scoring/german_credit_data_batch_test_00.csv, 1 files out of an estimated total of 4
Uploading ../data-batch-scoring/german_credit_data_batch_test_01.csv
Uploaded ../data-batch-scoring/german_credit_data_batch_test_01.csv, 2 files out of an estimated total of 4
Uploading ../data-batch-scoring/german_credit_data_batch_test_02.csv
Uploaded ../data-batch-scoring/german_credit_data_batch_test_02.csv, 3 files out of an estimated total of 4
Uploading ../data-batch-scoring/german_credit_data_batch_test_03.csv
Uploaded ../data-batch-scoring/german_credit_data_batch_test_03.csv, 4 files out of an estimated total of 4
Uploaded 4 files


{
  "source": [
    "('workspaceblobstore', 'german-credit-batch-tutorial')"
  ],
  "definition": [
    "GetDatastoreFiles"
  ],
  "registration": {
    "id": "e2fb1a77-e1c7-48bf-b897-9816ea287564",
    "name": "german-credit-batch-tutorial",
    "version": 1,
    "description": "Dataset for batch scoring tutorial",
    "workspace": "Workspace.create(name='demo-ent-ws', subscription_id='bcbf34a7-1936-4783-8840-8f324c37f354', resource_group='demo')"
  }
}

Next, let's reference our newly created batch scoring dataset, so that we can use it as the pipeline input:

In [6]:
batch_dataset = Dataset.get_by_name(ws, "german-credit-batch-tutorial")
batch_dataset_consumption = DatasetConsumptionConfig("batch_dataset", batch_dataset).as_download()

Now let's create a output dataset that will contain our predictions. This gives us complete freedom where we want to store the predictions on the datastore:

In [7]:
#output_dataset = PipelineData(name='batch_output', datastore=ws.get_default_datastore()).as_dataset()
#output_dataset = output_dataset.register(name='batch-scoring-results', create_new_version=True)

datastore = ws.get_default_datastore()

# This will put the output results into a pre-defined folder on our datastore and optionally register it as a dataset (not required)
output_dataset = OutputFileDatasetConfig(name='batch_results',
                                         destination=(datastore, 'batch-scoring-results/{run-id}')).register_on_complete(name='batch-scoring-results')


Next, we can create a [`ParallelRunStep`](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-steps/azureml.pipeline.steps.parallelrunstep?view=azure-ml-py) that runs our batch scoring code in parallel on one or more nodes. In this case, we use a [`ParallelRunConfig`](https://docs.microsoft.com/en-us/python/api/azureml-contrib-pipeline-steps/azureml.contrib.pipeline.steps.parallel_run_config.parallelrunconfig?view=azure-ml-py) from a YAML file, that defines our batch scoring job (source script, environment, parallelization, target cluster, etc.).

[Here](https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/machine-learning-pipelines/parallel-run) more details about Azure ML Batch Inference using the `ParallelRunStep`.

In [8]:
parallel_run_config = ParallelRunConfig.load_yaml(workspace=ws, path="parallel_runconfig.yml")

batch_step = ParallelRunStep(
    name="batch-inference-step",
    parallel_run_config=parallel_run_config,
    arguments=['--model_name', 'credit_model_tutorial'],
    inputs=[batch_dataset_consumption],
    side_inputs=[],
    output=output_dataset,
    allow_reuse=False
)

steps = [batch_step]

Finally, we can create our pipeline object and validate it. This will check the input and outputs are properly linked and that the pipeline graph is a non-cyclic graph:

In [9]:
pipeline = Pipeline(workspace=ws, steps=steps)
pipeline.validate()

Step batch-inference-step is ready to be created [bc8f27e1]


[]

Lastly, we can submit the pipeline against an experiment:

In [10]:
pipeline_run = Experiment(ws, 'mlops-workshop-pipelines').submit(pipeline)
pipeline_run.wait_for_completion()

Created step batch-inference-step [bc8f27e1][7849970e-626d-4c67-b7be-7219300bd4b6], (This step will run and generate new outputs)
Submitted PipelineRun cbb1d6bf-a063-4c2f-b6e8-275668c69483
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/mlops-workshop-pipelines/runs/cbb1d6bf-a063-4c2f-b6e8-275668c69483?wsid=/subscriptions/bcbf34a7-1936-4783-8840-8f324c37f354/resourcegroups/demo/workspaces/demo-ent-ws
PipelineRunId: cbb1d6bf-a063-4c2f-b6e8-275668c69483
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/mlops-workshop-pipelines/runs/cbb1d6bf-a063-4c2f-b6e8-275668c69483?wsid=/subscriptions/bcbf34a7-1936-4783-8840-8f324c37f354/resourcegroups/demo/workspaces/demo-ent-ws
PipelineRun Status: NotStarted
PipelineRun Status: Running


StepRunId: d55363a8-be80-4418-a530-30ceb9656e40
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/mlops-workshop-pipelines/runs/d55363a8-be80-4418-a530-30ceb9656e40?wsid=/subscriptions/bcbf34a7-1936


pip-20.2.4           | 1.1 MB    |            |   0% [0m[91m
pip-20.2.4           | 1.1 MB    | ########   |  81% [0m[91m
pip-20.2.4           | 1.1 MB    | #########6 |  96% [0m[91m
pip-20.2.4           | 1.1 MB    | ########## | 100% [0m[91m

tk-8.5.19            | 1.9 MB    |            |   0% [0m[91m
tk-8.5.19            | 1.9 MB    |            |   1% [0m[91m
tk-8.5.19            | 1.9 MB    | #######7   |  78% [0m[91m
tk-8.5.19            | 1.9 MB    | ########9  |  89% [0m[91m
tk-8.5.19            | 1.9 MB    | #########8 |  98% [0m[91m
tk-8.5.19            | 1.9 MB    | ########## | 100% [0m[91m

ca-certificates-2020 | 137 KB    |            |   0% [0m[91m
ca-certificates-2020 | 137 KB    | ########## | 100% [0m[91m

zlib-1.2.11          | 106 KB    |            |   0% [0m[91m
zlib-1.2.11          | 106 KB    | ########## | 100% [0m
Downloading and Extracting Packages
Preparing transaction: ...working... done
Verifying transaction: ...working... done

Successfully installed PyJWT-1.7.1 SecretStorage-3.3.0 adal-1.2.5 azure-common-1.1.26 azure-core-1.10.0 azure-graphrbac-0.61.1 azure-identity-1.4.1 azure-mgmt-authorization-0.61.0 azure-mgmt-containerregistry-2.8.0 azure-mgmt-keyvault-2.2.0 azure-mgmt-resource-12.0.0 azure-mgmt-storage-11.2.0 azureml-core-1.20.0 azureml-dataprep-2.8.2 azureml-dataprep-native-28.0.0 azureml-dataprep-rslex-1.6.0 backports.tempfile-1.0 backports.weakref-1.0.post1 cffi-1.14.4 chardet-4.0.0 cloudpickle-1.6.0 contextlib2-0.6.0.post1 cryptography-3.3.1 distro-1.5.0 docker-4.4.1 dotnetcore2-2.1.20 fusepy-3.0.1 idna-2.10 importlib-metadata-3.4.0 isodate-0.6.0 jeepney-0.6.0 jmespath-0.10.0 joblib-0.13.2 jsonpickle-1.5.0 msal-1.8.0 msal-extensions-0.2.2 msrest-0.6.19 msrestazure-0.6.4 ndg-httpsclient-0.5.1 numpy-1.19.5 oauthlib-3.1.0 pandas-0.25.3 pathspec-0.8.1 portalocker-1.7.1 pyarrow-1.0.1 pyasn1-0.4.8 pycparser-2.20 pyopenssl-19.1.0 python-dateutil-2.8.1 pytz-2020.5 requests-2.25.1 requests-oauthlib-1.3.0 ru

699b75ff4717: Pull complete
b177109c9d16: Pull complete
59cea07bb66c: Pull complete
d54d011de0e3: Pull complete
ec2c061b6e79: Pull complete
45be97372f16: Pull complete
741ed879c2f2: Pull complete
dcb42b399f96: Pull complete
c5158f856775: Pull complete
Digest: sha256:96a223a2d683aab4b4f91719ba3f705a79883c430ca39e73845fd2ba36704f14
Status: Downloaded newer image for viennaglobal.azurecr.io/azureml/azureml_9d4fa30783fc98f2c7c7f19c6a312f30:latest
viennaglobal.azurecr.io/azureml/azureml_9d4fa30783fc98f2c7c7f19c6a312f30:latest
2021-01-19T14:36:00Z Check if container d55363a8-be80-4418-a530-30ceb9656e40 already exist exited with 0, 

b641b945fe6f78d9a9b39d8a2063d8966967d8140ae62504675fff0f940012ea
2021/01/19 14:36:07 Starting App Insight Logger for task:  containerSetup
2021/01/19 14:36:07 Version: 3.0.01464.0002 Branch: HotfixRemoveLogErrForACRIdentity Commit: 71d969c
2021/01/19 14:36:07 /dev/infiniband/uverbs0 found (implying presence of InfiniBand)?: false
2021/01/19 14:36:07 /dev/infiniba

[2021-01-19T14:36:44.954694] Ran Sidecar prep cmd.
[2021-01-19T14:36:44.954743] Running Context Managers in Sidecar complete.

Streaming azureml-logs/70_driver_log.txt
bash: /azureml-envs/azureml_0a9c9c6cfe46ea8f73af74fbc3c2f5b8/lib/libtinfo.so.5: no version information available (required by bash)
bash: /azureml-envs/azureml_0a9c9c6cfe46ea8f73af74fbc3c2f5b8/lib/libtinfo.so.5: no version information available (required by bash)
2021/01/19 14:37:54 Attempt 1 of http call to http://10.0.0.4:16384/sendlogstoartifacts/info
2021/01/19 14:37:54 Attempt 1 of http call to http://10.0.0.4:16384/sendlogstoartifacts/status
[2021-01-19T14:37:55.750253] Entering context manager injector.
[context_manager_injector.py] Command line Options: Namespace(inject=['ProjectPythonPath:context_managers.ProjectPythonPath', 'Dataset:context_managers.Datasets', 'RunHistory:context_managers.RunHistory', 'TrackUserError:context_managers.TrackUserError'], invocation=['driver/amlbi_main.py', '--client_sdk_version', 



PipelineRun Execution Summary
PipelineRun Status: Finished
{'runId': 'cbb1d6bf-a063-4c2f-b6e8-275668c69483', 'status': 'Completed', 'startTimeUtc': '2021-01-19T14:23:56.031361Z', 'endTimeUtc': '2021-01-19T14:39:22.198596Z', 'properties': {'azureml.runsource': 'azureml.PipelineRun', 'runSource': 'SDK', 'runType': 'SDK', 'azureml.parameters': '{}'}, 'inputDatasets': [], 'outputDatasets': [], 'logFiles': {'logs/azureml/executionlogs.txt': 'https://demoentws5367325393.blob.core.windows.net/azureml/ExperimentRun/dcid.cbb1d6bf-a063-4c2f-b6e8-275668c69483/logs/azureml/executionlogs.txt?sv=2019-02-02&sr=b&sig=Ai24%2B1PPdJLMvPFhNUiAA95xEFO%2FzmW2cU8O3Uwltac%3D&st=2021-01-19T14%3A29%3A24Z&se=2021-01-19T22%3A39%3A24Z&sp=r', 'logs/azureml/stderrlogs.txt': 'https://demoentws5367325393.blob.core.windows.net/azureml/ExperimentRun/dcid.cbb1d6bf-a063-4c2f-b6e8-275668c69483/logs/azureml/stderrlogs.txt?sv=2019-02-02&sr=b&sig=he7QvNzC13GR%2BA2VB8%2B87tDOVvqt79nhFUUKER0vp40%3D&st=2021-01-19T14%3A29%3A25Z

'Finished'

Last but not least, we can now download the resulting dataset and have a look at our predictions. For easy of use, we'll just download it here to a folder named `temp`:

In [12]:
Dataset.get_by_name(ws, "batch-scoring-results").download(target_path="temp/", overwrite=True)
with open('temp/batch-predictions.txt','r') as f:
    print(f.read())

0 0.06820236095865362 0.9317976390413464
1 0.6843944764926586 0.31560552350734145
2 0.14786576475019952 0.8521342352498005
3 0.6406113601900081 0.3593886398099919
4 0.48906412482859263 0.5109358751714074
5 0.2699403411724228 0.7300596588275772
6 0.07079286684323505 0.929207133156765
7 0.6106057088728849 0.38939429112711516
8 0.018350514598001078 0.9816494854019989
9 0.5025497165991059 0.49745028340089414
10 0.545672695374676 0.454327304625324
11 0.8555835633596343 0.14441643664036577
12 0.3589346747081955 0.6410653252918045
13 0.39338672905056804 0.606613270949432
14 0.6314221266671478 0.36857787333285225
15 0.4375982497304032 0.5624017502695968
16 0.032246317122612056 0.9677536828773879
17 0.43043651877220335 0.5695634812277967
18 0.5802322548178415 0.41976774518215854
19 0.08642474551715895 0.913575254482841
20 0.07493746803499945 0.9250625319650005
21 0.23053073655002376 0.7694692634499762
22 0.42507834477789685 0.5749216552221031
23 0.2294804827823228 0.7705195172176772
24 0.044238