In [1]:
from azureml.core import Workspace

ws = Workspace.from_config()
print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

If you run your code in unattended mode, i.e., where you can't give a user input, then we recommend to use ServicePrincipalAuthentication or MsiAuthentication.
Please refer to aka.ms/aml-notebook-auth for different authentication mechanisms in azureml-sdk.


Workspace name: ws01ent
Azure region: westus2
Subscription id: 0e9bace8-7a81-4922-83b5-d995ff706507
Resource group: azureml


### Create or Attach existing compute resource for Python steps
By using Azure Machine Learning Compute, a managed service, data scientists can train machine learning models on clusters of Azure virtual machines. Examples include VMs with GPU support. In this tutorial, you create Azure Machine Learning Compute as your training environment. The code below creates the compute clusters for you if they don't already exist in your workspace.

**Creation of compute takes approximately 5 minutes. If the AmlCompute with that name is already in your workspace the code will skip the creation process.**

In [2]:
import os
from azureml.core.compute import AmlCompute, ComputeTarget
from azureml.data.data_reference import DataReference

# choose a name for your cluster
compute_name = os.environ.get("AML_COMPUTE_CLUSTER_NAME", "worker-cpu")
compute_min_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MIN_NODES", 0)
compute_max_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MAX_NODES", 4)

# This example uses CPU VM. For using GPU VM, set SKU to STANDARD_NC6
vm_size = os.environ.get("AML_COMPUTE_CLUSTER_SKU", "STANDARD_D12_V2")


if compute_name in ws.compute_targets:
    compute_target = ws.compute_targets[compute_name]
    if compute_target and type(compute_target) is AmlCompute:
        print('found compute target. just use it. ' + compute_name)
else:
    print('creating a new compute target...')
    provisioning_config = AmlCompute.provisioning_configuration(vm_size = vm_size,
                                                                min_nodes = compute_min_nodes, 
                                                                max_nodes = compute_max_nodes)

    # create the cluster
    compute_target = ComputeTarget.create(ws, compute_name, provisioning_config)
    
    # can poll for a minimum number of nodes and for a specific timeout. 
    # if no min node count is provided it will use the scale settings for the cluster
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
    
     # For a more detailed view of current AmlCompute status, use get_status()
    print(compute_target.get_status().serialize())

found compute target. just use it. worker-cpu


In [3]:
from azureml.core import Workspace
from azureml.core import Keyvault
import os

keyvault = ws.get_default_keyvault()
keyvault.set_secret(name="adlsgen6key", value = 'AcDil/MwM9KlDvJu0LBcBIQxogAncv306NMRYABtjphXfWgaDTV3yjZgoSNckUb/3nhG04ND2Nqn553fq36Pqw==')

In [52]:

#This is to mount a BLob datastore
from azureml.core import Datastore

key =keyvault.get_secret('adlsgen6key')
account_name = 'adlsdatalakegen6'
datastore_sourcefiles= Datastore.register_azure_blob_container(workspace=ws, datastore_name = 'adlsgen6landing', 
                                                     container_name='landing',
                                                     account_name= account_name, account_key=key,create_if_not_exists=False)

datastore_preprocess= Datastore.register_azure_blob_container(workspace=ws, datastore_name = 'adlsgen6process', 
                                                     container_name='process',
                                                     account_name= account_name, account_key=key,create_if_not_exists=False)




In [50]:
import pandas as pd
# pd.read_excel()


NameError: name 'PipelineData' is not defined

### Intermediate/Output Data
Intermediate data (or output of a Step) is represented by [PipelineData](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.pipelinedata?view=azure-ml-py) object. PipelineData can be produced by one step and consumed in another step by providing the PipelineData object as an output of one step and the input of one or more steps.

**Constructing PipelineData**
- name: [Required] Name of the data item within the pipeline graph
- datastore_name: Name of the Datastore to write this output to
- output_name: Name of the output
- output_mode: Specifies "upload" or "mount" modes for producing output (default: mount)
- output_path_on_compute: For "upload" mode, the path to which the module writes this output during execution
- output_overwrite: Flag to overwrite pre-existing data

In [43]:
from azure.servicebus import QueueClient,ServiceBusClient
from azure.storage.queue import QueueClient, BinaryBase64DecodePolicy
import json
# QueueMessageFormat=BinaryBase64DecodePolicy()

# con_str ='Endpoint=sb://amlservibus.servicebus.windows.net/;SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=wLZ9JTCNKdErXxKxjRr+aBGZ9dsIxkmf+veeCE7XeGc='
con_str='Endpoint=sb://amlservibus.servicebus.windows.net/;SharedAccessKeyName=access;SharedAccessKey=tXe8OSx7WPLjKXzX9iNGwPN36Hcns24z2A735qHdJZ8=;EntityPath=landing'
sb_client = ServiceBusClient.from_connection_string(con_str)
queue_client =sb_client.get_queue("landing")
# queue_client = QueueClient.from_connection_string(con_str, "landing")
# sb_client = ServiceBusClient.from_connection_string(con_str)

#     message_count=0
#     messages = client.receive_messages(messages_per_page=32)
#     for msg_batch in messages.by_page():
#         for message in msg_batch:
#             content =QueueMessageFormat.decode(message.content,None).decode('utf8')
#             json_content = ast.literal_eval(content)
#             data_content = json_content['data']
#             file_name =data_content['url']
#             file_list.append(file_name)
#             source_name =data_content['url'].split("/")[-2]
#             source_list.append(source_name)
#             print(source_name,file_name)
# #             client.delete_message(message)
#             message_count+=1
#     if message_count==0:
#         break
# Receive the message from the queue
with queue_client.get_receiver() as queue_receiver:
    messages = queue_receiver.fetch_next(timeout=3)
    for message in messages:
#         json_data = message.body.decode('utf-8')
        json_content = json.loads(str(message))
#         data_content = json_content['data']
        file_name =json_content['data']['url']
#         file_list.append(file_name)
        source_name =file_name.split("/")[-2]
#         source_list.append(source_name)
        print(source_name,file_name)
#         print(message)
#         message.complete()


source1 https://adlsdatalakegen6.blob.core.windows.net/landing/source1/leaderboard15.xlsx


In [42]:
json_content['data']['url']

'https://adlsdatalakegen6.blob.core.windows.net/landing/source1/leaderboard13.xlsx'

In [53]:
from azureml.pipeline.core import Pipeline, PipelineData,PipelineParameter


account_url_val = 'https://hungrywizardstorage.queue.core.windows.net'
queue_name_val='landingzonequeue'
sas_token_val='?sv=2019-12-12&ss=q&srt=sco&sp=rwdlacup&se=2021-07-27T02:57:46Z&st=2020-07-26T18:57:46Z&spr=https&sig=rJ%2Bi5ttwFzxeUjTsmZ%2BCCCw6R88Oh7V59suPRpC4Bvw%3D'

scripts_folder = "scripts"
datastore_sourcefiles_dir = PipelineData(name="preprocess_output", 
                      datastore=datastore_sourcefiles,output_path_on_compute="datastore_sourcefiles")

grouping_output_dir = PipelineData(name="grouping_output", 
                          datastore=datastore_sourcefiles, 
                          output_path_on_compute="grouping_output")
grouping_output_ds = grouping_output_dir.as_dataset().parse_delimited_files()

preprocess_output_dir = PipelineData(name="preprocess_output", 
                      datastore=datastore_preprocess,output_path_on_compute="preprocess_output")
preprocess_output_ds = preprocess_output_dir.as_dataset()

process_output_dir = PipelineData(name="process_output", 
                      datastore=datastore_process,output_path_on_compute="process_output")

account_url = PipelineParameter(name="account_url", default_value=account_url_val)
queue_name = PipelineParameter(name="queue_name", default_value=queue_name_val)
sas_token = PipelineParameter(name="sas_token", default_value=sas_token_val)

In [56]:
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.runconfig import DEFAULT_CPU_IMAGE

# create a new runconfig object
grouping_run_config = RunConfiguration()

# enable Docker 
grouping_run_config.environment.docker.enabled = True

# set Docker base image to the default CPU-based image
grouping_run_config.environment.docker.base_image = DEFAULT_CPU_IMAGE

# use conda_dependencies.yml to create a conda environment in the Docker image for execution
grouping_run_config.environment.python.user_managed_dependencies = False

# specify CondaDependencies obj
grouping_run_config.environment.python.conda_dependencies = CondaDependencies.create(pip_packages=['azure-storage-queue','azureml-defaults','pandas'])


In [None]:

from azureml.pipeline.steps import PythonScriptStep

grouping_step = PythonScriptStep(
   script_name="data_grouping_example.py",
   arguments=["--account_url", account_url, "--queue_name", queue_name,"--sas_token", sas_token,"--grouping_output", grouping_output_ds],
   outputs=[grouping_output_ds],
   compute_target=compute_target,
   source_directory=scripts_folder,
   runconfig =grouping_run_config,
    allow_reuse=False
)
    

In [None]:
from azureml.pipeline.core import PipelineParameter,StepSequence
from azureml.pipeline.steps import ParallelRunStep, ParallelRunConfig
from azureml.core import Environment

preprocess_conda_deps = CondaDependencies.create(pip_packages=["azureml-defaults","azureml-dataset-runtime[fuse,pandas]","xlrd","azure-storage-queue",'azure-storage-blob'])
preprocess_env = Environment(name="preprocess_environment")
preprocess_env.python.conda_dependencies = preprocess_conda_deps
preprocess_env.docker.enabled = True
preprocess_env.docker.base_image = DEFAULT_CPU_IMAGE



preprocess_parallel_run_config = ParallelRunConfig(
    source_directory=scripts_folder,
    
    entry_script="preprocessing_example.py",
    mini_batch_size=PipelineParameter(name="preprocess_batch_size_param", default_value="1000"),
    error_threshold=1,
    output_action="summary_only",
    append_row_file_name="outputs.txt",
    environment=preprocess_env,
    compute_target=compute_target,
#     process_count_per_node=PipelineParameter(name="process_count_param", default_value=2),
    node_count=2)

preprocess_parallelrun_step = ParallelRunStep(
    name="preprocess",
    arguments=["--account_url", account_url, "--queue_name", queue_name,"--sas_token", sas_token,"--preprocess_output",preprocess_output_ds,"--datastore_sourcefiles_dir",datastore_sourcefiles_dir],

    parallel_run_config=preprocess_parallel_run_config,
    inputs=[grouping_output_ds],
    output=preprocess_output_ds,
    allow_reuse=False
)

In [57]:

process_conda_deps = CondaDependencies.create(pip_packages=["azureml-defaults","azureml-dataprep[fuse]","xlrd",
                                                          "celery==4.3.0", "pandas==0.24.2", "urllib3==1.25.3",
                                                          "nameparser==1.0.4","python-dateutil==2.8.0",
                                                         "python-crontab==2.3.9","psutil==5.6.3","azure-storage-queue",'azure-storage-blob',
                                                         "detect-delimiter==0.1.1","xmltodict==0.12.0","SQLAlchemy==1.3.6","openpyxl"])
process_env = Environment(name="process_environment")
process_env.python.conda_dependencies = process_conda_deps
process_env.docker.enabled = True
process_env.docker.base_image = DEFAULT_CPU_IMAGE



process_parallel_run_config = ParallelRunConfig(
    source_directory=scripts_folder,
    
    entry_script="processing_example.py",
    mini_batch_size=PipelineParameter(name="process_batch_size_param", default_value="2"),
    error_threshold=1,
    output_action="summary_only",
    append_row_file_name="outputs.txt",
    environment=process_env,
    compute_target=compute_target2,
#     process_count_per_node=PipelineParameter(name="process_count_param", default_value=2),
    node_count=2
)

process_parallelrun_step = ParallelRunStep(
    name="process",
    arguments=["--account_url", account_url, "--queue_name", queue_name,"--sas_token", sas_token,"--process_output",process_output_dir],

    parallel_run_config=process_parallel_run_config,
    inputs=[preprocess_output_ds],
    output=process_output_dir,
    allow_reuse=False
)

In [60]:
from azureml.pipeline.core import PipelineParameter,StepSequence
from azureml.pipeline.steps import ParallelRunStep, ParallelRunConfig
from azureml.core import Environment

preprocess_conda_deps = CondaDependencies.create(pip_packages=["azureml-defaults","azureml-dataset-runtime[fuse,pandas]","xlrd","azure-storage-queue",'azure-storage-blob'])
preprocess_env = Environment(name="preprocess_environment")
preprocess_env.python.conda_dependencies = preprocess_conda_deps
preprocess_env.docker.enabled = True
preprocess_env.docker.base_image = DEFAULT_CPU_IMAGE



preprocess_parallel_run_config = ParallelRunConfig(
    source_directory=scripts_folder,
    
    entry_script="preprocessing_example.py",
    mini_batch_size=PipelineParameter(name="preprocess_batch_size_param", default_value="1000"),
    error_threshold=1,
    output_action="summary_only",
    append_row_file_name="outputs.txt",
    environment=preprocess_env,
    compute_target=compute_target,
#     process_count_per_node=PipelineParameter(name="process_count_param", default_value=2),
    node_count=2)

preprocess_parallelrun_step = ParallelRunStep(
    name="preprocess",
    arguments=["--account_url", account_url, "--queue_name", queue_name,"--sas_token", sas_token,"--preprocess_output",preprocess_output_ds,"--datastore_sourcefiles_dir",datastore_sourcefiles_dir],

    parallel_run_config=preprocess_parallel_run_config,
    inputs=[grouping_output_ds],
    output=preprocess_output_ds,
    allow_reuse=False
)

ValueError: Input/Output preprocess_output appears in arguments list but is not in the input/output lists

In [None]:

process_conda_deps = CondaDependencies.create(pip_packages=["azureml-defaults","azureml-dataprep[fuse]","xlrd",
                                                          "celery==4.3.0", "pandas==0.24.2", "urllib3==1.25.3",
                                                          "nameparser==1.0.4","python-dateutil==2.8.0",
                                                         "python-crontab==2.3.9","psutil==5.6.3","azure-storage-queue",'azure-storage-blob',
                                                         "detect-delimiter==0.1.1","xmltodict==0.12.0","SQLAlchemy==1.3.6","openpyxl"])
process_env = Environment(name="process_environment")
process_env.python.conda_dependencies = process_conda_deps
process_env.docker.enabled = True
process_env.docker.base_image = DEFAULT_CPU_IMAGE



process_parallel_run_config = ParallelRunConfig(
    source_directory=scripts_folder,
    
    entry_script="processing_example.py",
    mini_batch_size=PipelineParameter(name="process_batch_size_param", default_value="2"),
    error_threshold=1,
    output_action="summary_only",
    append_row_file_name="outputs.txt",
    environment=process_env,
    compute_target=compute_target2,
#     process_count_per_node=PipelineParameter(name="process_count_param", default_value=2),
    node_count=2
)

process_parallelrun_step = ParallelRunStep(
    name="process",
    arguments=["--account_url", account_url, "--queue_name", queue_name,"--sas_token", sas_token,"--process_output",process_output_dir],

    parallel_run_config=process_parallel_run_config,
    inputs=[preprocess_output_ds],
    output=process_output_dir,
    allow_reuse=False
)

In [None]:


notebook_path="/patient_matching/samples/data_loading"
databricks_loading_step = DatabricksStep(
    name="Databricks_loading",
    instance_pool_id='0802-004752-cones91-pool-SEQ0qDvt',
    spark_version = '7.0.x-scala2.12',
#     existing_cluster_id='0616-180808-font886',
    num_workers=2,
    notebook_path=notebook_path,
    inputs=[process_output_dir],
    run_name='data_loading',
    compute_target=databricks_compute,
    allow_reuse=False
)





In [None]:
from azureml.core import Experiment
# step_sequence = StepSequence(steps=[grouping_step, preprocess_parallelrun_step, process_parallelrun_step])

pipeline = Pipeline(workspace=ws, steps=[grouping_step, preprocess_parallelrun_step, process_parallelrun_step, databricks_loading_step])
experiment = Experiment(ws, 'hungrywizard_eligibility')
pipeline_run = experiment.submit(pipeline)

In [55]:
pipeline_run.publish_pipeline(name ="hunrgy_wizard_demo", description ="Demo of multi steps hungry wizard pipeline", version=1.0)

Name,Id,Status,Endpoint
hunrgy_wizard_demo,e0e1272a-74e7-40c0-a6a0-f8e454b16a00,Active,REST Endpoint


In [None]:
from azureml.widgets import RunDetails
RunDetails(pipeline_run).show()


### Optional: View detailed logs (streaming) 

In [21]:
pipeline_run.wait_for_completion(show_output=True)

PipelineRunId: 270ce7d9-61a6-4bff-a87f-9e3aa78636a0
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/hungrywizard_eligibility/runs/270ce7d9-61a6-4bff-a87f-9e3aa78636a0?wsid=/subscriptions/accde124-a45f-47be-aefa-b09362b2c5fc/resourcegroups/rg-HungryWizard/workspaces/hungrywizardml
PipelineRun Status: Running


StepRunId: 2d3ab6f9-739f-4381-a390-609e2b265dbd
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/hungrywizard_eligibility/runs/2d3ab6f9-739f-4381-a390-609e2b265dbd?wsid=/subscriptions/accde124-a45f-47be-aefa-b09362b2c5fc/resourcegroups/rg-HungryWizard/workspaces/hungrywizardml
StepRun( eligibility ) Status: Running

Streaming azureml-logs/70_driver_log.txt
Entering context manager injector. Current time:2020-07-02T19:51:18.470434
Initialize DatasetContextManager.
Starting the daemon thread to refresh tokens in background for process with pid = 148
Set Dataset eligibility_param_config's target path to /mnt/batch/tasks/shared/LS_root/jo

ExperimentExecutionException: ExperimentExecutionException:
	Message: The output streaming for the run interrupted.
But the run is still executing on the compute target. 
Details for canceling the run can be found here: https://aka.ms/aml-docs-cancel-run
	InnerException None
	ErrorResponse 
{
    "error": {
        "message": "The output streaming for the run interrupted.\nBut the run is still executing on the compute target. \nDetails for canceling the run can be found here: https://aka.ms/aml-docs-cancel-run"
    }
}

### Resubmit a with different dataset
Since we made the input a `PipelineParameter`, we can resubmit with a different dataset without having to create an entirely new experiment. We'll use the same datastore but use only a single image.

In [None]:
# path_on_datastore = mnist_data.path('mnist')
# single_image_ds = Dataset.File.from_files(path=path_on_datastore, validate=False)

In [None]:
# pipeline_run_2 = experiment.submit(pipeline, 
#                                    pipeline_parameters={"mnist_param": single_image_ds, 
#                                                         "batch_size_param": "1",
#                                                         "process_count_param": 1}
# )

In [None]:
# pipeline_run_2.wait_for_completion(show_output=True)