# Introduction

This notebook demonstrates how to use partiiton strategy to solve larger scale route optimization problem. The rationale for partitioning is that usually an optimization problem could be hard to solve given the NP-hard nature for most of the optimization problems. To trade-off the result optimality and running time, one can partition the big problem into many smaller problems, then solve each smaller problem, and finially combine all results as the final result. The whole pipeline is illustrated by the below figure.

<img src=../docs/media/pipeline.png width="90%" />

There are 4 main steps in the pipeline:
1.  Reduce: It will try to assign some of the orders to truck routes in a heuristic way. The remaining unscheduled order will be passed to the later steps for optimization. This step is optional, namely, one can bypass this step but let optimizer search solution for all orders. However, reducing the search space by  heuristic can significantly reduce the search space. This will make it easier for the oprimization solver to find a good solution.  
2.  Partition: This is core step to partition the big problem into smaller problems. 
3.  Solve: This step is to solve individual small problem using whatever optimization solver.
4.  Merge: This final step is to combine all results from each small problem.




# 1.0 Load libraries

We use Azure ML pipeline for the implementation. Specifically, the partitioning step is done by the PrallelRunStep in Azure ML SDK.

In [None]:
import os
from dotenv import load_dotenv, find_dotenv

import azureml.core
from azureml.core.authentication import AzureCliAuthentication
from azureml.core import Workspace, Experiment, Datastore, Environment, Dataset
from azureml.core.compute import AmlCompute, ComputeTarget
from azureml.core.compute_target import ComputeTargetException
from azureml.widgets import RunDetails

from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.runconfig import DEFAULT_CPU_IMAGE

from azureml.pipeline.core import Pipeline, PipelineParameter, PipelineData
from azureml.pipeline.steps import PythonScriptStep

from azureml.data import OutputFileDatasetConfig

from azureml.pipeline.steps import ParallelRunStep, ParallelRunConfig
from azureml.data.datapath import DataPath, DataPathComputeBinding

# 1.1 Setup some environment
## 1.1.1 Load variables

Some parameters are managed by environment variables.To specify your values, create a .env file in the root folder of the repository and set the values for the following parameters.

In [None]:
load_dotenv()

ws_name = os.environ['AML_WORKSPACE_NAME']
subscription_id = os.environ['AML_SUBSCRIPTION_ID']
resource_group = os.environ['AML_RESOURCE_GROUP']


print('---- Check Azure setting ----')
print(f'AML Workspace name       : {ws_name}')
print(f'Subscription ID          : {subscription_id}')
print(f'Resource group           : {resource_group}')

## 1.1.2 Azure authentication and Load Azure ML Workspace

We use Azure CLI to authenticate.

In [None]:
try:
    ws = Workspace(subscription_id = subscription_id, resource_group = resource_group, workspace_name = ws_name)
    print(ws.get_details())
except:
    print("Workspace not accessible. Change your parameters or create a new workspace")

## 1.1.3 Get Compute Cluster

Read the compute name from the environment varibale. If it doest not exist in the Azure ML workspace, a new compute target will be created.

In [None]:
# Retrieve or create an Aml compute
aml_compute_target = 'opcluster'

min_nodes = 0
max_nodes = 10
try:
    aml_compute = AmlCompute(ws, aml_compute_target)
    print("Found existing compute target.")
except ComputeTargetException:
    print("Creating new compute target")
    

    provisioning_config = AmlCompute.provisioning_configuration(vm_size = "STANDARD_D2_V2",
                                                                min_nodes = min_nodes, 
                                                                max_nodes = max_nodes)    
    aml_compute = ComputeTarget.create(ws, aml_compute_target, provisioning_config)
    aml_compute.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)

## 1.1.4 Create AML Environemnt and Run Configuration

In [None]:
# Default datastore (Azure blob storage)
def_blob_store = ws.get_default_datastore()

# source directory
source_directory = '../src/'
print(f'Source code is in {source_directory} directory.')

In [None]:
# environment
env_name = 'op-env'

try:
    env = Environment.get(ws, env_name)
    print("Found existing environment.")

except Exception:
    print("Creating new enviroment")
    env = Environment(env_name)

    # enable Docker 
    env.docker.enabled = True
    # set Docker base image to the default CPU-based image
    env.docker.base_image = DEFAULT_CPU_IMAGE
    # use conda_dependencies.yml to create a conda environment in the Docker image for execution
    env.python.user_managed_dependencies = False
    # specify CondaDependencies obj
    env.python.conda_dependencies = CondaDependencies.create(
            conda_packages=['pandas']
            ,pip_packages=['ortools'
                            ,'azureml-defaults']
        )

    env.register(workspace=ws)

In [None]:
# create a new runconfig object
run_config = RunConfiguration()

# set environment
run_config.environment = env

# 1.2 Set up Azure ML Pipeline

This section contains the main logic of the optimization pipeline.

## 1.2.1 Reduce the search space of the problem

The first step is to reduce the search space by assigning some of the orders based on heuristic. The detailed logic is implemented in the reduce.py. In general, if we use heuristic propoerly, we can achieve a good trade-off between result optimality and running time.

In [None]:
# Please replace it with the path you uploaded to Azure ML default datestore.
order_file = 'model_input/order_small.csv' 
distance_file = 'model_input/distance.csv'

model_input = Dataset.File.from_files((def_blob_store, order_file))
distance = Dataset.File.from_files((def_blob_store, distance_file))

# Naming the intermediate data 
model_result_partial = PipelineData("model_result_partial",datastore=def_blob_store)
model_input_reduced = PipelineData("model_input_reduced",datastore=def_blob_store)

reduce_step = PythonScriptStep(
    script_name="reduce.py", 
    arguments=["--model_input", model_input.as_named_input('model_input').as_download(path_on_compute='order_file'),
                "--distance", distance.as_named_input('distance').as_download(path_on_compute='distance_file'),
                "--model_result_partial", model_result_partial,
                "--model_input_reduced", model_input_reduced],
    outputs=[model_result_partial, model_input_reduced],
    compute_target=aml_compute, 
    source_directory=source_directory,
    runconfig=run_config
)

# 1.2.2 Partition the problem

For large scale optimization problem, the problem space is just so big to solve practically. A commonly used idea is to partition the big problem into many smaller problems. Then solve each smaller problem individually and combine all the results as the final result. In some cases, the partition may not affect the result optimality, for example, in the route optimization problem, we can partition the orders by the delivery sources. In other cases, there will be trade-off between result optimality and running time when partitioning is applied. 

In [None]:
# Naming the intermediate data 
model_input_list = PipelineData("model_input_list",datastore=def_blob_store).as_dataset()

parition_step = PythonScriptStep(
    script_name="partition.py", 
    arguments=["--model_input_reduced", model_input_reduced,
                "--distance", distance.as_named_input('distance').as_download(path_on_compute='distance_file'),
                "--model_input_list", model_input_list],
    inputs=[model_input_reduced],
    outputs=[model_input_list],
    compute_target=aml_compute, 
    source_directory=source_directory,
    runconfig=run_config
)

## 1.2.3 Solve individual problem

After the problem is partitioned, we can solve each individul one by using whatever optimization solver. The optimization solver itself may leverage multi-process to speed up the search of result. This level of parallelism is totally controlled by the solver but not our Azure ML pipeline.

In [None]:
import uuid

# Naming the intermediate data 
model_result_list = PipelineData("model_result_list", datastore=def_blob_store)

# pass distance file as side input
local_path = "/tmp/{}".format(str(uuid.uuid4()))
distance_config = distance.as_named_input("distance").as_mount(local_path)


parallel_run_config = ParallelRunConfig(
    source_directory=source_directory,
    entry_script='solve.py',
    mini_batch_size="5",
    error_threshold=-1,
    output_action="append_row",
    append_row_file_name="model_result_list.txt",
    environment=env,
    compute_target=aml_compute,
    run_invocation_timeout=600,
    node_count=max_nodes)

solve_step = ParallelRunStep(
    name="solve",
    inputs=[model_input_list.as_named_input('model_input_list')],
    output=model_result_list,
    arguments=["--distance", distance_config],
    side_inputs=[distance_config],
    parallel_run_config=parallel_run_config,
    allow_reuse=False
)

## 1.2.4 Merge the results

Once all the smaller problems are solved, we can combine the result as the final one. There could be chance to further optimize the result in this step in the case the previous partitioning will affect the global optimal. For example, one may combine two packages into the same truck from two seperated result if the combined one is more cost-efficient. 

In [None]:
# Please replace it with the path you uploaded to Azure ML default datestore.
model_output_path = 'model_output'

# Naming the intermediate data 
model_result_final = OutputFileDatasetConfig(destination=(def_blob_store, model_output_path))

merge_step = PythonScriptStep(
    script_name="merge.py", 
    arguments=["--model_input", model_input.as_named_input('model_input').as_download(path_on_compute='order_file'), 
    "--distance", distance.as_named_input('distance').as_download(path_on_compute='distance_file'),
    "--model_result_partial", model_result_partial, 
    "--model_result_list", model_result_list, 
    "--model_result_final", model_result_final],
    inputs=[model_result_partial, model_result_list],
    outputs=[model_result_final],
    compute_target=aml_compute, 
    source_directory=source_directory,
    runconfig=run_config
)

## 1.2.5 Run the Pipeline

Finally, we chained all steps into a single Azure ML pipeline and submit it to run.

In [None]:
pipeline = Pipeline(workspace=ws, steps=[reduce_step, parition_step, solve_step, merge_step])

print("Pipeline is built")

pipeline_run = Experiment(ws, 'optimization_example').submit(pipeline)
print("Pipeline is submitted for execution")

RunDetails(pipeline_run).show()

pipeline_run.wait_for_completion(show_output=True)

# 1.3 Check the Model Result

In [None]:
model_output = Dataset.Tabular.from_delimited_files(path=(def_blob_store, model_output_path))

print(model_output.to_pandas_dataframe())

# 1.4 Publish the Pipeline

In [None]:
# Publish the pipeline to Azure ML 

published_pipeline = pipeline_run.publish_pipeline(
    name='Route Optimization', description='Demo for route optimization', version='1.0')

published_pipeline

In [None]:
# Print the endpoint of the pipeline

rest_endpoint = published_pipeline.endpoint
print(rest_endpoint)