# Batch processing with Azure pipelines
Azure Machine Learning pipelines can either be created in the designer or with the python azureml API.
In this lab we are going to create a simple Azure pipeline for batch processing. The pipeline consists of two steps- preprocessing and scoring.
Be aware that we are going to use experimental features of azureml which should not be used in a productive environment.
Lets first import all needed packages:

In [None]:
import os
import pandas as pd
from azureml.core.model import Model
from azureml.core import Workspace
from azureml.core import Experiment

from azureml.core.dataset import Dataset
from azureml.core.compute import AmlCompute, ComputeTarget
from azureml.pipeline.steps import PythonScriptStep
from azureml.pipeline.core import Pipeline
from azureml.data.dataset_consumption_config import DatasetConsumptionConfig
from azureml.data.output_dataset_config import OutputFileDatasetConfig
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core import RunConfiguration

## Connect to workspace, set up dataset and compute
To have a more realistic setting we are not going to use our registered dataset, but the csv file with the raw credit data directly. Be aware, with this setting we are using our training data for prediction. This is just feasible for demonstration purpose, it is not something you would want to do in production. We create a DatasetConsumptionConfig for data input at the beginning of the pipeline. Two OutputFileDatasetConfig objects serve as intermediate and final location for the output files. The result_data will be registered as new dataset (batch-scoring-results) which is accomplished with the command register_on_complete.

In [None]:
ws = Workspace.from_config()
datastore = ws.get_default_datastore()
dataset = Dataset.Tabular.from_delimited_files(path=[(datastore, 'german_credit_dataset.csv')])
input_data = DatasetConsumptionConfig("input_dataset", dataset)
intermediate_data = OutputFileDatasetConfig(name='intermediate_dataset', destination=(datastore, 'intermediate/{run-id}'))
result_data = OutputFileDatasetConfig(name='result_dataset', destination=(datastore, 'result/{run-id}')).register_on_complete('batch-scoring-results')


If the compute "batch-comp" is not available in your workspace, it will be created.

In [None]:
compute_name = 'batch-comp'

# checks to see if compute target already exists in workspace, else create it
if compute_name in ws.compute_targets:
    compute_target = ComputeTarget(workspace=ws, name=compute_name)
else:
    config = AmlCompute.provisioning_configuration(vm_size="STANDARD_DS11_V2",
                                                   vm_priority="lowpriority",
                                                   min_nodes=1,
                                                   max_nodes=2)

    compute_target = ComputeTarget.create(workspace=ws, name=compute_name, provisioning_configuration=config)
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)

A run configuration based on the conda dependencies is automatically created.

In [None]:
conda_dep = CondaDependencies()
conda_dep.add_pip_package("scikit-learn==0.22")
config = RunConfiguration(conda_dependencies=conda_dep)
config

## Prepare the pipeline steps
We create two PythonScriptStep objects. For each object we need to supply a python script. The scripts are prepared in the batch_script folder and we load them only to have a look at it. You can find different pipeline steps [here](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-steps/azureml.pipeline.steps?view=azure-ml-py).

In [None]:
with open("batch_scripts/preprocessing_step.py", "r") as f:
    print(f.read())

In [None]:
with open("batch_scripts/scoring_step.py", "r") as f:
    print(f.read())

The two scripts, together with the locations and compute are given as inputs to the PythonScriptStep constructors. The allow_reuse flag will allow us to use the intermediate results from earlier runs, if there are any and the pipeline step has not changed since the last run.

In [None]:
preprocessing_step = PythonScriptStep(
    script_name="preprocessing_step.py",
    name='preprocessing_step',
    arguments=['--intermediate-data-path', intermediate_data],
    compute_target=compute_target,
    runconfig=config,
    inputs=[input_data],
    outputs=[intermediate_data],
    source_directory='./batch_scripts',
    allow_reuse=True
)
scoring_step = PythonScriptStep(
    script_name="scoring_step.py",
    name='scoring_step',
    arguments=['--intermediate-data-path', intermediate_data, '--result-data-path', result_data],
    compute_target=compute_target,
    runconfig=config,
    inputs=[intermediate_data],
    outputs=[result_data],
    source_directory='./batch_scripts'
)

## Run the pipeline
We can combine the steps to a whole pipeline and submit the pipeline as a new experiment run. You can find all logs in your workspace. The intermediate and final file locations and data can be found your Azure blob storage which was created automatically.

In [None]:
scoring_pipeline = Pipeline(workspace=ws, steps=[preprocessing_step, scoring_step])
pipeline_run = Experiment(ws, 'batch-score').submit(scoring_pipeline)
pipeline_run.wait_for_completion(show_output=False)

As you are used from the designer, you can still monitor the pipeline during training in the experiments section (open the specific run) in your workspace
<img src="images/pipeline-steps.jpg" alt="Pipeline" style="width: 800px;"/>

## Results
Let us have a look at the resulting data. We can easily access the results from the registered dataset. The result was automatically registered as batch-scoring-results as defined at the output location creation above. For comparison we open the original credit risk set, that we have registered in lab 3. We can see the added column "prediction". Of course, in a real-life scenario, you would not have the "Risk" column i.e. unlabeled data.

In [None]:
dataset = Dataset.get_by_name(ws, name='batch-scoring-results', version = "latest")
df_path = dataset.download('data/batch_scoring_results', overwrite=True)
pd.read_csv(df_path[0]).head()

In [None]:
dataset = Dataset.get_by_name(ws, name='german_credit_dataset', version = "latest")
ds_df = dataset.to_pandas_dataframe()
ds_df.head()


## Disclaimer

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.