# Publish a Training Pipeline
In this notebook, we will show how to automate the training/retraining of model using HyperDrive and registering best model. Once this training pipeline is published/created, it provides a rest endpoint which can be called to run this pipeline or we can create schedule to run this pipeline.

The steps in this notebook are
- [import libraries](#import),
- [read in the Azure ML workspace](#workspace),
- [upload the data to the cloud](#upload),
- [define a hyperparameter search configuration](#configuration),
- [create an estimator](#estimator),
- [Azure Machine Learning Pipelines](#aml_pipeline_overview)
- [create AML Pipeline HyperDrive step](#aml_pipeline_hd_step), and
- [create AML Pipeline PythonScript step](#aml_pipeline_ps_step)
- [create AML Pipeline](#create_aml_pipeline).
- [publish AML Pipeline](#publish_aml_pipeline)
- [run published pipeline using its REST endpoint](#run_publish_aml_pipeline)


## Imports  <a id='import'></a>

In [None]:
import os
import pandas as pd
from azureml.core import Workspace, Experiment
from azureml.train.hyperdrive import HyperDriveRun
from azureml.train.estimator import Estimator
from azureml.widgets import RunDetails

from azureml.core.datastore import Datastore
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.exceptions import ComputeTargetException
from azureml.data.data_reference import DataReference
from azureml.pipeline.steps import HyperDriveStep, PythonScriptStep, EstimatorStep
from azureml.pipeline.core import Pipeline, PipelineData, PipelineParameter
from azureml.core.runconfig import RunConfiguration, CondaDependencies

from azureml.train.hyperdrive import (
    RandomParameterSampling, choice, PrimaryMetricGoal,
    HyperDriveConfig, MedianStoppingPolicy, HyperDriveRun)

import azureml.core
from get_auth import get_auth
print('azureml.core.VERSION={}'.format(azureml.core.VERSION))

## Read in the Azure ML workspace  <a id='workspace'></a>
Read in the the workspace created in a previous notebook.

In [None]:
auth = get_auth()
ws = Workspace.from_config(auth=auth)
ws_details = ws.get_details()
print('Name:\t\t{}\nLocation:\t{}'
      .format(ws_details['name'],
              ws_details['location']))

## Upload the data to the cloud <a id='upload'></a>
We put the data in a particular directory on the workspace's default data store. This will show up in the same location in the file system of every job running on the Batch AI cluster.

Get a handle to the workspace's default data store.

In [None]:
ds = ws.get_default_datastore()

Upload the data. We use `overwrite=False` to avoid taking the time to re-upload the data should files with the same names be already present. If you change the data and want to refresh what's uploaded, use `overwrite=True`.

In [None]:
ds.upload(src_dir=os.path.join('.', 'data'), target_path='data', overwrite=False, show_progress=True)

## Define a hyperparameter search configuration <a id='configuration'></a>
Define the hyperparameter space for a random search.  We will use a constant value for the number of estimators that is enough to let us reliably identify the best of the parameter configurations. Once we have the best combination, we will build a model using a larger number of estimators to boost the performance. The table below should give you an idea of the trade-off between the number of estimators and the modeling run time, model size, and model gain.

| Estimators | Run time (s) | Size (MB) | Gain@1 | Gain@2 | Gain@3 |
|------------|--------------|-----------|------------|------------|------------|
|        100 |           40 |  2 | 25.02% | 38.72% | 47.83% |
|       1000 |          177 |  4 | 46.79% | 60.80% | 69.11% |
|       2000 |          359 |  7 | 51.38% | 65.93% | 73.09% |
|       4000 |          628 | 12 | 53.39% | 67.40% | 74.74% |
|       8000 |          904 | 22 | 54,62% | 67.77% | 75.35% |


In [None]:
hyperparameter_sampling = RandomParameterSampling({
    'ngrams': choice(range(1, 5)),
    'match': choice(range(2, 41)),
    'min_child_samples': choice(range(1, 31)),
    'unweighted': choice('Yes', 'No')
})

This hyperparameter space specifies a grid of 9,360 unique configuration points (4 `ngrams` X 39 `match` X 30 `min_child_samples` X 2 `unweighted`). We control the resources used by the search through specifying a maximum number of configuration points to sample as `max_total_runs`.

In [None]:
max_total_runs = 96

It is also possible to specify a maximum duration for the tuning experiment by setting `max_duration_minutes`. If both of these parameters are specified, any remaining runs are terminated once `max_duration_minutes` have passed.

Specify the primary metric to be optimized as the gain at 3, and that it should be maximized. This metric is logged by the training script.

In [None]:
primary_metric_name = "gain@3"
primary_metric_goal = PrimaryMetricGoal.MAXIMIZE

The training script logs the metric throughout training, so we may specify an early termination policy. If no policy is specified, the hyperparameter tuning service will let all training runs run to completion. We use a median stopping policy that terminates runs whose best metrics on the tune dataset are worse than the median of the running averages of the metrics on all training runs, and we delay the policy's application until each run's fifth metric report.

In [None]:
policy = MedianStoppingPolicy(delay_evaluation=5)

## Create an estimator <a id='estimator'></a>
Create an estimator that specifies the location of the script, sets up its fixed parameters, including the location of the data, the compute target, and specifies the packages needed to run the script. It may take a while to prepare the run environment the first time an estimator is used, but that environment will be used until the list of packages is changed.

In [None]:
compute_target = 'hypetuning'
estimator = Estimator(source_directory=os.path.join('.', 'scripts'),
                      entry_script='TrainClassifier.py',
#                       script_params={'--data-folder': ds.as_mount(),
#                                      '--estimators': 1000},
                      compute_target=compute_target,
                      conda_packages=['pandas==0.23.4',
                                      'scikit-learn==0.20.0'],
                      pip_packages=['lightgbm==2.1.2'])

Put the estimator and the configuration information together into an HyperDrive run configuration object.

In [None]:
hyperdrive_run_config = HyperDriveConfig(
    estimator=estimator,
    hyperparameter_sampling=hyperparameter_sampling,
    policy=policy,
    primary_metric_name=primary_metric_name,
    primary_metric_goal=primary_metric_goal,
    max_total_runs=max_total_runs)

## Azure Machine Learning Pipelines: Overview <a id='aml_pipeline_overview'></a>

A common scenario when using machine learning components is to have a data workflow that includes the following steps:

- Preparing/preprocessing a given dataset for training, followed by
- Training a machine learning model on this data, and then
- Deploying this trained model in a separate environment, and finally
- Running a batch scoring task on another data set, using the trained model.

Azure's Machine Learning pipelines give you a way to combine multiple steps like these into one configurable workflow, so that multiple agents/users can share and/or reuse this workflow. Machine learning pipelines thus provide a consistent, reproducible mechanism for building, evaluating, deploying, and running ML systems.

To get more information about Azure machine learning pipelines, please read our [Azure Machine Learning Pipelines overview](https://aka.ms/pl-concept), or the [getting started notebook](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-getting-started.ipynb).



Lets first create the Pipeline data which will be used to share data between different steps in pipeline


In [None]:
between_steps_data = PipelineData("steps_data", datastore=ds)

Lets create data reference for the raw data to be used in HyperDrive run

In [None]:
data_folder = DataReference(datastore=ds, data_reference_name="data_folder")

## Create AML Pipeline HyperDrive Step <a id='aml_pipeline_hd_step'></a>
We create a HyperDrive step in the AML pipeline to perform a search for hyperparameters.

In [None]:
estimators = PipelineParameter(name="estimators", default_value=1000)

In [None]:
hd_step_name="hd_step"
hd_step = HyperDriveStep(
    name=hd_step_name,
    hyperdrive_config=hyperdrive_run_config,
    estimator_entry_script_arguments=["--data-folder", data_folder,
                                      "--estimators", estimators,
                                      "--steps-data", between_steps_data,
                                      "--save-run-id", "run_id.txt"],
    inputs=[data_folder],
    outputs=[between_steps_data],
    allow_reuse=False)

## Create AML Pipeline PythonScript Step <a id='aml_pipeline_ps_step'></a>
We create a simple Python script step to get the best hyperparameters found by a HyperDrive step, and use them to train the best model. This step is equivalent to the task performed in notebook 05_Train_Best_Model.ipynb. The Best_Run.py script gets the best run from the set of HyperDrive runs and obtains its hyperparameters.

In [None]:
%%writefile scripts/Best_Run.py

from __future__ import print_function
import os
import argparse

from azureml.core import Run, Workspace, Experiment
from azureml.train.estimator import Estimator
from azureml.train.hyperdrive import HyperDriveRun
import azureml.core

if __name__ == "__main__":
    
    print("azureml.core.VERSION={}".format(azureml.core.VERSION))

    parser = argparse.ArgumentParser(description="Retrieve the hyperparameters "
                                     "of the best run")
    parser.add_argument("--hd-step", dest="hd_step",
                        help="the name of the NyperDrive step",
                        default="hd_step")
    parser.add_argument("--steps-data", dest="steps_data",
                        help="to share data between different steps in a pipeline",
                        default="outputs")
    args = parser.parse_args()
    
    # Get the HyperDrive run.
    run = Run.get_context()
    pipeline_run = run.parent
    for step in pipeline_run.get_steps():
        if step.name == args.hd_step:
            hd_run = list(step.get_children())[0]
            break
    
    # Get the best run.
    best_run = hd_run.get_best_run_by_primary_metric()
    if best_run is None:
        raise Exception("No best run was found")
    print("Best Run run id is {}".format(best_run.id))
    
    print(best_run.get_file_names())
    
    # Register Best run model
    model = best_run.register_model(model_name="FAQ_ranker",
                                    model_path=os.path.join("outputs",
                                                            "FAQ_ranker.pkl"))
    print("Best run model registered")

Creating run configuration to specify the environment for the PythonScript Step

In [None]:
run_config = RunConfiguration(conda_dependencies=CondaDependencies.create(
                      conda_packages=['pandas==0.23.4',
                                      'scikit-learn==0.20.0'],
                      pip_packages=['lightgbm==2.1.2',
                                    'azureml-sdk',
                                    'azure-cli'])
                             )
run_config.environment.docker.enabled = True

Creating PythonScript Step for AML pipeline to register the best model

In [None]:
register_best_model = PythonScriptStep(
    name="Register Best Model",
    script_name='Best_Run.py',
    compute_target=compute_target,
    source_directory=os.path.join('.', 'scripts'),
    arguments=['--steps-data', between_steps_data,],
    runconfig=run_config,
    inputs=[between_steps_data],
    allow_reuse=False,
)


Configure to run the Register Best Model step after the HyperDrive Step

In [None]:
register_best_model.run_after(hd_step)

## Create & Run the pipeline <a id='create_aml_pipeline'></a>

In [None]:
exp = Experiment(workspace=ws, name='hypetuning')
pipeline = Pipeline(workspace=ws, steps=[register_best_model])

# To run the pipeline without publishing, 
pipeline_run = exp.submit(pipeline,continue_on_step_failure=True)
pipeline_run.wait_for_completion(show_output=True)

The [next notebook](08_Tear_Down.ipynb) shows how to delete the components created by this tutorial.

## Publish a Pipeline  <a id='publish_aml_pipeline'></a>
Read more about why to publish a pipeline and how it can be triggered [here]()

In [None]:
published_pipeline = pipeline.publish(name="HyperDrive Pipeline", description="HyperDrive Pipeline", continue_on_step_failure=True)
published_pipeline.endpoint

## Run published pipeline using its REST endpoint <a id='run_publish_aml_pipeline'></a>
This step shows how to call the rest endpoint of a published pipeline to trigger the pipeline run

In [None]:
import requests

aad_token = auth.get_authentication_header()

rest_endpoint = published_pipeline.endpoint

print("You can perform HTTP POST on URL {} to trigger this pipeline".format(rest_endpoint))

# specify the param when running the pipeline
response = requests.post(rest_endpoint, 
                         headers=aad_token, 
                         json={"ExperimentName": "hypetuning",
                               "RunSource": "SDK"})
run_id = response.json()["Id"]

print(run_id)