# Udacity - Azure ml engineer nanodegree - project2: Operationalizing machine learning

This notebook demonstrates the use of AutoMLStep in Azure Machine Learning Pipeline. The following notebook is a modified version of the provided starter notebook from udacity which can be fuond [here](https://github.com/udacity/nd00333_AZMLND_C2/blob/master/Exercise_starter_files/aml-exercise-pipelines-with-automated-machine-learning-step.ipynb)

## Introduction
In this example we showcase how you can use AzureML Dataset to load data for AutoML via AML Pipeline. 

If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, make sure you have executed the [configuration](https://aka.ms/pl-config) before running this notebook.

In this notebook we will execute the following steps:
1. Create an `Experiment` in an existing `Workspace`.
2. Create or Attach existing AmlCompute to a workspace.
3. Define data loading in a `TabularDataset`.
4. Configure AutoML using `AutoMLConfig`.
5. Use AutoMLStep -> Train the model using AmlCompute
6. Explore the results and test the best fitted model.
7. Deploy the best fitted model as a Azure container instance.
8. Publish and deploy the entire pipeline.
9. Cleanup

We use the UCI Bankmarketing dataset to present our pipeline, which is described [here](https://archive.ics.uci.edu/ml/datasets/bank+marketing)
Please make sure that the project is set up properly before execution (e.g. ID,s secrets are available az environmental variables, etc.).

### 0. Package-imports 

In [None]:
import azureml.core
from azureml.core import Workspace, Experiment, Datastore, ScriptRunConfig
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.core.model import Model
from azureml.core.model import InferenceConfig
from azureml.core.webservice import AciWebservice
from azureml.core.webservice import Webservice
from azureml.core.environment import Environment
from azureml.pipeline.core import PipelineData, TrainingOutput

from azureml.core.dataset import Dataset
from azureml.data.data_reference import DataReference
from azureml.data.dataset_factory import TabularDatasetFactory
from azureml.widgets import RunDetails
from azureml.train.automl import AutoMLConfig
from azureml.train.automl.run import AutoMLRun
from azureml.train.sklearn import SKLearn
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.policy import BanditPolicy
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.parameter_expressions import uniform, quniform
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.pipeline.core import Pipeline, PipelineData
from azureml.pipeline.steps import PythonScriptStep
from azureml.pipeline.steps import AutoMLStep


import os
import joblib
import json
import pandas as pd
from dotenv import load_dotenv, find_dotenv
from pathlib import Path
load_dotenv()

%config Completer.use_jedi = False

### 1. Initialize Workspace
Alternative: use previously persisted config with e.g.
```
ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n'
```

In [None]:
ws = Workspace.get(name="udacity", subscription_id=os.getenv('AZURE_SUBSCRIPTION_ID'))

### 2. Create an Azure ML experiment
Let's create an experiment named "automlstep-classification" and a folder to hold the training scripts. The script runs will be recorded under the experiment in Azure.

The best practice is to use separate folders for scripts and its dependent files for each step and specify that folder as the `source_directory` for the step. This helps reduce the size of the snapshot created for the step (only the specific folder is snapshotted). Since changes in any files in the `source_directory` would trigger a re-upload of the snapshot, this helps keep the reuse of the step when there are no changes in the `source_directory` of the step.

*Udacity Note:* There is no need to create an Azure ML experiment, this needs to re-use the experiment that was already created


In [None]:
# Choose a name for the run history container in the workspace.
# NOTE: update these to match your existing experiment name
experiment_name = 'ml-experiment-1'
project_folder = './pipeline-project'

experiment = Experiment(ws, experiment_name)
experiment

### 3.  Create or Attach an AmlCompute cluster
You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for your AutoML run. In this tutorial, you get the default `AmlCompute` as your training compute resource.

**Udacity Note** There is no need to create a new compute target, it can re-use the previous cluster

In [None]:

# NOTE: update the cluster name to match the existing cluster
# Choose a name for your CPU cluster
amlcompute_cluster_name = "auto-ml-cluster"

# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=amlcompute_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_DS12_V2',# for GPU, use "STANDARD_NC6"
                                                           #vm_priority = 'lowpriority', # optional
                                                           min_nodes = 1,
                                                           max_nodes=6)
    compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True, min_node_count = 0, timeout_in_minutes = 10)
# For a more detailed view of current AmlCompute status, use get_status().

### 4. Upload, reference datasets.

**Udacity note:** Make sure the `key` is the same name as the dataset that is uploaded, and that the description matches. If it is hard to find or unknown, loop over the `ws.datasets.keys()` and `print()` them.
If it *isn't* found because it was deleted, it can be recreated with the link that has the CSV 

In [None]:
# Try to load the dataset from the Workspace. Otherwise, create it from the file
# NOTE: update the key to match the dataset name
found = False
key = "BankMarketing Dataset - Train"
description_text = "Bank Marketing train DataSet for Udacity Course 2"

if key in ws.datasets.keys(): 
        found = True
        dataset_train = ws.datasets[key] 

if not found:
        # Create AML Dataset and register it into Workspace
        example_data = 'https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_train.csv'
        dataset_train = Dataset.Tabular.from_delimited_files(example_data)        
        #Register Dataset in Workspace
        dataset_train = dataset_train.register(workspace=ws,
                                   name=key,
                                   description=description_text)


df_train = dataset_train.to_pandas_dataframe()
df_train.describe()

In [None]:
# Try to load the dataset from the Workspace. Otherwise, create it from the file
# NOTE: update the key to match the dataset name
found = False
key = "BankMarketing Dataset - Test"
description_text = "Bank Marketing test DataSet for Udacity Course 2"

if key in ws.datasets.keys(): 
        found = True
        dataset_test = ws.datasets[key] 

if not found:
        # Create AML Dataset and register it into Workspace
        example_data = 'https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_test.csv'
        dataset_test = Dataset.Tabular.from_delimited_files(example_data)        
        #Register Dataset in Workspace
        dataset_test = dataset_test.register(workspace=ws,
                                   name=key,
                                   description=description_text)


df_test = dataset_test.to_pandas_dataframe()
df_test.describe()

In [None]:
# Try to load the dataset from the Workspace. Otherwise, create it from the file
# NOTE: update the key to match the dataset name
found = False
key = "BankMarketing Dataset - Valid"
description_text = "Bank Marketing valid DataSet for Udacity Course 2"

if key in ws.datasets.keys(): 
        found = True
        dataset_valid = ws.datasets[key] 

if not found:
        # Create AML Dataset and register it into Workspace
        example_data = 'https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_validate.csv'
        dataset_valid = Dataset.Tabular.from_delimited_files(example_data)        
        #Register Dataset in Workspace
        dataset_valid = dataset_valid.register(workspace=ws,
                                   name=key,
                                   description=description_text)


df_valid = dataset_valid.to_pandas_dataframe()
df_valid.describe()

### 5. Train
This creates a general AutoML settings object. We will reference these configs in the AutoML pipeline step.

In [None]:
automl_settings = {
    "experiment_timeout_minutes": 20,
    "experiment_timeout_hours":1,
    "max_concurrent_iterations": 5,
    "primary_metric" : 'AUC_weighted'
}
automl_config = AutoMLConfig(compute_target=compute_target,
                             task = "classification",
                             training_data=dataset_train,
                             validation_data = dataset_valid,
                             label_column_name="y",   
                             path = project_folder,
                             enable_early_stopping= True,
                             model_explainability=True,
                             featurization= 'auto',
                             debug_log = "automl_errors.log",
                             **automl_settings
                            )

#### Create Pipeline and AutoMLStep

You can define outputs for the AutoMLStep using TrainingOutput.

In [None]:
ds = ws.get_default_datastore()
metrics_output_name = 'metrics_output'
best_model_output_name = 'best_model_output'

metrics_data = PipelineData(name='metrics_data',
                           datastore=ds,
                           pipeline_output_name=metrics_output_name,
                           training_output=TrainingOutput(type='Metrics'))
model_data = PipelineData(name='model_data',
                           datastore=ds,
                           pipeline_output_name=best_model_output_name,
                           training_output=TrainingOutput(type='Model'))

In [None]:
automl_step = AutoMLStep(
    name='automl_module',
    automl_config=automl_config,
    outputs=[metrics_data, model_data],
    enable_default_model_output=True,
    allow_reuse=True)

#### Build pipeline

In [None]:
pipeline = Pipeline(
    description="pipeline_with_automlstep",
    workspace=ws,    
    steps=[automl_step])

#### Execute pipeline

In [None]:
pipeline_run = experiment.submit(pipeline)

In [None]:
from azureml.widgets import RunDetails
RunDetails(pipeline_run).show()

In [None]:
pipeline_run.wait_for_completion()

### 6. Examine Results

#### Retrieve the metrics of all child runs
Outputs of above run can be used as inputs of other steps in pipeline. In this t

In [None]:
metrics_output = pipeline_run.get_pipeline_output(metrics_output_name)
num_file_downloaded = metrics_output.download('.', show_progress=True)

In [None]:
with open(metrics_output._path_on_datastore) as f:
    metrics_output_result = f.read()
    
deserialized_metrics_output = json.loads(metrics_output_result)
df = pd.DataFrame(deserialized_metrics_output)
df

#### Retrieve the Best Model

In [None]:
# Retrieve best model from Pipeline Run
best_model_output = pipeline_run.get_pipeline_output(best_model_output_name)
num_file_downloaded = best_model_output.download('.', show_progress=True)

In [None]:
import pickle

with open(best_model_output._path_on_datastore, "rb" ) as f:
    best_model = pickle.load(f)
best_model

In [None]:
best_model.steps

#### Test model performance on the holdout (test) set.

In [None]:
df_test = df_test[pd.notnull(df_test['y'])]

y_test = df_test['y']
X_test = df_test.drop(['y'], axis=1)

In [None]:
from sklearn.metrics import confusion_matrix
ypred = best_model.predict(X_test)
cm = confusion_matrix(y_test, ypred)

In [None]:
pd.DataFrame(cm).style.background_gradient(cmap='Blues', low=0, high=0.9)

### 7. Deploy the model as a Web Service on Azure Container Instance

As next step we deploy our model as an Azure Container Instance, thus we need to download the artifacts from the run, and we deploy the model using these as the necessary inputs, resources. 

In [None]:
pipeline_steps = [step for step in pipeline_run.get_steps()]
automl_run = AutoMLRun(experiment = experiment, run_id=pipeline_steps[0].id)

In [None]:
automl_run.get_best_child().download_file('outputs/model.pkl', 'outputs/model.pkl')
automl_run.get_best_child().download_file('outputs/scoring_file_v_1_0_0.py', 'outputs/score_aml.py')
automl_run.get_best_child().download_file('automl_driver.py', 'outputs/automl_driver.py')
automl_run.get_best_child().download_file('outputs/conda_env_v_1_0_0.yml', 'outputs/conda_env.yml')

myenv = Environment.from_conda_specification(name="myenv", file_path='outputs/conda_env.yml')

In [None]:
# Tip: When model_path is set to a directory, you can use the child_paths parameter to include
#      only some of the files from the directory
model = Model.register(model_path = 'outputs/model.pkl',
                       model_name = automl_run.get_best_child().properties['model_name'],
                       description = "",
                       workspace = ws)

In [None]:
script_file_name = 'outputs/score_aml.py'

In [None]:
inference_config = InferenceConfig(entry_script=script_file_name, environment=myenv)

aciconfig = AciWebservice.deploy_configuration(cpu_cores = 1.8, 
                                               memory_gb = 4, 
                                               auth_enabled=True,
                                               enable_app_insights = True,
                                               tags = {'area': "bmData", 'type': "automl_classification"}, 
                                               description = 'sample service for automl classification')

aci_service_name = 'automl-model-bankmarketing'
print(aci_service_name)
aci_service = Model.deploy(ws, aci_service_name, [model], inference_config, aciconfig, overwrite=True)
aci_service.wait_for_deployment(True)
print(aci_service.state)

In [None]:
aci_service.get_logs()

In [None]:
# optional cleanup

#aci_service.delete()

### 8. Publish and run from REST endpoint

Run the following code to publish the pipeline to your workspace. In your workspace in the portal, you can see metadata for the pipeline including run history and durations. You can also run the pipeline manually from the portal.

Additionally, publishing the pipeline enables a REST endpoint to rerun the pipeline from any HTTP library on any platform.


In [None]:
# Ensure that the workspace is once again available
ws = Workspace.from_config()
#print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')

In [None]:
experiment_name = 'ml-experiment-1'
project_folder = './pipeline-project'

experiment = Experiment(ws, experiment_name)
experiment

In [None]:
from azureml.pipeline.core import PipelineRun

run_id = pipeline_run.id
pipeline_run_1 = PipelineRun(experiment, run_id)

In [None]:
published_pipeline = pipeline_run_1.publish_pipeline(
    name="Bankmarketing Train", description="Training bankmarketing pipeline", version="1.0")

published_pipeline

Authenticate once again, to retrieve the `auth_header` so that the endpoint can be used

In [None]:
from azureml.core.authentication import InteractiveLoginAuthentication

interactive_auth = InteractiveLoginAuthentication()
auth_header = interactive_auth.get_authentication_header()

Get the REST url from the endpoint property of the published pipeline object. You can also find the REST url in your workspace in the portal. Build an HTTP POST request to the endpoint, specifying your authentication header. Additionally, add a JSON payload object with the experiment name and the batch size parameter. As a reminder, the process_count_per_node is passed through to ParallelRunStep because you defined it is defined as a PipelineParameter object in the step configuration.

Make the request to trigger the run. Access the Id key from the response dict to get the value of the run id.

In [None]:
from azureml.pipeline.core import PublishedPipeline

published_pipeline = PublishedPipeline.get(workspace=ws, id="7543c045-f3ff-46b3-bc17-ca626b516909")

In [None]:
import requests

rest_endpoint = published_pipeline.endpoint
response = requests.post(rest_endpoint, 
                         headers=auth_header, 
                         json={"ExperimentName": "pipeline-bankmarketing-rest-endpoint"}
                        )

In [None]:
try:
    response.raise_for_status()
except Exception:    
    raise Exception("Received bad response from the endpoint: {}\n"
                    "Response Code: {}\n"
                    "Headers: {}\n"
                    "Content: {}".format(rest_endpoint, response.status_code, response.headers, response.content))

run_id = response.json().get('Id')
print('Submitted pipeline run: ', run_id)

Use the run id to monitor the status of the new run. This will take another 10-15 min to run and will look similar to the previous pipeline run, so if you don't need to see another pipeline run, you can skip watching the full output.

In [None]:
from azureml.pipeline.core.run import PipelineRun
from azureml.widgets import RunDetails

published_pipeline_run = PipelineRun(ws.experiments["pipeline-bankmarketing-rest-endpoint"], run_id)
RunDetails(published_pipeline_run).show()

In [None]:
pipeline_run.wait_for_completion()

### 9. Cleanup all resources

As last step we remove the deployed model and delete the provisioned compute cluster 

In [None]:
try:
    aci_service.delete()
    print("Deployed model deleted")
except:
    print("Something went wrong..")

In [None]:
try:
    compute_target.delete()
    print('Computetarget deleted')
except ComputeTargetException:
    print('Computetarget not found')