# Automated Machine Learning
_**Classification with Deployment using a Bank Marketing Dataset**_

###  Optimizing an ML Pipelin

## Introduction

- we use the UCI Bank Marketing dataset to showcase how you can use AutoML for a  classification problem and deploy it to an Azure Container Instance (ACI). The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.
- The dataset contains 32950 training examples in a csv file.
https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_train.csv

### Creating a Workspace

- In the Microsoft Azure portal, create a new Machine Learning resource, specifying the subscription, resource group and workspace name.
- Use the Azure Machine Learning Python SDK to run code that creates a workspace. For example, the following code creates a workspace named aml-workspace (assuming the Azure ML SDK for Python is installed and a valid subscription ID is specified):


In [None]:
from azureml.core import Workspace, Experiment

ws = Workspace.from_config()

ws = Workspace.get(name="udacity-project")
exp = Experiment(workspace=ws, name="udacity-project")

print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

run = exp.start_logging()

### Create compute resources for your training experiments



To create a cluster, you need to specify a compute configuration that specifies the type of machine to be used and the scalability behaviors.  Then you choose a name for the cluster that is unique within the workspace that can be used to address the cluster later.

The cluster parameters are:
* vm_size - this describes the virtual machine type and size used in the cluster.  All machines in the cluster are the same type.  You can get the list of vm sizes available in your region by using the CLI command

```shell
az vm list-skus -o tsv
```
* min_nodes - this sets the minimum size of the cluster.  If you set the minimum to 0 the cluster will shut down all nodes while not in use.  Setting this number to a value higher than 0 will allow for faster start-up times, but you will also be billed when the cluster is not in use.
* max_nodes - this sets the maximum size of the cluster.  Setting this to a larger number allows for more concurrency and a greater distributed processing of scale-out jobs.


To create a **CPU** cluster now, run the cell below. The autoscale settings mean that the cluster will scale down to 0 nodes when inactive and up to 4 nodes when busy.

In [None]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
# TODO: Create compute cluster
# Use vm_size = "Standard_D2_V2" in your provisioning configuration.
# max_nodes should be no greater than 4.
# Choose a name for your CPU cluster
cpu_cluster_name = "cpu-cluster"

# Verify that cluster does not exist already
try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
     # If not, create it
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                           max_nodes=4)
    # Create the cluster with the specified name and configuration 
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)
    # Wait for the cluster to complete, show the output log
    cpu_cluster.wait_for_completion(show_output=True)
    print("Azure Machine Learning Compute cluster attached")

### Scikit-learn Pipeline with Hyperdrive
 - we build and optimize an Azure ML pipeline using the Python SDK and a provided Scikit-learn model. and Creating a training   script for hyperparameter tuning
 - Pipeline Architecture
 - Classification Algorithm
 - Parameter Sampler Random sampling supports discrete and continuous hyperparameters. 
 - Early Stopping Policy and Configuring and running a hyperdrive experiment To prepare the hyperdrive experiment, you must use a HyperDriveConfig object to configure the experiment run, as shown in the following example code:

In [None]:
from azureml.widgets import RunDetails
from azureml.train.sklearn import SKLearn
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.policy import BanditPolicy
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.parameter_expressions import uniform,choice
from azureml.core import ScriptRunConfig
from azureml.core import Environment
from azureml.core import Experiment
import os

# Specify parameter sampler
param_sampling = RandomParameterSampling( {
        "--C":uniform(1,100),
        "--max_iter":choice(15,30, 45,60,75,90,100)
    }
) 

# Specify a Policy
policy = BanditPolicy(evaluation_interval=2,slack_factor=0.2)

if "training" not in os.listdir():
    os.mkdir("./training")
# Create a SKLearn estimator for use with train.py
est = SKLearn(source_directory='.',
                    entry_script='train.py',
                    compute_target=cpu_cluster
                    )

# Create a HyperDriveConfig using the estimator, hyperparameter sampler, and policy.

hyperdrive_config = HyperDriveConfig(estimator=est,
                              hyperparameter_sampling=param_sampling,
                              policy=policy,
                              primary_metric_name='Accuracy',
                              primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                              max_total_runs=12,
                              max_concurrent_runs=4)

### Visualize experiment

In [None]:
# Submit your hyperdrive run to the experiment and show run details with the widget.
from azureml.widgets import RunDetails
hyperdrive_run = exp.submit(hyperdrive_config,show_output=True )
RunDetails(hyperdrive_run).show()
hyperdrive_run.wait_for_completion(show_output=True)

### Best run and save the model from that run

In [None]:
# Get your best run and save the model from that run.
best_run = hyperdrive_run.get_best_run_by_primary_metric()
print(best_run)
best_run_metrics = best_run.get_metrics()
for metric_name in best_run_metrics:
    metric = best_run_metrics[metric_name]
    print(metric_name, metric) 

In [None]:
best_run = hyperdrive_run.get_best_run_by_primary_metric()
best_run_metrics = best_run.get_metrics()

print('Best Run Id: ', best_run.id)
print('\n Accuracy:', best_run_metrics['Accuracy'])

In [None]:
from azureml.core import Model
import joblib
# Register model
best_run.download_file("outputs/model.joblib","./outputs/model.joblib")
model=best_run.register_model(model_name='model',model_path='outputs/model.joblib',tags={'Training context':'Hyperdrive'},
                        properties={'Accuracy': best_run_metrics['Accuracy']})
# List registered models
for model in Model.list(ws):
    print(model.name, 'version:', model.version)
    for tag_name in model.tags:
        tag = model.tags[tag_name]
        print ('\t',tag_name, ':', tag)
    for prop_name in model.properties:
        prop = model.properties[prop_name]
        print ('\t',prop_name, ':', prop)
    print('\n')

In [None]:
best_run

### Access data using TabularDataset Factory

In [None]:
from azureml.data.dataset_factory import TabularDatasetFactory
# Create TabularDataset using TabularDatasetFactory
# Data is available at: 
# "https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_train.csv"

path = "https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_train.csv"
ds = TabularDatasetFactory.from_delimited_files(path)

In [None]:
from train import clean_data

# Use the clean_data function to clean your data.
x, y = clean_data(ds) ### YOUR DATA OBJECT HERE ###

### Load data and prepare for training
Load the data from the "https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_train.csv", and split it into training and test sets using train_test_split(). This function segregates the data so the model has unseen data to use for testing following training.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test= train_test_split(x,y, train_size=0.8, test_size=0.2, random_state=42) 
                                      
#Concating the x_train & y_train together
train_data = pd.concat([x_train, y_train], axis=1)
#show 5 rows first from the dataset
train_data.head()

In [None]:
# Save training data to a csv file 
train_data.to_csv("./training/bankmarket_training.csv")

### Configure experiment

- Classification experiment using training_data as the primary metric with experiment timeout minutes set to 30 minutes If you wish to run the experiment longer, you will need to run this notebook in your own

In [None]:
from azureml.train.automl import AutoMLConfig

# Set parameters for AutoMLConfig
# NOTE: DO NOT CHANGE THE experiment_timeout_minutes PARAMETER OR YOUR INSTANCE WILL TIME OUT.
# If you wish to run the experiment longer, you will need to run this notebook in your own
# Azure tenant, which will incur personal costs.
automl_config = AutoMLConfig(
                            experiment_timeout_minutes=30,
                            task="classification",
                            primary_metric="accuracy",
                            training_data=train_data,
                            label_column_name="y",
                            n_cross_validations=2,
                            max_concurrent_iterations=4,
                            featurization='auto')
print("Ready for Auto ML run.")

### Submit training experiment
- submit the configuration to automatically select the algorithm, hyper parameters, and train the model.

In [None]:
# Submit your automl run
print('Submitting AutoMLConfig experiment...')
automl_run = exp.submit(automl_config, show_output=True)
RunDetails(automl_run).show()
automl_run.wait_for_completion(show_output=True)

### Retrieve the Best Model's explanation
- Retrieve the explanation from the best_run which includes explanations for engineered features and raw features. Make sure that the run for generating explanations for the best model is completed.

In [None]:
# Retrieve and save your best automl model.
### YOUR CODE HERE ###
#Getting the best run and the model it produced
best_run, fitted_model = automl_run.get_output()
print(best_run)
best_run_metrics = best_run.get_metrics()
print(fitted_model)
for metric_name in best_run_metrics:
    metric = best_run_metrics[metric_name]
    print(metric_name, metric) 
# Register model
from azureml.core import Model
# Download a named file

model=best_run.register_model(model_path='outputs/automl_model.pkl', model_name='automl_model', #try changing pkl to joblib
                        tags={'Training context':'Auto ML'},
                        properties={'Accuracy:', best_run_metrics['accuracy']})

# List registered models
for model in automl_model.list(ws):
    print(model.name, 'version:', model.version)
    for tag_name in model.tags:
        tag = model.tags[tag_name]
        print ('\t',tag_name, ':', tag)
    for prop_name in model.properties:
        prop = model.properties[prop_name]
        print ('\t',prop_name, ':', prop)
    print('\n')

In [None]:
# Retrieve and save your best automl model.
best_run, onnx_mdl = remote_run.get_output(return_onnx_model=True)

In [None]:
from azureml.automl.runtime.onnx_convert import OnnxConverter
onnx_fl_path = "./training/best_model.onnx"
OnnxConverter.save_onnx_model(onnx_mdl, onnx_fl_path)