# Automated Machine Learning
_**Classification with Deployment using a Bank Marketing Dataset**_

###  Optimizing an ML Pipelin

## Introduction

- we use the UCI Bank Marketing dataset to showcase how you can use AutoML for a  classification problem and deploy it to an Azure Container Instance (ACI). The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.
- The dataset contains 32950 training examples in a csv file.
https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_train.csv

### Creating a Workspace

- In the Microsoft Azure portal, create a new Machine Learning resource, specifying the subscription, resource group and workspace name.
- Use the Azure Machine Learning Python SDK to run code that creates a workspace. For example, the following code creates a workspace named aml-workspace (assuming the Azure ML SDK for Python is installed and a valid subscription ID is specified):


In [1]:
from azureml.core import Workspace, Experiment

ws = Workspace.get(name="quick-starts-ws-125934")
exp = Experiment(workspace=ws, name="quick-starts-ws-125934")

print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

run = exp.start_logging()

Workspace name: quick-starts-ws-125934
Azure region: southcentralus
Subscription id: 8ca4d354-cab8-4c39-9d61-ea227d2daaad
Resource group: aml-quickstarts-125934


### Create compute resources for your training experiments



To create a cluster, you need to specify a compute configuration that specifies the type of machine to be used and the scalability behaviors.  Then you choose a name for the cluster that is unique within the workspace that can be used to address the cluster later.

The cluster parameters are:
* vm_size - this describes the virtual machine type and size used in the cluster.  All machines in the cluster are the same type.  You can get the list of vm sizes available in your region by using the CLI command

```shell
az vm list-skus -o tsv
```
* min_nodes - this sets the minimum size of the cluster.  If you set the minimum to 0 the cluster will shut down all nodes while not in use.  Setting this number to a value higher than 0 will allow for faster start-up times, but you will also be billed when the cluster is not in use.
* max_nodes - this sets the maximum size of the cluster.  Setting this to a larger number allows for more concurrency and a greater distributed processing of scale-out jobs.


To create a **CPU** cluster now, run the cell below. The autoscale settings mean that the cluster will scale down to 0 nodes when inactive and up to 4 nodes when busy.

### Compute Target

In [2]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
# TODO: Create compute cluster
# Use vm_size = "Standard_D2_V2" in your provisioning configuration.
# max_nodes should be no greater than 4.
# Choose a name for your CPU cluster
cpu_cluster_name = "cpu-cluster"

# Verify that cluster does not exist already
try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
     # If not, create it
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                           max_nodes=4)
    # Create the cluster with the specified name and configuration 
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)
    # Wait for the cluster to complete, show the output log
    cpu_cluster.wait_for_completion(show_output=True)
    print("Azure Machine Learning Compute cluster attached")

Found existing cluster, use it.


### Scikit-learn Pipeline with Hyperdrive
 - we build and optimize an Azure ML pipeline using the Python SDK and a provided Scikit-learn model. and Creating a training   script for hyperparameter tuning
 - Pipeline Architecture
 - Classification Algorithm
 - Parameter Sampler Random sampling supports discrete and continuous hyperparameters. 
 - Early Stopping Policy and Configuring and running a hyperdrive experiment To prepare the hyperdrive experiment, you must use a HyperDriveConfig object to configure the experiment run, as shown in the following example code:

In [3]:
from azureml.widgets import RunDetails
from azureml.train.sklearn import SKLearn
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.policy import BanditPolicy
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.parameter_expressions import uniform,choice
from azureml.core import ScriptRunConfig
from azureml.core import Environment
from azureml.core import Experiment
import os

# Specify parameter sampler
param_sampling = RandomParameterSampling( {
        "--C":uniform(1,100),
        "--max_iter":choice(15,30, 45,60,75,90,100)
    }
) 

# Specify a Policy
policy = BanditPolicy(evaluation_interval=2,slack_factor=0.2)

if "training" not in os.listdir():
    os.mkdir("./training")
# Create a SKLearn estimator for use with train.py
est = SKLearn(source_directory='.',
                    entry_script='train.py',
                    compute_target=cpu_cluster
                    )

# Create a HyperDriveConfig using the estimator, hyperparameter sampler, and policy.

hyperdrive_config = HyperDriveConfig(estimator=est,
                              hyperparameter_sampling=param_sampling,
                              policy=policy,
                              primary_metric_name='Accuracy',
                              primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                              max_total_runs=12,
                              max_concurrent_runs=4)

### Visualize experiment

In [4]:
# Submit your hyperdrive run to the experiment and show run details with the widget.
from azureml.widgets import RunDetails
hyperdrive_run = exp.submit(hyperdrive_config,show_output=True )
RunDetails(hyperdrive_run).show()
hyperdrive_run.wait_for_completion(show_output=True)



_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

RunId: HD_95c96715-ae4f-4ce3-aa5f-4721ae142bb0
Web View: https://ml.azure.com/experiments/quick-starts-ws-125934/runs/HD_95c96715-ae4f-4ce3-aa5f-4721ae142bb0?wsid=/subscriptions/8ca4d354-cab8-4c39-9d61-ea227d2daaad/resourcegroups/aml-quickstarts-125934/workspaces/quick-starts-ws-125934

Streaming azureml-logs/hyperdrive.txt

"<START>[2020-11-13T12:28:51.546547][API][INFO]Experiment created<END>\n""<START>[2020-11-13T12:28:52.025189][GENERATOR][INFO]Trying to sample '4' jobs from the hyperparameter space<END>\n""<START>[2020-11-13T12:28:52.210285][GENERATOR][INFO]Successfully sampled '4' jobs, they will soon be submitted to the execution target.<END>\n"<START>[2020-11-13T12:28:53.2649510Z][SCHEDULER][INFO]The execution environment is being prepared. Please be patient as it can take a few minutes.<END>

Execution Summary
RunId: HD_95c96715-ae4f-4ce3-aa5f-4721ae142bb0
Web View: https://ml.azure.com/experiments/quick-starts-ws-125934/runs/HD_95c96715-ae4f-4ce3-aa5f-4721ae142bb0?wsid=/subsc

{'runId': 'HD_95c96715-ae4f-4ce3-aa5f-4721ae142bb0',
 'target': 'cpu-cluster',
 'status': 'Completed',
 'startTimeUtc': '2020-11-13T12:28:51.251109Z',
 'endTimeUtc': '2020-11-13T12:38:12.059986Z',
 'properties': {'primary_metric_config': '{"name": "Accuracy", "goal": "maximize"}',
  'resume_from': 'null',
  'runTemplate': 'HyperDrive',
  'azureml.runsource': 'hyperdrive',
  'platform': 'AML',
  'ContentSnapshotId': '32847538-0abf-4f48-9c1e-c5742e3816e9',
  'score': '0.9113808801213961',
  'best_child_run_id': 'HD_95c96715-ae4f-4ce3-aa5f-4721ae142bb0_5',
  'best_metric_status': 'Succeeded'},
 'inputDatasets': [],
 'outputDatasets': [],
 'logFiles': {'azureml-logs/hyperdrive.txt': 'https://mlstrg125934.blob.core.windows.net/azureml/ExperimentRun/dcid.HD_95c96715-ae4f-4ce3-aa5f-4721ae142bb0/azureml-logs/hyperdrive.txt?sv=2019-02-02&sr=b&sig=ip9vSFQ3QLfYS92J78vBIejiAScSpbU6Dfn5Fsh%2FIx8%3D&st=2020-11-13T12%3A28%3A13Z&se=2020-11-13T20%3A38%3A13Z&sp=r'}}

### Best run and save the model from that run

In [5]:
# Get your best run and save the model from that run.
best_run = hyperdrive_run.get_best_run_by_primary_metric()
print(best_run)
best_run_metrics = best_run.get_metrics()
for metric_name in best_run_metrics:
    metric = best_run_metrics[metric_name]
    print(metric_name, metric) 

Run(Experiment: quick-starts-ws-125934,
Id: HD_95c96715-ae4f-4ce3-aa5f-4721ae142bb0_5,
Type: azureml.scriptrun,
Status: Completed)
Regularization Strength: 61.06579587601447
Max iterations: 15
Accuracy 0.9113808801213961


In [6]:
best_run = hyperdrive_run.get_best_run_by_primary_metric()
best_run_metrics = best_run.get_metrics()

print('Best Run Id: ', best_run.id)
print('\n Accuracy:', best_run_metrics['Accuracy'])

Best Run Id:  HD_95c96715-ae4f-4ce3-aa5f-4721ae142bb0_5

 Accuracy: 0.9113808801213961


In [7]:
from azureml.core import Model
import joblib
# Register model
best_run.download_file("outputs/model.joblib","./outputs/model.joblib")
model=best_run.register_model(model_name='model',model_path='outputs/model.joblib',tags={'Training context':'Hyperdrive'},
                        properties={'Accuracy': best_run_metrics['Accuracy']})
# List registered models
for model in Model.list(ws):
    print(model.name, 'version:', model.version)
    for tag_name in model.tags:
        tag = model.tags[tag_name]
        print ('\t',tag_name, ':', tag)
    for prop_name in model.properties:
        prop = model.properties[prop_name]
        print ('\t',prop_name, ':', prop)
    print('\n')

model version: 1
	 Training context : Hyperdrive
	 Accuracy : 0.9113808801213961




In [8]:
best_run

Experiment,Id,Type,Status,Details Page,Docs Page
quick-starts-ws-125934,HD_95c96715-ae4f-4ce3-aa5f-4721ae142bb0_5,azureml.scriptrun,Completed,Link to Azure Machine Learning studio,Link to Documentation


### Access data using TabularDataset Factory

In [9]:
from azureml.data.dataset_factory import TabularDatasetFactory
# Create TabularDataset using TabularDatasetFactory
# Data is available at: 
# "https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_train.csv"

path = "https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_train.csv"
ds = TabularDatasetFactory.from_delimited_files(path)

In [10]:
from train import clean_data

# Use the clean_data function to clean your data.
x, y = clean_data(ds) ### YOUR DATA OBJECT HERE ###

### Load data and prepare for training
Load the data from the "https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_train.csv", and split it into training and test sets using train_test_split(). This function segregates the data so the model has unseen data to use for testing following training.

In [11]:
import pandas as pd
from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test= train_test_split(x,y, train_size=0.8, test_size=0.2, random_state=42) 
                                      
#Concating the x_train & y_train together
train_data = pd.concat([x_train, y_train], axis=1)
#show 5 rows first from the dataset
train_data.head()

Unnamed: 0,age,marital,default,housing,loan,month,day_of_week,duration,campaign,pdays,...,contact_telephone,education_basic.4y,education_basic.6y,education_basic.9y,education_high.school,education_illiterate,education_professional.course,education_university.degree,education_unknown,y
26465,52,1,0,1,0,8,3,13,14,999,...,0,0,0,0,1,0,0,0,0,0
1387,41,1,0,1,0,6,4,84,3,999,...,1,1,0,0,0,0,0,0,0,0
22628,27,0,0,0,0,7,1,331,2,999,...,0,0,0,0,0,0,1,0,0,0
8336,58,1,0,1,0,6,2,137,3,999,...,1,1,0,0,0,0,0,0,0,0
19898,36,0,0,1,0,5,1,143,4,999,...,1,1,0,0,0,0,0,0,0,0


In [12]:
# Save training data to a csv file 
train_data.to_csv("./training/bankmarket_training.csv")

### Configure experiment

- Classification experiment using training_data as the primary metric with experiment timeout minutes set to 30 minutes If you wish to run the experiment longer, you will need to run this notebook in your own

In [13]:
from azureml.train.automl import AutoMLConfig

# Set parameters for AutoMLConfig
# NOTE: DO NOT CHANGE THE experiment_timeout_minutes PARAMETER OR YOUR INSTANCE WILL TIME OUT.
# If you wish to run the experiment longer, you will need to run this notebook in your own
# Azure tenant, which will incur personal costs.
automl_config = AutoMLConfig(
                            experiment_timeout_minutes=30,
                            task="classification",
                            primary_metric="accuracy",
                            training_data=train_data,
                            label_column_name="y",
                            n_cross_validations=2,
                            max_concurrent_iterations=4,
                            featurization='auto')
print("Ready for Auto ML run.")

Ready for Auto ML run.


### Submit training experiment
- submit the configuration to automatically select the algorithm, hyper parameters, and train the model.

In [14]:
# Submit your automl run
print('Submitting AutoMLConfig experiment...')
automl_run = exp.submit(automl_config, show_output=True)
RunDetails(automl_run).show()
automl_run.wait_for_completion(show_output=True)

Submitting AutoMLConfig experiment...
Running on local machine
Parent Run ID: AutoML_d77d62f2-6720-42b7-9f1a-3f67a5381e7f

Current status: DatasetEvaluation. Gathering dataset statistics.
Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetFeaturization. Beginning to fit featurizers and featurize the dataset.
Current status: DatasetFeaturizationCompleted. Completed fit featurizers and featurizing the dataset.
Current status: DatasetBalancing. Performing class balancing sweeping
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       ALERTED
DESCRIPTION:  To decrease model bias, please cancel the current run and fix balancing problem.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData
DETAILS:      I

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…



****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       ALERTED
DESCRIPTION:  To decrease model bias, please cancel the current run and fix balancing problem.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData
DETAILS:      Imbalanced data can lead to a falsely perceived positive effect of a model's accuracy because the input data has bias towards one class.
+---------------------------------+---------------------------------+--------------------------------------+
|Size of the smallest class       |Name/Label of the smallest class |Number of samples in the training data|
|2951                             |1                                |26360                                 |
+---------------------------------+---------------------------------+--------------------------------------+

********************************************

{'runId': 'AutoML_d77d62f2-6720-42b7-9f1a-3f67a5381e7f',
 'target': 'local',
 'status': 'Completed',
 'startTimeUtc': '2020-11-13T12:43:14.469986Z',
 'endTimeUtc': '2020-11-13T13:14:57.02686Z',
 'properties': {'num_iterations': '1000',
  'training_type': 'TrainFull',
  'acquisition_function': 'EI',
  'primary_metric': 'accuracy',
  'train_split': '0',
  'acquisition_parameter': '0',
  'num_cross_validation': '2',
  'target': 'local',
  'DataPrepJsonString': None,
  'EnableSubsampling': None,
  'runTemplate': 'AutoML',
  'azureml.runsource': 'automl',
  'display_task_type': 'classification',
  'dependencies_versions': '{"azureml-widgets": "1.18.0", "azureml-train": "1.18.0", "azureml-train-restclients-hyperdrive": "1.18.0", "azureml-train-core": "1.18.0", "azureml-train-automl": "1.18.0", "azureml-train-automl-runtime": "1.18.0", "azureml-train-automl-client": "1.18.0", "azureml-tensorboard": "1.18.0", "azureml-telemetry": "1.18.0", "azureml-sdk": "1.18.0", "azureml-samples": "0+unknown

### Retrieve the Best Model's explanation
- Retrieve the explanation from the best_run which includes explanations for engineered features and raw features. Make sure that the run for generating explanations for the best model is completed.

In [15]:
# Retrieve and save your best automl model.
### YOUR CODE HERE ###
#Getting the best run and the model it produced
best_run, fitted_model = automl_run.get_output()
print(best_run)
best_run_metrics = best_run.get_metrics()
print(fitted_model)
for metric_name in best_run_metrics:
    metric = best_run_metrics[metric_name]
    print(metric_name, metric) 


Run(Experiment: quick-starts-ws-125934,
Id: AutoML_d77d62f2-6720-42b7-9f1a-3f67a5381e7f_34,
Type: None,
Status: Completed)
Pipeline(memory=None,
         steps=[('datatransformer',
                 DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
                                 feature_sweeping_config=None,
                                 feature_sweeping_timeout=None,
                                 featurization_config=None, force_text_dnn=None,
                                 is_cross_validation=None,
                                 is_onnx_compatible=None, logger=None,
                                 observer=None, task=None, working_dir=None)),
                ('prefittedsoftvotingclassifier',...
                                                                                                    min_samples_split=0.29105263157894734,
                                                                                                    min_weight_fraction_leaf=0.0,


In [23]:
# Register model Automl
from azureml.core import Model
best_run.register_model(model_path='outputs/model.pkl', model_name='automl_model', #try changing pkl to joblib
                        tags={'Training context':'Auto ML'},
                        properties={'Accuracy': best_run_metrics['accuracy']})

Model(workspace=Workspace.create(name='quick-starts-ws-125934', subscription_id='8ca4d354-cab8-4c39-9d61-ea227d2daaad', resource_group='aml-quickstarts-125934'), name=automl_model, id=automl_model:1, version=1, tags={'Training context': 'Auto ML'}, properties={'Accuracy': '0.9176403641881639'})

In [35]:
best_run

Experiment,Id,Type,Status,Details Page,Docs Page
quick-starts-ws-125934,AutoML_d77d62f2-6720-42b7-9f1a-3f67a5381e7f_34,,Completed,Link to Azure Machine Learning studio,Link to Documentation


### Viewing Registered Models

In [25]:
# List registered models
for model in model.list(ws):
    print(model.name, 'version:', model.version)
    for tag_name in model.tags:
        tag = model.tags[tag_name]
        print ('\t',tag_name, ':', tag)
    for prop_name in model.properties:
        prop = model.properties[prop_name]
        print ('\t',prop_name, ':', prop)
    print('\n')

automl_model version: 1
	 Training context : Auto ML
	 Accuracy : 0.9176403641881639


model version: 1
	 Training context : Hyperdrive
	 Accuracy : 0.9113808801213961




In [36]:
#Cleaning up the compute cluster once the training process is complete
cpu_cluster.delete()