# Udacity - Azure ml engineer nanodegree - project1: Optimizing an ML Pipeline in Azure





## 1. Set up Azure env

We use the .env file in our project folder to store the azure subscription id. Please make sure not to commit and push the `.env` file to any version control system. The workspace is prepared, beforehand we need to make sure we have the resource groups defined etc.

In [1]:
from azureml.core import Workspace, Experiment, Datastore, ScriptRunConfig
from azureml.data.data_reference import DataReference
from azureml.core.model import Model
import azureml.core

from azureml.data.dataset_factory import TabularDatasetFactory
from azureml.widgets import RunDetails
from azureml.train.sklearn import SKLearn
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.policy import BanditPolicy
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.parameter_expressions import uniform, quniform
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.pipeline.core import Pipeline, PipelineData
from azureml.pipeline.steps import PythonScriptStep

import os
import joblib
import pandas as pd
from dotenv import load_dotenv, find_dotenv
from pathlib import Path
load_dotenv()

True

In [4]:
ws = Workspace.get(name="udacity", subscription_id=os.getenv('AZURE_SUBSCRIPTION_ID'))

### 1.a. - Optional - make sure automl model can be executed locally 

Uncomment and execute to install xgboost and azure-automl runtime locally.

In [None]:
#!poetry run pip install xgboost==0.90 azureml-train-automl-runtime 

## 2. Create compute cluster

We use the vm_size = "Standard_D2_V2" and as in the instructions the max_nodes should be no greater than 4.

In [5]:
# Choose a name for your CPU cluster
amlcompute_cluster_name = "cpu-cluster"

# Verify that cluster does not exist already
try:
    aml_compute = ComputeTarget(workspace=ws, name=amlcompute_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                           max_nodes=4)
    aml_compute = ComputeTarget.create(ws, amlcompute_cluster_name, compute_config)

aml_compute.wait_for_completion(show_output=True)

Found existing cluster, use it.
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


## 3. Hyperdrive run logistic regression model

This step is utilizing the scikit-learn based pipeline, by minimally preprocessing the data and fitting a logisitic-regression model to solve our binary classification problem. We use Azure-hyperdrive to optimize our hyperparams, namely the "C" regularization parameter, and the "max_iter" the possible number of iterations to converge.
We use the BanditPolicy for the optimization, the details about it can be found [here](https://docs.microsoft.com/hu-hu/python/api/azureml-train-core/azureml.train.hyperdrive.banditpolicy?view=azure-ml-py).

As next step after execution we retrieve and save locally the optimized model-object and evaluate it on our holdout (test) dataset.

In [42]:
exp = Experiment(workspace=ws, name="udacity-hyperdrive")

print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Resource group: ' + ws.resource_group, sep = '\n')

Workspace name: udacity
Azure region: westeurope
Resource group: udacity


In [43]:
# Specify parameter sampler
ps = RandomParameterSampling({'--C': uniform(0.1, 1),
                              '--max_iter': quniform(100, 1500, 100),})
# Specify a Policy
policy = BanditPolicy(evaluation_interval=2, slack_factor=0.1)

if "training" not in os.listdir():
    os.mkdir("./training")

# Create a SKLearn estimator for use with train.py
est = SKLearn("./scripts",
              compute_target=aml_compute,
              entry_script="logit_train.py" )

# Create a HyperDriveConfig using the estimator, hyperparameter sampler, and policy.
hyperdrive_config = HyperDriveConfig(
    estimator=est,
     hyperparameter_sampling=ps,
     policy=policy,
     primary_metric_name='Accuracy',
     primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
     max_total_runs=25,
     max_concurrent_runs=4,)



In [44]:
# Submit your hyperdrive run to the experiment and show run details with the widget.
run_hyperdrive =exp.submit(config=hyperdrive_config)



In [45]:
RunDetails(run_hyperdrive).show()

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

In [49]:
# Get your best run and save the model from that run.

best_run = run_hyperdrive.get_best_run_by_primary_metric()
print(best_run.get_details()['runDefinition']['arguments'])

['--C', '0.8354666550884321', '--max_iter', '300']


In [50]:
model = best_run.register_model(model_name='bankmarketing-logit', model_path='outputs/bankmarketing-logit-model.joblib')
model.download(target_dir="outputs", exist_ok=True)

'outputs/bankmarketing-logit-model.joblib'

In [2]:
# Evaluation of model perf on our holdout-set.

from scripts.logit_train import clean_data
from azureml.data.dataset_factory import TabularDatasetFactory
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix


factory = TabularDatasetFactory()
test_data_path = "https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_test.csv"
test_ds = factory.from_delimited_files(test_data_path)
X_test, y_test = clean_data(test_ds)

logit_model = joblib.load('outputs/bankmarketing-logit-model.joblib')

print(logit_model.score(X_test, y_test))
print(classification_report(y_test, logit_model.predict(X_test)))
print(confusion_matrix(y_test, logit_model.predict(X_test)))

0.9111650485436893
              precision    recall  f1-score   support

           0       0.93      0.98      0.95      3636
           1       0.71      0.41      0.52       484

    accuracy                           0.91      4120
   macro avg       0.82      0.69      0.73      4120
weighted avg       0.90      0.91      0.90      4120

[[3557   79]
 [ 287  197]]




## 4. Hyperdrive run lightgbm model
As in the previous step we set up and experiment based on lightgbm. Since it has a lot more hyperparams we woould like to optimize, we set the number of total runs is set somewhat higher than in our previous example.

In [52]:
from azureml.train.estimator import Estimator
from azureml.train.hyperdrive import BayesianParameterSampling
from azureml.train.hyperdrive.parameter_expressions import uniform, quniform, loguniform


# Specify parameter sampler

ps_bayes = BayesianParameterSampling({'--learning_rate': uniform(0.001, 0.3),
                                  '--max_depth': quniform(3, 15, 1),
                                  '--num_leaves': quniform(50, 500, 10),
                                  '--min_data_in_leaf': quniform(1, 25, 1),
                                  '--num_iterations': quniform(100, 1500, 50), })
# Specify a Policy
#policy = BanditPolicy(evaluation_interval=2, slack_factor=0.1)

if "training" not in os.listdir():
    os.mkdir("./training")

# Create a SKLearn estimator for use with train.py
est_lgbm = Estimator("./scripts",
              compute_target=aml_compute,
              entry_script="lgbm_train.py",
              pip_packages=['lightgbm', 'sklearn'])

# Create a HyperDriveConfig using the estimator, hyperparameter sampler, and policy.
hyperdrive_config_lgbm = HyperDriveConfig(
    estimator=est_lgbm,
     hyperparameter_sampling=ps_bayes,
#     policy=policy,
     primary_metric_name='Accuracy',
     primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
     max_total_runs=100,
     max_concurrent_runs=4,)



In [53]:
run_hyperdrive =exp.submit(config=hyperdrive_config_lgbm)



In [54]:
from azureml.widgets import RunDetails

RunDetails(run_hyperdrive).show()

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

In [57]:
best_run_lgbm = run_hyperdrive.get_best_run_by_primary_metric()
print(best_run_lgbm.get_details()['runDefinition']['arguments'])

['--learning_rate', '0.05295908360909255', '--max_depth', '15', '--num_leaves', '50', '--min_data_in_leaf', '15', '--num_iterations', '200']


In [58]:
lgbm_model = best_run_lgbm.register_model(model_name='bankmarketing-lgbm', model_path='outputs/bankmarketing-lgbm-model.joblib')
lgbm_model.download(target_dir="outputs", exist_ok=True)

'outputs/bankmarketing-lgbm-model.joblib'

In [3]:
from scripts.logit_train import clean_data
from azureml.data.dataset_factory import TabularDatasetFactory
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score



factory = TabularDatasetFactory()

test_data_path = "https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_test.csv"
test_ds = factory.from_delimited_files(test_data_path)
X_test, y_test = clean_data(test_ds)

lgbm_model = joblib.load('outputs/bankmarketing-lgbm-model.joblib')

print(accuracy_score(lgbm_model.predict(X_test).round(0).astype(int), y_test))
print(classification_report(y_test, lgbm_model.predict(X_test).round(0).astype(int)))
print(confusion_matrix(y_test, lgbm_model.predict(X_test).round(0).astype(int)))

0.9174757281553398
              precision    recall  f1-score   support

           0       0.94      0.97      0.95      3636
           1       0.68      0.55      0.61       484

    accuracy                           0.92      4120
   macro avg       0.81      0.76      0.78      4120
weighted avg       0.91      0.92      0.91      4120

[[3512  124]
 [ 216  268]]


This means that in case of installing LightGBM from PyPI via the ``pip install lightgbm`` command, you don't need to install the gcc compiler anymore.
Instead of that, you need to install the OpenMP library, which is required for running LightGBM on the system with the Apple Clang compiler.
You can install the OpenMP library by the following command: ``brew install libomp``.


## 5. AutoML pipeline

During the automl pipeline we just define a new experiment and make the datasets available for the execution. As next step we define the AutoML config, and execute it on the previously defined and set up compute resources, once the experiment is finished we download the resulting pickled model object and evaliuate it locally on the test set as in the revious steps.

In [6]:
exp_automl = Experiment(workspace=ws, name="udacity-automl")

print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Resource group: ' + ws.resource_group, sep = '\n')

Workspace name: udacity
Azure region: westeurope
Resource group: udacity


In [8]:
from azureml.data.dataset_factory import TabularDatasetFactory

# Create TabularDataset using TabularDatasetFactory
# Data is available at: 
# "https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_train.csv"

### YOUR CODE HERE ###
datastore = ws.get_default_datastore()
factory = TabularDatasetFactory()
data_path_train = "https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_train.csv"
data_path_valid = "https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_validate.csv"
data_path_test = "https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_test.csv"


ds_train = factory.from_delimited_files(data_path_train)
ds_valid = factory.from_delimited_files(data_path_valid)
ds_test = factory.from_delimited_files(data_path_test)

In [9]:
import logging
from azureml.train.automl import AutoMLConfig

# Set parameters for AutoMLConfig
# NOTE: DO NOT CHANGE THE experiment_timeout_minutes PARAMETER OR YOUR INSTANCE WILL TIME OUT.
# If you wish to run the experiment longer, you will need to run this notebook in your own
# Azure tenant, which will incur personal costs.

label="y"

automl_settings = {
    "enable_early_stopping" : True,
    "iteration_timeout_minutes": 5,
    "max_concurrent_iterations": 4,
    "max_cores_per_iteration": -1,
    "primary_metric": 'accuracy',
    "featurization": 'auto',
    "verbosity": logging.INFO,
}

automl_config = AutoMLConfig(experiment_timeout_minutes=60,
                             task = 'classification',
                             debug_log = 'automl_errors.log',
                             compute_target=aml_compute,
                             experiment_exit_score = 0.9984,
                             blocked_models = ['KNN','LinearSVM'],
                             enable_onnx_compatible_models=True,
                             training_data = ds_train,
                             label_column_name = label,
                             validation_data = ds_valid,
                             n_cross_validations=5
                             **automl_settings
                            )

In [10]:
# Submit your automl run

remote_run = exp_automl.submit(automl_config, show_output = False)

Running on remote.


In [11]:
RunDetails(remote_run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

In [None]:
# Retrieve and save your best automl model, evaluate locally on hold out set

In [14]:
best_run_aml, fitted_model_aml = remote_run.get_output()
model_name = best_run_aml.properties['model_name']

In [16]:
best_run_aml.download_file('outputs/model.pkl', 'outputs/bankmarketing-aml-model.pkl')
best_run_aml.download_file('outputs/scoring_file_v_1_0_0.py', 'outputs/score_aml.py')
best_run_aml.download_file('automl_driver.py', 'outputs/automl_driver.py')

In [17]:
import pickle
file = open("outputs/bankmarketing-aml-model.pkl",'rb')
aml_model = pickle.load(file)
file.close()

In [29]:
from scripts.logit_train import clean_data
from azureml.data.dataset_factory import TabularDatasetFactory
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score


factory = TabularDatasetFactory()
test_data_path = "https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_test.csv"
test_ds = pd.read_csv(test_data_path)
y_test = test_ds[['y']]


In [34]:
print(accuracy_score(aml_model.predict(test_ds.drop(columns=['y'])), y_test))
print(classification_report(y_test, aml_model.predict(test_ds.drop(columns=['y']))))
print(confusion_matrix(y_test, aml_model.predict(test_ds.drop(columns=['y']))))

0.9162621359223301
              precision    recall  f1-score   support

          no       0.94      0.97      0.95      3636
         yes       0.68      0.55      0.61       484

    accuracy                           0.92      4120
   macro avg       0.81      0.76      0.78      4120
weighted avg       0.91      0.92      0.91      4120

[[3510  126]
 [ 219  265]]


## 6. Clean-up

As finishing step we delete the previoiusly defined compute target.

In [None]:
try:
    aml_compute.delete()
    print('Computetarget deleted')
except ComputeTargetException:
    print('Computetarget not found')