# Automated ML

In the cell below, we import all the dependencies that we will need to complete the project.

In [9]:
from azureml.core import Workspace, Dataset, ComputeTarget
from azureml.core.experiment import Experiment
from azureml.train.automl import AutoMLConfig
from azureml.widgets import RunDetails
from azureml.pipeline.steps import AutoMLStep
from azureml.core.compute import AmlCompute
from azureml.core.resource_configuration import ResourceConfiguration

## Dataset

### Overview

For our capstone project, we use [the kaggle heart failure dataset](https://www.kaggle.com/datasets/andrewmvd/heart-failure-clinical-data). We uploaded and registered this dataset to the workspace beforehand. The dataset stems from [a publication on heart failure prediction using machine learning](https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-020-1023-5). The dataset contains medical records for 299 patients with heart failures as well as their survival as a binary variable ("DEATH_EVENT"). The goal of our AutoML - experiment is to predict survival. Hence, we are dealing with a binary classification problem at the core of the task.

In [2]:
ws = Workspace.from_config()

# choose a name for experiment
experiment_name = 'capstone-experiment'
experiment=Experiment(ws, experiment_name)

# print some information about the experiment
print(f'Workspace name: {ws.name} / AZ region: {ws.location} ' \
    f'/ Subscription ID: {ws.subscription_id} / Resource group: {ws.resource_group}')

# start logging and get the compute target
run = experiment.start_logging()
cluster_name = "capstone-cluster"

try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print(f"Found existing compute target: {compute_target}")
except Exception as e:
    print(f"Creating a new compute target (error: {e}")
    compute_cnfg = AmlCompute.provisioning_configuration(
        vm_size = "Standard_DS3_V2",
        min_nodes = 0,
        max_nodes = 4,
    )
    compute_target = ComputeTarget.create(
        ws,
        cluster_name,
        compute_cnfg,
    )
    compute_target.wait_for_completion(
        show_output=True,
        min_node_count=None,
        timeout_in_minutes=60,
    )

# message if ready
print(f'compute target: {compute_target.get_status().serialize()}')

Workspace name: quick-starts-ws-240002 / AZ region: westeurope / Subscription ID: 6971f5ac-8af1-446e-8034-05acea24681f / Resource group: aml-quickstarts-240002
Found existing compute target: AmlCompute(workspace=Workspace.create(name='quick-starts-ws-240002', subscription_id='6971f5ac-8af1-446e-8034-05acea24681f', resource_group='aml-quickstarts-240002'), name=capstone-cluster, id=/subscriptions/6971f5ac-8af1-446e-8034-05acea24681f/resourceGroups/aml-quickstarts-240002/providers/Microsoft.MachineLearningServices/workspaces/quick-starts-ws-240002/computes/capstone-cluster, type=AmlCompute, provisioning_state=Succeeded, location=westeurope, tags={})
compute target: {'currentNodeCount': 4, 'targetNodeCount': 4, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 4, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2023-08-09T18:03:36.267000+00:00', 'errors': None, 'creati

In [3]:
ds_name = 'heart_failure_kaggle_ml'
dataset = Dataset.get_by_name(workspace=ws, name=ds_name)

In [4]:
# inspect the dataframe 
pd_dataset = dataset.to_pandas_dataframe()
pd_dataset.head()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8,1


In [5]:
pd_dataset.tail()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
294,62.0,0,61,1,38,1,155000.0,1.1,143,1,1,270,0
295,55.0,0,1820,0,38,0,270000.0,1.2,139,0,0,271,0
296,45.0,0,2060,1,60,0,742000.0,0.8,138,0,0,278,0
297,45.0,0,2413,0,38,0,140000.0,1.4,140,1,1,280,0
298,50.0,0,196,0,45,0,395000.0,1.6,136,1,1,285,0


## AutoML Configuration

The AutoML configuration below considers the following aspects:
- the task, which is set to classification for our usecase
- the primary metric used to evaluate the metrics - here we choose accuracy
- the training data, which is simply the Azure data asset described above
- the name of the target variable column in the dataset: DEATH_EVENT
- the number of cross validation folds used to train and evaluate the models to get a better overview of the individual model's performance
- the compute target to run our AutoML job as described above
- a flag to enable early stopping so that the experiment ends if the results do not improve in any way
- the experiment timeout minutes - we set to 20 minutes to be able to react fast if the experiment goes wrong
- we allow a max. number of 5 concurrent iterations to not overload our compute
- we set up a path for storing the experiment results: ./automl-rum
- we set the flag on featurization on "auto" to be able to profit from various AutomL-techniques for data (pre-) processing
- and finally, we set up a name for the AutoML - logging: automl_errors.log

In [6]:
target_column = "DEATH_EVENT"

# automl experiment settings here
automl_settings = {
    "task": "classification",
    "primary_metric": "accuracy",
    "training_data": dataset,
    "label_column_name": target_column,
    "n_cross_validations": 5,
    "compute_target": compute_target,
    "enable_early_stopping": True,
    "experiment_timeout_minutes": 20,
    "max_concurrent_iterations": 5,
    "path": "./automl-run",
    "featurization": "auto",
    "debug_log": "automl_errors.log",
}

# automl config here
automl_config = AutoMLConfig(**automl_settings)

In [7]:
# submit the experiment
automl_run = experiment.submit(automl_config)

Submitting remote run.


Experiment,Id,Type,Status,Details Page,Docs Page
capstone-experiment,AutoML_1bf9c79d-ba88-49bd-9ceb-d358da8460e4,automl,NotStarted,Link to Azure Machine Learning studio,Link to Documentation


## Run Details

In the cell below, we use the `RunDetails` widget to show the different experiments.

In [8]:
RunDetails(automl_run).show()
automl_run.wait_for_completion(show_output=True)

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

Experiment,Id,Type,Status,Details Page,Docs Page
capstone-experiment,AutoML_1bf9c79d-ba88-49bd-9ceb-d358da8460e4,automl,NotStarted,Link to Azure Machine Learning studio,Link to Documentation



Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.
Current status: ModelSelection. Beginning model selection.

********************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       PASSED
DESCRIPTION:  Your inputs were analyzed, and all classes are balanced in your training data.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData

********************************************************************************************

TYPE:         Missing feature values imputation
STATUS:       PASSED
DESCRIPTION:  No feature missing values were detected in the training data.
              Learn more about missing value imputation: https://aka.ms/AutomatedMLFeaturization

**********************************************************************************

{'runId': 'AutoML_1bf9c79d-ba88-49bd-9ceb-d358da8460e4',
 'target': 'capstone-cluster',
 'status': 'Completed',
 'startTimeUtc': '2023-08-09T18:25:22.207065Z',
 'endTimeUtc': '2023-08-09T18:41:03.00866Z',
 'services': {},
   'message': 'No scores improved over last 10 iterations, so experiment stopped early. This early stopping behavior can be disabled by setting enable_early_stopping = False in AutoMLConfig for notebook/python SDK runs.'}],
 'properties': {'num_iterations': '1000',
  'training_type': 'TrainFull',
  'acquisition_function': 'EI',
  'primary_metric': 'accuracy',
  'train_split': '0',
  'acquisition_parameter': '0',
  'num_cross_validation': '5',
  'target': 'capstone-cluster',
  'DataPrepJsonString': '{\\"training_data\\": {\\"datasetId\\": \\"791d29a5-0e01-469f-addb-b112bc99ccab\\"}, \\"datasets\\": 0}',
  'EnableSubsampling': None,
  'runTemplate': 'AutoML',
  'azureml.runsource': 'automl',
  'display_task_type': 'classification',
  'dependencies_versions': '{"azureml-

## Best Model

In the following cells, we display some information about the AutoML run, the best run metrics, get the best model from the automl experiments and display all the properties of the model.

In [13]:
best_run, best_model = automl_run.get_output()
best_run_metrics = best_run.get_metrics()
best_run_parameter_values = best_run.get_details()["runDefinition"]["arguments"]

print(f"ID for best model run: {best_run.id}\n")
print(f"Best metrics collected from the best run: {best_run_metrics}\n")
print(f"Reached accuracy: {best_run_metrics['accuracy']}\n")

Package:azureml-automl-runtime, training version:1.52.0.post1, current version:1.51.0.post1
Package:azureml-core, training version:1.52.0, current version:1.51.0
Package:azureml-dataprep, training version:4.11.4, current version:4.10.8
Package:azureml-dataprep-rslex, training version:2.18.4, current version:2.17.12
Package:azureml-dataset-runtime, training version:1.52.0, current version:1.51.0
Package:azureml-defaults, training version:1.52.0, current version:1.51.0
Package:azureml-interpret, training version:1.52.0, current version:1.51.0
Package:azureml-mlflow, training version:1.52.0, current version:1.51.0
Package:azureml-pipeline-core, training version:1.52.0, current version:1.51.0
Package:azureml-responsibleai, training version:1.52.0, current version:1.51.0
Package:azureml-telemetry, training version:1.52.0, current version:1.51.0
Package:azureml-train-automl-client, training version:1.52.0, current version:1.51.0.post1
Package:azureml-train-automl-runtime, training version:1.

ID for best model run: AutoML_1bf9c79d-ba88-49bd-9ceb-d358da8460e4_86

Best metrics collected from the best run: {'recall_score_macro': 0.8619047619047621, 'average_precision_score_micro': 0.918318078003167, 'log_loss': 0.3594652250175302, 'recall_score_weighted': 0.8897175141242938, 'precision_score_macro': 0.8961048938444603, 'balanced_accuracy': 0.8619047619047621, 'AUC_macro': 0.9172676264304173, 'recall_score_micro': 0.8897175141242938, 'matthews_correlation': 0.7553025119722379, 'weighted_accuracy': 0.908408072959493, 'accuracy': 0.8897175141242938, 'AUC_weighted': 0.9172676264304173, 'AUC_micro': 0.9183955759839126, 'f1_score_micro': 0.8897175141242938, 'f1_score_weighted': 0.8859539915484156, 'norm_macro_recall': 0.7238095238095238, 'precision_score_weighted': 0.9048272317148681, 'precision_score_micro': 0.8897175141242938, 'average_precision_score_weighted': 0.9240433534998737, 'average_precision_score_macro': 0.8989552076089244, 'f1_score_macro': 0.8660905548676316, 'accuracy

In [14]:
print(f"Metrics results of the best run: {best_run.get_metrics()}")

Metrics results of the best run: {'recall_score_macro': 0.8619047619047621, 'average_precision_score_micro': 0.918318078003167, 'log_loss': 0.3594652250175302, 'recall_score_weighted': 0.8897175141242938, 'precision_score_macro': 0.8961048938444603, 'balanced_accuracy': 0.8619047619047621, 'AUC_macro': 0.9172676264304173, 'recall_score_micro': 0.8897175141242938, 'matthews_correlation': 0.7553025119722379, 'weighted_accuracy': 0.908408072959493, 'accuracy': 0.8897175141242938, 'AUC_weighted': 0.9172676264304173, 'AUC_micro': 0.9183955759839126, 'f1_score_micro': 0.8897175141242938, 'f1_score_weighted': 0.8859539915484156, 'norm_macro_recall': 0.7238095238095238, 'precision_score_weighted': 0.9048272317148681, 'precision_score_micro': 0.8897175141242938, 'average_precision_score_weighted': 0.9240433534998737, 'average_precision_score_macro': 0.8989552076089244, 'f1_score_macro': 0.8660905548676316, 'accuracy_table': 'aml://artifactId/ExperimentRun/dcid.AutoML_1bf9c79d-ba88-49bd-9ceb-d35

In [15]:
print(f"Overview over the best model and its details: {best_model}")

Overview over the best model and its details: Pipeline(memory=None,
         steps=[('datatransformer',
                 DataTransformer(enable_dnn=False, enable_feature_sweeping=True, feature_sweeping_config={}, feature_sweeping_timeout=86400, featurization_config=None, force_text_dnn=False, is_cross_validation=True, is_onnx_compatible=False, observer=None, task='classification', working_dir='/mnt/batch/tasks/shared/LS_root/mount...
                 PreFittedSoftVotingClassifier(classification_labels=array([0, 1]), estimators=[('65', Pipeline(memory=None, steps=[('sparsenormalizer', Normalizer(copy=True, norm='l2')), ('xgboostclassifier', XGBoostClassifier(booster='gbtree', colsample_bytree=0.9, eta=0.3, gamma=0, max_depth=6, max_leaves=0, n_estimators=50, n_jobs=1, objective='reg:logistic', problem_info=ProblemInfo(gpu_training_param_dict={'processing_unit_type': 'cpu'}), random_state=0, reg_alpha=1.7708333333333335, reg_lambda=1.7708333333333335, subsample=0.9, tree_method='auto'))]

In [16]:
# save the best model
best_run = automl_run.get_best_child()
best_model_name = best_run.properties["model_name"]
best_run.get_file_names()

['accuracy_table',
 'automl_driver.py',
 'confusion_matrix',
 'explanation/6b2a27aa/classes.interpret.json',
 'explanation/6b2a27aa/eval_data_viz.interpret.json',
 'explanation/6b2a27aa/expected_values.interpret.json',
 'explanation/6b2a27aa/features.interpret.json',
 'explanation/6b2a27aa/global_names/0.interpret.json',
 'explanation/6b2a27aa/global_rank/0.interpret.json',
 'explanation/6b2a27aa/global_values/0.interpret.json',
 'explanation/6b2a27aa/local_importance_values.interpret.json',
 'explanation/6b2a27aa/per_class_names/0.interpret.json',
 'explanation/6b2a27aa/per_class_rank/0.interpret.json',
 'explanation/6b2a27aa/per_class_values/0.interpret.json',
 'explanation/6b2a27aa/rich_metadata.interpret.json',
 'explanation/6b2a27aa/true_ys_viz.interpret.json',
 'explanation/6b2a27aa/visualization_dict.interpret.json',
 'explanation/6b2a27aa/ys_pred_proba_viz.interpret.json',
 'explanation/6b2a27aa/ys_pred_viz.interpret.json',
 'explanation/9117bc8a/classes.interpret.json',
 'expl

In [17]:
best_run.download_files("./outputs")  # model.pkl is now saved in ./outputs

In [20]:
# registering the model
best_model_name = best_run.properties["model_name"]
model = best_run.register_model(
    model_name=best_model_name, 
    model_path="./outputs",
    resource_configuration=ResourceConfiguration(cpu=1, memory_in_gb=1),
)

print(f"Model name: {model.name}")
print(f"Version: {model.version}")
print(f"RunID: {model.run_id}")

Model name: AutoML1bf9c79db86
Version: 2
RunID: AutoML_1bf9c79d-ba88-49bd-9ceb-d358da8460e4_86


## Model Deployment

Remember you have to deploy only one of the two models you trained but you still need to register both the models. Perform the steps in the rest of this notebook only if you wish to deploy this model.

As the models are roughly similar, we chose to **just register** the AutoML model here.

In [18]:
# registering the model
best_model_name = best_run.properties["model_name"]
model = best_run.register_model(
    model_name=best_model_name, 
    model_path="./outputs",
    resource_configuration=ResourceConfiguration(cpu=1, memory_in_gb=1),
)

print(f"Model name: {model.name}")
print(f"Version: {model.version}")
print(f"RunID: {model.run_id}")

Model name: AutoML1bf9c79db86
Version: 1
RunID: AutoML_1bf9c79d-ba88-49bd-9ceb-d358da8460e4_86


In [19]:
print(f"Best run model name is : {best_model_name}")

Best run model name is : AutoML1bf9c79db86


In [None]:
compute_target.delete()

**Submission Checklist**
- I have registered the model.
- I have deployed the model with the best accuracy as a webservice.
- I have tested the webservice by sending a request to the model endpoint.
- I have deleted the webservice and shutdown all the computes that I have used.
- I have taken a screenshot showing the model endpoint as active.
- The project includes a file containing the environment details.
