# Automated ML

In the next cells: we will update the environment SDK to fit with the latest AutoML version and avoid incompatibility issues and import all the important libraries we will need to execute this notebook.
IMPORTANT: Restart the kernel after executing the update, so the new updated libraries can be loaded in the environment.

In [1]:
# In order to avoid problems with different SDK versions during AutoML training and later serving of the model in the compute cluster I update the SDK version of AutoML.
# It is important to use Python 3.6 as it is, at this moment the compatible version with the latest version of AutoML SDK.

import sys

! {sys.executable} -m pip install --upgrade azureml-sdk[automl]

Requirement already up-to-date: azureml-sdk[automl] in /anaconda/envs/azureml_py36/lib/python3.6/site-packages (1.36.0)


In [2]:
from azureml.core import Workspace, Experiment
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.widgets import RunDetails
from azureml.data.dataset_factory import TabularDatasetFactory
from azureml.train.automl import AutoMLConfig
from azureml.core.dataset import Dataset
from azureml.core import Datastore
import joblib
from pprint import pprint
import requests
import json
import pandas as pd
from azureml.core import Model
from azureml.core.resource_configuration import ResourceConfiguration
from azureml.core import Environment
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.model import InferenceConfig
from azureml.core.webservice import AciWebservice



## Dataset

### Overview

We are going to configure our workspace so we can register our dataset and create a compute cluster for the AutoML training and deployment of our model.

The dataset we will be using is the "Heart failure prediction dataset" from Kaggle (https://www.kaggle.com/fedesoriano/heart-failure-prediction). This dataset tries to help in the early detection of severe heart diseases by studying the way several health indicators affect the occurrence of such diseases. This dataset is a combination of 5 different datasets about this kind of diseases (more information in the Kaggle url provided earlier). 

A copy of the dataset is provided in the Github repository but it is also possible to access it by an url.

In [3]:
ws = Workspace.from_config()

# choose a name for experiment
experiment_name = 'Udacitycapsproject'

experiment=Experiment(ws, experiment_name)

print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

run = experiment.start_logging()

Workspace name: quick-starts-ws-165696
Azure region: southcentralus
Subscription id: 510b94ba-e453-4417-988b-fbdc37b55ca7
Resource group: aml-quickstarts-165696


In [4]:
cluster_name="udacityprojclust"

try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cluster_name)
    print('cpu cluster already exist. Using it.')
except ComputeTargetException:

    compute_config = AmlCompute.provisioning_configuration(vm_size='Standard_D2_V2', max_nodes=6)
    cpu_cluster = ComputeTarget.create(ws, cluster_name, compute_config)

cpu_cluster.wait_for_completion(show_output=True)

InProgress..
SucceededProvisioning operation finished, operation "Succeeded"
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


In [5]:
data = pd.read_csv('./heart.csv')
datastore = Datastore.get(ws, 'workspaceblobstore')

dataset = TabularDatasetFactory.register_pandas_dataframe(data, target=datastore, name='udacitycapsprojdata')

Validating arguments.
Arguments validated.
Successfully obtained datastore reference and path.
Uploading file to managed-dataset/c96dd65b-d930-4cbb-88ae-848bb0da6922/
Successfully uploaded file to datastore.
Creating and registering a new dataset.
Successfully created and registered a new dataset.


## AutoML Configuration

The task we need to develop with AutoML is a classification, we will be using a yes or no label in the dataset that is 'HeartDisease'. With the settings we try to define the AutoML work to give the best results but optimizing the resource consumption. So we have selected AUC metric to handle in the best possible way if there are some imbalance in the dataset and we also generate a number of 5 folds to assure a proper evaluation of the model, we have also enabled an early stopping policy and fixed the experiment timeout to optimize the use of the compute cluster. 

In [8]:
# automl settings
automl_settings = {
       "n_cross_validations": 5,
       "primary_metric": 'AUC_weighted',
       "enable_early_stopping": True,
       "experiment_timeout_hours": 1.0,
       "iterations" : 10,
       "max_concurrent_iterations": 5,
       "max_cores_per_iteration": -1,
       "enable_onnx_compatible_models": True,
       "blocked_models": ['XGBoostClassifier'],
       "verbosity": logging.INFO}

# automl config
automl_config = AutoMLConfig(task = 'classification',
                               compute_target = cluster_name,
                               training_data = dataset,
                               label_column_name = 'HeartDisease',
                               **automl_settings)

In [9]:
# Submit the AutoML experiment
remote_run = experiment.submit(automl_config)

Submitting remote run.


Experiment,Id,Type,Status,Details Page,Docs Page
Udacitycapsproject,AutoML_a3fce369-7ba2-4397-8adc-7ad9a4d6c61a,automl,NotStarted,Link to Azure Machine Learning studio,Link to Documentation


## Run Details

TODO: In the cell below, use the `RunDetails` widget to show the different experiments.

In [10]:
RunDetails(remote_run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

OPTIONAL: Write about the different models trained and their performance. Why do you think some models did better than others?

## Best Model

TODO: In the cell below, get the best model from the automl experiments and display all the properties of the model.



In [11]:
# Retrieve and get insights from your best automl model.

best_run_AutoML, fitted_model_AutoML = remote_run.get_output()

print(hasattr(fitted_model_AutoML, 'steps'))

True


In [12]:
# Function to list the hyperparameters 

def print_model(model, prefix=""):
    for step in model.steps:
        print(prefix + step[0])
        
        if hasattr(step[1], 'estimators') and hasattr(step[1], 'weights'):
            pprint({'estimators' : list(e[0] for e in step[1].estimators), 'weights' : step[1].weights})
            print()

            for estimator in step[1].estimators:
                print_model(estimator[1], estimator[0] + ' - ')
        
        else:
            pprint(step[1].get_params())
            print()
        
print_model(fitted_model_AutoML)

datatransformer
{'enable_dnn': False,
 'enable_feature_sweeping': False,
 'feature_sweeping_config': {},
 'feature_sweeping_timeout': 86400,
 'featurization_config': None,
 'force_text_dnn': False,
 'is_cross_validation': True,
 'is_onnx_compatible': True,
 'observer': None,
 'task': 'classification',
 'working_dir': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/notebook165696/code/Users/odl_user_165696'}

prefittedsoftvotingclassifier
{'estimators': ['4', '3', '7', '8', '6', '2', '0'],
 'weights': [0.14285714285714285,
             0.14285714285714285,
             0.14285714285714285,
             0.14285714285714285,
             0.14285714285714285,
             0.14285714285714285,
             0.14285714285714285]}

4 - sparsenormalizer
{'copy': True, 'norm': 'l2'}

4 - randomforestclassifier
{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': 'balanced',
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min

In [13]:
# Get information from guardrails.

print(remote_run.get_guardrails())


********************************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       PASSED
DESCRIPTION:  Your inputs were analyzed, and all classes are balanced in your training data.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData

********************************************************************************************************************

TYPE:         Missing feature values imputation
STATUS:       PASSED
DESCRIPTION:  No feature missing values were detected in the training data.
              Learn more about missing value imputation: https://aka.ms/AutomatedMLFeaturization

********************************************************************************************************************

TYPE:         High cardinality feature detection
STATUS:       PASSED
DESCRIPTION:  Your inputs were analyzed, and no high cardinality featu

In [14]:
# Save the best model by AutoML

joblib.dump(fitted_model_AutoML, 'AutoML.model')

['AutoML.model']

## Model Deployment

Remember you have to deploy only one of the two models you trained but you still need to register both the models. Perform the steps in the rest of this notebook only if you wish to deploy this model.

TODO: In the cell below, register the model, create an inference config and deploy the model as a web service.

In [15]:

df_y = data['HeartDisease']
df_x = data.drop('HeartDisease', axis=1)

model = Model.register(workspace=ws,
                       model_name='my-udacityproj3-automlmodel',                # Name of the registered model in your workspace.
                       model_path='./AutoML.model',  # Local file to upload and register as a model.
                       model_framework=Model.Framework.CUSTOM,  # Framework used to create the model.
                       #sample_input_dataset=df_x,
                       #sample_output_dataset=df_y,
                       resource_configuration=ResourceConfiguration(cpu=1, memory_in_gb=1.0),
                       description='AutoML model for heart disease prediction.',
                       tags={'area': 'heartdisease', 'type': 'classification'})

print('Name:', model.name)
print('Version:', model.version)

Registering model my-udacityproj3-automlmodel
Name: my-udacityproj3-automlmodel
Version: 1


TODO: In the cell below, send a request to the web service you deployed to test it.

In [16]:
# Create environment with its dependencies.

environment = Environment('my-AutoML-environment')
environment.python.conda_dependencies = CondaDependencies.create(pip_packages=[
    'azureml-defaults',
    'azureml-automl-core',
    'azureml-automl-runtime',
    'inference-schema[numpy-support]',
    'joblib',
    'numpy',
    'pandas',
    'scikit-learn',
    'xgboost',
    'packaging'
])

In [17]:
# Create the webservice

inference_config = InferenceConfig(entry_script='./score_3.py', environment=environment)
aci_config = AciWebservice.deploy_configuration(cpu_cores=1, memory_gb=1, enable_app_insights=True)


service_name = 'my-udacityproj3-service'

service = Model.deploy(workspace=ws,
                       name=service_name,
                       models=[model],
                       inference_config=inference_config,
                       deployment_config=aci_config,
                       overwrite=True)
service.wait_for_deployment(show_output=True)

Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running
2021-12-13 15:59:07+00:00 Creating Container Registry if not exists..
2021-12-13 16:09:07+00:00 Registering the environment..
2021-12-13 16:09:09+00:00 Building image..
2021-12-13 16:14:52+00:00 Generating deployment configuration..
2021-12-13 16:14:55+00:00 Submitting deployment to compute..
2021-12-13 16:14:59+00:00 Checking the status of deployment my-udacityproj3-service..
2021-12-13 16:20:16+00:00 Checking the status of inference endpoint my-udacityproj3-service.
Succeeded
ACI service creation operation finished, operation "Succeeded"


In [18]:
print(service.get_logs())

2021-12-13T16:19:57,893049000+00:00 - iot-server/run 
2021-12-13T16:19:57,894433700+00:00 - rsyslog/run 
2021-12-13T16:19:57,893048100+00:00 - gunicorn/run 
Dynamic Python package installation is disabled.
Starting HTTP server
2021-12-13T16:19:58,088945300+00:00 - nginx/run 
EdgeHubConnectionString and IOTEDGE_IOTHUBHOSTNAME are not set. Exiting...
2021-12-13T16:19:58,783336500+00:00 - iot-server/finish 1 0
2021-12-13T16:19:58,785953000+00:00 - Exit code 1 is normal. Not restarting iot-server.
Starting gunicorn 20.1.0
Listening at: http://127.0.0.1:31311 (68)
Using worker: sync
worker timeout is set to 300
Booting worker with pid: 93
SPARK_HOME not set. Skipping PySpark Initialization.
Initializing logger
2021-12-13 16:20:02,371 | root | INFO | Starting up app insights client
logging socket was found. logging is available.
logging socket was found. logging is available.
2021-12-13 16:20:02,373 | root | INFO | Starting up request id generator
2021-12-13 16:20:02,373 | root | INFO | Star

TODO: In the cell below, print the logs of the web service and delete the service

In [22]:
# Get the data to test the endpoint in the right way.

data_tosend = df_x[0:1].values.tolist()

print(data_tosend)

data_tosend = [data_tosend[0][:]]

print(data_tosend)


[[40, 'M', 'ATA', 140, 289, 0, 'Normal', 172, 'N', 0.0, 'Up']]
[[40, 'M', 'ATA', 140, 289, 0, 'Normal', 172, 'N', 0.0, 'Up']]


In [23]:
input_payload = json.dumps({
    'data': data_tosend,
    'method': 'predict'  # If you have a classification model, you can get probabilities by changing this to 'predict_proba'.
})

output = service.run(input_payload)

print(output)

Run Error: DataException:
	Message: Expected column(s) 0 not found in fitted data.
	InnerException: None
	ErrorResponse 
{
    "error": {
        "code": "UserError",
        "message": "Expected column(s) 0 not found in fitted data.",
        "target": "X",
        "inner_error": {
            "code": "BadArgument",
            "inner_error": {
                "code": "MissingColumnsInData"
            }
        },
        "reference_code": "17049f70-3bbe-4060-a63f-f06590e784e5"
    }
}


In [21]:
print(service.get_logs())

2021-12-13T16:19:57,893049000+00:00 - iot-server/run 
2021-12-13T16:19:57,894433700+00:00 - rsyslog/run 
2021-12-13T16:19:57,893048100+00:00 - gunicorn/run 
Dynamic Python package installation is disabled.
Starting HTTP server
2021-12-13T16:19:58,088945300+00:00 - nginx/run 
EdgeHubConnectionString and IOTEDGE_IOTHUBHOSTNAME are not set. Exiting...
2021-12-13T16:19:58,783336500+00:00 - iot-server/finish 1 0
2021-12-13T16:19:58,785953000+00:00 - Exit code 1 is normal. Not restarting iot-server.
Starting gunicorn 20.1.0
Listening at: http://127.0.0.1:31311 (68)
Using worker: sync
worker timeout is set to 300
Booting worker with pid: 93
SPARK_HOME not set. Skipping PySpark Initialization.
Initializing logger
2021-12-13 16:20:02,371 | root | INFO | Starting up app insights client
logging socket was found. logging is available.
logging socket was found. logging is available.
2021-12-13 16:20:02,373 | root | INFO | Starting up request id generator
2021-12-13 16:20:02,373 | root | INFO | Star

In [None]:
# Remove WebService endpoint

service.delete()

# Remove compute cluster

cpu_cluster.delete()

**Submission Checklist**
- I have registered the model.
- I have deployed the model with the best accuracy as a webservice.
- I have tested the webservice by sending a request to the model endpoint.
- I have deleted the webservice and shutdown all the computes that I have used.
- I have taken a screenshot showing the model endpoint as active.
- The project includes a file containing the environment details.
