# Automated ML

TODO: Import Dependencies. In the cell below, import all the dependencies that you will need to complete the project.

In [1]:
import logging
import os
import csv

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import json
from sklearn import datasets
import pkg_resources

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
from azureml.core.dataset import Dataset

from azureml.pipeline.steps import AutoMLStep
from azureml.pipeline.core import Pipeline
from azureml.pipeline.core import PipelineData, TrainingOutput

from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
from azureml.core.compute_target import ComputeTargetException

ws = Workspace.from_config()


cluster_name = "azuremlCluster"

try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',# for GPU, use "STANDARD_NC6"
                                                           #vm_priority = 'lowpriority', # optional
                                                           max_nodes=4)
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True, min_node_count = 1, timeout_in_minutes = 10)



Found existing cluster, use it.
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


## Dataset

### Overview
TODO: In this markdown cell, give an overview of the dataset you are using. Also mention the task you will be performing.


TODO: Get data. In the cell below, write code to access the data you will be using in this project. Remember that the dataset needs to be external.

In [2]:
# Utility function to clean data from dataset

def clean_data(data):
    # Clean and one hot encode data
    x_df = data.replace('N/A', np.nan).dropna()
    x_df.drop('id', inplace=True, axis=1)
    x_df['bmi'] = x_df['bmi'].astype(float)

    return (x_df)

In [3]:
# choose a name for experiment
experiment_name = 'stroke-prediction-training-automl'

experiment=Experiment(ws, experiment_name)

project_folder = './automl'

dataset_name = 'stroke-dataset'
dataset_description = ''' https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset?select=healthcare-dataset-stroke-data.csv
According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.
This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.
'''
start_filename = 'healthcare-dataset-stroke-data.csv'
out_filename = 'stroke_data_cleaned.parquet'

if dataset_name in ws.datasets.keys():
    dataset = ws.datasets[dataset_name]
else:
    ds = ws.get_default_datastore()
    df = pd.read_csv(start_filename)
    clean_df = clean_data(df)
    clean_df.to_parquet(out_filename, index=False)
    ds.upload_files(
        files=[
            out_filename
        ],
        target_path='stroke_data',
        overwrite=True,
        show_progress=True
    )

    dataset = Dataset.Tabular.from_parquet_files(path=(ds,'stroke_data/*.parquet'))

    dataset.register(
        workspace=ws,
        name=dataset_name,
        description=dataset_description
    )

# https://towardsdatascience.com/azure-machine-learning-service-where-is-my-data-pjainani-86a77b93ab52

## AutoML Configuration

TODO: Explain why you chose the automl settings and cofiguration you used below.

In [8]:
# TODO: Put your automl settings here
automl_settings = {
    "experiment_timeout_minutes": 20,
    "max_concurrent_iterations": 4, 
    "primary_metric" : 'AUC_weighted'
}

# TODO: Put your automl config here
automl_config = AutoMLConfig(compute_target=compute_target,
                             task = "classification",
                             training_data=dataset,
                             label_column_name="stroke",   
                             path = project_folder,
                             enable_early_stopping= True,
                             featurization= 'auto',
                             debug_log = "automl_errors.log",
                             **automl_settings
                            )

## Run Details

OPTIONAL: Write about the different models trained and their performance. Why do you think some models did better than others?

TODO: In the cell below, use the `RunDetails` widget to show the different experiments.

In [9]:
from azureml.widgets import RunDetails
# TODO: Submit your experiment
automl_run = Experiment(ws, 'automl_experiment')
run = automl_run.submit(config=automl_config)

RunDetails(run).show()
run.wait_for_completion(show_output=True)

Submitting remote run.


Experiment,Id,Type,Status,Details Page,Docs Page
automl_experiment,AutoML_7cb00053-71eb-430d-810f-c18bcc422676,automl,NotStarted,Link to Azure Machine Learning studio,Link to Documentation


_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

Experiment,Id,Type,Status,Details Page,Docs Page
automl_experiment,AutoML_7cb00053-71eb-430d-810f-c18bcc422676,automl,NotStarted,Link to Azure Machine Learning studio,Link to Documentation



Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetBalancing. Performing class balancing sweeping
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Cross validation
STATUS:       DONE
DESCRIPTION:  Each iteration of the trained model was validated through cross-validation.
              
DETAILS:      
+---------------------------------+
|Number of folds                  |
|3                                |
+---------------------------------+

****************************************************************************************************

TYPE:         Class balancing detection
STATUS:       ALERTED
DESCRIPTION:  To decrease model bias, please cancel the current run and fix balancing problem.
           

{'runId': 'AutoML_7cb00053-71eb-430d-810f-c18bcc422676',
 'target': 'azuremlCluster',
 'status': 'Completed',
 'startTimeUtc': '2022-06-15T03:57:03.401342Z',
 'endTimeUtc': '2022-06-15T04:11:12.463908Z',
 'services': {},
   'message': 'No scores improved over last 10 iterations, so experiment stopped early. This early stopping behavior can be disabled by setting enable_early_stopping = False in AutoMLConfig for notebook/python SDK runs.'}],
 'properties': {'num_iterations': '1000',
  'training_type': 'TrainFull',
  'acquisition_function': 'EI',
  'primary_metric': 'AUC_weighted',
  'train_split': '0',
  'acquisition_parameter': '0',
  'num_cross_validation': None,
  'target': 'azuremlCluster',
  'DataPrepJsonString': '{\\"training_data\\": {\\"datasetId\\": \\"e985c1b4-a8d6-4c85-a165-13e58ca7eb7c\\"}, \\"datasets\\": 0}',
  'EnableSubsampling': None,
  'runTemplate': 'AutoML',
  'azureml.runsource': 'automl',
  'display_task_type': 'classification',
  'dependencies_versions': '{"azurem

## Best Model

TODO: In the cell below, get the best model from the automl experiments and display all the properties of the model.



In [14]:
best_run, best_model = run.get_output(return_onnx_model=False)

print(best_run)
print(best_model)

if "automl" not in os.listdir():
    os.mkdir("./automl/outputs")

best_run.download_files(output_directory='./automl')

print(best_run.get_file_names())
print(best_run.get_environment())

best_run_metrics = best_run.get_metrics()

print('Best Run Id: ', best_run.id)
print('Accuracy: ', best_run_metrics['accuracy'])
print('AUC_Weighted: ', best_run_metrics['AUC_weighted'])

Package:azureml-automl-runtime, training version:1.38.0, current version:1.34.0
Package:azureml-core, training version:1.38.0, current version:1.34.0
Package:azureml-dataprep, training version:2.26.0, current version:2.22.2
Package:azureml-dataprep-rslex, training version:2.2.0, current version:1.20.1
Package:azureml-dataset-runtime, training version:1.38.0, current version:1.34.0
Package:azureml-defaults, training version:1.38.0, current version:1.34.0
Package:azureml-inference-server-http, training version:0.4.2, current version:0.3.1
Package:azureml-interpret, training version:1.38.0, current version:1.34.0
Package:azureml-mlflow, training version:1.38.0, current version:1.34.0
Package:azureml-pipeline-core, training version:1.38.0, current version:1.34.0
Package:azureml-responsibleai, training version:1.38.0, current version:1.34.0
Package:azureml-telemetry, training version:1.38.0, current version:1.34.0
Package:azureml-train-automl-client, training version:1.38.0, current version

Run(Experiment: automl_experiment,
Id: AutoML_7cb00053-71eb-430d-810f-c18bcc422676_36,
Type: azureml.scriptrun,
Status: Completed)
Pipeline(memory=None,
         steps=[('datatransformer',
                 DataTransformer(enable_dnn=False, enable_feature_sweeping=True, feature_sweeping_config={}, feature_sweeping_timeout=86400, featurization_config=None, force_text_dnn=False, is_cross_validation=True, is_onnx_compatible=False, observer=None, task='classification', working_dir='/mnt/batch/tasks/shared/LS_root/mount...
), random_state=0, reg_alpha=0.8333333333333334, reg_lambda=1.9791666666666667, subsample=0.8, tree_method='auto'))], verbose=False))], flatten_transform=None, weights=[0.2, 0.06666666666666667, 0.06666666666666667, 0.06666666666666667, 0.06666666666666667, 0.06666666666666667, 0.06666666666666667, 0.06666666666666667, 0.06666666666666667, 0.2, 0.06666666666666667]))],
         verbose=False)
['accuracy_table', 'automl_driver.py', 'confusion_matrix', 'explanation/5a1505a7/

In [15]:
#TODO: Save the best model
model_name = 'stroke-prediction-automl-model'
# https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.run.run?view=azure-ml-py#azureml-core-run-run-register-model

model = run.register_model(model_name=model_name)



## Model Deployment

Remember you have to deploy only one of the two models you trained but you still need to register both the models. Perform the steps in the rest of this notebook only if you wish to deploy this model.

TODO: In the cell below, register the model, create an inference config and deploy the model as a web service.

In [16]:
from azureml.core import Environment
from azureml.core.model import InferenceConfig

env = name=best_run.get_environment()

inference_config = InferenceConfig(environment=env, source_directory='./automl/outputs', entry_script='./scoring_file_v_2_0_0.py')

In [17]:
from azureml.core.webservice import AciWebservice, Webservice
from azureml.core.model import Model

deployment_config = AciWebservice.deploy_configuration(cpu_cores = 1, memory_gb = 1, enable_app_insights=True)

# Deploy 
service = Model.deploy(
    ws,
    "strokepredictor",
    [model],
    inference_config,
    deployment_config,
    overwrite=True,
)
service.wait_for_deployment(show_output=True)
print(service.state)


Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running
2022-06-15 04:24:03+00:00 Creating Container Registry if not exists.
2022-06-15 04:24:03+00:00 Registering the environment.
2022-06-15 04:24:04+00:00 Use the existing image.
2022-06-15 04:24:04+00:00 Generating deployment configuration.
2022-06-15 04:24:05+00:00 Submitting deployment to compute.
2022-06-15 04:24:08+00:00 Checking the status of deployment strokepredictor..
2022-06-15 04:30:07+00:00 Checking the status of inference endpoint strokepredictor.
Succeeded
ACI service creation operation finished, operation "Succeeded"
Healthy


TODO: In the cell below, send a request to the web service you deployed to test it.

In [22]:
import requests
import json

uri = service.scoring_uri

headers = {"Content-Type": "application/json"}
data =  {
  "Inputs": {
    "data": [
      {
        "gender": "Male",
        "age": 67,
        "hypertension": 0,
        "heart_disease": 1,
        "ever_married": "Yes",
        "work_type": "Private",
        "Residence_type": "Urban",
        "avg_glucose_level": 228.69,
        "bmi": 36.6,
        "smoking_status": "formerly smoked"
      },
      {
        "gender": "Female",
        "age": 102,
        "hypertension": 1,
        "heart_disease": 1,
        "ever_married": "No",
        "work_type": "Self-employed",
        "Residence_type": "Rural",
        "avg_glucose_level": 202.21,
        "bmi": 38.5,
        "smoking_status": "never smoked"
      }
    ]
  },
  "GlobalParameters": {
    "method": "predict"
  }
}
data = json.dumps(data)
response = requests.post(uri, data=data, headers=headers)
print(response.json())

{'Results': [0, 0]}


TODO: In the cell below, print the logs of the web service and delete the service

In [23]:
# Print logs
logs = service.get_logs()

for line in logs.split('\n'):
    print(line)

# delete service
service.delete()
model.delete()

**Submission Checklist**
- I have registered the model.
- I have deployed the model with the best accuracy as a webservice.
- I have tested the webservice by sending a request to the model endpoint.
- I have deleted the webservice and shutdown all the computes that I have used.
- I have taken a screenshot showing the model endpoint as active.
- The project includes a file containing the environment details.
