# Automated ML - Stroke Prediction

Import Dependencies. In the cell below, import all the dependencies that you will need to complete the project.

In [15]:
from azureml.core import Workspace, Experiment
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.widgets import RunDetails
from azureml.train.sklearn import SKLearn
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.policy import BanditPolicy
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.parameter_expressions import choice, uniform
from azureml.core import Environment, ScriptRunConfig
from azureml.core.experiment import Experiment
import joblib
import os
from azureml.data.dataset_factory import TabularDatasetFactory
from azureml.core import Dataset
import pandas as pd
from azureml.train.automl import AutoMLConfig
from azureml.core.webservice import AciWebservice
import requests # Used for http post request
import json
from azureml.core.environment import Environment
from azureml.core.model import InferenceConfig
from azureml.core import Model # Used to get model information

## Dataset
### Context

According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.
This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.

### Overview
The data is provided via the following Kaggle source: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset

The data is provided as a .csv file and ist structured as followed.

Attribute Information:
1) id: unique identifier
2) gender: "Male", "Female" or "Other"
3) age: age of the patient
4) hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
5) heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
6) ever_married: "No" or "Yes"
7) work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
8) Residence_type: "Rural" or "Urban"
9) avg_glucose_level: average glucose level in blood
10) bmi: body mass index
11) smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*
12) stroke: 1 if the patient had a stroke or 0 if not
*Note: "Unknown" in smoking_status means that the information is unavailable for this patient

# Create Workspace

In [2]:
ws = Workspace.from_config()

## Load data from Datastore

In [3]:
found = False
key = "Stroke Dataset"
description_text = "This dataset is used to predict whether a patient is likely to get stroke."

if key in ws.datasets.keys():
        found = True
        dataset = ws.datasets[key]

if not found:
        # Create AML Dataset and register it into Workspace
        example_data = 'https://raw.githubusercontent.com/jmtaverne/Udacity--Machine-Learning-Azure-Nanodegree/main/Project%20-%203_%20Capstone%20Project/healthcare-dataset-stroke-data.csv'
        dataset = Dataset.Tabular.from_delimited_files(example_data)
        #Register Dataset in Workspace
        dataset = dataset.register(workspace=ws,
                                   name=key,
                                   description=description_text)


df = dataset.to_pandas_dataframe()
df.describe()

Unnamed: 0,id,age,hypertension,heart_disease,avg_glucose_level,stroke
count,5110.0,5110.0,5110.0,5110.0,5110.0,5110.0
mean,36517.829354,43.226614,0.097456,0.054012,106.147677,0.048728
std,21161.721625,22.612647,0.296607,0.226063,45.28356,0.21532
min,67.0,0.08,0.0,0.0,55.12,0.0
25%,17741.25,25.0,0.0,0.0,77.245,0.0
50%,36932.0,45.0,0.0,0.0,91.885,0.0
75%,54682.0,61.0,0.0,0.0,114.09,0.0
max,72940.0,82.0,1.0,1.0,271.74,1.0


## Create compute cluster

In [4]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

cluster_name = "Udactiy-Project-Cluster"

# Verify that cluster does not exist already
try:
    aml_compute = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='Standard_D2_V2',
                                                           max_nodes=4)
    aml_compute = ComputeTarget.create(ws, cluster_name, compute_config)

aml_compute.wait_for_completion(show_output=True)



InProgress.
SucceededProvisioning operation finished, operation "Succeeded"
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


## Create experiment

In [5]:
# choose a name for experiment
experiment_name = 'Stroke_prediction'

experiment=Experiment(ws, experiment_name)

## AutoML Configuration

TODO: Explain why you chose the automl settings and configuration you used below.
To limit the runtime of the AutoML, I set a limit to 30 minutes for the experiment runtime. As task I want to have a classification model with the Accuracy as evaluation metric. To have a valid and not overfitted model I use cross validation and enable early stopping.

In [10]:
# TODO: Put your automl settings here
automl_settings = {
    "experiment_timeout_minutes":30,
    "primary_metric":"accuracy"
}

# TODO: Put your automl config here
automl_config = AutoMLConfig(
    task="classification",
    training_data=dataset,
    label_column_name="stroke",
    n_cross_validations=2,
    compute_target = aml_compute,
    enable_early_stopping = True,
    **automl_settings
)

In [11]:
# TODO: Submit your experiment
remote_run = experiment.submit(automl_config)

Submitting remote run.


Experiment,Id,Type,Status,Details Page,Docs Page
Stroke_prediction,AutoML_e175bf45-5083-47e4-816e-9cdeb563239e,automl,NotStarted,Link to Azure Machine Learning studio,Link to Documentation


## Run Details
TODO: In the cell below, use the `RunDetails` widget to show the different experiments.

In [12]:
RunDetails(remote_run).show()
remote_run.wait_for_completion(show_output=True)

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

Experiment,Id,Type,Status,Details Page,Docs Page
Stroke_prediction,AutoML_e175bf45-5083-47e4-816e-9cdeb563239e,automl,NotStarted,Link to Azure Machine Learning studio,Link to Documentation



Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetFeaturization. Beginning to fit featurizers and featurize the dataset.
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.
Current status: ModelSelection. Beginning model selection.

********************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       ALERTED
DESCRIPTION:  To decrease model bias, please cancel the current run and fix balancing problem.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData
DETAILS:      Imbalanced data can lead to a falsely perceived positive effect of a model's accuracy because the input data has bias towards one class.
+------------------------------+--------------------------------+--------------------------------------+
|Size of the smallest class    |Name/Label of the smallest class|Nu

{'runId': 'AutoML_e175bf45-5083-47e4-816e-9cdeb563239e',
 'target': 'Udactiy-Project-Cluster',
 'status': 'Completed',
 'startTimeUtc': '2022-05-25T08:11:11.973591Z',
 'endTimeUtc': '2022-05-25T08:37:02.15817Z',
 'services': {},
   'message': 'No scores improved over last 10 iterations, so experiment stopped early. This early stopping behavior can be disabled by setting enable_early_stopping = False in AutoMLConfig for notebook/python SDK runs.'}],
 'properties': {'num_iterations': '1000',
  'training_type': 'TrainFull',
  'acquisition_function': 'EI',
  'primary_metric': 'accuracy',
  'train_split': '0',
  'acquisition_parameter': '0',
  'num_cross_validation': '2',
  'target': 'Udactiy-Project-Cluster',
  'DataPrepJsonString': '{\\"training_data\\": {\\"datasetId\\": \\"93eec123-d4d4-4be2-846a-13197de012b4\\"}, \\"datasets\\": 0}',
  'EnableSubsampling': None,
  'runTemplate': 'AutoML',
  'azureml.runsource': 'automl',
  'display_task_type': 'classification',
  'dependencies_versions

## Best Model
TODO: In the cell below, get the best model from the automl experiments and display all the properties of the model.

In [13]:
# Get best run and model
best_run, fitted_model = remote_run.get_output()

# Print the best run
print(best_run)

# Get all metrics of the best run
best_run_metrics = best_run.get_metrics()

# Print all metrics of the best run
for metric_name in best_run_metrics:
    metric = best_run_metrics[metric_name]
    print(metric_name, metric)

Run(Experiment: Stroke_prediction,
Id: AutoML_e175bf45-5083-47e4-816e-9cdeb563239e_45,
Type: azureml.scriptrun,
Status: Completed)
f1_score_micro 0.7879439518496862
log_loss 0.5562988854684747
AUC_weighted 0.8433999771146986
recall_score_weighted 0.7879439518496862
balanced_accuracy 0.7880306866058427
precision_score_macro 0.7884097992534509
f1_score_weighted 0.7878846136565671
average_precision_score_weighted 0.8337891203489847
AUC_micro 0.841470528992134
norm_macro_recall 0.576061373211667
precision_score_micro 0.7879439518496862
average_precision_score_micro 0.8391374733762549
recall_score_micro 0.7879439518496862
weighted_accuracy 0.7658624421060931
accuracy 0.7879439518496805
AUC_macro 0.8433999771146986
recall_score_macro 0.7880306866058427
f1_score_macro 0.7877474200233474
precision_score_weighted 0.7887704922377754
matthews_correlation 0.5764402348320458
average_precision_score_macro 0.8335924711599372
confusion_matrix aml://artifactId/ExperimentRun/dcid.AutoML_e175bf45-5083-47

In [14]:
# Print detailed parameters of the fitted model
def print_model(model, prefix=""):
    for step in model.steps:
        print(prefix + step[0])
        if hasattr(step[1], 'estimators') and hasattr(step[1], 'weights'):
            print({'estimators': list(
                e[0] for e in step[1].estimators), 'weights': step[1].weights})
            print()
            for estimator in step[1].estimators:
                print_model(estimator[1], estimator[0] + ' - ')
        else:
            print(step[1].get_params())
            print()

print_model(fitted_model)


datatransformer
{'task': 'classification', 'is_onnx_compatible': False, 'enable_feature_sweeping': True, 'enable_dnn': False, 'force_text_dnn': False, 'feature_sweeping_timeout': 86400, 'featurization_config': None, 'is_cross_validation': True, 'feature_sweeping_config': {}, 'observer': None, 'working_dir': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/notebook196660/code/Users/odl_user_196660'}

prefittedsoftvotingclassifier
{'estimators': ['34', '24', '38', '29', '44', '17', '8', '7', '37', '3', '43', '9'], 'weights': [0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.21428571428571427, 0.07142857142857142, 0.07142857142857142]}

34 - maxabsscaler
{'copy': True}

34 - extratreesclassifier
{'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 0.5, 'max_leaf_nodes': None, 'max_sample

In [17]:
#TODO: Save the best model
myModel = best_run.register_model(model_path='outputs/model.pkl', model_name='capstoneModel_automl',
                        tags={'Training context':'Auto ML'},
                        properties={'Accuracy': best_run_metrics['accuracy']})

print(myModel)



Model(workspace=Workspace.create(name='quick-starts-ws-196660', subscription_id='6971f5ac-8af1-446e-8034-05acea24681f', resource_group='aml-quickstarts-196660'), name=capstoneModel_automl, id=capstoneModel_automl:1, version=1, tags={'Training context': 'Auto ML'}, properties={'Accuracy': '0.7879439518496805'})


In [18]:
# List registered models to verify if model has been saved
for model in Model.list(ws):
    print(model.name, 'version:', model.version)
    for tag_name in model.tags:
        tag = model.tags[tag_name]
        print ('\t',tag_name, ':', tag)
    for prop_name in model.properties:
        prop = model.properties[prop_name]
        print ('\t',prop_name, ':', prop)
    print('\n')


capstoneModel_automl version: 1
	 Training context : Auto ML
	 Accuracy : 0.7879439518496805




## Model Deployment

Remember you have to deploy only one of the two models you trained but you still need to register both the models. Perform the steps in the rest of this notebook only if you wish to deploy this model.

TODO: In the cell below, register the model, create an inference config and deploy the model as a web service.

In [25]:
# Download scoring file
best_run.download_file('outputs/scoring_file_v_1_0_0.py', 'scoreScript.py')

# Download environment file
best_run.download_file('outputs/conda_env_v_1_0_0.yml', 'envFile.yml')


inference_config = InferenceConfig(entry_script='scoreScript.py',
                                    environment=best_run.get_environment())

# deploy
deployment_config = AciWebservice.deploy_configuration(cpu_cores = 1, memory_gb = 1)
service = Model.deploy(ws, "myservice", [myModel], inference_config, deployment_config)
service.wait_for_deployment(show_output = True)
print(service.state)

print(service.scoring_uri)

print(service.swagger_uri)

Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running
2022-05-25 11:59:25+00:00 Creating Container Registry if not exists.
2022-05-25 11:59:25+00:00 Registering the environment.
2022-05-25 11:59:26+00:00 Use the existing image.
2022-05-25 11:59:27+00:00 Submitting deployment to compute.
2022-05-25 11:59:29+00:00 Checking the status of deployment myservice..
2022-05-25 12:01:46+00:00 Checking the status of inference endpoint myservice.
Succeeded
ACI service creation operation finished, operation "Succeeded"
Healthy
http://671ef973-9953-4a3b-939f-a4a05b9c74c2.southcentralus.azurecontainer.io/score
http://671ef973-9953-4a3b-939f-a4a05b9c74c2.southcentralus.azurecontainer.io/swagger.json


TODO: In the cell below, send a request to the web service you deployed to test it.

In [22]:
#Import test data
test_df = df.sample(5) # data is the pandas dataframe of the original data
label_df = test_df.pop('stroke')

test_sample = json.dumps({'data': test_df.to_dict(orient='records')})

print(test_sample)


# Set the content type
headers = {'Content-type': 'application/json'}


response = requests.post(service.scoring_uri, test_sample, headers=headers)

# Print results from the inference
print(response.text)

{"data": [{"id": 68794, "gender": "Female", "age": 79.0, "hypertension": 0, "heart_disease": 0, "ever_married": true, "work_type": "Self-employed", "Residence_type": "Urban", "avg_glucose_level": 228.7, "bmi": "26.6", "smoking_status": "never smoked"}, {"id": 62019, "gender": "Male", "age": 54.0, "hypertension": 0, "heart_disease": 0, "ever_married": true, "work_type": "Govt_job", "Residence_type": "Rural", "avg_glucose_level": 87.85, "bmi": "31.1", "smoking_status": "smokes"}, {"id": 63804, "gender": "Female", "age": 27.0, "hypertension": 0, "heart_disease": 0, "ever_married": false, "work_type": "Private", "Residence_type": "Rural", "avg_glucose_level": 55.93, "bmi": "20.3", "smoking_status": "smokes"}, {"id": 25676, "gender": "Female", "age": 7.0, "hypertension": 0, "heart_disease": 0, "ever_married": false, "work_type": "children", "Residence_type": "Rural", "avg_glucose_level": 89.38, "bmi": "19", "smoking_status": "Unknown"}, {"id": 28265, "gender": "Female", "age": 42.0, "hypert

TODO: In the cell below, print the logs of the web service and delete the service

In [23]:
print(service.get_logs())

2022-05-25T08:55:18,339418200+00:00 - iot-server/run 
2022-05-25T08:55:18,343561800+00:00 - gunicorn/run 
2022-05-25T08:55:18,383469400+00:00 - rsyslog/run 
2022-05-25T08:55:18,366945900+00:00 | gunicorn/run | 
2022-05-25T08:55:18,412249000+00:00 | gunicorn/run | ###############################################
2022-05-25T08:55:18,415461700+00:00 - nginx/run 
2022-05-25T08:55:18,435710300+00:00 | gunicorn/run | AzureML Container Runtime Information
2022-05-25T08:55:18,472516600+00:00 | gunicorn/run | ###############################################
2022-05-25T08:55:18,473813300+00:00 | gunicorn/run | 
2022-05-25T08:55:18,485711100+00:00 | gunicorn/run | 
2022-05-25T08:55:18,597795400+00:00 | gunicorn/run | AzureML image information: openmpi3.1.2-ubuntu18.04:20220516.v1
2022-05-25T08:55:18,621958700+00:00 | gunicorn/run | 
2022-05-25T08:55:18,650301500+00:00 | gunicorn/run | 
2022-05-25T08:55:18,671155500+00:00 | gunicorn/run | PATH environment variable: /azureml-envs/azureml_78c80f660ea0

**Submission Checklist**
- I have registered the model.
- I have deployed the model with the best accuracy as a webservice.
- I have tested the webservice by sending a request to the model endpoint.
- I have deleted the webservice and shutdown all the computes that I have used.
- I have taken a screenshot showing the model endpoint as active.
- The project includes a file containing the environment details.


In [24]:
service.delete()