## Sample Diabetes Classification Model Training using AutoML
Sample Notebook to demonstrate creation of a Machine Learning Model to predict likelihood of diabetes using Azure Machine Learning Services. 

* Azure AutoML was used to automatically select the algorithm and hyper-parameters.
* Another important point demonstrated in the sample is decorating the Scoring code with attributes so that Inferencing Web Service can be easily consumed by Power BI

This is a two-part solution, this first notebook is used to train the model and then create a Docker Image to be used for inferencing and the second notebook <a href="./deploy_model.ipynb">deploy_model</a> shows how to deploy the trained model to Azure Kubernetes Cluster.

** Please note that this is just a sample to demonstrate the capability of the service but does not gaurantee the quality beyond this demo scope**


In [None]:
import azureml.core
import logging
import os
import pandas as pd

from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
from azureml.core.compute import AksCompute, AmlCompute, ComputeTarget
from azureml.core import Datastore
from azureml.core.runconfig import DataReferenceConfiguration
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies

# Check core SDK version number
print("SDK version:", azureml.core.VERSION)

## Initialization
Initialize connection to AML Workspace and set variables to be used in the notebook

In [None]:
#TODO: Update the config settings as per your environment
subscription_id = "<TODO>"
resource_group = "<TODO>"
workspace_name = "<TODO>"

In [None]:
try:
    ws = Workspace(subscription_id = subscription_id, resource_group = resource_group, workspace_name = workspace_name)
    # write the details of the workspace to a configuration file to the notebook library
    ws.write_config()
    print("Workspace configuration succeeded. Skip the workspace creation steps below")
except:
    print("Workspace not accessible. Change your parameters or create a new workspace below")

In [None]:
# load workspace configuration from the config.json file in the current folder.
ws = Workspace.from_config()
print(ws.name, ws.location, ws.resource_group, sep='\t')

In [None]:
# Choose a name for your training cluster.
amlcompute_cluster_name = "traincluster"
experiment_name = 'diabetes-classification'
project_folder = './project-temp-files'

image_name = "diabclassprob"

## Compute Target for Training
Training is peformed on a remote AML Compute cluster. The AML Workspace is queried for a list of existing Compute Targets, an existing cluster is used if one exists (determined based on the Compute Target and Type) otherwise a new cluster is created.

In [None]:
found = False

# Check if this compute target already exists in the workspace.

cts = ws.compute_targets
if amlcompute_cluster_name in cts and cts[amlcompute_cluster_name].type == 'AmlCompute':
    found = True
    print('Found existing compute target.')
    compute_target = cts[amlcompute_cluster_name]

if not found:
    print('Creating a new compute target...')
    provisioning_config = AmlCompute.provisioning_configuration(vm_size = "STANDARD_D2_V2", # for GPU, use "STANDARD_NC6"
                                                                #vm_priority = 'lowpriority', # optional
                                                                max_nodes = 2)

    # Create the cluster.\n",
    compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, provisioning_config)

    # Can poll for a minimum number of nodes and for a specific timeout.
    # If no min_node_count is provided, it will use the scale settings for the cluster.
    compute_target.wait_for_completion(show_output = True, min_node_count = None, timeout_in_minutes = 20)

     # For a more detailed view of current AmlCompute status, use get_status().

### Initialize AzureML Experiment object

In [None]:
# Choose a name for the experiment and specify the project folder.
experiment = Experiment(ws, experiment_name)

output = {}
output['SDK version'] = azureml.core.VERSION
output['Subscription ID'] = ws.subscription_id
output['Workspace Name'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Project Directory'] = project_folder
output['Experiment Name'] = experiment.name
pd.set_option('display.max_colwidth', -1)
outputDf = pd.DataFrame(data = output, index = [''])
outputDf.T

### Setup Training Data
Training data is included in the repository with the code, its read into a Pandas Dataframe to show a few sample rows. Since training occurs on a remote cluster data is uploaded to AML Workspace default datasource (Azure Blob storage) to be used by remote compute for training by the remote Compute

In [None]:
data_folder = os.path.join(os.getcwd(),'data')
data_file = os.path.join(data_folder, 'diabetes_classification_dataset.csv')
print(data_folder)
print(data_file)

df = pd.read_csv(data_file)
df.head()

In [None]:
ds = ws.get_default_datastore()
ds.upload(src_dir=data_folder, target_path='diabetes_classification', overwrite=True, show_progress=True)

### Setup Run Configuration for Aml Compute Nodes
Training is performed on AML Compute nodes and run-time dependencies need to be specified, this includes packages needed for training as well as reference on how data will be made available to training code

In [None]:
dr = DataReferenceConfiguration(datastore_name=ds.name, 
                   path_on_datastore='diabetes_classification', 
                   path_on_compute='/tmp/azureml_runs',
                   mode='download', # download files from datastore to compute target
                   overwrite=False)

In [None]:
# create a new RunConfig object
conda_run_config = RunConfiguration(framework="python")

# Set compute target to AmlCompute
conda_run_config.target = compute_target
conda_run_config.environment.docker.enabled = True
conda_run_config.environment.docker.base_image = azureml.core.runconfig.DEFAULT_CPU_IMAGE

# set the data reference of the run coonfiguration
conda_run_config.data_references = {ds.name: dr}

cd = CondaDependencies.create(pip_packages=['azureml-sdk[automl]'], conda_packages=['numpy','py-xgboost<=0.80'])
conda_run_config.environment.python.conda_dependencies = cd

In [None]:
# Create a project_folder if it doesn't exist
if not os.path.exists(project_folder):
    os.makedirs(project_folder)


In [None]:
%%writefile ./project-temp-files/get_data.py
import pandas as pd
import os

def get_data():     
    df = pd.read_csv("/tmp/azureml_runs/diabetes_classification/diabetes_classification_dataset.csv")
    print('after pd.read_csv')    
    # get integer labels
    y = df["diabetes"]
    df = df.drop("diabetes", axis=1)    
    return { "X" : df, "y" : y.values }

### Setup AutoML 
Initialize AutoML configuration and submit the training run

In [None]:
automl_config = AutoMLConfig(task = 'classification',
                             debug_log = 'automl_errors.log',
                             path = project_folder,
                             #compute_target = compute_target,
                             run_configuration=conda_run_config,
                             data_script = project_folder + "/get_data.py",
                             iteration_timeout_minutes = 10,
                             iterations = 10,
                             n_cross_validations = 5,
                             primary_metric = 'AUC_weighted',
                             preprocess = True,
                             max_concurrent_iterations = 2,
                             verbosity= logging.INFO
                            )

In [None]:
remote_run = experiment.submit(automl_config, show_output = False)
remote_run

In [None]:
from azureml.widgets import RunDetails
RunDetails(remote_run).show()

In [None]:
# Wait until the run finishes.
remote_run.wait_for_completion(show_output = True)

## Inspect and Register best Model to Model Registry
AutoML creates mutliple models, best model is retreived for inspection and then registered with Model Registry to be used in Docker Image creation and eventually for inferencing. Models generated by AutoML can be inspected to see what transformations were applied to the features.

In [None]:
best_run, fitted_model = remote_run.get_output()

In [None]:
fitted_model.named_steps['datatransformer'].get_engineered_feature_names()

In [None]:
fitted_model.named_steps['datatransformer'].get_featurization_summary()

In [None]:
model = best_run.register_model(model_name = 'diabclassmodel', model_path= 'outputs/model.pkl')

### Create Docker Image to be used for Inferencing
Create Docker Image with Scoring File, the trianed Model as well Conda dependencies to expose Web Service for inferencing.

##### Azure Machine Learning and Power BI Integration
One very important thing to note in the scoring file are the decorators <i>input_schema</i> and <i>output_schema</i> because these result in exposing a Swagger Endpoint which is used by Power BI to identify the input parameters for the service call as well as the results


In [None]:
%%writefile score.py
# Scoring Script will need model id from registered model
import json
import numpy as np
import os
import pickle
import pandas as pd
from sklearn.externals import joblib
from sklearn.linear_model import LogisticRegression

from azureml.core.model import Model
from inference_schema.schema_decorators import input_schema, output_schema
from inference_schema.parameter_types.numpy_parameter_type import NumpyParameterType
from inference_schema.parameter_types.pandas_parameter_type import PandasParameterType

import azureml.train.automl

def init():
    global model
    # retreive the path to the model file using the model name
    model_path = Model.get_model_path('diabclassmodel') # update this based on previously registered model
    print(model_path)
    model = joblib.load(model_path)

input_dict = {
    "pregnancies": [6],
    "plasma glucose": [148] ,
    "blood pressure": [72],
    "triceps skin thickness": [35],
    "insulin": [0],
    "bmi": [33.6],
    "diabetes pedigree": [0.627],
    "age": [50]
}

output_dict = {
    "prediction": [1],
    "probability": [.89]
}

input_sample = pd.DataFrame(input_dict)
output_sample =  pd.DataFrame(output_dict)
#output_sample = np.array([("1",.90), ("0",.84)])

@input_schema('data', PandasParameterType(input_sample))
@output_schema(PandasParameterType(output_sample))
def run(data):
    # grab and prepare the data
    # make prediction
    try:
        print('inside the method')                      
        result_df = pd.DataFrame(columns = ["prediction","probability"]) 
        
        pred = model.predict(data)
        prob = model.predict_proba(data)
        
        print(pred)
        print(prob)
                
        for idx,val in enumerate(pred):
            print("index:",idx, "value:", val)
            print(val)
            print(prob[idx][int(val)])
            result_df = result_df.append({"prediction": val, "probability": prob[idx][int(val)]}, ignore_index=True)
            
    except Exception as e:
        print("Exception Caught")
        print(str(e))
        return ["exception", str(e)]    
    str = result_df.to_json(orient = 'records')    
    return json.loads(str)

In [None]:
myenv = CondaDependencies.create(conda_packages=['numpy','scikit-learn'], pip_packages=['azureml-sdk[automl]', 'inference-schema[numpy-support,pandas-support]'])

conda_env_file_name = 'mydeployenv.yml'
myenv.save_to_file('.', conda_env_file_name)

In [None]:
from azureml.core.image import Image, ContainerImage

image_config = ContainerImage.image_configuration(runtime= "python",
                                 execution_script="score.py",
                                 conda_file="mydeployenv.yml",
                                 tags = {'area': "diabetes", 'type': "classification"},
                                 description = "Diabetes Classification with probability implemented using AutoML")

image = Image.create(name = image_name,
                     # this is the model object. note you can pass in 0-n models via this list-type parameter
                     # in case you need to reference multiple models, or none at all, in your scoring script.
                     models = [model],
                     image_config = image_config, 
                     workspace = ws)

In [None]:
image.wait_for_creation(show_output = True)