# Tutorial: Train a diabetes prediction model with Azure Machine Learning and score with ADX
Open data set from: # TODO edit link
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective is to predict based on diagnostic measurements whether a patient has diabetes.

# Prerequisite
- Enable Python plugin on your ADX cluster (see the Onboarding section of the [python() plugin](https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/pythonplugin?pivots=azuredataexplorer) doc)
- Whitelist a blob container to be accessible by ADX Python sandbox (see the Appendix section of the doc)
- Create a Python environment (conda or virtual env) that reflects the Python sandbox image
- Install in that environment AML SDK
- Install in that environment Azure Blob Storage SDK (intall the older version v2.1 as the newer version is currently incompatible with azure-kusto-ingest package)

# Set up your AML environment
- 1. Import Python packages
- 2. Create (or connect to) an AML workspace
- 3. Create (or connect to) a remote compute target to use for training
- 4. Create an experiment to track all your runs

## Instructions:
- [Kusto Query Language (KQL) overview](https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/) 
- When I created this module ADX supported python 3.6.* & azure-storage-blob=2.1 therefore I chose specific version of library but in future you might go for the latest version. As they are going to upgrade python version to 3.9.12


# Importing AML packages

In [None]:
import sys
import azureml.core
from azureml.core import Workspace
from azureml.core import Experiment
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
from azureml.core.environment import Environment
from azureml.core.conda_dependencies import CondaDependencies


print(sys.version)
print("Azure ML SDK Version: ", azureml.core.VERSION)

## Create workspace
If the workspace already exists connect to it

In [None]:
ws = Workspace.create(
    name = "Your Workspace Name",
    subscription_id = "Your Subsription Id",
    resource_group = "Your Resource Group", 
    location = "Your location",  # e.g "westus"
    exist_ok = True,
    show_output = True)

ws.write_config()

In [3]:
ws = Workspace.from_config()
print(ws.name, ws.location, ws.resource_group, sep='\t')

vmalu-ml	uksouth	vmalu-rg


## Create experiment
Create an experiment to track the runs in your workspace

In [None]:
exp = Experiment(workspace=ws, name="Prediction-Diabetes")

## Create or attach existing compute resource
By using Azure Machine Learning Compute, a managed service, data scientists can train machine learning models on clusters of Azure virtual machines.Here you create Azure Machine Learning Compute for model training

**Creation of compute takes approximately 5 minutes**. If the AmlCompute with that name is already in your workspace the code will skip the creation process.

In [None]:
compute_name = "vmalu-ci"
vm_sku = "STANDARD_E4AS_V4"

if compute_name in ws.compute_targets:
    compute_target = ws.compute_targets[compute_name]
    if compute_target and type(compute_target) is azureml.core.compute.computeinstance.ComputeInstance:
        print("found compute target: " + compute_name)
else:
    print("creating new compute target...")
    provisioning_config = AmlCompute.provisioning_configuration(vm_size=vm_sku, min_nodes=1,max_nodes=2)
    compute_target = ComputeTarget.create(ws, compute_name, provisioning_config)
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=10)

Install Kqlmagic

In [None]:
!pip install Kqlmagic --no-cache-dir  --upgrade

**Note: KQLMAGIC_AZUREML_COMPUTE environment variable should be set, otherwise popup windows might not function properly.**

Web address of your compute


In [7]:
import os
os.environ['KQLMAGIC_AZUREML_COMPUTE']="https://ml.azure.com/compute/{-}/details?wsid=/subscriptions/{-}/resourcegroups/{-}/providers/Microsoft.MachineLearningServices/workspaces/{-}"

In [None]:
# Load kql 
%reload_ext Kqlmagic

## Connect to the Azure Data Explorer
Authenticate yourself! Follow the steps you get in the output   

In [None]:
%kql kusto://code;cluster='your adx cluser ';database='your database'

## Explore data
Before you train a model, you need to understand the data that you are using to train it.
- Fetch the diabetic prediction dataset from Kusto using KqlMagic
- Display some records

In [None]:
%kql res << Diabetes # Save dataset in the Diabetes table 
df = res.to_dataframe() 
print(df.shape)
df[:4]

## Let's copy the data from ADX to blob container to access it from AML
Notes:
- We copy the input data using KqlMagic to a blob container in the storage account that was allocated for the AML workspace
- You can create the blob container using Azure Storage Explorer, and extract its SAS token by right clicking it

In [None]:
aml_storage_account = "Your storage account" # you can use the storage account that was created automatically as part of the AML workspace
aml_container_name = "kusto"
aml_sas_token = "Your SAS Token for this container"
aml_stroage_key = "Your storage account key "
os.environ['AZURE_STORAGE_CONNECTION_STRING']= "Your storage account connection string"

In [None]:
blob_container_uri = f"https://{aml_storage_account}.blob.core.windows.net/{aml_container_name};{aml_stroage_key}"
copy_query = f".export to csv (h@'{blob_container_uri}') with(namePrefix=Diabetes, includeHeaders=all) <| Diabetes"
print(copy_query)

%kql res << -query copy_query
data_blob_name = res.to_dataframe()["Path"][0].split('/')[-1]
print("\ndata blob name is: ", data_blob_name)


**Note:** This notebook supports the latest version of azure-storage-blob, I used the latest API and modules to download/upload  blobs

In [None]:
# Test downloading the blob

from azure.storage.blob import BlobServiceClient, ContainerClient, __version__
import pandas as pd
try:
    print("Azure Blob Storage v" + __version__ + " - Python quickstart sample")
    connect_str = os.getenv('AZURE_STORAGE_CONNECTION_STRING')

    blob_service_client = BlobServiceClient.from_connection_string(connect_str)
    container_client = blob_service_client.get_container_client(aml_container_name)
    local_path = './'
    download_file_path = os.path.join(local_path, 'Diabetes.csv')
    with open(download_file_path, "wb") as download_file:
        download_file.write(container_client.download_blob(data_blob_name).readall())
        print( data_blob_name + " downloaded")

except Exception as ex:
    print('Exception:')
    print(ex)

df = pd.read_csv('Diabetes.csv')
print(df.shape)
df[-4:]

## Training on a remote cluster
Here we submit the job to run on the remote training cluster we set up earlier. To submit a job we:
- Create a directory for all files to be uploaded to the remote cluster
- Create a training script
- Create an estimator object
- Submit the job
## Create a directory
Create a directory to upload all files to the remote cluster

In [63]:
import os
script_folder = os.path.join(os.getcwd(), "to-upload")
os.makedirs(script_folder, exist_ok=True)

## Create a training script
To submit the job to the cluster, we need to create a training script. Here we create train.py in the to-upload directory

In [None]:
%%writefile "$script_folder/train.py"

# Import libraries
from azureml.core import Run
import argparse
import pandas as pd
import numpy as np
import pickle
import os
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from azure.storage.blob import BlockBlobService

# Get the experiment run context
run = Run.get_context()

parser = argparse.ArgumentParser()

parser.add_argument('--account', type=str,  help='storage account name')
parser.add_argument('--container', type=str, help='blob container name')
parser.add_argument('--blob', type=str, help='blob name')
parser.add_argument('--sas', type=str,  help='SAS token')

args = parser.parse_args()

storage_account = args.account
container_name = args.container
blob_name = args.blob
sas_token = args.sas

try:
    block_blob_service = BlockBlobService(account_name=storage_account, sas_token=sas_token)
    block_blob_service.get_blob_to_path(container_name, blob_name, 'diabetes.csv')

except Exception as ex:
    print('Exception:')
    print(ex)


# load the diabetes dataset
print("Loading Data...")
diabetes = pd.read_csv('diabetes.csv')

# Separate features and labels
X, y = diabetes[['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']].values, diabetes['Diabetic'].values

# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

# Set regularization hyperparameter
reg = 0.01

# Train a logistic regression model
print('Training a logistic regression model with regularization rate of', reg)
run.log('Regularization Rate',  np.float(reg))
model = LogisticRegression(C=1/reg, solver="liblinear").fit(X_train, y_train)

# calculate accuracy
y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
print('Accuracy:', acc)
run.log('Accuracy', np.float(acc))

# calculate AUC
y_scores = model.predict_proba(X_test)
auc = roc_auc_score(y_test,y_scores[:,1])
print('AUC: ' + str(auc))
run.log('AUC', np.float(auc))

# Save the trained model in the outputs folder
os.makedirs('outputs', exist_ok=True)

with open('outputs/diabetes_model.pkl', 'wb') as handle:
    pickle.dump(model, handle)

run.complete()


## Run the training script as an experiment
Now you're ready to run the script as an experiment. Note that the default environment does not include the ML/azure packages, so you need to explicitly add that to the configuration. The conda environment is built on-demand the first time the experiment is run, and cached for future runs that use the same configuration; so the first run will take a little longer.

In [None]:

# Create Environment to install required packages
from azureml.core import Experiment, ScriptRunConfig, Environment
from azureml.core.runconfig import DockerConfiguration
from azureml.widgets import RunDetails

env = Environment.from_conda_specification("adx_sandbox_env", "environment.yml")


script_config = ScriptRunConfig(source_directory=script_folder,
              arguments=['--account', aml_storage_account, '--container',aml_container_name, '--blob', data_blob_name, '--sas', aml_sas_token],
              environment=env,
              script='train.py')

from azureml.widgets import RunDetails
run = exp.submit(config=script_config)
RunDetails(run).show()
# specify show_output to True for a verbose log
run.wait_for_completion(show_output=True) 

## Register model
The training script pickled the models to files and wrote them in a directory named outputs in the VM of the cluster where the job is executed. outputs is a special directory in that all content in this directory is automatically uploaded to our workspace. This content appears in the run record in the experiment under the workspace. Hence, the model file is now also available in the workspace.

In [None]:
model = run.register_model(model_name='Diabetic-Prediction', model_path='outputs/diabetes_model.pkl')
print(model.name, model.id, model.version, sep='\t')

## Scoring in ADX
2 options for retrieving the model for scoring:

- serialize the model to a string to be stored in a standard table in ADX   (It is applicable for a small size of model otherwise you will get this error "_Request is invalid and cannot be processed: Syntax error: SYN0009: Query length (354646170) too large (max: 2097152) [line:position=0:0]_")
- copy the model to a blob container (that was previously whitelisted for access by ADX Python sandbox)

In [None]:
model_path = model.download(exist_ok=True)
model_path

## Copy the model to a blob container 

In [None]:
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient, __version__
try:
    print("Azure Blob Storage v" + __version__ + " - Python quickstart sample")
    connect_str = os.getenv('AZURE_STORAGE_CONNECTION_STRING')

    model_name = 'diabetes_model.pkl'
    upload_file_path = os.path.join('./', model_path)
    
    blob_service_client = BlobServiceClient.from_connection_string(connect_str)
    blob_client = blob_service_client.get_blob_client(container=aml_container_name, blob=model_name)

    print("\nUploading to Azure Storage as blob:\n\t" + model_name)

    # Upload the created file
    with open(upload_file_path, "rb") as data:
        blob_client.upload_blob(data)
    
except Exception as ex:
    print('Exception:')
    print(ex)

In [None]:
# Copy Blob SAS URL of the uploaded model 
model_uri = " Blob SAS URL "

## Scoring from model which is stored in blob storage

In [None]:
scoring_from_blob_query = r'''
let classify_sf=(samples:(*), model_sas:string, features_cols:dynamic, pred_col:string)
{
    let kwargs = pack('model_sas', model_sas, 'features_cols', features_cols, 'pred_col', pred_col);
    let code =
    '\n'
    'import pickle\n'
    '\n'
    'model_sas = kargs["model_sas"]\n'
    'features_cols = kargs["features_cols"]\n'
    'pred_col = kargs["pred_col"]\n'
    'with open("/Temp/model.pkl", "rb") as f:\n'
    '   bmodel = f.read()\n'
    'clf1 = pickle.loads(bmodel)\n'
    'df1 = df[features_cols]\n'
    'predictions = clf1.predict(df1)\n'
    '\n'
    'result = df\n'
    'result[pred_col] = pd.DataFrame(predictions, columns=[pred_col])'
    '\n'
    ;
    samples | evaluate python(typeof(*), code, kwargs,
        external_artifacts=pack('model.pkl', model_sas))
};
Diabetes 
| extend pred_Diabetic=0
| invoke classify_sf('$model_uri$',
                     pack_array('Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age'), 'pred_Diabetic')
| summarize n=count() by Diabetic, pred_Diabetic      //  confusion matrix
'''

In [None]:
scoring_from_blob_query = scoring_from_blob_query.replace('$model_uri$', model_uri)

In [None]:
%kql res << -query scoring_from_blob_query
df = res.to_dataframe()
print('Confusion Matrix')
df

# Summary
In this tutorial you learned how to train a model in AML and then use ADX for scoring. 
- ADX scoring is done near the data, on the same ADX compute nodes, enabling near real time processing of big amounts of new data. There is no the need to export the data to external scoring service and import back the results. Consequently, scoring architecture is simpler and performance is much faster and scalable