# Tutorial: Train a diabetes prediction model and deploy it as a web service. 
[This dataset](https://github.com/maluvinita/Tutorials/tree/main/data) is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective is to predict based on diagnostic measurements whether a patient has diabetes.

This tutorial includes following:
- Register diabetes (tabular and file) data set in the workspace.
- Train the model using file dataset & register it.
- Deploy the model as a real time inferencing web service & predict diabetic using the service.

## Importing AML packages

In [None]:
import sys
import os
import azureml.core
from azureml.core import Workspace
from azureml.core import Experiment
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
from azureml.core.environment import Environment
from azureml.core.conda_dependencies import CondaDependencies

print(sys.version)
print("Azure ML SDK Version: ", azureml.core.VERSION)

## Create workspace
If the workspace already exists connect to it

In [None]:
ws = Workspace.create(
    name = "Your Workspace Name",
    subscription_id = "Your Subsription Id",
    resource_group = "Your Resource Group", 
    location = "Your location",  # e.g "westus"
    exist_ok = True,
    show_output = True)

ws.write_config()

In [None]:
ws = Workspace.from_config()
print(ws.name, ws.location, ws.resource_group, sep='\t')

## Create or attach existing compute resource
By using Azure Machine Learning Compute, a managed service, data scientists can train machine learning models on clusters of Azure virtual machines.Here you create Azure Machine Learning Compute for model training

Creation of compute takes approximately** 5 minutes**. If the AmlCompute with that name is already in your workspace the code will skip the creation process

In [None]:
compute_name = "Your compute name"
vm_sku = "your VM sku"

if compute_name in ws.compute_targets:
    compute_target = ws.compute_targets[compute_name]
    if compute_target and type(compute_target) is azureml.core.compute.computeinstance.ComputeInstance:
        print("found compute target: " + compute_name)
else:
    print("creating new compute target...")
    provisioning_config = AmlCompute.provisioning_configuration(vm_size=vm_sku, min_nodes=1,max_nodes=2)
    compute_target = ComputeTarget.create(ws, compute_name, provisioning_config)
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=10)

## Create a folder to store the diabetes dataset

In [None]:
script_folder = os.path.join(os.getcwd(), "diabetes-data")
os.makedirs(script_folder, exist_ok=True)

## Work with Data
Managing data centrally in the cloud, and making it accessible to teams of data scientists who are running experiments and training models on multiple workstations.In Azure ML, datastores are references to storage locations, such as Azure Storage blob containers. Every workspace has a default datastore - usually the Azure storage blob container that was created with the workspace. If you need to work with data that is stored in different locations, you can add custom datastores to your workspace and set any of them to be the default.

## View datastores
Run the following code to determine the datastores in your workspace:

In [None]:
# Get the default datastore
default_ds = ws.get_default_datastore()

# Enumerate all datastores, indicating which is the default
for ds_name in ws.datastores:
    print(ds_name, "- Default =", ds_name == default_ds.name)

## Upload data to a datastore
We can upload files from our local file system to a datastore so that it will be accessible to experiments running in the workspace, regardless of where the experiment script is actually being run.

In [None]:
# uploading file to workspace blob store

from azureml.core import Dataset
from azureml.data.datapath import DataPath

Dataset.File.upload_directory(src_dir='diabetes-data',
                              target=DataPath(default_ds, 'diabetes-data/')
                              )

## Work with datasets
AML provides an abstraction for data in the form of datasets. A dataset is a versioned reference to a specific set of data that we may want to use in an experiment. Datasets can be tabular or file-based.

## Create a tabular dataset
Let's create a dataset from the diabetes data you uploaded to the datastore, and view the first 20 records.

In [None]:
#Create a tabular dataset from the path on the datastore (this may take a short while)
tab_data_set = Dataset.Tabular.from_delimited_files(path=(default_ds, 'diabetes-data/*.csv'))

# Display the first 20 rows as a Pandas dataframe
tab_data_set.take(20).to_pandas_dataframe()


## Create a file Dataset
If we might need to work with data that is unstructured; or simply want to handle reading the data from files. To accomplish this, we can use a file dataset, which creates a list of file paths in a virtual mount point, which we can use to read the data in the files.

In [None]:
from azureml.core import Dataset
from azureml.data.datapath import DataPath

#Create a file dataset from the path on the datastore (this may take a short while)
file_data_set = Dataset.File.from_files(path=(default_ds, 'diabetes-data/*.csv'))

# Get the files in the dataset
for file_path in file_data_set.to_path():
    print(file_path)

We'll register the tabular dataset as **diabetes dataset**, and the file dataset as **diabetes files**.

In [None]:
# Register the tabular dataset

try:
    tab_data_set = tab_data_set.register(workspace=ws, 
                                        name='diabetes dataset',
                                        description='diabetes data set',
                                        tags = {'format':'CSV'},
                                        create_new_version=True)
except Exception as ex:
    print(ex)

# Register the file dataset
try:
    file_data_set = file_data_set.register(workspace=ws,
                                            name='diabetes file',
                                            description='diabetes data file',
                                            tags = {'format':'CSV'},
                                            create_new_version=True)
except Exception as ex:
    print(ex)

print('Datasets registered')

We can view and manage datasets on the **Datasets** page for your workspace in [Azure Machine Learning studio](https://ml.azure.com). You can also get a list of datasets from the workspace object:

In [None]:
print("Datasets:")
for dataset_name in list(ws.datasets.keys()):
    dataset = Dataset.get_by_name(ws, dataset_name)
    print("\t", dataset.name, 'version', dataset.version)

## Train a model from a file dataset
Run the following two code cells to create:

- A folder named diabetes_training_from_file_dataset
- A script that trains a classification model by using a file dataset that is passed to is as an input.

In [None]:
import os

# Create a folder for the experiment files
experiment_folder = 'diabetes_training_from_file_dataset'
os.makedirs(experiment_folder, exist_ok=True)
print(experiment_folder, 'folder created')

In [None]:
%%writefile $experiment_folder/diabetes_training.py
# Import libraries
import os
import argparse
from azureml.core import Dataset, Run
import pandas as pd
import numpy as np
import joblib
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
import glob

# Get script arguments (rgularization rate and file dataset mount point)
parser = argparse.ArgumentParser()
parser.add_argument('--regularization', type=float, dest='reg_rate', default=0.01, help='regularization rate')
parser.add_argument('--input-data', type=str, dest='dataset_folder', help='data mount point')
args = parser.parse_args()

# Set regularization hyperparameter (passed as an argument to the script)
reg = args.reg_rate

# Get the experiment run context
run = Run.get_context()

# load the diabetes dataset
print("Loading Data...")
data_path = args.dataset_folder # Get the training data path from the input
# (You could also just use args.dataset_folder if you don't want to rely on a hard-coded friendly name)

# Read the files
all_files = glob.glob(data_path + "/*.csv")
diabetes = pd.concat((pd.read_csv(f) for f in all_files), sort=False)

# Separate features and labels
X, y = diabetes[['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']].values, diabetes['Diabetic'].values

# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

# Train a logistic regression model
print('Training a logistic regression model with regularization rate of', reg)
run.log('Regularization Rate',  np.float(reg))
model = LogisticRegression(C=1/reg, solver="liblinear").fit(X_train, y_train)

# calculate accuracy
y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
print('Accuracy:', acc)
run.log('Accuracy', np.float(acc))

# calculate AUC
y_scores = model.predict_proba(X_test)
auc = roc_auc_score(y_test,y_scores[:,1])
print('AUC: ' + str(auc))
run.log('AUC', np.float(auc))

os.makedirs('outputs', exist_ok=True)
# note file saved in the outputs folder is automatically uploaded into experiment record
joblib.dump(value=model, filename='outputs/diabetes_model.pkl')

run.complete()

## Run the training script as an experiment
The conda environment is built on-demand the first time the experiment is run, and cached for future runs that use the same configuration; so the first run will take a little longer.

In [None]:
from azureml.core import Experiment, ScriptRunConfig, Environment
from azureml.core.runconfig import DockerConfiguration
from azureml.widgets import RunDetails
import shutil

# Get the training dataset
diabetes_ds = ws.datasets.get("diabetes file")

# Create a Python environment for the experiment (from a .yml file)
env = Environment.from_conda_specification("simple-env", "environment.yml")

# Create a script config
script_config = ScriptRunConfig(source_directory=experiment_folder,
                                script='diabetes_training.py',
                                arguments = ['--regularization', 0.1, # Regularizaton rate parameter
                                             '--input-data', diabetes_ds.as_download(path_on_compute="/tmp/training_files")], # Reference to dataset location
                                environment=env, # Use the environment created previously
                                )


#Create an experiment to track the runs in your workspace
experiment_name = 'predict-diabetes'
experiment = Experiment(workspace=ws, name=experiment_name)

# remove the existing file becuse it doesn't overwrite the file 
import shutil
download_file_path = "/tmp/training_files"
if os.path.exists(download_file_path):
    shutil.rmtree(download_file_path)

# submit the experiment
run = experiment.submit(config=script_config)
RunDetails(run).show()
run.wait_for_completion(show_output=True)

We can retrieve the metrics and outputs from the **Run** object.

In [None]:
# Get logged metrics and files
metrics = run.get_metrics()
for key in metrics.keys():
        print(key, metrics.get(key))
print('\n')
for file in run.get_file_names():
    print(file)

## Register the trained model
Note that the outputs of the experiment include the trained model file (**diabetes_model.pkl**). We can register this model in your AML workspace, making it possible to track model versions and retrieve them later.

In [None]:
# Register model

from azureml.core import Model

run.register_model(model_path='outputs/diabetes_model.pkl', model_name='diabetes-prediction-model',
                   tags={'Training context':'Script'},
                   properties={'AUC': run.get_metrics()['AUC'], 'Accuracy': run.get_metrics()['Accuracy']})

# List registered models
for model in Model.list(ws):
    print(model.name, 'version:', model.version)
    for tag_name in model.tags:
        tag = model.tags[tag_name]
        print ('\t',tag_name, ':', tag)
    for prop_name in model.properties:
        prop = model.properties[prop_name]
        print ('\t',prop_name, ':', prop)
    print('\n')

We want to deploy. By default, if we specify a model name, the latest version will be returned.

In [None]:
model = ws.models['diabetes-prediction-model']
print(model.name, 'version', model.version)

## Deploy the model as a web service
We have trained and registered a machine learning model that classifies patients based on the likelihood of them having diabetes. This model could be used in a production environment therfore we will deploy the model as a web service.

We're going to create a web service to host this model, and this will require some code and configuration files.

In [None]:
# deploying the model

import os
# Create a folder for the deployment files
deployment_folder = './diabetes_service'
os.makedirs(deployment_folder, exist_ok=True)
print(deployment_folder, 'folder created.')

# Set path for scoring script
script_file = 'score_diabetes.py'
script_path = os.path.join(deployment_folder,script_file)

The web service where we deploy the model will need some Python code to load the input data, get the model from the workspace, and generate and return predictions. We'll save this code in an *entry script* (often called a *scoring script*) that will be deployed to the web service.

The script consists of two functions:

- **init**: This function is called when the service is initialized, and is generally used to load the model. Note that the scoring script uses the **AZUREML_MODEL_DIR** environment variable to determine the folder where the model is stored.
- **run**: This function is called each time a client application submits new data, and is generally used to inference predictions from the model.

In [None]:
run_id = run.id
os.environ['AZUREML_MODEL_DIR'] = '/tmp/azureml_runs/'+run_id+'/outputs'

The web service will be hosted in a container, and the container will need to install any required Python dependencies when it gets initialized. In this case, our scoring code requires **scikit-learn** and some Azure Machine Learning specific packages that are used by the scoring web service, so we'll create an environment that included these. Then we'll add that environment to an *inference configuration* along with the scoring script, and define a *deployment configuration* for the container in which the environment and script will be hosted.

We can then deploy the model as a service based on these configurations.

> **More Information**: For more details about model deployment, and options for target execution environments, see the [documentation](https://docs.microsoft.com/azure/machine-learning/how-to-deploy-and-where).

Deployment will take some time as it first runs a process to create a container image, and then runs a process to create a web service based on the image. When deployment has completed successfully, you'll see a status of **Healthy**.

In [None]:
%%writefile $script_path
import json
import joblib
import numpy as np
import os

# Called when the service is loaded
def init():
    global model
    # Get the path to the deployed model file and load it
    model_path = os.path.join(os.getenv('AZUREML_MODEL_DIR'), 'diabetes_model.pkl')
    model = joblib.load(model_path)

# Called when a request is received
def run(raw_data):
    # Get the input data as a numpy array
    data = np.array(json.loads(raw_data)['data'])
    # Get a prediction from the model
    predictions = model.predict(data)
    # Get the corresponding classname for each prediction (0 or 1)
    classnames = ['not-diabetic', 'diabetic']
    predicted_classes = []
    for prediction in predictions:
        predicted_classes.append(classnames[prediction])
    # Return the predictions as JSON
    return json.dumps(predicted_classes)

In [None]:
from azureml.core import Environment
from azureml.core.model import InferenceConfig
from azureml.core.webservice import AciWebservice
from azureml.core import Model

# Configure the scoring environment
service_env = Environment.get(workspace=ws, name="AzureML-sklearn-0.24.1-ubuntu18.04-py37-cpu-inference")
service_env.inferencing_stack_version="latest"

inference_config = InferenceConfig(source_directory=deployment_folder,
                                   entry_script=script_file,
                                   environment=service_env)

# Configure the web service container
deployment_config = AciWebservice.deploy_configuration(cpu_cores=1, memory_gb=1)

# Deploy the model as a service
print('Deploying model...')
service_name = "diabetes-prediction-service"
service = Model.deploy(ws, service_name, [model], inference_config, deployment_config, overwrite=True)
service.wait_for_deployment(True)
print(service.state)

Take a look at your workspace in [Azure Machine Learning Studio](https://ml.azure.com) and view the **Endpoints** page, which shows the deployed services in your workspace.

We can also retrieve the names of web services in our workspace by running the following code:

In [None]:
for webservice_name in ws.webservices:
    print(webservice_name)

## Use the web service
With the service deployed, now we can consume it from a client application.
We can send multiple patient observations to the service, and get back a prediction for each one.

In [None]:
import json

# This time our input is an array of two feature arrays
x_new = [[2,180,74,24,21,23.9091702,1.488172308,22],
         [0,148,58,11,179,39.19207553,0.160829008,45]]

# Convert the array or arrays to a serializable list in a JSON document
input_json = json.dumps({"data": x_new})

# Call the web service, passing the input data
predictions = service.run(input_data = input_json)

# Get the predicted classes.
predicted_classes = json.loads(predictions)
   
for i in range(len(x_new)):
    print ("Patient {}".format(x_new[i]), predicted_classes[i] )

The code above uses the Azure Machine Learning SDK to connect to the containerized web service and use it to generate predictions from our diabetes classification model. In production, a model is likely to be consumed by business applications that do not use the Azure Machine Learning SDK, but simply make HTTP requests to the web service.

Let's determine the URL to which these applications must submit their requests:

In [None]:
endpoint = service.scoring_uri
print(endpoint)

Now that you know the endpoint URI, an application can simply make an HTTP request, sending the patient data in JSON format, and receive back the predicted class(es).

In [None]:
import requests
import json

x_new = [[2,180,74,24,21,23.9091702,1.488172308,22],
         [0,148,58,11,179,39.19207553,0.160829008,45]]

# Convert the array to a serializable list in a JSON document
input_json = json.dumps({"data": x_new})

# Set the content type
headers = { 'Content-Type':'application/json' }

predictions = requests.post(endpoint, input_json, headers = headers)
predicted_classes = json.loads(predictions.json())

for i in range(len(x_new)):
    print ("Patient {}".format(x_new[i]), predicted_classes[i] )

# Summary

In this tutorial we learn how to deploy a real time inferencing web service and how we can use the end point for prediction.