<center><h1>PART 3: DEPLOY A MODEL ON AZURE MACHINE LEARNING SERVICE</h1></center>
<br>
In this notebook, we will deploy the model we trained in Part 1 to Azure Machine Learning Service

#### ABOUT THE MODEL & DATA

Using data from Taarifa and the Tanzanian Ministry of Water, we will predict which pumps are functional, which need some repairs, and which don't work at all. The labels encompass three classes and the training data is based on a number of variables about what kind of pump is operating, when it was installed, and how it is managed. A smart understanding of which waterpoints will fail can improve maintenance operations and ensure that clean, portable water is available to communities across Tanzania. This competition is hosted on [Driven Data.](https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/)

#### PYTHON DEPENDENCIES
```
This notebook was developed with the following packages:
azureml-sdk[databricks]
category-encoders==1.3.0
numpy==1.15.0
pandas==0.24.1
scikit-learn==0.20.2

```

#### APPROACH

Azure Kubernetes Service offers orchestrated elastic container clusters, that encapsulate the scoring logic and the model itself. The steps involved are:
1. Configure the development environment  
2. Test the model locally
3. Create an execution script called `score.py`
4. Configure the cluster image
5. Deploy the cluster
6. Test the web service

Reference: [Deploy models with the Azure Machine Learning service](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-deploy-and-where)

In [2]:
from __future__ import print_function 
import os
import numpy as np
import pandas as pd
from sklearn.externals import joblib
import azureml.core
from azureml.core import Workspace
from azureml.core.compute import AksCompute, ComputeTarget
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.image import ContainerImage
from azureml.core.model import Model
from azureml.core.webservice import Webservice, AksWebservice
from azureml.train.estimator import Estimator

# set Pandas display options
pd.options.display.max_columns = None

# check Azure SDK version
print("Azure ML SDK Version: ", azureml.core.VERSION)

#### 3.1 CONFIGURE THE DEVELOPMENT ENVIRONMENT
Configuring the development environment involves steps such as setting up an Azure ML Workspace, connecting to a remote storage and compute resources etc. 

For local disk storage while development we will use the Databricks FileStore folder. `/FileStore` is a special folder within DBFS where you can save files and also download files to your local machine via a browser.  
Use the Databricks Data menu/UI to upload the pumps_data.csv and new_pumps_data.csv to ``/FileStore/tables/pumps` directory (skip if you've already uploaded the data in Part 1 or 2).
```
/FileStore
  ├── tables                     -> Databricks by default stores data here
  │   └──pumps                   -> we will create this project specific folder
  │      ├── new_pumps_data.csv  -> scoring dataset with no labels
  │      └── pumps_data.csv      -> training dataset with labels
  └── users/jason/pumps          -> we will create this folder as our project root folder
      ├───models                 -> store all pickle files
      │    ├──  local            -> pickle files created by training locally in the notebook
      │    ├──  rf.pkl           -> Random Forest estimator trained on AMLS
      │    ├──  le.pkl           -> Preprocessing transformer trained on AMLS
      │    ├──  ohc.pkl          -> Preprocessing transformer trained on AMLS
      │    ├──  y_le.pkl         -> Preprocessing transformer trained on AMLS
      └── scripts                -> scripts such as train.py and score.py for AMLS
```

Let's create a `Config` class to hold all the pertinent configurations and storage locations.

In [4]:
class Config(object):

    # define Azure ML Workspace configuration variables---------
    SUBSCRIPTION_ID = 'your_key_here'
    RESOURCE_GROUP = 'ML_SANDBOX'
    WORKSPACE_NAME = 'ML_SANDBOX'
    WORKSPACE_REGION = 'East US 2'
    
    # setup Azure Machine Learning Service compute for training
    TRAIN_COMPUTE = 'dev-vm'
    
    # Kubernetes cluster for deployment
    DEPLOY_COMPUTE = 'dev-cluster'
    IMAGE_NAME = 'pumps_rf_image'
    AKS_SERVICE_NAME = 'pumps-aks-service-1'
    
    # define DBFS paths for sub-directories---------------------
    # dbutils requires filepaths without the use of '/dbfs' so we will use this
    # variables largely with dbutils functions.
    PROJECT_DIR = '/FileStore/users/jason/pumps' 
    
    # for Python to understand filepaths, you need to prefix '/dbfs'
    MODELS_DIR = '/dbfs'+PROJECT_DIR+'models'
    SCRIPTS_DIR = '/dbfs'+PROJECT_DIR+'scripts'
    
    # set location for uploading data
    # default location for data is /FileStore/tables but we will use a pumps sub-directory
    DATA_DIR = '/dbfs/FileStore/tables/pumps'  

If you previously ran this code from the Part1 notebook, you can skip this.

In [6]:
# connect to workspace
try:
    ws = Workspace(subscription_id = Config.SUBSCRIPTION_ID, resource_group = Config.RESOURCE_GROUP, workspace_name = Config.WORKSPACE_NAME)
    # write the details of the workspace to a configuration file to the notebook library
    ws.write_config()
    print("Workspace configuration succeeded.")
except:
    print("Workspace not accessible.")
    
# set up experiment
experiment_name = 'pumps-exp1'
from azureml.core import Experiment
exp = Experiment(workspace=ws, name=experiment_name)

#### 3.2 TEST THE MODEL LOCALLY
Before deploying the model on a remote cluster, let's test it locally to ensure it is working correctly

In [8]:
# NOT CURRENTLY WORKING----------------
# download the registered pumps_rf model to the model folder in ADSL
rf = Model(ws, 'pumps_rf')
le = Model(ws, 'pumps_le')
ohc = Model(ws, 'pumps_ohc')

rf.download(target_dir=Config.MODELS_DIR, exist_ok=True)
le.download(target_dir=Config.MODELS_DIR, exist_ok=True)
ohc.download(target_dir=Config.MODELS_DIR, exist_ok=True)

# verify the downloaded model file
dbutils.fs.ls(Config.MODELS_DIR)

In [9]:
dbutils.fs.ls(Config.PROJECT_DIR+'/models')

Define some helper functions

In [11]:
def print_nans(df):

    print('Checking for NANs:............................')
    mis_val = df.isnull().sum()
    mis_val_percent = 100 * df.isnull().sum() / len(df)
    mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
    mis_val_table_ren_columns = mis_val_table.rename(
        columns={0: 'Missing Values', 1: '% of Total Values'})
    mis_val_table_ren_columns = mis_val_table_ren_columns[
        mis_val_table_ren_columns.iloc[:, 1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
    print("Your selected dataframe has " +
          str(df.shape[1]) +
          " columns and " +
          str(len(df)) +
          " rows \n" "There are " +
          str(mis_val_table_ren_columns.shape[0]) +
          " columns that have missing values.")
    print('..............................................')
    return mis_val_table_ren_columns


def data_frame_imputer(df):
    fill = pd.Series([df[c].value_counts().index[0]
                      if df[c].dtype == np.dtype('O') else df[c].mean() for c in df],
                     index=df.columns)
    return df.fillna(fill)


def replace_with_grouped_mean(df, value, column, to_groupby):

    invalid_mask = (df[column] == value)

    # get the mean without the invalid value
    means_by_group = (df[~invalid_mask].groupby(to_groupby)[column].mean())

    # get an array of the means for all of the data
    means_array = means_by_group[df[to_groupby].values].values

    # assign the invalid values to means
    df.loc[invalid_mask, column] = means_array[invalid_mask]

    return df


def log_transformer(df, base, c=1):

    if base == 'e' or base == np.e:
        log = np.log

    elif base == '10' or base == 10:
        log = np.log10

    else:
        def log(x): return np.log(x) / np.log(base)

    c = c
    out = pd.DataFrame()
    for _ in df:
        out = df.apply(lambda x: log(x + c))
    return out

The `process_data()` function will clean and preprocess any new data and output a format that the random forest model expects to see

In [13]:
def process_data(x, le, ohc):
    """
    Gets new data ready for scoring

    :param x: new data in the form of a dataframe
    :param le: the pumps pickled label encoder transformer
    :param ohc: the pumps pickled one-hot encoding transformer
    :return:  dataframe ready for prediction
    """
  
    useful_columns = ['amount_tsh',
                      'gps_height',
                      'longitude',
                      'latitude',
                      'region',
                      'population',
                      'construction_year',
                      'extraction_type_class',
                      'management_group',
                      'quality_group',
                      'source_type',
                      'waterpoint_type']

    # subset to columns we care about
    x = x[useful_columns]

    # for column construction_year, values <=1000 are probably bad
    invalid_rows = x['construction_year'] < 1000
    valid_mean = int(x.construction_year[~invalid_rows].mean())
    x.loc[invalid_rows, "construction_year"] = valid_mean

    # in some columns 0 is an invalid value
    x = replace_with_grouped_mean(df=x, value=0, column='longitude', to_groupby='region')
    x = replace_with_grouped_mean(df=x, value=0, column='population', to_groupby='region')

    # set latitude to the proper value
    x = replace_with_grouped_mean(df=x, value=-2e-8, column='latitude', to_groupby='region')

    # set amount_to non-zeroes
    x = replace_with_grouped_mean(df=x, value=0, column='amount_tsh', to_groupby='region')

    # remove na's
    x = data_frame_imputer(df=x)

    # print nans in the dataframe if any
    print_nans(x)

    # log transform numerical columns
    num_cols = ['amount_tsh', 'population']
    x[num_cols] = log_transformer(df=x[num_cols], base='e', c=1)
    
    print("data shape: ", x.shape)
    print("Running label and one-hot encoding on the new data...")
    x = le.transform(x)
    x = ohc.transform(x)
    print("Processed data shape: ", x.shape)
    print("done.")
    
    return x

In [14]:
# load pickled models & transformers
rf = joblib.load(Config.MODELS_DIR+'/rf.pkl')
le = joblib.load(Config.MODELS_DIR+'/le.pkl')
ohc = joblib.load(Config.MODELS_DIR+'/ohc.pkl')

In [15]:
# get the data ready for prediction
df = pd.read_csv(Config.DATA_DIR+'/new_pumps_data.csv', index_col=0)
df = process_data(df, le, ohc)

In [16]:
# make prediction
predictions = rf.predict(df)
print(predictions)

#### 3.3 CREATE AN EXECUTION SCRIPT
The execution script receives data submitted to a deployed image, and passes it to the model. It then takes the response returned by the model and returns that to the client. The script is specific to your model; it must understand the data that the model expects and returns. The script usually contains two functions that load and run the model:

`init()`: Typically this function loads the model into a global object. This function is run only once when the Docker container is started.

`run(input_data)`: This function uses the model to predict a value based on the input data. Inputs and outputs to the run typically use JSON for serialization and de-serialization. You can also work with raw binary data. You can transform the data before sending to the model, or before returning to the client.

To encode & decode JSON, we will use the Pandas JSON functions with the `table` format. This format only only encodes the data, but the columns, indexes and schema. This makes the JSON explicit and clear.

In [18]:
%%writefile /dbfs/mnt/jason/pumps/scripts/score.py

import json
import numpy as np
import os
import pickle
from sklearn.externals import joblib
from sklearn.linear_model import LogisticRegression

from azureml.core.model import Model

###########################################################################################
# HELPER FUNCTIONS
###########################################################################################

def print_nans(df):

    print('Checking for NANs:............................')
    mis_val = df.isnull().sum()
    mis_val_percent = 100 * df.isnull().sum() / len(df)
    mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
    mis_val_table_ren_columns = mis_val_table.rename(
        columns={0: 'Missing Values', 1: '% of Total Values'})
    mis_val_table_ren_columns = mis_val_table_ren_columns[
        mis_val_table_ren_columns.iloc[:, 1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
    print("Your selected dataframe has " +
          str(df.shape[1]) +
          " columns and " +
          str(len(df)) +
          " rows \n" "There are " +
          str(mis_val_table_ren_columns.shape[0]) +
          " columns that have missing values.")
    print('..............................................')
    return mis_val_table_ren_columns


def data_frame_imputer(df):
    fill = pd.Series([df[c].value_counts().index[0]
                      if df[c].dtype == np.dtype('O') else df[c].mean() for c in df],
                     index=df.columns)
    return df.fillna(fill)


def replace_with_grouped_mean(df, value, column, to_groupby):

    invalid_mask = (df[column] == value)

    # get the mean without the invalid value
    means_by_group = (df[~invalid_mask].groupby(to_groupby)[column].mean())

    # get an array of the means for all of the data
    means_array = means_by_group[df[to_groupby].values].values

    # assign the invalid values to means
    df.loc[invalid_mask, column] = means_array[invalid_mask]

    return df


def log_transformer(df, base, c=1):

    if base == 'e' or base == np.e:
        log = np.log

    elif base == '10' or base == 10:
        log = np.log10

    else:
        def log(x): return np.log(x) / np.log(base)

    c = c
    out = pd.DataFrame()
    for _ in df:
        out = df.apply(lambda x: log(x + c))
    return out


def process_data(x, le, ohc):
    """
    Gets new data ready for scoring

    :param x: new data in the form of a dataframe
    :param le: the pumps pickled label encoder transformer
    :param ohc: the pumps pickled one-hot encoding transformer
    :return:  dataframe ready for prediction
    """
  
    useful_columns = ['amount_tsh',
                      'gps_height',
                      'longitude',
                      'latitude',
                      'region',
                      'population',
                      'construction_year',
                      'extraction_type_class',
                      'management_group',
                      'quality_group',
                      'source_type',
                      'waterpoint_type']

    # subset to columns we care about
    x = x[useful_columns]

    # for column construction_year, values <=1000 are probably bad
    invalid_rows = x['construction_year'] < 1000
    valid_mean = int(x.construction_year[~invalid_rows].mean())
    x.loc[invalid_rows, "construction_year"] = valid_mean

    # in some columns 0 is an invalid value
    x = replace_with_grouped_mean(df=x, value=0, column='longitude', to_groupby='region')
    x = replace_with_grouped_mean(df=x, value=0, column='population', to_groupby='region')

    # set latitude to the proper value
    x = replace_with_grouped_mean(df=x, value=-2e-8, column='latitude', to_groupby='region')

    # set amount_to non-zeroes
    x = replace_with_grouped_mean(df=x, value=0, column='amount_tsh', to_groupby='region')

    # remove na's
    x = data_frame_imputer(df=x)

    # print nans in the dataframe if any
    print_nans(x)

    # log transform numerical columns
    num_cols = ['amount_tsh', 'population']
    x[num_cols] = log_transformer(df=x[num_cols], base='e', c=1)
    
    print("data shape: ", x.shape)
    print("Running label and one-hot encoding on the new data...")
    x = le.transform(x)
    x = ohc.transform(x)
    print("Processed data shape: ", x.shape)
    print("done.")
    
    return x

###########################################################################################
# MAIN
###########################################################################################

# load the model
def init():
    """
    Loads the models and estimators into a global scope
    """
    
    global rf
    global le
    global ohc

    # retrieve model
    rf_path = Model.get_model_path('pumps_rf')
    rf = joblib.load(model_path)
    
    # retrieve transformers
    le_path = Model.get_model_path('pumps_le')
    ohc_path = Model.get_model_path('pumps_ohc')
    le = joblib.load(le_path)
    ohc = joblib.load(ohc_path)

# Passes data to the model and returns the prediction
def run(raw_data):
    """
    Processes the incoming data, passes it to the
    model and outputs predictions in JSON format
    """
    json_data = json.loads(raw_data)
    df = pd.read_json(json_data, orient='table')
    
    # process data
    processed_data = process_data(df, le, ohc):
    
    # make prediction
    preds = rf.predict(processed_data)
    
    # join predictions to index and jsonify
    # TODO
    
    
    return json.dumps(y_hat.tolist())

#### 3.4 CONFIGURE AN AKS CLUSTER
Deployed models are packaged as an image. The image contains the dependencies needed to run the model. Will need to create an environment file (`myenv.yml`) that specifies all of the scoring script's package dependencies. This file is used to ensure that all of those dependencies are installed in the Docker image by Azure ML.

In [20]:
package_list = [
  'category-encoders==1.3.0',
  'numpy==1.15.0',
  'pandas==0.24.1',
  'scikit-learn==0.20.2']

# Conda environment configuration
myenv = CondaDependencies.create(pip_packages=package_list)

with open("myenv.yml","w") as f:
    f.write(myenv.serialize_to_string())
    
print(myenv.serialize_to_string())

For Azure Container Instance, Azure Kubernetes Service, and Azure IoT Edge deployments, the `azureml.core.image.ContainerImage` class is used to create an image configuration. The image configuration is then used to create a new Docker image.

In [22]:
# retrieve cloud representations of the models
rf = Model(workspace=ws, name='pumps_rf')
le = Model(workspace=ws, name='pumps_le')
ohc = Model(workspace=ws, name='pumps_ohc')
print(rf); print(le); print(ohc)

In [23]:
# Image configuration
image_config = ContainerImage.image_configuration(execution_script='score.py', 
                                                  runtime='python', 
                                                  conda_file='myenv.yml',
                                                  description='Pumps Random Forest model)


# Register the image from the image configuration
image = ContainerImage.create(name = Config.IMAGE_NAME, 
                              models = [rf, le, ohc],
                              image_config = image_config,
                              workspace = ws)

#### 3.5 DEPLOY THE AKS CLUSTER
To begin with, let's attach an existing AKS cluster to the AML workspace

In [25]:
# Attach the cluster to your workgroup
attach_config = AksCompute.attach_configuration(resource_group = Config.RESOURCE_GROUP,
                                                cluster_name = Config.DEPLOY_COMPUTE)
aks_target = ComputeTarget.attach(workspace=ws, 
                                  name=Config.DEPLOY_COMPUTE, 
                                  attach_configuration=attach_config)

# Wait for the operation to complete
aks_target.wait_for_completion(True)

Next, let's deploy the cluster with the image we configured in the previous section.

In [27]:
# Set configuration and service name
aks_config = AksWebservice.deploy_configuration()

# Deploy from image
service = Webservice.deploy_from_image(workspace = ws,
                                       name = Config.AKS_SERVICE_NAME,
                                       image = image,
                                       deployment_config = aks_config,
                                       deployment_target = aks_target)
# Wait for the deployment to complete
service.wait_for_deployment(show_output = True)
print(service.state)

In [28]:
# in case of issues, you can check the logs
service.get_logs()

In [29]:
# This is the HTTP endpoint that accepts REST client calls
print(service.scoring_uri)

#### 3.6 TEST THE WEB SERVICE
Let's send the data as a JSON string to the web service hosted in AKS and use the SDK's run API to invoke the service. Here we will take an image from our validation data to predict on.

In [31]:
import torch
from torchvision import transforms
    
def preprocess(image_file):
    """Preprocess the input image."""
    data_transforms = transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ])

    image = Image.open(image_file)
    image = data_transforms(image).float()
    image = torch.tensor(image)
    image = image.unsqueeze(0)
    return image.numpy()
  
input_data = preprocess('test_img.jpg')
result = service.run(input_data=json.dumps({'data': input_data.tolist()}))
print(result)

#### TEST JSON

In [33]:
test_data ='''
{
    "schema": {
        "pandas_version": "0.20.0",
        "primaryKey": ["id"],
        "fields": [
            {"name": "id","type": "integer"},
            {"name": "amount_tsh","type": "integer"},
            {"name": "date_recorded","type": "string"},
            {"name": "funder","type": "string"},
            {"name": "gps_height","type": "integer"},
            {"name": "installer","type": "string"},
            {"name": "longitude","type": "number"},
            {"name": "latitude","type": "number"},
            {"name": "wpt_name","type": "string"},
            {"name": "num_private","type": "integer"},
            {"name": "basin","type": "string"},
            {"name": "subvillage","type": "string"},
            {"name": "region","type": "string"},
            {"name": "region_code","type": "integer"},
            {"name": "district_code","type": "integer"},
            {"name": "lga","type": "string"},
            {"name": "ward","type": "string"},
            {"name": "population","type": "integer"},
            {"name": "public_meeting","type": "string"},
            {"name": "recorded_by","type": "string"},
            {"name": "scheme_management","type": "string"},
            {"name": "scheme_name","type": "string"},
            {"name": "permit","type": "string"},
            {"name": "construction_year","type": "integer"},
            {"name": "extraction_type","type": "string"},
            {"name": "extraction_type_group","type": "string"},
            {"name": "extraction_type_class","type": "string"},
            {"name": "management","type": "string"},
            {"name": "management_group","type": "string"},
            {"name": "payment","type": "string"},
            {"name": "payment_type","type": "string"},
            {"name": "water_quality","type": "string"},
            {"name": "quality_group","type": "string"},
            {"name": "quantity","type": "string"},
            {"name": "quantity_group","type": "string"},
            {"name": "source","type": "string"},
            {"name": "source_type","type": "string"},
            {"name": "source_class","type": "string"},
            {"name": "waterpoint_type","type": "string"},
            {"name": "waterpoint_type_group","type": "string"}
        ]
    },
    "data": [
        {
            "id": 61848,
            "amount_tsh": 0,
            "date_recorded": "2011-08-04",
            "funder": "Rudep",
            "gps_height": 1645,
            "installer": "DWE",
            "longitude": 31.44412134,
            "latitude": -8.27496163,
            "wpt_name": "Kwa Juvenal Ching'Ombe",
            "num_private": 0,
            "basin": "Lake Tanganyika",
            "subvillage": "Tunzi",
            "region": "Rukwa",
            "region_code": 15,
            "district_code": 2,
            "lga": "Sumbawanga Rural",
            "ward": "Mkowe",
            "population": 200,
            "public_meeting": true,
            "recorded_by": "GeoData Consultants Ltd",
            "scheme_management": "VWC",
            "scheme_name": null,
            "permit": false,
            "construction_year": 1991,
            "extraction_type": "swn 80",
            "extraction_type_group": "swn 80",
            "extraction_type_class": "handpump",
            "management": "vwc",
            "management_group": "user-group",
            "payment": "never pay",
            "payment_type": "never pay",
            "water_quality": "soft",
            "quality_group": "good",
            "quantity": "enough",
            "quantity_group": "enough",
            "source": "machine dbh",
            "source_type": "borehole",
            "source_class": "groundwater",
            "waterpoint_type": "hand pump",
            "waterpoint_type_group": "hand pump"
        },
        {
            "id": 48451,
            "amount_tsh": 500,
            "date_recorded": "2011-07-04",
            "funder": "Unicef",
            "gps_height": 1703,
            "installer": "DWE",
            "longitude": 34.64243884,
            "latitude": -9.10618458,
            "wpt_name": "Kwa John Mtenzi",
            "num_private": 0,
            "basin": "Rufiji",
            "subvillage": "Kidudumo",
            "region": "Iringa",
            "region_code": 11,
            "district_code": 4,
            "lga": "Njombe",
            "ward": "Mdandu",
            "population": 35,
            "public_meeting": true,
            "recorded_by": "GeoData Consultants Ltd",
            "scheme_management": "WUA",
            "scheme_name": "wanging'ombe water supply s",
            "permit": true,
            "construction_year": 1978,
            "extraction_type": "gravity",
            "extraction_type_group": "gravity",
            "extraction_type_class": "gravity",
            "management": "wua",
            "management_group": "user-group",
            "payment": "pay monthly",
            "payment_type": "monthly",
            "water_quality": "soft",
            "quality_group": "good",
            "quantity": "dry",
            "quantity_group": "dry",
            "source": "river",
            "source_type": "river\/lake",
            "source_class": "surface",
            "waterpoint_type": "communal standpipe",
            "waterpoint_type_group": "communal standpipe"
        }
    ]
}
'''

In [34]:
new_df = pd.read_json(test_data, orient='table')
new_df