# Scoring Pipeline

In this notebook we create a pipeline for Scoring the 12,000 models that we build in the Training Pipeline. We set up the Pipeline for batch scoring. We again utitlize the [ParallelRunStep](https://docs.microsoft.com/en-us/python/api/azureml-contrib-pipeline-steps/azureml.contrib.pipeline.steps.parallel_run_step.parallelrunstep?view=azure-ml-py) to parallelize the process. 

Batch inference (or batch scoring) provides cost-effective inference, with unparalleled throughput for asynchronous applications. Batch prediction pipelines can scale to perform inference on terabytes of production data. Batch prediction is optimized for high throughput, fire-and-forget predictions for a large collection of data.

# Prerequisites 

This example runs on an Azure Machine Learning Notebook VM. We are calling models that have already been trained and registered to the Workspace. If you have already run the Environment Setup and Training Pipeline notebooks you are all set.

In [1]:
import os
import azureml.core
from azureml.core import Workspace, Experiment, Run, Datastore, Dataset
from azureml.core.compute import AmlCompute
from azureml.pipeline.core import Pipeline, PipelineData
from azureml.core import Environment
from azureml.core.runconfig import CondaDependencies, DEFAULT_CPU_IMAGE
from azureml.contrib.pipeline.steps import ParallelRunStep, ParallelRunConfig
from azureml.core.model import Model
import joblib

## Set up the Workspace, Datastore, Experiment and Compute

As we did in the Training Pipeline notebook, we need to call the Workspace and set up an Experiment. We also want to create variables for the datastore and compute cluster. 

### Connect to the workspace

Creat a workspace object. Workspace.from_config() reads the file config.json and loads the details into an object named ws. 

In [None]:
from azureml.core import Workspace 

ws = Workspace.from_config()
print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

#ws = Workspace(subscription_id="bbd86e7d-3602-4e6d-baa4-40ae2ad9303c", resource_group="ManyModelsSA", workspace_name="ManyModelsSAv1")
#ws.get_details()

### Create or Attach existing compute resource
By using Azure Machine Learning Compute, a managed service, data scientists can train machine learning models on clusters of Azure virtual machines. Examples include VMs with GPU support. In this tutorial, you create Azure Machine Learning Compute as your training environment. The code below creates the compute clusters for you if they don't already exist in your workspace.

**Creation of compute takes approximately 5 minutes. If the AmlCompute with that name is already in your workspace the code will skip the creation process.**

In [None]:
# define the compute cluster and the data store
compute = AmlCompute(ws, 'cpu-cluster')
import os
from azureml.core.compute import AmlCompute, ComputeTarget
from azureml.core.compute_target import ComputeTargetException

# choose a name for your cluster
compute_name = os.environ.get("AML_COMPUTE_CLUSTER_NAME", "cpu-cluster")
compute_min_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MIN_NODES", 0)
compute_max_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MAX_NODES", 4)

# This example uses CPU VM. For using GPU VM, set SKU to STANDARD_NC6
vm_size = os.environ.get("AML_COMPUTE_CLUSTER_SKU", "STANDARD_D2_V2")


if compute_name in ws.compute_targets:
    compute_target = ws.compute_targets[compute_name]
    if compute_target and type(compute_target) is AmlCompute:
        print('found compute target. just use it. ' + compute_name)
else:
    print('creating a new compute target...')
    provisioning_config = AmlCompute.provisioning_configuration(vm_size = vm_size,
                                                                min_nodes = compute_min_nodes, 
                                                                max_nodes = compute_max_nodes)

    # create the cluster
    compute_target = ComputeTarget.create(ws, compute_name, provisioning_config)
    
    # can poll for a minimum number of nodes and for a specific timeout.
    # if no min node count is provided it will use the scale settings for the cluster
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
    
     # For a more detailed view of current AmlCompute status, use get_status()
    print(compute_target.get_status().serialize())

### Call the Datastore containing the Orange Juice sales data
From the Generate Data Notebook, we uploaded the csv's for each Store and Brand comination. Use the .get_default_datastore() to save the datastore we uploaded the files into. 

In [None]:
dstore = ws.get_default_datastore()

## Set up the Environment

In [3]:
# set up the batch environment settings
batch_conda_deps = CondaDependencies.create(pip_packages=['sklearn','pmdarima'])

batch_env = Environment(name="manymodels_environment")
batch_env.python.conda_dependencies = batch_conda_deps
batch_env.docker.enabled = True
batch_env.docker.base_image = DEFAULT_CPU_IMAGE

In [4]:
# call the models
from azureml.core.model import Model 

model1 = Model(ws, 'arima_Store5_tropicana')
model2 = Model(ws, 'arima_Store2_dominicks')
model3 = Model(ws, 'arima_Store8_minute.maid')

model_list = [model1, model2, model3]

In [5]:
type(model1)

azureml.core.model.Model

In [None]:
from azureml.pipeline.core import Pipeline, PipelineData

# dataset = Dataset.get_by_name(ws, name='Store2_dominicks')

dataset1 = Dataset.File.from_files(path = (dstore, '3modelsdata/Store2_dominicks.csv'))
dataset2 = Dataset.File.from_files(path = (dstore, '3modelsdata/Store5_tropicana.csv'))
dataset3 = Dataset.File.from_files(path = (dstore, '3modelsdata/Store8_minute.maid.csv'))

output_dir = PipelineData(name="3_models", 
                          datastore=dstore, 
                          output_path_on_compute="3models/")


## Define the ParallelRunConfig

In [12]:
# Create the parallel run config
workercount = 3
nodecount = 1
timeout = 3000

tags1 = {}
tags1['nodes'] = nodecount
tags1['workers-per-node'] = workercount
tags1['timeout'] = timeout 

parallel_run_config = ParallelRunConfig(
    source_directory = './scripts',
    entry_script = 'score.py',
    mini_batch_size = '1',
    run_invocation_timeout = timeout, 
    error_threshold = 10,
    output_action = 'summary_only', 
    environment = batch_env, 
    process_count_per_node = workercount, 
    compute_target = compute, 
    node_count = nodecount
)

## Set up the ParallelRunStep

In [13]:
datasetname = 'store'
output_dir = PipelineData(name = 'scoringOutput', 
                         datastore = dstore, 
                         output_path_on_compute = 'scoringOutput/')

parallelrun_step = ParallelRunStep(
    name="many-models-scoring",
    parallel_run_config=parallel_run_config,
    inputs=[dataset1.as_named_input(datasetname), dataset2.as_named_input(datasetname), dataset3.as_named_input(datasetname)], # must have at least one element.... 
    output=output_dir,
    models= model_list, # this is just for logging
    arguments=['--n_predictions', 6],
    allow_reuse = False
)

## Submit and Run the Pipeline

In [2]:
# set up the experiment
experiment = Experiment(ws, 'scoring-pipeline-AP')

In [14]:
pipeline = Pipeline(ws, steps=[parallelrun_step])

run = experiment.submit(pipeline, tags=tags1)

Created step many-models-scoring [4e56035e][3ccd7e72-45fd-42dd-a81f-a099067e9b41], (This step will run and generate new outputs)
Using data reference store5_0 for StepId [512dd7d7][e79bdf2c-906d-4a00-a053-8d1dc436f342], (Consumers of this data are eligible to reuse prior runs.)
Submitted PipelineRun 18c39a34-306b-4ab2-b756-e215c6657122




Link to Azure Machine Learning studio: https://ml.azure.com/experiments/scoring-pipeline-AP/runs/18c39a34-306b-4ab2-b756-e215c6657122?wsid=/subscriptions/bbd86e7d-3602-4e6d-baa4-40ae2ad9303c/resourcegroups/ManyModelsSA/workspaces/ManyModelsSAv1


## Review the Output from the Pipeline
Put the predicitons back into blob storage

In [None]:
prediction_run = next(run.get_children())
prediction_output = prediction_run.get_output_data("3models")
prediction_output

prediction_output.download(local_path="training_results")


for root, dirs, files in os.walk("training_results"):
    for file in files:
        if file.endswith('parallel_run_step.txt'):
            result_file = os.path.join(root,file)
            
df = pd.read_csv(result_file, delimiter=" ", header=None) 
df.head()

## Scoring Script 

In [11]:
%%writefile ./scripts/score.py
from azureml.core.run import Run
import pandas as pd
import os
import uuid
import argparse
import datetime
import numpy as np

from azureml.core.model import Model
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
import pickle
from azureml.core import Experiment, Workspace, Run
from azureml.core import ScriptRunConfig
# import datetime
from entry_script_helper import EntryScriptHelper
import logging

from sklearn.externals import joblib
from joblib import dump, load
import pmdarima as pm
import time
from datetime import timedelta
from sklearn.metrics import mean_squared_error, mean_absolute_error 


thisrun = Run.get_context()
#childrun=thisrun

LOG_NAME = "user_log"

parser = argparse.ArgumentParser("split")
parser.add_argument("--n_test_set", type=int, help="input number of predictions")
parser.add_argument("--timestamp_column", type=str, help="model name")
#parser.add_argument("--start_date", type=str, help="date to start predictions")

args, unknown = parser.parse_known_args()
# args = parser.parse_args()

print("Argument 1(n_test_set): %s" % args.n_test_set)
print("Argument 2(timestamp_column): %s" % args.timestamp_column)


def init():
    EntryScriptHelper().config(LOG_NAME)
    logger = logging.getLogger(LOG_NAME)
    output_folder = os.path.join(os.environ.get("AZ_BATCHAI_INPUT_AZUREML", ""), "temp/output")
    logger.info(f"{__file__}.output_folder:{output_folder}")
    logger.info("init()")    
    return

def run(data):
    print("begin run ")
    logger = logging.getLogger(LOG_NAME)
    os.makedirs('./outputs', exist_ok=True)
    
    predictions = pd.DataFrame()
    
    logger.info('making predictions...')
    
    for file in data:
        u1 = uuid.uuid4()
        mname='arima'+str(u1)[0:16]

        #for w in range(0,1):
        with thisrun.child_run(name=mname) as childrun:
            for w in range(0,5):
                thisrun.log(mname,str(w))
            
            date1=datetime.datetime.now()
            logger.info('starting ('+file+') ' + str(date1))
            childrun.log(mname,'starttime-'+str(date1))
            
            # 0. Unpickle Model 
            model_name = 'arima_'+str(data).split('/')[-1][:-6]
            print(model_name)
            model_path = Model.get_model_path(model_name)         
            model = joblib.load(model_path)
            
            # 1. Make Predictions 
            prediction_list, conf_int = model.predict(args.n_test_set, return_conf_int = True)
            print("MAKING PREDICTIONS")
            
             
            # 2. Splitting the data for test set  
            data = pd.read_csv(file,header=0,)
            data = data.set_index(args.timestamp_column)             
            max_date = datetime.datetime.strptime(data.index.max(),'%Y-%m-%d')
            split_date = max_date - timedelta(days=7*args.n_test_set)
            data.index = pd.to_datetime(data.index)
            #train = data[data.index <= split_date]
            test = data[data.index > split_date]
                
            test['Predictions'] = prediction_list
            print(test.head())
            
            # 3. Calculating Accuracy Metrics            
            metrics = []
            mse = mean_squared_error(test['Quantity'], test['Predictions'])
            rmse = np.sqrt(mse)
            mae = mean_absolute_error(test['Quantity'], test['Predictions'])
            act, pred = np.array(test['Quantity']), np.array(test['Predictions'])
            mape = np.mean(np.abs((act - pred)/act)*100)

            metrics.append(mse)
            metrics.append(rmse)
            metrics.append(mae)
            metrics.append(mape)

            print(metrics)
            
            # 4. Save the output back to blob storage 
            ws1 = childrun.experiment.workspace
            output_path = os.path.join('./outputs/', model_name)
            test.to_csv(path_or_buf=output_path+'.csv', index = False)
            dstore = ws1.get_default_datastore()
            dstore.upload_files([output_path+'.csv'], target_path='oj_predictions', overwrite=False, show_progress=True)
            
            # Log metrics 
            date2=datetime.datetime.now()
            logger.info('ending ('+str(file)+') ' + str(date2))

            childrun.log(mname,'endtime-'+str(date2))
            childrun.log(mname,'auc-1')
            
            # 5. Append the predictions to return a dataframe if desired 
            predictions = predictions.append(test)
        
    return predictions

Overwriting ./scripts/score.py
