# Step 3: Scoring Pipeline

We assume you have already created a model to use in this scoring script. You can create a model using the `.\notebooks\2_Training_Pipeline` notebook.

The scoring pipeline consists of two steps:

  1. Transform the raw data into a scoring data set
  2. Score the scoring data set.
  
We assume that new data is arriving into the data store as it's generated. A scoring workflow would poll the data store for new data on whatever schedule is convenient (realtime, hourly, daily...), transform and manipulate the new data just as was done for the training step, then predict the label for the new observations.

We can run the feature engineering notebook (`2a_feature_engineering`) with the correct parameters to simulate the polling process. Since our data was all ingested, and we are not simulating the collection of new data, we choose observations after the training data set end date of "2015-10-30". We transform the data into observation format the model expects. A note here, in a true production setting, it might be more convenient to split the (`2a_feature_engineering`) notebook into a featurizing notebook and a labeller notebook that calculates labels separately, and joins them to the features. Then the scoring process would only need the features, and labels could be collected later and compared to predictions for post processing to monitor the model accuracy. 

Second, we run the scoring notebook (`3b_model_scoring`) to read the scoring data set, generate predictions for each of those observations using the specified `model` and store the results in the `results_data` dataset. 

Since the scoring data set is only used between these two notebooks, we remove the scoring table after scoring the data.

This notebook should take about 2-3 minutes to complete.

In [2]:
from pyspark.sql import SparkSession

# The scoring uses the same feature engineering script used to train the model
scoring_table = 'scoring_input'
results_table = 'results_output'
model = 'RandomForest' # Use 'DecisionTree' or 'RandomForest'

Databricks parameters to customize the runs.

In [4]:
dbutils.widgets.removeAll()
dbutils.widgets.text("results_data", results_table)

dbutils.widgets.text("model", model)

dbutils.widgets.text("start_date", '2015-11-15')

dbutils.widgets.text("to_date", '2016-04-30')


## Feature Engineering

The scenario is constructed as a pipeline flow, where each notebook is optimized to perform in a batch setting for each of the ingest, feature engineering, model building, model scoring operations. To accomplish this, this `2a_feature_engineering` notebook is designed to be used to generate a general data set for any of the training, calibrate, test or scoring operations. In this scenario, we use a temporal split strategy for these operations, so the notebook parameters are used to set date range filtering. The notebook creates a labeled data set using the parameters start_date and to_date to select the time period for training. This data set is stored in the features_table specified. After this cell completes, you should see the dataset under the Databricks Data icon.

Create a scoring data set using the parameters `start_date` and `to_date` to select the time period for scoring. Store those results in the `scoring_table` specified.

In [6]:
dbutils.notebook.run("2a_feature_engineering", 600, {"features_table": scoring_table, 
                                                     "start_date": dbutils.widgets.get("start_date"), 
                                                     "to_date": dbutils.widgets.get("to_date")})

## Scoring observations

Using the `model` specified, predict the probability of component failures for the observations in the `scoring_table`. Store the resulting probabilities in the `results_data` set, which will be available in the Databricks data store for later post processing.

In [8]:
dbutils.notebook.run("3b_model_scoring", 600, {"scoring_data": scoring_table, 
                                               "results_data": dbutils.widgets.get("results_data"), 
                                               "model": dbutils.widgets.get("model")})

## Cleanup temporary data

Since we only need the `scoring_table` data to pass observations from the featurizer to the scoring notebooks, we can safly remove the table.

In [10]:
# Since we created the scoring data set, we should remove it to keep things clean.
spark = SparkSession.builder.getOrCreate()
telemetry = spark.sql("DROP TABLE " + scoring_table)

# Conclusion

The scenario is constructed as a pipeline flow, where each notebook is optimized to perform in a batch setting for each of the ingest, feature engineering, model building, model scoring operations. The included notebooks can be customized for application to your specific batch scoring use case.