# Step 3B: Model Scoring

Using a scoring data set constructed in the `2a_feature_engineering` notebook, and a model constructed in the `2b_model_building` notebook (this is run through the `2_Training_Pipeline` notebook, this notebook loads the data and predicts the probability of component failure with the provided model. 

This notebook can be run though the `3_Scoring_Pipeline` notebook, which creates a temporary scoring data set before scoring the data with this notebook.

We provide the `3b_model_scoring_evalation` notebook to examine the output of the scoring process.

**IMPORTANT NOTE** This notebook depends on there being a `scoring_data` set in the Databricks Data store. You can score any dataset constructed with the `2a_feature_engineering` notebook, but you must specify that data set in the `scoring_data` parameter above, or this notebook will fail. 

**Note:** This notebook should take less than a minute to execute all cells, depending on the compute configuration you have setup.

In [2]:
# import the libraries
from pyspark.ml import PipelineModel
# for creating pipelines and model
from pyspark.ml.feature import StringIndexer, VectorAssembler, VectorIndexer

# The scoring uses the same feature engineering script used to train the model
scoring_table = 'testing_data'
results_table = 'results_output'

model = 'RandomForest' # Use 'DecisionTree' or 'RandomForest'

Databricks parameters to customize the runs.

In [4]:
dbutils.widgets.removeAll()
dbutils.widgets.text("scoring_data", scoring_table)
dbutils.widgets.text("results_data", results_table)

dbutils.widgets.text("model", model)

We need to run the feature engineering on the data we're interested in scoring (`2a_feature_engineering`). Spark MLlib models require a vectorized data frame. We transform the dataset here for model consumption. In a general scoring operation, we do not know the labels so we only need to construct the features.

In [6]:
sqlContext.refreshTable(dbutils.widgets.get("scoring_data")) 
score_data = spark.table(dbutils.widgets.get("scoring_data"))

# We'll use the known label, and key variables.
label_var = ['label_e']
key_cols =['machineID','dt_truncated']

# Then get the remaing feature names from the data
input_features = score_data.columns

# We'll use the known label, key variables and 
# a few extra columns we won't need.
remove_names = label_var + key_cols + ['failure','model_encoded','model' ]

# Remove the extra names if that are in the input_features list
input_features = [x for x in input_features if x not in set(remove_names)]

input_features
# assemble features
va = VectorAssembler(inputCols=(input_features), outputCol='features')

# assemble features
score_data = va.transform(score_data).select('machineID','dt_truncated','label_e','features')

# set maxCategories so features with > 10 distinct values are treated as continuous.
featureIndexer = VectorIndexer(inputCol="features", 
                               outputCol="indexedFeatures", 
                               maxCategories=10).fit(score_data)

To evaluate this model, we predict the component failures over the test data set. Since the test set has been created from data the model has not been seen before, it simulates future data. The evaluation then can be generalize to how the model could perform when operationalized and used to score the data in real time.

In [8]:
# Load the model from local storage
model_pipeline = PipelineModel.load("dbfs:/storage/models/" + dbutils.widgets.get("model") + ".pqt")

# score the data. The Pipeline does all the same operations on this dataset
predictions = model_pipeline.transform(score_data)

#write results to data store for persistance.
predictions.write.mode('overwrite').saveAsTable(dbutils.widgets.get("results_data"))

# Conclusion

We have provided an additional notebook (`3a_model_scoring_evaluation`) to examine how the process works.