# Step 2: Training Pipeline

This notebook executes the model training pipeline. This includes manipulating and transform the raw data sets into an training data set and then using that data to train the machine learning model. You must have already run the `1_data_ingestion` notebook to download the raw predictive maintenance scenario data before running this notebook. You can either `Run All` cells, or use the Databricks CLI to create a Databricks Job to do the same process automatically.

The model training pipeline takes the raw data as it would arrive from the machines we're interested in, and manipulates and transforms the data into a training data set and then optimizes a machine learning model to accurately predict the outcome of interest. 

## Feature Engineering

The training data set is constructed in the `2a_feature_engineering` notebook, which transforms the machine features used to predict the outcome, as well as creates that outcome label we're interested in predicting. In this case, the labels are a boolean (TRUE/FALSE) indicator of a component failing. There are four components we're insterested in, so this gives us a label of possible values on `{0,1,2,3,4}`, where '0' indicates a healthy machine, and `{1,2,3,4}` indicate which of the four components will fail within the next 7 days.

To examine the SPARK analysis data sets constructed in the `2a_feature_engineering` notebook, the `2a_feature_exploration` notebook has been included in the repository and copied to your Azure Databricks Workspace. You must run this notebook, or the `2a_feature_engineering` notebook, before running the `2a_feature_engineering` notebook, which details the feature dataset.

## Model Training

The training data, including the labels, are then used to build the model. Since we have labeled data, we will build a _supervised_ classification model. Since we are interested in more than a boolean (TRUE/FALSE) indicator, we will construct a multi-class model, where the outcomes correspond to the label values `{0,1,2,3,4}`. 

The model is constructed in the `./notebooks/2b_model_building` notebook. We can choose between the SPARK [DecisionTree model](https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#decision-tree-classifier) or a SPARK [RandomForest Model](https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#random-forest-classifier). The resulting model is stored directly on the Databricks DBFS file system.

Similarly, to test the model constructed in `2b_model_building` notebook, the `2b_model_testing` notebook has been included in the repository. You must run this notebook, or the `2b_model_building` notebook, before running the `2b_model_testing` notebook, which calculates some model performance metrics for this predictive maintenance model.

## Pipeline

This notebook takes parameters for which model to build (model), where to store the training data (features_table),  and the start (start_date) and end (to_date) dates to use when creating the training data. 

Using these parameters, it creates the training data by calling the `./notebooks/2a_feature_engineering` with the correct parameters. When the `./notebooks/2a_feature_engineering` notebook completes, the `./notebooks/2b_model_building` notebook run. The resulting modelis stored on the Databricks file system for use in the `./notebooks/3_Scoring_Pipeline` notebook.

**Note:** This notebook will take about 4-6 minutes to execute all cells, depending on the compute configuration you have setup.

In [2]:
# This is the default feature training data file.
training_table = 'training_data'

# The model is a Random Forest
model_type = 'RandomForest' # Use 'DecisionTree' or 'RandomForest'

Databricks parameters to customize the runs.

In [4]:
dbutils.widgets.removeAll()
dbutils.widgets.text("features_table", training_table)
dbutils.widgets.text("Model", model_type)

dbutils.widgets.text("start_date", '2000-01-01')

dbutils.widgets.text("to_date", '2015-10-30')

## Feature Engineering

The scenario is constructed as a pipeline flow, where each notebook is optimized to perform in a batch setting for each of the ingest, feature engineering, model building, model scoring operations. To accomplish this, the feature engineering notebook is designed to be used to generate a general data set for any of the training, calibrate, test or scoring operations. In this scenario, we use a temporal split strategy for these operations, so the notebook parameters are used to set date range filtering.

The `2a_feature_engineering` notebook creates a labeled training data set using the parameters `start_date` and `to_date` to select the time period for training. This data set is stored in the `features_table` specified. After this cell completes, you should see the dataset under the Databricks `Data` icon.

In [6]:
dbutils.notebook.run("2a_feature_engineering", 600, {"features_table": dbutils.widgets.get("features_table"), 
                                                     "start_date": dbutils.widgets.get("start_date"), 
                                                     "to_date": dbutils.widgets.get("to_date")})

## Model Building

Build the `Model` model using the `features_table` labeled training dataset.

In [8]:
dbutils.notebook.run("2b_model_building", 600, {"training_table": dbutils.widgets.get("features_table"), 
                                                "model": dbutils.widgets.get("Model")})

# Conclusion

Now that we have a model stored on the Databricks DBFS filesystem, we can run the `./notebooks/3_Scoring_Pipeline` notebook to score new data as it arrives in our data storage location.