## **Train** a ML model locally, **Deploy** the model in Monitor, and perform real-time **Prediction** in Maximo-Monitor

You can use this notebook as template for training your model locally using data in Maximo-Monitor, and deploying the model in Monitor. At the end of this notebook there are further instructions on getting started with real-time inference.

**Note** The following files are required to run this notebook and they are not provided with this template:
1. monitor-credentials.json. Contains credentials to connect to maximo monitor. Contents explained in section **I. Credentials setup**

*Create your own files with the correct file paths. In the template, all required inputs are marked with the `user-input-required` comment as a guide.*

**Pre-requisites**
1. Install iotfunctions package
2. Install any other packages required to train your machine learning models

In [None]:
import json # loading credentials
import pandas as pd
import numpy as np

## I. Credentials setup
You must create one credentials file:
1. Credentials to allow Maximo Monitor access its database and APIs.

The credentials are saved in JSON files. In the following cell, define the relative path (that is, relative to this notebook) of both of these credentials files and load the files from local variables

**Note**: Keep all credentials safe and hidden

**More information about credentials in Maximo Monitor**

[Maximo Monitor documentation](https://www.ibm.com/docs/en/maximo-monitor/8.5.0?topic=monitor-connection-parameters)

In [None]:
MONITOR_CREDENTIALS_FILE_PATH = './dev_resources/monitor-credentials.json'  # user-input-required. Set it the path of your monitor-credentials file relative to this notebook

In [None]:
with open(MONITOR_CREDENTIALS_FILE_PATH, 'r') as f: 
    monitor_credentials = json.loads(f.read())

## II. Train Model

[Read the article about Machine Learning](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf) if you are interested in getting familiar with some of the basics.

This notebook can be modified and used for regression or classifiaction tasks. [A short refresher for classification vs regression tasks](https://in.springboard.com/blog/regression-vs-classification-in-machine-learning/)

The template loads an entity's data and guides you to train a machine learning model locally using that data. An entity in Monitor has several database tables that store metrics, dimensions, derived metrics, and alerts. To load these features for an entity, follow this step-by-step guide. 

**Steps to train a model in a notebook**
1. Load training data from Monitor database (for a specified entity)
2. Prepare dataset for training
3. Create a training pipeline
4. Train the model

**III.1 Load training data from Monitor database**

An entity in Monitor has several database tables that store raw metrics, dimensions, and alerts. To load these features you peform two database reads. The first read fetches raw metrics and dimensions, and the second read fetched the alerts for the user provided entity type. Both data streams are stored is separate comma separated value (.csv) files and are available to manipulate as required.


Complete the following these steps to load training data:
1. Specify entity_name (and any other user input)
2. Connect to the database
3. Save the data in a csv to use later

**Note** If you already have data available in a csv file, go to section **III.2 Prepare data

Set the following variables:
- (Required) entity_name  # Name of the entity that you want to retrieve data from. Use the same name that is displayed on the user interface in Monitor.,
- (Optional) start_ts # Start fetching date from this date and time.,
- (Optional) end_ts # Fetch data until this date and time.

**III.1.1 Specify entity_name (and any other user input)**

In [None]:
# (User Input) user-input-required

entity_name = 'shraddha_robot' # user-input-required. Set to entity name you'd like to retrieve data from. This is the same name as displayed on the Monitor UI

start_ts = None # user-input-required. (Optional) Fetch data starting from this date/time. Format 'YYYY-MM-DD-HH.MM.ss.mmmmm'. Set to None to disable

end_ts = None # user-input-required. (Optional) Fetch data until this date/time. Format 'YYYY-MM-DD-HH.MM.ss.mmmmm'. Set to None to disable

**III.1.2 Connect to the database**

Use the monitor credentials loaded section **I. Credentials setup** to connect to the Monitor database

In [None]:
from iotfunctions.db import Database

db = Database(credentials = monitor_credentials, entity_type=entity_name)
db_schema = None #  set if you are not using the default

In [None]:
entity_type = db.get_entity_type(name=entity_name)

**loaded data: metris and dimension**

The metrics and dimension data for the entity type you specified is loaded in a dataframe that is indexed with a (id, evt_timestamp) index. `id` is the device ID and `evt_timestamp` is the event time. This is the same index scheme that the pipeline uses on its dataframe.

In [None]:
metric_df = entity_type.get_data(start_ts = start_ts, end_ts=end_ts) # fetch metric and dimension data

In [None]:
metric_df.head()

In [None]:
# (Setup to fetch alerts)
alert_table_name = "DM_WIOT_AS_ALERT"
alert_table_timestamp_col = "timestamp"
alert_filters = {"entity_type_id": [entity_type._entity_type_id]}

**loaded data: alerts**

Alert data for the specified entity type is loaded in a dataframe.

In [None]:
query, _ = db.query(table_name=alert_table_name, schema=db_schema,timestamp_col=alert_table_timestamp_col, start_ts=start_ts, end_ts=end_ts, filters=alert_filters) #query for alert data
alert_df = db.read_sql_query(sql=query.statement) # fetch alert data

In [None]:
alert_df.head()

In [None]:
# you can perform further data evaluations and manipulations in this cell

**III.1.3 Save data in csv**

The metric (with dimensions) and alert data that was loaded in the previous steps can now be saved into .csv files. Running the following step saves this data into files named **metric_{{entity_name}}.csv** and **alert_{{entity_name}}.csv** respectively. In your environment, `{{entity_name}}` is replaced by the name you specified during the load data stage.


Both files are saved in the same directory as this notebook. You can create a separate directory for the data by appending the relative directory structure to the following file path. For example, to save the data to a folder named `data` under the directory that contains this notebook, modify the file path as follows:
```python3
metric_data_filepath = f'./data/metric_{entity_name}.csv'
```

In [None]:
metric_data_filepath = f'metric_{entity_name}.csv'
alert_data_filepath = f'alert_{entity_name}.csv'

In [None]:
metric_df.to_csv(metric_data_filepath, index=False,header=True) # save metric and dimension data
alert_df.to_csv(alert_data_filepath, index=False,header=True) # save alert data

**III.2 Prepare data**

Load the data that you saved earlier and manipulates it some more to prepare it for training.

**Read data back**

In [None]:
raw_data_df = pd.read_csv(metric_data_filepath) # assumes you saved data using steps above. If you have pre-save data replace `metric_data_filepath` with the path to that .csv. The path should be relative to this notebook 
raw_data_df.head()

In [None]:
alert_data_df = pd.read_csv(alert_data_filepath) # assumes you saved data using steps above. If you have pre-save data replace `alert_data` with the path to that .csv. The path should be relative to this notebook 
alert_data_df.head()

**Create a dataset to train the model**

This template uses the variables `speed` and `acc` to predict `torque`. Depending on your problem statement, you can use all or part of the available data to define your feature vector as well as the prediction variables.

In [None]:
feature_vector = ['acc', 'speed'] # use-input-required. Define feature vector. These are the variables used to train the model
target_variable = 'torque' # use-input-required. Define target variable. This is the feature you want to predict. For regression models this will be a continous variable and for classification models this is a discrete variable

In [None]:
required_data_items = []
required_data_items.extend(feature_vector)
required_data_items.append(target_variable)

In [None]:
required_data_items, feature_vector, target_variable

In [None]:
data_df = raw_data_df[required_data_items]

In [None]:
data_df.head()

In [None]:
# further data manipulation can be performed in this cell

In [None]:
from sklearn.model_selection import train_test_split

Y = data_df[target_variable]
X = data_df[feature_vector]

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, random_state=143) # (Optional) user-input-required. Change/add parameters to train_test_split function call

In [None]:
# Use cell for train, test data verification
X.head()

**III.3 Create training pipeline**

In this template a sample training pipeline is provided. The training pipeline uses [sklearn Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) that imlpements data transformation steps followed by an estimator (the model). Using the pipeline is optional.

Learning resources
- [Sklearn Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)
- [Sklearn Transformers for text data](https://scikit-learn.org/stable/modules/compose.html#columntransformer-for-heterogeneous-data)
- [Sklearn Feature Selection](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html?highlight=selectfrommodel#sklearn.feature_selection.SelectFromModel)
- [Sklearn Hyperparameter tuning](https://scikit-learn.org/stable/modules/grid_search.html#randomized-parameter-search)
- [Python packages for machine learning](https://github.com/ml-tooling/best-of-ml-python#time-series-data)

The following code builds a preprocessor to transform input data. The numeric and categorical data are transformed using separate methods. In the sample code, you'll find a pipeline with a imputer (to complete missing data) and a scaler to transform numerical data. Using the code, the categorical features are transformed into one hot encoded vector. Both of these steps are combined to build the `preprocessor` module to perform input data transformation.

Use the sklearn learning resources to learn more about the individual parts of the `ColumnTransformer` module and to learn about the different options for data imputers, scalers, and text data transformers that are available.


```python3
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])
    
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])
```

`numeric_features` is a list of all numerical feature names. In our example `numerical_features=['acc', 'speed']` <br>
`categorical_features` is a list of all numerical feature names


To create and use relevant features you can use `SelecFromModel` for an additional `feature_selection` stage in the training pipeline

```python3
Pipeline(steps=[('preprocessor', preprocessor),
                 ('feature_selection', SelectFromModel(LinearSVC(penalty="l1", dual=False))),
                 ('classification', LinearRegression())])
```

In [None]:
numeric_features = X.select_dtypes(include=np.number).columns.tolist()
categorical_features = list(set(X.columns.tolist()).difference(numeric_features))

In [None]:
numeric_features, categorical_features

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import LinearSVC
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectFromModel
from sklearn.compose import ColumnTransformer


## Example Pipeline
if numeric_features:
    numeric_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features)])

skl_pipeline = Pipeline(steps=[('preprocessor', preprocessor), # (Optional) preprocessor stage
                               ('classification', GradientBoostingRegressor())]) # user-input-required estimator/machine learnign model you want to run

**III.4 Train model**

You can run the training pipeline by completing the setup in the previous cell or you can set up hyperparameter tuning as follows and run the search pipeline. The training pipeline runs a single training job while the hyperparameter search runs several jobs to find the best model parameters across the ranges specified in `param_grid`. The search pipeline is more computationally intensive and so it takes longer to run


In [None]:
use_hyperparameter_tuning = False # (optional) user-input-required. Control for turning hyperparameter_tuning on or off. Run this cell to turn hyperparameter tuning off if you ran the cell below to turn it on

**(Optional) Set up hyper paratmeter tuning**

- Parameters of pipelines can be set using `__` separated parameter names
- specify parameters to train in `param_grid`
- speciffy other parameters as needed

Refer to [Tuning hyperparameter](https://scikit-learn.org/stable/modules/grid_search.html#randomized-parameter-search) for further details

In [None]:
from sklearn.model_selection import RandomizedSearchCV
use_hyperparameter_tuning = True 
param_grid = {
    'classification__loss': ['ls', 'lad']
}

search = RandomizedSearchCV(skl_pipeline, param_grid, n_jobs=-1, cv=3)

**Run Training Job**

Depending on factors such as the model you selected, whether you are running hyperparameter training, and the computational power you have available to run this notebook, the following cell might take several minutes to run.

In [None]:
if 'use_hyperparameter_tuning' in globals() and use_hyperparameter_tuning:
    print('Running the Hyperparameter tuning ...')
    search.fit(X_train, y_train)
    model = search
else:
    print('Running training pipeline ...')
    model = skl_pipeline

model.fit(X_train, y_train)
print('Finished training job')

**Run additional validation on model**

In [None]:
y_pred = model.predict(X_test)
rmse = np.mean((np.round(y_pred) - y_test.values)**2)**0.5
print('RMSE: {}'.format(rmse))

**(Optional) Save a local copy of the model**

In [None]:
import pickle
# save the model to disk
model_filename = 'monitor_wml_model.sav'
pickle.dump(model, open(model_filename, 'wb'))

**(Optional) Load the local copy of the model**

In [None]:
import pickle
model = pickle.load(open(model_filename, 'rb'))

## III. Deploy Model in Monitor

Use the connection to monitor database made in **III.1.2 Connect to the database**. When deploying the trained model store the feature vector in addition to the model. The feature vector is used for data validation within the custom-function used for inference

In [None]:
model

The model and feature vector information is stored in monitor database and accessed by monitor's custom-function during inference time. The method shown here is one way to retain this information. When making a custiom-function you need **the keys to access the model and features** as well as the **model_name**

In [None]:
# save model_name and feature vectors
model_name = 'my_gbm_regressor' # user-input-required
cache_model_and_features = { 'model': model,
                             'features': feature_vector}

In [None]:
cache_model_and_features['features']

In [None]:
db.model_store.store_model(model_name, cache_model_and_features)

## (Optional) IV. Test predictions using deployed model

**Fetch model**

In [None]:
# test the model was saved correctly by accessing it
deployed_model_metadata = db.model_store.retrieve_model(model_name)
deployed_model_metadata

**Separate the model and features**

Use the feature vector to perform data validation. A basic data validation is to make sure that all the specified features are present and in correct order in the dataframe used for prediction. 

Use the model to generate prediction. 

In [None]:
deployed_model = deployed_model_metadata['model']
deployed_features = deployed_model_metadata

**Test prediction**

In this sample the test split from earlier is used to validate the model

In [None]:
y_pred = deployed_model.predict(X_test)
rmse = np.mean((np.round(y_pred) - y_test.values)**2)**0.5
print('RMSE: {}'.format(rmse))

**(Optional) Sample code to delete any old model**

In [None]:
db.model_store.delete_model(model_name="my_gbm_regressor")

At this stage, you have deployed a trained model in Monitor Database. Congratulations!

### Setup for real-time inference
The last part of this machine learning experience is to use the deployed model for inference. For this step you need to [create a custom function](https://github.com/ibm-watson-iot/functions/tree/advanced_custom_function_starter). [Example custom function used for inference](https://github.com/singhshraddha/custom-functions/blob/development/custom/forecast.py#L156)

Within the `execute` method of the custom-function you will
1. retrive the model and feature vector from the monitor database
2. perform data validation using the feature vector
3. call the prediction method for the estimator you used

The custom-function should have a way to accept model name parameter and the feature names as parameter. This will be done in the `build_ui` and `__init__` methods. To select features regardless of datatype 
[set the corresponding UI item's datatype=None](https://github.com/singhshraddha/custom-functions/blob/1923713d7cfc71c8a0fb9f4bc9da365a40bc733e/custom/forecast.py#L156) *in this case UIMultiItem*. This setting lets you select features name across all metrics, dimensions, and alerts. 

After creating and registering a cutom function, follow these steps to get the predictions on the UI <br>
**From the Monitor UI setup assests data tab** <br>
1. Choose your custom function from the catalog <br>
2. Set saved_model_name parameter to model_name <br>
3. Select raw metric/dimension/alert features used for trainign the model as the features parameter from the drop down <br>
4. Specify the target/output name <br>