## **Train** a ML model locally, **Deploy** the model in Watson Studio, and perform real-time **Prediction** in Maximo-Monitor


You can train and deploy a machine learning model to Watson Studio in several ways. You can use this notebook as template for training your model locally using data from Maximo Monitor. Then, deploy your model to Watson Studio. At the end of this notebook, you'll find more information about how to get started with real-time inferencing.


If you are new to Watson Machine Learning and Cloud Pak for Data, see the documentation at the following links to learn more:
- [Watson Machine Learning Documentation](https://www.ibm.com/docs/en/cloud-paks/cp-data/3.5.0?topic=deploying-managing-models-functions)
- [Jupyter Notebook examples for deploying and using Watson Studio models](https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/ml-samples-overview.html?context=cpdaas&audience=wdp)
- This notebook is based on [Watson Studio example notebook](https://github.com/IBM/watson-machine-learning-samples/blob/master/cpd4.0/notebooks/python_sdk/deployments/custom_library/Use%20scikitlearn%20and%20custom%20library%20to%20predict%20temperature.ipynb)


**Note** The following files are required to run this notebook and they are not provided with this template: 
1. monitor-credentials.json. Contains credentials to connect to maximo monitor. Contents explained in section **I. Credentials setup**
2. wml-credentials.json. Contains credentials to connect to watson machine learning API. Contents explained in section **I. Credentials setup**

Create your own files with the correct file paths. In the template, all required inputs are marked with the `user-input-required` comment as a guide.

**Pre-requisites**
1. Install the iotfunctions package
2. Install any other packages that are required to train specific machine learning models

In [None]:
from ibm_watson_machine_learning import APIClient # watson studio
import json # loading credentials
import pandas as pd
import numpy as np

## I. Credentials setup
You must create two sets of credentials:

1. Credentials to allow Maximo Monitor access its database and APIs.
2. Credentials to allow Watson Machine Learning to access Watson Studio deployment spaces. 

The credentials are saved in JSON files. In the following cell, define the relative path (that is, relative to this notebook) of both of these credentials files and load the files from local variables

**Note**: Keep all credentials safe and hidden

In [None]:
MONITOR_CREDENTIALS_FILE_PATH = './dev_resources/monitor-credentials.json'  # user-input-required. Set it the path of your monitor-credentials file relative to this notebook
WML_CREDENTIALS_FILE_PATH = './dev_resources/wml-credentials.json' # user-input-required. Set it the path of your wml-credentials file relative to this notebook

**More information about credentials in Maximo Monitor**

[Maximo Monitor documentation](https://www.ibm.com/docs/en/maximo-monitor/8.5.0?topic=monitor-connection-parameters)

In [None]:
with open(MONITOR_CREDENTIALS_FILE_PATH, 'r') as f: 
    monitor_credentials = json.loads(f.read())

**More information about credentials in Watson Machine Learning**

1. [Generate new API key](https://cloud.ibm.com/iam/apikeys) for Watson Machine Learning services or use an existing API key

2. Save your Watson Machine Learning credentials to a JSON file

```json
{
    url:"https://us-south.ml.cloud.ibm.com",
    apikey:"xxxx"
}
```
**Note**: Depending on the region of your provisioned service instance, use one of the following as your url:
```
Dallas: https://us-south.ml.cloud.ibm.com
London: https://eu-gb.ml.cloud.ibm.com
Frankfurt: https://eu-de.ml.cloud.ibm.com
Tokyo: https://jp-tok.ml.cloud.ibm.com
```

[More information](https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/ml-authentication.html?context=cpdaas&audience=wdp) about WML Authentication

In [None]:
with open(WML_CREDENTIALS_FILE_PATH, 'r') as f: 
    wml_credentials = json.loads(f.read())

## II. Connect to Watson Machine Learning API Client

Connect to the API Client using the credentials that you loaded in the previous step. This client is used to 
 1. Extract deployment spaces
 2. Set the deployment space to the space you want to deploy our model
 3. Extract the software specification that you will use to deploy the model
 4. Extract the hardware specification that you will use to deploy the model
 
[Watson Machine Learning API Client Documentation](http://ibm-wml-api-pyclient.mybluemix.net)

In [None]:
client = APIClient(wml_credentials)

### Deployment Space

To deploy a model to Watson Machine Learning services, you must connect to a deployment space:

1. [Create a deployment space](https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/ml-spaces_local.html?context=cpdaas&audience=wdp) if you don't have one yet
2. Set the space ID manually using the list generated from `client.spaces.list()`

Note: if `client.spaces.list()` is empty you will need to create a deployment space as specified in Step 1

In [None]:
client.spaces.list(limit=10) #lists deployment spaces

In [None]:
client.set.default_space('select-and-set-space-id') # user-input-required. Select a space id from the list generated in the above cell

**Software Specification**

[Learn more about software specification and why you need them](https://www.ibm.com/docs/en/cloud-paks/cp-data/3.5.0?topic=overview-specifying-model-type-software-specification)

In [None]:
client.software_specifications.list(limit=5)

**Hardware Specification**


Hardware specifications determine how much CPUs and RAM can be used during inferencing. Different specifications allow you to scale the deployment as needed (but at a cost).

[Learn more about CUH costs for different hardware specifications](https://dataplatform.cloud.ibm.com/docs/content/wsj/landings/wml-plans.html?context=cpdaas&audience=wdp)

In [None]:
client.hardware_specifications.list()

## III. Train Model

[Read the article about Machine Learning](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf) if you are interested in getting familiar with some of the basics.

This notebook can be modified and used for regression or classifiaction tasks. [A short refresher for classification vs regression tasks](https://in.springboard.com/blog/regression-vs-classification-in-machine-learning/)
 
The template loads an entity's data and guides you to train a machine model locally by using that data. An entity in Monitor has several database tables that store metrics, dimensions, derived metrics, and alerts. To load these features for an entity, follow this step-by-step guide.

**Steps to train a model in a notebook**
1. Load training data from Monitor database (for a specified entity)
2. Prepare dataset for training
3. Create a training pipeline
4. Train the model

**III.1 Load training data from Monitor database**

An entity in Monitor has several database tables that store raw metrics, dimensions, and alerts. To load these features you peform two database reads. The first read fetches raw metrics and dimensions, and the second read fetched the alerts for the user provided entity type. Both data streams are stored is separate comma separated value (.csv) files and are available to manipulate as required.


Complete the following these steps to load training data:
1. Connect to the database
2. Specify entity_name (and any other user input)
3. Save the data in a csv to use later

**Note** If you already have data available in a csv file, go to section **III.2 Prepare data

Set the following variables:
- (Required) entity_name  # Name of the entity that you want to retrieve data from. Use the same name that is displayed on the user interface in Monitor.,
- (Optional) start_ts # Start fetching date from this date and time.,
- (Optional) end_ts # Fetch data until this date and time.

**III.1.1 Connect to the database**

Use the monitor credentials loaded section **I. Credentials setup** to connect to the Monitor database

In [None]:
from iotfunctions.db import Database

db = Database(credentials = monitor_credentials)
db_schema = None #  set if you are not using the default

**III.1.2 Specify entity_name (and any other user input)**

In [None]:
# (User Input) user-input-required

entity_name = 'shraddha_robot' # user-input-required. Set to entity name you'd like to retrieve data from. This is the same name as displayed on the Monitor UI

start_ts = None # user-input-required. (Optional) Fetch data starting from this date/time. Format 'YYYY-MM-DD-HH.MM.ss.mmmmm'. Set to None to disable

end_ts = None # user-input-required. (Optional) Fetch data until this date/time. Format 'YYYY-MM-DD-HH.MM.ss.mmmmm'. Set to None to disable

In [None]:
entity_type = db.get_entity_type(name=entity_name)

In [None]:
metric_df = entity_type.get_data(start_ts = start_ts, end_ts=end_ts) # fetch metric and dimension data

**loaded data: metris and dimension**

The metrics and dimension data for the entity type you specified is loaded in a dataframe that is indexed with a (id, evt_timestamp) index. `id` is the device ID and `evt_timestamp` is the event time. This is the same index scheme that the pipeline uses on its dataframe.

In [None]:
metric_df.head()

In [None]:
# (Setup to fetch alerts)
alert_table_name = "DM_WIOT_AS_ALERT"
alert_table_timestamp_col = "timestamp"
alert_filters = {"entity_type_id": [entity_type._entity_type_id]}

In [None]:
query, _ = db.query(table_name=alert_table_name, schema=db_schema,timestamp_col=alert_table_timestamp_col, start_ts=start_ts, end_ts=end_ts, filters=alert_filters) #query for alert data
alert_df = db.read_sql_query(sql=query.statement) # fetch alert data

**loaded data: alerts**

Alert data for the specified entity type is loaded in a dataframe.

In [None]:
alert_df.head()

In [None]:
# you can perform further data evaluations and manipulations in this cell

**III.1.3 Save data in csv**

The metric (with dimensions) and alert data that was loaded in the previous steps can now be saved into .csv files. Running the following step saves this data into files named **metric_{{entity_name}}.csv** and **alert_{{entity_name}}.csv** respectively. In your environment, `{{entity_name}}` is replaced by the name you specified during the load data stage.


Both files are saved in the same directory as this notebook. You can create a separate directory for the data by appending the relative directory structure to the following file path. For example, to save the data to a folder named `data` under the directory that contains this notebook, modify the file path as follows:
```python3
metric_data_filepath = f'./data/metric_{entity_name}.csv'
```

In [None]:
metric_data_filepath = f'metric_{entity_name}.csv'
alert_data_filepath = f'alert_{entity_name}.csv'

In [None]:
metric_df.to_csv(metric_data_filepath, index=False,header=True) # save metric and dimension data
alert_df.to_csv(alert_data_filepath, index=False,header=True) # save alert data

**III.2 Prepare data**

Load the data that you saved earlier and manipulates it some more to prepare it for training.

**Read data back**

In [None]:
raw_data_df = pd.read_csv(metric_data_filepath) # assumes you saved data using steps above. If you have pre-save data replace `metric_data_filepath` with the path to that .csv. The path should be relative to this notebook 
raw_data_df.head()

In [None]:
alert_data_df = pd.read_csv(alert_data_filepath) # assumes you saved data using steps above. If you have pre-save data replace `alert_data` with the path to that .csv. The path should be relative to this notebook 
alert_data_df.head()

**Create a dataset to train the model**

This template uses the variables `speed` and `acc` to predict `torque`. Depending on your problem statement, you can use all or part of the available data to define your feature vector as well as the prediction variables.

In [None]:
feature_vector = ['acc', 'speed'] # use-input-required. Define feature vector. These are the variables used to train the model
target_variable = 'torque' # use-input-required. Define target variable. This is the feature you want to predict. For regression models this will be a continous variable and for classification models this is a discrete variable

In [None]:
required_data_items = []
required_data_items.extend(feature_vector)
required_data_items.append(target_variable)

In [None]:
data_df = raw_data_df[required_data_items]

In [None]:
data_df.head()

In [None]:
# further data manipulation can be performed in this cell

In [None]:
from sklearn.model_selection import train_test_split

Y = data_df['torque'] # user-input-required. Specify the target/prediction/classification variable here 
X = data_df[['acc', 'speed']] # user-input-required. Provide your feature vector here

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, random_state=143) # create a train-test split

In [None]:
# Use cell for train, test data verification

**III.3 Create training pipeline**

In this template a sample training pipeline is provided. The training pipeline uses [sklearn Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) that imlpements data transformation steps followed by an estimator (the model). Using the pipeline is optional.

Learning resources
- [Sklearn Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)
- [Sklearn Transformers for text data](https://scikit-learn.org/stable/modules/compose.html#columntransformer-for-heterogeneous-data)
- [Sklearn Feature Selection](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html?highlight=selectfrommodel#sklearn.feature_selection.SelectFromModel)
- [Sklearn Hyperparameter tuning](https://scikit-learn.org/stable/modules/grid_search.html#randomized-parameter-search)
- [Python packages for machine learning](https://github.com/ml-tooling/best-of-ml-python#time-series-data)

The following code builds a preprocessor to transform input data. The numeric and categorical data are transformed using separate methods. In the sample code, you'll find a pipeline with a imputer (to complete missing data) and a scaler to transform numerical data. Using the code, the categorical features are transformed into one hot encoded vector. Both of these steps are combined to build the `preprocessor` module to perform input data transformation.

Use the sklearn learning resources to learn more about the individual parts of the `ColumnTransformer` module and to learn about the different options for data imputers, scalers, and text data transformers that are available.


```python3
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])
    
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])
```

`numeric_features` is a list of all numerical feature names. In our example `numerical_features=['acc', 'speed']` <br>
`categorical_features` is a list of all numerical feature names


To create and use relevant features you can use `SelecFromModel` for an additional `feature_selection` stage in the training pipeline

```python3
Pipeline(steps=[('preprocessor', preprocessor),
                 ('feature_selection', SelectFromModel(LinearSVC(penalty="l1", dual=False))),
                 ('classification', LinearRegression())])
```

In [None]:
numeric_features = X.select_dtypes(include=np.number).columns.tolist()
categorical_features = list(set(X.columns.tolist()).difference(numeric_features))

In [None]:
numeric_features, categorical_features

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import LinearSVC
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectFromModel
from sklearn.compose import ColumnTransformer


## Example Pipeline
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features)])

skl_pipeline = Pipeline(steps=[('preprocessor', preprocessor), # (Optional) preprocessor stage
                               ('classification', GradientBoostingRegressor())]) # user-input-required estimator/machine learnign model you want to run

**III.4 Train model**

You can run the training pipeline by completing the setup in the previous cell or you can set up hyperparameter tuning as follows and run the search pipeline. The training pipeline runs a single training job while the hyperparameter search runs several jobs to find the best model parameters across the ranges specified in `param_grid`. The search pipeline is more computationally intensive and so it takes longer to run.


**(Optional) Set up hyper paratmeter tuning**

- Parameters of pipelines can be set using `__` separated parameter names
- Specify parameters to train in `param_grid`
- Specify other parameters as needed

Refer to [Tuning hyperparameter](https://scikit-learn.org/stable/modules/grid_search.html#randomized-parameter-search) for further details

In [None]:
from sklearn.model_selection import RandomizedSearchCV
use_hyperparameter_tuning = True
use_hyperparameter_tuning = True 

param_grid = {
    'classification__loss': ['ls', 'lad']
}

search = RandomizedSearchCV(skl_pipeline, param_grid, n_jobs=-1, cv=3)

**Run Training Job**

Depending on factors such as the model you selected, whether you are running hyperparameter training, and the computational power you have available to run this notebook, the following cell might take several minutes to run.

In [None]:
if 'use_hyperparameter_tuning' in globals() and use_hyperparameter_tuning:
    print('Running the Hyperparameter tuning ...')
    search.fit(X_train, y_train)
    model = search
else:
    print('Running training pipeline ...')
    model = skl_pipeline

model.fit(X_train, y_train)
print('Finished training job')

**Run additional validation on model**

In [None]:
y_pred = model.predict(X_test)
rmse = np.mean((np.round(y_pred) - y_test.values)**2)**0.5
print('RMSE: {}'.format(rmse))

**(Optional) Save a local copy of the model**

In [None]:
import pickle
# save the model to disk
model_filename = 'monitor_wml_model.sav'
pickle.dump(model, open(model_filename, 'wb'))

**(Optional) Load the local copy of the model**

In [None]:
model = pickle.load(open(model_filename, 'rb'))

## IV. Deploy the Model in Watson Machine Learning

In this step, you use the connection to WML python client that you made in **Step II. Connect to Watson Machine Learnig API Client**. When you deploy the trained model, you must **a.** set the metadata, **b.** retain the model and, **c.** deploy the model to Watson Machine Learning.,

For **a.**, set the following metadata,
1. Model name
2. Software specification. Run the following command to view a list of software specification options: `client.software_specifications.list()`. Pick the specification best suited for your model.
3. Hardware specification Run the following command to view a list of hardware specification options: `client.hardware_specifications.list()`. Pick the specification best suited for your model.

(These specifications were extracted when you connected to the Watson Studio Client)

In [None]:
base_sw_spec_uid = client.software_specifications.get_uid_by_name("default_py3.7") # user-input-required

The following code sets up the software specifications when using only scikit-learn for modeling purposes. For information about more intricate software setup when using custom packages see the [Watson ML sample notebooks](https://www.ibm.com/docs/en/cloud-paks/cp-data/3.5.0?topic=apis-machine-learning-python-example-notebooks) documentation.

In [None]:
model_props = {
    client.repository.ModelMetaNames.NAME: "MaxTemp prediction model",
    client.repository.ModelMetaNames.TYPE: 'scikit-learn_0.23', # user-input-required
    client.repository.ModelMetaNames.SOFTWARE_SPEC_UID: base_sw_spec_uid
    
}

**b. Retain model**

To save the trained model, specify a software specification. Learn more about the available [software specifications](https://dataplatform.cloud.ibm.com/docs/content/wsj/wmls/wmls-deploy-python-types.html?context=cpdaas&audience=wdp) options and their usage

In [None]:
published_model = client.repository.store_model(model=model, meta_props=model_props)

In [None]:
published_model_uid = client.repository.get_model_uid(published_model)
model_details = client.repository.get_details(published_model_uid)
print(json.dumps(model_details, indent=2))

**c. Deploy model**

For this deployment, a hardware specification that requires the least amount of CUHs is used. The decision to use a different hardware specification can be guided by speed, performance and other variables. <br>
[More Information on CUH usage for different hardware specifications](https://dataplatform.cloud.ibm.com/docs/content/wsj/landings/wml-plans.html?context=cpdaas&audience=wdp)


The default hardware specification is S(2CPUs 16 GB). In this template, the specification is changed to XXS(1CPU 8 GB)

In [None]:
metadata = {
    client.deployments.ConfigurationMetaNames.NAME: "Deployment of test model",
    client.deployments.ConfigurationMetaNames.ONLINE: {},
    client.deployments.ConfigurationMetaNames.HARDWARE_SPEC : { "id":  "b128f957-581d-46d0-95b6-8af5cd5be580"} # user-input-required
}

created_deployment = client.deployments.create(published_model_uid, meta_props=metadata)

**Extract deployment uuid and other information**

The deployment uuid is required to make predictions.

Set the **wml_auth** constant field as defined in [Maximo Monitor documentation](https://www.ibm.com/docs/en/maximo-monitor/8.6.0?topic=detectors-using-externall-model). This field is used by the `InvokeWatsonStudio` catalog function to make real-time predictions by using the deployed model.

In [None]:
deployment_uid = client.deployments.get_uid(created_deployment)
print(deployment_uid)

In [None]:
client.deployments.get_details(deployment_uid)

**Note** Information needed for scoring
1. wml_credentials
2. Deployment space_id
3. Deployment ID
4. Feature vector

## (Optional) V. Test predictions using the deployed model

In [None]:
scoring_endpoint = client.deployments.get_scoring_href(created_deployment)
print(scoring_endpoint)

In [None]:
scoring_payload = {
    "input_data": [{
        'fields': ['acc', 'speed'],
        'values': [[22, 23]]}]
}

In [None]:
predictions = client.deployments.score(deployment_uid, scoring_payload)

In [None]:
print(json.dumps(predictions, indent=2))

At this stage, you have deployed a trained model by using Watson Machine Learning services. Congratulations!,

Use the catalog function **InvokeWatsonStudio**, as described in the [Maximo Monitor documentation](https://www.ibm.com/docs/en/maximo-monitor/8.6.0?topic=detectors-using-externall-model), to make predictions from the Monitor pipeline by using the deployed model.