In [1]:
# The code was removed by Watson Studio for sharing.

<img src="https://github.com/pmservice/ai-openscale-tutorials/raw/master/notebooks/images/banner.png" align="left" alt="banner">

# IBM Watson OpenScale - Generate Configuration Archive

This notebook demonstrates how to generate a configuration archive for monitoring deployments in IBM Watson OpenScale. This configuration is targetted for `System-Managed` monitored deployments.

***Target audience for this notebook:***
This notebook is targetted for users who fall in either of the below categories:
- Users who cannot provide training data (*as CSV or in DB2 or COS*) while configuring a deployment for monitoring in IBM Watson OpenScale
- Users who have large amount of training data (> 500MB) and as such can not be used for creating artifacts in IBM Watson OpenScale
- Users who are looking for automation and/or more granular control using Python SDK.

User must provide the necessary inputs where marked. Generated configuration package can be used in IBM Watson OpenScale UI while configuring monitoring of a model deployment in IBM Watson OpenScale.

**Contents:**
1. [Setting up the environment](#setting-up-the-environment) - Pre-requisites: Install Libraries and required dependencies
2. [Training Data](#training-data) - Read the training data as a pandas DataFrame
3. [User Inputs Section](#user-inputs-section) - Provide Model Details, IBM Watson OpenScale Services and their configuration
4. [Generate Configuration Archive](#generate-configuration-archive)
5. [Helper Methods](#helper-methods)
6. [Definitions](#definitions)

## Setting up the environment

In [2]:
%pip install --upgrade "ibm-metrics-plugin>=4.7.0" "ibm-watson-openscale>=3.0.29.6" | tail -n 1

Note: you may need to restart the kernel to use updated packages.


In [3]:
import  ibm_watson_openscale
import pandas as pd
import requests
import json
import numpy as np

In [4]:
# ----------------------------------------------------------------------------------------------------
# IBM Confidential
# OCO Source Materials
# 5900-A3Q, 5737-H76
# Copyright IBM Corp. 2018, 2023
# The source code for this Notebook is not published or other-wise divested of its trade 
# secrets, irrespective of what has been deposited with the U.S.Copyright Office.
# ----------------------------------------------------------------------------------------------------

VERSION = "6.0.0"

# Version history

# 6.0.0 : Complete Re-design of the notebook; Official notebook for IBM CP4D 4.7.x
#         Upgrade ibm-metrics-plugin to >= 4.7.0; Includes Support for Drift v2 Archive
#         Refactored for Configuration Package; Deprecate the Drift Archive
# 5.4.7 : Official notebook for IBM CP4D 4.6.4
#         Upgrade ibm-metrics-plugin to >= 4.6.4.0
# 5.4.6 : Take optional class labels input for global explainability
# 5.4.5 : Remove numpy and scipy versions to be installed.
# 5.4.4 : Add support for lime global explanation
# 5.4.3 : Add numpy and scipy versions to be installed.
# 5.4.2 : Remove explainability configuration while saving training_distribution
# 5.4.1 : Add sample size for generating global explanation
# 5.4.0 : Add support for SHAP Global explanation
# 5.3.6 : Fix issue with explainability archive generation for regression model
# 5.3.5 : Official notebook for IBM CPD 4.5.0. 
#         Upgrade ibm-wos-utils to 4.5.0. 
#         Added code to generate explainability perturbations archive.
# 5.3.4 : Upgrade ibm-wos-utils to 4.1.1 (scikit-learn has been upgraded to 1.0.2)
# 5.3.3 : Upgrade ibm-wos-utils to 4.0.34
# 5.3.2 : Upgrade ibm-wos-utils to 4.0.31
# 5.3.1 : Official notebook for IBM CPD 4.0.5

## Training Data
*Note: Pandas' read\_csv method converts the columns to its data types. If you want the column type to not be interpreted, specify the dtype param to read_csv method in this cell. More on this method [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)*

*Note: By default NA values will be dropped while computing training data distribution. Please ensure to handle the NA values during Pandas' read\_csv method*

In [5]:
# import pandas as pd
# training_data_df = pd.read_csv("TO_BE_EDITED")

# print(training_data_df.head())
# print("Columns:{}".format(list(training_data_df.columns.values)))

In [6]:
# Download data asset from project storage and store it in the local file system
wslib.download_file("epp_train.csv", "epp_train.csv")

{'file_name': 'epp_train.csv', 'summary': ['loaded data', 'saved to file']}

In [7]:
# Read data from the CSV file into a DataFrame
training_data_df = pd.read_csv("epp_train.csv")

# Change the order of the columns
training_data_df = training_data_df[['employee_id', 'department', 'region', 'education', 'gender',
       'recruitment_channel', 'no_of_trainings', 'age', 'previous_year_rating',
       'length_of_service', 'kpis_met_above_80_percent', 'any_awards_won',
       'avg_training_score', 'is_promoted']]

training_data_df.head()

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,kpis_met_above_80_percent,any_awards_won,avg_training_score,is_promoted
0,45709,Sales & Marketing,region_31,Bachelor's,f,other,1,29,,1,0,0,49,0
1,66874,Sales & Marketing,region_27,Bachelor's,f,other,1,30,,1,0,0,50,0
2,36904,Sales & Marketing,region_15,Bachelor's,m,other,1,29,3.0,2,0,0,51,0
3,32877,Sales & Marketing,region_2,Bachelor's,f,other,1,40,3.0,12,0,0,50,0
4,58415,Sales & Marketing,region_7,Bachelor's,m,other,1,45,4.0,5,0,0,50,0


In [8]:
# training_data_df = pd.read_csv("../data/employee-promotion-train.csv")
# training_data_df = training_data_df.rename(columns={'awards_won?': 'any_awards_won', 'KPIs_met >80%': 'kpis_met_above_80_percent'})
print("Columns:{}".format(list(training_data_df.columns.values)))

Columns:['employee_id', 'department', 'region', 'education', 'gender', 'recruitment_channel', 'no_of_trainings', 'age', 'previous_year_rating', 'length_of_service', 'kpis_met_above_80_percent', 'any_awards_won', 'avg_training_score', 'is_promoted']


In [9]:
# training_data_df = pd.read_csv("../data/employee-promotion-train.csv")
# training_data_df = training_data_df.rename(columns={'awards_won?': 'any_awards_won', 'KPIs_met >80%': 'kpis_met_above_80_percent'})
# print("Columns:{}".format(list(training_data_df.columns.values)))

In [10]:
# Convert the DataFrame to a list of dictionaries
# Sort the DataFrame by a specific column (e.g., 'column_name') and get the top 10 records

df_test = training_data_df.drop(columns=["employee_id", "recruitment_channel", "region"])
df_test["education"].fillna(df_test["education"].mode()[0], inplace=True)
df_test["previous_year_rating"].fillna(1, inplace=True)
# 
# Encode categorical columns
categorical_columns = df_test.select_dtypes(include=['object']).columns.tolist()
X_encoded_1 = pd.get_dummies(df_test, columns=categorical_columns, drop_first=True)
X_encoded = X_encoded_1.drop("is_promoted", axis=1)
X_encoded_test = X_encoded[:30]
X_encoded_test.shape

(30, 18)

## User Inputs Section

##### _1. Provide Common Parameters_:

Provide the common parameters like the basic model details like type, feature columns, etc. Also, enable/disable the different monitors you would like th artifacts for. Read more about these [here](#common-parameters). 

##### _2. Provide Fairness Parameters_
The fairness parameters are required if `enable_fairness` is set to `True`. Read more about these parameters [here](#fairness-parameters)

##### _3. Provide Explainability Parameters_
The explainability parameters are required if `enable_explainability` is set to `True`. Read more about these parameters [here](#explainability-parameters)

*When LIME global explanation is enabled, the explainability archive upload and explainability monitor enablement should be done using python sdk/api.*
*LIME global explanation configuration is not supported from IBM Watson OpenScale GUI.*

##### _4. Provide Drift v2 Parameters_
Read more about these parameters [here](#drift-v2-parameters)

##### _5. *DEPRECATED* Provide Drift Parameters_
Read more about these parameters [here](#deprecated-drift-parameters)

##### _6. Provide a scoring function_
The scoring function is required if any of  `enable_explainability`, `enable_drift_v2` or `enable_drift` is set to `True`. The scoring function should adhere to the following guidelines.

- The input of the scoring function should accept a `pandas.DataFrame` containing all the `feature_columns` used to build the model.
- The output of the scoring function should return:
    - a `tuple` of `(probabilities, predictions)` for classification problems. Both `probabilities` and `predictions` are of type `numpy.ndarray`
    - a `numpy.ndarray` of `predictions` for regression problems.
- The data type of the label column and prediction column should be same. Moreover, the label column and the prediction column array should have the same unique class labels
- A host of different scoring function templates are provided [here](https://github.com/IBM/watson-openscale-samples/wiki/Score-function-templates-for-IBM-Watson-OpenScale)

In [11]:
def scoring_fn(df):

    def row_to_input(df):
        fields = df.columns.tolist()
        values = df.values.tolist()
        return {'fields': fields, 'values': values}

    # Initialize the structured dictionary
    structured_dict = {'input_data': []}

    # Extract the first 10 rows from your DataFrame (X_encoded_test)
    data_subset = X_encoded_test.iloc[:30]

    # Convert all fields and values for the 10 rows
    structured_dict['input_data'].append(row_to_input(data_subset))

    deployment_url = "http://localhost:8000"
    endpoint = f"{deployment_url}/v2/predict" # endpoint implemented in serving engine

    try:
        response = requests.post(endpoint, json=structured_dict)
        
        #print("-----response ",response.text)
        if response.status_code == 200:
            # The request was successful, print the response content
            score_predictions = json.loads(response.text)['predictions'][0]
            prob_col_index = list(score_predictions.get("fields")).index('probability')
            predict_col_index = list(score_predictions.get("fields")).index('prediction')

            if prob_col_index < 0 or predict_col_index < 0:
                raise Exception("Missing prediction/probability column in the scoring response")

            
            probability_array = np.array([value[prob_col_index] for value in score_predictions.get("values")])
            prediction_vector = np.array([value[predict_col_index] for value in score_predictions.get("values")])
            return probability_array,prediction_vector
        else:
            # Handle other response codes
            print(f"Error: {response.status_code} - {response.text}")

    except requests.exceptions.RequestException as e:
        # Handle connection or request errors
        print(f"Request Exception: {e}")

In [12]:
common_parameters = {
    "problem_type" : "binary", 
    "label_column": "is_promoted",
    "prediction_column": "prediction",
    "probability_column": "probability", # <- Not required for Regression problems.

    "feature_columns": ['no_of_trainings', 'age', 'previous_year_rating', 'length_of_service',
           'kpis_met_above_80_percent', 'any_awards_won', 'avg_training_score',
           'department_Finance', 'department_HR', 'department_Legal',
           'department_Operations', 'department_Procurement', 'department_R&D',
           'department_Sales & Marketing', 'department_Technology',
           'education_Below Secondary', "education_Master's & above", 'gender_m'],
    
    "categorical_columns": [],
    
    'most_important_features': ['any_awards_won','kpis_met_above_80_percent','previous_year_rating', 'length_of_service'],
    "enable_quality": True,
    "enable_fairness": True,
    "enable_explainability": True,
    "enable_drift_v2": False,
    "notebook_version": VERSION
}

fairness_parameters = {
    "fairness_attributes": [
        {"type":"int",
         "feature": "gender_m",
         "majority": [[1,1]],
         "minority": [[0,0]],
         "threshold": 0.8
         }
    ],
    "min_records" : 30,
    "max_records": None,
    "favourable_class" : [1],
    "unfavourable_class": [0]
}

explainability_parameters = {
    "global_explanation": {
        "enabled": True,
        "sample_size": 50,
        "training_data_sample_size": 30,
        "explanation_method": "lime"
    },
    "shap": {
        "enabled": False,
        "perturbations_count": 100,
        "background_data_set": None,
        "background_data_sets": []
        
    },
    "lime": {
        "enabled": True,
        "perturbations_count": 5000
    },
    "local_explanation_method": "lime",
}
drift_v2_parameters = {
    # "max_samples": 10000
}

## Generate Configuration Archive

Run the following code to generate the configuration archive for the IBM Watson OpenScale monitors. This archive is used as is by IBM Watson OpenScale UI/SDK to onboard model for monitoring. UI/SDK will identify the different artifacts and appropriately upload to respective monitors.

In [13]:
from ibm_watson_openscale.utils.configuration_utility import ConfigurationUtility

config_util = ConfigurationUtility(
    training_data=X_encoded_1,
    common_parameters=common_parameters,
    scoring_fn=scoring_fn if "scoring_fn" in locals() else None)

config_util.create_configuration_package(
    explainability_parameters=explainability_parameters if "explainability_parameters" in locals() else None,
    drift_v2_parameters=drift_v2_parameters if "drift_v2_parameters" in locals() else {},
    fairness_parameters=fairness_parameters if "fairness_parameters" in locals() else None,
    display_link=True)

Training Statistics generated.
Fairness Statistics generated.
Please install evaluate package from Huggingface
Please install evaluate package from Huggingface
Please install evaluate package from Huggingface
Please install evaluate package from Huggingface
2023-11-10 18:02:48,297 451  140684795806720   ibm_metrics_plugin.metrics.explainability.explainers.lime.lime_tabular_explainer ERROR    Axis must be specified when shapes of a and weights differ.
2023-11-10 18:02:48,321 451  140684795806720   ibm_metrics_plugin.metrics.explainability.explainers.lime.lime_tabular_explainer ERROR    Axis must be specified when shapes of a and weights differ.
2023-11-10 18:02:48,341 451  140684795806720   ibm_metrics_plugin.metrics.explainability.explainers.lime.lime_tabular_explainer ERROR    Axis must be specified when shapes of a and weights differ.
2023-11-10 18:02:48,363 451  140684795806720   ibm_metrics_plugin.metrics.explainability.explainers.lime.lime_tabular_explainer ERROR    Axis must be s

Traceback (most recent call last):
  File "ibm_metrics_plugin/metrics/explainability/explainers/lime/lime_tabular_explainer.py", line 262, in ibm_metrics_plugin.metrics.explainability.explainers.lime.lime_tabular_explainer.LimeTabularExplainer.__explain_row
  File "ibm_metrics_plugin/metrics/explainability/explainers/lime/lime_tabular_explainer.py", line 318, in ibm_metrics_plugin.metrics.explainability.explainers.lime.lime_tabular_explainer.LimeTabularExplainer.__get_classification_explanation
  File "lime/lime_tabular.py", line 420, in lime.lime_tabular.LimeTabularExplainer.explain_instance
  File "lime/lime_base.py", line 195, in lime.lime_base.LimeBase.explain_instance_with_data
  File "lime/lime_base.py", line 125, in lime.lime_base.LimeBase._explain_instance_with_data_helper
  File "lime/lime_base.py", line 96, in lime.lime_base.LimeBase.feature_selection
  File "<__array_function__ internals>", line 180, in average
  File "/opt/conda/envs/Python-3.10/lib/python3.10/site-packages

2023-11-10 18:02:48,504 451  140684795806720   ibm_metrics_plugin.metrics.explainability.explainers.lime.lime_tabular_explainer ERROR    Axis must be specified when shapes of a and weights differ.
2023-11-10 18:02:48,523 451  140684795806720   ibm_metrics_plugin.metrics.explainability.explainers.lime.lime_tabular_explainer ERROR    Axis must be specified when shapes of a and weights differ.
2023-11-10 18:02:48,544 451  140684795806720   ibm_metrics_plugin.metrics.explainability.explainers.lime.lime_tabular_explainer ERROR    Axis must be specified when shapes of a and weights differ.
2023-11-10 18:02:48,565 451  140684795806720   ibm_metrics_plugin.metrics.explainability.explainers.lime.lime_tabular_explainer ERROR    Axis must be specified when shapes of a and weights differ.
2023-11-10 18:02:48,586 451  140684795806720   ibm_metrics_plugin.metrics.explainability.explainers.lime.lime_tabular_explainer ERROR    Axis must be specified when shapes of a and weights differ.
2023-11-10 18:0

Traceback (most recent call last):
  File "ibm_metrics_plugin/metrics/explainability/explainers/lime/lime_tabular_explainer.py", line 262, in ibm_metrics_plugin.metrics.explainability.explainers.lime.lime_tabular_explainer.LimeTabularExplainer.__explain_row
  File "ibm_metrics_plugin/metrics/explainability/explainers/lime/lime_tabular_explainer.py", line 318, in ibm_metrics_plugin.metrics.explainability.explainers.lime.lime_tabular_explainer.LimeTabularExplainer.__get_classification_explanation
  File "lime/lime_tabular.py", line 420, in lime.lime_tabular.LimeTabularExplainer.explain_instance
  File "lime/lime_base.py", line 195, in lime.lime_base.LimeBase.explain_instance_with_data
  File "lime/lime_base.py", line 125, in lime.lime_base.LimeBase._explain_instance_with_data_helper
  File "lime/lime_base.py", line 96, in lime.lime_base.LimeBase.feature_selection
  File "<__array_function__ internals>", line 180, in average
  File "/opt/conda/envs/Python-3.10/lib/python3.10/site-packages

2023-11-10 18:02:48,705 451  140684795806720   ibm_metrics_plugin.metrics.explainability.explainers.lime.lime_tabular_explainer ERROR    Axis must be specified when shapes of a and weights differ.
2023-11-10 18:02:48,728 451  140684795806720   ibm_metrics_plugin.metrics.explainability.explainers.lime.lime_tabular_explainer ERROR    Axis must be specified when shapes of a and weights differ.
2023-11-10 18:02:48,748 451  140684795806720   ibm_metrics_plugin.metrics.explainability.explainers.lime.lime_tabular_explainer ERROR    Axis must be specified when shapes of a and weights differ.
2023-11-10 18:02:48,768 451  140684795806720   ibm_metrics_plugin.metrics.explainability.explainers.lime.lime_tabular_explainer ERROR    Axis must be specified when shapes of a and weights differ.
2023-11-10 18:02:48,789 451  140684795806720   ibm_metrics_plugin.metrics.explainability.explainers.lime.lime_tabular_explainer ERROR    Axis must be specified when shapes of a and weights differ.
2023-11-10 18:0

Traceback (most recent call last):
  File "ibm_metrics_plugin/metrics/explainability/explainers/lime/lime_tabular_explainer.py", line 262, in ibm_metrics_plugin.metrics.explainability.explainers.lime.lime_tabular_explainer.LimeTabularExplainer.__explain_row
  File "ibm_metrics_plugin/metrics/explainability/explainers/lime/lime_tabular_explainer.py", line 318, in ibm_metrics_plugin.metrics.explainability.explainers.lime.lime_tabular_explainer.LimeTabularExplainer.__get_classification_explanation
  File "lime/lime_tabular.py", line 420, in lime.lime_tabular.LimeTabularExplainer.explain_instance
  File "lime/lime_base.py", line 195, in lime.lime_base.LimeBase.explain_instance_with_data
  File "lime/lime_base.py", line 125, in lime.lime_base.LimeBase._explain_instance_with_data_helper
  File "lime/lime_base.py", line 96, in lime.lime_base.LimeBase.feature_selection
  File "<__array_function__ internals>", line 180, in average
  File "/opt/conda/envs/Python-3.10/lib/python3.10/site-packages

Explain Archive generated.
