## 0. Imports and install shap

In [2]:
!pip install shap==0.40.0 -q

from ibm_watson_studio_lib import access_project_or_space
import pandas as pd
import numpy as np
import shap
import os

## 1. Read job env variables

This notebook is meant to be run as a job, where parameters are read as environment variables. During development, the cell below can help overwrite some of these parameters for testing.

In [3]:
PROJECT_TOKEN = os.environ.get('PROJECT_TOKEN')
DATASET_NAME = os.environ.get('DATASET_NAME', 'heloc_dataset_v1.csv')

APIKEY = os.environ.get('APIKEY')
SPACE_ID = os.environ.get('SPACE_ID', '9d6b2070-54a7-4ea1-89b8-a900fd845763')
MODEL_ID = os.environ.get('MODEL_ID', 'b8cc86f0-5e8c-4671-8df1-9afca24651f4')

In [7]:
if (PROJECT_TOKEN is None) or (APIKEY is None):
    from getpass import getpass
    print("It looks like your credentials are missing. Please enter them.")
    PROJECT_TOKEN = getpass("Enter your Watson Studio project token")
    APIKEY = getpass("Enter your IBM Cloud API key")

## 2. Load data

In [9]:
wslib = access_project_or_space(params=dict(token=PROJECT_TOKEN))

df = pd.read_csv(wslib.load_data(DATASET_NAME))
display(df.head())
df.shape

Unnamed: 0,RiskPerformance,ExternalRiskEstimate,MSinceOldestTradeOpen,MSinceMostRecentTradeOpen,AverageMInFile,NumSatisfactoryTrades,NumTrades60Ever2DerogPubRec,NumTrades90Ever2DerogPubRec,PercentTradesNeverDelq,MSinceMostRecentDelq,...,PercentInstallTrades,MSinceMostRecentInqexcl7days,NumInqLast6M,NumInqLast6Mexcl7days,NetFractionRevolvingBurden,NetFractionInstallBurden,NumRevolvingTradesWBalance,NumInstallTradesWBalance,NumBank2NatlTradesWHighUtilization,PercentTradesWBalance
0,Bad,55,144,4,84,20,3,0,83,2,...,43,0,0,0,33,-8,8,1,1,69
1,Bad,61,58,15,41,2,4,4,100,-7,...,67,0,0,0,0,-8,0,-8,-8,0
2,Bad,67,66,5,24,9,0,0,100,-7,...,44,0,4,4,53,66,4,2,1,86
3,Bad,66,169,1,73,28,1,1,93,76,...,57,0,5,4,72,83,6,4,3,91
4,Bad,81,333,27,132,12,0,0,100,-7,...,25,0,1,1,51,89,3,1,0,80


(10459, 24)

## 2. Load model

In [25]:
from ibm_watson_machine_learning import APIClient

wml_credentials = {
    "url": "https://us-south.ml.cloud.ibm.com",
    "apikey": APIKEY
}

client = APIClient(wml_credentials)
client.set.default_space(SPACE_ID)

'SUCCESS'

In [26]:
model = client.repository.load(MODEL_ID)
model_details = client.repository.get_model_details(MODEL_ID)
type(model)

sklearn.pipeline.Pipeline

In [29]:
autoai_details = model_details['entity'].get('hybrid_pipeline_software_specs')
if autoai_details is not None:
    autoai_details = autoai_details[0].get('name')

if not autoai_details or not ('autoai' in autoai_details):
    raise Exception("This notebook has only been tested for an AutoAI model.\
    For other model types, you will need to adapt cells below")

## 3. Use `shap.Explainer` on prepped data

In [30]:
X_prep = model[:-1].transform(df.drop(columns=['RiskPerformance']).sample(1000).values)

In [31]:
# see https://github.com/slundberg/shap/issues/1042#issuecomment-590112711
model[-1].booster_.params['objective'] = 'binary'
exp = shap.Explainer(model[-1], features=X_prep)
# exp = shap.KernelExplainer(model[-1].predict_proba, data=X_prep)

In [32]:
# AutoAI feature names:
# - original features stay in the same order except if they are selected
# - then all new features are appended at the end
# the code below uses that logic to retrieve feature names in order

fi = model_details['entity']['metrics'][0]['context']['features_importance'][0]['features']

new_features = [c for c in fi if c.startswith("NewFeature")]
new_features = sorted(new_features, key=lambda s: int(s.split('_')[1]))

autoai_feature_names = [c for c in df.columns if c in fi.keys()] + new_features
autoai_feature_names

['ExternalRiskEstimate',
 'MSinceOldestTradeOpen',
 'MSinceMostRecentTradeOpen',
 'AverageMInFile',
 'NumSatisfactoryTrades',
 'NumTrades60Ever2DerogPubRec',
 'NumTrades90Ever2DerogPubRec',
 'PercentTradesNeverDelq',
 'MSinceMostRecentDelq',
 'MaxDelq2PublicRecLast12M',
 'MaxDelqEver',
 'NumTotalTrades',
 'NumTradesOpeninLast12M',
 'PercentInstallTrades',
 'MSinceMostRecentInqexcl7days',
 'NumInqLast6M',
 'NumInqLast6Mexcl7days',
 'NetFractionRevolvingBurden',
 'NetFractionInstallBurden',
 'NumRevolvingTradesWBalance',
 'NumInstallTradesWBalance',
 'NumBank2NatlTradesWHighUtilization',
 'PercentTradesWBalance',
 'NewFeature_0_sum(ExternalRiskEstimate__MSinceMostRecentTradeOpen)',
 'NewFeature_1_sum(ExternalRiskEstimate__AverageMInFile)',
 'NewFeature_2_sum(ExternalRiskEstimate__NumSatisfactoryTrades)',
 'NewFeature_3_sum(ExternalRiskEstimate__PercentTradesNeverDelq)',
 'NewFeature_4_sum(ExternalRiskEstimate__NumTotalTrades)',
 'NewFeature_5_sum(ExternalRiskEstimate__MSinceMostRecentInq

In [33]:
shap_values = exp.shap_values(X_prep) # this will be a list of N arrays for each of the N classes

LightGBM binary classifier with TreeExplainer shap values output has changed to a list of ndarray


In [36]:
print("Shap values successfully computed.")

Shap values successfully computed.


## 4. Store the SHAP values as additional metadata for the saved model

Note: **Model metadata is limited in size. Because we're storing shap values for a sample of 1000 samples, this call is going through**, but it could be a problem in other cases. In such case the best solution would be to store the values as a data asset in the WML space for example, and store the id of that data asset below instead of the raw values.

In [34]:
meta_props = {
    client.repository.ModelMetaNames.CUSTOM: {
        'shap': {
            'feature_names': autoai_feature_names,
            'expected_value': exp.expected_value[1],  # only keep class 1
            'values': shap_values[1].tolist(), # only keep class 1; use tolist() to make it json serializable
            'data': X_prep.tolist()
        }
    }
}
new_model_details = client.repository.update_model(MODEL_ID, meta_props)

In [37]:
print("Shap values successfully stored as model metadata.")

Shap values successfully stored as model metadata.
