# Classification with SHAP

- This tutorial demonstrates how to use structured binary classification with Keras, starting from a raw CSV file. 


- This example is an advanced version of [](structured_data_classification_intro.ipynb) since we will use more functions and less code.

## Setup

In [74]:
import numpy as np
import pandas as pd

import tensorflow as tf
from tensorflow.keras import layers

import shap

import warnings
warnings.filterwarnings('ignore')

print(f"TensorFlow: {tf.__version__}")
print(f"SHAP: {shap.__version__}")

TensorFlow: 2.8.1
SHAP: 0.40.0


## Data

- We use the features below to predict whether a patient has a heart disease (`Target`).

featureumn| Description| Feature Type
------------|--------------------|----------------------
Age | Age in years | Numerical
Sex | (1 = male; 0 = female) | Categorical
CP | Chest pain type (0, 1, 2, 3, 4) | Categorical
Trestbpd | Resting blood pressure (in mm Hg on admission) | Numerical
Chol | Serum cholesterol in mg/dl | Numerical
FBS | fasting blood sugar in 120 mg/dl (1 = true; 0 = false) | Categorical
RestECG | Resting electrocardiogram results (0, 1, 2) | Categorical
Thalach | Maximum heart rate achieved | Numerical
Exang | Exercise induced angina (1 = yes; 0 = no) | Categorical
Oldpeak | ST depression induced by exercise relative to rest | Numerical
Slope | Slope of the peak exercise ST segment | Numerical
CA | Number of major vessels (0-3) featureored by fluoroscopy | Both numerical & categorical
Thal | normal; fixed defect; reversible defect | Categorical (string)
Target | Diagnosis of heart disease (1 = true; 0 = false) | Target

In [89]:
file_url = "http://storage.googleapis.com/download.tensorflow.org/data/heart.csv"
df = pd.read_csv(file_url)

In [90]:
X = df.drop("target", axis=1)

## Model

### Model import

- Load the model (see [](structured_data_classification_layers.ipynb)): 

In [6]:
model = tf.keras.models.load_model('my_hd_classifier')

2022-06-10 10:02:32.882167: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz


- To get a prediction for a new sample, you can simply call the Keras `Model.predict` method.

- There are just two things you need to do:

  - Wrap scalars into a list so as to have a batch dimension (Models only process batches of data, not single samples).

  - Call `tf.convert_to_tensor` on each feature.

In [92]:
x = {
    "age": 60,
    "sex": 1,
    "cp": 1,
    "trestbps": 145,
    "chol": 233,
    "fbs": 1,
    "restecg": 2,
    "thalach": 150,
    "exang": 0,
    "oldpeak": 2.3,
    "slope": 3,
    "ca": 0,
    "thal": "fixed",
}

In [93]:
input_dict = {name: tf.convert_to_tensor([value]) for name, value in x.items()}

In [94]:
predictions = model.predict(input_dict)

In [84]:
predictions

array([[0.6244012]], dtype=float32)

In [6]:
print(
    "This particular patient had a %.1f percent probability "
    "of having a heart disease, as evaluated by our model." % (100 * predictions[0][0],)
)

This particular patient had a 62.4 percent probability of having a heart disease, as evaluated by our model.


In [95]:
def f(X):
    return model.predict([X[:,i] for i in range(X.shape[1])]).flatten()

In [96]:
explainer = shap.KernelExplainer(f, X)

Provided model function fails when applied to the provided data set.


ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type int).

In [50]:
numeric_feature_names = ['age', 'sex', 'cp', 'trestbps',  'chol', 'fbs','restecg','thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal' ]
numeric_features = X_train[numeric_feature_names]
numeric_features.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,1,145,233,1,2,150,0,2.3,3,0,fixed
1,67,1,4,160,286,0,2,108,1,1.5,2,3,normal
2,37,1,3,130,250,0,0,187,0,3.5,3,0,normal
3,41,0,2,130,204,0,2,172,0,1.4,1,0,normal
4,56,1,2,120,236,0,0,178,0,0.8,1,0,normal


In [51]:
numeric_features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 242 entries, 0 to 241
Data columns (total 13 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   age       242 non-null    int64   
 1   sex       242 non-null    int64   
 2   cp        242 non-null    int64   
 3   trestbps  242 non-null    int64   
 4   chol      242 non-null    int64   
 5   fbs       242 non-null    int64   
 6   restecg   242 non-null    int64   
 7   thalach   242 non-null    int64   
 8   exang     242 non-null    int64   
 9   oldpeak   242 non-null    float64 
 10  slope     242 non-null    int64   
 11  ca        242 non-null    int64   
 12  thal      242 non-null    category
dtypes: category(1), float64(1), int64(11)
memory usage: 23.3 KB


In [58]:
# Define a function to create our tensors
def dataframe_to_dataset(dataframe, shuffle=True, batch_size=32):
    
    # Make a copy of our dataframe
    df = dataframe.copy()
    # Obtain label and drop target from dataframe
    labels = df.pop("target")

    # Transform data to tensor dataset
    ds = tf.data.Dataset.from_tensor_slices((dict(df), labels))
    
    # Shuffle data
    if shuffle:
        ds = ds.shuffle(buffer_size=len(dataframe))
    
    # Create batches 
    ds = ds.batch(batch_size)
    # Prefetch data for computational efficiency
    df = ds.prefetch(batch_size)

    return ds

In [59]:
batch_size = 5

ds_train_test = dataframe_to_dataset(X_train, shuffle=True, batch_size=batch_size)

In [None]:
input_dict_2 = {name: tf.convert_to_tensor([value]) for name, value in sample.items()}

In [63]:
predictions_2 = model.predict(ds_train_test)

In [65]:
def f(X):
    return model.predict([ds_train_test[:,i] for i in range(ds_train_test.shape[1])]).flatten()

In [97]:
# select backgroud for shap
background = X[np.random.choice(X.shape[0], 100, replace=False)]

KeyError: "None of [Int64Index([251, 285, 154, 277, 115,  98, 260,  55, 157,  43,  88,  29, 270,\n            192, 131,  31, 130, 204, 133, 283, 169,  68, 264, 197,  51, 242,\n             66, 284,  45,  34,   6,  33, 219, 103, 114, 230, 212,  24, 171,\n             70, 301,   8, 234, 147,  30,  97, 138, 134, 182, 275, 205,  59,\n            227,  49, 184, 235,  54,  42, 297, 221, 273,   2, 176,  85, 187,\n            151, 118,  41, 293,  10, 156,   3, 292,  47,  18, 199, 281, 179,\n            228,  78, 111, 143, 126,  87, 290, 193,  16, 287,  74, 218, 123,\n            279, 116,  90,  50, 124,  77, 189, 240,  25],\n           dtype='int64')] are in the [columns]"

In [71]:
explainer = shap.KernelExplainer(f, X_train.iloc[:50,:])

Provided model function fails when applied to the provided data set.


AttributeError: 'BatchDataset' object has no attribute 'shape'

In [None]:
explainer = shap.KernelExplainer(f, X.iloc[:50,:])
shap_values = explainer.shap_values(X.iloc[299,:], nsamples=500)
shap.force_plot(explainer.expected_value, shap_values, X_display.iloc[299,:])

In [66]:
explainer = shap.KernelExplainer(model.predict, input_dict)


AssertionError: Unknown type passed as data object: <class 'dict'>

In [None]:
shap_values = explainer.shap_values(X_test.iloc[0,:])
shap.force_plot(explainer.expected_value[0], shap_values[0], X_test.iloc[0,:])