**Author**: Yap Jheng Khin

**FYP II Title**: Used car dealership web application

**Purpose**:
1. This notebook explains:
    - The reason that the `interventional` approach must be used in calculating SHAP value.
2. Input: 
    - Car train and car test dataset. The <a href="https://colab.research.google.com/drive/1alFnJwZVOKntfmjxA0q8Rcni1irur5fn?usp=sharing">Google Colab notebook</a> explains how the data is scraped from a website and preprocessed.
    - Dictionaries containing the extracted tree weights for pre-trained car price models and pre-trained lead scoring models.
    - A total of 10 dictionaries representing car price models in 10 training checkpoints.
    - A total of 10 dictionaries representing lead scoring models in 10 training checkpoints.

**Execution time**: At most 5 minutes in Jupyter Notebook.

In [1]:
import json
import shap
import numpy as np
import pandas as pd
import pickle
from general_utils import deserialize_arf
from tqdm import tqdm

# Setup

Ensure that the current Python interpreter path is correct. For example, if the **SHAP conda environment** is named as **arf_conda_exp_env**, the expected `sys.executable` should be C:\Users\User\miniconda3\envs\\**arf_conda_exp_env**\\python.exe.

In [2]:
import sys
print(sys.executable)

C:\Users\User\miniconda3\envs\arf_conda_exp_env\python.exe


Load data preprocessors.

In [3]:
with open(f'outputs/car_price/data_preprocessor.pkl', 'rb') as f:
    cp_data_pp = pickle.load(f)

with open(f'outputs/lead_scoring/data_preprocessor.pkl', 'rb') as f:
    ls_data_pp = pickle.load(f)

Load data.

In [4]:
# Load car price test set
car_test_set = pd.read_csv(f'outputs/car_price/car_test_processed.csv')

# Get and preprocess car price test set
cp_X_test = car_test_set.copy().drop(columns=['price', 'model'], axis=1)
cp_X_test_pp = cp_data_pp.preprocess(cp_X_test)

# Load car price test subsample to initialize explainer that uses interventional approach
cp_X_test_truth_av_subsample = pd.read_csv('outputs/car_price/X_test_truth_av_subsample.csv')

In [5]:
# Load lead scoring test set
lead_scoring_test_set = pd.read_csv(f'outputs/lead_scoring/test_set.csv')

# Get and preprocess lead scoring test set
ls_X_test = lead_scoring_test_set.copy().drop(columns=['converted'], axis=1)
ls_X_test_pp = ls_data_pp.preprocess(ls_X_test)

# Load lead scoring test subsample to initialize explainer that uses interventional approach
ls_X_test_truth_av_subsample = pd.read_csv('outputs/lead_scoring/X_test_truth_av_subsample.csv')

# Car Price Explainers

First, the adaptive random forest regressor is trained with test set and 10 model checkpoints are created. Then, the SHAP values are calculcated for each model checkpoint. The difference of the SHAP values is computed by substracting the ingested model predictions with the expected value. The difference should be close to 0 to indicate that the tree SHAP explainer is accurate in explaining model's predictions. In other words, a larger SHAP values difference indicates more inaccurate tree SHAP explainer.

In [6]:
def get_largest_shap_diff_rg(tree_explainer):
    # Calculcate SHAP values
    cp_shap_values = tree_explainer.shap_values(cp_X_test_pp, check_additivity=False)
    # Original model's prediction output
    true_y_pred = tree_explainer.model.predict(cp_X_test_pp)
    # Sum of the SHAP values and the expected value
    y_pred_calulated_fr_shap_val = tree_explainer.expected_value + cp_shap_values.sum(1)
    # Get the biggest difference
    largest_diff = np.abs(true_y_pred - y_pred_calulated_fr_shap_val).max()
    return largest_diff

def check_node_sample_weight(cur_arf_dict):
    df = pd.DataFrame([], columns=['parent', 'left', 'right'])
    node_count_df = pd.DataFrame([], columns=['count'])
    node_id = -1

    progress_unit = 1
    desc = f'Checking Hoeffding trees'
    with tqdm(total=len(cur_arf_dict["trees"]), position=0, leave=True, desc=desc) as pbar:
        for base_learner_no in range(len(cur_arf_dict["trees"])):
            node_count_df.loc[base_learner_no, :] = [len(cur_arf_dict["trees"][base_learner_no]["node_sample_weight"])]
            queue = []
            # Separately track the node index position for both dictionaries
            queue.append(0)

            while len(queue) > 0:
                ht_node_idx = queue.pop(0)
                ht_node_idx = int(ht_node_idx)

                # Retrieve the node_sample_weight from the dictionaries
                ht_node_sw = cur_arf_dict["trees"][base_learner_no]["node_sample_weight"][ht_node_idx]

                # Add records to dataframe for debugging
                node_id += 1

                # Retrieve the left child index position from the dictionaries
                ht_left_ch_idx = int(cur_arf_dict["trees"][base_learner_no]["children_left"][ht_node_idx])

                # Retrieve the right child index position from the dictionaries
                ht_right_ch_idx = int(cur_arf_dict["trees"][base_learner_no]["children_right"][ht_node_idx])

                ht_left_node_sw = -1
                ht_right_node_sw = -1

                # Only enqueue if the child node is a branch node
                if ht_left_ch_idx != -1:
                    queue.append(ht_left_ch_idx)
                    # Retrieve the left child node_sample_weight from the dictionaries
                    ht_left_node_sw = cur_arf_dict["trees"][base_learner_no]["node_sample_weight"][int(ht_left_ch_idx)]
                if ht_right_ch_idx != -1:
                    queue.append(ht_right_ch_idx)
                    # Retrieve the right child node_sample_weight from the dictionaries
                    ht_right_node_sw = cur_arf_dict["trees"][base_learner_no]["node_sample_weight"][int(ht_right_ch_idx)]

                df.loc[node_id, :] = [ht_node_sw, ht_left_node_sw, ht_right_node_sw]
                # Update the progress bar.
            pbar.update(progress_unit)

    bool_index = (df['parent'] - df['left'] + df['right']) < 0
    print('Invalid node_sample_weight:')
    display(df[bool_index])

The result below shows the SHAP values difference for tree SHAP explainer that used `tree_path_dependent` approach. The tree SHAP explainer becomes more inaccurate as more samples are trained. Further analysis is conducted in the following cells.

In [7]:
increment = 10

df = pd.DataFrame([], columns=['Largest difference in SHAP values'])

progress_unit = 1
desc = f'Initializing explainers'
with tqdm(total=increment+1, position=0, leave=True, desc=desc) as pbar:
    # Load base model
    with open('outputs/car_price/arf_rg.json', 'r') as f:
        cp_arf_dict_serializable = json.load(f)
    base_cp_arf_dict = deserialize_arf(cp_arf_dict_serializable)
    # Initialize tree SHAP explainer
    cp_exp_tree_path_dependent = shap.TreeExplainer(
        model = base_cp_arf_dict, 
        feature_perturbation = 'tree_path_dependent', 
    )
    # Get the largest SHAP difference
    df.loc[0, :] = [get_largest_shap_diff_rg(cp_exp_tree_path_dependent)]
    # Update progress bar
    pbar.update(progress_unit)
    
    for idx in range(1, increment+1):
        # Load the model at ith checkpoint
        with open(f'outputs/explainer_validation/car_price/arf_rg_{idx}.json', 'r') as f:
            cp_arf_dict_serializable = json.load(f)
        cp_arf_dict = deserialize_arf(cp_arf_dict_serializable)
        # Initialize tree SHAP explainer
        cp_exp_tree_path_dependent = shap.TreeExplainer(
            model = cp_arf_dict, 
            feature_perturbation = 'tree_path_dependent', 
        )
        # Get the largest SHAP difference
        df.loc[idx, :] = [get_largest_shap_diff_rg(cp_exp_tree_path_dependent)]
        # Update the progress bar.
        pbar.update(progress_unit)

df

Initializing explainers: 100%|█████████████████| 11/11 [00:29<00:00,  2.66s/it]


Unnamed: 0,Largest difference in SHAP values
0,0.0
1,5149.22402
2,10833.002394
3,16265.587504
4,21541.62672
5,28325.577186
6,33378.069441
7,38454.312109
8,43884.79866
9,49323.235785


By analyzing the results below, it turned out that the Hoeffding tree regressor did not update `node_sample_weight` on the parent node. Tree SHAP explainer expects that for every parent node *i*, the `node_sample_weight` of the left and right child of parent node *i* must always be smaller than parent node *i* itself. Since the River only update `node_sample_weight` at the leaf node, the expectation is broken. With the expectation broken, the SHAP values calculation will be inaccurate since the `tree_path_dependent` approach calculates SHAP values by using `node_sample_weight`. Therefore, the River developer should give the option to update the `node_sample_weight` to ensure compatibility and interoperability of River models with other Python libraries like SHAP.

The code below checks the expectation on the pre-trained car price model. The expectation holds true since the tree weights are transferred from the Scikit-learn random forest regressor.

In [8]:
check_node_sample_weight(base_cp_arf_dict)

Checking Hoeffding trees: 100%|████████████████| 15/15 [00:02<00:00,  5.24it/s]

Invalid node_sample_weight:





Unnamed: 0,parent,left,right


The result below checks the expectation on the pre-trained car price model that is trained with new samples. The expectation does not hold true since the Hoeffding tree regressor did not update `node_sample_weight` on the parent node.

In [9]:
with open(f'outputs/explainer_validation/car_price/arf_rg_{10}.json', 'r') as f:
    cp_arf_dict_serializable = json.load(f)
cp_arf_dict = deserialize_arf(cp_arf_dict_serializable)
check_node_sample_weight(cp_arf_dict)

Checking Hoeffding trees: 100%|████████████████| 15/15 [00:04<00:00,  3.66it/s]

Invalid node_sample_weight:





Unnamed: 0,parent,left,right
53,74.0,115.0,12.0
72,79.0,123.0,8.0
82,64.0,91.0,7.0
155,73.0,106.0,18.0
165,86.0,110.0,12.0
...,...,...,...
8463,163.0,206.0,22.0
8471,72.0,122.0,18.0
8491,562.0,596.0,31.0
8555,166.0,184.0,7.0


Fortunately, the model predictions can still be explained using tree SHAP explainer with the `interventional` approach instead of `tree_path_dependent` approach. The result below shows that the SHAP value difference are consistent in all the model checkpoints. The differences are negligible since the car price prediction values are in thousands and not sensitive to SHAP value differences that are less than 1.

In [10]:
increment = 10

df = pd.DataFrame([], columns=['Largest difference in SHAP values'])

progress_unit = 1
desc = f'Initializing explainers'
with tqdm(total=increment+1, position=0, leave=True, desc=desc) as pbar:
    # Load base model
    with open('outputs/car_price/arf_rg.json', 'r') as f:
        cp_arf_dict_serializable = json.load(f)
    base_cp_arf_dict = deserialize_arf(cp_arf_dict_serializable)
    # Initialize tree SHAP explainer
    cp_exp_interventional = shap.TreeExplainer(
        model = cp_arf_dict, 
        feature_perturbation = 'interventional', 
        data = cp_X_test_truth_av_subsample
    )
    # Get the largest SHAP difference
    df.loc[0, :] = [get_largest_shap_diff_rg(cp_exp_interventional)]
    # Update progress bar
    pbar.update(progress_unit)
    
    for idx in range(1, increment+1):
        # Load the model at ith checkpoint
        with open(f'outputs/explainer_validation/car_price/arf_rg_{idx}.json', 'r') as f:
            cp_arf_dict_serializable = json.load(f)
        cp_arf_dict = deserialize_arf(cp_arf_dict_serializable)
        # Initialize tree SHAP explainer
        cp_exp_interventional = shap.TreeExplainer(
            model = cp_arf_dict, 
            feature_perturbation = 'interventional', 
            data = cp_X_test_truth_av_subsample
        )
        # Get the largest SHAP difference
        df.loc[idx, :] = [get_largest_shap_diff_rg(cp_exp_interventional)]
        # Update progress bar
        pbar.update(progress_unit)

df

Initializing explainers: 100%|█████████████████| 11/11 [00:35<00:00,  3.24s/it]


Unnamed: 0,Largest difference in SHAP values
0,0.010849
1,0.014626
2,0.009889
3,0.010842
4,0.01027
5,0.008185
6,0.007507
7,0.008405
8,0.009215
9,0.007914


# Lead Scoring Explainers

The same experiment is also conducted for adaptive random forest classifier.

In [11]:
def get_largest_shap_diff_cf(tree_explainer):
    # Calculcate SHAP values
    ls_shap_values = tree_explainer.shap_values(ls_X_test_pp, check_additivity=False)
    # Original model's prediction output
    true_y_pred = tree_explainer.model.predict(ls_X_test_pp)
    # Sum of the SHAP values and the expected value
    y_pred_calulated_fr_shap_val = \
    tree_explainer.expected_value[:, np.newaxis] + \
    np.sum(np.array(ls_shap_values), axis=-1) 
    # Get the biggest difference
    largest_diff = np.abs(true_y_pred - y_pred_calulated_fr_shap_val.T).flatten().max()
    return largest_diff

The result below shows the SHAP values difference for tree SHAP explainer that used `tree_path_dependent` approach. The tree SHAP explainer becomes more inaccurate as more samples are trained. The differences are significant since the model outputs prediction probabilities ranging from 0 to 1.

In [12]:
increment = 10

df = pd.DataFrame([], columns=['Largest difference in SHAP values'])

progress_unit = 1
desc = f'Initializing explainers'
with tqdm(total=increment+1, position=0, leave=True, desc=desc) as pbar:
    # Load base model
    with open('outputs/lead_scoring/arf_cf.json', 'r') as f:
        ls_arf_dict_serializable = json.load(f)
    base_ls_arf_dict = deserialize_arf(ls_arf_dict_serializable)
    # Initialize tree SHAP explainer
    ls_exp_tree_path_dependent = shap.TreeExplainer(
        model = base_ls_arf_dict, 
        feature_perturbation = 'tree_path_dependent', 
    )
    # Get the largest SHAP difference
    df.loc[0, :] = [get_largest_shap_diff_cf(ls_exp_tree_path_dependent)]
    # Update progress bar
    pbar.update(progress_unit)
    
    for idx in range(1, increment+1):
        # Load the model at ith checkpoint
        with open(f'outputs/explainer_validation/lead_scoring/arf_cf_{idx}.json', 'r') as f:
            ls_arf_dict_serializable = json.load(f)
        ls_arf_dict = deserialize_arf(ls_arf_dict_serializable)
        # Initialize tree SHAP explainer
        ls_exp_tree_path_dependent = shap.TreeExplainer(
            model = ls_arf_dict, 
            feature_perturbation = 'tree_path_dependent', 
        )
        # Get the largest SHAP difference
        df.loc[idx, :] = [f'{get_largest_shap_diff_cf(ls_exp_tree_path_dependent):4g}']
        # Update the progress bar.
        pbar.update(progress_unit)

df

Initializing explainers: 100%|█████████████████| 11/11 [00:02<00:00,  5.13it/s]


Unnamed: 0,Largest difference in SHAP values
0,0.0
1,0.132005
2,0.137843
3,0.0855889
4,0.0440363
5,0.0764652
6,0.0929509
7,0.137198
8,0.148903
9,0.167896


By analyzing the results below, it turned out that the Hoeffding tree classifier also did not update `node_sample_weight` on the parent node. 

The code below checks the expectation on the pre-trained lead scoring model. The expectation holds true since the tree weights are directly transferred from the Scikit-learn random forest classifier.

In [13]:
check_node_sample_weight(base_ls_arf_dict)

Checking Hoeffding trees: 100%|████████████████| 20/20 [00:00<00:00, 57.12it/s]

Invalid node_sample_weight:





Unnamed: 0,parent,left,right


The result below checks the expectation on the pre-trained lead scoring model that is trained with new samples. The expectation does not hold true since the Hoeffding tree classifier did not update `node_sample_weight` on the parent node.

In [14]:
with open(f'outputs/explainer_validation/lead_scoring/arf_cf_{10}.json', 'r') as f:
    ls_arf_dict_serializable = json.load(f)
ls_arf_dict = deserialize_arf(ls_arf_dict_serializable)
check_node_sample_weight(ls_arf_dict)

Checking Hoeffding trees: 100%|████████████████| 20/20 [00:00<00:00, 32.41it/s]

Invalid node_sample_weight:





Unnamed: 0,parent,left,right
8,165.0,383.0,15.0
47,446.0,875.351869,233.648131
88,188.0,420.0,25.0
89,114.0,221.0,13.0
93,199.0,456.0,111.0
...,...,...,...
1285,187.0,389.0,142.0
1287,270.0,548.0,134.0
1304,941.0,963.0,2.0
1315,578.0,710.0,2.0


Fortunately, the model predictions can still be explained using tree SHAP explainer with the `interventional` approach instead of `tree_path_dependent` approach. The result below shows that the SHAP value difference are consistent in all the model checkpoints. The differences are negligible since the SHAP value differences are less than 1e-7.

In [15]:
increment = 10

df = pd.DataFrame([], columns=['Largest difference in SHAP values'])

progress_unit = 1
desc = f'Initializing explainers'
with tqdm(total=increment+1, position=0, leave=True, desc=desc) as pbar:
    # Load base model
    with open('outputs/lead_scoring/arf_cf.json', 'r') as f:
        ls_arf_dict_serializable = json.load(f)
    base_ls_arf_dict = deserialize_arf(ls_arf_dict_serializable)
    # Initialize tree SHAP explainer
    ls_exp_interventional = shap.TreeExplainer(
        model = ls_arf_dict, 
        feature_perturbation = 'interventional', 
        data = ls_X_test_truth_av_subsample
    )
    # Get the largest SHAP difference
    df.loc[0, :] = [get_largest_shap_diff_cf(ls_exp_interventional)]
    # Update progress bar
    pbar.update(progress_unit)
    
    for idx in range(1, increment+1):
        # Load the model at ith checkpoint
        with open(f'outputs/explainer_validation/lead_scoring/arf_cf_{idx}.json', 'r') as f:
            ls_arf_dict_serializable = json.load(f)
        ls_arf_dict = deserialize_arf(ls_arf_dict_serializable)
        # Initialize tree SHAP explainer
        ls_exp_interventional = shap.TreeExplainer(
            model = ls_arf_dict, 
            feature_perturbation = 'interventional', 
            data = ls_X_test_truth_av_subsample
        )
        # Get the largest SHAP difference
        df.loc[idx, :] = [f'{get_largest_shap_diff_cf(ls_exp_interventional):4g}']
        # Update progress bar
        pbar.update(progress_unit)

df

Initializing explainers: 100%|█████████████████| 11/11 [00:23<00:00,  2.15s/it]


Unnamed: 0,Largest difference in SHAP values
0,0.0
1,2.41219e-08
2,2.38361e-08
3,2.46521e-08
4,1.92897e-08
5,1.93313e-08
6,1.8947e-08
7,2.07608e-08
8,1.78671e-08
9,2.11324e-08


Thank you for reading.