**Author**: Yap Jheng Khin

**FYP II Title**: Used car dealership web application

**Purpose**:
1. This notebook describes:
    - The tree weights extraction process for adaptive random forest regressor and classifier.
2. Input: 
    - Fitted car price model (River adaptive random forest regressor).
    - Fitted lead scoring model (River adaptive random forest classifier).
3. Output: 
    - Dictionary containing the extracted weights for the car price model.
    - Dictionary containing the extracted weights for the lead scoring model.

**Execution time**: At most 1 minute in Jupyter Notebook.

**Important**: Some sections in the notebook is moved to a new file named *FYP2_ARF_to_Dict_Validation.ipynb* in order to run code on a different Python interpreter. It is because, as of April 2021, SHAP library and River library has conflicting dependenices on Numpy library. SHAP library requires Numpy version of 1.21.5 while River library requires Numpy version of 1.22.3. The import of both libraries will fail if the respective requirements are not met.

# Setup

Ensure that the current Python interpreter path is correct. For example, if the **River conda environment** is named as **arf_conda_env**, the expected `sys.executable` should be C:\Users\User\miniconda3\envs\\**arf_conda_env**\\python.exe.

In [1]:
import sys
print(sys.executable)

C:\Users\User\miniconda3\envs\arf_conda_env\python.exe


In [2]:
# Standard libraries
import json
import numpy as np
import pandas as pd
import pickle

# User-defined libraries
from arf_to_dict_conversion import extract_cf_arf, extract_rg_arf
from general_utils import serialize_arf
from rf_rg_performance_eval import arf_predict as rg_arf_predict
from rf_cf_performance_eval import arf_predict_proba as cf_arf_predict_proba

# Overview

For initializing Tree SHAP explainer, the function internally extract the tree weights from an API-specific tree models before storing the information in the `TreeEnsemble` class instance. The function directly supports the weights extraction of tree models from commonly used API like Scikit-learn and XGBoost. However, the function does not directly supports the weight extraction of `AdaptiveRandomForestRegressor` and `AdaptiveRandomForestClassifier` from River API. Instead, the function accepts a Python dictionary that stores the tree weights which are `children_left`, `children_right`, `features`, `thresholds`, `node_sample_weight`, and `values`. Thus, the tree weights must be manually extracted into a dictionary.

Below shows the pseudocode of extracting the tree weights of adaptive random forest regressor and adaptive random forest clasifier into a dictionary.

```
INIT arf_dict
INIT arf_dict["internal_dtype"]
INIT arf_dict["input_dtype"]
INIT arf_dict["objective"]
INIT arf_dict["tree_output"]
INIT arf_dict["base_offset"]

FOR each hoeffding_tree
    ENQUEUE (root_node, 0) to queue

    WHILE queue is not empty
        DEQUEUE (node, node_index) from queue
        RESET flig flag to False
        INIT a dictionary named hoeffding_dict

        IF node is a child node
            Update hoeffding_dict

        ELSE IF node is a parent node

            IF node is a NumericBinaryBranch class instance
                split_threshold = node.threshold

            ELSE IF node is a NominalBinaryBranch class instance
                split_threshold = node.value

                IF split_threshold is 1
                    SET flip flag to True

                split_threshold = 0.5

            node_index = node_index + 1
            left_child_index = node_index
            IF flip
                ENQUEUE (right_child_node, left_child_index) to queue
            ELSE
                ENQUEUE (left_child_node, left_child_index) to queue

            node_index = node_index + 1
            right_child_index = node_index
            IF flip
                ENQUEUE (left_child_node, right_child_index) to queue
            ELSE
                ENQUEUE (right_child_node, right_child_index) to queue
            
            Update hoeffding_dict
```

There are two types of branches in Hoeffding tree. If the split feature is numerical, then the `NumericBinaryBranch` class instance is used to represent the branch node. Else if the split feature is nominal, then the `NominalBinaryBranch` class instance is used to represent the branch node.

The initialization values for `internal_dtype`, `input_dtype`, `objective`, `tree_output`, and `base_offset` are dependent on whether the model is regressor or classifier. These exact initialization values and other details like the weights that are extracted from Hoeffding tree are discussed in *FYP2_ARF_to_Dict_Validation.ipynb*.

The reason why the left child node is flipped with the right child node is discussed in the next section.

## Conversion of split value

In Scikit-learn's decision tree, regardless whether the split feature is **numerical** and **nominal**, the samples in each node are splitted using the following code logic:

```python
if x[feature] <= threshold:
    return 0 # Go to left
return 1     # Go to right
```

In River's Hoeffding tree, if the split feature is **numerical**, the samples in each node are splitted using the following code logic:

```python
# Source: https://github.com/online-ml/river/blob/main/river/tree/nodes/branch.py#L52

if x[feature] <= threshold:
    return 0 # Go to left
return 1     # Go to right
```

In River's Hoeffding tree, if the split feature is **nominal**, the samples in each node are splitted using the following code logic:

```python
# Source: https://github.com/online-ml/river/blob/main/river/tree/nodes/branch.py#L89

if x[feature] == value:
    return 0 # Go to left
return 1     # Go to right
```

Since all the nomial attributes are one-hot encoded when training the adaptive random forest regressor and classifier, the `value` can only be either 0 or 1. Based on the table below, the sample splitting for River tree model is the same as the one in Scikit-learn if the split value is 0. Since the Tree SHAP explainer expects the Scikit-learn sample splitting code logic, the table below proposes the conversion necessary to ensure the correctness of SHAP tree explainer in using the model.

<table>
    <tr>
        <th style="text-align:center">Model type</th>
        <th style="text-align:center">Split Value</th>
        <th style="text-align:center">Split direction<br> if feature value<br> is 0</th>
        <th style="text-align:center">Conversion</th>
    </tr>
    <tr>
        <td style="text-align:left">Scikit-learn random tree</td>
        <td style="text-align:center">0.5</td>
        <td style="text-align:left">Go to left</td>
        <td style="text-align:center">-</td>
    </tr>
    <tr>
        <td rowspan="2" style="text-align:left">Adaptive random tree</td>
        <td style="text-align:center">0</td>
        <td style="text-align:left">Go to left</td>
        <td style="text-align:left">Change the split value to 0.5</td>
    </tr>
    <tr>
        <td style="text-align:center">1</td>
        <td style="text-align:left">Go to right</td>
        <td style="text-align:left">Change the split value to 0.5. Flip the left and right child of the current node.</td>
    </tr>
</table>

When the split value is 1, flipping the left and right child of the current node will have the same effect as if the split value is 0. Thus, the split value is set to 0.5.

# Adaptive Random Forest Regressor

Load the fitted car price ARF model.

In [3]:
# Load lead scoring ARF model and its data preprocessor
with open('outputs/car_price/arf_rg.pkl', 'rb') as f:
    cp_arf_model = pickle.load(f)

Load the car price data preprocessor to retrieve the preprocessed feature names.

In [4]:
# Load car price dataset
with open('outputs/car_price/data_preprocessor.pkl', 'rb') as f:
    cp_data_pp = pickle.load(f)
    
cp_feature_names = cp_data_pp.features
cp_feature_names[:10]

['manufacture_year',
 'mileage',
 'length_mm',
 'engine_cc',
 'wheel_base_mm',
 'width_mm',
 'seat_capacity',
 'peak_power_hp',
 'height_mm',
 'peak_torque_nm']

In [5]:
cp_arf_dict = extract_rg_arf(cp_arf_model, cp_feature_names)
cp_arf_dict

{'internal_dtype': numpy.float64,
 'input_dtype': numpy.float64,
 'objective': 'squared_error',
 'tree_output': 'raw_value',
 'base_offset': 0,
 'trees': [{'children_left': array([  1.,   3.,   5.,   7.,   9.,  11.,  13.,  15.,  17.,  19.,  -1.,
           21.,  -1.,  23.,  25.,  27.,  29.,  31.,  -1.,  33.,  35.,  37.,
           -1.,  39.,  -1.,  -1.,  41.,  43.,  45.,  47.,  49.,  51.,  -1.,
           -1.,  53.,  55.,  -1.,  57.,  59.,  -1.,  61.,  63.,  65.,  67.,
           69.,  71.,  -1.,  73.,  75.,  77.,  79.,  81.,  -1.,  83.,  -1.,
           -1.,  -1.,  85.,  87.,  89.,  -1.,  91.,  -1.,  93.,  -1.,  95.,
           97.,  99., 101., 103.,  -1., 105., 107., 109., 111., 113.,  -1.,
          115.,  -1., 117.,  -1.,  -1.,  -1., 119.,  -1., 121.,  -1., 123.,
          125.,  -1., 127., 129.,  -1., 131.,  -1., 133.,  -1., 135.,  -1.,
          137.,  -1., 139., 141., 143., 145., 147.,  -1., 149.,  -1., 151.,
          153., 155.,  -1.,  -1.,  -1., 157., 159., 161.,  -1., 163., 

In [6]:
# Serialize the dictionary to convert to JSON
cp_arf_dict_serializable = serialize_arf(cp_arf_dict)

# Export to JSON
with open('outputs/car_price/arf_rg.json', 'w') as f:
    json.dump(cp_arf_dict_serializable, f)

## Validation

**Test 1**: The number of nodes of car price ARF model and the converted dictionary must be the same.

In [7]:
tree_infos = {
    'ARF': [],
    'ARF Converted': [],
    'match': []
}

for idx, (ht, converted_ht) in enumerate(zip(cp_arf_model, cp_arf_dict['trees'])):
    ht_nc = ht.model.n_nodes
    ht_nc_converted = len(converted_ht['values'])
    tree_infos['ARF'].append(ht_nc)
    tree_infos['ARF Converted'].append(ht_nc_converted)
    tree_infos['match'].append(ht_nc_converted == ht_nc)

node_counts = pd.DataFrame(tree_infos)
node_counts.index = [idx + 1 for idx in range(len(cp_arf_model))]
node_counts

Unnamed: 0,ARF,ARF Converted,match
1,467,467,True
2,417,417,True
3,349,349,True
4,399,399,True
5,419,419,True
6,439,439,True
7,397,397,True
8,399,399,True
9,447,447,True
10,413,413,True


Further validation is conducted in *FYP2_ARF_to_Dict_Validation.ipynb.*

## Prediction

In *FYP2_Car_Price_Explainer.ipynb*, the validation tests of explainer requires the River model's prediction. It is not possible to load the River Model in there since River library is incompatible with SHAP library. Therefore, the model prediction is performed here and exported as an numpy array in npy format.

Load the dataset used in initializing explainers. The details of dataset is discussed in *FYP2_Car_Price_Explainer.ipynb*.

In [8]:
# Load car inventory records
car_records = pd.read_csv(f'outputs/seed_data/car_inventory_records.csv')

# Remove columns that are not needed to perform inference
cp_X_test = car_records.copy().drop(columns=['price', 'model', 'pred_price', 
                                             'update_analytics'], axis=1)

# Preprocess the data
cp_X_test  = cp_data_pp.preprocess(cp_X_test)

# Filter the data
truth_available = car_records['update_analytics'] == 1
cp_X_test_truth_av = cp_X_test.loc[truth_available, :].copy()

Save the prediction to a file to be used in *FYP2_Car_Price_Explainer.ipynb*.

In [9]:
# Save the prediction to a file to be used in FYP2_Car_Price_Explainer.ipynb
cp_y_test_truth_av_pred = rg_arf_predict(cp_arf_model, cp_X_test_truth_av)
cp_y_test_truth_av_pred = np.array(cp_y_test_truth_av_pred, dtype=np.float64)

with open('outputs/car_price/y_test_truth_av-prediction.npy', 'wb') as f:
    np.save(f, cp_y_test_truth_av_pred)

Performing predictions: 100%|██████████████| 547/547 [00:00<00:00, 2757.65it/s]


Randomly choose 100 samples from the test set to provide as a background dataset for the explainer. The background dataset is exported since it is not reproducible in another Python environment.

In [10]:
num_subsamples = 100

rng = np.random.default_rng(2022)
idx_arr = np.arange(len(cp_X_test_truth_av))
rng.shuffle(idx_arr)
idx_arr = idx_arr[:num_subsamples]

# Randomly choose 300 samples from the test set
cp_X_test_truth_av_subsample = cp_X_test_truth_av.iloc[idx_arr, :].copy()

# Export the subsample since the result is different across environment, 
# even if the same seed is used
cp_X_test_truth_av_subsample.to_csv('outputs/car_price/X_test_truth_av_subsample.csv', index=False)

print(f'The size of the subsample: {cp_X_test_truth_av_subsample.shape}')

The size of the subsample: (100, 76)


Save the prediction to a file to be used in *FYP2_Car_Price_Explainer.ipynb*.

In [11]:
cp_y_test_truth_av_subsample_pred = rg_arf_predict(cp_arf_model, 
                                                   cp_X_test_truth_av_subsample)
cp_y_test_truth_av_subsample_pred = np.array(cp_y_test_truth_av_subsample_pred, dtype=np.float64)

with open('outputs/car_price/y_test_truth_av_subsample-prediction.npy', 'wb') as f:
    np.save(f, cp_y_test_truth_av_subsample_pred)

Performing predictions: 100%|██████████████| 100/100 [00:00<00:00, 2729.44it/s]


# Adaptive Random Forest Classifier

Load the fitted lead scoring ARF model.

In [12]:
# Load lead scoring ARF model and its data preprocessor
with open('outputs/lead_scoring/arf_cf.pkl', 'rb') as f:
    ls_arf_model = pickle.load(f)

Load the lead scoring data preprocessor to retrieve the preprocessed feature names.

In [13]:
# Load car price dataset
with open('outputs/lead_scoring/data_preprocessor.pkl', 'rb') as f:
    ls_data_pp = pickle.load(f)
    
ls_feature_names = ls_data_pp.features
ls_feature_names[:10]

['total_site_visit',
 'total_time_spend_on_site',
 'avg_page_view_per_visit',
 'dont_email_Yes',
 'occupation_Businessman',
 'occupation_Student',
 'occupation_Unemployed',
 'occupation_Working Professional',
 'received_free_copy_Yes']

In [14]:
ls_arf_dict = extract_cf_arf(ls_arf_model, ls_feature_names)
ls_arf_dict

{'internal_dtype': numpy.float64,
 'input_dtype': numpy.float64,
 'objective': 'binary_crossentropy',
 'tree_output': 'probability',
 'base_offset': 0,
 'trees': [{'children_left': array([ 1.,  3.,  5.,  7.,  9., 11., 13., 15., 17., 19., 21., -1., 23.,
          25., -1., 27., -1., -1., -1., 29., -1., 31., 33., 35., -1., 37.,
          -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.]),
   'children_right': array([ 2.,  4.,  6.,  8., 10., 12., 14., 16., 18., 20., 22., -1., 24.,
          26., -1., 28., -1., -1., -1., 30., -1., 32., 34., 36., -1., 38.,
          -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.]),
   'features': array([ 1.,  0.,  7.,  7.,  6.,  0.,  2.,  8.,  3.,  2.,  0., -2.,  3.,
           0., -2.,  6., -2., -2., -2.,  2., -2.,  0.,  3.,  4., -2.,  0.,
          -2., -2., -2., -2., -2., -2., -2., -2., -2., -2., -2., -2., -2.]),
   'thresholds': array([ 5.63500000e+02,  5.00000000e-01,  5.00000000e-01,  5.00000000e-01,
           5.000000

In [15]:
# Serialize the dictionary to convert to JSON
ls_arf_dict_serializable = serialize_arf(ls_arf_dict)

with open('outputs/lead_scoring/arf_cf.json', 'w') as f:
    json.dump(ls_arf_dict_serializable, f)

## Validation

**Test 1**: The number of nodes of lead scoring ARF model and the converted dictionary must be the same.

In [16]:
tree_infos = {
    'ARF': [],
    'ARF Converted': [],
    'match': []
}

for idx, (ht, converted_ht) in enumerate(zip(ls_arf_model, ls_arf_dict['trees'])):
    ht_nc = ht.model.n_nodes
    ht_nc_converted = len(converted_ht['values'])
    tree_infos['ARF'].append(ht_nc)
    tree_infos['ARF Converted'].append(ht_nc_converted)
    tree_infos['match'].append(ht_nc_converted == ht_nc)

node_counts = pd.DataFrame(tree_infos)
node_counts.index = [idx + 1 for idx in range(len(ls_arf_model))]
node_counts

Unnamed: 0,ARF,ARF Converted,match
1,39,39,True
2,45,45,True
3,41,41,True
4,45,45,True
5,43,43,True
6,43,43,True
7,39,39,True
8,41,41,True
9,39,39,True
10,41,41,True


Further validation is conducted in *FYP2_ARF_to_Dict_Validation.ipynb*.

## Prediction

In *FYP2_Lead_Scoring_Explainer.ipynb*, the validation tests of explainer requires the River model's prediction. It is not possible to load the River Model in there since River library is incompatible with SHAP library. Therefore, the model prediction is performed here and exported as an numpy array in npy format.

Load the dataset used in initializing explainers. The details of dataset is discussed in *FYP2_Lead_Scoring_Explainer.ipynb*.

In [17]:
# Load lead records
lead_records = pd.read_csv(f'outputs/seed_data/lead_records.csv')

# Remove columns that are not needed to perform inference
ls_test = lead_records.copy().drop(columns=['pred_score', 'lead_status'], axis=1)

ls_target_attr = 'converted'
ls_X_test  = ls_test.drop(columns=ls_target_attr, axis=1)
ls_y_test  = ls_test[ls_target_attr]

# Preprocess the data
ls_X_test  = ls_data_pp.preprocess(ls_X_test)

# Filter the data
truth_available = lead_records['lead_status'] != 'Active'
ls_X_test_truth_av = ls_X_test.loc[truth_available, :].copy()
ls_y_test_truth_av = ls_y_test[truth_available].copy()

Save the prediction to a file to be used in *FYP2_Lead_Scoring_Explainer.ipynb*.

In [18]:
# Save the prediction to a file to be used in FYP2_Car_Price_Explainer.ipynb
ls_y_test_truth_av_pred = cf_arf_predict_proba(ls_arf_model, ls_X_test_truth_av)

with open('outputs/lead_scoring/y_test_truth_av-prediction.npy', 'wb') as f:
    np.save(f, ls_y_test_truth_av_pred)

Performing predictions: 100%|██████████████| 652/652 [00:00<00:00, 1044.95it/s]


Randomly choose 100 samples from the test set to provide as a background dataset for the explainer. The background dataset is exported since it is not reproducible in another Python environment.

In [19]:
num_subsamples = 100

rng = np.random.default_rng(2022)
idx_arr = np.array(range(num_subsamples))
rng.shuffle(idx_arr)
idx_arr = idx_arr[:num_subsamples]

# Randomly choose 300 samples from the test set
ls_X_test_truth_av_subsample = ls_X_test_truth_av.reset_index(drop=True).iloc[idx_arr, :].copy()
ls_y_test_truth_av_subsample = ls_y_test_truth_av.reset_index(drop=True)[idx_arr].copy()

print(f'The size of the subsample: {ls_X_test_truth_av_subsample.shape}')

# Export the subsample since the result is different across environment, 
# even if the same seed is used
ls_test_truth_av_subsample = ls_X_test_truth_av_subsample.copy()
ls_test_truth_av_subsample['converted'] = ls_y_test_truth_av_subsample
ls_test_truth_av_subsample.to_csv('outputs/lead_scoring/test_truth_av_subsample.csv', index=False)

The size of the subsample: (100, 9)


Save the prediction to a file to be used in *FYP2_Lead_Scoring_Explainer.ipynb*.

In [20]:
ls_y_test_truth_av_subsample_pred = cf_arf_predict_proba(ls_arf_model, 
                                                         ls_X_test_truth_av_subsample)

with open('outputs/lead_scoring/y_test_truth_av_subsample-prediction.npy', 'wb') as f:
    np.save(f, ls_y_test_truth_av_subsample_pred)

Performing predictions: 100%|██████████████| 100/100 [00:00<00:00, 1043.27it/s]


The end.