# Tree Influence

tree influence: 決定木ベースのアルゴリズムに向けて、事例型説明（予測に影響した訓練データを出す方法）の手法をまとめたパッケージ

- Repository: [jjbrophy47/tree_influence: Influence Estimation for Gradient-Boosted Decision Trees](https://github.com/jjbrophy47/tree_influence)
- Paper: [Brophy, J., Hammoudeh, Z., & Lowd, D. (2023). Adapting and Evaluating Influence-Estimation Methods for Gradient-Boosted Decision Trees. J. Mach. Learn. Res., 24, 154-1.](https://jmlr.org/papers/v24/22-0449.html)


途中で $n$×木の数のnp.arrayを作るタイミングがあり、例えば100万レコード×5万の決定木ではfloat32だとしても186GiBになってしまう

In [3]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from lightgbm import LGBMClassifier
from tree_influence.explainers import BoostIn

# load iris data
data = load_iris()
X, y = data['data'], data['target']

# use two classes, then split into train and test
idxs = np.where(y != 2)[0]
X, y = X[idxs], y[idxs]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=1)

# train GBDT model
model = LGBMClassifier().fit(X_train, y_train)

# fit influence estimator
explainer = BoostIn().fit(model, X_train, y_train)

# estimate training influences on each test instance
influence = explainer.get_local_influence(X_test, y_test)  # shape=(no. train, no. test)

# extract influence values for the first test instance
values = influence[:, 0]  # shape=(no. train,)

# sort training examples from:
# - most positively influential (decreases loss of the test instance the most), to
# - most negatively influential (increases loss of the test instance the most)
training_idxs = np.argsort(values)[::-1]

[LightGBM] [Info] Number of positive: 43, number of negative: 47
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001302 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 68
[LightGBM] [Info] Number of data points in the train set: 90, number of used features: 4
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.477778 -> initscore=-0.088947
[LightGBM] [Info] Start training from score -0.088947
[9.99978883e-01 1.87684488e-05 9.99832895e-01 9.99832895e-01
 9.99832895e-01 1.87684488e-05 1.92746335e-05 1.55569815e-03
 9.99901429e-01 1.87684488e-05 5.69284309e-05 9.99980781e-01
 9.99901133e-01 9.99463304e-01 1.93438545e-05 5.68060108e-05
 9.99977217e-01 1.87684488e-05 1.85110744e-05 2.42050712e-04
 9.99490484e-01 9.99979921e-01 9.99980817e-01 9.99832872e-01
 9.99792503e-01 1.93241772e-05 1.87684488e-05 9.99971114e-01
 1.86029015e-03 1.29047237e

In [10]:
i = 0
X_train[i]

array([5.8, 2.6, 4. , 1.2])

In [11]:
values = influence[:, i]
training_idxs = np.argsort(values)[::-1]
X[training_idxs[:10],]

array([[5.1, 3.4, 1.5, 0.2],
       [4.9, 2.4, 3.3, 1. ],
       [5. , 2. , 3.5, 1. ],
       [4.3, 3. , 1.1, 0.1],
       [6. , 2.7, 5.1, 1.6],
       [5.4, 3.4, 1.7, 0.2],
       [5.7, 2.8, 4.5, 1.3],
       [5.4, 3.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.2]])