# Table of Contents
1. [Installation](#install)
2. [Configuration](#config)
3. [Compression Baselines](#compression)
4. [Custom Methods](#custom)
5. [Explanation Methods](#explain)
6. [Evaluation](#eval)

<a id='install'></a>
----------
## 1. Install ThinX package
-------------

1. for testing - create a new environemnt in terminal (environment in Python 3.10+ is required)

`conda create --name test_env python=3.11 -y` 

`conda activate test_env`

`conda install -n test_env ipykernel --update-deps --force-reinstall`

2. now activate test_env as kernel in jupyter notebook

In [1]:
# install patched `goodpoints` package into the environment
! pip install ./goodpoints-main

# install pactched `sage` package
! pip install ./sage-main

# install the `thinx` package
! pip install -e ./thinx-main

# after that better to restart the kernel

Processing ./goodpoints-main
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: goodpoints
  Building wheel for goodpoints (pyproject.toml) ... [?25ldone
[?25h  Created wheel for goodpoints: filename=goodpoints-0.3.5-cp311-cp311-macosx_11_0_arm64.whl size=1759551 sha256=dfa71cce3a7888ae3019f0aaf8731bdcf7d3a344261963d11e79cb481881c56f
  Stored in directory: /Users/milannapahasian_user_account/Library/Caches/pip/wheels/a1/2c/3e/edd07f5b244bf8e028af2fd4dd0c40f9579391013b895dd653
Successfully built goodpoints
Installing collected packages: goodpoints
  Attempting uninstall: goodpoints
    Found existing installation: goodpoints 0.3.5
    Uninstalling goodpoints-0.3.5:
      Successfully uninstalled goodpoints-0.3.5
Successfully installed goodpoints-0.3.5
Processing ./sage-main
  Installing build dependencies ... [?25ldone
[?25h  Getti

<a id='config'></a>
------
## 2. Load the configuration (dataset and model) for experiments
-----

In [1]:
import thinx

loader = thinx.DataLoader()
# lets's first load a dataset jm1 with dataset id = 1053 and a neural network model trained on the train set
# NOTE: Reproducibility is guaranteed. The train-test split is fixed by the dataset ID.
# Both preprocessing by TabularPreprocessor and model training are deterministic based on the seed inside DataLoader, 
# ensuring identical results for the same (dataset, model) pair.
dataset_name, X_train, y_train, X_test, y_test, nn_model, preprocessor = loader.load_from_openml(
    dataset_id=1053,
    model_name="nn"
)
# lets's now load the same dataset jm1 with dataset id = 1053 and an xgboost model 
dataset_name, X_train, y_train, X_test, y_test, xgb_model, preprocessor = loader.load_from_openml(
    dataset_id=1053,
    model_name="xgboost"
)

# for future faster computations:
X_test = X_test[:30]
y_test = y_test[:30]

  from .autonotebook import tqdm as notebook_tqdm


Early stopping at epoch 43


So now we have:
1. dataset_name - the original name of the dataset.
2. X_train - samples after preprocessing, on which both XGBoost and NN have been trained.
3. y_train - y_train – target values (labels) corresponding to `X_train`.
4. X_test – preprocessed samples for future explanation process, strictly separate from training but processed identically.
5. y_test - ground truth target values (labels) corresponding to `X_test`.
6. nn_model - the trained Sequential Neural Network (`PyTorchNN` class) with structure: INPUT (60) → Linear (100) → ReLU → Linear (100) → ReLU → Linear (2) → OUTPUT
7. xgb_model - the trained XGBoost model (n_estimators=200). XGBClassifier or XGBRegressor.
8. preprocessor - aan instance of `TabularPreprocessor` – can be used to get some information about preprocessing applied to samples.

You can get some information about the decisions made while preprocessing the data using the `TabularPreprocessor`.

In [2]:
preprocessor.report_

PreprocessReport(dropped_constant=[], dropped_id_like=[], kept_raw_columns=['loc', 'v(g)', 'ev(g)', 'iv(g)', 'n', 'v', 'l', 'd', 'i', 'e', 'b', 't', 'lOCode', 'lOComment', 'lOBlank', 'locCodeAndComment', 'uniq_Op', 'uniq_Opnd', 'total_Op', 'total_Opnd', 'branchCount'], numeric_columns=['loc', 'v(g)', 'ev(g)', 'iv(g)', 'n', 'v', 'l', 'd', 'i', 'e', 'b', 't', 'lOCode', 'lOComment', 'lOBlank', 'locCodeAndComment', 'uniq_Op', 'uniq_Opnd', 'total_Op', 'total_Opnd', 'branchCount'], categorical_columns=[], task_type='classification')

<a id='compression'></a>
------
## 3. Distribution Compression - different baselines
------

### 1. IID Sampling
This method performs Independent and Identically Distributed sampling

* **`target_size`**: Can be set to any positive integer between 1 and `len(X_test)`. 

In [3]:
pre = thinx.Preprocessor(
    X=X_test.copy(),
    y=y_test.copy(),
    compression_method="iid",
    seed=0
)

X_comp, y_comp, idx_comp, t_comp = pre.preprocess()

print("--- iid Compression ---")
print("Original size:", X_test.shape[0])
print("Compressed size:", X_comp.shape[0])
print("Selected indices:", idx_comp)
print("Compression time (s):", t_comp)

No target_size specified, defaulting to sqrt(largest power of 4 <= n))
--- iid Compression ---
Original size: 30
Compressed size: 4
Selected indices: [ 8 14 22 17]
Compression time (s): 0.0010991096496582031


### 2. ARF Sampling
This method leverages Adversarial Random Forests (ARF) to model the underlying data distribution and generate synthetic samples.

* **`target_size`**: Can be set to any positive integer greater than 1. Instead of selecting existing rows, the algorithm generates new artificial samples.

In [4]:
pre_arf = thinx.Preprocessor(
    X=X_test.copy(),
    y=y_test.copy(),
    compression_method="arfpy",
    seed=0
)
X_comp_arf, y_comp_arf, idx_arf, t_arf = pre_arf.preprocess(target_size=4)

print("--- ARFPY Compression ---")
print("Original size:", X_test.shape[0])
print("Compressed size:", X_comp_arf.shape[0])
print("Selected indices:", idx_arf, " Returns -1 for synthetic points")
print("One of selected synthetic points:", X_comp_arf[0])
print("Compression time (s):", t_arf)

--- ARFPY Compression ---
Original size: 30
Compressed size: 4
Selected indices: [-1 -1 -1 -1]  Returns -1 for synthetic points
One of selected synthetic points: [-3.71555582e-02  2.89397521e-01  3.02446081e-04  2.17690659e+00
 -3.92633545e-02  2.13101280e+00 -5.07327390e-01  1.51434680e+00
 -3.34694539e-01  2.55446592e-01  7.14236087e-01  1.90245469e-01
  1.45732350e+00  1.68113315e+00 -3.32911393e-01 -2.39940926e-01
  4.49775187e-01 -9.39340533e-02 -4.69033798e-01  2.64095319e-01
 -8.20775263e-01]
Compression time (s): 0.2751460075378418



### 3. Stein Thinning

This method performs gradient-based compression by iteratively selecting points that minimize the Stein discrepancy to approximate the underlying distribution.

* **`target_size`**: Can be set to any positive integer. The algorithm uses a greedy selection process to pick the most representative samples one by one until the specified size is reached.
* **`grad_type`**: Defaults to `b'gaussian'`, which is the simplest and recommended option. While `b'kde'` and `b'gmm'` are also available, they are **strictly experimental** and we do not recommend them.

In [5]:
pre_stein = thinx.Preprocessor(
    X=X_test.copy(),
    y=y_test.copy(),
    compression_method="stein_thinning",
    seed=0
)

X_comp_st, y_com_st, idx_st, t_st = pre_stein.preprocess(
    target_size=15,
    grad_type=b'kde'
)

print("--- Stein Thinning Compression ---")
print("Original size:", X_test.shape[0])
print("Compressed size:", X_comp_st.shape[0])
print("Selected indices:", idx_st)
print("Compression time (s):", t_st)

--- Stein Thinning Compression ---
Original size: 30
Compressed size: 15
Selected indices: [16 22  9 23 14 25 20 15  4 26  2 12 27 19 18]
Compression time (s): 0.0033872127532958984


### 4. Influence-based Compression
This method utilizes Influence Functions (via the `pydvl` library) to identify and select the most "impactful" samples from the dataset based on the model's loss surface. This model requires a pytorch model to be given.

* **`target_size`**: Can be set to any positive integer. The algorithm computes self-influence scores for all points and uses them to sample a representative subset.

In [6]:
pre = thinx.Preprocessor(
    X=X_test,
    y=y_test,
    model=nn_model,
    compression_method="influence",
)
X_comp_i, y_comp_i, idx_i, t_comp_i, matrix = pre.preprocess(target_size=15)

print("--- Influence-based Compression ---")
print("Original size:", X_test.shape[0])
print("Compressed size:", X_comp_i.shape[0])
print("Selected indices:", idx_i)
print("Compression time (s):", t_comp_i)

Classification identified
--- Influence-based Compression ---
Original size: 30
Compressed size: 15
Selected indices: [19  8  1  0 24 27 18 22 16 28 25 26  6 21 13]
Compression time (s): 0.14124798774719238


### 5. Kernel Thinning
This method utilizes kernel-based thinning sequences to minimize the Maximum Mean Discrepancy (MMD).

* **`target_size`**: Must be of a power of 2, a posisite integer less than original size, otherwise algorithm will try to find the closest power of 2 based on `target_size`. Important: The possible values are contrainted by `g` - oversampling parameter and `num_bins` - number of bins used in Compress++ algorithms. 
* **`kernel`**: Supports various types including `"gaussian"`, `"sobolev"`, `"inverse_multiquadric"`, and `"matern"`. The kernel choice defines the function space in which the discrepancy between the original and compressed sets is minimized.

In [7]:
# test different sizes and kernels

pre = thinx.Preprocessor(
    X=X_test.copy(),
    y=y_test.copy(),
    compression_method="kernel_thinning",
    seed=0
)

X_comp_kt_g, y_com_kt_g, idx_kt_g, t_kt_g = pre.preprocess(
    g=4,
    num_bins=4, 
    target_size=2, 
    kernel="gaussian"
)

print("--- Stein Thinning Compression - Gaussian ---")
print("Original size:", X_test.shape[0])
print("Compressed size:", X_comp_kt_g.shape[0])
print("Selected indices:", idx_kt_g)
print("Compression time (s):", t_kt_g)

X_comp_kt_s, y_com_kt_s, idx_kt_s, t_kt_s = pre.preprocess(
    g=4,
    num_bins=4, 
    target_size=4, 
    kernel="sobolev"
)

print(" \n --- Stein Thinning Compression - Sobolev ---")
print("Original size:", X_test.shape[0])
print("Compressed size:", X_comp_kt_s.shape[0])
print("Selected indices:", idx_kt_s)
print("Compression time (s):", t_kt_s)

X_comp_kt_imq, y_com_kt_imq, idx_kt_imq, t_kt_imq = pre.preprocess(
    g=4,
    num_bins=4, 
    target_size=8, 
    kernel="inverse_multiquadric"
)

print(" \n --- Stein Thinning Compression - Inverse-Multiquadric ---")
print("Original size:", X_test.shape[0])
print("Compressed size:", X_comp_kt_imq.shape[0])
print("Selected indices:", idx_kt_imq)
print("Compression time (s):", t_kt_imq)

X_comp_kt_m, y_com_kt_m, idx_kt_m, t_kt_m = pre.preprocess(
    g=4,
    num_bins=4, 
    target_size=16, 
    kernel="matern"
)

print(" \n --- Stein Thinning Compression - Matern ---")
print("Original size:", X_test.shape[0])
print("Compressed size:", X_comp_kt_m.shape[0])
print("Selected indices:", idx_kt_m)
print("Compression time (s):", t_kt_m)

Gaussian kernel: lambda^2 = 9.155884507140055
--- Stein Thinning Compression - Gaussian ---
Original size: 30
Compressed size: 2
Selected indices: [ 1 11]
Compression time (s): 0.0008299350738525391
Sobolev kernel: parameters [1. 2. 3.]
 
 --- Stein Thinning Compression - Sobolev ---
Original size: 30
Compressed size: 4
Selected indices: [ 9  9  9 23]
Compression time (s): 0.0002391338348388672
Inverse-Multiquadric kernel: c = 9.155884507140055
 
 --- Stein Thinning Compression - Inverse-Multiquadric ---
Original size: 30
Compressed size: 8
Selected indices: [ 0  5  1  3 15 13 25 27]
Compression time (s): 0.00025916099548339844
Matérn kernel: parameters [1.         1.51293461 0.5        1.         3.02586922 1.5
 1.         6.05173843 2.5       ]
 
 --- Stein Thinning Compression - Matern ---
Original size: 30
Compressed size: 16
Selected indices: [ 0  1  3  5  7  9 11 13 15 17 19 21 23 25 27 29]
Compression time (s): 0.00021195411682128906


<a id='custom'></a>
-------
## 4. Custom Methods - Stratified and Joined
-------
 You can also define for a processor a custom data modification method, which will be applied before compression - none, stratified (only for classification), joined methods.

* **`data_modification_method="joined"`**: This method augments the feature matrix $X$ by appending the model's output as additional columns. For **regression** tasks, it appends a single column containing the predicted values. For **classification** tasks, it appends $n$ columns containing the predicted probabilities for each of the $n$ classes. This allows the compression algorithm to account for the model's decision manifold during the thinning process.

In [8]:
pre = thinx.Preprocessor(
    X=X_test.copy(),
    y=y_test.copy(),
    model=nn_model, # Model is required (nn_model or xgb_model can be used)
    compression_method="arfpy",
    seed=0,
    data_modification_method="joined"
)

X_comp_st, y_com_st, idx_st, t_st = pre.preprocess(
    target_size=11,
    grad_type=b'kde'
)

print("--- Arfpy Compression using custom joined adjustment ---")
print("Original size:", X_test.shape[0])
print("Compressed size:", X_comp_st.shape[0])
print("Selected indices:", idx_st)
print("Compression time (s):", t_st)

--- Arfpy Compression using custom joined adjustment ---
Original size: 30
Compressed size: 11
Selected indices: [-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1]
Compression time (s): 0.37145495414733887


* **`data_modification_method="stratified"`**: This method performs compression independently across groups defined by the model's predicted classes - compression algorithm is applied to each group separately. 

In [9]:
pre = thinx.Preprocessor(
    X=X_test.copy(),
    y=y_test.copy(),
    model=nn_model, # Model is required (nn_model or xgb_model can be used)
    compression_method="stein_thinning",
    seed=0,
    data_modification_method="stratified"
)

X_comp_st, y_com_st, idx_st, t_st = pre.preprocess(target_size=6)

print("--- Stein Thinning Compression using custom stratified adjustment ---")
print("Original size:", X_test.shape[0])
print("Compressed size:", X_comp_st.shape[0])
print("Selected indices:", idx_st)
print("Compression time (s):", t_st)

--- Stein Thinning Compression using custom stratified adjustment ---
Original size: 30
Compressed size: 6
Selected indices: [23 22 25  0  8 28]
Compression time (s): 0.013226032257080078


<a id='explain'></a>
--------
# 5. Explanation Methods
---------

### 1. SHAP Kernel - local explanations

In [10]:
explainer = thinx.Explainer(
    model=xgb_model, 
    explainer_name="shap",
    strategy="kernel",
    task_type="classification",
    seed=0
)

shap_values, exp_time = explainer.explain(X_foreground=X_test, X_background=X_comp, n_jobs=1)
print("time for explanation (s):", exp_time)
print("shap values for sample 0:", shap_values[0])

Using 1 CPU cores for parallel processing.
Explaining with SHAP. 30 samples to explain using 4 background samples.
Running SHAP explanation in parallel using 1 jobs, 4 batches of size 10
  → Finished batch 1/4
  → Finished batch 2/4
  → Finished batch 3/4
  → Finished all 30 samples
time for explanation (s): 1.2518579959869385
shap values for sample 0: [ 0.01813701  0.         -0.03539474  0.         -0.02175547  0.03759694
  0.01001287  0.          0.01740682  0.          0.          0.
  0.          0.          0.01052346  0.          0.01333673  0.
 -0.01097273 -0.04374334  0.        ]


### 2. SAGE Permutation - global explanations

In [11]:
explainer = thinx.Explainer(
    model=nn_model, 
    explainer_name="sage",
    strategy="permutation",
    task_type="classification",
    seed=0
)

sage_values, exp_time = explainer.explain(X_foreground=X_test, 
                                          X_background=X_comp_arf,
                                          y_background=y_comp_arf, # sage requires background labels
                                          y_foreground=y_test, # sage requires foreground labels
                                          n_jobs=4)
print("time for explanation (s):", exp_time)
print("sage values:", sage_values)

Using 4 CPU cores for parallel processing.
Explaining with SAGE. 30 samples to explain using 4 background samples.
PermutationEstimator will use 4 jobs
StdDev Ratio = 0.0840 (Converge at 0.0250)
StdDev Ratio = 0.0662 (Converge at 0.0250)
StdDev Ratio = 0.0545 (Converge at 0.0250)
StdDev Ratio = 0.0460 (Converge at 0.0250)
StdDev Ratio = 0.0409 (Converge at 0.0250)
StdDev Ratio = 0.0365 (Converge at 0.0250)
StdDev Ratio = 0.0329 (Converge at 0.0250)
StdDev Ratio = 0.0305 (Converge at 0.0250)
StdDev Ratio = 0.0284 (Converge at 0.0250)
StdDev Ratio = 0.0271 (Converge at 0.0250)
StdDev Ratio = 0.0256 (Converge at 0.0250)
StdDev Ratio = 0.0246 (Converge at 0.0250)
Detected convergence
time for explanation (s): 4.38506007194519
sage values: [ 0.02800407  0.00258646 -0.01364999  0.00214314  0.00045693 -0.00314154
 -0.00918725  0.02363003  0.00688503 -0.00024612  0.00139636 -0.00018399
  0.0058765  -0.00755452  0.02119821  0.00265083  0.01126081  0.00208434
 -0.00121262  0.00145024  0.01368105

### 3. SHAP-IQ Kernel - contributions of pairs of features

In [12]:
explainer = thinx.Explainer(
    model=nn_model, 
    explainer_name="shapiq",
    strategy="kernel",
    task_type="classification",
    seed=0
)

shapiq_values, exp_time = explainer.explain(X_foreground=X_test, X_background=X_comp_arf, n_jobs=4, verbose=False)
print("time for explanation (s):", exp_time)
print("shapiq values for sample 0:", shap_values[0])

Using 4 CPU cores for parallel processing.
Explaining with ShapIQ. 30 samples to explain using 4 background samples.
time for explanation (s): 3.903041362762451
shapiq values for sample 0: [ 0.01813701  0.         -0.03539474  0.         -0.02175547  0.03759694
  0.01001287  0.          0.01740682  0.          0.          0.
  0.          0.          0.01052346  0.          0.01333673  0.
 -0.01097273 -0.04374334  0.        ]


### 4. Expected gradients - local explanations - `only for neural networks!`

In [13]:
explainer = thinx.Explainer(
    model=nn_model, 
    explainer_name="expected_gradients",
    strategy="na",
    task_type="classification",
    seed=0
)

exp_grad_values, exp_time = explainer.explain(X_foreground=X_test, X_background=X_comp_arf, n_jobs=4, verbose=False)
print("time for explanation (s):", exp_time)
print("expected gradients values for sample 0:", exp_grad_values[0])

Using 4 CPU cores for parallel processing.
Explaining with Expected Gradients. 30 samples to explain using 4 background samples.
time for explanation (s): 0.4687843322753906
expected gradients values for sample 0: tensor([ 0.1665,  0.1349,  0.0009, -0.0019,  0.0360,  0.1464, -0.2587, -0.2174,
        -0.0080,  0.0184, -0.0186,  0.0061, -0.0889, -0.0439,  0.1128,  0.0257,
         0.0665,  0.0331,  0.0047, -0.0601,  0.0058])


<a id='eval'></a>
## 6. Evaluation
-----

### 0. Ground truth explanation calculation

In [14]:
explainer = thinx.Explainer(
    model=xgb_model, 
    explainer_name="shap",
    strategy="kernel",
    task_type="classification",
    seed=0
)

shap_values_gt, exp_time_gt = explainer.explain(X_foreground=X_test, X_background=X_test, n_jobs=1)
print("time for explanation (s):", exp_time)
print("shap values for sample 0:", shap_values[0])

Using 1 CPU cores for parallel processing.
Explaining with SHAP. 30 samples to explain using 30 background samples.
Running SHAP explanation in parallel using 1 jobs, 4 batches of size 10
  → Finished batch 1/4
  → Finished batch 2/4
  → Finished batch 3/4
  → Finished all 30 samples
time for explanation (s): 0.4687843322753906
shap values for sample 0: [ 0.01813701  0.         -0.03539474  0.         -0.02175547  0.03759694
  0.01001287  0.          0.01740682  0.          0.          0.
  0.          0.          0.01052346  0.          0.01333673  0.
 -0.01097273 -0.04374334  0.        ]


In [15]:
evaluator = thinx.Evaluator(ground_truth_explanation=shap_values_gt, ground_truth_points=X_test)

print("--- Evaluation of ARFPY compressed applied to SHAP explanations ---")
metrics_comp_eval = evaluator.evaluate_compression(
    compressed_points=X_comp_arf,
)
metrics_exp_eval = evaluator.evaluate_explanation(
    explanation=shap_values, 
    time_elapsed=t_arf,
    num_samples=len(X_comp_arf)
)
print("Compression evaluation metrics:", metrics_comp_eval) 
print("Explanation evaluation metrics:", metrics_exp_eval)


--- Evaluation of ARFPY compressed applied to SHAP explanations ---
Compression evaluation metrics: {'mmd': 0.18292712301271497}
Explanation evaluation metrics: {'mae': 0.011498427798072128, 'top_k': 0.5666666666666665, 'explanation_time': 0.2751460075378418, 'size': 4}
