# All‐Together: XTREES Exploration & Debug

This notebook imports and sets up all key modules for XTREES (VizTree, TreePlot, TreeDash, ForestBasedTree, etc.) so you can experimentation and example demonstration. 

Below, the first code cell pulls in everything needed—feel free to run each section step by step to verify functionality.


## Setup

In [1]:
from src.xtrees.dash import *
from src.xtrees.model.fbt import *
from src.utils import _fmt, show_df
from src.xtrees.dash.vis_tree import VisTree

import pandas as pd

from jupyter_dash import JupyterDash

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor

from sklearn.datasets import load_iris, load_breast_cancer, fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, mean_squared_error

seed = 42


## Data Loading & Splitting

Load the Iris, Breast Cancer, and California Housing datasets, convert them to pandas DataFrames, infer feature types, and split each into training and test sets.


In [2]:
# --------------------
# Iris Dataset
# --------------------
iris_data = load_iris()
iris_X = iris_data.data
iris_y = iris_data.target

iris_class_names = iris_data.target_names
iris_feature_names = iris_data.feature_names
iris_X = pd.DataFrame(iris_X, columns=iris_feature_names)

iris_feature_types = pd.DataFrame(iris_X, columns=iris_feature_names).dtypes

iris_X_train, iris_X_test, iris_y_train, iris_y_test = train_test_split(
    iris_X, iris_y, test_size=0.3, random_state=seed
)


# --------------------
# Breast Cancer Dataset
# --------------------
bc_data = load_breast_cancer()
bc_X = bc_data.data
bc_y = bc_data.target

bc_class_names = bc_data.target_names
bc_feature_names = bc_data.feature_names
bc_X = pd.DataFrame(bc_X, columns=bc_feature_names)

bc_feature_types = pd.DataFrame(bc_X, columns=bc_feature_names).dtypes

bc_X_train, bc_X_test, bc_y_train, bc_y_test = train_test_split(
    bc_X, bc_y, test_size=0.3, random_state=seed
)


# --------------------
# California Housing Dataset
# --------------------
calif_data = fetch_california_housing()
calif_X = calif_data.data
calif_y = calif_data.target

calif_feature_names = calif_data.feature_names
calif_X = pd.DataFrame(calif_X, columns=calif_feature_names)

calif_feature_types = pd.DataFrame(calif_X, columns=calif_feature_names).dtypes

calif_X_train, calif_X_test, calif_y_train, calif_y_test = train_test_split(
    calif_X, calif_y, test_size=0.2, random_state=seed
)


## Train Models

### on Iris (0.6s)

Train a RandomForestClassifier on the Iris training set, then build and fit a ForestBasedTree (FBT) with specified parameters. Finally, display the head of the conjunction-set DataFrame.


In [3]:
# RandomForest parameters for Iris
num_of_estimators = 20
max_depth = 5
min_sample_leaf = max(1, int(0.02 * len(iris_X_train)))

iris_rf = RandomForestClassifier(
    n_estimators=num_of_estimators,
    max_depth=max_depth,
    min_samples_leaf=min_sample_leaf,
    random_state=seed
)
iris_rf.fit(iris_X_train, iris_y_train)


# ForestBasedTree parameters
minimal_forest_size = 10
max_number_of_branches = 50
exclusion_threshold = 0.8

iris_fbt = ForestBasedTree(random_state=seed)
iris_fbt.fit(
    iris_rf,
    iris_X_train,
    iris_y_train,
    iris_feature_types,
    iris_feature_names,
    minimal_forest_size=minimal_forest_size,
    amount_of_branches_threshold=max_number_of_branches,
    exclusion_threshold=exclusion_threshold
)

cs_df = iris_fbt.cs_df.copy()

h = show_df(cs_df)

print("-"*50)


iris_y_pred_rf = iris_rf.predict(iris_X_test)
rf_accuracy = accuracy_score(iris_y_test, iris_y_pred_rf)


iris_y_pred_fbt = iris_fbt.predict(iris_X_test)
fbt_accuracy = accuracy_score(iris_y_test, iris_y_pred_fbt)


iris_dt = DecisionTreeClassifier(max_depth=4, random_state=seed)
iris_dt.fit(iris_X_train, iris_y_train)

iris_y_pred_dt = iris_dt.predict(iris_X_test)
dt_accuracy = accuracy_score(iris_y_test, iris_y_pred_dt)


print(f"RandomForest Accuracy: {rf_accuracy:.2f}")
print(f"ForestBasedTree Accuracy: {fbt_accuracy:.2f}")
print(f"Decision Tree Accuracy:   {dt_accuracy:.2f}")


  0_upper 0_lower 1_upper 1_lower 2_upper 2_lower 3_upper 3_lower n_samples branch_prob              probas     0     1     2
0    5.45    -inf     inf    -inf    2.45    -inf    0.70    -inf        18        0.03  [1.00, 0.00, 0.00]  1.00  0.00  0.00
1    5.75    5.45    2.95    -inf    2.50    -inf    0.70    -inf        16        0.01  [0.81, 0.19, 0.00]  0.81  0.19  0.00
2    5.75    5.45    3.60    2.95    2.50    -inf    0.70    -inf        15        0.01  [0.84, 0.16, 0.00]  0.84  0.16  0.00
3     inf    5.75    3.60    -inf    2.50    -inf    0.70    -inf        16        0.04  [0.81, 0.19, 0.00]  0.81  0.19  0.00
4    5.45    -inf     inf    -inf    2.45    -inf    1.35    0.80        16        0.02  [0.16, 0.84, 0.00]  0.16  0.84  0.00
--------------------------------------------------
RandomForest Accuracy: 1.00
ForestBasedTree Accuracy: 1.00
Decision Tree Accuracy:   1.00


### on Breast Cancer (4.5s)

Train a RandomForestClassifier on the Breast Cancer training set, then build and fit a ForestBasedTree (FBT) with the same parameters. Display the head of the resulting conjunction-set DataFrame.


In [4]:
# RandomForest parameters for Breast Cancer
num_of_estimators = 20
max_depth = 5
min_sample_leaf = max(1, int(0.02 * len(bc_X_train)))

bc_rf = RandomForestClassifier(
    n_estimators=num_of_estimators,
    max_depth=max_depth,
    min_samples_leaf=min_sample_leaf,
    random_state=seed
)
bc_rf.fit(bc_X_train, bc_y_train)

# ForestBasedTree parameters
minimal_forest_size = 10
max_number_of_branches = 50
exclusion_threshold = 0.8

bc_fbt = ForestBasedTree(random_state=seed)
bc_fbt.fit(
    bc_rf,
    bc_X_train,
    bc_y_train,
    bc_feature_types,
    bc_feature_names,
    minimal_forest_size=minimal_forest_size,
    amount_of_branches_threshold=max_number_of_branches,
    exclusion_threshold=exclusion_threshold
)

h = show_df(bc_fbt.cs_df)

print("-"*50)

bc_y_pred_rf = bc_rf.predict(bc_X_test)
rf_accuracy = accuracy_score(bc_y_test, bc_y_pred_rf)


bc_y_pred_fbt = bc_fbt.predict(bc_X_test)
fbt_accuracy = accuracy_score(bc_y_test, bc_y_pred_fbt)


bc_dt = DecisionTreeClassifier(max_depth=4, random_state=seed)
bc_dt.fit(bc_X_train, bc_y_train)

bc_y_pred_dt = bc_dt.predict(bc_X_test)
dt_accuracy = accuracy_score(bc_y_test, bc_y_pred_dt)


print(f"RandomForest Accuracy: {rf_accuracy:.2f}")
print(f"ForestBasedTree Accuracy: {fbt_accuracy:.2f}")
print(f"Decision Tree Accuracy:   {dt_accuracy:.2f}")


  0_upper 0_lower 1_upper 1_lower 2_upper 2_lower 3_upper 3_lower 4_upper 4_lower 5_upper 5_lower 6_upper 6_lower 7_upper 7_lower 8_upper 8_lower 9_upper 9_lower 10_upper 10_lower 11_upper 11_lower 12_upper 12_lower 13_upper 13_lower 14_upper 14_lower 15_upper 15_lower 16_upper 16_lower 17_upper 17_lower 18_upper 18_lower 19_upper 19_lower 20_upper 20_lower 21_upper 21_lower 22_upper 22_lower 23_upper 23_lower 24_upper 24_lower 25_upper 25_lower 26_upper 26_lower 27_upper 27_lower 28_upper 28_lower 29_upper 29_lower n_samples branch_prob        probas     0     1
0   14.15    -inf   19.36    -inf     inf    -inf  690.50    -inf     inf    -inf     inf    -inf     inf    -inf    0.05    -inf     inf    -inf     inf    -inf     0.38     -inf     1.37     -inf     2.81     -inf    31.25     -inf      inf     -inf      inf     -inf      inf     -inf      inf     -inf      inf     -inf      inf     -inf    14.48     -inf    29.80     -inf   101.85     -inf      inf     -inf      inf     -in

### on California Housing (28s)

Train a RandomForestRegressor on the California Housing training set, then fit a ForestBasedTree. Display the head of the resulting conjunction-set DataFrame.


In [5]:
feature_names = calif_feature_names
feature_types = calif_feature_types

num_of_estimators = 20
max_depth = 5
min_sample_leaf = max(1, int(0.02 * len(calif_X_train)))

calif_rf = RandomForestRegressor(
    n_estimators=num_of_estimators,
    max_depth=max_depth,
    min_samples_leaf=min_sample_leaf,
    random_state=seed
)
calif_rf.fit(calif_X_train, calif_y_train)

minimal_forest_size = 10
max_number_of_branches = 50
exclusion_threshold = 0.8

calif_fbt = ForestBasedTree(random_state=seed)
calif_fbt.fit(
    calif_rf,
    calif_X_train,
    calif_y_train,
    feature_types=feature_types,
    feature_names=feature_names,
    minimal_forest_size=minimal_forest_size,
    amount_of_branches_threshold=max_number_of_branches,
    exclusion_threshold=exclusion_threshold
)


h = show_df(calif_fbt.cs_df)

print("-"*50)


calif_y_pred_rf = calif_rf.predict(calif_X_test)
rf_mse = mean_squared_error(calif_y_test, calif_y_pred_rf)



calif_y_pred_fbt = calif_fbt.predict(calif_X_test)
fbt_mse = mean_squared_error(calif_y_test, calif_y_pred_fbt)


calif_dt = DecisionTreeRegressor(random_state=seed)
calif_dt.fit(calif_X_train, calif_y_train)

calif_y_pred_dt = calif_dt.predict(calif_X_test)
dt_mse = mean_squared_error(calif_y_test, calif_y_pred_dt)


print(f"RandomForest Mean Squared Error: {rf_mse:.2f}")
print(f"ForestBasedTree Mean Squared Error: {fbt_mse:.2f}")
print(f"Decision Tree Mean Squared Error:   {dt_mse:.2f}")


  0_upper 0_lower 1_upper 1_lower 2_upper 2_lower 3_upper 3_lower 4_upper 4_lower 5_upper 5_lower 6_upper 6_lower 7_upper 7_lower n_samples branch_prob regressions
0    3.06    2.20     inf    -inf    3.90    -inf     inf    -inf     inf    -inf    2.44    -inf     inf    -inf     inf    -inf       453        0.01        2.17
1    3.06    2.20     inf    -inf    3.90    -inf     inf    -inf     inf    -inf    2.91    2.51     inf    -inf     inf    -inf       460        0.01        2.11
2    3.06    2.20     inf    -inf    3.90    -inf     inf    -inf     inf    -inf     inf    3.11     inf    -inf     inf    -inf       423        0.01        1.57
3    3.06    2.46     inf    -inf    4.21    3.90     inf    -inf     inf    -inf     inf    3.01     inf    -inf     inf    -inf       413        0.00        1.62
4    2.32    -inf     inf    -inf     inf    4.31     inf    -inf     inf    -inf    2.43    -inf   34.52    -inf     inf    -inf       390        0.02        1.15
----------------

## Predictions


### on Iris

Use the already-fitted ForestBasedTree (`iris_fbt`) to predict on the Iris test set, compute accuracy, then build a VizTree from `iris_fbt`, inspect its attributes, and verify that VizTree’s predictions match the ForestBasedTree predictions.


In [6]:
X_train = iris_X_train
X_test = iris_X_test
y_train = iris_y_train
y_test = iris_y_test
class_names = iris_class_names

fbt_ypred = iris_fbt.predict(X_test)
accuracy = accuracy_score(y_test, fbt_ypred)
print(f"ForestBasedTree Accuracy: {accuracy:.4f}")


iris_fbt_viz = VisTree(iris_fbt, X=iris_X, class_names=class_names)
print("VisTree is_classifier flag:", iris_fbt_viz.is_classifier)

print("="*50)
fbtviz_ypred = iris_fbt_viz.predict(X_test)
print("VisTree Predictions:", fbtviz_ypred)
print("="*50)

accuracy = accuracy_score(y_test, fbtviz_ypred)
print(f"VisTree Accuracy: {accuracy:.4f}")
print("="*50)
iris_fbt_viz.print_nodes()



ForestBasedTree Accuracy: 1.0000
VisTree is_classifier flag: True
VisTree Predictions: [1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0 0 0 1 0 0 2 1
 0 0 0 2 1 1 0 0]
VisTree Accuracy: 1.0000
Node 0 (parent=None, is_left=None) | value=[0.27, 0.39, 0.34] | n_train=150
  feature:   petal length (cm)
  threshold: 5.05
  left_id:   1
  right_id:  30
  n_samples: 0
  Node 1 (parent=0, is_left=True) | value=[0.37, 0.49, 0.14] | n_train=108
    feature:   petal width (cm)
    threshold: 0.70
    left_id:   2
    right_id:  13
    n_samples: 0
    Node 2 (parent=1, is_left=True) | value=[0.80, 0.12, 0.08] | n_train=50
      feature:   sepal length (cm)
      threshold: 5.75
      left_id:   3
      right_id:  4
      n_samples: 0
      Leaf 3 (parent=2, is_left=True) | value=[0.80, 0.12, 0.08] | n_train=49
      Node 4 (parent=2, is_left=False) | value=[0.81, 0.19, 0.00] | n_train=1
        feature:   petal length (cm)
        threshold: 2.50
        left_id:   5
        right_id:

### on Breast Cancer

Use the already-fitted ForestBasedTree (`bc_fbt`) to predict on the Breast Cancer test set, compute accuracy, then build a VizTree from `bc_fbt`, inspect its attributes, and verify that VizTree’s predictions match the ForestBasedTree predictions.



In [7]:
X_train = bc_X_train
X_test  = bc_X_test
y_train = bc_y_train
y_test  = bc_y_test
class_names = bc_class_names

cancer_fbt_viz = VisTree(bc_fbt, bc_X, class_names)

fbt_ypred = bc_fbt.predict(X_test)
accuracy = accuracy_score(y_test, fbt_ypred)
print(f"ForestBasedTree Accuracy: {accuracy:.4f}")

from src.xtrees.dash.vis_tree import VisTree

bc_fbt_viz = VisTree(bc_fbt, X=bc_X, class_names=class_names)
print("VisTree is_classifier flag:", bc_fbt_viz.is_classifier)

print("="*50)
fbtviz_ypred = bc_fbt_viz.predict(X_test)
print("VisTree Predictions:", fbtviz_ypred)
print("="*50)

accuracy = accuracy_score(y_test, fbtviz_ypred)
print(f"VisTree Accuracy: {accuracy:.4f}")
print("="*50)
bc_fbt_viz.print_nodes()



ForestBasedTree Accuracy: 0.6491
VisTree is_classifier flag: True
VisTree Predictions: [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1]
VisTree Accuracy: 0.6491
Node 0 (parent=None, is_left=None) | value=[0.14, 0.86] | n_train=569
  feature:   area error
  threshold: 31.25
  left_id:   1
  right_id:  60
  n_samples: 0
  Node 1 (parent=0, is_left=True) | value=[0.14, 0.86] | n_train=358
    feature:   mean concave points
    threshold: 0.05
    left_id:   2
    right_id:  3
    n_samples: 0
    Leaf 2 (parent=1, is_left=True) | value=[0.12, 0.88] | n_train=314
    Node 3 (parent=1, is_left=False) | value=[0.25, 0.75] | n_train=44
      feature:   mean radius
      threshold: 15.23
      left_i

In [8]:
# !python3 -m src.xtrees.dash.vis_tree


## Visualization

Below, we demonstrate various interactive visualizations for:
1. The pruned ForestBasedTree (FBT) as a Sankey plot.
2. An interactive Dash app for the pruned FBT on the Iris test set.
3. The original decision tree (DT) for the breast‐cancer dataset, both as a Sankey plot and a Dash app.
4. An interactive RandomForest Dashboard.
5. A combined FBT + RF dashboard for Iris.
6. A Sankey plot of a pruned regression tree on the California housing dataset.
7. A combined FBT + RF dashboard for California.


In [9]:
from src.xtrees.dash.tree_dash import *
from src.xtrees.dash.tree_plot import *
from src.xtrees.dash.vis_tree import *


In [10]:
# 1) Pruned FBT Sankey plot for breast‐cancer

# Rename variables: use bc_ prefix for breast-cancer objects
pruned_bc_fbt = bc_fbt_viz.prune(max_depth=10)  
bc_fbt_sankey = SankeyTreePlot(pruned_bc_fbt)

# Show the interactive Sankey plot of the pruned FBT
bc_fbt_sankey.show()


*Explanation:*  
We prune the FBT built on the breast‐cancer RandomForest down to depth 10, then pass it into `SankeyTreePlot`.  
The `.show()` call renders an interactive Sankey diagram of all leaf‐to‐root flows (branch importances → values).


In [11]:
# 2) FBT Dash app for Iris test set

# Prune the Iris FBT to a large depth to keep all nodes
pruned_iris_fbt = iris_fbt_viz.prune(max_depth=100)  
iris_fbt_dash = VisTreeDashboard(pruned_iris_fbt, iris_X_test, iris_y_test)

# Launch at port 8060
iris_fbt_dash.run(port=8060)


*Explanation:*  
This cell creates a Dash dashboard (`VisTreeDashboard`) for the pruned Iris FBT.  
It will spin up a local web app (http://localhost:8060) where you can interactively explore how each test sample flows through the FBT and inspect leaf‐value predictions.


In [12]:
# 3) Original Decision Tree Sankey plot for breast‐cancer

# Build a Sankey plot of the original breast‐cancer decision tree
bc_dt_viz = VisTree(
    model=bc_dt,
    X=bc_X,
    class_names=bc_class_names
)

bc_dt_sankey = SankeyTreePlot(bc_dt_viz)
bc_dt_sankey.show()


*Explanation:*  
We take the original `VisTree` object for the breast‐cancer decision tree (`bc_dt_viz`) and visualize it as a Sankey plot.  
Each node’s probability distribution is shown, and flows highlight how training samples split at each threshold.


In [13]:
# 4) Original Decision Tree Dash app for breast‐cancer

bc_tree_dash = VisTreeDashboard(bc_dt_viz, bc_X_test, bc_y_test)
bc_tree_dash.run(port=8061)


*Explanation:*  
This launches an interactive Dash app (http://localhost:8061) for the breast‐cancer DT (`bc_dt_viz`).  
You can hover over nodes to see split conditions, inspect leaf probabilities, and see how test samples travel through the tree.


In [14]:
pruned_fbt = cancer_fbt_viz.prune(10)
fbt_sankey = SankeyTreePlot(pruned_fbt)

fbt_sankey.show()

*Explanation:*  
Here we create a `RFDashboard` for the breast‐cancer RandomForest (`bc_rf`).  
It shows train vs. test metrics, feature importances, and allows you to inspect individual tree contributions.


In [15]:
# 6) Combined FBT + RF dashboard for Iris

combined_iris_dash = CombinedDashboard(
    iris_fbt_viz,       # the VisTree wrapper around the Iris FBT
    iris_X_test,        # test‐set features
    iris_y_test,        # test‐set labels
    iris_X,             # full Iris data (for train visualization)
    iris_rf,            # the RandomForestClassifier on Iris
    iris_class_names    # class‐name labels
)
combined_iris_dash.run(port=8063)


*Explanation:*  
This combined dashboard (`CombinedDashboard`) lets you compare FBT and RF side by side on Iris.  
It shows FBT‐based branches, test‐set accuracy, plus standard RF metrics and feature importances in one app at http://localhost:8063.


In [16]:
# 7) Pruned regression tree Sankey plot for California housing

calif_dt_viz = VisTree(
    model=calif_dt,
    X=calif_X,
    class_names=bc_class_names
)

pruned_calif_viz = calif_dt_viz.prune(max_depth=4)
calif_sankey = SankeyTreePlot(pruned_calif_viz, show_text=False)
calif_sankey.show()


*Explanation:*  
We prune a `VisTree` built on the California‐housing regression tree (`calif_dt_viz`) down to depth 4,  
then visualize it as a Sankey diagram (without node‐text labels) to focus on numeric flows.


In [17]:
# 8) Combined FBT + RF dashboard for California housing

calif_fbt_viz = VisTree(calif_fbt, calif_X)

combined_calif_dash = CombinedDashboard(
    calif_fbt_viz,       # the VisTree wrapper around the California FBT
    calif_X_test,        # test‐set features
    calif_y_test,        # test‐set labels
    calif_X,             # full California data (train)
    calif_rf,            # the RandomForestRegressor on California
    calif_feature_names  # feature‐name list
)
combined_calif_dash.run(port=8064)


*Explanation:*  
Finally, we build a `CombinedDashboard` for the California‐housing FBT and RF.  
Visit http://localhost:8064 to interactively compare how the ForestBasedTree approximates the RandomForestRegressor on numeric targets (e.g., median house value).


