# Analysis of (Conditional) Mutual Information with Supervised Forests

Compared to unsupervised forests, supervised forests are useful for computing direct estimates of entropy based on their abilities to predict probabilities. Moreover, supervised forests lend themselves well to computing (C)MI between discrete and continuous variables (i.e. when Y is categorical but X is continuous for (X, Y) input data).

- MI: I(X;Y) = H(Y) - H(Y|X)
- CMI: I(X;Y | Z) = H(Y|Z) - H(Y | X, Z)

Specifically, we will extend the separated Gaussian simulations of the uncertainty forest paper.

$
\begin{align}
    I(X; Y | Z) = H(X | Z) - H(X | Y, Z)  \quad \text{(Entropy identity)}\\
    = (H(X, Z) - H(Z)) - (H(X, Z | Y) - H(Z | Y)) \quad \text{(By chain rule)} \\
    = H(X, Z) - H(Z) - H(X, Z | Y) + H(Z | Y)  \quad \text{(Simplify)}
\end{align}
$

Now each quantity is directly computable from $(X \cup Z) \sim$ multivariate Gaussian with Y being categorical.



In [3]:
%load_ext lab_black
%load_ext autoreload
%autoreload 2

The lab_black extension is already loaded. To reload it, use:
  %reload_ext lab_black


In [22]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy
import scipy.spatial
import seaborn as sns
from sklearn.neighbors import NearestNeighbors
from sklearn.tree import DecisionTreeClassifier
from sklearn.utils.validation import check_is_fitted
from tqdm import tqdm

import sktree
from sktree import (
    HonestForestClassifier,
    NearestNeighborsMetaEstimator,
    ObliqueRandomForestClassifier,
    ObliqueRandomForestRegressor,
    UnsupervisedObliqueRandomForest,
    UnsupervisedRandomForest,
)
from sktree.experimental import SupervisedInfoForest, mutual_info_ksg
from sktree.experimental.ksg import _compute_radius_nbrs, mutual_info_ksg_nn
from sktree.experimental.mutual_info import (
    cmi_from_entropy,
    cmi_gaussian,
    entropy_gaussian,
    entropy_weibull,
    mi_from_entropy,
    mi_gamma,
    mi_gaussian,
)
from sktree.experimental.simulate import (
    cmi_separated_gaussians,
    embed_high_dims,
    mi_separated_gaussians,
    simulate_helix,
    simulate_multivariate_gaussian,
    simulate_separate_gaussians,
    simulate_sphere,
)
from sktree.neighbors import forest_distance
from sktree.tree import ObliqueDecisionTreeClassifier, compute_forest_similarity_matrix

## Define Hyperparameters of the Simulation

In [6]:
seed = 12345
rng = np.random.default_rng(seed)

In [7]:
n_jobs = -1
n_estimators = 100
feature_combinations = 2.0
n_nbrs = 5

# hyperparameters of the simulation
n_samples = 1000
n_noise_dims = 20
alpha = 0.001
n_classes = 2

# dimensionality of mvg
n_dims = 3

# for sphere
radius = 1.0

# for helix
radius_a = 0.0
radius_b = 2.0

# manifold parameters
radii_func = lambda: rng.uniform(0, 1)

## Setup a single simulation

Now, to demonstrate what the data would look like fromm a single parameterized simulation, we want to show the entire workflow from data generation to analysis and output value.

In [8]:
# simulate separated Gaussians
# simulate multivariate Gaussian
X, y, means, sigmas, pi = simulate_separate_gaussians(
    n_dims=n_dims, n_samples=n_samples, n_classes=n_classes, seed=seed
)

print(X.shape, y.shape)

(1000, 3) (1000,)


In [9]:
print(len(means), len(sigmas))
print(means[0].shape, pi)

2 2
(3,) [0.5 0.5]


In [10]:
print(sigmas)

[array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]]), array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])]


In [11]:
# compute ground-truth MI/CMI
# I(X;Y)
# I(X;Y |Z)

# true MI and CMI
true_mi = mi_separated_gaussians(means, sigmas, pi, seed=seed)

In [13]:
true_cmi = cmi_separated_gaussians(means, sigmas, pi, condition_idx=[1, 2], seed=seed)

4.351104834412363 2.8469010556693393 4.2568155996140185 2.8378770664093453


In [14]:
condition_idx = [1, 2]
z_sigmas = [sigma[np.ix_(condition_idx, condition_idx)] for sigma in sigmas]
print([sigma.shape for sigma in z_sigmas])

[(2, 2), (2, 2)]


In [16]:
print(true_mi)
print(true_cmi)

0.09428923479834417
0.08526524553835024


In [17]:
_X = embed_high_dims(X[:, [1]], n_dims=n_noise_dims, random_state=seed)
_Z = embed_high_dims(X[:, [2]], n_dims=n_noise_dims, random_state=seed)
highdim_X = np.hstack((_X, _Z))

print(X.shape, y.shape, _X.shape, _Z.shape)

(1000, 3) (1000,) (1000, 21) (1000, 21)


# Setup SupervisedInfoForest

The info forest object is just a meta-estimator that I wrote to expose the sklearn API for computing MI/CMI. It proceeds by fitting multiple forests.

In [29]:
tree_estimator = ObliqueDecisionTreeClassifier(
    feature_combinations=feature_combinations,
    random_state=seed,
)
tree_estimator = DecisionTreeClassifier(
    # feature_combinations=feature_combinations,
    random_state=seed,
)

estimator = HonestForestClassifier(
    n_estimators=n_estimators,
    n_jobs=n_jobs,
    random_state=seed,
    tree_estimator=tree_estimator,
)
est = SupervisedInfoForest(
    estimator=estimator, y_categorical=True, n_jobs=n_jobs, random_state=seed
)

In [30]:
est.fit(_X, y, _Z)

In [33]:
check_is_fitted(est)

In [38]:
print(est.estimator_yxz_.apply(X))

NotFittedError: This HonestTreeClassifier instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

In [31]:
print(est.estimator_yxz_)
print(check_is_fitted(est))
print(check_is_fitted(est.estimator_yxz_))
print(check_is_fitted(est.estimator_yz_))

HonestForestClassifier(n_jobs=-1, random_state=12345,
                       tree_estimator=DecisionTreeClassifier(random_state=424562602))
None
None
None


In [27]:
est.estimator_yxz_.apply(X)

NotFittedError: This HonestTreeClassifier instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

In [20]:
pred_cmi = est.predict_cmi(_X, y, _Z)

NotFittedError: This HonestTreeClassifier instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

In [None]:
pred_mi = est.predict_cmi(_X, y)

In [None]:
print(pred_cmi, pred_mi)

In [69]:
print(pred_cmi, pred_mi)

-0.009832259764467277 -0.19976608189849127


In [62]:
print(pred_cmi, pred_mi)

-0.006589316434129411 -0.19854781628986418


# Analysis of dimensionality 