# Benchmarking Analysis for Multiview Hypothesis Testers

Here, we are interested in analyzing the performance of using multi-view trees in the presence
of two views of data, when one of the datasets has a significantly greater number of dimensions.

For example, in liquid biopsies from cancer patients, we might obtain features describing the
fragment lengths of DNA found, or a description of the fragment ends (e.g. what is the ratio of base-pairs
found at the end). These might have wildly different dimensionalities like 1000 vs 10,000 if we consider
different levels of complexities of each of these biological characteristics.

So we want a hypothesis test that is i) aware of the fact that ``X`` is comprised of two feature-sets, and ii)
is able to handle wildly different dimensionalities in each of the feature-sets with sufficient power
and type-I error rate to provide useful answers to cancer biomarker hypotheses.

In [5]:
from collections import defaultdict

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from scipy.special import expit

from sktree import HonestForestClassifier, RandomForestClassifier, RandomForestRegressor
from sktree.stats import (
    FeatureImportanceForestClassifier,
    FeatureImportanceForestRegressor,
    PermutationForestRegressor,
)

import mvlearn
from mvlearn.datasets import make_gaussian_mixture

seed = 12345
rng = np.random.default_rng(seed)

# Simulate data with varying dimensionality

Here, we will implement a function to simulate data with a varying number of dimensions in the second view.

We will implement a simulation that leverages two views stemming from the graphical model:

$(X1 \rightarrow Y \leftarrow X2; X1 \leftrightarrow X2)$

or

$(X1 \rightarrow Y \leftarrow X2)$

where X1 and X2 are two views and Y is the target variable. The bidirected edge between X1 and X2 is just to allow the two views to
be potentially correlated. 

We will also use the package `mvlearn` to simulate a two-view dataset using their `make_gaussian_mixture` function (https://mvlearn.github.io/references/datasets.html#data-simulator). This simulates a ``X`` and a ``y``, and then applies a transformation on ``X``, which produces the second view. This stems from the graphical model:

$(X1 \rightarrow Y \leftarrow X2; X1 \rightarrow X2)$

where X1 is the original dataset and X2 is the transformed dataset.

In [4]:
def make_multiview_classification(
    n_samples=100,
    n_features_1=100,
    n_features_2=10000,
    cluster_std_first=2.0,
    cluster_std_second=5.0,
    X0_first=None,
    y0=None,
    X1_first=None,
    y1=None,
    seed=None,
):
    rng = np.random.default_rng(seed=seed)

    if X0_first is None and y0 is None:
        # Create a high-dimensional multiview dataset with a low-dimensional informative
        # subspace in one view of the dataset.
        X0_first, y0 = make_blobs(
            n_samples=n_samples,
            cluster_std=cluster_std_first,
            n_features=n_features_1,
            random_state=rng.integers(1, 10000),
            centers=1,
        )

        X1_first, y1 = make_blobs(
            n_samples=n_samples,
            cluster_std=cluster_std_second,
            n_features=n_features_1,
            random_state=rng.integers(1, 10000),
            centers=1,
        )
    y1[:] = 1
    X0 = np.concatenate(
        [X0_first, rng.standard_normal(size=(n_samples, n_features_2))], axis=1
    )
    X1 = np.concatenate(
        [X1_first, rng.standard_normal(size=(n_samples, n_features_2))], axis=1
    )
    X = np.vstack((X0, X1))
    y = np.hstack((y0, y1)).T

    X = X + rng.standard_normal(size=X.shape)

    return X, y

In [None]:
# Simulate data
# -------------
# We simulate a 2-view dataset with both views containing informative low-dimensional features.
# The first view has five dimensions, while the second view will vary from five to a thousand
# dimensions. The sample-size will be kept fixed, so we can compare the performance of
# regular Random forests with Multi-view Random Forests.

n_samples = 500
n_features_views = np.linspace(5, 20000, 5).astype(int)

datasets = []

# make the signal portions of the dataset
X0_first, y0 = make_blobs(
    n_samples=n_samples,
    cluster_std=5.0,
    n_features=5,
    random_state=rng.integers(1, 10000),
    centers=1,
)
X1_first, y1 = make_blobs(
    n_samples=n_samples,
    cluster_std=10.0,
    n_features=5,
    random_state=rng.integers(1, 10000),
    centers=1,
)

# increasingly add noise dimensions to the second view
for idx, n_features in enumerate(n_features_views):
    X, y = make_multiview_classification(
        n_samples=n_samples,
        n_features_1=5,
        n_features_2=n_features,
        cluster_std_first=5.0,
        cluster_std_second=10.0,
        # X0_first=X0_first, y0=y0,
        # X1_first=X1_first, y1=y1,
        seed=seed + idx,
    )
    datasets.append((X, y))