# Benchmarking Analysis for Multiview Hypothesis Testers

Here, we are interested in analyzing the performance of using multi-view trees in the presence
of two views of data, when one of the datasets has a significantly greater number of dimensions.

For example, in liquid biopsies from cancer patients, we might obtain features describing the
fragment lengths of DNA found, or a description of the fragment ends (e.g. what is the ratio of base-pairs
found at the end). These might have wildly different dimensionalities like 1000 vs 10,000 if we consider
different levels of complexities of each of these biological characteristics.

So we want a hypothesis test that is i) aware of the fact that ``X`` is comprised of two feature-sets, and ii)
is able to handle wildly different dimensionalities in each of the feature-sets with sufficient power
and type-I error rate to provide useful answers to cancer biomarker hypotheses.

In [2]:
from collections import defaultdict

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from scipy.special import expit

from sktree import HonestForestClassifier, RandomForestClassifier, RandomForestRegressor
from sktree.stats import (
    FeatureImportanceForestClassifier,
    FeatureImportanceForestRegressor,
    PermutationForestRegressor,
)

seed = 12345



# Simulate data with varying dimensionality

Here, we will implement a function to simulate data with a varying number of dimensions in the second view.