Fast HHG 2-Sample Test (#314)

* Integrated Fast HHG into original HHG code * Update hhg.py * Create Fast HHG Tester.ipynb * Update hhg.py * Update Fast HHG Tester.ipynb * Update Fast HHG Tester.ipynb * FastHHG Tester * Update * Update Fast HHG Tester.ipynb * Added Power Sample Size Results * Fast HHG updated with Hoeffdings and docstrings * Update hhg.py * Edited independence tutorial * Delete Fast HHG Tester.ipynb * Delete indep_power_sampsize (center).pdf * Delete OGTest1Figure.png * Delete OGTest1.png * Delete Function Outline.docx * release hyppo 0.2.2 (#236) (#237) * fix inccorect p-value in unit tests * fix MGC unit tests * update version to 0.2.2 * Black sampling and edits to tutorial * Black format check * Imported IndependenceTestOutput * Formatting Changes * Black formating * Black Formatting * Adjustments to Requested Changes Remove cdist - replace with compute_dist, modified description, changed 'fast' to 'auto' and simplifying of hoeffding's implementation to basic formula. * Edited Hoeffding Function * Added Unit Tests for Fast HHG Unit tests cover replicability, Type 1 Error, distance statistic implementation and 1D linear test. * Add citations to refs and to code * Black Formatting Sweep * Minor Edits * Transfer Fast HHG description to notes section * Removed Fast HHG description Added 'fast implementation' as a pro instead. * Updated Unit Tests * Adjustments Made 'auto' in test function instead of _init_. Adjustments all-around to match. Adjusted Hoeffding function to implement jit on inner function. Achieved much higher speeds thanks to it. * Added Toy Implementation to tutorial * Update independence.py * Black Formatting Sweep * Adjustments * Formatting sweep * Updated test to cover Hoeffding Statistic Function * Disabled JIT during test to possibly improve code coverage * Formatting * Removed unneeded import. * Fixed JIT coverage error + Black Format Sweep * [skip-ci] link to p-value method * Create HHG KSample Tester.ipynb * Adding Fast HHG K-Sample Class Currently center-of-mass AD version * Completed addition of Multipoint Version * Renaming module file to match independence format * Begin code of unit tests for HHG Ksample * Edit ksample init.py * Update Unit Test Added Type 1 Error and replicability test * Corrected error in hhg ksample module Allowed option of mode * Remove Center of Mass Version * Update documentation * Black formatting * Update Stat vs test function * Start user guide * Update documentation. * Update unit tests * Updated Ksample tutorial * Black formating * Change unit test for better coverage * Consolidated functions * Correct error in tutorial * Removed mistakenly added testing notebook * Change class name to KSampleHHG, Distinguish from independence HHG. All references also updated as well. * change file name to ksamplehhg * Change test function to use statistic function * Separate function to add numba to relevant section * Added parameter headings * Complete parameter headings * Black Formatting * Removed accidental loop * Fix for coverage * Requested changes made! * Updated API index for class doc Co-authored-by: Sambit Panda <36676569+sampan501@users.noreply.github.com> Co-authored-by: Sambit Panda <sampanda501@gmail.com>
neurodata · May 16, 2022 · 1760db8 · 1760db8
1 parent 32a06d0
commit 1760db8
Show file tree

Hide file tree

Showing 5 changed files with 244 additions and 0 deletions.
diff --git a/docs/api/index.rst b/docs/api/index.rst
@@ -57,6 +57,7 @@ Independence
    Hotelling
    SmoothCFTest
    MeanEmbeddingTest
+   KSampleHHG
 
 
 .. automodule:: hyppo.time_series

diff --git a/hyppo/ksample/__init__.py b/hyppo/ksample/__init__.py
@@ -5,6 +5,7 @@
 from .ksamp import KSample
 from .manova import MANOVA
 from .mmd import MMD
+from .ksamplehhg import KSampleHHG
 
 from .smoothCF import SmoothCFTest, smooth_cf_distance
 from .mean_embedding import MeanEmbeddingTest, mean_embed_distance
@@ -21,4 +22,5 @@
     "mmd": MMD,
     "smoothCF": SmoothCFTest,
     "mean_embedding": MeanEmbeddingTest,
+    "ksamplehhg": KSampleHHG,
 }
diff --git a/hyppo/ksample/ksamplehhg.py b/hyppo/ksample/ksamplehhg.py
@@ -0,0 +1,167 @@
+import numpy as np
+from numba import jit
+from hyppo.ksample.base import KSampleTest
+from hyppo.ksample._utils import _CheckInputs
+from sklearn.metrics import pairwise_distances
+from scipy.stats import ks_2samp
+
+
+class KSampleHHG(KSampleTest):
+    r"""
+    HHG 2-Sample test statistic.
+    
+    This is a 2-sample multivariate test based on univariate test statistics.
+    It inherits the computational complexity from the unvariate tests to achieve
+    faster speeds than classic multivariate tests.
+    The univariate test used is the Kolmogorov-Smirnov 2-sample test.
+    :footcite:p:`hellerMultivariateTestsOfAssociation2016`.
+    
+    Parameters
+    ----------
+    compute_distance : str, callable, or None, default: "euclidean"
+        A function that computes the distance among the samples within each
+        data matrix.
+        Valid strings for ``compute_distance`` are, as defined in
+        :func:`sklearn.metrics.pairwise_distances`,
+
+            - From scikit-learn: [``"euclidean"``, ``"cityblock"``, ``"cosine"``,
+              ``"l1"``, ``"l2"``, ``"manhattan"``] See the documentation for
+              :mod:`scipy.spatial.distance` for details
+              on these metrics.
+            - From scipy.spatial.distance: [``"braycurtis"``, ``"canberra"``,
+              ``"chebyshev"``, ``"correlation"``, ``"dice"``, ``"hamming"``,
+              ``"jaccard"``, ``"kulsinski"``, ``"mahalanobis"``, ``"minkowski"``,
+              ``"rogerstanimoto"``, ``"russellrao"``, ``"seuclidean"``,
+              ``"sokalmichener"``, ``"sokalsneath"``, ``"sqeuclidean"``,
+              ``"yule"``] See the documentation for :mod:`scipy.spatial.distance` for
+              details on these metrics.
+
+        Set to ``None`` or ``"precomputed"`` if ``x`` and ``y`` are already distance
+        matrices. To call a custom function, either create the distance matrix
+        before-hand or create a function of the form ``metric(x, **kwargs)``
+        where ``x`` is the data matrix for which pairwise distances are
+        calculated and ``**kwargs`` are extra arguements to send to your custom
+        
+     **kwargs
+         Arbitrary keyword arguments for ``compute_distance``.
+    
+    Notes
+    -----
+    The statistic can be derived as follows:
+    :footcite:p:`hellerMultivariateTestsOfAssociation2016`.
+    
+    Let :math:`x`, :math:`y` be :math:`(n, p)`, :math:`(m, p)` samples of random variables
+    :math:`X` and :math:`Y \in \R^p` . Let there be a center point
+    :math:`\in \R^p`. 
+    For every sample :math:`i`, calculate the distances from the center point
+    in :math:`x` and :math:`y` and denote this as :math:`d_x(x_i)`
+    and :math:`d_y(y_i)`. This will create a 1D collection of distances for each
+    sample group.
+    
+    Then apply the KS 2-sample test on these center-point distances. This classic test
+    compares the empirical distribution function of the two samples and takes
+    the supremum of the difference between them. See Notes under scipy.stats.ks_2samp
+    for more details.
+    
+    To achieve better power, the above process is repeated with each sample point 
+    :math:`x_i` and :math:`y_i` as center points. The resultant :math:`n+m` p-values
+    are then pooled for use in the Bonferroni test of the global null hypothesis.
+    The HHG statistic is the KS stat associated with the smallest p-value from the pool,
+    while the HHG p-value is the smallest p-value multipled by the number of sample points.
+    
+    References
+    ----------
+    .. footbibliography::
+    """
+
+    def __init__(self, compute_distance="euclidean", **kwargs):
+        self.compute_distance = compute_distance
+        KSampleTest.__init__(self, compute_distance=compute_distance, **kwargs)
+
+    def statistic(self, x, y):
+        """
+        Calculates K-Sample HHG test statistic.
+        
+        Parameters
+        ----------
+        x,y : ndarray of float
+            Input data matrices. ``x`` and ``y`` must have the same number of
+            dimensions. That is, the shapes must be ``(n, p)`` and ``(m, p)`` where
+            `n` and  are the number of samples and `p` is the number of
+            dimensions.
+            
+        Returns
+        -------
+        stat : float
+            The computed KS test statistic associated with the lowest p-value.
+        """
+        xy = np.concatenate((x, y), axis=0)
+        distxy = _centerpoint_dist(xy, self.compute_distance, 1)
+        distx = distxy[:, 0 : len(x)]
+        disty = distxy[:, len(x) : len(x) + len(y)]
+        stats, pvalues = _distance_score(distx, disty)
+        minP = min(pvalues)
+        stat = stats[pvalues.index(minP)]
+        self.minP = minP
+        self.stat = stat
+        return self.stat
+
+    def test(self, x, y):
+        """
+        Calculates K-Sample HHG test statistic and p-value.
+        
+        Parameters
+        ----------
+        x,y : ndarray of float
+            Input data matrices. ``x`` and ``y`` must have the same number of
+            dimensions. That is, the shapes must be ``(n, p)`` and ``(m, p)`` where
+            `n` and `m` are the number of samples and `p` is the number of
+            dimensions.
+            
+        Returns
+        -------
+        stat : float
+            The computed KS test statistic associated with the lowest p-value.
+        pvalue : float
+            The computed HHG pvalue. Equivalent to the lowest p-value multiplied by the total number
+            of samples.
+        """
+        check_input = _CheckInputs(inputs=[x, y],)
+        x, y = check_input()
+        N = x.shape[0] + y.shape[0]
+
+        stat = self.statistic(x, y)
+        pvalue = self.minP * N
+        return stat, pvalue
+
+
+def _centerpoint_dist(xy, metric, workers=1, **kwargs):
+    """Gives pairwise distances - each row corresponds to center-point distances
+    where one sample point is the center point"""
+    distxy = pairwise_distances(xy, metric=metric, n_jobs=workers, **kwargs)
+    return distxy
+
+
+def _distance_score(distx, disty):
+    dist1, dist2 = _group_distances(distx, disty)
+    stats = []
+    pvalues = []
+    for i in range(len(distx)):
+        stat, pvalue = ks_2samp(dist1[i], dist2[i])
+        stats.append(stat)
+        pvalues.append(pvalue)
+    return stats, pvalues
+
+
+@jit(nopython=True, cache=True)
+def _group_distances(distx, disty):  # pragma: no cover
+    dist1 = []
+    dist2 = []
+    for i in range(len(distx)):
+        distancex = np.delete(distx[i], 0)
+        distancey = np.delete(disty[i], 0)
+        distancex = distancex.reshape(-1)
+        distancey = distancey.reshape(-1)
+        dist1.append(distancex)
+        dist2.append(distancey)
+    return dist1, dist2
diff --git a/hyppo/ksample/tests/test_hhg.py b/hyppo/ksample/tests/test_hhg.py
@@ -0,0 +1,46 @@
+import numpy as np
+import pytest
+from numpy.testing import assert_almost_equal
+
+from ...tools import rot_ksamp
+from .. import KSampleHHG
+
+
+class TestKSampleHHG:
+    @pytest.mark.parametrize(
+        "n, obs_stat, obs_pvalue", [(100, 0.515, 4.912e-10), (10, 0.777, 0.125874),],
+    )
+    def test_linear_oned(self, n, obs_stat, obs_pvalue):
+        np.random.seed(123456789)
+        x, y = rot_ksamp("linear", n, 1, k=2, noise=False)
+        stat, pvalue = KSampleHHG().test(x, y)
+
+        assert_almost_equal(stat, obs_stat, decimal=3)
+        assert_almost_equal(pvalue, obs_pvalue, decimal=3)
+
+    @pytest.mark.parametrize(
+        "n", [(100)],
+    )
+    def test_rep(self, n):
+        np.random.seed(123456789)
+        x, y = rot_ksamp("linear", n, 1, k=2, noise=False)
+        MPstat1 = KSampleHHG().statistic(x, y)
+        MPstat2 = KSampleHHG().statistic(x, y)
+
+        assert MPstat1 == MPstat2
+
+
+class TestKSampleHHGTypeIError:
+    def test_oned(self):
+        np.random.seed(123456789)
+        rejections = 0
+        for i in range(1000):
+            x, y = rot_ksamp(
+                "multimodal_independence", n=100, p=1, noise=True, degree=90
+            )
+            stat, pvalue = KSampleHHG().test(x, y)
+        if pvalue < 0.05:
+            rejections += 1
+        est_power = rejections / 1000
+
+        assert_almost_equal(est_power, 0, decimal=2)
diff --git a/tutorials/ksample.py b/tutorials/ksample.py
@@ -239,6 +239,34 @@
 print("5 degrees of freedom (stat, pval):\n", stat1, pval1)
 print("10 degrees of freedom (stat, pval):\n", stat2, pval2)
 
+########################################################################################
+# Univariate-Based Test
+# --------------------------------------------
+# The **Heller Heller Gorfine (HHG) 2-Sample Test** is a non-parametric two-sample
+# statistical test. This test is based on testing the independence of the distances of sample vectors
+# from a center point by a univariate K-sample test. If the distribution of samples differs
+# across categories, then so does the distribution of distances of the vectors from almost every
+# point z. The univariate test used is the Kolmogorov-Smirnov 2-sample Test, which looks
+# at the largest absolute deviation between the cumulative distribution functions of
+# the samples.
+# More info can found at :class:`hyppo.ksample.KSampleHHG`.
+#
+# .. note::
+#
+#    :Pros: - Very fast computation time
+#    :Cons: - Lower power than more computationally complex algorithms
+#           - Inherits the assumptions of the KS univariate test
+
+from hyppo.ksample import KSampleHHG
+
+np.random.seed(1234)
+
+x, y = rot_ksamp("linear", n=100, p=1, k=2, noise=False)
+
+stat, pvalue = KSampleHHG().test(x, y)
+print(stat, pvalue)
+
+
 ########################################################################################
 # .. _[1]: https://link.springer.com/article/10.1007/s10182-020-00378-1
 # .. _[2]: https://arxiv.org/abs/1910.08883