Skip to content

Commit

Permalink
Fast HHG 2-Sample Test (#314)
Browse files Browse the repository at this point in the history
* Integrated Fast HHG into original HHG code

* Update hhg.py

* Create Fast HHG Tester.ipynb

* Update hhg.py

* Update Fast HHG Tester.ipynb

* Update Fast HHG Tester.ipynb

* FastHHG Tester

* Update

* Update Fast HHG Tester.ipynb

* Added Power Sample Size Results

* Fast HHG updated with Hoeffdings and docstrings

* Update hhg.py

* Edited independence tutorial

* Delete Fast HHG Tester.ipynb

* Delete indep_power_sampsize (center).pdf

* Delete OGTest1Figure.png

* Delete OGTest1.png

* Delete Function Outline.docx

* release hyppo 0.2.2 (#236) (#237)

* fix inccorect p-value in unit tests

* fix MGC unit tests

* update version to 0.2.2

* Black sampling and edits to tutorial

* Black format check

* Imported IndependenceTestOutput

* Formatting Changes

* Black formating

* Black Formatting

* Adjustments to Requested Changes

Remove cdist - replace with compute_dist, modified description, changed 'fast' to 'auto' and simplifying of hoeffding's implementation to basic formula.

* Edited Hoeffding Function

* Added Unit Tests for Fast HHG

Unit tests cover replicability, Type 1 Error, distance statistic implementation and 1D linear test.

* Add citations to refs and to code

* Black Formatting Sweep

* Minor Edits

* Transfer Fast HHG description to notes section

* Removed Fast HHG description

Added 'fast implementation' as a pro instead.

* Updated Unit Tests

* Adjustments

Made 'auto' in test function instead of _init_. Adjustments all-around to match.

Adjusted Hoeffding function to implement jit on inner function. Achieved much higher speeds thanks to it.

* Added Toy Implementation to tutorial

* Update independence.py

* Black Formatting Sweep

* Adjustments

* Formatting sweep

* Updated test to cover Hoeffding Statistic Function

* Disabled JIT during test to possibly improve code coverage

* Formatting

* Removed unneeded import.

* Fixed JIT coverage error + Black Format Sweep

* [skip-ci] link to p-value method

* Create HHG KSample Tester.ipynb

* Adding Fast HHG K-Sample Class

Currently center-of-mass AD version

* Completed addition of Multipoint Version

* Renaming module file to match independence format

* Begin code of unit tests for HHG Ksample

* Edit ksample init.py

* Update Unit Test

Added Type 1 Error and replicability test

* Corrected error in hhg ksample module

Allowed option of mode

* Remove Center of Mass Version

* Update documentation

* Black formatting

* Update Stat vs test function

* Start user guide

* Update documentation.

* Update unit tests

* Updated Ksample tutorial

* Black formating

* Change unit test for better coverage

* Consolidated functions

* Correct error in tutorial

* Removed mistakenly added testing notebook

* Change class name to KSampleHHG,

Distinguish from independence HHG.
All references also updated as well.

* change file name to ksamplehhg

* Change test function to use statistic function

* Separate function to add numba to relevant section

* Added parameter headings

* Complete parameter headings

* Black Formatting

* Removed accidental loop

* Fix for coverage

* Requested changes made!

* Updated API index for class doc

Co-authored-by: Sambit Panda <36676569+sampan501@users.noreply.github.com>
Co-authored-by: Sambit Panda <sampanda501@gmail.com>
  • Loading branch information
3 people committed May 16, 2022
1 parent 32a06d0 commit 1760db8
Show file tree
Hide file tree
Showing 5 changed files with 244 additions and 0 deletions.
1 change: 1 addition & 0 deletions docs/api/index.rst
Expand Up @@ -57,6 +57,7 @@ Independence
Hotelling
SmoothCFTest
MeanEmbeddingTest
KSampleHHG


.. automodule:: hyppo.time_series
Expand Down
2 changes: 2 additions & 0 deletions hyppo/ksample/__init__.py
Expand Up @@ -5,6 +5,7 @@
from .ksamp import KSample
from .manova import MANOVA
from .mmd import MMD
from .ksamplehhg import KSampleHHG

from .smoothCF import SmoothCFTest, smooth_cf_distance
from .mean_embedding import MeanEmbeddingTest, mean_embed_distance
Expand All @@ -21,4 +22,5 @@
"mmd": MMD,
"smoothCF": SmoothCFTest,
"mean_embedding": MeanEmbeddingTest,
"ksamplehhg": KSampleHHG,
}
167 changes: 167 additions & 0 deletions hyppo/ksample/ksamplehhg.py
@@ -0,0 +1,167 @@
import numpy as np
from numba import jit
from hyppo.ksample.base import KSampleTest
from hyppo.ksample._utils import _CheckInputs
from sklearn.metrics import pairwise_distances
from scipy.stats import ks_2samp


class KSampleHHG(KSampleTest):
r"""
HHG 2-Sample test statistic.
This is a 2-sample multivariate test based on univariate test statistics.
It inherits the computational complexity from the unvariate tests to achieve
faster speeds than classic multivariate tests.
The univariate test used is the Kolmogorov-Smirnov 2-sample test.
:footcite:p:`hellerMultivariateTestsOfAssociation2016`.
Parameters
----------
compute_distance : str, callable, or None, default: "euclidean"
A function that computes the distance among the samples within each
data matrix.
Valid strings for ``compute_distance`` are, as defined in
:func:`sklearn.metrics.pairwise_distances`,
- From scikit-learn: [``"euclidean"``, ``"cityblock"``, ``"cosine"``,
``"l1"``, ``"l2"``, ``"manhattan"``] See the documentation for
:mod:`scipy.spatial.distance` for details
on these metrics.
- From scipy.spatial.distance: [``"braycurtis"``, ``"canberra"``,
``"chebyshev"``, ``"correlation"``, ``"dice"``, ``"hamming"``,
``"jaccard"``, ``"kulsinski"``, ``"mahalanobis"``, ``"minkowski"``,
``"rogerstanimoto"``, ``"russellrao"``, ``"seuclidean"``,
``"sokalmichener"``, ``"sokalsneath"``, ``"sqeuclidean"``,
``"yule"``] See the documentation for :mod:`scipy.spatial.distance` for
details on these metrics.
Set to ``None`` or ``"precomputed"`` if ``x`` and ``y`` are already distance
matrices. To call a custom function, either create the distance matrix
before-hand or create a function of the form ``metric(x, **kwargs)``
where ``x`` is the data matrix for which pairwise distances are
calculated and ``**kwargs`` are extra arguements to send to your custom
**kwargs
Arbitrary keyword arguments for ``compute_distance``.
Notes
-----
The statistic can be derived as follows:
:footcite:p:`hellerMultivariateTestsOfAssociation2016`.
Let :math:`x`, :math:`y` be :math:`(n, p)`, :math:`(m, p)` samples of random variables
:math:`X` and :math:`Y \in \R^p` . Let there be a center point
:math:`\in \R^p`.
For every sample :math:`i`, calculate the distances from the center point
in :math:`x` and :math:`y` and denote this as :math:`d_x(x_i)`
and :math:`d_y(y_i)`. This will create a 1D collection of distances for each
sample group.
Then apply the KS 2-sample test on these center-point distances. This classic test
compares the empirical distribution function of the two samples and takes
the supremum of the difference between them. See Notes under scipy.stats.ks_2samp
for more details.
To achieve better power, the above process is repeated with each sample point
:math:`x_i` and :math:`y_i` as center points. The resultant :math:`n+m` p-values
are then pooled for use in the Bonferroni test of the global null hypothesis.
The HHG statistic is the KS stat associated with the smallest p-value from the pool,
while the HHG p-value is the smallest p-value multipled by the number of sample points.
References
----------
.. footbibliography::
"""

def __init__(self, compute_distance="euclidean", **kwargs):
self.compute_distance = compute_distance
KSampleTest.__init__(self, compute_distance=compute_distance, **kwargs)

def statistic(self, x, y):
"""
Calculates K-Sample HHG test statistic.
Parameters
----------
x,y : ndarray of float
Input data matrices. ``x`` and ``y`` must have the same number of
dimensions. That is, the shapes must be ``(n, p)`` and ``(m, p)`` where
`n` and are the number of samples and `p` is the number of
dimensions.
Returns
-------
stat : float
The computed KS test statistic associated with the lowest p-value.
"""
xy = np.concatenate((x, y), axis=0)
distxy = _centerpoint_dist(xy, self.compute_distance, 1)
distx = distxy[:, 0 : len(x)]
disty = distxy[:, len(x) : len(x) + len(y)]
stats, pvalues = _distance_score(distx, disty)
minP = min(pvalues)
stat = stats[pvalues.index(minP)]
self.minP = minP
self.stat = stat
return self.stat

def test(self, x, y):
"""
Calculates K-Sample HHG test statistic and p-value.
Parameters
----------
x,y : ndarray of float
Input data matrices. ``x`` and ``y`` must have the same number of
dimensions. That is, the shapes must be ``(n, p)`` and ``(m, p)`` where
`n` and `m` are the number of samples and `p` is the number of
dimensions.
Returns
-------
stat : float
The computed KS test statistic associated with the lowest p-value.
pvalue : float
The computed HHG pvalue. Equivalent to the lowest p-value multiplied by the total number
of samples.
"""
check_input = _CheckInputs(inputs=[x, y],)
x, y = check_input()
N = x.shape[0] + y.shape[0]

stat = self.statistic(x, y)
pvalue = self.minP * N
return stat, pvalue


def _centerpoint_dist(xy, metric, workers=1, **kwargs):
"""Gives pairwise distances - each row corresponds to center-point distances
where one sample point is the center point"""
distxy = pairwise_distances(xy, metric=metric, n_jobs=workers, **kwargs)
return distxy


def _distance_score(distx, disty):
dist1, dist2 = _group_distances(distx, disty)
stats = []
pvalues = []
for i in range(len(distx)):
stat, pvalue = ks_2samp(dist1[i], dist2[i])
stats.append(stat)
pvalues.append(pvalue)
return stats, pvalues


@jit(nopython=True, cache=True)
def _group_distances(distx, disty): # pragma: no cover
dist1 = []
dist2 = []
for i in range(len(distx)):
distancex = np.delete(distx[i], 0)
distancey = np.delete(disty[i], 0)
distancex = distancex.reshape(-1)
distancey = distancey.reshape(-1)
dist1.append(distancex)
dist2.append(distancey)
return dist1, dist2
46 changes: 46 additions & 0 deletions hyppo/ksample/tests/test_hhg.py
@@ -0,0 +1,46 @@
import numpy as np
import pytest
from numpy.testing import assert_almost_equal

from ...tools import rot_ksamp
from .. import KSampleHHG


class TestKSampleHHG:
@pytest.mark.parametrize(
"n, obs_stat, obs_pvalue", [(100, 0.515, 4.912e-10), (10, 0.777, 0.125874),],
)
def test_linear_oned(self, n, obs_stat, obs_pvalue):
np.random.seed(123456789)
x, y = rot_ksamp("linear", n, 1, k=2, noise=False)
stat, pvalue = KSampleHHG().test(x, y)

assert_almost_equal(stat, obs_stat, decimal=3)
assert_almost_equal(pvalue, obs_pvalue, decimal=3)

@pytest.mark.parametrize(
"n", [(100)],
)
def test_rep(self, n):
np.random.seed(123456789)
x, y = rot_ksamp("linear", n, 1, k=2, noise=False)
MPstat1 = KSampleHHG().statistic(x, y)
MPstat2 = KSampleHHG().statistic(x, y)

assert MPstat1 == MPstat2


class TestKSampleHHGTypeIError:
def test_oned(self):
np.random.seed(123456789)
rejections = 0
for i in range(1000):
x, y = rot_ksamp(
"multimodal_independence", n=100, p=1, noise=True, degree=90
)
stat, pvalue = KSampleHHG().test(x, y)
if pvalue < 0.05:
rejections += 1
est_power = rejections / 1000

assert_almost_equal(est_power, 0, decimal=2)
28 changes: 28 additions & 0 deletions tutorials/ksample.py
Expand Up @@ -239,6 +239,34 @@
print("5 degrees of freedom (stat, pval):\n", stat1, pval1)
print("10 degrees of freedom (stat, pval):\n", stat2, pval2)

########################################################################################
# Univariate-Based Test
# --------------------------------------------
# The **Heller Heller Gorfine (HHG) 2-Sample Test** is a non-parametric two-sample
# statistical test. This test is based on testing the independence of the distances of sample vectors
# from a center point by a univariate K-sample test. If the distribution of samples differs
# across categories, then so does the distribution of distances of the vectors from almost every
# point z. The univariate test used is the Kolmogorov-Smirnov 2-sample Test, which looks
# at the largest absolute deviation between the cumulative distribution functions of
# the samples.
# More info can found at :class:`hyppo.ksample.KSampleHHG`.
#
# .. note::
#
# :Pros: - Very fast computation time
# :Cons: - Lower power than more computationally complex algorithms
# - Inherits the assumptions of the KS univariate test

from hyppo.ksample import KSampleHHG

np.random.seed(1234)

x, y = rot_ksamp("linear", n=100, p=1, k=2, noise=False)

stat, pvalue = KSampleHHG().test(x, y)
print(stat, pvalue)


########################################################################################
# .. _[1]: https://link.springer.com/article/10.1007/s10182-020-00378-1
# .. _[2]: https://arxiv.org/abs/1910.08883

0 comments on commit 1760db8

Please sign in to comment.