Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast HHG 2-Sample Test #314

Merged
merged 91 commits into from May 16, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
91 commits
Select commit Hold shift + click to select a range
f370c83
Integrated Fast HHG into original HHG code
TacticalFallacy Oct 28, 2021
5fa9bbc
Update hhg.py
TacticalFallacy Oct 28, 2021
49128eb
Create Fast HHG Tester.ipynb
TacticalFallacy Oct 28, 2021
39c2923
Update hhg.py
TacticalFallacy Oct 28, 2021
3ed46bf
Update Fast HHG Tester.ipynb
TacticalFallacy Oct 28, 2021
66b8d61
Update Fast HHG Tester.ipynb
TacticalFallacy Oct 28, 2021
af331c6
FastHHG Tester
TacticalFallacy Nov 4, 2021
924a7aa
Update
TacticalFallacy Nov 4, 2021
9feb4f8
Update Fast HHG Tester.ipynb
TacticalFallacy Nov 4, 2021
39e0c43
Added Power Sample Size Results
TacticalFallacy Nov 11, 2021
3a5d2f3
Fast HHG updated with Hoeffdings and docstrings
TacticalFallacy Dec 2, 2021
6eda881
Update hhg.py
TacticalFallacy Dec 2, 2021
7842c4e
Edited independence tutorial
TacticalFallacy Dec 7, 2021
93e28d5
Delete Fast HHG Tester.ipynb
TacticalFallacy Dec 7, 2021
a079dfb
Delete indep_power_sampsize (center).pdf
TacticalFallacy Dec 7, 2021
d42046a
Delete OGTest1Figure.png
TacticalFallacy Dec 7, 2021
0c1364e
Delete OGTest1.png
TacticalFallacy Dec 7, 2021
551b6d7
Delete Function Outline.docx
TacticalFallacy Dec 7, 2021
8cf9d7b
release hyppo 0.2.2 (#236) (#237)
sampan501 Dec 7, 2021
284ab2e
Black sampling and edits to tutorial
TacticalFallacy Dec 7, 2021
c758cc4
Merge branch 'fasthgg' of https://github.com/TacticalFallacy/hyppo in…
TacticalFallacy Dec 7, 2021
35af78a
Merge branch 'staging' into fasthgg
TacticalFallacy Dec 7, 2021
c8e01e5
Black format check
TacticalFallacy Dec 7, 2021
4cc0277
Imported IndependenceTestOutput
TacticalFallacy Dec 7, 2021
6c0d3f3
Formatting Changes
TacticalFallacy Dec 11, 2021
1c09a18
Black formating
TacticalFallacy Dec 11, 2021
9a92842
Black Formatting
TacticalFallacy Dec 11, 2021
379d2db
Merge branch 'dev' into fasthgg
sampan501 Dec 13, 2021
0c40d5f
Adjustments to Requested Changes
TacticalFallacy Dec 13, 2021
5815d55
Edited Hoeffding Function
TacticalFallacy Dec 16, 2021
9b1a519
Added Unit Tests for Fast HHG
TacticalFallacy Dec 16, 2021
af25066
Add citations to refs and to code
TacticalFallacy Dec 16, 2021
d2e3447
Black Formatting Sweep
TacticalFallacy Dec 16, 2021
3bddbc2
Minor Edits
TacticalFallacy Dec 16, 2021
f7fe90b
Transfer Fast HHG description to notes section
TacticalFallacy Dec 16, 2021
12da32d
Removed Fast HHG description
TacticalFallacy Dec 16, 2021
a7a430c
Updated Unit Tests
TacticalFallacy Dec 16, 2021
0e9995e
Merge branch 'dev' into fasthgg
sampan501 Dec 17, 2021
be3ec46
Adjustments
TacticalFallacy Dec 20, 2021
8293f9a
Merge branch 'fasthgg' of https://github.com/TacticalFallacy/hyppo in…
TacticalFallacy Dec 20, 2021
e5e5cb7
Added Toy Implementation to tutorial
TacticalFallacy Dec 20, 2021
defbec8
Update independence.py
TacticalFallacy Dec 20, 2021
49389e4
Black Formatting Sweep
TacticalFallacy Dec 20, 2021
3e96d6b
Adjustments
TacticalFallacy Dec 20, 2021
54ffbf9
Formatting sweep
TacticalFallacy Dec 20, 2021
375e53d
Updated test to cover Hoeffding Statistic Function
TacticalFallacy Dec 20, 2021
d24d8ba
Disabled JIT during test to possibly improve code coverage
TacticalFallacy Dec 20, 2021
bdcd5ac
Formatting
TacticalFallacy Dec 20, 2021
3b01d23
Merge branch 'dev' into fasthgg
TacticalFallacy Dec 20, 2021
86a3c6b
Removed unneeded import.
TacticalFallacy Dec 20, 2021
f720a15
Merge branch 'fasthgg' of https://github.com/TacticalFallacy/hyppo in…
TacticalFallacy Dec 20, 2021
18bda28
Fixed JIT coverage error + Black Format Sweep
TacticalFallacy Dec 20, 2021
94ca825
[skip-ci] link to p-value method
sampan501 Dec 20, 2021
9a48b8a
Merge branch 'fasthgg'
TacticalFallacy Feb 17, 2022
42606d2
Create HHG KSample Tester.ipynb
TacticalFallacy Feb 17, 2022
9c33b99
Adding Fast HHG K-Sample Class
TacticalFallacy Apr 7, 2022
8bcb5c1
Completed addition of Multipoint Version
TacticalFallacy Apr 21, 2022
8d4a148
Renaming module file to match independence format
TacticalFallacy Apr 21, 2022
60af56a
Begin code of unit tests for HHG Ksample
TacticalFallacy Apr 21, 2022
d691f65
Edit ksample init.py
TacticalFallacy Apr 21, 2022
8acf999
Update Unit Test
TacticalFallacy Apr 21, 2022
bcff2b4
Corrected error in hhg ksample module
TacticalFallacy Apr 21, 2022
9fd4cb5
Remove Center of Mass Version
TacticalFallacy Apr 28, 2022
3de1ccc
Merge branch 'main' into hhgksample
TacticalFallacy Apr 28, 2022
8f7c17d
Update documentation
TacticalFallacy Apr 28, 2022
c03a38d
Black formatting
TacticalFallacy Apr 28, 2022
efbd6c2
Merge branch 'neurodata:dev' into hhgksample
TacticalFallacy Apr 28, 2022
e5848d4
Update Stat vs test function
TacticalFallacy Apr 28, 2022
59c5aed
Start user guide
TacticalFallacy Apr 28, 2022
9781c18
Merge branch 'dev' into pr/314
sampan501 May 5, 2022
674209f
Update documentation.
TacticalFallacy May 6, 2022
8d0b30b
Merge branch 'hhgksample' of https://github.com/TacticalFallacy/hyppo…
TacticalFallacy May 6, 2022
4294a99
Update unit tests
TacticalFallacy May 12, 2022
86781ba
Updated Ksample tutorial
TacticalFallacy May 12, 2022
a1681df
Black formating
TacticalFallacy May 12, 2022
deb2e6e
Change unit test for better coverage
TacticalFallacy May 12, 2022
6685874
Consolidated functions
TacticalFallacy May 12, 2022
e0f2b8e
Correct error in tutorial
TacticalFallacy May 12, 2022
aa8251a
Removed mistakenly added testing notebook
TacticalFallacy May 13, 2022
03fc95b
Change class name to KSampleHHG,
TacticalFallacy May 13, 2022
066a390
Merge branch 'hhgksample' of https://github.com/TacticalFallacy/hyppo…
TacticalFallacy May 13, 2022
2c537ae
change file name to ksamplehhg
TacticalFallacy May 13, 2022
a3f3a30
Change test function to use statistic function
TacticalFallacy May 13, 2022
309dc50
Separate function to add numba to relevant section
TacticalFallacy May 13, 2022
4cc471a
Added parameter headings
TacticalFallacy May 13, 2022
6e3fa76
Complete parameter headings
TacticalFallacy May 13, 2022
dd8efb9
Black Formatting
TacticalFallacy May 13, 2022
ae56617
Removed accidental loop
TacticalFallacy May 13, 2022
941b1cf
Fix for coverage
TacticalFallacy May 13, 2022
398e9a6
Requested changes made!
TacticalFallacy May 16, 2022
5bf4689
Updated API index for class doc
TacticalFallacy May 16, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/api/index.rst
Expand Up @@ -57,6 +57,7 @@ Independence
Hotelling
SmoothCFTest
MeanEmbeddingTest
KSampleHHG


.. automodule:: hyppo.time_series
Expand Down
2 changes: 2 additions & 0 deletions hyppo/ksample/__init__.py
Expand Up @@ -5,6 +5,7 @@
from .ksamp import KSample
from .manova import MANOVA
from .mmd import MMD
from .ksamplehhg import KSampleHHG

from .smoothCF import SmoothCFTest, smooth_cf_distance
from .mean_embedding import MeanEmbeddingTest, mean_embed_distance
Expand All @@ -21,4 +22,5 @@
"mmd": MMD,
"smoothCF": SmoothCFTest,
"mean_embedding": MeanEmbeddingTest,
"ksamplehhg": KSampleHHG,
}
167 changes: 167 additions & 0 deletions hyppo/ksample/ksamplehhg.py
@@ -0,0 +1,167 @@
import numpy as np
from numba import jit
from hyppo.ksample.base import KSampleTest
from hyppo.ksample._utils import _CheckInputs
from sklearn.metrics import pairwise_distances
from scipy.stats import ks_2samp


class KSampleHHG(KSampleTest):
r"""
HHG 2-Sample test statistic.

This is a 2-sample multivariate test based on univariate test statistics.
It inherits the computational complexity from the unvariate tests to achieve
faster speeds than classic multivariate tests.
The univariate test used is the Kolmogorov-Smirnov 2-sample test.
:footcite:p:`hellerMultivariateTestsOfAssociation2016`.

Parameters
----------
compute_distance : str, callable, or None, default: "euclidean"
A function that computes the distance among the samples within each
data matrix.
Valid strings for ``compute_distance`` are, as defined in
:func:`sklearn.metrics.pairwise_distances`,

- From scikit-learn: [``"euclidean"``, ``"cityblock"``, ``"cosine"``,
``"l1"``, ``"l2"``, ``"manhattan"``] See the documentation for
:mod:`scipy.spatial.distance` for details
on these metrics.
- From scipy.spatial.distance: [``"braycurtis"``, ``"canberra"``,
``"chebyshev"``, ``"correlation"``, ``"dice"``, ``"hamming"``,
``"jaccard"``, ``"kulsinski"``, ``"mahalanobis"``, ``"minkowski"``,
``"rogerstanimoto"``, ``"russellrao"``, ``"seuclidean"``,
``"sokalmichener"``, ``"sokalsneath"``, ``"sqeuclidean"``,
``"yule"``] See the documentation for :mod:`scipy.spatial.distance` for
details on these metrics.

Set to ``None`` or ``"precomputed"`` if ``x`` and ``y`` are already distance
matrices. To call a custom function, either create the distance matrix
before-hand or create a function of the form ``metric(x, **kwargs)``
where ``x`` is the data matrix for which pairwise distances are
calculated and ``**kwargs`` are extra arguements to send to your custom

**kwargs
Arbitrary keyword arguments for ``compute_distance``.

Notes
-----
The statistic can be derived as follows:
:footcite:p:`hellerMultivariateTestsOfAssociation2016`.

Let :math:`x`, :math:`y` be :math:`(n, p)`, :math:`(m, p)` samples of random variables
:math:`X` and :math:`Y \in \R^p` . Let there be a center point
:math:`\in \R^p`.
For every sample :math:`i`, calculate the distances from the center point
in :math:`x` and :math:`y` and denote this as :math:`d_x(x_i)`
and :math:`d_y(y_i)`. This will create a 1D collection of distances for each
sample group.

Then apply the KS 2-sample test on these center-point distances. This classic test
compares the empirical distribution function of the two samples and takes
the supremum of the difference between them. See Notes under scipy.stats.ks_2samp
for more details.

To achieve better power, the above process is repeated with each sample point
:math:`x_i` and :math:`y_i` as center points. The resultant :math:`n+m` p-values
are then pooled for use in the Bonferroni test of the global null hypothesis.
The HHG statistic is the KS stat associated with the smallest p-value from the pool,
while the HHG p-value is the smallest p-value multipled by the number of sample points.

References
----------
.. footbibliography::
"""

def __init__(self, compute_distance="euclidean", **kwargs):
self.compute_distance = compute_distance
KSampleTest.__init__(self, compute_distance=compute_distance, **kwargs)

def statistic(self, x, y):
"""
Calculates K-Sample HHG test statistic.

Parameters
----------
x,y : ndarray of float
Input data matrices. ``x`` and ``y`` must have the same number of
dimensions. That is, the shapes must be ``(n, p)`` and ``(m, p)`` where
`n` and are the number of samples and `p` is the number of
dimensions.

Returns
-------
stat : float
The computed KS test statistic associated with the lowest p-value.
"""
xy = np.concatenate((x, y), axis=0)
distxy = _centerpoint_dist(xy, self.compute_distance, 1)
distx = distxy[:, 0 : len(x)]
disty = distxy[:, len(x) : len(x) + len(y)]
stats, pvalues = _distance_score(distx, disty)
minP = min(pvalues)
stat = stats[pvalues.index(minP)]
self.minP = minP
self.stat = stat
return self.stat

def test(self, x, y):
"""
Calculates K-Sample HHG test statistic and p-value.

Parameters
----------
x,y : ndarray of float
Input data matrices. ``x`` and ``y`` must have the same number of
dimensions. That is, the shapes must be ``(n, p)`` and ``(m, p)`` where
`n` and `m` are the number of samples and `p` is the number of
dimensions.

Returns
-------
stat : float
The computed KS test statistic associated with the lowest p-value.
pvalue : float
The computed HHG pvalue. Equivalent to the lowest p-value multiplied by the total number
of samples.
"""
check_input = _CheckInputs(inputs=[x, y],)
x, y = check_input()
N = x.shape[0] + y.shape[0]

stat = self.statistic(x, y)
pvalue = self.minP * N
return stat, pvalue


def _centerpoint_dist(xy, metric, workers=1, **kwargs):
"""Gives pairwise distances - each row corresponds to center-point distances
where one sample point is the center point"""
distxy = pairwise_distances(xy, metric=metric, n_jobs=workers, **kwargs)
return distxy


def _distance_score(distx, disty):
dist1, dist2 = _group_distances(distx, disty)
stats = []
pvalues = []
for i in range(len(distx)):
stat, pvalue = ks_2samp(dist1[i], dist2[i])
stats.append(stat)
pvalues.append(pvalue)
return stats, pvalues


@jit(nopython=True, cache=True)
def _group_distances(distx, disty): # pragma: no cover
dist1 = []
dist2 = []
for i in range(len(distx)):
distancex = np.delete(distx[i], 0)
distancey = np.delete(disty[i], 0)
distancex = distancex.reshape(-1)
distancey = distancey.reshape(-1)
dist1.append(distancex)
dist2.append(distancey)
return dist1, dist2
46 changes: 46 additions & 0 deletions hyppo/ksample/tests/test_hhg.py
@@ -0,0 +1,46 @@
import numpy as np
import pytest
from numpy.testing import assert_almost_equal

from ...tools import rot_ksamp
from .. import KSampleHHG


class TestKSampleHHG:
@pytest.mark.parametrize(
"n, obs_stat, obs_pvalue", [(100, 0.515, 4.912e-10), (10, 0.777, 0.125874),],
)
def test_linear_oned(self, n, obs_stat, obs_pvalue):
np.random.seed(123456789)
x, y = rot_ksamp("linear", n, 1, k=2, noise=False)
stat, pvalue = KSampleHHG().test(x, y)

assert_almost_equal(stat, obs_stat, decimal=3)
assert_almost_equal(pvalue, obs_pvalue, decimal=3)

@pytest.mark.parametrize(
"n", [(100)],
)
def test_rep(self, n):
np.random.seed(123456789)
x, y = rot_ksamp("linear", n, 1, k=2, noise=False)
MPstat1 = KSampleHHG().statistic(x, y)
MPstat2 = KSampleHHG().statistic(x, y)

assert MPstat1 == MPstat2


class TestKSampleHHGTypeIError:
def test_oned(self):
np.random.seed(123456789)
rejections = 0
for i in range(1000):
x, y = rot_ksamp(
"multimodal_independence", n=100, p=1, noise=True, degree=90
)
stat, pvalue = KSampleHHG().test(x, y)
if pvalue < 0.05:
rejections += 1
est_power = rejections / 1000

assert_almost_equal(est_power, 0, decimal=2)
28 changes: 28 additions & 0 deletions tutorials/ksample.py
Expand Up @@ -239,6 +239,34 @@
print("5 degrees of freedom (stat, pval):\n", stat1, pval1)
print("10 degrees of freedom (stat, pval):\n", stat2, pval2)

########################################################################################
# Univariate-Based Test
# --------------------------------------------
# The **Heller Heller Gorfine (HHG) 2-Sample Test** is a non-parametric two-sample
# statistical test. This test is based on testing the independence of the distances of sample vectors
# from a center point by a univariate K-sample test. If the distribution of samples differs
# across categories, then so does the distribution of distances of the vectors from almost every
# point z. The univariate test used is the Kolmogorov-Smirnov 2-sample Test, which looks
# at the largest absolute deviation between the cumulative distribution functions of
# the samples.
# More info can found at :class:`hyppo.ksample.KSampleHHG`.
#
# .. note::
#
# :Pros: - Very fast computation time
# :Cons: - Lower power than more computationally complex algorithms
# - Inherits the assumptions of the KS univariate test

from hyppo.ksample import KSampleHHG

np.random.seed(1234)

x, y = rot_ksamp("linear", n=100, p=1, k=2, noise=False)

stat, pvalue = KSampleHHG().test(x, y)
print(stat, pvalue)


########################################################################################
# .. _[1]: https://link.springer.com/article/10.1007/s10182-020-00378-1
# .. _[2]: https://arxiv.org/abs/1910.08883