Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Integrated Fast HHG into original HHG code * Update hhg.py * Create Fast HHG Tester.ipynb * Update hhg.py * Update Fast HHG Tester.ipynb * Update Fast HHG Tester.ipynb * FastHHG Tester * Update * Update Fast HHG Tester.ipynb * Added Power Sample Size Results * Fast HHG updated with Hoeffdings and docstrings * Update hhg.py * Edited independence tutorial * Delete Fast HHG Tester.ipynb * Delete indep_power_sampsize (center).pdf * Delete OGTest1Figure.png * Delete OGTest1.png * Delete Function Outline.docx * release hyppo 0.2.2 (#236) (#237) * fix inccorect p-value in unit tests * fix MGC unit tests * update version to 0.2.2 * Black sampling and edits to tutorial * Black format check * Imported IndependenceTestOutput * Formatting Changes * Black formating * Black Formatting * Adjustments to Requested Changes Remove cdist - replace with compute_dist, modified description, changed 'fast' to 'auto' and simplifying of hoeffding's implementation to basic formula. * Edited Hoeffding Function * Added Unit Tests for Fast HHG Unit tests cover replicability, Type 1 Error, distance statistic implementation and 1D linear test. * Add citations to refs and to code * Black Formatting Sweep * Minor Edits * Transfer Fast HHG description to notes section * Removed Fast HHG description Added 'fast implementation' as a pro instead. * Updated Unit Tests * Adjustments Made 'auto' in test function instead of _init_. Adjustments all-around to match. Adjusted Hoeffding function to implement jit on inner function. Achieved much higher speeds thanks to it. * Added Toy Implementation to tutorial * Update independence.py * Black Formatting Sweep * Adjustments * Formatting sweep * Updated test to cover Hoeffding Statistic Function * Disabled JIT during test to possibly improve code coverage * Formatting * Removed unneeded import. * Fixed JIT coverage error + Black Format Sweep * [skip-ci] link to p-value method * Create HHG KSample Tester.ipynb * Adding Fast HHG K-Sample Class Currently center-of-mass AD version * Completed addition of Multipoint Version * Renaming module file to match independence format * Begin code of unit tests for HHG Ksample * Edit ksample init.py * Update Unit Test Added Type 1 Error and replicability test * Corrected error in hhg ksample module Allowed option of mode * Remove Center of Mass Version * Update documentation * Black formatting * Update Stat vs test function * Start user guide * Update documentation. * Update unit tests * Updated Ksample tutorial * Black formating * Change unit test for better coverage * Consolidated functions * Correct error in tutorial * Removed mistakenly added testing notebook * Change class name to KSampleHHG, Distinguish from independence HHG. All references also updated as well. * change file name to ksamplehhg * Change test function to use statistic function * Separate function to add numba to relevant section * Added parameter headings * Complete parameter headings * Black Formatting * Removed accidental loop * Fix for coverage * Requested changes made! * Updated API index for class doc Co-authored-by: Sambit Panda <36676569+sampan501@users.noreply.github.com> Co-authored-by: Sambit Panda <sampanda501@gmail.com>
- Loading branch information
1 parent
32a06d0
commit 1760db8
Showing
5 changed files
with
244 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -57,6 +57,7 @@ Independence | |
Hotelling | ||
SmoothCFTest | ||
MeanEmbeddingTest | ||
KSampleHHG | ||
|
||
|
||
.. automodule:: hyppo.time_series | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,167 @@ | ||
import numpy as np | ||
from numba import jit | ||
from hyppo.ksample.base import KSampleTest | ||
from hyppo.ksample._utils import _CheckInputs | ||
from sklearn.metrics import pairwise_distances | ||
from scipy.stats import ks_2samp | ||
|
||
|
||
class KSampleHHG(KSampleTest): | ||
r""" | ||
HHG 2-Sample test statistic. | ||
This is a 2-sample multivariate test based on univariate test statistics. | ||
It inherits the computational complexity from the unvariate tests to achieve | ||
faster speeds than classic multivariate tests. | ||
The univariate test used is the Kolmogorov-Smirnov 2-sample test. | ||
:footcite:p:`hellerMultivariateTestsOfAssociation2016`. | ||
Parameters | ||
---------- | ||
compute_distance : str, callable, or None, default: "euclidean" | ||
A function that computes the distance among the samples within each | ||
data matrix. | ||
Valid strings for ``compute_distance`` are, as defined in | ||
:func:`sklearn.metrics.pairwise_distances`, | ||
- From scikit-learn: [``"euclidean"``, ``"cityblock"``, ``"cosine"``, | ||
``"l1"``, ``"l2"``, ``"manhattan"``] See the documentation for | ||
:mod:`scipy.spatial.distance` for details | ||
on these metrics. | ||
- From scipy.spatial.distance: [``"braycurtis"``, ``"canberra"``, | ||
``"chebyshev"``, ``"correlation"``, ``"dice"``, ``"hamming"``, | ||
``"jaccard"``, ``"kulsinski"``, ``"mahalanobis"``, ``"minkowski"``, | ||
``"rogerstanimoto"``, ``"russellrao"``, ``"seuclidean"``, | ||
``"sokalmichener"``, ``"sokalsneath"``, ``"sqeuclidean"``, | ||
``"yule"``] See the documentation for :mod:`scipy.spatial.distance` for | ||
details on these metrics. | ||
Set to ``None`` or ``"precomputed"`` if ``x`` and ``y`` are already distance | ||
matrices. To call a custom function, either create the distance matrix | ||
before-hand or create a function of the form ``metric(x, **kwargs)`` | ||
where ``x`` is the data matrix for which pairwise distances are | ||
calculated and ``**kwargs`` are extra arguements to send to your custom | ||
**kwargs | ||
Arbitrary keyword arguments for ``compute_distance``. | ||
Notes | ||
----- | ||
The statistic can be derived as follows: | ||
:footcite:p:`hellerMultivariateTestsOfAssociation2016`. | ||
Let :math:`x`, :math:`y` be :math:`(n, p)`, :math:`(m, p)` samples of random variables | ||
:math:`X` and :math:`Y \in \R^p` . Let there be a center point | ||
:math:`\in \R^p`. | ||
For every sample :math:`i`, calculate the distances from the center point | ||
in :math:`x` and :math:`y` and denote this as :math:`d_x(x_i)` | ||
and :math:`d_y(y_i)`. This will create a 1D collection of distances for each | ||
sample group. | ||
Then apply the KS 2-sample test on these center-point distances. This classic test | ||
compares the empirical distribution function of the two samples and takes | ||
the supremum of the difference between them. See Notes under scipy.stats.ks_2samp | ||
for more details. | ||
To achieve better power, the above process is repeated with each sample point | ||
:math:`x_i` and :math:`y_i` as center points. The resultant :math:`n+m` p-values | ||
are then pooled for use in the Bonferroni test of the global null hypothesis. | ||
The HHG statistic is the KS stat associated with the smallest p-value from the pool, | ||
while the HHG p-value is the smallest p-value multipled by the number of sample points. | ||
References | ||
---------- | ||
.. footbibliography:: | ||
""" | ||
|
||
def __init__(self, compute_distance="euclidean", **kwargs): | ||
self.compute_distance = compute_distance | ||
KSampleTest.__init__(self, compute_distance=compute_distance, **kwargs) | ||
|
||
def statistic(self, x, y): | ||
""" | ||
Calculates K-Sample HHG test statistic. | ||
Parameters | ||
---------- | ||
x,y : ndarray of float | ||
Input data matrices. ``x`` and ``y`` must have the same number of | ||
dimensions. That is, the shapes must be ``(n, p)`` and ``(m, p)`` where | ||
`n` and are the number of samples and `p` is the number of | ||
dimensions. | ||
Returns | ||
------- | ||
stat : float | ||
The computed KS test statistic associated with the lowest p-value. | ||
""" | ||
xy = np.concatenate((x, y), axis=0) | ||
distxy = _centerpoint_dist(xy, self.compute_distance, 1) | ||
distx = distxy[:, 0 : len(x)] | ||
disty = distxy[:, len(x) : len(x) + len(y)] | ||
stats, pvalues = _distance_score(distx, disty) | ||
minP = min(pvalues) | ||
stat = stats[pvalues.index(minP)] | ||
self.minP = minP | ||
self.stat = stat | ||
return self.stat | ||
|
||
def test(self, x, y): | ||
""" | ||
Calculates K-Sample HHG test statistic and p-value. | ||
Parameters | ||
---------- | ||
x,y : ndarray of float | ||
Input data matrices. ``x`` and ``y`` must have the same number of | ||
dimensions. That is, the shapes must be ``(n, p)`` and ``(m, p)`` where | ||
`n` and `m` are the number of samples and `p` is the number of | ||
dimensions. | ||
Returns | ||
------- | ||
stat : float | ||
The computed KS test statistic associated with the lowest p-value. | ||
pvalue : float | ||
The computed HHG pvalue. Equivalent to the lowest p-value multiplied by the total number | ||
of samples. | ||
""" | ||
check_input = _CheckInputs(inputs=[x, y],) | ||
x, y = check_input() | ||
N = x.shape[0] + y.shape[0] | ||
|
||
stat = self.statistic(x, y) | ||
pvalue = self.minP * N | ||
return stat, pvalue | ||
|
||
|
||
def _centerpoint_dist(xy, metric, workers=1, **kwargs): | ||
"""Gives pairwise distances - each row corresponds to center-point distances | ||
where one sample point is the center point""" | ||
distxy = pairwise_distances(xy, metric=metric, n_jobs=workers, **kwargs) | ||
return distxy | ||
|
||
|
||
def _distance_score(distx, disty): | ||
dist1, dist2 = _group_distances(distx, disty) | ||
stats = [] | ||
pvalues = [] | ||
for i in range(len(distx)): | ||
stat, pvalue = ks_2samp(dist1[i], dist2[i]) | ||
stats.append(stat) | ||
pvalues.append(pvalue) | ||
return stats, pvalues | ||
|
||
|
||
@jit(nopython=True, cache=True) | ||
def _group_distances(distx, disty): # pragma: no cover | ||
dist1 = [] | ||
dist2 = [] | ||
for i in range(len(distx)): | ||
distancex = np.delete(distx[i], 0) | ||
distancey = np.delete(disty[i], 0) | ||
distancex = distancex.reshape(-1) | ||
distancey = distancey.reshape(-1) | ||
dist1.append(distancex) | ||
dist2.append(distancey) | ||
return dist1, dist2 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
import numpy as np | ||
import pytest | ||
from numpy.testing import assert_almost_equal | ||
|
||
from ...tools import rot_ksamp | ||
from .. import KSampleHHG | ||
|
||
|
||
class TestKSampleHHG: | ||
@pytest.mark.parametrize( | ||
"n, obs_stat, obs_pvalue", [(100, 0.515, 4.912e-10), (10, 0.777, 0.125874),], | ||
) | ||
def test_linear_oned(self, n, obs_stat, obs_pvalue): | ||
np.random.seed(123456789) | ||
x, y = rot_ksamp("linear", n, 1, k=2, noise=False) | ||
stat, pvalue = KSampleHHG().test(x, y) | ||
|
||
assert_almost_equal(stat, obs_stat, decimal=3) | ||
assert_almost_equal(pvalue, obs_pvalue, decimal=3) | ||
|
||
@pytest.mark.parametrize( | ||
"n", [(100)], | ||
) | ||
def test_rep(self, n): | ||
np.random.seed(123456789) | ||
x, y = rot_ksamp("linear", n, 1, k=2, noise=False) | ||
MPstat1 = KSampleHHG().statistic(x, y) | ||
MPstat2 = KSampleHHG().statistic(x, y) | ||
|
||
assert MPstat1 == MPstat2 | ||
|
||
|
||
class TestKSampleHHGTypeIError: | ||
def test_oned(self): | ||
np.random.seed(123456789) | ||
rejections = 0 | ||
for i in range(1000): | ||
x, y = rot_ksamp( | ||
"multimodal_independence", n=100, p=1, noise=True, degree=90 | ||
) | ||
stat, pvalue = KSampleHHG().test(x, y) | ||
if pvalue < 0.05: | ||
rejections += 1 | ||
est_power = rejections / 1000 | ||
|
||
assert_almost_equal(est_power, 0, decimal=2) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters