Skip to content
This repository was archived by the owner on Jan 22, 2026. It is now read-only.

Add Mann-Whitney U test methods (bug 1395571)#174

Merged
robhudson merged 1 commit intomasterfrom
add-mann-whitney
Oct 16, 2017
Merged

Add Mann-Whitney U test methods (bug 1395571)#174
robhudson merged 1 commit intomasterfrom
add-mann-whitney

Conversation

@robhudson
Copy link
Copy Markdown

@robhudson robhudson commented Sep 7, 2017

TODO:

  • Also return p-values, similar to scipy

@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage increased (+0.4%) to 78.84% when pulling 9d87447 on add-mann-whitney into 414f6f3 on master.

Copy link
Copy Markdown
Contributor

@fbertsch fbertsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, Rob!

Comment thread tests/test_stats.py Outdated
SAMPLE2 = {1: 1, 2: 5, 3: 2, 4: 4, 5: 3}

# Basic test.
assert stats.mann_whitney_u(SAMPLE1, SAMPLE2) == 94.5
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should import scipy's mannwhitneyu test and compare its test statistic to yours. Could even have comparisons on some generated histograms.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This project doesn't currently depend on scipy. Is that an ok requirement to add?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to simply add it to the setup.py similar to numpy but it requires a fortran compiler and failed. It'll be a bit more complicated to add this it seems?

@robhudson
Copy link
Copy Markdown
Author

It was noted on IRC that I can borrow the method for calculating p-values from the lua code here:
https://github.com/mozilla-services/lua_sandbox_extensions/blob/d5d65adbe4e075825aa99f5abee01ffff93bcced/lsb/modules/lsb/stats.lua#L311-L319

@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage increased (+0.4%) to 79.157% when pulling 7876bcb on add-mann-whitney into bf624ef on master.

@mozilla mozilla deleted a comment from fbertsch Oct 4, 2017
@robhudson robhudson force-pushed the add-mann-whitney branch 3 times, most recently from 73dccce to 9389509 Compare October 10, 2017 16:47
@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage increased (+1.01%) to 81.378% when pulling 9389509 on add-mann-whitney into 26e5deb on master.

Comment thread moztelemetry/stats.py Outdated
from collections import namedtuple


def rank(sample):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function will use the floor if the input counts are integers:

rank({1: 2, 2: 2, 3: 2, 4: 2}) == {1: 1, 2: 3, 3: 5, 4: 7}
rank({1: 2.0, 2: 2.0, 3: 2.0, 4: 2.0}) == {1: 1.5, 2: 3.5, 3: 5.5, 4: 7.5}

Please convert them to doubles.

Please write a brief description of what this function is doing, and specify exactly what the rank will be for buckets with more than one count (i.e. rank is the median rank of all ranks in that bucket).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will return doubles since I'm importing division from the __future__, which will be more Python 3 if this project ever moves to it. W/o that import you're correct (and living in the past).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heh, good point! I've gotta switch my local default to 3.6...

Comment thread moztelemetry/stats.py Outdated
return ranks


def tie_correct(sample):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a link to tie correction for MWU on wikipedia.

Comment thread moztelemetry/stats.py
tie_correction = tie_correct(sample)
sd_u = math.sqrt(tie_correction * n1 * n2 * (n1 + n2 + 1) / 12.0)
mean_rank = n1 * n2 / 2.0 + 0.5 * use_continuity
z = abs((max(u1, u2) - mean_rank) / sd_u)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be the min(u1, u2), see here

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In step 10 they're using U=32 to compute z, which is the max(Ua, Ub) in the example.

Comment thread moztelemetry/stats.py Outdated
return tc


def ndtr(v):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ndtr takes the integral from -infinity to v over the gaussian distribution. This is why the input below is -abs(z). Can we do two things here:

  • Add tests that this function returns the same value as scipy ndtr

  • Document what this function is doing, with a reference to the above link

I do like calculating this separately, rather than importing scipy.

Comment thread tests/test_stats.py
"""
This module implements test coverage for the stats functions in stats.py.
"""
import itertools
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add more tests, with some different distributions? We could add, for example:

  • Normally distributed histograms
  • Skewed distributions
  • Uniformly distributed histograms

And combinations therein. We can create them randomly, and check that our result is within scipy's by some value.

Comment thread moztelemetry/stats.py
{1: 5, 2: 20, 3: 12, ...}

Returns the U statistic, equal to min(U for sample1, U for sample2).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update comments to match code.

@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage increased (+1.05%) to 81.418% when pulling 5955828 on add-mann-whitney into 26e5deb on master.

@robhudson
Copy link
Copy Markdown
Author

Updated with requested changes.

@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage increased (+1.05%) to 81.418% when pulling 9f1760a on add-mann-whitney into 26e5deb on master.

Comment thread moztelemetry/stats.py Outdated

"""
try:
a = float(a)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is probably more pythonic to let this exception be raised. Perhaps just line 72 should be float(a) * sqrth. Alternatively, we could just remove validation entirely.

@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage increased (+1.0%) to 81.338% when pulling 44cd983 on add-mann-whitney into 26e5deb on master.

Comment thread tests/test_stats.py Outdated


def test_mann_whitney_u():
for sample in ('normalized', 'uniform', 'skewed'):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are close to what I was thinking, but all of these are comparing the same distribution. I would change one of the distributions (for each of: normalized, uniform, and skewed) to be different (different mean, different std dev, different number of samples, etc.), and in addition add tests that compare e.g. normalized to uniform, uniform to skewed, etc.

@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage increased (+1.0%) to 81.338% when pulling b46e40a on add-mann-whitney into 26e5deb on master.

Copy link
Copy Markdown
Contributor

@fbertsch fbertsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frack yeah! :shipit:

@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage increased (+1.0%) to 81.338% when pulling 7b704bb on add-mann-whitney into 26e5deb on master.

@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage increased (+1.0%) to 81.338% when pulling d426a26 on add-mann-whitney into 26e5deb on master.

@robhudson robhudson merged commit 55c1646 into master Oct 16, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants