New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NF: Outlier scoring #1624

Merged
merged 28 commits into from Oct 24, 2018

Conversation

Projects
None yet
4 participants
@kesshijordan
Copy link
Contributor

kesshijordan commented Aug 24, 2018

This is an outlier scoring method that compares the pathways of each streamline in a bundle (pairwise) and scores each streamline by how many other streamlines have similar pathways. I still need to write tests, but any initial feedback would be appreciated.

Here is a notebook demonstrating the use to filter streamlines
https://github.com/kesshijordan/shared_notebooks/blob/master/CCI_Demo.ipynb

Here is a paper about the method
https://onlinelibrary.wiley.com/doi/abs/10.1111/jon.12467

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Aug 24, 2018

Codecov Report

Merging #1624 into master will decrease coverage by 0.03%.
The diff coverage is 20%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1624      +/-   ##
==========================================
- Coverage   87.33%   87.29%   -0.04%     
==========================================
  Files         246      246              
  Lines       32164    32179      +15     
  Branches     3497     3499       +2     
==========================================
+ Hits        28089    28092       +3     
- Misses       3242     3254      +12     
  Partials      833      833
Impacted Files Coverage Δ
dipy/tracking/utils.py 85.88% <20%> (-3.05%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a1e1f2e...3a73599. Read the comment docs.

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Aug 25, 2018

Codecov Report

Merging #1624 into master will increase coverage by 0.02%.
The diff coverage is 97.59%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1624      +/-   ##
==========================================
+ Coverage   87.31%   87.34%   +0.02%     
==========================================
  Files         246      246              
  Lines       32613    32687      +74     
  Branches     3552     3555       +3     
==========================================
+ Hits        28477    28549      +72     
- Misses       3275     3276       +1     
- Partials      861      862       +1
Impacted Files Coverage Δ
dipy/tracking/utils.py 88.92% <ø> (ø) ⬆️
dipy/tracking/tests/test_streamline.py 98.44% <100%> (+0.19%) ⬆️
dipy/tracking/tests/test_utils.py 99.29% <100%> (ø) ⬆️
dipy/tracking/streamline.py 69.37% <90.9%> (+1.19%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7e28942...978ab60. Read the comment docs.

@kesshijordan kesshijordan requested a review from skoudoro Aug 25, 2018

@skoudoro
Copy link
Member

skoudoro left a comment

In overall, it looks good @kesshijordan. It would be nice if you can add an example in the streamlines analysis section of the doc like this one.
Otherwise, I did not comment them but there is a lot of pep8 error on test_utils.py (missing space around operator). Can you check them?
Thank you for this @kesshijordan!


for i, sl in enumerate(subsamp_sls):
mdf_mx = bundles_distances_mdf([subsamp_sls[i]], subsamp_sls)
if (1 * mdf_mx == 0).sum() > 1:

This comment has been minimized.

@skoudoro

skoudoro Aug 27, 2018

Member

why do you need to multiply by 1?

This comment has been minimized.

@kesshijordan

kesshijordan Aug 28, 2018

Contributor

I guess I don't... I did that to make sure the boolean is converted to an integer so it gets added correctly but I just tried it and True + True = 2. Thanks!

This comment has been minimized.

@skoudoro

skoudoro Sep 8, 2018

Member

Can you do this correction?

@@ -64,6 +64,8 @@
from numpy import (asarray, ceil, dot, empty, eye, sqrt)
from dipy.io.bvectxt import ornt_mapping
from dipy.tracking import metrics
from dipy.tracking.streamline import set_number_of_points

This comment has been minimized.

@skoudoro

skoudoro Aug 27, 2018

Member

This line generates a circular import. This is really problematic because I can not run your test on windows.
For the moment, you can replace it by from dipy.tracking.streamlinespeed import set_number_of_points

I feel like we have a design problem in tracking module @Garyfallidis @gabknight @jchoude @StongeEtienne. This is not the first time I encounter this circular import problem on tracking module

This comment has been minimized.

@arokem

arokem Aug 28, 2018

Member

I think this means that the CCI calculation is not a utility. I would put it in the streamlines module

This comment has been minimized.

@kesshijordan

kesshijordan Aug 28, 2018

Contributor

Thanks, @skoudoro and @arokem. Could one of you please explain this point a bit? I'm not following.

This comment has been minimized.

@arokem

arokem Aug 28, 2018

Member

In this line, you are importing a name from the dipy.tracking.streamline, but that module in turn imports dipy.tracking.utils, so that creates a circular import. Apparently that works fine on some platforms (maybe because you are only importing a name, rather than the entire namespace?), but doesn't generally. As @skoudoro points out, it probably reflects a lack of clarity on how this module is organized.

I would advocate for putting utils at the bottom of the hierarchy, and have that sub-module collect functions that are used by other sub-modules within the tracking module. That means that utils should be imported by other sub-modules, but shouldn't be importing other sub-modules (I see that structure is already being violated).

This comment has been minimized.

@kesshijordan

kesshijordan Aug 28, 2018

Contributor

I see. Thank you for explaining, @arokem. So this is a decision made about the structure of the tracking module as a whole. Utils gets imported by other tracking sub-modules but should only import from other modules (unless it's from _utils since that is lower on the hierarchy). I think this explains some other things I've been puzzled about debugging imports... I might be an offender with respect to this point so let me know if I can help fix the module if it's decided to enforce the hierarchy : )

This comment has been minimized.

@kesshijordan

kesshijordan Aug 29, 2018

Contributor

I took your suggestion, @arokem, and moved the function to the streamlines sub-module in tracking

def test_cci():
# two identical streamlines should raise an error
mysl = np.array([np.arange(10)] * 3).T
test_streamlines = list([mysl])*2

This comment has been minimized.

@skoudoro

skoudoro Aug 27, 2018

Member

Due to memory issue, you should avoid having a list of np.array as much as possible. This is for this reason, we created Streamlines object. Can you replace list with Streamlines?

This comment has been minimized.

@kesshijordan

kesshijordan Aug 28, 2018

Contributor

Thanks, @skoudoro.

Something I noticed while fixing this: you get an error if the type isn't double (my toy streamlines in tests were initially integers)... I just fixed it in the test to make the toys floats. Can we assume that inputs to set_number_of_points will always be floats or is this an issue?

dipy.tracking.streamlinespeed.set_number_of_points()
ValueError: Buffer dtype mismatch, expected 'double' but got 'long'

This comment has been minimized.

@skoudoro

skoudoro Sep 8, 2018

Member

Indeed, I think this is an issue. But I would like to have @Garyfallidis opinion.

assert_raises(ValueError, calculate_cci, test_streamlines)

# 3 offset collinear streamlines
test_streamlines = list([mysl])+[mysl+1]+[mysl+2]

This comment has been minimized.

@skoudoro

skoudoro Aug 27, 2018

Member

same as above


test_streamlines_p1 = list([mysl])+[mysl2]+[mysl3]
test_streamlines_p2 = list([mysl])+[mysl3]+[mysl4]
test_streamlines_p3 = list([mysl])+[mysl2]+[mysl3]+[mysl5]

This comment has been minimized.

@skoudoro

skoudoro Aug 27, 2018

Member

same as above


# error if any streamlines are shorter than 20mm
lengths = list(length(streamlines))
if np.array(lengths).min() < 20 and not override:

This comment has been minimized.

@skoudoro

skoudoro Aug 27, 2018

Member

I think you do not need to convert it as np.array, you could just write min(lengths).

raise ValueError('Identical streamlines. CCI calculation invalid')
mdf_mx_oi = (mdf_mx > 0) & (mdf_mx < max_mdf) & ~ np.isnan(mdf_mx)
mdf_mx_oi_only = mdf_mx[mdf_mx_oi]
cci_score = np.sum(np.divide(1, np.power(mdf_mx_oi_only, power)))

This comment has been minimized.

@skoudoro

skoudoro Aug 27, 2018

Member

Any chance to have mdf_mx_oi_only equal to 0? if yes, make sure to avoid the DivisionZeroError.

This comment has been minimized.

@kesshijordan

kesshijordan Aug 28, 2018

Contributor

I don't think so... here's my reasoning:

  • the mdf_mx can only be zero when two streamlines are identical (it's the mdf distance, so based on summing euclidean distances between subsampled points on streamlines.. this can't be negative so the only way to get to zero would be if all of the points were identical).
  • Since we raise a ValueError when there are identical streamlines, we should never get to the point where mdf_mx_oi_only since that's a subset of mdf_mx
  • I added the ValueError for identical streamlines because, if you think about what the purpose of this outlier calculation is, it's not really telling you anything when streamlines are identical because they would give strong support to a pathway that was probably gained "unfairly"... e.g. 1) this is being used for deterministic tracking with two seeds that are on top of each other, which wouldn't actually give you confidence in that pathway... only a seed that's a bit further away would be informative in a deterministic system... 2) it's really intended for probabilistic tracking, which shouldn't have this identical streamline situation unless something is wrong with the way the bundle was generated (e.g. duplicated samples), in which case the user needs to address that to avoid getting misleading results)

Does this sound reasonable?

This comment has been minimized.

@skoudoro

skoudoro Sep 8, 2018

Member

yep, good for me. Thank you for your explanation

mdf_mx = bundles_distances_mdf([subsamp_sls[i]], subsamp_sls)
if (1 * mdf_mx == 0).sum() > 1:
raise ValueError('Identical streamlines. CCI calculation invalid')
mdf_mx_oi = (mdf_mx > 0) & (mdf_mx < max_mdf) & ~ np.isnan(mdf_mx)

This comment has been minimized.

@skoudoro

skoudoro Aug 27, 2018

Member

nice! One day, I should try to see if this syntax is more efficient than using numpy function

the cci. High values of power make the contribution value
degrade much faster. Example: a streamline with 5mm MDF
similarity contributes 1/5 to the cci if power is 1, but
only contributes 1/5^2 = 1/25 if power is 2.

This comment has been minimized.

@arokem

arokem Aug 28, 2018

Member

override is not documented

@arokem

This comment has been minimized.

Copy link
Member

arokem commented Aug 28, 2018

Regarding the naming: I wonder whether it would be better to call the function something like cluster_confidence, omitting the calculate and expanding the acronym out a bit.

@skoudoro

This comment has been minimized.

Copy link
Member

skoudoro commented Aug 28, 2018

+1 for the naming suggestion

@kesshijordan

This comment has been minimized.

Copy link
Contributor

kesshijordan commented Aug 29, 2018

Thanks so much for all your help, @skoudoro and @arokem! I addressed most of your comments (though I still need to write the example and finish formatting, see below).

I have a few formatting questions:

  • Do you have a suggestion for the best way to screen for pep8 errors? I have flake8 in my editor and I added autopep8 (Atom), but it isn't catching what @skoudoro pointed out. It catches missing spaces after commas, but not around operators. What are you using, @skoudoro?
  • I'm getting a Flake8 warning on some things that are imported but not used, but they are not part of my updates so I want to make sure I understand before changing anything
    • run_module_suite in test_streamline is redundant to npt.run_module_suite called at bottom
    • import nose in test_utils, setup_test in utils, setup_test & compress_streamlines in streamline I'm not sure... is it ever the case that something is imported and not used, but it should be there?
@kesshijordan

This comment has been minimized.

Copy link
Contributor

kesshijordan commented Aug 29, 2018

I had forgotten to raise the streamline length ValueError... I wanted to get the last line of test coverage so I found the error : ) I think this is set now, besides the formatting and example.

@skoudoro
Copy link
Member

skoudoro left a comment

Sorry for getting back to you late @kesshijordan and Thank you for this update! Below, you will find my answer. Can you rebase your PR?

What are you using

I am using pycodestyle, pylint and pydocstyle.

I'm getting a Flake8 warning on some things that are imported but not used

Most of them are useful and are used during nosetests initialization

@kesshijordan kesshijordan force-pushed the kesshijordan:outlier_cci branch from d32fa5a to 8a3e107 Sep 10, 2018

@kesshijordan

This comment has been minimized.

Copy link
Contributor

kesshijordan commented Sep 10, 2018

No problem, @skoudoro. Thanks for all the time you put into reviewing my code! I added a tutorial and rebased on master.

@skoudoro
Copy link
Member

skoudoro left a comment

I just need to test it on windows, but otherwise, it looks good to me.


for i, sl in enumerate(subsamp_sls):
mdf_mx = bundles_distances_mdf([subsamp_sls[i]], subsamp_sls)
if (1 * mdf_mx == 0).sum() > 1:

This comment has been minimized.

@skoudoro

skoudoro Sep 13, 2018

Member

Finally, you keep this multiplication by 1. Why ?

This comment has been minimized.

@kesshijordan

kesshijordan Oct 16, 2018

Contributor

I did that to change the boolean to an integer for summing... tested on ipython and not necessary... True sums like 1 : ) Thanks for the pointer, @skoudoro !

@skoudoro
Copy link
Member

skoudoro left a comment

I just added some comment on the tutorial but this is a nice one!
Everything works like a charm.

cci = cluster_confidence(long_streamlines)

# Visualize the streamlines, colored by cci
ren = window.renderer()

This comment has been minimized.

@skoudoro

skoudoro Sep 13, 2018

Member

Capital R here. window.Renderer() (class object) instead of window.renderer() (deprecated function)

"""

lengths = list(length(streamlines))
long_streamlines = []

This comment has been minimized.

@skoudoro

skoudoro Sep 13, 2018

Member

Can replace long_streamlines = [] by long_streamlines = Streamlines()? In general, try to avoid list for streamlines object. They work like a list but improve memory management.

"""

keep_streamlines = []

This comment has been minimized.

@skoudoro

skoudoro Sep 13, 2018

Member

same as above: Can replace keep_streamlines = [] by keep_streamlines = Streamlines()?

keep_streamlines.append(sl)

# Visualize the streamlines we kept
ren = window.renderer()

This comment has been minimized.

@skoudoro

skoudoro Sep 13, 2018

Member

Same as above, capital R here. window.Renderer() (class object) instead of window.renderer() (deprecated function)

@skoudoro

This comment has been minimized.

Copy link
Member

skoudoro commented Oct 15, 2018

Any chance to fix this quick request and rebase this PR @kesshijordan? it would be great to see this PR on the next release this month. Thanks !

@kesshijordan

This comment has been minimized.

Copy link
Contributor

kesshijordan commented Oct 16, 2018

what's the release date, @skoudoro?

@skoudoro

This comment has been minimized.

Copy link
Member

skoudoro commented Oct 16, 2018

we do not have a date yet, but 100% sure this month. Sounds like last week of October...

@skoudoro

This comment has been minimized.

Copy link
Member

skoudoro commented Oct 16, 2018

Awesome! Thank you for this update @kesshijordan!

I will merge this as soon as Travis and Appveyor are happy. So, if someone wants to add a comment, you have 1h

@arokem

This comment has been minimized.

Copy link
Member

arokem commented Oct 16, 2018

+1 for the merge!

@kesshijordan

This comment has been minimized.

Copy link
Contributor

kesshijordan commented Oct 17, 2018

@skoudoro the travis error is in the forecast model, which I didn't touch in this PR. Is this a known issue?

======================================================================
FAIL: dipy.reconst.tests.test_forecast.test_forecast_positive_constrain

Traceback (most recent call last):
File "/home/travis/build/nipy/dipy/venv/lib/python3.5/site-packages/nose/case.py", line 198, in runTest
self.test(*self.arg)
File "/home/travis/build/nipy/dipy/venv/lib/python3.5/site-packages/numpy/testing/_private/decorators.py", line 155, in skipper_func
return f(*args, **kwargs)
File "/home/travis/build/nipy/dipy/venv/lib/python3.5/site-packages/dipy/reconst/tests/test_forecast.py", line 53, in test_forecast_positive_constrain
assert_almost_equal(fodf[fodf < 0].sum(), 0, 2)
File "/home/travis/build/nipy/dipy/venv/lib/python3.5/site-packages/numpy/testing/_private/utils.py", line 584, in assert_almost_equal
raise AssertionError(_build_err_msg())
AssertionError:
Arrays are not almost equal to 2 decimals
ACTUAL: -0.015931951387651192
DESIRED: 0

@skoudoro

This comment has been minimized.

Copy link
Member

skoudoro commented Oct 17, 2018

It appears on my PR too, it seems "new" @kesshijordan. I need to look at this because it will block all our PR.

@arokem arokem referenced this pull request Oct 17, 2018

Closed

Test failure in FORECAST #1654

@arokem

This comment has been minimized.

Copy link
Member

arokem commented Oct 17, 2018

I think this is because of the newly released version of cvxpy. See: #1654

@skoudoro

This comment has been minimized.

Copy link
Member

skoudoro commented Oct 22, 2018

Hi @kesshijordan , we just fixed the issue, you can rebase your PR and we can merge it after that.

Thank you

@kesshijordan

This comment has been minimized.

Copy link
Contributor

kesshijordan commented Oct 23, 2018

Thanks, @skoudoro and @arokem! I think the Python27 environment on appveyor is failing because of a copied and pasted non-ASCII dash in a citation.

I tried this in my Python 3 vs 2.7 environment and it only catches the dash in 2.7.
https://stackoverflow.com/questions/21639275/python-syntaxerror-non-ascii-character-xe2-in-file

ERROR: Failure: SyntaxError (Non-ASCII character '\xe2' in file c:\projects\dipy\dipy\tracking\streamline.py on line 517, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details (streamline.py, line 516))

References
----------
[Jordan17] Jordan K. Et al., Cluster Confidence Index: A Streamline‐Wise

This comment has been minimized.

@arokem

arokem Oct 23, 2018

Member

Something about this reference is wonky and throwing 27 off. Could you rewrite it from scratch by typing it manually into the file, instead of copy-paste?

This comment has been minimized.

@kesshijordan

kesshijordan Oct 23, 2018

Contributor

thanks, @arokem. There was a rogue dash, but I replaced it and then 27 passed.

This comment has been minimized.

@arokem
@skoudoro

This comment has been minimized.

Copy link
Member

skoudoro commented Oct 24, 2018

Thank you for all the hard work @kesshijordan! and thank you for the review @arokem! merging! 🎉

@skoudoro skoudoro merged commit 3638ee5 into nipy:master Oct 24, 2018

4 checks passed

codecov/patch 97.59% of diff hit (target 87.31%)
Details
codecov/project 87.34% (+0.02%) compared to f887b77
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@kesshijordan

This comment has been minimized.

Copy link
Contributor

kesshijordan commented Oct 24, 2018

Thanks, @skoudoro and @arokem for all the work you put into reviews!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment