New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quickbundles 2.1 #629

Merged
merged 15 commits into from May 19, 2015

Conversation

Projects
None yet
5 participants
@MarcCote
Contributor

MarcCote commented Apr 13, 2015

This PR mainly adds new Feature classes (ResampleFeature, MidPointFeature, ArcLengthFeature and VectorBetweenEndpointsFeature) and a new Metric class (CosineMetric).

Moreover, a new tutorial has been added explaining how to make new Feature and new Metric for Quickbundles.

This PR also fixes some bugs and adds more sanity checks.

super(CosineMetric, self).__init__(feature=feature)
cdef int c_are_compatible(CosineMetric self, Shape shape1, Shape shape2) nogil except -1:
return same_shape(shape1, shape2) and shape1.dims[0] == 1

This comment has been minimized.

@omarocegueda

omarocegueda Apr 13, 2015

Contributor

if same_shape(shape1, shape2)!=0 and shape1.dims[0] == 1:
return 1
else:
return 0
to avoid Cython trying to create a boolean object:
https://travis-ci.org/nipy/dipy/jobs/58217418#L775

This comment has been minimized.

@MarcCote
return shape1 == shape2 and shape1[0] == 1
def dist(self, v1, v2):
norm = lambda x: np.sqrt(np.sum(x**2, dtype=np.float64))

This comment has been minimized.

@arokem

arokem Apr 24, 2015

Member

You might want to use np.linalg.norm

@@ -416,7 +417,7 @@ class QuickBundles(Clustering):
metric : str or `Metric` object (optional)
The distance metric to use when comparing two streamlines. By default,
the Minimum average Direct-Flip (MDF) distance [Garyfallidis12]_ is
used and requires streamlines to have the same number of points.
used and streamlines are automatically resampled so they have 18 points.

This comment has been minimized.

@arokem

arokem Apr 24, 2015

Member

I assume there's some reason 18 is used here?

This comment has been minimized.

@MarcCote

MarcCote May 8, 2015

Contributor

No, not really. I used 18 since that is what @Garyfallidis used in the example about streamlines segmentation. I don't know what you guys prefer: force the user to provide the number of points, or set one by default and mention it clearly in the documentation?

This comment has been minimized.

@arokem

arokem May 9, 2015

Member

I would prefer this be a kwarg that users can easily change if they want to. Would that be hard to implement?

This comment has been minimized.

@MarcCote

MarcCote May 9, 2015

Contributor

It wouldn't be hard to implement. However, I don't know if this would make sense to add a parameter like that in the QuickBundles constructor since QuickBundles can accept any kind of metric that doesn't necessarily need a nb_points.

I would argue that users who just want to use the QuickBundles algorithm as a black box (I'm guessing it would be the majority) won't care about changing the number of points. This parameter has a direct impact on the speed of the algorithm but also the quality. I'm afraid uninformed users would simply decrease that number to gain speed without necessarily considering the validity of the results.

An informed user can "easily" and explicitly change that parameter as follows

feature = ResampleFeature(nb_points=42)
metric = AveragePointwiseEuclideanMetric(feature)
qb = QuickBundles(threshold=10., metric=metric)
clusters = qb.cluster(streamlines)

and I should probably add a note about that in the tutorial. What do you think? @Garyfallidis ?

This comment has been minimized.

@arokem

arokem May 9, 2015

Member

Yeah. How about creating a few usage examples in the docstring of this
class? Include simpler examples, but also something like this example. You
can certainly reuse code you have in the tests for that. I think that some
users will look at the docstring of the class, even rather than looking at
the tutorial (for IPython users in particular, this is the natural thing to
do).

This kind of hard-coded parameters needs to be really well documented - why
you chose a certain setting (here, this follows Eleftherios et al.'s paper,
I understand?), and what the implications of changing it would be (here a
speed-accuracy tradeoff). You can certainly state these things in the
docstring, as you show the example. Warn people about changing this and
what effect that will have on the results.

On Sat, May 9, 2015 at 9:07 AM, Marc-Alexandre Côté <
notifications@github.com> wrote:

In dipy/segment/clustering.py
#629 (comment):

@@ -416,7 +417,7 @@ class QuickBundles(Clustering):
metric : str or Metric object (optional)
The distance metric to use when comparing two streamlines. By default,
the Minimum average Direct-Flip (MDF) distance [Garyfallidis12]_ is

  •    used and requires streamlines to have the same number of points.
    
  •    used and streamlines are automatically resampled so they have 18 points.
    

It wouldn't be hard to implement. However, I don't know if this would make
sense to add a parameter like that in the QuickBundles constructor since
QuickBundles can accept any kind of metric that doesn't necessarily need
a nb_points.

I would argue that users who just want to use the QuickBundles algorithm
as a black box (I'm guessing it would be the majority) won't care about
changing the number of points. This parameter has a direct impact on the
speed of the algorithm but also the quality. I'm afraid uninformed users
would simply decrease that number to gain speed without necessarily
considering the validity of the results.

An informed user can "easily" and explicitly change that parameter as
follows

feature = ResampleFeature(nb_points=42)
metric = AveragePointwiseEuclideanMetric(feature)
qb = QuickBundles(threshold=10., metric=metric)
clusters = qb.cluster(streamlines)

and I should probably add a note about that in the tutorial. What do you
think? @Garyfallidis https://github.com/Garyfallidis ?


Reply to this email directly or view it on GitHub
https://github.com/nipy/dipy/pull/629/files#r29993774.

This comment has been minimized.

@Garyfallidis

Garyfallidis May 15, 2015

Member

@MarcCote you can use the same default parameter in what was in the initial Quickbundles algorithm. If not as @arokem suggests you have to explain what will be the effect of the changes and give examples.
I really like this new design 👍

feature = ResampleFeature(nb_points=42)
metric = AveragePointwiseEuclideanMetric(feature)
qb = QuickBundles(threshold=10., metric=metric)
clusters = qb.cluster(streamlines)

This comment has been minimized.

@MarcCote

MarcCote May 16, 2015

Contributor

Ok, the default is 12 points, I'll change it.

A sequence of N-dimensional points is represented as a 2D array with
shape (nb_points, nb_dimensions).
The feature being extracted consists in one N-dimensional point representing

This comment has been minimized.

@arokem

arokem Apr 24, 2015

Member

"consists of"

A sequence of N-dimensional points is represented as a 2D array with
shape (nb_points, nb_dimensions).
The feature being extracted consists in one N-dimensional point representing

This comment has been minimized.

@arokem

arokem Apr 24, 2015

Member

IIUC, this feature is not really a point in the same way that the inputs are sequences of points, but rather a vector in the N-dimensional space pointing from one end-point to the other.

This comment has been minimized.

@MarcCote

MarcCote May 8, 2015

Contributor

Yes, you are right, I'll change it.

int d
for d in range(D):
out[0, d] = datum[N-1, d] - datum[0, d]

This comment has been minimized.

@arokem

arokem Apr 24, 2015

Member

Can you index using datum[-1, d]?

This comment has been minimized.

@MarcCote

MarcCote May 8, 2015

Contributor

No, since I disabled bounds checking. At the top of the file: # cython: wraparound=False, cdivision=True, boundscheck=False

However, using -1 won't fail at compilation time. Instead at runtime it will behave unexpectedly! -1 refers to memory before the beginning of the array.

This comment has been minimized.

@arokem

arokem May 9, 2015

Member

Cool. I learn new things here every time I do a review :-)

from dipy.segment.metricspeed import (dist,
distance_matrix)
# Creates aliases
EuclideanMetric = SumPointwiseEuclideanMetric

This comment has been minimized.

@arokem

arokem Apr 24, 2015

Member

Is this here for backwards compatibility?

This comment has been minimized.

@MarcCote

MarcCote May 8, 2015

Contributor

No. Depending on the context, I find it clearer. Particularly, when comparing scalars or N-dimensional points. I think it makes more sense to use the term EuclideanMetric when computing Euclidean distance between two points and use SumPointwiseEuclideanMetric when comparing two sequences. Since SumPointwiseEuclideanMetric is generic it can easily handles the scalar and single Nd point cases.

For example, somewhere in the tutorial about extending features, we have to compare scalars: metric = EuclideanMetric(feature=ArcLengthFeature()).

This comment has been minimized.

@arokem

arokem May 9, 2015

Member

Would this be confusing if someone were to look for the docstring of EuclideanMetric and found the docstring of SumPointWiseEuclideanMetric? Maybe better to implement explicitly as a class with its own docstring, but inheriting all of its properties from the parent?

I am not sure - just thinking out loud here.

This comment has been minimized.

@MarcCote

MarcCote May 9, 2015

Contributor

What you said makes sense, I'll create a new class that subclass SumPointWiseEuclideanMetric. Thanks for the idea.

This comment has been minimized.

@MarcCote

MarcCote May 17, 2015

Contributor

@Garyfallidis Ok, I've been rethinking the framework. I'll make another PR explaining why but the gist of it is: what the hell I was thinking making Metric dependent on Features!

For now, I will remove the alias and modify the tutorial where it was used (i.e. the only place it was used anyway).

@@ -191,6 +204,9 @@ cdef class SumPointwiseEuclideanMetric(CythonMetric):
is equal to $a+b+c$ where $a$ is the Euclidean distance between s1[0] and
s2[0], $b$ between s1[1] and s2[1] and $c$ between s1[2] and s2[2].
"""
#def __init__(SumPointwiseEuclideanMetric self, Feature feature=ResampleFeature(18)):

This comment has been minimized.

@arokem

arokem Apr 24, 2015

Member

If this is not needed, you can delete it.

This comment has been minimized.

@MarcCote

MarcCote May 8, 2015

Contributor

Yep, I'll delete it.

r""" Computes the cosine distance between two vectors.
A vector of N-dimensional points is represented as a 2D array with
shape (1, nb_dimensions).

This comment has been minimized.

@arokem

arokem Apr 24, 2015

Member

This is just one N-dimensional point?

This comment has been minimized.

@MarcCote

MarcCote May 8, 2015

Contributor

Yep, typo.

@@ -0,0 +1,233 @@
"""
===========================================================
How to make new `Feature` and new `Metric` for Quickbundles

This comment has been minimized.

@Garyfallidis

Garyfallidis May 14, 2015

Member

Hi @MarcCote, this title is not very attractive.
I would say something more general. For example,
Enhancing QuickBundles with different metrics and features.
or
Introducing the advanced capabilities of the clustering framework. A use case
with adding new metrics to Quickbundles.

This comment has been minimized.

@MarcCote

MarcCote May 18, 2015

Contributor

I find the former better, for now. The second one might be goof once we will have more tutorials describing all the potential of the framework.

===========================================================
This tutorial shows how to create new `Feature` and new `Metric` classes that
can be used by QuickBundles.

This comment has been minimized.

@Garyfallidis

Garyfallidis May 14, 2015

Member

For the purpose of ...

This comment has been minimized.

@Garyfallidis

Garyfallidis May 14, 2015

Member

You should explain a bit why this is important.

sequential data (e.g. streamlines).
A *sequential datum* in Dipy is represented as a numpy array of size
:math:`(N \times D)` where each row of the array represent a D dimensional

This comment has been minimized.

@Garyfallidis

Garyfallidis May 14, 2015

Member

represents

number of sequences in the set.
This clustering framework is modular and divided in three parts:
1) features extraction

This comment has been minimized.

@Garyfallidis
This clustering framework is modular and divided in three parts:
1) features extraction
2) distance computation
3) data partitioning

This comment has been minimized.

@Garyfallidis

Garyfallidis May 14, 2015

Member

You mean clustering algorithm?

2) distance computation
3) data partitioning
The features extraction part regroups any preprocessing needed to be done on

This comment has been minimized.

@Garyfallidis

Garyfallidis May 14, 2015

Member

feature
regroups is not clear here. Clarify please. An example of a simple feature would be helpful here.

the data before computing distances between them. To define new way of
extracting features, one has to subclass `Feature` (see below).
The distance computation part regroups any metric capable of evaluating a

This comment has been minimized.

@Garyfallidis

Garyfallidis May 14, 2015

Member

Regroups is not informative here. Part is not needed too.

The distance computation part regroups any metric capable of evaluating a
distance between two set of features previously extracted from the data. To
define new way of extracting features, one has to subclass `Metric` (see below).

This comment has been minimized.

@Garyfallidis

Garyfallidis May 14, 2015

Member

a new way

The data partitionning part represents the clustering algorithm itself
(e.g. QuickBundles, K-means, Hierarchical Clustering). More precisely, it
regroups any algorithms taking as input a list of sequential data and

This comment has been minimized.

@Garyfallidis

Garyfallidis May 14, 2015

Member

Regroups again not clear? What do you want to say?

Extending `Feature`
===================
This section will guide you through the creation of a new features extraction

This comment has been minimized.

@Garyfallidis

Garyfallidis May 14, 2015

Member

feature extraction

This comment has been minimized.

@Garyfallidis

Garyfallidis May 14, 2015

Member

correct everywhere please or say extraction of features

"""
The new distance `CosineMetric` is ready to be used. Let's use
it to cluster a set of streamlines according the cosine distance of the

This comment has been minimized.

@Garyfallidis

Garyfallidis May 14, 2015

Member

according to

from dipy.data import get_data
from dipy.viz import fvtk
fname = get_data('fornix')

This comment has been minimized.

@Garyfallidis

Garyfallidis May 14, 2015

Member

If you want to use a different bundle than the fornix. I have some simple cingulum bundles (cb) in the data. Look in get_data.

#cdef extern from "stdlib.h" nogil:
# ctypedef unsigned long size_t
# void free(void *ptr)
# void *malloc(size_t size)

This comment has been minimized.

@Garyfallidis

Garyfallidis May 15, 2015

Member

If you don't need the commented imports/cimport and cdef can you remove them?

This comment has been minimized.

@MarcCote

MarcCote May 18, 2015

Contributor

Oups. Done.

@@ -187,13 +188,39 @@ cdef class IdentityFeature(CythonFeature):
out[n, d] = datum[n, d]
cdef class ResampleFeature(CythonFeature):

This comment has been minimized.

@Garyfallidis

Garyfallidis May 15, 2015

Member

Resample is very general. We need a more precise name.

This comment has been minimized.

@MarcCote

MarcCote May 18, 2015

Contributor

You when me to put SetNumberOfPointsFeature :/

This comment has been minimized.

@Garyfallidis

Garyfallidis May 19, 2015

Member

No, it's okay. Keep it as it is.

@@ -2,6 +2,7 @@
import numpy as np

This comment has been minimized.

@Garyfallidis

Garyfallidis May 15, 2015

Member

I noticed that the coverage in clustering.py is at 86% is that the best it can be?

This comment has been minimized.

@MarcCote

MarcCote May 18, 2015

Contributor

I'll check what I can do.

This comment has been minimized.

@MarcCote

MarcCote May 18, 2015

Contributor

Done

@@ -1,5 +1,5 @@
from cythonutils cimport Data2D, Shape
cimport numpy as np

This comment has been minimized.

@matthew-brett

matthew-brett May 18, 2015

Member

I personally prefer 'cimport numpy as cnp' so it is clearer which is Python numpy, which Cython.

This comment has been minimized.

@MarcCote

MarcCote May 18, 2015

Contributor

I agree, I'll change it everywhere it applies.

@MarcCote

This comment has been minimized.

Contributor

MarcCote commented May 18, 2015

It seems the QuickBundles2 code is now failling with Cython 0.22.1. I think it has something to do with except *, I'll try to fix that.

FYI, according to Stefan Behnel from Cython:
"I understand that developers need Cython to work on kivy. I was just
wondering why users need Cython in order to install kivy. Usually,
shipping the generated (and tested) C sources avoids that. "
See https://groups.google.com/d/msg/cython-users/r3dQSfOKMjI/U8vabWagDQkJ

@Garyfallidis

This comment has been minimized.

Member

Garyfallidis commented May 19, 2015

Very well done! Next PR please! 👍

Garyfallidis added a commit that referenced this pull request May 19, 2015

@Garyfallidis Garyfallidis merged commit 7fc0cfd into nipy:master May 19, 2015

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@arokem

This comment has been minimized.

Member

arokem commented May 19, 2015

Nice!

@MarcCote MarcCote deleted the MarcCote:quickbundles2.1 branch May 20, 2015

@arokem arokem referenced this pull request May 26, 2015

Closed

Build-bot status #656

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment