Measures for over- and under-segmentation of chord labels by fdlm · Pull Request #263 · mir-evaluation/mir_eval

fdlm · 2017-10-04T09:51:08Z

An attempt for addressing #262.

The MIREX MeanSeg metric is not implemented, because it is the dataset average of min(overseg, underseg). I don't think it's meaningful to add another field to the scores that just returns the minimum of the two.
The results can be slightly different than the ones used in MIREX due to a philosophical difference. This implementation uses the adjusted intervals (i.e., assumes a "N" detection if the estimation does not output anything), while in MIREX, positions without an explicit estimation are ignored.

Let me know what you think.

bmcfee · 2017-10-04T10:46:54Z

+    seg = 0.
+    for start, end in reference_intervals:
+        dur = end - start
+        between_start_end = est_ts[(est_ts > start) & (est_ts < end)]


Should only one of these comparisons be strict?

bmcfee · 2017-10-04T10:48:13Z

+        directional hamming distance between reference intervals and
+        estimated intervals.
+    """
+    est_ts = np.unique(estimated_intervals.flatten())


Seems like this should have an input validation

Something like if estimated_intervals.shape[1] != 2 or should the check go deeper and ensure e.g. that there are no overlaps in the intervals, etc.?

Calling util.validate_intervals would probably suffice. It doesn't check for disjoint/complete segmentation (and maybe it should), but it does cover the basics of shape matching and valid interval timings.

bmcfee

The implementation looks good, thanks! A few minor comments about validation and edge cases.

Also, it looks like there are no unit tests for the new functions?

fdlm · 2017-10-05T09:35:30Z

Right, I missed that - somehow I thought having numbers in the outputXX.json is enough, but I will write tests that cover some edge cases.

craffel · 2017-10-05T17:36:57Z

Thanks for reviewing @bmcfee , I am slammed right now so I can give it a final look-over once things are squared

fdlm · 2017-10-11T13:54:18Z

So, I think I addressed all concerns @bmcfee pointed out, so from my side, this PR is ready to merge. Let me know if there is anything else I can improve!

bmcfee · 2017-10-11T13:59:41Z

+        between_start_end = est_ts[(est_ts >= start) & (est_ts < end)]
+        seg_ts = np.hstack([start, between_start_end, end])
+        seg += dur - np.diff(seg_ts).max()
+    return seg / (reference_intervals[-1, 1] - reference_intervals[0, 0])


possible div-by-0 here, since validate_intervals doesn't check for non-emptiness. Maybe add an explicit check here, and document the behavior if reference intervals are empty.

I think it does implicitly in the following check (util.py:770):

if intervals.ndim != 2 or intervals.shape[1] != 2: raise ValueError('Intervals should be n-by-2 numpy ndarray, ' 'but shape={}'.format(intervals.shape))

Here, np.atleast_2d(np.array([])).shape[1] == 0, and thus the ValueError will be raised. So, at least one interval must be present, and it must have a positive duration, and thus reference_intervals[-1, 1] - reference_intervals[0, 0] must be positive.

Ah, quite correct, thanks. Sorry I missed that before.

LGTM!

craffel · 2017-10-19T15:21:10Z

Great. I am about to travel to ISMIR but should get a chance to do a final pass over in the next few days. Thanks!

craffel

Minor documentation question, otherwise looks good.

craffel · 2017-10-19T20:09:36Z

+    ...     ref_intervals, est_intervals)
+    >>> underseg = 1 - mir_eval.chord.directional_hamming_distance(
+    ...     est_intervals, ref_intervals)
+    >>> meanseg = min(overseg, underseg)


I'm confused by this line --

It's called "meanseg" but is computed as the min.

Is this a thing people measure? If so should we have a separate metric function for it?

I agree that the naming is weird, but I wanted to be consistent with MIREX (see http://www.music-ir.org/mirex/wiki/2017:Audio_Chord_Estimation_Results).

I'm not sure if it is meaningful to have this as a separate metric for this for each song (it is, after all, just min(overseg, underseg)). However, it might then be easier to recreate the metrics as used in the MIREX task, where "MeanSeg" means the mean of the min(os, us) for each song. In this case, I would prefer to call it just something like "Segmentation".

What do you think?

So they are calling it "mean" because it's the mean of the min across all songs? I guess that bothers me because all of the metrics are mean across songs, correct? In terms of what I think, I honestly don't know in this case! But I will say typically it's ok to have an extra metric function if it's something that people measure, even if it's oneline. The idea being that evaluate returns everything that you might want to measure to report your results.

Sure, let's add a function for it, then! The only negative now is that underseg() and overseg() get computed twice when calling evaluate(). If this is a problem, I can replace scores['seg'] = seg(...) with scores['seg'] = min(scores['underseg'], scores['overseg']).

If this is a problem, I can replace scores['seg'] = seg(...) with scores['seg'] = min(scores['underseg'], scores['overseg']).

This seems like the best option. (and sorry for the delay)

No worries! Should be done now. If there is anything I can do to further improve this PR, let me know.

craffel · 2017-10-19T20:21:42Z

+    estimated intervals as defined by [#harte2010towards]_ and used for MIREX
+    'OverSeg', 'UnderSeg' and 'MeanSeg' measures.
+
+    Examples


Technically, since this isn't a metric function, you don't need an "Examples" section, but it certainly doesn't hurt :)

fixed chord interval overlap check; added tests for chord interval overlap check.

craffel · 2017-11-30T01:55:17Z

Merged! Thanks so much!

* implemented over- and under-segmentation measuers * added function to merge intervals; added tests * added more tests and input validation for directional hamming distance * added function for "segmentation" measure * fixed computation of 'segmentation' measure for chord evaluation; fixed chord interval overlap check; added tests for chord interval overlap check.

fdlm added 2 commits October 2, 2017 15:37

implemented over- and under-segmentation measuers

7c12878

added function to merge intervals; added tests

de46707

bmcfee reviewed Oct 4, 2017

View reviewed changes

added more tests and input validation for directional hamming distance

35045e1

bmcfee reviewed Oct 11, 2017

View reviewed changes

craffel reviewed Oct 19, 2017

View reviewed changes

fdlm added 2 commits October 25, 2017 16:39

added function for "segmentation" measure

c029071

fixed computation of 'segmentation' measure for chord evaluation;

0b84ec3

fixed chord interval overlap check; added tests for chord interval overlap check.

craffel merged commit 58ad872 into mir-evaluation:master Nov 30, 2017

Conversation

fdlm commented Oct 4, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bmcfee left a comment

Choose a reason for hiding this comment

Uh oh!

fdlm commented Oct 5, 2017

Uh oh!

craffel commented Oct 5, 2017

Uh oh!

fdlm commented Oct 11, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

craffel commented Oct 19, 2017

Uh oh!

craffel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

craffel commented Nov 30, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants