# Precision, recall and f-measure 
### George Tzanetakis, University of Victoria 

In this notebook we go over the terminology of retrieval metrics: precision, recall and f-measure and how they are used in a variety of Music Information Retrieval (MIR) tasks. We first examine their classic original use as measures for the effectiveness of text information retrieval systems. Then we explore how they are used for evaluating classification systems, auto-tagging systems, as well as systems that estimate time boundaries (such as beat tracking and structure segmentation). 


## Text Retrieval 

The original scenario for which retrieval metrics were proposed is topic-based text retrieval. Today we are all familiar with this concept from search engines. The idea is that the user submits a query and the text retrieval system returns a set of what the system thinks are relevant documents. To evaluate the system an expert rates each returned document as relevant or not. 



In [1]:
import numpy as np 

# suppose that are query is beatles and we are interested in retrieving documents 
# about the corresponding insect species 
query = 'beatles'

# Let's say a set of 10 documents is returned - their exact representation is not important 
# and could be a bag-of-words representation, text, a list of keywords etc 

# an expert/user goes over these automatically retrieved documents and marks 
# them as relevant or not relevant. 
# The documents that are about the insects are marked with a 1 and the 
# ones that are not relevant (perhaps about the music group) are marked
# with a zero. 
retrieved = np.array([1,1,0,1,1,1,1,0,1,0])


Precision is defined as the number of relevant retrieved documents divded by the number of returned documents. So for our particular example the precision is 0.70 or 70%. 

In [2]:
import numpy as np

def precision(retrieved):
    retrieved_relevant = np.count_nonzero(retrieved == 1)
    return (retrieved_relevant / len(retrieved))


print("Precision = %.2f\n" % (precision(retrieved)))

Precision = 0.70



Notice that we can improve the precision in this case by returning less items. For example the precision when returning the first two items is 1 or 100%. 

In [3]:
less_retrieved = np.array([1,1])
print("Precision = %.2f\n" % (precision(retrieved)))

Precision = 0.70



Now suppose that in the set of documents we are considering there are 15 documents about beatles (the insect species). Our set of retrieved results only contains 7 of them and therefore the recall in this case is 7/15=0.47

In [4]:
def recall(retrieved, num_relevant):
    retrieved_relevant = np.count_nonzero(retrieved == 1)
    return (retrieved_relevant / num_relevant)

In [5]:
retrieved = np.array([1,1,0,1,1,1,1,0,1,0])
print("Recall = %.2f\n" % (recall(retrieved,15)))

Recall = 0.47



We can trivially increase recall by returning more items with the extreme cases of returning all documents as relevant at the expense of diminishing precision. Alternatively we can only return less and less document increasing precision at the expense of recall. An effective retrieval system should achieve both high precision and recall. The most common measure that balances these two metrics is the f1-score or f-measure defined as the 
harmonic mean of precision and recall. 

\begin{equation} 
F_{1} = 2 * \frac{precision * recall}{precision + recall}
\end{equation} 

In [6]:
def f1_score(precision, recall): 
    return 2 * ((precision * recall) / (precision+recall))

precision = 0.7 
recall = 0.47
f1 = f1_score(precision,recall)
print("F1 = %.2f\n" % f1)

F1 = 0.56



## Binary Classification 

For a binary classification problem we can consider the set of predictions as the retrieved documents and the ground truth as the annotations of relevance. For example suppose that we have a music/speech classification system that predicts 100 instances and lets say that 50 of them are labeled as music and 50 of them are labeled as speech. A system that would predict everything as music would have high recall (i.e all music instances according to the ground truth would be predicted correctly) but 0.5 precision as half of the predicted instances 
(the ones labeled speech according to the ground truth) would not be correctly predicted. A system that predicted a single instance of music correctly as music and all the other instances as speech would have a precision of 1 but really bad recall 1/50. Notice that in any binary classification system that outputs a class probability we can trade precision and recall by adjusting the threhold for classification. F1-score is defined similarly as with text retrieval as the harmonic mean of precision and recall. 

Another way to view precision and recall is through the terminology used for binary classification problems in medical tests which is in terms of true positives (TP), true negatives (TN), false positives (FP) ad false negatives (FN). It is easy to see that using this terminology precision can be defined as: 

\begin{equation}
Precision = \frac{TP}{TP+FP} 
\end{equation} 

\begin{equation} 
Recall = \frac{TP}{TP+FN} 
\end{equation} 

## Multi-class and multi-label classification

For a multi-class classification problem with K classes we can not directly calculate retrieval metrics as described. There are three common approaches that are used. In macro-averaging the metrics are computed 
independetly for each class and then the average is taken (this means that each class is treated equally even 
though some classes might have more instances/support than others). A micro-average aggregates the contributions of each class and computes the metric overall which means that classes that are over-represented will effect the metrics more. Finally weighted average returns support-weighted metrics. 

In multi-label classification (auto-tagging) each instances can be assigned more than one label. The approach to calculating retrieval metrics is similar with macro-averaging meaning that retrieval metrics are calculated separately for each tag column and then averagedm and micro-averaging meaning that retrieval metrics are calculating over the entiure matrix of predictions. 

Let's look at a particular example from musical genre classification. First audio features are extracted for the GTZAN dataset. 




In [7]:


import glob
import librosa
import numpy as np

fnames = glob.glob("/Users/georgetzanetakis/data/sound/genres/*/*.wav")

genres = ['classical', 'country', 'disco', 'hiphop', 'jazz', 'rock', 'blues', 'reggae', 'pop', 'metal']

# allocate matrix for audio features and target 
audio_features = np.zeros((len(fnames), 40))
target = np.zeros(len(fnames))

# compute the features 
for (i,fname) in enumerate(fnames): 
    #print("Processing %d %s" % (i, fname))
    for (label,genre) in enumerate(genres): 
        if genre in fname: 
            audio, srate = librosa.load(fname)
            mfcc_matrix = librosa.feature.mfcc(y=audio, sr=srate)
            mean_mfcc = np.mean(mfcc_matrix,axis=1)
            std_mfcc = np.std(mfcc_matrix, axis=1)
            audio_fvec = np.hstack([mean_mfcc, std_mfcc])
            audio_features[i] = audio_fvec
            target[i] = label

print(audio_features.shape)


Processing 0 /Users/georgetzanetakis/data/sound/genres/pop/pop.00027.wav
Processing 1 /Users/georgetzanetakis/data/sound/genres/pop/pop.00033.wav
Processing 2 /Users/georgetzanetakis/data/sound/genres/pop/pop.00032.wav
Processing 3 /Users/georgetzanetakis/data/sound/genres/pop/pop.00026.wav
Processing 4 /Users/georgetzanetakis/data/sound/genres/pop/pop.00030.wav
Processing 5 /Users/georgetzanetakis/data/sound/genres/pop/pop.00024.wav
Processing 6 /Users/georgetzanetakis/data/sound/genres/pop/pop.00018.wav
Processing 7 /Users/georgetzanetakis/data/sound/genres/pop/pop.00019.wav
Processing 8 /Users/georgetzanetakis/data/sound/genres/pop/pop.00025.wav
Processing 9 /Users/georgetzanetakis/data/sound/genres/pop/pop.00031.wav
Processing 10 /Users/georgetzanetakis/data/sound/genres/pop/pop.00009.wav
Processing 11 /Users/georgetzanetakis/data/sound/genres/pop/pop.00035.wav
Processing 12 /Users/georgetzanetakis/data/sound/genres/pop/pop.00021.wav
Processing 13 /Users/georgetzanetakis/data/sound

Processing 112 /Users/georgetzanetakis/data/sound/genres/metal/metal.00018.wav
Processing 113 /Users/georgetzanetakis/data/sound/genres/metal/metal.00019.wav
Processing 114 /Users/georgetzanetakis/data/sound/genres/metal/metal.00025.wav
Processing 115 /Users/georgetzanetakis/data/sound/genres/metal/metal.00031.wav
Processing 116 /Users/georgetzanetakis/data/sound/genres/metal/metal.00027.wav
Processing 117 /Users/georgetzanetakis/data/sound/genres/metal/metal.00033.wav
Processing 118 /Users/georgetzanetakis/data/sound/genres/metal/metal.00032.wav
Processing 119 /Users/georgetzanetakis/data/sound/genres/metal/metal.00026.wav
Processing 120 /Users/georgetzanetakis/data/sound/genres/metal/metal.00082.wav
Processing 121 /Users/georgetzanetakis/data/sound/genres/metal/metal.00096.wav
Processing 122 /Users/georgetzanetakis/data/sound/genres/metal/metal.00069.wav
Processing 123 /Users/georgetzanetakis/data/sound/genres/metal/metal.00041.wav
Processing 124 /Users/georgetzanetakis/data/sound/ge

Processing 216 /Users/georgetzanetakis/data/sound/genres/disco/disco.00050.wav
Processing 217 /Users/georgetzanetakis/data/sound/genres/disco/disco.00044.wav
Processing 218 /Users/georgetzanetakis/data/sound/genres/disco/disco.00068.wav
Processing 219 /Users/georgetzanetakis/data/sound/genres/disco/disco.00040.wav
Processing 220 /Users/georgetzanetakis/data/sound/genres/disco/disco.00054.wav
Processing 221 /Users/georgetzanetakis/data/sound/genres/disco/disco.00083.wav
Processing 222 /Users/georgetzanetakis/data/sound/genres/disco/disco.00097.wav
Processing 223 /Users/georgetzanetakis/data/sound/genres/disco/disco.00096.wav
Processing 224 /Users/georgetzanetakis/data/sound/genres/disco/disco.00082.wav
Processing 225 /Users/georgetzanetakis/data/sound/genres/disco/disco.00055.wav
Processing 226 /Users/georgetzanetakis/data/sound/genres/disco/disco.00041.wav
Processing 227 /Users/georgetzanetakis/data/sound/genres/disco/disco.00069.wav
Processing 228 /Users/georgetzanetakis/data/sound/ge

Processing 323 /Users/georgetzanetakis/data/sound/genres/blues/blues.00043.wav
Processing 324 /Users/georgetzanetakis/data/sound/genres/blues/blues.00094.wav
Processing 325 /Users/georgetzanetakis/data/sound/genres/blues/blues.00080.wav
Processing 326 /Users/georgetzanetakis/data/sound/genres/blues/blues.00096.wav
Processing 327 /Users/georgetzanetakis/data/sound/genres/blues/blues.00082.wav
Processing 328 /Users/georgetzanetakis/data/sound/genres/blues/blues.00069.wav
Processing 329 /Users/georgetzanetakis/data/sound/genres/blues/blues.00055.wav
Processing 330 /Users/georgetzanetakis/data/sound/genres/blues/blues.00041.wav
Processing 331 /Users/georgetzanetakis/data/sound/genres/blues/blues.00040.wav
Processing 332 /Users/georgetzanetakis/data/sound/genres/blues/blues.00054.wav
Processing 333 /Users/georgetzanetakis/data/sound/genres/blues/blues.00068.wav
Processing 334 /Users/georgetzanetakis/data/sound/genres/blues/blues.00083.wav
Processing 335 /Users/georgetzanetakis/data/sound/ge

Processing 431 /Users/georgetzanetakis/data/sound/genres/reggae/reggae.00052.wav
Processing 432 /Users/georgetzanetakis/data/sound/genres/reggae/reggae.00053.wav
Processing 433 /Users/georgetzanetakis/data/sound/genres/reggae/reggae.00047.wav
Processing 434 /Users/georgetzanetakis/data/sound/genres/reggae/reggae.00090.wav
Processing 435 /Users/georgetzanetakis/data/sound/genres/reggae/reggae.00084.wav
Processing 436 /Users/georgetzanetakis/data/sound/genres/reggae/reggae.00020.wav
Processing 437 /Users/georgetzanetakis/data/sound/genres/reggae/reggae.00034.wav
Processing 438 /Users/georgetzanetakis/data/sound/genres/reggae/reggae.00008.wav
Processing 439 /Users/georgetzanetakis/data/sound/genres/reggae/reggae.00009.wav
Processing 440 /Users/georgetzanetakis/data/sound/genres/reggae/reggae.00035.wav
Processing 441 /Users/georgetzanetakis/data/sound/genres/reggae/reggae.00021.wav
Processing 442 /Users/georgetzanetakis/data/sound/genres/reggae/reggae.00037.wav
Processing 443 /Users/george

Processing 530 /Users/georgetzanetakis/data/sound/genres/classical/classical.00097.wav
Processing 531 /Users/georgetzanetakis/data/sound/genres/classical/classical.00096.wav
Processing 532 /Users/georgetzanetakis/data/sound/genres/classical/classical.00082.wav
Processing 533 /Users/georgetzanetakis/data/sound/genres/classical/classical.00069.wav
Processing 534 /Users/georgetzanetakis/data/sound/genres/classical/classical.00055.wav
Processing 535 /Users/georgetzanetakis/data/sound/genres/classical/classical.00041.wav
Processing 536 /Users/georgetzanetakis/data/sound/genres/classical/classical.00026.wav
Processing 537 /Users/georgetzanetakis/data/sound/genres/classical/classical.00032.wav
Processing 538 /Users/georgetzanetakis/data/sound/genres/classical/classical.00033.wav
Processing 539 /Users/georgetzanetakis/data/sound/genres/classical/classical.00027.wav
Processing 540 /Users/georgetzanetakis/data/sound/genres/classical/classical.00019.wav
Processing 541 /Users/georgetzanetakis/data

Processing 630 /Users/georgetzanetakis/data/sound/genres/rock/rock.00070.wav
Processing 631 /Users/georgetzanetakis/data/sound/genres/rock/rock.00064.wav
Processing 632 /Users/georgetzanetakis/data/sound/genres/rock/rock.00048.wav
Processing 633 /Users/georgetzanetakis/data/sound/genres/rock/rock.00060.wav
Processing 634 /Users/georgetzanetakis/data/sound/genres/rock/rock.00074.wav
Processing 635 /Users/georgetzanetakis/data/sound/genres/rock/rock.00075.wav
Processing 636 /Users/georgetzanetakis/data/sound/genres/rock/rock.00061.wav
Processing 637 /Users/georgetzanetakis/data/sound/genres/rock/rock.00049.wav
Processing 638 /Users/georgetzanetakis/data/sound/genres/rock/rock.00088.wav
Processing 639 /Users/georgetzanetakis/data/sound/genres/rock/rock.00077.wav
Processing 640 /Users/georgetzanetakis/data/sound/genres/rock/rock.00063.wav
Processing 641 /Users/georgetzanetakis/data/sound/genres/rock/rock.00062.wav
Processing 642 /Users/georgetzanetakis/data/sound/genres/rock/rock.00076.wav

Processing 736 /Users/georgetzanetakis/data/sound/genres/hiphop/hiphop.00043.wav
Processing 737 /Users/georgetzanetakis/data/sound/genres/hiphop/hiphop.00057.wav
Processing 738 /Users/georgetzanetakis/data/sound/genres/hiphop/hiphop.00053.wav
Processing 739 /Users/georgetzanetakis/data/sound/genres/hiphop/hiphop.00047.wav
Processing 740 /Users/georgetzanetakis/data/sound/genres/hiphop/hiphop.00090.wav
Processing 741 /Users/georgetzanetakis/data/sound/genres/hiphop/hiphop.00084.wav
Processing 742 /Users/georgetzanetakis/data/sound/genres/hiphop/hiphop.00085.wav
Processing 743 /Users/georgetzanetakis/data/sound/genres/hiphop/hiphop.00091.wav
Processing 744 /Users/georgetzanetakis/data/sound/genres/hiphop/hiphop.00046.wav
Processing 745 /Users/georgetzanetakis/data/sound/genres/hiphop/hiphop.00052.wav
Processing 746 /Users/georgetzanetakis/data/sound/genres/hiphop/hiphop.00044.wav
Processing 747 /Users/georgetzanetakis/data/sound/genres/hiphop/hiphop.00050.wav
Processing 748 /Users/george

Processing 838 /Users/georgetzanetakis/data/sound/genres/country/country.00005.wav
Processing 839 /Users/georgetzanetakis/data/sound/genres/country/country.00011.wav
Processing 840 /Users/georgetzanetakis/data/sound/genres/country/country.00039.wav
Processing 841 /Users/georgetzanetakis/data/sound/genres/country/country.00038.wav
Processing 842 /Users/georgetzanetakis/data/sound/genres/country/country.00010.wav
Processing 843 /Users/georgetzanetakis/data/sound/genres/country/country.00004.wav
Processing 844 /Users/georgetzanetakis/data/sound/genres/country/country.00009.wav
Processing 845 /Users/georgetzanetakis/data/sound/genres/country/country.00021.wav
Processing 846 /Users/georgetzanetakis/data/sound/genres/country/country.00035.wav
Processing 847 /Users/georgetzanetakis/data/sound/genres/country/country.00034.wav
Processing 848 /Users/georgetzanetakis/data/sound/genres/country/country.00020.wav
Processing 849 /Users/georgetzanetakis/data/sound/genres/country/country.00008.wav
Proc

Processing 940 /Users/georgetzanetakis/data/sound/genres/jazz/jazz.00064.wav
Processing 941 /Users/georgetzanetakis/data/sound/genres/jazz/jazz.00065.wav
Processing 942 /Users/georgetzanetakis/data/sound/genres/jazz/jazz.00071.wav
Processing 943 /Users/georgetzanetakis/data/sound/genres/jazz/jazz.00059.wav
Processing 944 /Users/georgetzanetakis/data/sound/genres/jazz/jazz.00054.wav
Processing 945 /Users/georgetzanetakis/data/sound/genres/jazz/jazz.00040.wav
Processing 946 /Users/georgetzanetakis/data/sound/genres/jazz/jazz.00068.wav
Processing 947 /Users/georgetzanetakis/data/sound/genres/jazz/jazz.00097.wav
Processing 948 /Users/georgetzanetakis/data/sound/genres/jazz/jazz.00083.wav
Processing 949 /Users/georgetzanetakis/data/sound/genres/jazz/jazz.00082.wav
Processing 950 /Users/georgetzanetakis/data/sound/genres/jazz/jazz.00096.wav
Processing 951 /Users/georgetzanetakis/data/sound/genres/jazz/jazz.00069.wav
Processing 952 /Users/georgetzanetakis/data/sound/genres/jazz/jazz.00041.wav

We can view the confusion matrix and classificaton report with micro and macro average retrieval metrics. You can observe from the confusion matrix that classical is the easiest genre to classify with 89/100 instances classified correctly. This is reflected on the corresponding f1-score of 0.89. Because the classes are balanced there is no difference between the macro and micro averaging. 

In [8]:

from sklearn.model_selection import cross_val_predict
from sklearn import svm, metrics
clf = svm.SVC(gamma='scale', kernel='linear')

# perform 10-fold cross-validation to calculate accuracy and confusion matrix 
predicted = cross_val_predict(clf, audio_features, target, cv=10)

print("Confusion matrix:\n%s" % metrics.confusion_matrix(target, predicted))
print("Classification report for classifier %s:\n%s\n"
      % (clf, metrics.classification_report(target, predicted, target_names=genres)))

Confusion matrix:
[[89  1  1  0  8  0  0  1  0  0]
 [ 2 61  4  0  4  8 11  5  4  1]
 [ 1  9 52  9  0 12  3  7  6  1]
 [ 0  2  9 62  0  3  3 16  1  4]
 [ 7  6  2  0 76  6  2  1  0  0]
 [ 0  7 17  4  3 41 17  4  1  6]
 [ 0 18  8  1  9  8 46  2  0  8]
 [ 1  8  9 22  1  5  7 40  6  1]
 [ 0  4 10  5  2  2  0  4 73  0]
 [ 0  2  4  7  0  9  1  1  0 76]]
Classification report for classifier SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False):
              precision    recall  f1-score   support

   classical       0.89      0.89      0.89       100
     country       0.52      0.61      0.56       100
       disco       0.45      0.52      0.48       100
      hiphop       0.56      0.62      0.59       100
        jazz       0.74      0.76      0.75       100
        rock       0.44      0.41      0.42       100
     

In [9]:
# merge rock and pop to a single genre with more support (300 instances) than the other genres (100 instanes)
new_genres = ['classsical_jazz', 'country', 'disco', 'hiphop','rock','pop', 'blues','reggage', 'metal']
new_target = [0 if k==4 else k for k in target]
print(new_genres)

['classsical_jazz', 'country', 'disco', 'hiphop', 'rock', 'pop', 'blues', 'reggage', 'metal']


In [10]:
# perform 10-fold cross-validation to calculate accuracy and confusion matrix 
predicted = cross_val_predict(clf, audio_features, new_target, cv=10)

print("Confusion matrix:\n%s" % metrics.confusion_matrix(target, predicted))
print("Classification report for classifier %s:\n%s\n"
      % (clf, metrics.classification_report(new_target, predicted, target_names=new_genres)))

Confusion matrix:
[[95  1  1  0  0  1  0  2  0  0]
 [ 7 59  3  0  0  9 11  6  4  1]
 [ 2  8 52  9  0 12  3  7  6  1]
 [ 0  2  9 62  0  3  3 16  1  4]
 [83  5  2  0  0  5  4  1  0  0]
 [ 7  6 17  4  0 36 17  4  3  6]
 [ 6 15 10  1  0  8 48  5  0  7]
 [ 2  7 10 23  0  4  7 40  6  1]
 [ 2  4 10  5  0  2  0  4 73  0]
 [ 0  2  4  7  0  9  1  1  0 76]]
Classification report for classifier SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False):
                 precision    recall  f1-score   support

classsical_jazz       0.87      0.89      0.88       200
        country       0.54      0.59      0.56       100
          disco       0.44      0.52      0.48       100
         hiphop       0.56      0.62      0.59       100
           rock       0.40      0.36      0.38       100
            pop       0.51      0.48     

Notice that because of the unbalanced support of the different classes after merging classical and jazz the values of the micro-average f1-score and macro-average f1-score are different. 

## MIR tasks with time markers 

Finally let's look at a last usage of retrieval metrics (precision, recall and f-measure for tasks where the predicted output and the ground truth is a set of time markers corresponding to things like structure boundaries, 
beat locations, downbeats etc. 

In this case typically a time tolerance window is defined so that if a time marker is predicted near (within that window) a ground truth marker it is consider correct or relevant. 

One can easily see how over-segmemntation will affect precision. For example consider a predicted beat track output that is twice as fast as the ground truth. All the ground truth markers will be matched (recall will be 1) 
but only half of them will be correct (precision 0.5). Alternatively if the predicted beat track output is every two ground truth beats then the precision will be 1 (all the predicted beats are correct) but the recall will be 0.5 (half of the ground truth beat markers are predicted). 

Again the f1-score provides a balance between precision and recall. As an example if the marker detection depends on some parameters, we can perform a grid search over their values and select the one that provides the best f1-measure. 

Depending on the application domain we might be more interested in precision or recall. For example in a automatic video generation system for music that takes into account segmentation boundaries precision might be more important as we want to make sure that if a boundary is detected it is a true boundary with high probability. At the samne time we might not care as much about recall as if a boundary is missed occassionally the video will still look ok. 


In [11]:
# Let's consider a simple toy example for calculating the retrieval metrics based on a time markers 
# To make things simple the tolerance is +/-1 i.e an estimated time marker is consider correct if it is 
# within +/-1 of the reference ground truth. 

import numpy as np

ref_times = np.array([0, 4, 8, 12, 16])
est_times = np.array([0, 9, 11, 18])


# precision is the number of "correct" estimated times divided by the number of estimates 
precision = 0 
for e in est_times: 
    diff = np.abs(e - ref_times)
    for d in diff: 
        if (d <= 1): 
            precision += 1
precision /= len(est_times)
print(precision)

# Note that the precision is 0.75 because the 4th estimated time marker - 18 is not part of the ground truth 


# recall is the percentage of the reference time markers that are returned as estimates 
recall = 0 
for r in ref_times:
    diff = np.abs(r - est_times)
    for d in diff: 
        if (d <=1): 
            recall += 1 
recall /= len(ref_times)
print(recall)

# Not that the recall is 0.6 because out of the 5 reference time markers only 3 are estimated correctly 
# (within +/-1 of a reference marker) |


def f1_score(precision, recall): 
    return 2 * ((precision * recall) / (precision+recall))

f1 = f1_score(precision,recall)
print("F1 = %.2f\n" % f1)

0.75
0.6
F1 = 0.67



In [12]:
def segment_metrics(ref_times, est_times): 
    precision = 0 
    for e in est_times: 
        diff = np.abs(e - ref_times)
        for d in diff: 
            if (d <= 1): 
                precision += 1
    precision /= len(est_times)
    
    recall = 0 
    for r in ref_times:
        diff = np.abs(r - est_times)
        for d in diff: 
            if (d <=1): 
                recall += 1 
    recall /= len(ref_times)
    f1_score = 2 * ((precision * recall) / (precision+recall))
    return (precision, recall, f1_score)

segment_metrics(ref_times, est_times)

(0.75, 0.6, 0.6666666666666665)

In [13]:
# high precision by undersegmentation 
ref_times = np.array([0, 4, 8, 12, 16])
est_times = np.array([9, 11])
segment_metrics(ref_times, est_times)

(1.0, 0.4, 0.5714285714285715)

In [14]:
# high recall by oversegmentation 
ref_times = np.array([0, 4, 8, 12, 16])
est_times = np.array([0,2,4,6,8,10,12,14,16])
segment_metrics(ref_times, est_times)

(0.5555555555555556, 1.0, 0.7142857142857143)

## MIR tasks with pair-wise labels

Another use of retrieval metrics is when there are cluster/segment labels per frame as in chord detection or structure segmentation. This evaluation combines information from the precision of the boundaries as well as ensuring that corrresponding segments are marked by the same label. 

The input consists of two sequences of labels (one corresponding to the reference ground truth and one 
corresponding to the estimated labels) using the same time granularity - typically frames or beats. The vocabulary of the labels can be different between the two segmentations. The retrieval metrics are computed over all possible pairs with a pair considered valid/correct/relevant if both items share a label. 














