<a href="https://colab.research.google.com/github/jdavibedoya/f0-jingju/blob/master/Singing_pitch_estimation_in_jingju_music.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Singing pitch estimation in jingju music
This notebook is designed to:
1. Extract singing pitch estimations in jingju music using three different algorithms (CREPE, pYin and MELODIA)
2. Evaluate the accuracy of these pitch estimations.
3. Compare the performance of the algorithms used.

Singing pitch estimation may refer to either monophonic melody (one singer) or predominant melody from polyphonic music signals (a singer with musical accompaniment). 
- For monophonic melody extraction, here are used the recordings from the <a href = " https://zenodo.org/record/832736 ">Jingju a cappella singing pitch contour segmentation ground truth dataset</a> (hereafter the a cappella recordings). 
- For predominant melody extraction, here are used the recordings used in <a href = " https://repositori.upf.edu/handle/10230/34975 ">Comparision of the singing style of two jingju schools</a> (hereafter the commercial recordings).

Note: Each cell begins with a comment that explains what is done in that cell.

In [0]:
# installing the required packages
%pip install --upgrade tensorflow  # if you don't already have tensorflow >= 2.0.0
%pip install crepe
%pip install essentia

In [0]:
# imports
import os
import shutil
import csv
import numpy as np
import matplotlib.pyplot as plt
import mir_eval
import crepe
import warnings
warnings.filterwarnings('ignore')

from essentia.standard import PitchYinProbabilistic, MonoLoader, PitchMelodia, EqualLoudness
#from google.colab import drive
drive.mount('/content/drive/', force_remount=True)
main_dir = "drive/Shared drives/MIR Project/"

Mounted at /content/drive/


## A cappella

It would be desirable that the recordings had the metadata and annotations in terms of the jingju musical system. As a further matter, this would be useful for studying jingju singing pitch estimation in terms of its musical system elements (not addressed here).

The a cappella recordings do not have these metadata and annotations, but the <a href = " https://zenodo.org/record/3251761 ">Jingju a Cappella Recordings Collection</a> (JaCRC), which include these recordings and many more (albeit with a different name), does.

The following cell defines and executes a function that matches the <a href = " https://zenodo.org/record/ ">a cappella recordings</a> with the appropriate <a href = " https://zenodo.org/record/3251761 ">JaCRC</a> recordings using a simple similarity function (cosine of the angle between audio vectors). The results of this matching are stored in the `file_matching.csv` file.

In [0]:
# function that matches the a cappella recordings with the appropriate JaCRC recordings using a cosine similarity function
unlabeled_audio = main_dir + "raw data/Audio_melody"               

def match_audio(x, unlabeled):
    '''
    x: unlabeled audio
    unlabeled: unlabeled audio location
    '''
    labeled_audio = main_dir + "raw data/JaCRC"
    better_match = [unlabeled, None, 0] # [unlabeled, labeled, similarity]
    max_similarity = 0
    fs = 44100
    for root, dirs, files in os.walk(labeled_audio):
        if root == labeled_audio: # exclude Accompaniment and Mixing folders
            for file in files:
                if file.endswith('.wav') or file.endswith('.WAV'): # WAV files
                    file_name = os.path.join(root,file)
                    y = MonoLoader(filename = file_name, downmix = 'mix', sampleRate = fs)() # load labeled audio 
                    if (np.abs(len(x)-len(y)) < 10): # only compare audios with a length difference of a maximum of 10 samples
                        minor_length = min(len(x), len(y))
                        x = x[:minor_length]
                        y = y[:minor_length]                          
                        similarity = np.dot(x, y) / ( np.linalg.norm(x)*np.linalg.norm(y) ) # cosine of the angle between audio vectors
                        if similarity  > max_similarity:
                            max_similarity = similarity 
                            better_match[1] = file_name
                            better_match[2] = similarity 
    return better_match

# executing the match_audio function in the a cappella recordings
matches = 0
with open(main_dir + 'file_matching.csv', mode='r') as csv_file:
    csv_writer = csv.writer(csv_file, delimiter=',')
  
    for root, dirs, files in os.walk(unlabeled_audio):
        for file in files:
            if file.endswith('.wav'): # WAV files
                file_name = os.path.join(root,file)
                x = MonoLoader(filename = file_name)()
                #match = match_audio(x, file_name)
                print(file, match)
                if match[2] != 0:
                    matches += 1
                    csv_writer.writerow(match)

print('{} Matches'.format(matches))

44 Matches


The following two cells copy the pitch track annotations and their respective a cappella recordings from the `raw data` folder, which contains the original datasets, and then rename them using the the `file_matching.csv file`.

In [0]:
# copying the pitch track annotations and renaming them using the file_matching.csv file
test_csv = main_dir + "a cappella/f0_ground_truth/"
#os.mkdir(test_csv)
count_copied_and_renamed = 0

# copying csv files
ground_truth_folder = main_dir + "raw data/SMC2016-master/dataset/groundtruth"
for root, dirs, files in os.walk(ground_truth_folder):
     for file in files:
          if file.endswith('pitchtrack.csv'): # WAV files
              file_name = os.path.join(root,file)
              shutil.copy(file_name, test_csv)

# renaming csv files
with open(main_dir + 'file_matching.csv', mode='r') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    for row in csv_reader:
        source_name = row[0].split("/")[-2] + "_" + row[0].split("/")[-1].split(".")[-2] + "_pitchtrack.csv"
        source = test_csv + source_name
        if os.path.isfile(source): # check if file exists
            count_copied_and_renamed += 1
            target = test_csv + row[1].split("/")[-1].split(".")[-2] + ".csv"
            os.rename(source, target)

print('{} csv files copied and renamed'.format(count_copied_and_renamed))

41 csv files copied and renamed


In [0]:
# copying the a cappella recordings and renaming them using the file_matching.csv file
test_audio = main_dir + "a cappella/audio/"
test_csv = main_dir + "a cappella/f0_ground_truth/"
#os.mkdir(test_audio)
count_copied_and_renamed = 0

with open(main_dir + 'file_matching.csv', mode='r') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    for row in csv_reader:
        if row[1] != "":
          pitch_track = row[1].split("/")[-1].split(".")[-2] + ".csv"
          source = test_csv + pitch_track
          if os.path.isfile(source): # check if pitch track exists
              count_copied_and_renamed += 1
              copy = shutil.copy(row[0], test_audio)
              target = test_audio + row[1].split("/")[-1].split(".")[-2] + ".wav"
              os.rename(copy, target)

print('{} wav files copied and renamed'.format(count_copied_and_renamed))

41 wav files copied and renamed


### pYIN

The <a href = " https://essentia.upf.edu/reference/std_PitchYinProbabilistic.html ">pYIN</a> algorithm computes the pitch track of a mono audio signal.

The following cell runs pYin in the a cappella recordings for three frame sizes `[512, 1024, 2048]` and three hop sizes `[256, 441, 512]`. Then it stores the prediction with the best `Raw Pitch Accuracy` performance for each audio in the `a cappella/pYin/` folder.

Here, the best `Raw Pitch Accuracy` performance for each audio is determined with the `melody.evaluate` method of the `mir_eval` library. A tolerance of 50 cents is used.


In [0]:
# executing and storing pYin - a cappella
test_csv = main_dir + "a cappella/f0_ground_truth/"
test_audio = main_dir + "a cappella/audio/"
pYin_csv = main_dir + "a cappella/pYin/"
#os.mkdir(pYin_csv)

frameSizes = [512, 1024, 2048]
hopSizes = [256, 441, 512]
fs = 44100
count_predictions = 0

for root, dirs, files in os.walk(test_csv):
    for file in files:
        file_name = os.path.join(root,file)
        # reading ground-truth
        gt_time = []
        gt_pitch = []
        with open(file_name, mode='r') as csv_file:
            csv_reader = csv.reader(csv_file, delimiter=',')
            for row in csv_reader:
                gt_time.append(float(row[1]))
                gt_pitch.append(float(row[2]))

        best_score = 0
        x = MonoLoader(filename = test_audio + file.split(".")[-2] + ".wav", downmix = 'mix', sampleRate = fs)() # load labeled audio
        for frameSize in frameSizes:
            for hopSize in hopSizes:
                # executing pYin
                pYIN = PitchYinProbabilistic(frameSize = frameSize, hopSize = hopSize, preciseTime = True, outputUnvoiced = 'negative')
                pyin_pitch , pyin_probabilities = pYIN(x)
                pyin_time = np.arange(0,len(pyin_pitch))*hopSize/fs

                # pyin_pitch[np.argwhere(pyin_probabilities < 0.75)] = 0 # discard prediction with confidence < 0.75
                scores = mir_eval.melody.evaluate(gt_time, gt_pitch, pyin_time, pyin_pitch, cent_tolerance=50)
                if scores['Raw Pitch Accuracy'] > best_score:
                    best_score = scores['Raw Pitch Accuracy']
                    best_frameSize = frameSize
                    best_hopSize = hopSize
                    best_pyin_time = pyin_time
                    best_pyin_pitch = pyin_pitch
                    best_pyin_probabilities = pyin_probabilities
        
        # storing pYIN prediction
        with open(pYin_csv + file.split(".")[-2] + '.' + str(best_frameSize) + '.' + str(best_hopSize) + ".csv", 'w') as csv_file:
            count_predictions +=1
            csv_writer = csv.writer(csv_file, delimiter=',')
            for i in range(len(best_pyin_pitch)):
                csv_writer.writerow([best_pyin_time[i], best_pyin_pitch[i], best_pyin_probabilities[i]])

print('{} pYIN predictions stored'.format(count_predictions))

41 pYIN predictions stored


Next cell computes the pYin performance on the a cappella recordings using the predictions stored in the previous cell. The metric used is the `Raw Pitch Accuracy` which is computed using the `mir_eval.melody.evaluate` method with tolerances of 10, 25 and 50 cents.

In [0]:
# evaluating pYin - a cappella
test_csv = main_dir + "a cappella/f0_ground_truth/"
pYin_csv = main_dir + "a cappella/pYin/"

cent_tolerances = [10, 25, 50]
pYin_RPA = {10:[[],[]], 25:[[],[]], 50:[[],[]]} # raw pitch accuracy

for root, dirs, files in os.walk(pYin_csv):
    for file in files:
        file_name = os.path.join(root,file)
        # reading pYin prediction
        pYin_time = []
        pYin_pitch = []
        pYin_probabilities = []
        with open(file_name, mode='r') as csv_file:
            csv_reader = csv.reader(csv_file, delimiter=',')
            for row in csv_reader:
                pYin_time.append(float(row[0]))
                pYin_pitch.append(float(row[1]))
                pYin_probabilities.append(float(row[2]))

        pYin_time = np.array(pYin_time)
        pYin_pitch = np.array(pYin_pitch)
        pYin_probabilities = np.array(pYin_probabilities)

        # reading ground-truth
        gt_time = []
        gt_pitch = []
        with open(test_csv + file.split('.')[-4] + '.csv', mode='r') as csv_file:
            csv_reader = csv.reader(csv_file, delimiter=',')
            for row in csv_reader:
                gt_time.append(float(row[1]))
                gt_pitch.append(float(row[2]))
  
        # pyin_pitch[np.argwhere(pyin_probabilities < 0.75)] = 0 # discard prediction with confidence < 0.75    
        for cent_tolerance in cent_tolerances:
            scores = mir_eval.melody.evaluate(gt_time, gt_pitch, pYin_time, pYin_pitch, cent_tolerance=cent_tolerance)
            pYin_RPA[cent_tolerance][0].append( scores['Raw Pitch Accuracy'] )
            pYin_RPA[cent_tolerance][1].append( len(gt_time) )

print('pYin - a cappella audio')
for cent_tolerance in cent_tolerances:
    RPA_mean = np.average( pYin_RPA[cent_tolerance][0], weights = pYin_RPA[cent_tolerance][1] )
    RPA_std = np.sqrt( np.average((pYin_RPA[cent_tolerance][0] - RPA_mean)**2, weights = pYin_RPA[cent_tolerance][1]) )
    print(str(cent_tolerance) + ' cents \tRaw Pitch Accuracy: ', "{0:.3f}".format(RPA_mean), ' ± ', "{0:.3f}".format(RPA_std))

pYin - a cappella audio
10 cents 	Raw Pitch Accuracy:  0.705  ±  0.088
25 cents 	Raw Pitch Accuracy:  0.912  ±  0.050
50 cents 	Raw Pitch Accuracy:  0.971  ±  0.017


### CREPE

<a href = " https://github.com/marl/crepe ">CREPE </a> is a monophonic pitch tracker based on a deep convolutional neural network operating directly on the time-domain waveform input.

The following cell runs CREPE in the a cappella recordings for two time steps `[5, 10]`, and then stores the prediction with the best `Raw Pitch Accuracy` performance for each audio in the `a cappella/crepe/` folder.

Here, the best `Raw Pitch Accuracy` performance for each audio is determined with the `melody.evaluate` method of the `mir_eval` library. A tolerance of 50 cents is used.


In [0]:
# executing and storing crepe - a cappella
test_csv = main_dir + "a cappella/f0_ground_truth/"
test_audio = main_dir + "a cappella/audio/"
crepe_csv = main_dir + "a cappella/crepe/"
os.mkdir(crepe_csv)

fs = 44100
steps = [5, 10, 15]
count_predictions = 0

for root, dirs, files in os.walk(test_csv):
    for file in files:
        file_name = os.path.join(root,file)
        # reading ground-truth
        gt_time = []
        gt_pitch = []
        with open(file_name, mode='r') as csv_file:
            csv_reader = csv.reader(csv_file, delimiter=',')
            for row in csv_reader:
                gt_time.append(float(row[1]))
                gt_pitch.append(float(row[2]))

        best_score = 0
        audio = MonoLoader(filename = test_audio + file.split(".")[-2] + ".wav", downmix = 'mix', sampleRate = fs)() # load labeled audio 
        for step in steps:
            # executing crepe
            crp_time, crp_frequency, crp_confidence, activation = crepe.predict(audio, fs, step_size= step, viterbi=True, verbose=0)

            #crp_freq[np.argwhere(crp_confidence < 0.75)] = 0 # discard prediction with confidence < 0.75
            scores = mir_eval.melody.evaluate(gt_time, gt_pitch, crp_time, crp_frequency, cent_tolerance=50)
            if scores['Raw Pitch Accuracy'] > best_score:
                best_score = scores['Raw Pitch Accuracy']
                best_step = step
                best_crp_time = crp_time
                best_crp_frequency = crp_frequency
                best_crp_confidence = crp_confidence

        # storing crepe output
        with open(crepe_csv + file.split(".")[-2] + '.' + str(best_step) + ".csv", mode='w') as csv_file:
            count_predictions +=1
            csv_writer = csv.writer(csv_file, delimiter=',')
            for i in range(len(best_crp_time)):
                csv_writer.writerow([best_crp_time[i], best_crp_frequency[i], best_crp_confidence[i]])

print('{} CREPE predictions stored'.format(count_predictions))

41 CREPE predictions stored


Next cell computes the CREPE performance on the a cappella recordings using the predictions stored in the previous cell. The metric used is the `Raw Pitch Accuracy` which is computed using the `mir_eval.melody.evaluate` method with tolerances of 10, 25 and 50 cents.

In [0]:
# evaluating crepe - a cappella
test_csv = main_dir + "a cappella/f0_ground_truth/"
crepe_csv = main_dir + "a cappella/crepe/"

cent_tolerances = [10, 25, 50]
crp_RPA = {10:[[],[]], 25:[[],[]], 50:[[],[]]} # raw pitch accuracy

for root, dirs, files in os.walk(crepe_csv):
    for file in files:
        file_name = os.path.join(root,file)
        # reading crepe prediction
        crp_time = []
        crp_freq = []
        crp_confidence = []
        with open(file_name, mode='r') as csv_file:
            csv_reader = csv.reader(csv_file, delimiter=',')
            for row in csv_reader:
                crp_time.append(float(row[0]))
                crp_freq.append(float(row[1]))
                crp_confidence.append(float(row[2]))

        crp_time = np.array(crp_time)
        crp_freq = np.array(crp_freq)
        crp_confidence = np.array(crp_confidence)

        # reading ground-truth
        gt_time = []
        gt_pitch = []
        with open(test_csv + file.split('.')[-3] + '.csv', mode='r') as csv_file:
            csv_reader = csv.reader(csv_file, delimiter=',')
            for row in csv_reader:
                gt_time.append(float(row[1]))
                gt_pitch.append(float(row[2]))
  
        #crp_freq[np.argwhere(crp_confidence < 0.75)] = 0 # discard prediction with confidence < 0.75    
        for cent_tolerance in cent_tolerances:
            scores = mir_eval.melody.evaluate(gt_time, gt_pitch, crp_time, crp_freq, cent_tolerance=cent_tolerance)
            crp_RPA[cent_tolerance][0].append( scores['Raw Pitch Accuracy'] )
            crp_RPA[cent_tolerance][1].append( len(gt_time) )

print('crepe - a cappella audio')
for cent_tolerance in cent_tolerances:
    RPA_mean = np.average( crp_RPA[cent_tolerance][0], weights = crp_RPA[cent_tolerance][1] )
    RPA_std = np.sqrt( np.average((crp_RPA[cent_tolerance][0] - RPA_mean)**2, weights = crp_RPA[cent_tolerance][1]) )
    print(str(cent_tolerance) + ' cents \tRaw Pitch Accuracy: ', "{0:.3f}".format(RPA_mean), ' ± ', "{0:.3f}".format(RPA_std))

crepe - a cappella audio
10 cents 	Raw Pitch Accuracy:  0.804  ±  0.060
25 cents 	Raw Pitch Accuracy:  0.942  ±  0.023
50 cents 	Raw Pitch Accuracy:  0.977  ±  0.012


## Commercial

The commercial recordings used in <a href = " https://repositori.upf.edu/handle/10230/34975 ">Comparision of the singing style of two jingju schools</a>, are identified with a MusicBrainz ID. With the recording information available on MusicBrainz, the audio files can be found in the <a href = " https://zenodo.org/record/1475846 ">Jingju Audio Recordings Collection</a>.


### MELODIA

The <a href = "https://essentia.upf.edu/reference/std_PredominantPitchMelodia.html ">MELODIA</a> algorithm estimates the fundamental frequency of the predominant melody from polyphonic music signals.

The following cell runs MELODIA in the commercial recordings for four frame sizes `[512, 1024, 2048, 4096]` and five hop sisez `[128, 256, 441, 512, 1024]`. Then it stores the prediction with the best `Raw Pitch Accuracy` performance for each audio in the `commercial/melodia/` folder.

Here, the best `Raw Pitch Accuracy` performance for each audio is determined with the `melody.evaluate` method of the `mir_eval` library. A tolerance of 50 cents is used.

In [0]:
# executing and storing melodia - commercial
test_txt = main_dir + "commercial/f0_ground_truth/"
test_audio = main_dir + "commercial/audio/"
melodia_csv = main_dir + "commercial/melodia/"
os.mkdir(melodia_csv)

frameSizes = [512, 1024, 2048, 4096]
hopSizes = [128, 256, 441, 512, 1024]
fs = 44100
count_predictions = 0

for root, dirs, files in os.walk(test_txt):
    for file in files:
        file_name = os.path.join(root,file)
        # reading ground-truth
        gt_time, gt_pitch = mir_eval.io.load_time_series(file_name)

        best_score = 0
        x = MonoLoader(filename = test_audio + file.split(".")[-2] + ".flac", downmix = 'mix', sampleRate = fs)() # load labeled audio
        x = EqualLoudness()(x)
        for frameSize in frameSizes:
            for hopSize in hopSizes:
                # executing melodia
                melodia = PitchMelodia(guessUnvoiced = True, frameSize = frameSize, hopSize = hopSize, sampleRate=fs)
                melodia_pitch , melodia_confidence = melodia(x)
                melodia_time = np.arange(0,len(melodia_pitch))*hopSize/fs

                # melodia_pitch[np.argwhere(melodia_confidence < 0.75)] = 0 # discard prediction with confidence < 0.75
                scores = mir_eval.melody.evaluate(gt_time, gt_pitch, melodia_time, melodia_pitch, cent_tolerance=50)
                if scores['Raw Pitch Accuracy'] > best_score:
                    second_s = best_score
                    best_score = scores['Raw Pitch Accuracy']
                    best_frameSize = frameSize
                    best_hopSize = hopSize     
                    best_melodia_time = melodia_time
                    best_melodia_pitch = melodia_pitch
                    best_melodia_confidence = melodia_confidence
        
        # storing melodia prediction
        with open(melodia_csv + file.split(".")[-2] + '.' + str(best_frameSize) + '.' + str(best_hopSize) + ".csv", 'w') as csv_file:
            count_predictions +=1
            csv_writer = csv.writer(csv_file, delimiter=',')
            for i in range(len(best_melodia_pitch)):
                csv_writer.writerow([best_melodia_time[i], best_melodia_pitch[i], best_melodia_confidence[i]])

print('{} melodia predictions stored'.format(count_predictions))

8 melodia predictions stored


Next cell computes the melodia performance on the commercial recordings using the predictions stored in the previous cell. The metric used is the `Raw Pitch Accuracy` which is computed using the `mir_eval.melody.evaluate` method with tolerances of 10, 25 and 50 cents.

In [0]:
# evaluating melodia - commercial
test_txt = main_dir + "commercial/f0_ground_truth/"
melodia_csv = main_dir + "commercial/melodia/"

cent_tolerances = [10, 25, 50]
melodia_RPA = {10:[[],[]], 25:[[],[]], 50:[[],[]]} # raw pitch accuracy

for root, dirs, files in os.walk(melodia_csv):
    for file in files:
        file_name = os.path.join(root,file)
        # reading melodia prediction
        melodia_time = []
        melodia_pitch = []
        melodia_confidence = []
        with open(file_name, mode='r') as csv_file:
            csv_reader = csv.reader(csv_file, delimiter=',')
            for row in csv_reader:
                melodia_time.append(float(row[0]))
                melodia_pitch.append(float(row[1]))
                melodia_confidence.append(float(row[2]))

        melodia_time = np.array(melodia_time)
        melodia_pitch = np.array(melodia_pitch)
        melodia_confidence = np.array(melodia_confidence)

        # reading ground-truth
        gt_time, gt_pitch = mir_eval.io.load_time_series(test_txt + file.split('.')[-4] + '.txt')
  
        # melodia_pitch[np.argwhere(melodia_confidence < 0.75)] = 0 # discard prediction with confidence < 0.75    
        for cent_tolerance in cent_tolerances:
            scores = mir_eval.melody.evaluate(gt_time, gt_pitch, melodia_time, melodia_pitch, cent_tolerance=cent_tolerance)
            melodia_RPA[cent_tolerance][0].append( scores['Raw Pitch Accuracy'] )
            melodia_RPA[cent_tolerance][1].append( len(gt_time) )

print('melodia - commercial')
for cent_tolerance in cent_tolerances:
    RPA_mean = np.average( melodia_RPA[cent_tolerance][0], weights = melodia_RPA[cent_tolerance][1] )
    RPA_std = np.sqrt( np.average((melodia_RPA[cent_tolerance][0] - RPA_mean)**2, weights = melodia_RPA[cent_tolerance][1]) )
    print(str(cent_tolerance) + ' cents \tRaw Pitch Accuracy: ', "{0:.3f}".format(RPA_mean), ' ± ', "{0:.3f}".format(RPA_std))

melodia - commercial
10 cents 	Raw Pitch Accuracy:  0.695  ±  0.084
25 cents 	Raw Pitch Accuracy:  0.775  ±  0.107
50 cents 	Raw Pitch Accuracy:  0.794  ±  0.115


### CREPE

Although CREPE is not specifically designed to estimate the predominant melody from polyphonic music signals, here it is used for that purpose in the commercial recordings to provide a performance against which the MELODIA performance could be compared. With a different dataset (<a href = " http://mac.citi.sinica.edu.tw/ikala/" >iKala</a>), the same approach is followed in <a href = " https://arxiv.org/abs/1807.03046 ">Deep Learning for Singing Processing: Achievements, Challenges and Impact on Singers and Listeners</a>.

The following cell runs CREPE in the commercial recordings for four time steps `[20, 25, 30, 35]`, and then stores the prediction with the best `Raw Pitch Accuracy` performance for each audio in the `commercial/crepe/` folder.

Here, the best `Raw Pitch Accuracy` performance for each audio is determined with the `melody.evaluate` method of the `mir_eval` library. A tolerance of 50 cents is used.

In [0]:
# executing and storing crepe - commercial
test_txt = main_dir + "commercial/f0_ground_truth/"
test_audio = main_dir + "commercial/audio/"
crepe_csv = main_dir + "commercial/crepe/"
os.mkdir(crepe_csv)

fs = 44100
steps = [5, 10, 15, 20, 25, 30, 35, 40]
count_predictions = 0

for root, dirs, files in os.walk(test_txt):
    for file in files:
        file_name = os.path.join(root,file)
        # reading ground-truth
        gt_time, gt_pitch = mir_eval.io.load_time_series(file_name)

        best_score = 0
        audio = MonoLoader(filename = test_audio + file.split(".")[-2] + ".flac", downmix = 'mix', sampleRate = fs)() # load labeled audio 
        for step in steps:
            # executing crepe
            crp_time, crp_frequency, crp_confidence, activation = crepe.predict(audio, fs, step_size= step, viterbi=True, verbose=0)

            #crp_freq[np.argwhere(crp_confidence < 0.75)] = 0 # discard prediction with confidence < 0.75
            scores = mir_eval.melody.evaluate(gt_time, gt_pitch, crp_time, crp_frequency, cent_tolerance=50)
            if scores['Raw Pitch Accuracy'] > best_score:
                best_score = scores['Raw Pitch Accuracy']
                best_step = step
                best_crp_time = crp_time
                best_crp_frequency = crp_frequency
                best_crp_confidence = crp_confidence

        # storing crepe output
        with open(crepe_csv + file.split(".")[-2] + '.' + str(best_step) + ".csv", mode='w') as csv_file:
            count_predictions +=1
            csv_writer = csv.writer(csv_file, delimiter=',')
            for i in range(len(best_crp_time)):
                csv_writer.writerow([best_crp_time[i], best_crp_frequency[i], best_crp_confidence[i]])

print('{} crepe predictions stored'.format(count_predictions))

8 crepe predictions stored


Next cell computes the CREPE performance on the commercial recordings using the predictions stored in the previous cell. The metric used is the `Raw Pitch Accuracy` which is computed using the `mir_eval.melody.evaluate` method with tolerances of 10, 25 and 50 cents.

In [0]:
# evaluating crepe - commercial
test_txt = main_dir + "commercial/f0_ground_truth/"
crepe_csv = main_dir + "commercial/crepe/"

cent_tolerances = [10, 25, 50]
crp_RPA = {10:[[],[]], 25:[[],[]], 50:[[],[]]} # raw pitch accuracy

for root, dirs, files in os.walk(crepe_csv):
    for file in files:
        file_name = os.path.join(root,file)
        # reading crepe prediction
        crp_time = []
        crp_freq = []
        crp_confidence = []
        with open(file_name, mode='r') as csv_file:
            csv_reader = csv.reader(csv_file, delimiter=',')
            for row in csv_reader:
                crp_time.append(float(row[0]))
                crp_freq.append(float(row[1]))
                crp_confidence.append(float(row[2]))

        crp_time = np.array(crp_time)
        crp_freq = np.array(crp_freq)
        crp_confidence = np.array(crp_confidence)

        # reading ground-truth
        gt_time, gt_pitch = mir_eval.io.load_time_series(test_txt + file.split('.')[-3] + '.txt')

        #crp_freq[np.argwhere(crp_confidence < 0.75)] = 0 # discard prediction with confidence < 0.75    
        for cent_tolerance in cent_tolerances:
            scores = mir_eval.melody.evaluate(gt_time, gt_pitch, crp_time, crp_freq, cent_tolerance=cent_tolerance)
            crp_RPA[cent_tolerance][0].append( scores['Raw Pitch Accuracy'] )
            crp_RPA[cent_tolerance][1].append( len(gt_time) )

print('crepe - commercial')
for cent_tolerance in cent_tolerances:
    RPA_mean = np.average( crp_RPA[cent_tolerance][0], weights = crp_RPA[cent_tolerance][1] )
    RPA_std = np.sqrt( np.average((crp_RPA[cent_tolerance][0] - RPA_mean)**2, weights = crp_RPA[cent_tolerance][1]) )
    print(str(cent_tolerance) + ' cents \tRaw Pitch Accuracy: ', "{0:.3f}".format(RPA_mean), ' ± ', "{0:.3f}".format(RPA_std))

crepe - commercial
10 cents 	Raw Pitch Accuracy:  0.603  ±  0.084
25 cents 	Raw Pitch Accuracy:  0.802  ±  0.075
50 cents 	Raw Pitch Accuracy:  0.859  ±  0.066
