# AcousticBrainz Duplicate Analysis
by Marc Jones, written for Python 3.6+

In this notebook I undertake an analysis of duplicate [AcousticBrainz](http://acousticbrainz.org) entries, attempting to identify mislabeled entries by leveraging metadata alongside various audio-level extracted metrics (using [Essentia](http://essentia.upf.edu/documentation/documentation.html)) from the audio. A given AcousticBrainz entry can have multiple submissions of extracted data, some of which may be misclassified without the user knowing because AcousticBrainz does not store audio on its servers. By using the entries' associated metadata and extracted audio data we can look for outliers that may indicate the entry was improperly labeled. I've chosen the following features to analyze for anomalies:

#### Metadata Feature
- track length (in seconds)

#### Tonal Feature
- mean of the harmonic pitch class profile (or 'HPCP')

#### Rhythmic Features
- beats count, bpm, onset rate

#### LowLevel Features
- average loudness, dynamic complexity, mean of the dissonance, mean of the spectral flux, mean of the pitch salience, mean of the spectral complexity, mean of the zero crossing rate (or 'zcr'), & mean of the high frequency content (or 'hfc')

 

In [7]:
# Import necessary libraries and packages
import os, sys, json, requests, tarfile
from collections import Counter
from essentia.standard import *
from essentia import Pool, array
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

### Downloading the Data
Using the full [dataset](http://www.dtic.upf.edu/~aporter/amplab/2017/ab-duplicates1000-2016-03-02.tar.bz2) (provided by Alastair Porter) of AcousticBrainz Duplicates, we have 1000 AcousticBrainz entries along with their duplicates. For demonstrative purposes a [sample dataset](http://www.dtic.upf.edu/~aporter/amplab/2017/ab-duplicates100-2016-03-02.tar.bz2) is available with just 100 entries. If you don't already have the datasets, downloaded in the local directory of this notebook, the running the cell below will accomplish that for you.

In [8]:
# sample dataset ~190mb compressed => ~440mb uncompressed
url = 'http://www.dtic.upf.edu/~aporter/amplab/2017/ab-duplicates100-2016-03-02.tar.bz2'
# full dataset ~1.2gb compressed => ~2.4gb uncompressed (uncomment the below line to download the full dataset)
# url = 'http://www.dtic.upf.edu/~aporter/amplab/2017/ab-duplicates1000-2016-03-02.tar.bz2'

filename = url[url.rfind('/')+1:]

# download and extract the compressed data only if there does not already exist
if not os.path.exists(filename) and not os.path.exists(filename[:-8]):
    # downloading                                                  
    print('connecting to:',url)
    print('downloading:',filename, '\n(considering the file size, this may take a while...')
    r = requests.get(url, allow_redirects=True)
    open(filename,'wb').write(r.content)
    print('download successful')
    # extracting                                 
    tar = tarfile.open(filename, 'r:bz2')
    print('extracting data from:',filename)
    tar.extractall()
    print('extraction complete')
    tar.close()

### File Organization
Entries are organized by their [MusicBrainz ID #](https://musicbrainz.org/doc/MusicBrainz_Identifier), or __"MBID"__, into folders by MBID prefix (first two digits, i.e. MBID '[00c47ea6-3a10-4a32-b1f1-990ac756c6a0](http://acousticbrainz.org/00c47ea6-3a10-4a32-b1f1-990ac756c6a0)' its prefix would be '00'), and further organized within the aformentioned folders by version, indicated with a suffix (i.e. an MBID such as ' followed by '-19' would indicate the 20th version of the entry because the numbering is zero-based). Each entry is represented is JSON file format with a simple naming convention: 'MBID' + '-##' + '.json' ; where _##_ is the version number. To ensure this notebook has access to the dataset, be sure to have both datasets unzipped in the same directory as this notebook.

In [2]:
mbids = {}

# sample dataset
root = 'ab-duplicates100-2016-03-02'
# full dataset : uncomment the below line to use the full dataset
# root = 'ab-duplicates1000-2016-03-02'

#read file paths in as a dictionary by each 'MusicBrainz ID' or 'mbid' 
for folder in os.listdir(root):
    # ignore hidden files
    if not folder.startswith('.'): 
        for file in os.listdir(root+'/'+folder):
            path = root+'/'+folder+'/'+file 
            mbid = path[:path.rfind('-')]
            version = path[path.rfind('-'):] # each version # as string with the json extension incld
            if mbid in mbids:
                mbids[mbid].append(version)
            else:
                mbids[mbid] = [version]

### Determining Candidates for Being Mislabled
In determining which entires might be candidates for potentially having been mislabeled, I've used a 'voting' system where the chosen metrics from each MBID version is compared to their mean and standard deviation of all versions. If the individual value is meets the criteria (usually more than 1.5 standard deviations away from the mean*) then the MBID version it represents recieves a 'vote' by being added to a list of 'mislabeled candidates.' The more votes (occurrences in the list) a version has, the more likely it is to have been mislabeled. Each metric is weighted equally and the threshold for which a mislabled candidate makes it into the final list is __7 votes__, meaning that over 50% of the criteria metrics were far enough from the mean. 

On the sample dataset (100 MBIDs) this analysis can be completed in just under a minute, however on the full dataset (1000 MBIDs) expect your machine to take at least 3-5min of runtime.


_*unlike other metrics, track length is considered a candidate by an arbitrary threshold of 15 seconds away from the mean_



In [3]:
# output container
filtered_candidates = []

for mbid in mbids:
    mislabeled_candidates = []
    # store metrics for each version of the mbid in a dict where the key = version and the val = metric value
    # example: track_length['-1.json'] = 199.24
    # metadata metrics
    track_length = {} # in seconds
    # tonal metrics
    hpcp_mean = {}
    # rhythm metrics
    beats_count = {} 
    bpm = {}
    onset_rate = {}
    # lowlevel metrics
    average_loudness = {}
    dynamic_complexity = {}
    dissonance_mean = {}
    spectral_flux_mean = {}
    pitch_salience_mean = {}
    spectral_complexity_mean = {}
    zcr_mean = {} # zero crossing rate
    hfc_mean = {} # high frequency content
    
    for version in mbids[mbid]: 
        # load json file into memory and extract the desired metrics for each track
        data = json.load(open(mbid+version), strict=False)
        # becaue the 
        # metadata
        track_length[version] = data['metadata']['audio_properties']['length']
        # tonal
        hpcp_mean[version] = np.mean(data['tonal']['hpcp']['mean'])
        # rhythm
        beats_count[version] = data['rhythm']['beats_count']
        bpm[version] = data['rhythm']['bpm']
        onset_rate[version] = data['rhythm']['onset_rate']
        # lowlevel
        average_loudness[version] = data['lowlevel']['average_loudness']
        dynamic_complexity[version] = data['lowlevel']['dynamic_complexity']
        dissonance_mean[version] = data['lowlevel']['dissonance']['mean']
        spectral_flux_mean[version] = data['lowlevel']['spectral_flux']['mean']
        pitch_salience_mean[version] = data['lowlevel']['pitch_salience']['mean']
        spectral_complexity_mean[version] = data['lowlevel']['spectral_complexity']['mean']
        zcr_mean[version] = data['lowlevel']['zerocrossingrate']['mean']
        hfc_mean[version] = data['lowlevel']['hfc']['mean']
        
    # create numpy arrays (easy mean & std calculation) from the list of values associated with each metric
    # metadata
    all_track_length = np.array(list(track_length.values()))
    # tonal
    all_hpcp_mean = np.array(list(hpcp_mean.values()))
    # rhythm
    all_beats_count = np.array(list(beats_count.values()))
    all_bpm = np.array(list(bpm.values()))
    all_onset_rate = np.array(list(onset_rate.values()))
    # lowlevel
    all_average_loudness = np.array(list(average_loudness.values()))
    all_dynamic_complexity = np.array(list(dynamic_complexity.values()))
    all_dissonance_mean = np.array(list(dissonance_mean.values()))
    all_spectral_flux_mean = np.array(list(spectral_flux_mean.values()))
    all_pitch_salience_mean = np.array(list(pitch_salience_mean.values()))
    all_spectral_complexity_mean = np.array(list(spectral_complexity_mean.values()))
    all_zcr_mean = np.array(list(zcr_mean.values()))
    all_hfc_mean = np.array(list(hfc_mean.values()))
    
    # find the mean and std dev for each metric across all versions                       
    # metadata
    mean_all_track_length = all_track_length.mean()
    std_all_track_length = all_track_length.std()
    # tonal
    mean_all_hpcp_mean = all_hpcp_mean.mean()
    std_all_hpcp_mean = all_hpcp_mean.std()
    # rhythm (perhaps irrelevant because of alternative implementation; see: conclusions/future work)
    mean_all_beats_count = all_beats_count.mean()
    mean_all_bpm = all_bpm.mean()
    mean_all_onset_rate = all_onset_rate.mean()
    std_all_beats_count = all_beats_count.std()
    std_all_bpm = all_bpm.std()
    std_all_onset_rate = all_onset_rate.std()
    # lowlevel
    mean_all_average_loudness = all_average_loudness.mean()
    mean_all_dynamic_complexity = all_dynamic_complexity.mean()
    mean_all_dissonance_mean = all_dissonance_mean.mean()
    mean_all_spectral_flux_mean = all_spectral_flux_mean.mean()
    mean_all_pitch_salience_mean = all_pitch_salience_mean.mean()
    mean_all_spectral_complexity_mean = all_spectral_complexity_mean.mean()
    mean_all_zcr_mean = all_zcr_mean.mean()
    mean_all_hfc_mean = all_hfc_mean.mean()
    std_all_average_loudness = all_average_loudness.std()
    std_all_dynamic_complexity = all_dynamic_complexity.std()
    std_all_dissonance_mean = all_dissonance_mean.std()
    std_all_spectral_flux_mean = all_spectral_flux_mean.std()
    std_all_pitch_salience_mean = all_pitch_salience_mean.std()
    std_all_spectral_complexity_mean = all_spectral_complexity_mean.std()
    std_all_zcr_mean = all_zcr_mean.std()
    std_all_hfc_mean = all_hfc_mean.std()

    # find outliers
    for version in mbids[mbid]:
        # metadata metrics
        if abs(track_length[version]-mean_all_track_length) > 15:
            mislabeled_candidates.append(version)
        # tonal
        if abs(hpcp_mean[version]-mean_all_hpcp_mean) > 1.5*std_all_hpcp_mean:
            mislabeled_candidates.append(version)
        # rhythm
        if abs(beats_count[version]-mean_all_beats_count) > 1.5*std_all_beats_count:
            mislabeled_candidates.append(version)
        if abs(bpm[version]-mean_all_bpm) > 1.5*std_all_bpm:
            mislabeled_candidates.append(version)
        if abs(onset_rate[version]-mean_all_onset_rate) > 1.5*std_all_onset_rate:
            mislabeled_candidates.append(version)
        # lowlevel
        if abs(average_loudness[version]-mean_all_average_loudness) > 1.5*std_all_average_loudness:
            mislabeled_candidates.append(version)
        if abs(dynamic_complexity[version]-mean_all_dynamic_complexity) > 1.5*std_all_dynamic_complexity:
            mislabeled_candidates.append(version)
        if abs(dissonance_mean[version]-mean_all_dissonance_mean) > 1.5*std_all_dissonance_mean:
            mislabeled_candidates.append(version)
        if abs(spectral_flux_mean[version]-mean_all_spectral_flux_mean) > 1.5*std_all_spectral_flux_mean:
            mislabeled_candidates.append(version)
        if abs(pitch_salience_mean[version]-mean_all_pitch_salience_mean) > 1.5*std_all_pitch_salience_mean:
            mislabeled_candidates.append(version)
        if abs(spectral_complexity_mean[version]-mean_all_spectral_complexity_mean) > 1.5*std_all_spectral_complexity_mean:
            mislabeled_candidates.append(version)
        if abs(zcr_mean[version]-mean_all_zcr_mean) > 1.5*std_all_zcr_mean:
            mislabeled_candidates.append(version)
        if abs(hfc_mean[version]-mean_all_hfc_mean) > 1.5*std_all_hfc_mean:
            mislabeled_candidates.append(version)
    
    # count up votes converting list of candidates to dict of versions and the number of votes
    mislabeled_candidates = dict(Counter(mislabeled_candidates))
    for version,votes in mislabeled_candidates.items():
        if votes >= 7: # threshold is 7/13 possible votes
            filtered_candidates.append(mbid+version)

### Runtime Analysis : O(KN) 

_where K = # of MBID & N = # of versions per MBID_

Because the data is stored in the JSON file format, we can easily load the files into memory and access them as a python dictionary, thus providing a very efficient __O(1)__ access time for the desired metrics. We must iterate over N-number of versions for each MBID to store the metrics (by MBID) in their respective containers. Summary statistics (mean and standard deviation) are efficiently calculated using numpy - conversion [from dict values to list](https://wiki.python.org/moin/TimeComplexity) to np array is __O(N)__. In order to identify candidates we must iterate one more time over each version of a given MBID to compare the summary stats against the individual metric values to determine if it's a candidate for having been mislabled for another runtime of __O(N)__. Finally votes are counted __O(N)__ in the container of mislabled candidates before said candidates are filtered by a threshold and added to a final output.

In [4]:
# write out candidates to file
outfile = open('mislabled_candidates.txt','w')

prev = ''
for i in filtered_candidates:
    if prev[:prev.rfind('-')] not in i:
        outfile.write('\n')
    outfile.write(i+'\n')
    prev = i
    
outfile.close()

### Conclusions and Future Work

The output file titled 'mislabled_candidates.txt' contains a list of filepaths representing potentially mislabeled entries - such that they matched over 50% of the 13 criteria metrics measured and evaluated above. So that the output file is easily readable, there is a line-break in between each set of MBID versions. 

In taking a closer look at rhyhtmic features such as the 'BPM' and 'Beats_Count' metrics, it becomes quite apparent that +/- a difference of 1.5 * std dev may not be an accurate means of delineating improperly labeled entries from the group. These rhythmic features require a bit of nuance to identifying errors because the algorithms we use to extract that information are not necessarily precise; for example: a track with a ground truth BPM at 160 might be analyzed and determined to have a BPM of 80. This problem with doubling/halving from the ground truth value is also apparent in the Beats_Count metric.

In a future iteration of this 'AcousticBrainz Duplicate Analysis' I will account for the intricacies in determining whether or not the aforementioned rhythmic features are actually candidates for having been mislabled. This may include segregating the Beats_Count value by K-means clustering or the BPM values into two bins: 0-100 and 100-200, identifying which of the two bins holds the majority of the values, then checking to see if there are halved/doubled BPM values in the opposite bin are actually accurate and closer to the majority mean.

Lastly the main block of code accomplishing the analysis is quite verbose, certain variables such as the numpy arrays of dict values, along with their respective means and standard deviation calculations could be compacted by being encapsulated into a dictionary of tuples. 