# Automated Instrument Identification
#### Nicholas Gibb

## Background

The [NSynth dataset](https://magenta.tensorflow.org/datasets/nsynth) is a large, high-quality, and publically-available dataset of annotated musical notes. It consists of 305,979 four-second audio files with an instrument playing a note. Each musical note is annotated with the instrument family (one of the following):


| Index | ID         |
|-------|------------|
| 0     | bass       |
| 1     | brass      |
| 2     | flute      |
| 3     | guitar     |
| 4     | keyboard   |
| 5     | mallet     |
| 6     | organ      |
| 7     | reed       |
| 8     | string     |
| 9     | synth_lead |
| 10    | vocal      |


The task of this project is to develop an algorithm that classifies which instrument is playing the musical note.

## Methodology

First, the non-audio JSON data from the training and test datasets was inspected and explored (note: the validation data provided in the NSynth dataset was not needed). The training data had 289,205 examples. This is far too much data to process on a standard PC. To reduce the size, 4500 samples were randomly sampled from each instrument. Instrument 9 ("synth_lead") was discarded because it was not present in the test dataset. Thus, the training set was reduced to 45,000 examples (10 instruments, 4500 samples from each). 

Next, audio features were extracted. Features in the JSON annotation, such as pitch and velocity, were not considered. Instead, the Python library Librosa was used to extract audio features with more predictive power: 

1. [Harmonic or percussive](https://librosa.github.io/librosa/generated/librosa.effects.hpss.html)

- [Mel-frequency cepstral coefficients (MFCCs)](https://librosa.github.io/librosa/generated/librosa.feature.mfcc.html)

- [Mel-scaled spectrogram](https://librosa.github.io/librosa/generated/librosa.feature.melspectrogram.html)

- [Chroma energy](https://librosa.github.io/librosa/generated/librosa.feature.chroma_cens.html)

- [Spectral contrast](https://librosa.github.io/librosa/generated/librosa.feature.spectral_contrast.html)

Note: I have no audio engineering background and lacked sufficient time to research the technical signifiance of these features. However, I encountered other users of the NSynth dataset who found these features to be useful. In particular, [an analysis](https://github.com/NadimKawwa/NSynth) published by Github user NadimKawwa was informative and greatly accelerated this analysis.

The five audio features listed above were extracted for each of the 50,000 training examples and 4096 test examples. The instrument family (e.g. "vocal") was extracted from the filename of the audio clip (e.g. "vocal_synthetic_003-092-050"). A dataframe was then constructed using these five features and the instrument family (i.e. the target).

Finally, a random forest algorithm was trained to determine the instrument type based on audio features. Random forest is one of the most popular supervised learning algorithms. It is available in the [scikit-learn Python library](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html). It consists of a collection of decision trees. Each tree is different -- hence the name *random* forest. Randomness prevents overfitting, which is a common problem with decision trees.


## Results and discussion

The algorithm made 4096 predictions on the test data. In all, 2325 predictions were true and 1771 were false. The accuracy rate was thus 56.8%. Given that there are ten possible instruments, a randomly selected label would only be right 10% of the time. Thus, 56.8% is very good performance.

An extension of this work would be to conduct a sensitivity analysis to determine which features were the most useful. It is possible that a similarly good result could be obtained with fewer features. This would avoid redundant computation and allow the size of the  training dataset to be expanded.

---

## Code
All code that follows is in Python. Code is documented and explained through comments.
### Data exploration and clean up

In [1]:
import numpy as np
import pandas as pd
import librosa
from sklearn.ensemble import RandomForestClassifier

In [2]:
# Load training and test data to dataframe
df_train = pd.read_json("data/nsynth-train/examples.json", orient='index')
df_test = pd.read_json("data/nsynth-test/examples.json", orient='index')

In [3]:
# Inspect the training data
df_train.sample(3)

Unnamed: 0,instrument,instrument_family,instrument_family_str,instrument_source,instrument_source_str,instrument_str,note,note_str,pitch,qualities,qualities_str,sample_rate,velocity
bass_acoustic_000-024-025,70,0,bass,0,acoustic,bass_acoustic_000,239963,bass_acoustic_000-024-025,24,"[0, 1, 0, 0, 0, 0, 0, 0, 0, 0]",[dark],16000,25
bass_acoustic_000-024-050,70,0,bass,0,acoustic,bass_acoustic_000,168968,bass_acoustic_000-024-050,24,"[0, 1, 0, 0, 0, 0, 0, 0, 0, 0]",[dark],16000,50
bass_acoustic_000-024-075,70,0,bass,0,acoustic,bass_acoustic_000,222873,bass_acoustic_000-024-075,24,"[0, 1, 0, 0, 0, 0, 0, 0, 0, 0]",[dark],16000,75


In [4]:
# Inspect the test data
df_test.sample(3)

Unnamed: 0,instrument,instrument_family,instrument_family_str,instrument_source,instrument_source_str,instrument_str,note,note_str,pitch,qualities,qualities_str,sample_rate,velocity
bass_electronic_018-022-100,759,0,bass,1,electronic,bass_electronic_018,158391,bass_electronic_018-022-100,22,"[0, 0, 0, 1, 0, 0, 0, 1, 0, 0]","[fast_decay, percussive]",16000,100
bass_electronic_018-023-025,759,0,bass,1,electronic,bass_electronic_018,202327,bass_electronic_018-023-025,23,"[0, 0, 0, 0, 0, 0, 0, 1, 0, 0]",[percussive],16000,25
bass_electronic_018-023-075,759,0,bass,1,electronic,bass_electronic_018,286022,bass_electronic_018-023-075,23,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]",[],16000,75


In [5]:
# Confirm that the training data has examples from every instrument class.
df_train['instrument_family'].value_counts().reindex(np.arange(0,11,1))

0     65474
1     12675
2      8773
3     32690
4     51821
5     34201
6     34477
7     13911
8     19474
9      5501
10    10208
Name: instrument_family, dtype: int64

In [6]:
# The test data does not have any examples from instrument 9!
df_test['instrument_family'].value_counts().reindex(np.arange(0,11,1))

0     843.0
1     269.0
2     180.0
3     652.0
4     766.0
5     202.0
6     502.0
7     235.0
8     306.0
9       NaN
10    141.0
Name: instrument_family, dtype: float64

In [7]:
# There is a tonne of training data...
df_train.shape

(289205, 13)

In [8]:
df_test.shape

(4096, 13)

In [9]:
# Group the df_train by instrument family, and then sample 4500 from each
samples_per_instrument = 4500
df_train_reduced=df_train.groupby('instrument_family', as_index=False, group_keys=False).apply(lambda df: df.sample(samples_per_instrument)) 
# Remove synth lead (instrument 9) as it is missing from test data
df_train_reduced= df_train_reduced[df_train_reduced['instrument_family']!=9]
df_train_reduced.head(2)

Unnamed: 0,instrument,instrument_family,instrument_family_str,instrument_source,instrument_source_str,instrument_str,note,note_str,pitch,qualities,qualities_str,sample_rate,velocity
bass_synthetic_113-100-075,862,0,bass,2,synthetic,bass_synthetic_113,273694,bass_synthetic_113-100-075,100,"[0, 0, 0, 0, 1, 0, 0, 0, 0, 0]",[long_release],16000,75
bass_synthetic_029-046-100,381,0,bass,2,synthetic,bass_synthetic_029,181441,bass_synthetic_029-046-100,46,"[0, 1, 0, 0, 0, 0, 0, 0, 0, 0]",[dark],16000,100
bass_synthetic_054-065-075,571,0,bass,2,synthetic,bass_synthetic_054,252298,bass_synthetic_054-065-075,65,"[0, 1, 0, 0, 0, 0, 0, 0, 1, 0]","[dark, reverb]",16000,75


In [31]:
# The training dataset is reduced to a more reasonable size
df_train_reduced.shape

(45000, 13)

In [10]:
# Confirming that we have 4500 training examples from each instrument class (except instrument 9)
df_train_reduced['instrument_family'].value_counts()

10    4500
8     4500
7     4500
6     4500
5     4500
4     4500
3     4500
2     4500
1     4500
0     4500
Name: instrument_family, dtype: int64

### Extracting features

In [11]:
# This function takes a file and returns features in an array
def extract_audio_features(file):
   
    # Load the wav file
    y, sr = librosa.load(file)
        
    # Determine if sound is harmonic or percussive
    y_harmonic, y_percussive = librosa.effects.hpss(y)
    if np.mean(y_harmonic)>np.mean(y_percussive):
        harmonic=1
    else:
        harmonic=0
        
    # Mel-frequency cepstral coefficients
    mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
    # Temporal averaging
    mfcc=np.mean(mfcc,axis=1)
    
    # Mel-scaled spectrogram
    mel_spectrogram = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=128,fmax=8000)  
    # Temporal averaging
    mel_spectrogram = np.mean(mel_spectrogram, axis = 1)
    
    # Determine chroma energy
    chroma = librosa.feature.chroma_cens(y=y, sr=sr)
    # Temporal averaging
    chroma = np.mean(chroma, axis = 1)
    
    # Determine spectral contrast
    spectral_contrast = librosa.feature.spectral_contrast(y=y, sr=sr)
    # Temporal averaging
    spectral_contrast = np.mean(spectral_contrast, axis= 1)
    
    audio_features = [harmonic, mfcc, mel_spectrogram, chroma, spectral_contrast]
    return audio_features

In [12]:
instrument_types = ['bass', 'brass', 'flute', 'guitar', 'keyboard', 'mallet', 'organ', 'reed', 'string', 'synth_lead', 'vocal']

# Function to extract instrument from filename
def instrument_code(filename):   
    for instrument in instrument_types:
        if instrument in filename:
            instrument_index = instrument_types.index(instrument)
            return instrument_index
    else: # Should never get this far
        return None

#### Test data

In [13]:
#create dictionary to store all test features
test_features_dict = {}
filenames_test = df_test.index.tolist()

#loop over every file in the list
for file in filenames_test:
    #extract the features
    file_path = 'data/nsynth-test/audio/'+ file + '.wav'
    audio_features = extract_audio_features(file_path)
    #add dictionary entry
    test_features_dict[file] = audio_features



In [14]:
#convert dict to dataframe
features_test = pd.DataFrame.from_dict(test_features_dict, orient='index',
                                       columns=['harmonic', 'mfcc', 'spectro', 'chroma', 'contrast'])

features_test.head()

Unnamed: 0,harmonic,mfcc,spectro,chroma,contrast
vocal_synthetic_003-092-050,1,"[-373.57192233234747, -33.36105661703314, -27....","[729.4009942434144, 0.0027017951858890115, 0.0...","[0.0015458412067324828, 0.0015458412067324828,...","[51.46653956248316, 14.840266535438966, 25.813..."
bass_synthetic_135-060-127,0,"[-457.00787873883735, 28.898613116509534, 18.3...","[0.3436867655622912, 0.35550269999881584, 0.38...","[0.6756612403346377, 0.5141054400032908, 0.011...","[15.326957907261994, 35.63678924329, 26.317790..."
string_acoustic_056-067-025,0,"[-465.73313659822185, 17.321712690553028, 7.17...","[0.0008105553402491008, 0.00024985939913425024...","[0.18972148689028717, 0.15967222704598705, 0.0...","[18.788956368002847, 31.210328814912906, 34.67..."
organ_electronic_007-019-050,1,"[-358.67841386126287, 184.25233465145075, 100....","[426.2791337777077, 351.4662981416424, 167.739...","[0.001789669933149414, 0.2420925108506149, 0.4...","[13.312955962033707, 16.38899225439365, 16.225..."
organ_electronic_057-056-025,1,"[-406.32360278343657, 96.17874454016462, 36.05...","[0.003399323228392822, 0.027184807515206286, 0...","[0.05217717952335415, 0.077518447971031, 0.027...","[25.755901046219236, 29.918044307829252, 34.57..."


In [15]:
#extract mfccs
mfcc_test = pd.DataFrame(features_test.mfcc.values.tolist(),index=features_test.index)
mfcc_test = mfcc_test.add_prefix('mfcc_')

#extract spectro
spectro_test = pd.DataFrame(features_test.spectro.values.tolist(),index=features_test.index)
spectro_test = spectro_test.add_prefix('spectro_')

#extract chroma
chroma_test = pd.DataFrame(features_test.chroma.values.tolist(),index=features_test.index)
chroma_test = chroma_test.add_prefix('chroma_')

#extract contrast
contrast_test = pd.DataFrame(features_test.contrast.values.tolist(),index=features_test.index)
contrast_test = chroma_test.add_prefix('contrast_')

#drop the old columns
features_test = features_test.drop(labels=['mfcc', 'spectro', 'chroma', 'contrast'], axis=1)

#concatenate
df_features_test=pd.concat([features_test, mfcc_test, spectro_test, chroma_test, contrast_test],
                           axis=1, join='inner')
df_features_test.head(2)

Unnamed: 0,harmonic,mfcc_0,mfcc_1,mfcc_2,mfcc_3,mfcc_4,mfcc_5,mfcc_6,mfcc_7,mfcc_8,...,contrast_chroma_2,contrast_chroma_3,contrast_chroma_4,contrast_chroma_5,contrast_chroma_6,contrast_chroma_7,contrast_chroma_8,contrast_chroma_9,contrast_chroma_10,contrast_chroma_11
vocal_synthetic_003-092-050,1,-373.571922,-33.361057,-27.369076,34.193883,-18.958225,44.740011,-15.474436,4.263017,-0.601554,...,0.001546,0.001546,0.001546,0.001546,0.001546,0.436386,0.718565,0.541034,0.01405,0.000781
bass_synthetic_135-060-127,0,-457.007879,28.898613,18.354827,15.474206,9.697761,7.271243,1.082034,-3.5347,-7.939138,...,0.011721,0.017333,0.02392,0.015035,0.022643,0.032344,0.018143,0.003452,0.00795,0.50095
string_acoustic_056-067-025,0,-465.733137,17.321713,7.174124,5.700109,1.728119,-1.210466,-4.24426,-6.091183,-7.490178,...,0.050275,0.054103,0.100527,0.039607,0.378199,0.558875,0.458417,0.043233,0.044282,0.087742
organ_electronic_007-019-050,1,-358.678414,184.252335,100.074377,45.046813,32.310046,29.813086,20.930312,13.147634,10.103831,...,0.413198,0.194556,0.003137,0.011438,0.421284,0.62536,0.395452,0.008238,0.01024,0.030497
organ_electronic_057-056-025,1,-406.323603,96.178745,36.050094,50.751388,27.297679,-10.885754,-6.829087,-15.531379,-19.470244,...,0.027525,0.08433,0.029006,0.047196,0.055036,0.380008,0.583555,0.52603,0.153189,0.104294


In [16]:
# Extract the instrument from the file name
targets_test = []
for name in df_features_test.index.tolist():
    targets_test.append(instrument_code(name))
# Add new column to dataframe -- the instrument name
df_features_test['targets'] = targets_test

In [17]:
df_features_test.shape

(4096, 167)

#### Training data

In [18]:
# Warning: Executing this cell will take several hours
#create dictionary to store all training features
training_features_dict = {}

filenames_train = df_train_reduced.index.tolist()

#loop over every file in the list
for file in filenames_train:
    #extract the features
    file_path = 'data/nsynth-train/audio/'+ file + '.wav'
    features = extract_audio_features(file_path) 
    #add dictionary entry
    training_features_dict[file] = features

In [19]:
#convert dict to dataframe
features_train = pd.DataFrame.from_dict(training_features_dict, orient='index',
                                       columns=['harmonic', 'mfcc', 'spectro', 'chroma', 'contrast'])

features_train.head(2)

Unnamed: 0,harmonic,mfcc,spectro,chroma,contrast
flute_acoustic_018-088-075,1,"[-417.01840264841826, -9.04578478470823, -46.6...","[0.44449052298927844, 0.04662231259904716, 0.0...","[0.03694305474549144, 0.056594355564031605, 0....","[26.29176465990957, 11.21200285612929, 19.1967..."
brass_acoustic_039-072-050,0,"[-544.3391514128634, 21.051549976198814, -10.9...","[0.004308175531897194, 0.0016578828477332886, ...","[0.30373709210371824, 0.27451049428378804, 0.1...","[24.57603886807756, 21.803992378021853, 26.334..."
string_acoustic_016-040-075,0,"[-276.546715371918, 161.82940587617142, 22.976...","[0.25745008956667503, 5.982371622789718, 1574....","[0.002594858980885267, 0.0, 0.0135968191104551...","[28.367361980031166, 25.17603567661357, 26.274..."
brass_acoustic_000-081-127,1,"[-365.0960537772607, 15.390363008632992, -92.6...","[0.0009537202885831016, 0.0006988966150524941,...","[0.03759062641409593, 0.025828711797790074, 0....","[16.962525271849305, 13.976145287575568, 16.26..."
organ_electronic_102-023-050,1,"[-199.96430395050618, 176.31389145230807, 3.60...","[157.44627480338673, 352.5860168181026, 446.81...","[0.5159334781833347, 0.013697974746562017, 0.0...","[17.81785152367051, 15.668335207848635, 21.530..."


In [20]:
#extract mfccs
mfcc_train = pd.DataFrame(features_train.mfcc.values.tolist(),
                          index=features_train.index)
mfcc_train = mfcc_train.add_prefix('mfcc_')

#extract spectro
spectro_train = pd.DataFrame(features_train.spectro.values.tolist(),
                             index=features_train.index)
spectro_train = spectro_train.add_prefix('spectro_')


#extract chroma
chroma_train = pd.DataFrame(features_train.chroma.values.tolist(),
                            index=features_train.index)
chroma_train = chroma_train.add_prefix('chroma_')


#extract contrast
contrast_train = pd.DataFrame(features_train.contrast.values.tolist(),
                              index=features_train.index)
contrast_train = chroma_train.add_prefix('contrast_')

#drop the old columns
features_train = features_train.drop(labels=['mfcc', 'spectro', 'chroma', 'contrast'], axis=1)

#concatenate
df_features_train=pd.concat([features_train, mfcc_train, spectro_train, chroma_train, contrast_train],
                           axis=1, join='inner')
df_features_train.head(2)

Unnamed: 0,harmonic,mfcc_0,mfcc_1,mfcc_2,mfcc_3,mfcc_4,mfcc_5,mfcc_6,mfcc_7,mfcc_8,...,contrast_chroma_2,contrast_chroma_3,contrast_chroma_4,contrast_chroma_5,contrast_chroma_6,contrast_chroma_7,contrast_chroma_8,contrast_chroma_9,contrast_chroma_10,contrast_chroma_11
flute_acoustic_018-088-075,1,-417.018403,-9.045785,-46.667953,26.53562,-6.161555,23.745347,6.148696,10.404978,-15.271874,...,0.053021,0.475228,0.678006,0.507703,0.014001,0.010619,0.003961,0.000811,0.007598,0.014822
brass_acoustic_039-072-050,0,-544.339151,21.05155,-10.918888,-5.509842,1.169068,0.617842,-0.927245,-0.874914,-1.039703,...,0.164938,0.152696,0.1005,0.055011,0.108957,0.163284,0.043727,0.009289,0.050591,0.157467
string_acoustic_016-040-075,0,-276.546715,161.829406,22.976585,14.003924,11.823727,18.535303,14.77918,4.008552,0.132176,...,0.013597,0.553803,0.665533,0.487591,0.003646,0.0,0.000912,0.0,0.0,0.065584
brass_acoustic_000-081-127,1,-365.096054,15.390363,-92.682955,-0.816596,-35.520141,6.336497,-11.100114,19.258636,18.265654,...,0.031372,0.084329,0.20417,0.084551,0.015903,0.012418,0.507097,0.649947,0.475271,0.017051
organ_electronic_102-023-050,1,-199.964304,176.313891,3.609491,57.997492,8.848982,20.751573,-1.620565,7.390697,2.838251,...,0.006226,0.039657,0.001806,0.076412,0.269692,0.09758,0.0,0.00407,0.436745,0.652498


In [21]:
# Extract the instrument from the file name
targets_train = []
for name in df_features_train.index.tolist():
    targets_train.append(instrument_code(name))

# Add new column to dataframe -- the instrument name
df_features_train['targets'] = targets_train

In [23]:
df_features_train.shape

(45000, 167)

### Random forest model 

In [24]:
#get training and testing data
X_train = df_features_train.drop(labels=['targets'], axis=1)
y_train = df_features_train['targets']

X_test = df_features_test.drop(labels=['targets'], axis=1)
y_test = df_features_test['targets']

In [25]:
#instantiate the random forest
clf_Rf =RandomForestClassifier(n_estimators=20, max_depth=50, warm_start=True)

In [26]:
clf_Rf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=50, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=20, n_jobs=None,
            oob_score=False, random_state=None, verbose=0, warm_start=True)

In [41]:
# The algorithm made 4096 predictions, of which 2325 were true, 1771 were false
(y_pred_RF == y_test).value_counts()

True     2325
False    1771
Name: targets, dtype: int64

In [30]:
random_forest_accuracy = np.mean(y_pred_RF == y_test)
print("Random forest algorithm accuracy: {0:.1%}".format(random_forest_accuracy))

The accuracy of the random forest algorithm is 56.8%


### References

Nadim Kawwa, *[Instrument classification on the NSynth dataset using supervised learning and CNNs](https://github.com/NadimKawwa/NSynth)*. 2019.