# Automated Instrument Identification
#### Nicholas Gibb

## Background

The [NSynth dataset](https://magenta.tensorflow.org/datasets/nsynth) is a large, high-quality, and publically-available dataset of annotated musical notes. It consists of 305,979 four-second audio files with an instrument playing a note. Each musical note is annotated with the instrument family (one of the following):


| Index | ID         |
|-------|------------|
| 0     | bass       |
| 1     | brass      |
| 2     | flute      |
| 3     | guitar     |
| 4     | keyboard   |
| 5     | mallet     |
| 6     | organ      |
| 7     | reed       |
| 8     | string     |
| 9     | synth_lead |
| 10    | vocal      |


The task of this project is to develop an algorithm that classifies which instrument is playing the musical note.

## Methodology

First, the non-audio JSON data from the training and test datasets was inspected and explored (note: the validation data provided in the NSynth dataset was not needed). The training data had 289,205 examples. This is far too much data to process on a standard PC. To reduce the size, 4500 samples were randomly sampled from each instrument. Instrument 9 ("synth_lead") was discarded because it was not present in the test dataset. Thus, the training set was reduced to 45,000 examples (10 instruments, 4500 samples from each). 

Next, audio features were extracted. Features in the JSON annotation, such as pitch and velocity, were not considered. Instead, the Python library Librosa was used to extract audio features with more predictive power: 

1. [Harmonic or percussive](https://librosa.github.io/librosa/generated/librosa.effects.hpss.html)

- [Mel-frequency cepstral coefficients (MFCCs)](https://librosa.github.io/librosa/generated/librosa.feature.mfcc.html)

- [Mel-scaled spectrogram](https://librosa.github.io/librosa/generated/librosa.feature.melspectrogram.html)

- [Chroma energy](https://librosa.github.io/librosa/generated/librosa.feature.chroma_cens.html)

- [Spectral contrast](https://librosa.github.io/librosa/generated/librosa.feature.spectral_contrast.html)

Note: I have no audio engineering background and lacked sufficient time to research the technical signifiance of these features. However, I encountered other users of the NSynth dataset who found these features to be useful. In particular, [an analysis](https://github.com/NadimKawwa/NSynth) published by Github user NadimKawwa was informative and greatly accelerated this analysis.

The five audio features listed above were extracted for each of the 45,000 training examples and 4096 test examples. The instrument family (e.g. "vocal") was extracted from the filename of the audio clip (e.g. "vocal_synthetic_003-092-050"). A dataframe was then constructed using these five features and the instrument family (i.e. the target).

Finally, a random forest algorithm was trained to determine the instrument type based on audio features. Random forest is one of the most popular supervised learning algorithms. It is available in the [scikit-learn Python library](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html). It consists of a collection of decision trees. Each tree is different, hence the name *random* forest. Randomness prevents overfitting, which is a common problem with decision trees.


## Results and discussion

The algorithm made 4096 predictions on the test data. In all, 2325 predictions were true and 1771 were false, giving an accuracy rate of 57.8%. Given that there are ten possible instruments, a random  prediction would only be right 10% of the time. Thus, 57.8% is very good performance.

An extension of this work would be to conduct a sensitivity analysis to determine which features were the most useful. It is possible that a similarly good result could be obtained with fewer features. This would avoid redundant computation and allow the size of the training dataset to be expanded. Also, perhaps the model would be substantially better if the training data set was expanded (but this would take more computing resources).

---

## Code
All code that follows is in Python. There are four required libraries: numpy, pandas, librosa, and sklearn. Code is documented and explained through comments.
### Data exploration and clean up

In [1]:
import numpy as np
import pandas as pd
import librosa
from sklearn.ensemble import RandomForestClassifier

In [2]:
# Load training and test data to dataframe
df_train = pd.read_json("data/nsynth-train/examples.json", orient='index')
df_test = pd.read_json("data/nsynth-test/examples.json", orient='index')

In [3]:
# Inspect the training data
df_train.sample(2)

Unnamed: 0,instrument,instrument_family,instrument_family_str,instrument_source,instrument_source_str,instrument_str,note,note_str,pitch,qualities,qualities_str,sample_rate,velocity
bass_synthetic_084-039-127,726,0,bass,2,synthetic,bass_synthetic_084,283156,bass_synthetic_084-039-127,39,"[1, 0, 1, 0, 0, 0, 1, 0, 0, 0]","[bright, distortion, nonlinear_env]",16000,127
guitar_acoustic_033-098-025,679,3,guitar,0,acoustic,guitar_acoustic_033,94583,guitar_acoustic_033-098-025,98,"[0, 0, 0, 1, 0, 0, 0, 1, 0, 0]","[fast_decay, percussive]",16000,25


In [4]:
# Inspect the test data
df_test.sample(2)

Unnamed: 0,instrument,instrument_family,instrument_family_str,instrument_source,instrument_source_str,instrument_str,note,note_str,pitch,qualities,qualities_str,sample_rate,velocity
mallet_acoustic_047-102-050,488,5,mallet,0,acoustic,mallet_acoustic_047,96966,mallet_acoustic_047-102-050,102,"[0, 0, 0, 1, 0, 0, 0, 1, 1, 0]","[fast_decay, percussive, reverb]",16000,50
organ_electronic_028-068-100,440,6,organ,1,electronic,organ_electronic_028,16849,organ_electronic_028-068-100,68,"[0, 1, 0, 0, 0, 0, 0, 0, 0, 0]",[dark],16000,100


In [5]:
# Confirm that the training data has examples from every instrument class.
df_train['instrument_family'].value_counts().reindex(np.arange(0,11,1))

0     65474
1     12675
2      8773
3     32690
4     51821
5     34201
6     34477
7     13911
8     19474
9      5501
10    10208
Name: instrument_family, dtype: int64

In [6]:
# The test data does not have any examples from instrument 9!
# Instrument 9 will therefore be removed from the data later
df_test['instrument_family'].value_counts().reindex(np.arange(0,11,1))

0     843.0
1     269.0
2     180.0
3     652.0
4     766.0
5     202.0
6     502.0
7     235.0
8     306.0
9       NaN
10    141.0
Name: instrument_family, dtype: float64

In [7]:
# There is a tonne of training data!
df_train.shape

(289205, 13)

In [8]:
# The test data is reasonably-sized
df_test.shape

(4096, 13)

In [9]:
# Group the df_train by instrument family, and then sample 4500 from each
samples_per_instrument = 4500
df_train_reduced = df_train.groupby('instrument_family', as_index=False, group_keys=False).apply(lambda df: df.sample(samples_per_instrument)) 

# Remove synth lead (instrument 9) as it is missing from test data
df_train_reduced = df_train_reduced[df_train_reduced['instrument_family']!=9]
df_train_reduced.head(2)

Unnamed: 0,instrument,instrument_family,instrument_family_str,instrument_source,instrument_source_str,instrument_str,note,note_str,pitch,qualities,qualities_str,sample_rate,velocity
bass_synthetic_078-026-127,704,0,bass,2,synthetic,bass_synthetic_078,225351,bass_synthetic_078-026-127,26,"[0, 1, 0, 0, 0, 0, 0, 0, 0, 0]",[dark],16000,127
bass_synthetic_106-061-127,833,0,bass,2,synthetic,bass_synthetic_106,264743,bass_synthetic_106-061-127,61,"[0, 0, 1, 0, 0, 0, 0, 0, 0, 0]",[distortion],16000,127


In [10]:
# Confirm that training data has been reduced to reasonable size
df_train_reduced.shape

(45000, 13)

In [11]:
# Confirming that we have 4500 training examples from each instrument class (except instrument 9)
df_train_reduced['instrument_family'].value_counts()

10    4500
8     4500
7     4500
6     4500
5     4500
4     4500
3     4500
2     4500
1     4500
0     4500
Name: instrument_family, dtype: int64

### Extracting features

In [12]:
# This function takes a file and returns features in an array
def extract_audio_features(file):
   
    # Load the wav file
    y, sr = librosa.load(file)
        
    # Determine if sound is harmonic or percussive
    y_harmonic, y_percussive = librosa.effects.hpss(y)
    if np.mean(y_harmonic)>np.mean(y_percussive):
        harmonic = 1
    else:
        harmonic = 0
        
    # Mel-frequency cepstral coefficients
    mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
    # Temporal averaging
    mfcc = np.mean(mfcc,axis=1)
    
    # Mel-scaled spectrogram
    mel_spectrogram = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=128,fmax=8000)  
    # Temporal averaging
    mel_spectrogram = np.mean(mel_spectrogram, axis = 1)
    
    # Determine chroma energy
    chroma = librosa.feature.chroma_cens(y=y, sr=sr)
    # Temporal averaging
    chroma = np.mean(chroma, axis = 1)
    
    # Determine spectral contrast
    spectral_contrast = librosa.feature.spectral_contrast(y=y, sr=sr)
    # Temporal averaging
    spectral_contrast = np.mean(spectral_contrast, axis= 1)
    
    # Return audio features as array
    audio_features = [harmonic, mfcc, mel_spectrogram, chroma, spectral_contrast]
    return audio_features

In [13]:
instrument_types = ['bass', 'brass', 'flute', 'guitar', 'keyboard', 'mallet', 'organ', 'reed', 'string', 'synth_lead', 'vocal']

# Function to extract instrument from filename
def instrument_code(filename):   
    for instrument in instrument_types:
        if instrument in filename:
            instrument_index = instrument_types.index(instrument)
            return instrument_index
    else: # Should never get this far
        return None

#### Feature extraction from test audio data

In [14]:
# WARNING: Executing this cell will take several hours!

# Create dictionary to store all test features
# This will be converted to dataframe after
test_features_dict = {}
filenames_test = df_test.index.tolist()

#loop over every audio file in test set
for file in filenames_test:
    
    # feature extraction; function returns array
    file_path = 'data/nsynth-test/audio/'+ file + '.wav'
    audio_features = extract_audio_features(file_path) 
    
    # store features in dictionary for each file
    test_features_dict[file] = audio_features



In [15]:
# Convert dict to dataframe
features_test = pd.DataFrame.from_dict(test_features_dict, orient='index',
                                       columns=['harmonic', 'mfcc', 'spectro', 'chroma', 'contrast'])

features_test.head()

Unnamed: 0,harmonic,mfcc,spectro,chroma,contrast
vocal_synthetic_003-092-050,1,"[-373.57192233234747, -33.36105661703314, -27....","[729.4009942434144, 0.0027017951858890115, 0.0...","[0.0015458412067324828, 0.0015458412067324828,...","[51.46653956248316, 14.840266535438966, 25.813..."
bass_synthetic_135-060-127,0,"[-457.00787873883735, 28.898613116509534, 18.3...","[0.3436867655622912, 0.35550269999881584, 0.38...","[0.6756612403346377, 0.5141054400032908, 0.011...","[15.326957907261994, 35.63678924329, 26.317790..."
string_acoustic_056-067-025,0,"[-465.73313659822185, 17.321712690553028, 7.17...","[0.0008105553402491008, 0.00024985939913425024...","[0.18972148689028717, 0.15967222704598705, 0.0...","[18.788956368002847, 31.210328814912906, 34.67..."
organ_electronic_007-019-050,1,"[-358.67841386126287, 184.25233465145075, 100....","[426.2791337777077, 351.4662981416424, 167.739...","[0.001789669933149414, 0.2420925108506149, 0.4...","[13.312955962033707, 16.38899225439365, 16.225..."
organ_electronic_057-056-025,1,"[-406.32360278343657, 96.17874454016462, 36.05...","[0.003399323228392822, 0.027184807515206286, 0...","[0.05217717952335415, 0.077518447971031, 0.027...","[25.755901046219236, 29.918044307829252, 34.57..."


In [16]:
# Extract mfccs
mfcc_test = pd.DataFrame(features_test.mfcc.values.tolist(),index=features_test.index)
mfcc_test = mfcc_test.add_prefix('mfcc_')

# Extract spectro
spectro_test = pd.DataFrame(features_test.spectro.values.tolist(),index=features_test.index)
spectro_test = spectro_test.add_prefix('spectro_')

# Extract chroma
chroma_test = pd.DataFrame(features_test.chroma.values.tolist(),index=features_test.index)
chroma_test = chroma_test.add_prefix('chroma_')

# Extract contrast
contrast_test = pd.DataFrame(features_test.contrast.values.tolist(),index=features_test.index)
contrast_test = chroma_test.add_prefix('contrast_')

# Drop the old columns
features_test = features_test.drop(labels=['mfcc', 'spectro', 'chroma', 'contrast'], axis=1)

#concatenate
df_features_test = pd.concat([features_test, mfcc_test, spectro_test, chroma_test, contrast_test],
                           axis=1, join='inner')
df_features_test.head(2)

Unnamed: 0,harmonic,mfcc_0,mfcc_1,mfcc_2,mfcc_3,mfcc_4,mfcc_5,mfcc_6,mfcc_7,mfcc_8,...,contrast_chroma_2,contrast_chroma_3,contrast_chroma_4,contrast_chroma_5,contrast_chroma_6,contrast_chroma_7,contrast_chroma_8,contrast_chroma_9,contrast_chroma_10,contrast_chroma_11
vocal_synthetic_003-092-050,1,-373.571922,-33.361057,-27.369076,34.193883,-18.958225,44.740011,-15.474436,4.263017,-0.601554,...,0.001546,0.001546,0.001546,0.001546,0.001546,0.436386,0.718565,0.541034,0.01405,0.000781
bass_synthetic_135-060-127,0,-457.007879,28.898613,18.354827,15.474206,9.697761,7.271243,1.082034,-3.5347,-7.939138,...,0.011721,0.017333,0.02392,0.015035,0.022643,0.032344,0.018143,0.003452,0.00795,0.50095


In [17]:
# df_features_test doesn't have the target labels... add it now
# Extract the instrument from the file name
targets_test = []
for name in df_features_test.index.tolist():
    targets_test.append(instrument_code(name))
# Add new column to dataframe -- the instrument name
df_features_test['targets'] = targets_test

In [18]:
df_features_test.shape

(4096, 167)

#### Feature extraction from training audio data

In [19]:
# Now we do it all again, this time with the training data

# WARNING: Executing this cell will take several hours!

# Dictionary to store all training features
# This will be converted to dataframe after
training_features_dict = {}

filenames_train = df_train_reduced.index.tolist()

#loop over every audio file in training set
for file in filenames_train:
    
    # feature extraction; function returns array
    file_path = 'data/nsynth-train/audio/'+ file + '.wav'
    features = extract_audio_features(file_path) 
    
    # store features in dictionary for each file
    training_features_dict[file] = features

In [20]:
# Convert dict to dataframe
features_train = pd.DataFrame.from_dict(training_features_dict, orient='index',
                                       columns=['harmonic', 'mfcc', 'spectro', 'chroma', 'contrast'])

features_train.head(2)

Unnamed: 0,harmonic,mfcc,spectro,chroma,contrast
string_acoustic_019-053-050,0,"[-435.11749120248277, 57.522890486385286, 45.5...","[0.005743224907793942, 0.0008756911777530139, ...","[0.0, 0.0, 0.0007373366518613547, 0.0022483638...","[39.8663779193007, 26.978123499677192, 25.6377..."
organ_electronic_002-025-025,1,"[-410.6611963186803, 101.48094777084113, 33.50...","[0.2284447254434572, 29.304125976725576, 150.8...","[0.4292591243437037, 0.5678657757812192, 0.429...","[31.465076018721827, 21.043539543764012, 22.50..."


In [21]:
#extract mfccs from array stored in one column to seperate columns
mfcc_train = pd.DataFrame(features_train.mfcc.values.tolist(),
                          index=features_train.index)
# convenient column renaming function in pandas
# generates mfcc_0, mfcc_1, ..
mfcc_train = mfcc_train.add_prefix('mfcc_')

# Extract spectro
spectro_train = pd.DataFrame(features_train.spectro.values.tolist(),
                             index=features_train.index)
spectro_train = spectro_train.add_prefix('spectro_')


# Extract chroma
chroma_train = pd.DataFrame(features_train.chroma.values.tolist(),
                            index=features_train.index)
chroma_train = chroma_train.add_prefix('chroma_')


# Extract contrast
contrast_train = pd.DataFrame(features_train.contrast.values.tolist(),
                              index=features_train.index)
contrast_train = chroma_train.add_prefix('contrast_')

# Drop the old columns
features_train = features_train.drop(labels=['mfcc', 'spectro', 'chroma', 'contrast'], axis=1)

# Concatenate all the columns 
df_features_train = pd.concat([features_train, mfcc_train, spectro_train, chroma_train, contrast_train],
                           axis=1, join='inner')
df_features_train.head(2)

Unnamed: 0,harmonic,mfcc_0,mfcc_1,mfcc_2,mfcc_3,mfcc_4,mfcc_5,mfcc_6,mfcc_7,mfcc_8,...,contrast_chroma_2,contrast_chroma_3,contrast_chroma_4,contrast_chroma_5,contrast_chroma_6,contrast_chroma_7,contrast_chroma_8,contrast_chroma_9,contrast_chroma_10,contrast_chroma_11
string_acoustic_019-053-050,0,-435.117491,57.52289,45.55878,34.311198,25.078864,18.16553,13.74652,10.890065,8.637233,...,0.000737,0.002248,0.515533,0.683511,0.516565,0.003048,0.0,0.0,0.0,0.0
organ_electronic_002-025-025,1,-410.661196,101.480948,33.507414,61.627004,60.105843,29.676009,22.198454,14.750659,6.99517,...,0.42918,0.069715,0.074528,0.032688,0.017369,0.104263,0.355976,0.191591,0.079765,0.082488


In [22]:
# df_features_train doesn't have the target labels... add it now
# Extract the instrument from the file name
targets_train = []
for name in df_features_train.index.tolist():
    targets_train.append(instrument_code(name))

# Add new column to dataframe -- the instrument name
df_features_train['targets'] = targets_train

In [23]:
df_features_train.shape

(45000, 167)

### Random forest model 

In [24]:
#get training and testing data
X_train = df_features_train.drop(labels=['targets'], axis=1)
y_train = df_features_train['targets']

X_test = df_features_test.drop(labels=['targets'], axis=1)
y_test = df_features_test['targets']

In [25]:
# Generate the random forest
clf_Rf = RandomForestClassifier(n_estimators=20, max_depth=50, warm_start=True)

In [30]:
# Fit the model
clf_Rf.fit(X_train, y_train)

  warn("Warm-start fitting without increasing n_estimators does not "


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=50, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=20, n_jobs=None,
            oob_score=False, random_state=None, verbose=0, warm_start=True)

In [27]:
# Predict the y values for the x_test values
y_pred_RF = clf_Rf.predict(X_test)

In [28]:
# The algorithm made 4096 predictions, of which 2325 were true, 1771 were false
(y_pred_RF == y_test).value_counts()

True     2369
False    1727
Name: targets, dtype: int64

In [29]:
random_forest_accuracy = np.mean(y_pred_RF == y_test)
print("Random forest algorithm accuracy: {0:.1%}".format(random_forest_accuracy))

Random forest algorithm accuracy: 57.8%


### References

Nadim Kawwa, *[Instrument classification on the NSynth dataset using supervised learning and CNNs](https://github.com/NadimKawwa/NSynth)*. 2019.