# NSynth Dataset

[NSynth](https://magenta.tensorflow.org/datasets/nsynth#files) is "<i>a large-scale and high-quality dataset of annotated musical notes</i>" <br>
[Download](https://magenta.tensorflow.org/datasets/nsynth#files) train, validation, and test data

In [None]:
import numpy as np
import pandas as pd
import os
import glob
import librosa
import librosa.display
import re
# import seaborn as sns
import matplotlib.pyplot as plt
import scipy
import time
import collections
import itertools
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score

In [None]:
print('numpy version ',np.__version__)
print('pandas version ',pd.__version__)
print('librosa version ', librosa.__version__)
print('re version ', re.__version__)
# print('seaborn verrsion ', sns.__version__)
print('scipy version ', scipy.__version__)

# Instruments & Sources
### Instrument index

| Index | ID   |
|------|------|
|   0  | bass|
|   1  | brass|
|   2  | flute|
|   3  | guitar|
|   4  | keyboard|
|   5  | mallet|
|   6  | organ|
|   7  | reed|
|   8  | string|
|   9  | synth_lead|
|   10  | vocal|

### Source Index

| Index | ID   |
|------|------|
|   0  | acoustic|
|   1  | electronic|
|   2  | synthetic|

In [None]:
#Store class names in array
class_names=['bass', 'brass', 'flute', 'guitar', 
             'keyboard', 'mallet', 'organ', 'reed', 
             'string', 'synth_lead', 'vocal']
#Store source names in array
source_names=['acoustic', 'electronic', 'synthetic']

# Anatomy of a Wave File

Explore the features of a wave file by picking random samples for each instrument and illustrating the features

### Pick a file

In [None]:
class_names

In [None]:
#Pick a random  wave file from the dataset
pa_t_h = '/home/zaibachkhoa/Documents/Music-Genre-Classification-From-Audio-Files/Music_Instrument_Classification/dataset/valid/nsynth-valid/'
bass_file = f'{pa_t_h}/audio/bass_electronic_018-047-075.wav'
brass_file = f'{pa_t_h}/audio/brass_acoustic_006-031-050.wav'
flute_file = f'{pa_t_h}/audio/flute_synthetic_000-035-127.wav'
guitar_file = f'{pa_t_h}/audio/guitar_acoustic_010-086-100.wav'
keyboard_file = f'{pa_t_h}/audio/keyboard_acoustic_004-041-100.wav'
mallet_file = f'{pa_t_h}/audio/mallet_acoustic_056-065-050.wav'
organ_file = f'{pa_t_h}/audio/organ_electronic_007-015-100.wav'
reed_file =f'{pa_t_h}/audio/reed_acoustic_023-042-127.wav'
string_file = f'{pa_t_h}/audio/string_acoustic_012-035-100.wav'
synth_lead_file = f'{pa_t_h}/audio/flute_acoustic_002-093-075.wav'
vocal_file = f'{pa_t_h}/audio/vocal_synthetic_003-030-050.wav'

In [None]:
sample_files = [bass_file, brass_file, flute_file, guitar_file, keyboard_file, 
         mallet_file, organ_file, reed_file, string_file, synth_lead_file, vocal_file]

### Feature Extraction

### Waveform & Sampling Rate: [librosa.core.load](https://librosa.github.io/librosa/generated/librosa.core.load.html)

A sound is a continous time signal that is usually represented by a wave <b>form</b>, this continuous signal is sampled and converted into a discrete time signal. <b>Sample Rate</b> is the number of samples of audio recorded per second. The sample rate determines the maximum audio frequency that can be reproduced, Indeed the maximum frequency that can be represented is half the sample rate. Most humans can hear sounds in the frequency of 20-20,000Hz and so most some sounds like CDs are sampled are 44,000Hz, that is to say in one second the analog signal is sampled 44,100. <br>

Sources:<br>
[Frequency Range of Human Hearing](https://hypertextbook.com/facts/2003/ChrisDAmbrose.shtml)<br>
[7 Questions About Sample Rate](https://www.sweetwater.com/insync/7-things-about-sample-rate/)

In [None]:
#load the audio as waveform y
#store the sampling rate as s
# To preserve the native sampling rate of the file, use sr=None
y= []
sr = []
for file in sample_files:
    y_out, sr_out = librosa.load(file, sr=None)
    y.append(y_out)
    sr.append(sr_out)
for instrument, y_out, sr_out in zip(class_names, y, sr):
    print("{} has {} samples with a sampling rate of {}".format(instrument, np.size(y_out), sr_out))

### Harmonic & Percussive Component: [librosa.effects.hpss](https://librosa.github.io/librosa/generated/librosa.effects.hpss.html)

A <b>percussion</b> instrument is simply instruments that you can bang on! Think of drums, gongs, xylophones, triangles, etc... A <b>harmonic</b> instrument is instrument that produces a series of sounds within which the fundamental frequency of each of them is an integral multiple of the lowest fundamental frequency. For harmonic instruments think of guitars, violions, trumpets, etc...<br>

Sources:<br>
[Percussion Instruments](https://ccrma.stanford.edu/CCRMA/Courses/152/percussion.html)<br>
[IEV ref 801-30-04](http://www.electropedia.org/iev/iev.nsf/display?openform&ievref=801-30-04)

In [None]:
len(y[0])

In [None]:
y_harmonic= []
y_percussive= []

for sample in y:
    y_harmonic.append(librosa.effects.hpss(sample)[0])
    y_percussive.append(librosa.effects.hpss(sample)[1])

In [None]:
for i, j  in enumerate(y):
    plt.figure(figsize=(10,4))
    librosa.display.waveshow(y_harmonic[i], sr=sr[i], alpha=0.25)
    librosa.display.waveshow(y_percussive[i], sr=sr[i], color='r', alpha=0.5)
    plt.legend(['Harmonic', 'Percussive'])
    plt.title("Amplitude Envelope of Harmonic and Percussive Components for " + class_names[i])
    plt.savefig('plots/soundsAsArrays_'+str(class_names[i])+'.png')



plt.show()



### Beat & Tempo: [librosa.beat.beat_track](https://librosa.github.io/librosa/generated/librosa.beat.beat_track.html)

The <b>tempo</b> is the rate of speed of a musical piece or passage indicated by one of a series of directions (such as largo, presto, or allegro) and often by an exact metronome marking.<br>
Source:<br>
[Merriam Webster](https://www.merriam-webster.com/dictionary/tempo)

In [None]:
tempo, beat_frames = librosa.beat.beat_track(y=y[0], sr=sr[0])
print("Tempo ", tempo)
print("Beat frames ", beat_frames)

Since we have a single note there's no such thing as a tempo.

### Chroma Energy: [librosa.feature.chroma_cens](https://librosa.github.io/librosa/generated/librosa.feature.chroma_cens.html)

Two pitches are perceived as similar in “color” if they differ by an octave. Based on this observation, a pitch can be separated into two components, which are referred to as tone height and chroma. Assuming the equal-tempered scale, the <b>chromas</b> correspond to the set {C, C♯, D, . . . , B} that consists of the twelve pitch spelling attributes as used in Western music notation. Thus, a chroma feature is represented by a 12 dimensional vector:<br>
x = (x(1), x(2), ..., x(12))T<br>
Where x(1) corresponds to chroma C, x(2) to chroma C# and so on. In the feature extraction step, a given audio signal is converted into a sequence of chroma features each expressing how the short-time energy of the signal is spread over the twelve chroma bands.<br>

Source:<br>
[Meinard Müller and Sebastian Ewert Chroma Toolbox: MATLAB implementations for extracting variants of chroma-based audio features In Proceedings of the International Conference on Music Information Retrieval (ISMIR), 2011.](http://ismir2011.ismir.net/papers/PS2-8.pdf)

In [None]:
chromas = []
for y_, sr_ in zip(y, sr):
    chromas.append(librosa.feature.chroma_cens(y=y_, sr=sr_))

In [None]:
for chroma, instrument, i in zip(chromas,class_names, range(len(class_names))):
    plt.figure(figsize=(10,4))
    librosa.display.specshow(chroma, y_axis='chroma', x_axis='time')
    plt.colorbar()
    plt.title("CENS for " + instrument)
    plt.savefig('plots/ChromaEnergy_'+str(class_names[i])+'.png')
    
plt.show()



### Mel Spectrogram [librosa.feature.melspectrogram](https://librosa.github.io/librosa/generated/librosa.feature.melspectrogram.html)

The <b>spectrogram</b> is a basic tool in audio spectral analysis and other fields. It has been applied extensively in speech analysis. The spectrogram can be defined as an intensity plot (usually on a log scale, such as dB) of the Short-Time Fourier Transform (STFT) magnitude. The STFT is simply a sequence of Fast Fourier Transforms (FFT) of windowed data segments, where the windows are usually allowed to overlap in time, typically by 25-50%. It is an important representation of audio data because human hearing is based on a kind of real-time spectrogram encoded by the cochlea of the inner ear. <br>The spectrogram has been used extensively in the field of computer music as a guide during the development of sound synthesis algorithms. When working with an appropriate synthesis model, matching the spectrogram often corresponds to matching the sound extremely well. In fact, spectral modeling synthesis (SMS) is based on synthesizing the short-time spectrum directly by some means.<br>

The <b>mel</b> is a unit of pitch. The mel scale is a scale of pitches judged by listeners to be equal in distance one from another.


Sources:<br>
[Spectrograms](https://ccrma.stanford.edu/~jos/st/Spectrograms.html)<br>
[Stevens, Stanley Smith; Volkmann; John & Newman, Edwin B. (1937). "A scale for the measurement of the psychological magnitude pitch"](https://archive.is/20130414065947/http://asadl.org/jasa/resource/1/jasman/v8/i3/p185_s1)

In [None]:
spectrograms_mel = []

for y_, sr_ in zip(y, sr):
    # Passing through arguments to the Mel filters
    spectrograms_mel.append(librosa.feature.melspectrogram(y=y_, sr=sr_, n_mels=128,fmax=8000))

In [None]:
for S, instrument, i in zip(spectrograms_mel,class_names, range(len(class_names))):
    plt.figure(figsize=(10,4))
    librosa.display.specshow(librosa.power_to_db(S,ref=np.max), 
                         y_axis='mel', fmax=8000, x_axis='time')
    plt.colorbar(format='%+2.0f dB')
    plt.title("Mel spectrogram for " + instrument)
    plt.savefig('plots/mel_spectro_'+str(class_names[i])+'.png')
    
plt.show()

In [None]:
spectrograms = []

for y_, sr_ in zip(y, sr):
    # Use left-aligned frames, instead of centered frames
    D = np.abs(librosa.stft(y_))
    spectro = librosa.amplitude_to_db(D, ref=np.max)
    spectrograms.append(spectro)

for S, instrument, i in zip(spectrograms,class_names, range(len(class_names))):
    plt.figure(figsize=(10,4))
    librosa.display.specshow(S, 
                         y_axis='log', fmax=8000, x_axis='time')
    plt.colorbar(format='%+2.0f dB')
    plt.title("Spectrogram for " + instrument)
    plt.savefig('plots/spectro_'+str(class_names[i])+'.png')
    
plt.show()

### Mel-Frequency Cepstral Coefficients: [librosa.feature.mfcc](https://librosa.github.io/librosa/generated/librosa.feature.mfcc.html)

The <b>mel-frequency cepstrum (MFC)</b> is a presentation of the short term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency. MFC coefficients are the coefficients that make up an MFC. Using a mel scale means that the bands are equally scaled, which resembles the human hearing system more than linearly spaced based bands in a normal spectrum.<br>

MFCCs are commonly derived as follows:<br>

1) Take the Fourier transform of (a windowed excerpt of) a signal.<br>
2) Map the powers of the spectrum obtained above onto the mel scale, using triangular overlapping windows.<br>
3) Take the logs of the powers at each of the mel frequencies.<br>
4) Take the discrete cosine transform of the list of mel log powers, as if it were a signal.<br>
5) The MFCCs are the amplitudes of the resulting spectrum.<br>

Sources:<br>
[Speech Processing for Machine Learning](https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html)<br>
[Wikipedia: Mel Frequency Cepstrum](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum)<br>
[Mel Frequency Cepstral Coefficient (MFCC) tutorial](http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/#why-do-we-do-these-things)

In [None]:
mfccs = []
for y_, sr_ in zip(y, sr):
    mfccs.append(librosa.feature.mfcc(y=y_, sr=sr_, n_mfcc=13))

In [None]:
for mfcc, instrument,i in zip(mfccs,class_names,range(len(class_names))):
    plt.figure(figsize=(10,4))
    librosa.display.specshow(mfcc, x_axis='time')
    plt.colorbar()
    plt.title('Mel-frequency cepstral coefficients for ' + instrument)
    plt.savefig('plots/mfcc'+str(class_names[i])+'.png')
plt.show()

### Spectral Centroid: [librosa.feature.spectral_centroid](https://librosa.github.io/librosa/generated/librosa.feature.spectral_centroid.html)

The <b>spectral centroid</b> is commonly associated with the measure of the brightness of a sound. This measure is obtained by evaluating the “center of gravity” using the Fourier transform’s frequency and magnitude information. The individual centroid of a spectral frame is defined as the average frequency weighted by amplitudes, divided by the sum of the amplitudes.<br>

Source:<br>
[Spectral Centroid](https://www.cs.cmu.edu/~music/icm/slides/05-algorithmic-composition.pdf)

In [None]:
centroids = []
spectrograms = []
phases = []

for y_, sr_ in zip(y, sr):
    centroids.append(librosa.feature.spectral_centroid(y=y_, sr=sr_))
    spectrograms.append(librosa.magphase(librosa.stft(y=y_))[0])
    phases.append(librosa.magphase(librosa.stft(y=y_))[1])

In [None]:
#Make a plot with log scaling on the y axis.
for C, instrument, S, i in zip(centroids ,class_names, spectrograms, range(len(class_names))):
    plt.figure(figsize=(10,4))
    plt.subplot(2, 1, 1)
    plt.semilogy(C.T, label='Spectral Centroid for ' + instrument)
    plt.ylabel('Hz')
    plt.xticks(np.arange(0,C.shape[-1],20))
    plt.xlim([0, C.shape[-1]])
    plt.xlabel('Frequency bins')
    plt.legend()
    #plt.subplot(2, 1, 2)
    #librosa.display.specshow(librosa.amplitude_to_db(S, ref=np.max),y_axis='log', x_axis='time')
    #plt.title('log Power spectrogram for ' + instrument)
    #plt.tight_layout()
    plt.savefig('plots/spectral_centroid'+str(class_names[i])+'.png')
    
plt.show()

### Spectral Contrast: [librosa.feature.spectral_contrast](https://librosa.github.io/librosa/generated/librosa.feature.spectral_contrast.html)

<b>Spectral contrast</b> is defined as the level difference between peaks and valleys in the spectrum. Octave-based Spectral Contrast considers the spectral peak, spectral valley and their difference in each sub-band.<br>
Source: <br>
[Jiang, Dan-Ning, Lie Lu, Hong-Jiang Zhang, Jian-Hua Tao, and Lian-Hong Cai. “Music type classification by spectral contrast feature.” In Multimedia and Expo, 2002. ICME‘02. Proceedings. 2002 IEEE International Conference on, vol. 1, pp. 113-116. IEEE, 2002.](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.583.7201&rep=rep1&type=pdf)

In [None]:
contrasts = []
for y_, sr_ in zip(y, sr):
    contrasts.append(librosa.feature.spectral_contrast(y=y_, sr=sr_))

In [None]:
for contrast, instrument, i in zip(contrasts,class_names, range(len(class_names))):
    plt.figure(figsize=(10,4))
    librosa.display.specshow(contrast,x_axis='time')
    plt.colorbar()
    plt.ylabel('Frequency bands')
    plt.title('Spectral contrast for '+ instrument)
    plt.savefig('plots/spectral_contrast_'+str(class_names[i])+'.png')
    
plt.show()

### Spectral Rolloff Frequency: [librosa.feature.spectral_rolloff](https://librosa.github.io/librosa/generated/librosa.feature.spectral_rolloff.html)

From the function docstring:<br>
The <b>roll-off frequency</b> is defined for each frame as the center frequency for a spectrogram bin such that at least roll_percent (0.85 by default) of the energy of the spectrum in this frame is contained in this bin and the bins below. This can be used to, e.g., approximate the maximum (or minimum) frequency by setting roll_percent to a value close to 1 (or 0).

In [None]:
rolloffs = []
for y_, sr_ in zip(y, sr):
    rolloffs.append(librosa.feature.spectral_rolloff(y=y_, sr=sr_))
    

In [None]:
for rolloff, instrument, i in zip(rolloffs,class_names, range(len(class_names))):
    plt.figure(figsize=(10,4))
    plt.semilogy(rolloff.T, label='Rolloff frequency for ' + instrument)
    plt.ylabel('Hz')
    plt.xlim([0, rolloff.shape[-1]])
    plt.xlabel('Bins')
    plt.xticks(np.arange(0,rolloff.shape[-1],20))
    plt.title("Rolloff Frequency for "+ class_names[i])
    plt.savefig('plots/rolloff_freq'+str(class_names[i])+'.png')


plt.show()

### Zero Crossings: [librosa.core.zero_crossings](https://librosa.github.io/librosa/generated/librosa.core.zero_crossings.html)

Find the zero-crossings of a signal y

In [None]:
z_crossings = []
for y_ in y:
    z = librosa.zero_crossings(y_)
    #Find number of zero crossings
    z_crossings.append(np.sum(z))

### Zero Crossing Rate: [librosa.feature.zero_crossing_rate](https://librosa.github.io/librosa/generated/librosa.feature.zero_crossing_rate.html)

Compute the zero-crossing rate of an audio time series.

In [None]:
zrates = []
for y_ in y:
    zrates.append(librosa.feature.zero_crossing_rate(y=y_))

In [None]:
for zrate, instrument, i in zip(zrates,class_names, range(len(class_names))):
    plt.figure(figsize=(10,4))
    plt.semilogy(zrate.T, label='Fraction')
    plt.ylabel('Fraction Per Frame')
    plt.xlabel('Bins')
    plt.xticks(np.arange(0, rolloff.shape[-1], 20))
    plt.xlim([0, rolloff.shape[-1]])
    plt.title("Zero crossing rate for "+ instrument)
    plt.legend()
    plt.savefig('plots/zerocross_'+str(class_names[i])+'.png')
    
plt.show()

In [None]:
zrate.shape

## Feature Summary

### Normalize Chroma Energy

In [None]:
chroma_means = []
chroma_stds = []

for chroma in chromas:
    chroma_means.append(np.mean(chroma, axis=1))
    chroma_stds.append(np.std(chroma, axis=1))

[Background on octaves and equal temperament scale](https://en.wikipedia.org/wiki/Equal_temperament)

In [None]:
#Define octaves
octave=['C','C#','D','D#','E','F','F#','G','G#','A','A#','B']

In [None]:
for chroma_mean,  instrument in zip(chroma_means,class_names):
    plt.figure(figsize=(10,4))
    plt.title("Mean Chroma Energy for "+ instrument)
    sns.barplot(x=octave, y=chroma_mean)

In [None]:
for chroma_std, instrument in zip(chroma_stds,class_names):
    plt.figure(figsize=(10,4))
    plt.title("Mean Chroma Energy for "+ instrument)
    sns.barplot(x=octave, y=chroma_std)

### Normalize Mel-Frequency Cepstral Coefficients 

In [None]:
mfcc_means = []
mfcc_stds = []
coefficients = []

for mfcc in mfccs:
    mfcc_means.append(np.mean(mfcc,axis=1))
    mfcc_stds.append(np.std(mfcc,axis=1))
    coefficients.append(np.arange(0,mfcc.shape[0]))

In [None]:
for mfcc_mean, coefficient, instrument in zip(mfcc_means, coefficients,class_names):
    plt.figure(figsize=(10,4))
    plt.title("Mean Mel-Frequency Cepstral Coefficients for " + instrument)
    sns.barplot(x=coefficient, y=mfcc_mean)

In [None]:
for mfcc_std, coefficient, instrument in zip(mfcc_stds, coefficients,class_names):
    plt.figure(figsize=(10,4))
    plt.title("Standard Deviation Mel-Frequency Cepstral Coefficients for " + instrument)
    sns.barplot(x=coefficient, y=mfcc_std)

### Spectral Centroid Summary

In [None]:
centroid_means = []
centroid_stds = []
centroid_skews = []

for i in range(len(centroids)):
    centroid_means.append(np.mean(centroids[i]))
    centroid_stds.append(np.std(centroids[i]))
    centroid_skews.append(scipy.stats.skew(centroids[i], axis=1)[0])
    print("Centroid mean of {} is {}".format(class_names[i], centroid_means[i]))
    print("Centroid standard deviation of {} is ".format(class_names[i], centroid_stds[i]))
    print("Centroid skewness of {} is {} \n".format(class_names[i], centroid_skews[i]))

### Spectral Contrast Summary

In [None]:
contrast_means= []
contrast_stds= []

for C in contrasts:
    contrast_means.append(np.mean(C, axis=1))
    contrast_stds.append(np.mean(C, axis=1))

In [None]:
for C, contrast, instrument in zip(contrast_means, contrasts,class_names):
    plt.figure(figsize=(10,4))
    n_constrast= np.arange(0, contrast.shape[0])
    plt.title("Mean Spectral Contrast for " + instrument)
    sns.barplot(x=n_constrast, y= C)

In [None]:
for C, contrast, instrument in zip(contrast_stds, contrasts,class_names):
    plt.figure(figsize=(10,4))
    n_constrast= np.arange(0, contrast.shape[0])
    plt.title("Standard Deviation Spectral Contrast " + instrument)
    sns.barplot(x=n_constrast, y= C)

### Spectral Rolloff Summary

In [None]:
rolloff_means = []
rolloff_stds = []
rolloff_skews = []

for i in range(len(rolloffs)):
    rolloff_means.append(np.mean(rolloffs[i]))
    rolloff_stds.append(np.std(rolloffs[i]))
    rolloff_skews.append(scipy.stats.skew(rolloffs[i], axis=1)[0])
    print("Rolloff mean of {} is {}".format(class_names[i], rolloff_means[i]))
    print("Rolloff standard deviation of {} is ".format(class_names[i], rolloff_stds[i]))
    print("Rolloff skewness of {} is {} \n".format(class_names[i], rolloff_skews[i]))

### Zero Crossing Rate Summary

In [None]:
zrate_means = []
zrate_stds = []
zrate_skews = []

for i in range(len(zrates)):
    zrate_means.append(np.mean(zrates[i]))
    zrate_stds.append(np.std(zrates[i]))
    zrate_skews.append(scipy.stats.skew(zrates[i], axis=1)[0])
    print("Zero Crossing mean of {} is {}".format(class_names[i], zrate_means[i]))
    print("Zero Crossing standard deviation of {} is ".format(class_names[i], zrate_stds[i]))
    print("Zero Crossing skewness of {} is {} \n".format(class_names[i], zrate_skews[i]))