# Neural Networks for Musical Instrument Classification

In this assignment, we will attempt a musical instrument classification problem.  Given a sample of music, we want to determine which instrument (e.g. trumpet, violin, piano) is playing.  

*This assignment is closely based on one by Sundeep Rangan, from his [IntroML GitHub repo](https://github.com/sdrangan/introml/).*


In [None]:
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## Audio Feature Extraction with Librosa

The key to audio classification is to extract useful features. The `librosa` package in Python has a rich set of methods for extracting the features of audio samples commonly used in machine learning tasks, such as speech recognition and sound classification. 


In [None]:
import librosa
import librosa.display
import librosa.feature

In this lab, we will use a set of music samples from the website:

http://theremin.music.uiowa.edu

We will use the `wget` command to retrieve one file to our Google Colab storage area. (We can run `wget` and many other basic Linux commands in Colab by prefixing them with a `!` or `%`.)

In [None]:
!wget "http://theremin.music.uiowa.edu/sound files/MIS/Woodwinds/sopranosaxophone/SopSax.Vib.pp.C6Eb6.aiff"

Now, if you click on the small folder icon on the far left of the Colab interface, you can see the files in your Colab storage. You should see the "SopSax.Vib.pp.C6Eb6.aiff" file appear there.

In order to listen to this file, we'll first convert it into the `wav` format. Again, we'll use the `!` to run a basic command-line utility: `ffmpeg`, a powerful tool for working with audio and video files.

In [None]:
aiff_file = 'SopSax.Vib.pp.C6Eb6.aiff'
wav_file = 'SopSax.Vib.pp.C6Eb6.wav'

!ffmpeg -y -i $aiff_file $wav_file

Now, we can play the file directly from Colab. If you listen to it you will hear a soprano saxaphone (with vibrato) playing four notes (C, C#, D, Eb).

In [None]:
import IPython.display as ipd
ipd.Audio(wav_file) 

Next, use `librosa` command `librosa.load` to read the audio file with filename `audio_file` and get the samples `y` and sample rate `sr`.

In [None]:
y, sr = librosa.load(aiff_file)

Feature engineering from audio files is an entire course on its own right.  A commonly used set of features are called the Mel Frequency Cepstral Coefficients (MFCCs).  These are derived from the so-called mel spectrogram, which extracts features that correlate with human audio perception.  

You can run the code below to display the mel spectrogram from the audio sample.

You can easily see the four notes played in the audio track.  You also see the 'harmonics' of each notes, which are other tones at integer multiples of the fundamental frequency of each note.

In [None]:
S = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=128, fmax=8000)
librosa.display.specshow(librosa.amplitude_to_db(S),
                         y_axis='mel', fmax=8000, x_axis='time')
plt.colorbar(format='%+2.0f dB')
plt.title('Mel spectrogram')
plt.tight_layout()

## Downloading the Data


Using the MFCC features described above, [Prof. Juan Bello](http://steinhardt.nyu.edu/faculty/Juan_Pablo_Bello) and his former PhD student Eric Humphrey have created a complete data set that can used for instrument classification.  Essentially, they collected a number of data files from the website above.  For each audio file, the segmented the track into notes and then extracted 120 MFCCs for each note.  The goal is to recognize the instrument from the 120 MFCCs.  The process of feature extraction is quite involved.  So, we will just use their processed data.

To retrieve their data, visit

https://github.com/marl/dl4mir-tutorial/blob/master/README.md

and note the password listed on that page. Click on the link for "Instrument Dataset", enter the password, click on `instrument_dataset` to open the folder, and download the four files there. and note the password listed on that page. Click on the link for "Instrument Dataset", enter the password, click on `instrument_dataset` to open the folder, and download the four files there. (You can "direct download" straight from this site, you don't need a Dropbox account.)


Then, upload the files to your Google Colab storage: click on the folder icon on the left to see your storage, if it isn't already open, and then click on "Upload". Wait until _all_ uploads have completed.

Then, load the files with:

In [None]:
Xtr = np.load('uiowa_train_data.npy')
ytr = np.load('uiowa_train_labels.npy')
Xts = np.load('uiowa_test_data.npy')
yts = np.load('uiowa_test_labels.npy')

Examine the data you have just loaded in:

* What are the number of training and test samples?
* What is the number of features for each sample?
* How many classes (i.e. instruments) are there?

Write some code to find these values and print them.


In [None]:
# TODO 1 

Then, standardize the training and test data, `Xtr` and `Xts`, by removing the mean of each feature and scaling to unit variance. 


You can do this manually, or using `sklearn`'s [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html). 

Make sure you standardize both the training and test data using the mean and variance of the *training data only*.  (If using a `StandardScaler`: create a single `StandardScaler`, call `fit` with the training data, then call `tranform` with the training data, and finally call `transform` with the test data.)

<small>Standardizing input data can make the gradient descent easier; see [this video](https://www.youtube.com/watch?reload=9&v=UIp2CMI0748) for further explanation.</small>

In [None]:
# TODO 2 Scale the training and test matrices
# Xtr_scale = ...
# Xts_scale = ...

## Building a Neural Network Classifier

Following the example in the [demo you have seen](https://colab.research.google.com/drive/1t2OeBGcfB5HSDFl6FPQFaQKbmeEAPPgG?usp=sharing), prepare and create a neural network with the following configuration:

* 256 hidden units in a single dense hidden layer
* sigmoid activation at hidden units
* `softmax` activation at the output (since this is a multi-class classification problem)
* Cross-entropy loss
* Adam optimizer with a learning rate of 0.001
* print the model summary


In [None]:
# TODO 3 construct the model, print model summary, and compile the model

Fit the model for 10 epochs (passes through the entire data). Use the scaled training data to fit the model, and also pass the test data as "validation data" so that the loss and accuracy will be computed on the test data as well.

Use a batch size of 128.  Your final accuracy should be >99%.

In [None]:
# TODO 4 fit the model

Plot the training and test accuracy vs. epochs on one subplot, and the training and test loss vs. epoch on another subplot.  Use a log scale for the vertical axis on the loss plot.

You should see that the test accuracy saturates at a little higher than 99%.  After that it may "bounce around" due to the noise in the stochastic mini-batch gradient descent.

In [None]:
# TODO 5 two subplots: one of accuracy vs. epochs, one of loss vs. epochs
# in each subplot, show training in one color and test in another color

## Varying the Learning Rate

One challenge in training neural networks is the selection of the learning rate.  Rerun the above code, trying four learning rates as shown in the vector `rates`.  For each learning rate, 

* clear the session
* prepare a neural network model as described above, with the appropriate learning rate
* train the model for 20 epochs
* save the accuracy and losses

In [None]:
rates = [0.1, 0.01,0.001,0.0001]

# TODO 6
for lr in rates:
        # do stuff here...


Plot the training loss vs. the epoch for all of the learning rates on one plot.  You should see that the lower learning rates are more stable, but converge slower, while with a learning rate that is too high, the gradient descent may fail to move towards weights that decrease the loss function.

In [None]:
# TODO 7 one plot showing training loss vs. epoch
# use a different color for each learning rate