In [1]:
from util_functs import *
from pathlib import Path
from IPython.display import Audio
import time
import os
import shutil
from os.path import isdir
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import imageio
import h5py

from keras.models import load_model
%reload_ext autoreload
%autoreload 2
%matplotlib inline
%pylab inline

np.random.seed(1)

Using TensorFlow backend.


Populating the interactive namespace from numpy and matplotlib


`%matplotlib` prevents importing * from pylab and numpy
  "\n`%matplotlib` prevents importing * from pylab and numpy"


Here, we test the trained model on the test data and see how it performs!

In [3]:
#Testing the model on the unseen test data. 

#Define the path to the model and the path to the 6-second audio clips
model_path = 'models/model_isaac.h5'
audio_path = 'input_clips/'

#Generate the image & label data - as well as the dataframe containing filename-label pairs - using the
#generate_data function
X_test, y_test, ms_df_test = generate_data((34,50,4), ms_dir='test_ms/', csv_name='test.csv')

#Load the model & make predictions on the test data. The softmax predictions need to be coverted to index 
#predictions to calculate true accuracy
model = load_model(model_path)

predictions = model.predict(X_test)
predictions = np.argmax(predictions, axis=1) + 1
predictions = predictions.reshape(len(ms_df_test),1)

#Calculate the difference between the predictions and true labels, then find where the difference is 0 (correct predictions) 
diff_array = predictions - y_test
correct_preds = diff_array[diff_array == 0]

#Finally, calculate the accuracy of the model on the test data
acc_percent = (len(correct_preds) / len(diff_array)) * 100
acc_percent

W0728 19:59:28.302200 14736 deprecation_wrapper.py:119] From C:\Users\isaaa\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0728 19:59:28.719279 14736 deprecation_wrapper.py:119] From C:\Users\isaaa\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0728 19:59:28.894668 14736 deprecation_wrapper.py:119] From C:\Users\isaaa\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py:3976: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.

W0728 19:59:28.944622 14736 deprecation_wrapper.py:119] From C:\Users\isaaa\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py:131: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0728 19:59:28.945640 14736 deprecation_wrapper.py:119] From C:\Users\isaaa\An

79.6116504854369

The model does well, and achieves ~80% accuracy on the test set. Considering the the data isn't the cleanest, this is decent!

In [6]:
#Quick demonstration of the predictions on the test data
i=random.randint(len(ms_df_test))
prediction = np.argmax(model.predict(np.array([X_test[i]]))[0]) + 1

print("For the image {}, the model predicts that the song is {}".format(ms_df_test.iloc[i].fname, str(prediction) + ": " + LABEL_DICT[prediction]))

#PLay clip associated with above image
filename = audio_path + (os.path.splitext(ms_df_test.iloc[i].fname)[0] + '.wav')
clip, sample_rate = librosa.load(filename, sr=None)
Audio(clip, rate=sample_rate)

For the image 2_s_clip229.png, the model predicts that the song is 2: Survivor - Eye of the Tiger


# Conclusion/Future Work

The model trained here used only one voice as its input source. Obviously, better generalization could be achieved if more voices are used. The songs chosen here were carefully chosen: they contain unique, distintive frequency patterns. The assumption was that not many voices need be used when trainng on these songs since the frequency patterns of the 6 second recordings should look similar accross gender & race. Moreover, the approach being used clearly cannot be used at scale (millions of songs + more being released every day means re-training model all the time!!), but it serves as a cool little approach to use for a pet project such as this :)

With that being said, clear avenues to explore next would be to incorporate more voices into training, and particularly, incorporate female voice. Originally, 5 voices were recorded for the project, but the audio quality was not great for the other recordings. Thus, for the purposes of this project, only the one voice was used, but it would be very easy to get other people to record high-quality audio and improve the generalization of the model despite the contraints of little data.  

As a little aside, you can try the model with your own clips! Simply specify the location of the clip & directories to store your raw recodings, input clips and input melspecs:

In [8]:
## INSERT LOCATION OF CLIP HERE ##
your_clip_name = '1.mp3'
##################################
## SPECIFY RAW AUDIO DIR (your_recording_dir), INPUT CLIP DIR (your_clip_dir), INPUT MELSPEC DIR (your_ms_dir) ##
your_recording_dir = "your_recordings/" 
your_clip_dir = "your_audio_clips/"
your_ms_dir = "your_ms/"

#-------------------------------------------------------#

# Load in the clip and write it to .wav format
clip, sample_rate = librosa.load(your_recording_dir + your_clip_name, sr=None, duration=6)
librosa.output.write_wav(your_clip_dir + os.path.splitext(your_clip_name)[0] + ".wav", clip, sample_rate)

#Generate the mel-spectrograms from the clips
generate_melspecs(input_dir=your_clip_dir, ms_output=your_ms_dir)

#Generate the input data
X, _ = initialise_data(img_folder=your_ms_dir, image_shape=(34,50,4))
img_name = os.path.splitext(your_clip_name)[0] + '.png'
img = imageio.imread(your_ms_dir + img_name)
X[0] = img

#Calculate the prediction on the input image, then get the best and second best predictions from the softmax vector
prediction = model.predict(np.array([X[0]]))[0] #softmax vector 
ind = np.argpartition(prediction, -2)[-2:] #get indices of top two preds
best_guess = ind[np.argsort(prediction[ind])][1] + 1
second_best = ind[np.argsort(prediction[ind])][0] + 1

#Print the predictions
print("For the image {}, best guess: {}, 2nd best guess: {}".format(img_name, LABEL_DICT[best_guess], LABEL_DICT[second_best]))

#PLay clip associated with above image
filename = your_clip_dir + os.path.splitext(your_clip_name)[0] + ".wav"
clip, sample_rate = librosa.load(filename, sr=None)
Audio(clip, rate=sample_rate)

For the image 1.png, best guess: Michael Jackson - Thriller, 2nd best guess: Survivor - Eye of the Tiger


## References
[1] Fast.ai experimental audio classification module: https://github.com/sevenfx/fastai_audio <br/>
[2] Article for fast.ai audio module: https://towardsdatascience.com/audio-classification-using-fastai-and-on-the-fly-frequency-transforms-4dbe1b540f89 <br/>
[3] A CNN architecture for classifying digits with spectrograms: https://medium.com/x8-the-ai-community/audio-classification-using-cnn-coding-example-f9cbd272269e <br/> https://github.com/Jakobovski/free-spoken-digit-dataset