# SE_25 Mitchell Thomson SoundHunt
## A Sound Organization Project






    

Soundhunt attempts to cluster like sounds by reading the chromavector of sound files, clustering, then reducing the dimension of the 12 note matrix. The main question is there a way to use these methods above to get accurate comparison measurement? The methods above proved effective for comparing very distinct sounds but proved to vague of information to compare similiar yet different sound files entirely

The goal of the 'SoundHunt' project is to build a tool that will allow sound designers faster access to their sounds and giving them a better organization method based on actual sound data rather than a user tagging the category. What is an efficent way to import, read, analyse and categorize large amounts of sound files in a way that would be both effective and useable for sound designers? Breaking down this question poses alot of challenges and many different ways to approach it, however that is what the project goal was. The main goal and question for this jupyter notebook is to determine the a way to first find a comparison between sound data and then how can this compared data be clustered/organized in a meaningful way? The methods that will be showcased are FFT, feature extraction,dimension reduction, chromavector comparison, and Kmeans


In [None]:
from scipy import signal
from scipy.fftpack import fft,fftfreq
import numpy as np
import pandas as pb
import wavio
import sys
import os
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics.cluster import homogeneity_score
from pyAudioAnalysis import audioBasicIO
from pyAudioAnalysis import ShortTermFeatures
from pyAudioAnalysis import Myfeature
import librosa
import librosa.display
%matplotlib inline

In [None]:
path = '/mnt/c/users/mitch/projects/soundproject/soundhunt/sounds'
saveToPath='/mnt/c/users/mitch/desktop'
path2 = '/mnt/c/users/mitch/projects/soundproject/soundhunt/selectivesounds'

titles = []
sounds = []
for r,d,f in os.walk(path):
    for file in f:
        if ".wav" in file:
            titles.append(file)
            s = wavio.read(path +"/"+file)
            sounds.append(s)


title = {'Title':titles}
titlesDF = pb.DataFrame(title)

print(titlesDF)


So to begin the sound files need to be read into the program
here we are using wavio which allows the soundfiles to be read in with all their corresponding data points (samples) and their sampling rate

we then store this into a dataframe, above is the dataframe for all titles of the incoming sounds

In [None]:
plt.plot(sounds[5].data)

Above is an example of what one of our .wav files looks like in its waveform the orange and blue waves indicate that the sound is coming in on both the left and right channels and this is a measure of each

It is a good comparison to see how the wave changes overtime as we start extracting features

sounds[] is being plotted as sounds.data because each element contains the datapoints of the wave form and the sampling rate, for these graphs the sampling rate will not be useful so we ignore it for now

In [None]:
i = 0
times = []
amplitude = []
while i <len(sounds):
    curSound = sounds[i]
    data = curSound.data
    samples = curSound.data.shape[0]
    rate = curSound.rate
    fftdata = fft(data)
    fftdata = abs(fftdata) 
    curtime = samples/rate
    maxampfft = np.amax(fftdata)
    times.append(curtime)
    amplitude.append(maxampfft)
    i = i+1
    
SoundDF = {'Titles':titles,'Time':times,'Max Amp':amplitude}
SoundDF = pb.DataFrame(SoundDF)
print(SoundDF)

In the cell above it is the beginning of the feature extraction

For right now we want to experiment with feature extraction and are interested in finding the Max amplitude of each sound as well as the duration or time of each sample. This will hopefully let us see if there is a similarity measurement that can be found using only time and max amplitude.

To find the time length of each sound is pretty easy, wavio gives us the the songs sampling rate and the number of samples stored in the sound file. To get the time it is just a quick calculation of samples/sampling rate

Max Amplitude was a bit more work, in order to get the proper feature extraction a Fast Fourier Transform or FFT was done onto the sound. An FFT takes a sound wave and breaks it into its corresponding frequency domain, this essentially allows better access for applying filters to start removing data that is not needed for us

Once the FFT is completed on the data we store it into a numpy array, using a numpy array we can use the .amax() call to find the largest data point in the array, this is what gives us the max amplitude

Again is a dataframe from pandas with storing the same sound files but this time with their length of file and the maximum amplitude


In [None]:
plt.plot(abs(fft(sounds[5].data)))

For comparison here is the same soundfile we looked at but with the FFT applied

In [None]:
soundData =SoundDF[['Time','Max Amp']]
time = soundData[['Time']]
amp = soundData[['Max Amp']]
plt.plot
plt.scatter(time,amp, label='True Position')

Here is the first comparison of the soundfiles put onto a scatterplot above

The X axis is our time measurement for each song, the Y is the maximum amplitude

In [None]:
X=soundData.rename_axis('ID').values
kmeans = KMeans(n_clusters=5)
kmeans.fit(X)
Y = kmeans.predict(X)
plt.scatter(X[:,0],X[:,1], c=kmeans.labels_, cmap='rainbow')

This was the first attempt at trying to feature extract, compare and cluster the sound data and as seen above there this was not successful at getting similar sounds grouped together

There are many reasons why I believe this dataset did not yield a useable result

1. To start using just Time and Maximum Amplitude is not enough data to compare songs to each other, two sounds that are explosions may have drastically different values of both Time and Max Amplitude which would categorize them seperatly rather than together

2. The Amplitude should have been normalized before trying to compare the files, without normalizing the data will stay the same as it is above and will not give a good result


## New Approach

From the results above it seemed that a new approach was needed there was a few options that we began to try

1. Changing the library for how sound files are opened/analyzed
At this point we switched to pyaudioanalyze as the main library for both opening and analyzing sound files, wavio is fine for opening and would have also worked it just made sense to switch the opening library to the same one that will also analyze

2. Different features to start extracting
Researching what are new possible features that could be extracted from sound files there was several options, zero crossing rate, energy, or chromaspectrum, I decided to start with the zero crossing rate

The zero crossing rate is a dataset that will determine everytime a wave crosses over 0. I was able to measure this for all sound files but was unsure how to use this data to start comparing so I quickly switched over to chromavector

The chromavector is the relation between music and color representation. There are 12 chromavector value measurements 
"G#", "G", "F#", "F", "E", "D#", "D", "C#", "C", "B", "A#", "A"

Basically if a note is an octave apart it will be observed to have a similar colour and when determining a pitch it is broken into two components, pitch height and chromavector. 

If you break down a sound into these chromavectors to determine the frequency and prevelence of a certain note you can begin comparing sounds together

In [None]:
filenames = []
titles = []
for (path, dirnames, filename) in os.walk(path):
    filenames.append(filename)

filenames = filenames[0]

for sound in filenames:
    if '.wav' in sound:
        [sr, data] = audioBasicIO.read_audio_file(path + '/' + sound)
        feature, feature_categories = ShortTermFeatures.feature_extraction(data[ :, 0 ],sr, 0.050*sr,0.025*sr)
        file = sound.replace('.wav','')
        titles.append(file)


So above the same audio files are being read again

sr is the sampling rate of the audio file being read
data is again the data points of the incoming soundwave

there is a new line below it that is giving us 'feature'
ShortTermFeatures is a call from the pyaudioanalyze library that takes the data of a sound,its sampling rate and then uses window analysis

window analysis uses two variables a window and a step
the window is the .050sr
the step is the .025sr
This method creates windows of a 50ms window to start analyzing the sound data
to make sure that no data is lost we specify the step count, in this case half the length of the window, that means every 25ms we create a 50ms wide window that is allowed to look at the data.

This 50% overlap makes sure no data is missed and gives a more accurate measurement overall, the smaller the window/step the more data you will take in

ShorTermFeatures uses the data and window analysis to first perform a time specific FFT (rather than one taken over the entire sound wave like before) it will then compute several audio features from each sound including the chronovectors that we will need for the comparison

it then appends the titles to save for later storage, this time I removed the .wav from the end of each just for cleaner looking dataframes

In [None]:
pb.DataFrame(feature,index = feature_categories)

In [None]:
featureDF=pb.DataFrame(feature,index = feature_categories)
featureDF.to_csv(saveToPath +"/featuresOneSound.csv")

Here we write out the entire feature extraction of one sound it is very large amounts of data points if you would like to see it in full view it in the csv file that it will write to

each row is one spectral feature with the given label for what it represents

We are only interested in rows 21-32 (chroma_1 - chroma_12) as that is the chromavectors that was mentioned before

The additional data is not used in this project

In [None]:
for sound in filenames:
    if '.wav' in sound:
        [sr, data] = audioBasicIO.read_audio_file(path + '/' + sound)
        feature,_ = ShortTermFeatures.feature_extraction(data[ :, 0 ],sr, 0.050*sr,0.025*sr)
        chrono = feature[21]
        for i in range(22,33):
            temp = feature[i]
            chrono = np.vstack([chrono,temp])
pb.DataFrame(chrono)

Above is the same as before loading in soundfiles, doing the feature extraction except now we are taking the only part we need which is the chronovector values and then stacking them into the array as chrono this is the same above as the (chroma_1 to chroma_12)

This loop ended up taking very long to run when using the data so the following cells get rid of the loop and instead splice each array for the needed cells [21:33] makes runtime much faster

In [None]:
ChromaData =pb.DataFrame()
for sound in filenames:
    if '.wav' in sound:
        [sr, data] = audioBasicIO.read_audio_file(path + '/' + sound)
        feature, _ = ShortTermFeatures.feature_extraction(data[ :, 0 ],sr, 0.050*sr,0.025*sr)
        
        chrono = feature[21:33]
        
        notes = chrono.shape[1]
        noteVal = chrono.argmax(axis = 0)
        x, bin = np.histogram(noteVal, bins = 12)
        normalData = x.reshape( 1, 12 ).astype( float ) / notes
        pb.DataFrame(normalData)
        nb =pb.DataFrame(normalData)
        ChromaData=ChromaData.append(nb,ignore_index =True)


Here again we extract the Chromovectors needed to calculate the notes
However as seen in the DataFrame above there are 686 columns of data for each sound and we need to have 12, 1 for each note representation
to do this we start normalizing the data for each song
notes = to the amount of data points for the sound in this case again 686 noteval is a value that will give the most prominent note within the sound of the index
the histogram keeps track of each note in the sound using the numpy histogram function to track the sounds as the are read
the data is then finally normalized by using the histogram data and reshaping it into the 12 note categories
finally we store the data into a dataframe =nb for temp storage then append this temp dataframe into the main storage of ChromaData
You can see the results below


In [None]:
pb.DataFrame(ChromaData.values,columns = ["G#", "G", "F#", "F", "E", "D#", "D", "C#", "C", "B", "A#", "A"])

In [None]:
df=ChromaData

X = df.values

model = KMeans(n_clusters = 3)
model.fit(X)

Y = model.predict(X) # --> 0-max_amount_cluster
all_values = np.append(model.cluster_centers_, X, axis=0)

pca = PCA(n_components=2)
pca.fit(all_values)
X_transformed = pca.transform(X)
cluster_centers_transformed = pca.transform(model.cluster_centers_)

plt.scatter(X_transformed[:,0], X_transformed[:,1],c=model.labels_,cmap = 'rainbow')
plt.scatter(cluster_centers_transformed[:,0], cluster_centers_transformed[:,1],color = 'black')

finalDF = {'Title':titles,'Cluster #':model.labels_}


Here is where the Data is placed and clustered into its organized categories

Using sklearn kmeans clustering method

Kmeans clustering is a form where "Center points" are picked at random based on the amount (K) of clusters you pick

All data points are measured to these center points and whichever is the closest it will join that cluster

After all data points are assigned, each cluster gets a mean solve to determine a new mid point for each cluster, from there the data will be measured and reassigned again

This step repeats until there are no changes within the clustering data

After feeding the clustering algorithm the data and set amount of clusters in this case 10 each data point will get set to a cluster

However in order to visualize the clusters on a 2d graph there runs into an issue

Our data so far is a 1x12 dimensional matrix containing all the note values, you can't display this on a cluster graph like above

This is where the pca from Sklearn come in (Principle Component Analysis) is a linear dimension reduction, no data is scaled but it will be recentered

Also had to make sure that the center points also would be reduced just like the other data points so that the clustering would stay consistent and not linking reduced data to non reduced data

The plot above each color is one cluster to show how they spread apart and group together, Black dots are the center spots for each cluster

In [None]:
pb.DataFrame(finalDF)

The Data displayed in the cluster graph is not exactly what I was looking for, The data seems to be very spread out but also jumbling several categories together

I believe this could be caused due to either the group of sounds I am using being very similar to each other resulting in the large middle cluster meanwhile sounds to the top and right are very different

May also be caused by a possible wrong K value, there was yet time to make a proper elbow plot that could give a better idea about how many clusters should be placed in graphic

The last reason which I am unsure about is the scale of the graph may be too small for this many sound files, it could be giving off the wrong display however it may be a reach, possibly in the future be able to play around with that factor to decide

In [None]:
filenames = []
titles = []
for (path, dirnames, filename) in os.walk(path2):
    filenames.append(filename)

filenames = filenames[0]
ChromaData =pb.DataFrame()
for sound in filenames:
    if '.wav' in sound:
        [sr, data] = audioBasicIO.read_audio_file(path2 + '/' + sound)
        feature, _ = ShortTermFeatures.feature_extraction(data[ :, 0 ],sr, 0.050*sr,0.025*sr)
        
        file = sound.replace('.wav','')
        titles.append(file)
        chrono = feature[21:33]
        
        notes = chrono.shape[1]
        noteVal = chrono.argmax(axis = 0)
        x, bin = np.histogram(noteVal, bins = 12)
        normalData = x.reshape( 1, 12 ).astype( float ) / notes
        pb.DataFrame(normalData)
        nb =pb.DataFrame(normalData)
        ChromaData=ChromaData.append(nb,ignore_index =True)


In [None]:
df=ChromaData

X = df.values
model = KMeans(n_clusters = 3)
model.fit(X)
Y = model.predict(X) # --> 0-max_amount_cluster
all_values = np.append(model.cluster_centers_, X, axis=0)

pca = PCA(n_components=2)
pca.fit(all_values)
X_transformed = pca.transform(X)
cluster_centers_transformed = pca.transform(model.cluster_centers_)

plt.scatter(X_transformed[:,0], X_transformed[:,1],c=model.labels_,cmap = 'rainbow')
plt.scatter(cluster_centers_transformed[:,0], cluster_centers_transformed[:,1],color = 'black')
finalDF = {'Title':titles,'Cluster #':model.labels_}

Here we have the same method but on a different smaller dataset, this dataset was a selection of sound I picked with very clear categories from each other, there are penguin sounds, punching sounds and explosions. When this data is clustered here it will make three very clear clusters based on the K I give.

So far what this tells me is that using the chromavector for determining sound comparison is still very uncertain. With smaller data sets it appears to work better but also when you have distinctly different sounds the clusters work out very well

In comparison the random mix of samples picked in the first data set there are many similar sounds and lots more data to jumble up resulting in less accurate clusters

In [None]:
pb.DataFrame(finalDF)

Here above shows the Data again just without the plot, The sounds that are similar do get placed together but with a dataset so small and so distinct it is hard to tell if it was really a success or if it was a fault of the data set

With what I can see from the data it seems that chromavectors are not the only data needed to determine if a sound is similar or different. Within large datasets with a wide range of sounds it appears that sounds which should be paired together seem to split into new clusters away from on an other. It is unclear if maybe a wider graph would help. But additional Data is needed to clarify these questions and hypothesis about what methods can be used for similarity comparison. Maybe combining additonal data like the energy reading or the zero crossing rate can be used to help seperate similar catergorical sounds from each other. In future I would also try to implement an actual comparison reading for each individual soundfile that way they can be related to each other individually rather than just to where they end up in the cluster, but unsure of how to proceed with that idea. It is possible to create some organization of sounds with these methods but it is not accurate enough for a final product