# UrbanSounds8K sound classification visual approach comparison -  Linear, Log Spectrograms vs Mel Spectrograms

## About the UrbanSounds8K dataset

Urban Sounds is a dataset of 8732 labeled sounds of less than 4 seconds each from 10 classes. Dataset for [UrbanSounds8K](https://urbansounddataset.weebly.com/urbansound8k.html) contains these 10 classes:

1.  air_conditioner
2.  car_horn
3.  children_playing
4.  dog_bark
5.  drilling
6.  engine_idling
7.  gun_shot
8.  jackhammer
9.  siren
10. street_music


## Background and objectives for this notebook

[Research with this dataset as of 2019](https://www.researchgate.net/publication/335862311_Evaluation_of_Classical_Machine_Learning_Techniques_towards_Urban_Sound_Recognition_on_Embedded_Systems) and optimized ML approaches as of late 2019 had classification accuracy at 74% with a k-nearest neighbours (KNN) algorithm. A deep learning neural network trained from scratch obtained accuracy at 76% accuracy.


![Accuracy metrics](https://www.researchgate.net/profile/Bruno-Silva-172/publication/335862311/figure/fig2/AS:804132151652353@1568731453277/Achieved-accuracy-of-the-classifiers-with-their-default-and-optimized-configuration.png "research")

*(accuracy metrics for research article)*

The state-of-the-art methods for audio classification approach  this problem as an image classification task. For such image classification problems from audio samples, three common transformation approaches are:

- 1. Linear Spectrograms
- 2. Log Spectrograms
- 3. Mel Spectrograms

You can learn more about these three transformations in [Scott Duda's article](https://scottmduda.medium.com/urban-environmental-audio-classification-using-mel-spectrograms-706ee6f8dcc1) and [Ketan Doshi's writing](https://towardsdatascience.com/audio-deep-learning-made-simple-part-2-why-mel-spectrograms-perform-better-aad889a93505), reasoning why Mel Spectrograms perform better in general for visual transformations of audio files.

This notebook will test these three transforms on this Urban Sounds 8K dataset and how they perform with a pre-trained vision-based model (Resnet-34) leveraging Fastaiv2. Subsequently will see if other pre-trained models can improve upon the Resnet-34 pre-trained model on these results.


This notebook converts these sounds to a spectrogram then uses FastAI2 code base to classify these sounds. Code and approach in this notebook 

### Setup for AWS

In [None]:
#One time installs  - On AWS useconda_pytorch_p38 environment and add using ml.p3.2xlarge for this notebook
!pip install librosa
!pip install fastbook

In [None]:
#all the one time imports for this nb
import pandas as pd

from fastai.vision.all import *
from fastai.data.all import *
import matplotlib.pyplot as plt
from matplotlib.pyplot import specgram
import librosa
import librosa.display
import numpy as np
from pathlib import Path
import os
import random
import IPython
from tqdm import tqdm

from collections import OrderedDict

In [None]:
# One time download files to local S3 folder
# !wget https://goo.gl/8hY5ER  #download
# !tar xf 8hY5ER #unpack tar file

In [None]:
df = pd.read_csv('UrbanSound8K/metadata/UrbanSound8K.csv')  #classification information across folds as provided from Urbansounds
df.head()

##### Class distribution across the sound types

In [None]:
df.groupby('class').classID.count().sort_values(ascending=False).plot.bar()
plt.ylabel('count')
plt.title('Class distribution in the dataset')

In [None]:
df.groupby(['fold']).classID.count().sort_values(ascending=False).plot.bar()
plt.ylabel('Files in each fold')
plt.title('Files in each fold')

##### Inspect the files - audio and single tranform of a audio file

In [None]:
audio_file= 'UrbanSound8K/audio/fold5/100032-3-0-0.wav'   #dog bark in fold 5

IPython.display.Audio(audio_file)

##### Linear Spectrogram

In [None]:
samples, sample_rate = librosa.load(audio_file)
Ydb = librosa.amplitude_to_db(librosa.stft(samples), ref=sample_rate)
plt.figure(figsize=(18, 6))
librosa.display.specshow(Ydb, sr=sample_rate, x_axis='time', y_axis='linear')
plt.colorbar()

##### Log Spectrogram

In [None]:
plt.figure(figsize=(18, 6))
librosa.display.specshow(Ydb, sr=sample_rate, x_axis='time', y_axis='log')
plt.colorbar()

##### Mel Spectrogram

In [None]:
S = librosa.feature.melspectrogram(y=samples, sr=sample_rate)
Sdb = librosa.power_to_db(S, ref=np.max)
plt.figure(figsize=(18, 6))
librosa.display.specshow(Sdb, sr=sample_rate, x_axis='time', y_axis='mel')
plt.colorbar()

In [None]:
audio_path = Path('UrbanSound8K/audio/')  # un zipped source audio files are in this location as wav files
tranform_store_path = 'UrbanSoundTransforms/'  #destination folder for each transformed image state

In [None]:
#make initial folders once
#os.mkdir(tranform_store_path)
# os.mkdir(tranform_store_path +'linear_spectrogram')
# os.mkdir(tranform_store_path +'log_spectrogram')
# os.mkdir(tranform_store_path +'mel_spectrogram')

In [None]:
# for fold in np.arange (1,11):
#     print(f'Processing fold {fold}')
#     try:
#         os.mkdir(tranform_store_path+'linear_spectrogram/'+ str(fold))
#         os.mkdir(tranform_store_path+'log_spectrogram/'+ str(fold))
#         os.mkdir(tranform_store_path+'mel_spectrogram/'+str(fold))
#     except:
#         pass #Folder exists
#     for audio_file in tqdm(list(Path(audio_path/f'fold{fold}').glob('*.wav'))):
#         samples, sample_rate = librosa.load(audio_file)  #create onces with librosa
        
#         #plot for linear spectrogram - without axis, tight 
        
#         fig = plt.figure(figsize=[0.72,0.72])
#         ax = fig.add_subplot(111)
#         ax.axes.get_xaxis().set_visible(False)
#         ax.axes.get_yaxis().set_visible(False)
#         ax.set_frame_on(False)
#         Ydb = librosa.amplitude_to_db(librosa.stft(samples), ref=sample_rate)
#         LS = librosa.display.specshow(Ydb, sr=sample_rate, x_axis='time', y_axis='linear')
#         filename  = tranform_store_path + 'linear_spectrogram/'+str(fold) +'/'+ str(audio_file).split('/')[-1:][0].replace('.wav','.png')
#         plt.savefig(filename, dpi=400, bbox_inches='tight',pad_inches=0)
#         plt.close('all')
        
#         #plot for log  spectrogram - without axis, tight 
#         fig = plt.figure(figsize=[0.72,0.72])
#         ax = fig.add_subplot(111)
#         ax.axes.get_xaxis().set_visible(False)
#         ax.axes.get_yaxis().set_visible(False)
#         ax.set_frame_on(False)
#         LogS = librosa.display.specshow(Ydb, sr=sample_rate,x_axis='time', y_axis='log')
#         filename  = tranform_store_path + 'log_spectrogram/'+str(fold) +'/'+ str(audio_file).split('/')[-1:][0].replace('.wav','.png')
#         plt.savefig(filename, dpi=400, bbox_inches='tight',pad_inches=0)
#         plt.close('all')
        
#         #plot for mel spectrogram - without axis, tight
        
#         fig = plt.figure(figsize=[0.72,0.72])
#         ax = fig.add_subplot(111)
#         ax.axes.get_xaxis().set_visible(False)
#         ax.axes.get_yaxis().set_visible(False)
#         ax.set_frame_on(False)
#         melS = librosa.feature.melspectrogram(y=samples, sr=sample_rate)
#         librosa.display.specshow(librosa.power_to_db(melS, ref=np.max))
#         filename  = tranform_store_path + 'mel_spectrogram/'+str(fold) +'/'+ str(audio_file).split('/')[-1:][0].replace('.wav','.png')
#         plt.savefig(filename, dpi=400, bbox_inches='tight',pad_inches=0)
#         plt.close('all')
        

##### Validate all files are transformed in destination folds

In [None]:
transforms = ['linear_spectrogram/','log_spectrogram/','mel_spectrogram/']
for transform in transforms:
    count = 0
    for fold in np.arange (1,11):
        count += len(list(Path(tranform_store_path+transform+str(fold)).glob('*.png')))
    print ('%s file count is %s'%(transform[:-1],count))
    assert (len(df)==count)

In [None]:
classes = OrderedDict(sorted(df.set_index('classID').to_dict()['class'].items()))
classes

In [None]:
fig, ax = plt.subplots(10,3, figsize=(16,16))
for k,v in classes.items():
    sample = df[df['class']==v].sample(1)
    sample_fold = sample['fold'].values[0]
    sample_file = sample['slice_file_name'].values[0].replace('wav','png')
    t_counter=0
    for transform in transforms:
        img = plt.imread(tranform_store_path+transform+str(sample_fold)+'/'+sample_file)
        ax[k][t_counter].imshow(img, aspect='equal')
        ax[k][t_counter].set_title(v+' transformed with '+ transform[:-1])
        ax[k][t_counter].title.set_size(10)
        ax[k][t_counter].set_axis_off()
        
        t_counter+=1
fig.tight_layout()
plt.show()

##### Fast AI classification of these spectrograms

In [None]:
df['fname'] = df[['slice_file_name','fold']].apply (lambda x: str(x['slice_file_name'][:-4])+'.png'.strip(),axis=1 )

In [None]:
my_dict = dict(zip(df.fname,df['class']))

In [None]:
def label_func(f_name):
    f_name = str(f_name).split('/')[-1:][0]
    return my_dict[f_name]

In [None]:
all_folds = list(np.arange(1,11))
all_folders = [str(i) for i in all_folds]
all_folders

In [None]:
results = pd.DataFrame()

In [None]:
for transform in transforms:
    all_files = get_image_files(path=tranform_store_path+transform,recurse=True, folders =all_folders )
    
    for test_folder in all_folds:
        
        dblock = DataBlock(blocks=(ImageBlock,CategoryBlock),
                   get_y     = label_func,
                   
                   splitter  = FuncSplitter(lambda s: Path(s).parent.name==str(test_folder)),
                   
                  )
        dl = dblock.dataloaders(all_files)
        
        print ('Train has {0} images and test has {1} images. Test is on folder {2} of transform type {3}.' .format(len(dl.train_ds),len(dl.valid_ds),test_folder,transform[:-1]))
        learn = vision_learner(dl, resnet34, metrics=accuracy)
        learn.fine_tune(3)
        r = learn.validate()
        results.at[test_folder,transform[:-1]] = r[1]
        

In [None]:
learn.fit(3,lr= (learn_rate[0]+learn_rate[1])/2)

In [None]:
transform ='mel_spectrogram/'

In [None]:
all_files = get_image_files(path=tranform_store_path+transform,recurse=True, folders =all_folders )

In [None]:
dblock = DataBlock(blocks=(ImageBlock,CategoryBlock),
                   get_y     = label_func,
                   
                   splitter  = FuncSplitter(lambda s: Path(s).parent.name==str(test_folder)),
                   
                  )
dl = dblock.dataloaders(all_files)

print ('Train has {0} images and test has {1} images. Test is on folder {2} of transform type {3}.' .format(len(dl.train_ds),len(dl.valid_ds),test_folder,transform[:-1]))
learn = vision_learner(dl, resnet34, metrics=accuracy)
learn.fine_tune(3)
r = learn.validate()

In [None]:
r[1]