# Environment sound classification

This project is divided into 3 separate Python files:

- env-sound-classify-part1.ipynb 
- env-sound-classify-part2.ipynb 
- env-sound-classify-part3.py 


## Part 1 - Data Preparation

Just like any Machine Learning project, a Deep Learning project involves preparing the data before training.

In this project, you will use the Environment Sound Classification (ESC) dataset available from Kaggle. 
  https://www.kaggle.com/mmoreaux/environmental-sound-classification-50
  
The sound classification dataset contains a total of 50 classes of environment sounds, with 40 audio recordings of 5 seconds for each class.

For this project, we will only use 10 classes of sounds for recognition. And your job is the extract the audio recordings of the 10 classes of sounds that we can use in Colab for training.

In [13]:
# Run the following code as it is
#
import pandas as pd
import cv2
import librosa
import numpy as np
from shutil import copyfile
import os



In [21]:
# Set the folder to point to where you downloaded the ESC dataset,
# and also the folder to point to where you intend to save the processed data
#
user_folder = os.path.expanduser("~")
csv_folder = user_folder + "/Downloads/environmental-sound-classification-50/"

#The folder shows the path to the audio files downloaded from kaggle
input_folder = user_folder + "/Downloads/environmental-sound-classification-50/audio/"

audio_folder = user_folder + "/Downloads/environmental-sound-classification-50/esc10/"
output_folder = user_folder + "/Downloads/environmental-sound-classification-50/npydata/"


In [22]:
# Run the following code as it is

# Each of our sample (22khz) lasts exactly 5 seconds with 22050 * 5 samples.
#
spec_hop_length = 512
mfcc_hop_length = 512
spec_max_frames = int(22050 * 5 / spec_hop_length) + 1 # this is actually about 22050 / 512.
mfcc_max_frames = int(22050 * 5 / mfcc_hop_length) + 1

print ("MFCC Frames (for 5 sec audio):     %d" % (mfcc_max_frames))
print ("Spectral Frames (for 5 sec audio): %d" % (spec_max_frames))


num_classes = 10
max_samples = 22050 * 5  # 5 seconds
max_mfcc_features = 40

# Scale the values to be between 
def scale(arr):
    #arr = arr - arr.mean()
    safe_max = np.abs(arr).max()
    if safe_max == 0:
        safe_max = 1
    arr = arr / safe_max
    return arr


# Load a file and convert its audio signal into a series of MFCC
# This will return a 2D numpy array.
#
def convert_mfcc(file_name):
    signal, sample_rate = librosa.load(file_name) 
    signal = librosa.util.normalize(signal)
    signal_trimmed, index = librosa.effects.trim(signal, top_db=60)
    signal_trimmed = librosa.util.fix_length(signal_trimmed, max_samples)
    
    feature = (librosa.feature.mfcc(y=signal_trimmed, sr=sample_rate, n_mfcc=max_mfcc_features).T)
    #print (feature.shape)
    if (feature.shape[0] > mfcc_max_frames):
        feature = feature[0:mfcc_max_frames, :]
    if (feature.shape[0] < mfcc_max_frames):
        feature = np.pad(feature, pad_width=((0, mfcc_max_frames - feature.shape[0]), (0,0)), mode='constant')
    
    # This removes the average component from the MFCC as it may not be meaningful.
    #
    feature[:,0] = 0
        
    feature = scale(feature)
    #print(feature)
    return feature


# Load a file and convert its audio signal into a spectrogram
# This will return a 2D numpy array.
#
def convert_spectral(file_name):
    signal, sample_rate = librosa.load(file_name) 
    signal = librosa.util.normalize(signal)
    signal_trimmed, index = librosa.effects.trim(signal, top_db=60)
    signal_trimmed = librosa.util.fix_length(signal_trimmed, max_samples)
    
    feature = np.abs(librosa.stft(y=signal_trimmed, hop_length=spec_hop_length, win_length=spec_hop_length*4, n_fft=spec_hop_length*4, center=False).T)

    if (feature.shape[0] > spec_max_frames):
        feature = feature[0:spec_max_frames, :]
    if (feature.shape[0] < spec_max_frames):
        feature = np.pad(feature, pad_width=((0, spec_max_frames - feature.shape[0]), (0,0)), mode='constant')
        
    feature = librosa.amplitude_to_db(feature)
    feature = cv2.resize(feature, (224, 224), interpolation = cv2.INTER_CUBIC)
    feature = scale(feature)
    #print(feature)

    return feature    





MFCC Frames (for 5 sec audio):     216
Spectral Frames (for 5 sec audio): 216


## Copying WAV Files Into Our Custom Structure 

In the following section of code, we will copy the 40 recordings of the ESC-10 classes of sounds into the following folder structure (similar to what we used during Practical 3):

- **[audiofolder]**
  - chainsaw
    - 1-19898-A-41.wav
    - 1-19898-B-41.wav
    - ...
  - clocktick
  - crackling_fire
  - crying_baby
  - dog
  - helicopter
  - rain
  - rooster
  - sea_waves
  - sneezing
 
NOTE: The audiofolder is a variable declared in the cell above. If you didn't make any changes to that folder, it should point to:
- <user_folder>/Downloads/environmental-sound-classification-50/esc10/

In [23]:
# Declare the labels we are using.
#
import shutil
labels = ["chainsaw", "clock_tick", "crackling_fire", "crying_baby", "dog", "helicopter", "rain", "rooster", "sea_waves", "sneezing"]

# Create the folder for containing your output data.
#
#os.makedirs(audio_folder, exist_ok=True)
os.makedirs(output_folder, exist_ok=True)

# Load up the CSV file containing our data.
#
df = pd.read_csv(csv_folder + 'esc50.csv')
df.sort_values(['target', 'fold'], ascending=[True, True])

print ("Copying...")

# Load up the CSV and copy only those 
# 
for index, row in df.iterrows():
    
    if(row['category'] in labels):
    # TODO:
    # Inspect the data in the Pandas DataFrame to discover
    # the filename, the label. Then copy the file from its
    # source folder into the target folder above.
    #
    # ..................... CODES START HERE ..................... #
        os.makedirs(audio_folder  + "/" + row['category'], exist_ok=True)
        filepath = audio_folder + "/" + row['category'] 
        shutil.copy(input_folder+"/"+row['filename'], filepath)
           
    # ...................... CODES END HERE ...................... #

print ("Copy complete.")

Copying...
Copy complete.


## Extracting the MFCC, Spectral Features 

Here's let loop through the files in our folders and extract their spectral and MFCC features. 

We will create the following arrays:

- **Training**
    - x_spec_train: The input training data for the spectrograms
    - x_mfcc_train: The input training data for the MFCC features
    - y_train: The one-hot expected prediction for the training

- **Validation**
    - x_spec_test: The input validation data for the spectrograms
    - x_mfcc_test: The input validation data for the MFCC features
    - y_test: The one-hot expected prediction for the validation





In [25]:
# Process our wave files into spectral features.
#
print ("Processing...")

x_spec_train = []
x_mfcc_train = []
y_train = []

x_spec_test = []
x_mfcc_test = []
y_test = []

# TODO:
# Write a loop to loop through all labels.
#
for i in range(0, len(labels)):
    
    label = labels[i]
    print ("Label: " + labels[i])
    
    sample_number = 0
    
    # TODO:
    # Write the loop to walk through all files in the folder
    # corresponding to the label
    #
    for root, dirs, files in os.walk(audio_folder + "/" + label, topdown=False):
        
        for file in files:
        
            print(file)
            filepath = audio_folder + "/" + label + "/" + file
            if not ".wav" in filepath:
                continue
                
            # TODO:
            # Extract the spectral features and append it into
            # the x_spec_train / x_spec_test array.
            #
            # Samples 0-27 goes into the train array.
            # Samples 28-40 goes into the test array.
            # ..................... CODES START HERE ..................... #
                
            # ...................... CODES END HERE ...................... #
            
            spec_feat =convert_spectral(filepath)
            if(sample_number<=27):
                x_spec_train.append(spec_feat)
            else:
                x_spec_test.append(spec_feat)
                
                
            # TODO:
            # Extract the MFCC features and append it into
            # the x_mfcc_train / x_mfcc_test array.
            #
            # Samples 0-27 goes into the train array.
            # Samples 28-40 goes into the test array.
            # ..................... CODES START HERE ..................... #
            
            # ...................... CODES END HERE ...................... #
            mfcc_feat = convert_mfcc(filepath)
            if(sample_number<=27):
                x_mfcc_train.append(mfcc_feat)
            else:
                x_mfcc_test.append(mfcc_feat)
            
            
            # TODO:
            # Create a one-hot index corresponding to the label
            # of this class. 
            #
            # Samples 0-27 goes into the train array.
            # Samples 28-40 goes into the test array.
            # ..................... CODES START HERE ..................... #

            # ...................... CODES END HERE ...................... #

            y_hot = [0] * num_classes
            y_hot[i] = 1
            if(sample_number<=27):
                y_train.append(y_hot)
            else:
                y_test.append(y_hot)
                
            sample_number = sample_number + 1

print ("Processing complete.")

Processing...
Label: chainsaw
1-116765-A-41.wav
1-19898-A-41.wav
1-19898-B-41.wav
1-19898-C-41.wav
1-47250-A-41.wav
1-47250-B-41.wav
1-64398-A-41.wav
1-64398-B-41.wav
2-50667-A-41.wav
2-50667-B-41.wav
2-50668-A-41.wav
2-50668-B-41.wav
2-68391-A-41.wav
2-68391-B-41.wav
2-77945-A-41.wav
2-77945-B-41.wav
3-118656-A-41.wav
3-118657-A-41.wav
3-118657-B-41.wav
3-118658-A-41.wav
3-118658-B-41.wav
3-118972-A-41.wav
3-118972-B-41.wav
3-165856-A-41.wav
4-149294-A-41.wav
4-149294-B-41.wav
4-157611-A-41.wav
4-157611-B-41.wav
4-165823-A-41.wav
4-165823-B-41.wav
4-169127-A-41.wav
4-169127-B-41.wav
5-170338-A-41.wav
5-170338-B-41.wav
5-171653-A-41.wav
5-185579-A-41.wav
5-185579-B-41.wav
5-216370-A-41.wav
5-216370-B-41.wav
5-222524-A-41.wav
Label: clock_tick
1-21934-A-38.wav
1-21935-A-38.wav
1-35687-A-38.wav
1-42139-A-38.wav
1-48413-A-38.wav
1-57163-A-38.wav
1-62849-A-38.wav
1-62850-A-38.wav
2-119748-A-38.wav
2-127108-A-38.wav
2-131943-A-38.wav
2-134700-A-38.wav
2-135728-A-38.wav
2-140147-A-38.wav
2-1

In [26]:
# Run the following code as it is

# Convert x_train, y_train, x_test, y_test into Numpy arrays.
#
x_spec_train = np.array(x_spec_train)
x_mfcc_train = np.array(x_mfcc_train)
y_train = np.array(y_train)

x_spec_test = np.array(x_spec_test)
x_mfcc_test = np.array(x_mfcc_test)
y_test = np.array(y_test)



In [27]:
# Run the following code as it is

# Ensure that the following arrays are converted to Numpy
# arrays and have the following shapes:
#
#    x_spec_train.shape    (280, 224, 224)
#    x_mfcc_train.shape    (280, 216, 40)
#    y_train.shape         (280, 10)
#
#    x_spec_test.shape     (120, 224, 224)
#    x_mfcc_test.shape     (120, 216, 40)
#    y_test.shape          (120, 10)
#
print (x_spec_train.shape)
print (x_mfcc_train.shape)
print (y_train.shape)


print (x_spec_test.shape)
print (x_mfcc_test.shape)
print (y_test.shape)

(280, 224, 224)
(280, 216, 40)
(280, 10)
(120, 224, 224)
(120, 216, 40)
(120, 10)


In [28]:
# Run the following code as it is

# Save x_train, y_train, x_test, y_test into their respective 
# .npy files. 
#
np.save(output_folder + '/x_spec_train.npy', np.array(x_spec_train))
np.save(output_folder + '/x_mfcc_train.npy', np.array(x_mfcc_train))
np.save(output_folder + '/y_train.npy', np.array(y_train))

np.save(output_folder + '/x_spec_test.npy', np.array(x_spec_test))
np.save(output_folder + '/x_mfcc_test.npy', np.array(x_mfcc_test))
np.save(output_folder + '/y_test.npy', np.array(y_test))




Once you've saved the files, you should upload it to your Google Drive for training.

For reference sake, your files should generally be of the following sizes:

- x_mfcc_test.npy: about 4.1 MB
- x_mfcc_train.npy: about 9.7 MB
- x_spec_test.npy: about 24.1 MB
- s_spec_train.npy: about 56.2 MB
- y_test.npy: about 10 KB
- y_train.npy: about 23 KB