# **TUTORIAL: analyze and classify sounds with AI**

*A guide to analyze and classify marine mammal sounds.*

## Introduction

Audio or sound classification is a technique with multiple applications in the field of AI and data science.

Use cases:
- chatbots
- automated speech translators
- virtual assistants
- music genre identification 
- text-to-speech applications
- ...

Audio classifications come in many types and forms, such as classification of acoustic data, music, natural language and environmental sounds.

## Objective

The aim of this Notebook is to use **AI NOTEBOOKS** product to train a model to **classify marine mammal sounds**.

Here, the sounds in the dataset are in `.wav` format. 

To use them and obtain results you have to pre-process this data by following different steps.

- Analyse one of these audio recordings
- Transform each sound file into a `.csv` file
- Train your model from the `.csv` file

**USE CASE:** [Best of Watkins Marine Mammal Sound Database](https://www.kaggle.com/shreyj1729/best-of-watkins-marine-mammal-sound-database/version/3)

![](./assets/categories.png)

This dataset is composed of **55 different folders** corresponding to the marine mammals. In each folder are stored several sound files of each animal.

You can get more information about this dataset on this [website](https://cis.whoi.edu/science/B/whalesounds/index.cfm).

The data distribution is as follows:

![](./assets/data.png)

#### ⚠️ *For this example, we choose only the first 45 classes (or folders).*

Let’s follow the different steps!

![](./assets/plan.png)


## Step 1 - Import dependencies

In [None]:
# audio libraries
import librosa
import librosa.display as lplt
import IPython

# import matplotlib to be able to display graphs
import matplotlib.pyplot as plt

# transform .wav into .csv
import csv
import os
import numpy as np
import pandas as pd

# preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# model
import keras
import tensorflow as tf
from tensorflow.keras.models import Sequential


## Step 2 - Audio libraries

### 1. Loading an audio file with Librosa

**Librosa**: Python module for audio signal analysis. 

By using **Librosa**, you can extract key features from the audio samples:
- Tempo
- Chroma Energy Normalized
- Mel-Freqency Cepstral Coefficients
- Spectral Centroid, Spectral Contrast 
- Spectral Rolloff
- Zero Crossing Rate

If you want to know more about this library, refer to the [documentation](https://librosa.org/doc/latest/index.html).

You can start by looking at your data by displaying different parameters using the **Librosa** library.

First, you can do a test on a file.

In [None]:
test_sound = "/workspace/data/AtlanticSpottedDolphin/61025001.wav"

Loads and decodes the audio.

In [None]:
data, sr = librosa.load(test_sound)
print(type(data), type(sr))

In [None]:
librosa.load(test_sound ,sr = 45600)

### 2. Playing Audio with IPython.display.Audio

[IPython.display.Audio](https://ipython.org/ipython-doc/stable/api/generated/IPython.display.html#IPython.display.Audio) advises you play audio directly in a **Jupyter notebook**.

Using **IPython.display.Audio** to play the audio.

In [None]:
IPython.display.Audio(data, rate = sr)

## Step 3 - Visualizing Audio

### 1. Waveforms

**Waveforms**: visual representations of sound as time on the x-axis and amplitude on the y-axis. They allow for quick analysis of audio data.

You can display the audio array using **librosa.display.waveplot**.

In [None]:
plt.show(librosa.display.waveplot(data))

### 2. Spectrograms

**Spectrogram**: visual way of representing the intensity of a signal over time at various frequencies present in a particular waveform.
> Some warnings can appear, don't be afraid, you can execute the next steps of the notebook

In [None]:
stft = librosa.stft(data)
plt.colorbar(librosa.display.specshow(stft, sr = sr, x_axis = 'time', y_axis = 'hz'))

In [None]:
stft_db = librosa.amplitude_to_db(abs(stft))
plt.colorbar(librosa.display.specshow(stft_db, sr = sr, x_axis = 'time', y_axis = 'hz'))

### 3. Spectral Rolloff

**Spectral Rolloff**: frequency below which a specified percentage of the total spectral energy.

**librosa.feature.spectral_rolloff** calculates the attenuation frequency for each frame of a signal.

In [None]:
spectral_rolloff = librosa.feature.spectral_rolloff(data + 0.01, sr = sr)[0]
plt.show(librosa.display.waveplot(data, sr = sr, alpha = 0.4))

### 4. Chroma Feature

This tool is perfect for analyzing musical features whose pitches can be meaningfully categorized and whose tuning is close to the equal temperament scale.

In [None]:
chroma = librosa.feature.chroma_stft(data, sr = sr)
lplt.specshow(chroma, sr = sr, x_axis = "time" ,y_axis = "chroma", cmap = "coolwarm")
plt.colorbar()
plt.title("Chroma Features")
plt.show()

### 5. Zero Crossing Rate

**Zero crossing**: occurs if successive samples have different algebraic signs.

- The rate at which zero crossings occur is a simple measure of the frequency content of a signal.
- The number of zero-crossings measures the number of times in a time interval that the amplitude of speech signals passes through a zero value.

In [None]:
start = 1000
end = 1200
plt.plot(data[start:end])
plt.grid()

## Step 4 - Data preprocessing

### 1. Data transformation

To train your model, preprocessing of data is required. First of all, you have to convert the `.wav` into a `.csv` file.

- Define columns name:

In [None]:
header = "filename length chroma_stft_mean chroma_stft_var rms_mean rms_var spectral_centroid_mean spectral_centroid_var spectral_bandwidth_mean \
        spectral_bandwidth_var rolloff_mean rolloff_var zero_crossing_rate_mean zero_crossing_rate_var harmony_mean harmony_var perceptr_mean \
        perceptr_var tempo mfcc1_mean mfcc1_var mfcc2_mean mfcc2_var mfcc3_mean mfcc3_var mfcc4_mean mfcc4_var label".split()

- Create the `data.csv` file:

In [None]:
file = open("/workspace/data/csv/data.csv", "w", newline = "")
with file:
    writer = csv.writer(file)
    writer.writerow(header)

- Define character string of marine mammals (45):

There are 45 different marine animals, or 45 classes.

In [None]:
marine_mammals = "AtlanticSpottedDolphin BeardedSeal Beluga_WhiteWhale BlueWhale BottlenoseDolphin Boutu_AmazonRiverDolphin BowheadWhale ClymeneDolphin \
        Commerson'sDolphin CommonDolphin Dall'sPorpoise DuskyDolphin FalseKillerWhale Fin_FinbackWhale FinlessPorpoise Fraser'sDolphin Grampus_Risso'sDolphin \
        GraySeal GrayWhale HarborPorpoise HarbourSeal HarpSeal Heaviside'sDolphin HoodedSeal HumpbackWhale IrawaddyDolphin JuanFernandezFurSeal KillerWhale \
        LeopardSeal Long_FinnedPilotWhale LongBeaked(Pacific)CommonDolphin MelonHeadedWhale MinkeWhale Narwhal NewZealandFurSeal NorthernRightWhale \
        PantropicalSpottedDolphin RibbonSeal RingedSeal RossSeal Rough_ToothedDolphin SeaOtter Short_Finned(Pacific)PilotWhale SouthernRightWhale SpermWhale".split()

- Transform each `.wav` file into a `.csv` row:
> Some warnings can appear, don't be afraid, you can execute the next steps of the notebook
>
> This step can be very long.

In [None]:
for animal in marine_mammals:

    for filename in os.listdir(f"/workspace/data/{animal}/"):

        sound_name = f"/workspace/data/{animal}/{filename}"
        y, sr = librosa.load(sound_name, mono = True, duration = 30)
        chroma_stft = librosa.feature.chroma_stft(y = y, sr = sr)
        rmse = librosa.feature.rms(y = y)
        spec_cent = librosa.feature.spectral_centroid(y = y, sr = sr)
        spec_bw = librosa.feature.spectral_bandwidth(y = y, sr = sr)
        rolloff = librosa.feature.spectral_rolloff(y = y, sr = sr)
        zcr = librosa.feature.zero_crossing_rate(y)
        mfcc = librosa.feature.mfcc(y = y, sr = sr)
        to_append = f'{filename} {np.mean(chroma_stft)} {np.mean(rmse)} {np.mean(spec_cent)} {np.mean(spec_bw)} {np.mean(rolloff)} {np.mean(zcr)}'

        for e in mfcc:
            to_append += f' {np.mean(e)}'

        to_append += f' {animal}'
        file = open('/workspace/data/csv/data.csv', 'a', newline = '')

        with file:
            writer = csv.writer(file)
            writer.writerow(to_append.split())

- Display the `data.csv` file:

In [None]:
df = pd.read_csv('/workspace/data/csv/data.csv')
df.head()

In [None]:
# dataframe shape
df.shape

In [None]:
# dataframe types
df.dtypes

### 2. Features extraction

In the preprocessing of the data, **feature extraction** is necessary before running the training. The purpose is to define the **inputs** and **outputs** of the neural network.

- **OUTPUT** (y): last column which is the `label`.

You cannot use text directly for training. You have encode these labels with the **LabelEncoder()** function of **sklearn.preprocessing**.

So, before running run a model, convert this type of categorical text data into numerical data that the model can understand.

In [None]:
class_list = df.iloc[:,-1]
encoder = LabelEncoder()
y = encoder.fit_transform(class_list)
print("y: ", y)

- **INPUTS** (X): all other columns are input parameters of the neural network except the `filename`.

You remove the first column which does not provide any information for the training (the filename) and the last one which corresponds to the output.

In [None]:
input_parameters = df.iloc[:, 1:27]
scaler = StandardScaler()
X = scaler.fit_transform(np.array(input_parameters))
print("X:", X)

### 3. Split dataset for training

In [None]:
# training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2)

## Step 5 - Building the model

The first step is to build the model and display the summary.

For the CNN model, all hidden layers use a **ReLU** activation function, the output layer a **Softmax** function and a **Dropout** is used to avoid overfitting.

In [None]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(512, activation = 'relu', input_shape = (X_train.shape[1],)),
    tf.keras.layers.Dropout(0.2),
    
    tf.keras.layers.Dense(256, activation = 'relu'),
    keras.layers.Dropout(0.2),
    
    tf.keras.layers.Dense(128, activation = 'relu'),
    tf.keras.layers.Dropout(0.2),
    
    tf.keras.layers.Dense(64, activation = 'relu'),
    tf.keras.layers.Dropout(0.2),
    
    tf.keras.layers.Dense(45, activation = 'softmax'),
])

print(model.summary())

## Step 6 - Model training and evaluation

**Adam** optimizer is used to train the model over 100 epochs. This choice was made because it allows us to obtain better results.

The loss is calculated with the **sparse_categorical_crossentropy** function.

In [None]:
def trainModel(model,epochs, optimizer):
    batch_size = 128
    model.compile(optimizer = optimizer, loss = 'sparse_categorical_crossentropy', metrics = 'accuracy')
    return model.fit(X_train, y_train, validation_data = (X_val, y_val), epochs = epochs, batch_size = batch_size)

Now, you can launch the training!
> This step can be very long.

In [None]:
model_history = trainModel(model = model, epochs = 100, optimizer = 'adam')

- Display **loss** curves:

In [None]:
loss_train_curve = model_history.history["loss"]
loss_val_curve = model_history.history["val_loss"]
plt.plot(loss_train_curve, label = "Train")
plt.plot(loss_val_curve, label = "Validation")
plt.legend(loc = 'upper right')
plt.title("Loss")
plt.show()

- Display **accuracy** curves:

In [None]:
acc_train_curve = model_history.history["accuracy"]
acc_val_curve = model_history.history["val_accuracy"]
plt.plot(acc_train_curve, label = "Train")
plt.plot(acc_val_curve, label = "Validation")
plt.legend(loc = 'lower right')
plt.title("Accuracy")
plt.show()

In [None]:
test_loss, test_acc = model.evaluate(X_val, y_val, batch_size = 128)
print("The test loss is: ", test_loss)
print("The best accuracy is: ", test_acc*100)

## Step 7 - Make predictions on test data

To test your model and predict which classes new sounds belong to, you can import sounds into a `/workspace/data_test` folder. 

Here we are testing **2 new sounds**.

### 1. Test data preprocessing

To test your model, preprocessing of data is also required.

- Define columns name:

In [None]:
# header => for test data, we remove the columns "filename" and "label"
header_test = "filename length chroma_stft_mean chroma_stft_var rms_mean rms_var spectral_centroid_mean spectral_centroid_var spectral_bandwidth_mean \
        spectral_bandwidth_var rolloff_mean rolloff_var zero_crossing_rate_mean zero_crossing_rate_var harmony_mean harmony_var perceptr_mean perceptr_var tempo mfcc1_mean mfcc1_var mfcc2_mean \
        mfcc2_var mfcc3_mean mfcc3_var mfcc4_mean mfcc4_var".split()

- Create the `data_test.csv` file:

In [None]:
file = open('/workspace/data/csv/data_test.csv', 'w', newline = '')
with file:
    writer = csv.writer(file)
    writer.writerow(header_test)

- Transform each `.wav` file into a `.csv` row:

In [None]:
for filename in os.listdir(f"/workspace/data/data_test/"):
    sound_name = f"/workspace/data/data_test/{filename}"
    y, sr = librosa.load(sound_name, mono = True, duration = 30)
    chroma_stft = librosa.feature.chroma_stft(y = y, sr = sr)
    rmse = librosa.feature.rms(y = y)
    spec_cent = librosa.feature.spectral_centroid(y = y, sr = sr)
    spec_bw = librosa.feature.spectral_bandwidth(y = y, sr = sr)
    rolloff = librosa.feature.spectral_rolloff(y = y, sr = sr)
    zcr = librosa.feature.zero_crossing_rate(y)
    mfcc = librosa.feature.mfcc(y = y, sr = sr)
    to_append = f'{filename} {np.mean(chroma_stft)} {np.mean(rmse)} {np.mean(spec_cent)} {np.mean(spec_bw)} {np.mean(rolloff)} {np.mean(zcr)}'

    for e in mfcc:
        to_append += f' {np.mean(e)}'

    file = open('/workspace/data/csv/data_test.csv', 'a', newline = '')

    with file:
        writer = csv.writer(file)
        writer.writerow(to_append.split())

- Display the `data_test.csv` file:

In [None]:
df_test = pd.read_csv('/workspace/data/csv/data_test.csv')
df_test.head()

In [None]:
X_test = scaler.transform(np.array(df_test.iloc[:, 1:27]))
print("X_test:", X_test)

### 2. Predictions

In [None]:
# generate predictions for samples
predictions = model.predict(X_test)
print(predictions)

In [None]:
# generate argmax for predictions
classes = np.argmax(predictions, axis = 1)
print(classes)

In [None]:
# transform classes number into classes name
result = encoder.inverse_transform(classes)
print(result)

## Step 8 - Save the model for future inference

> To save your model, you should create an other Object Storage container (with write rights) and mount it in your workspace (`saved_model` in this example).

You can now save your model in a dedicated folder.

In [None]:
model.save('/workspace/saved_model/my_model')

In [None]:
# my_model directory
%ls /workspace/saved_model/

In [None]:
# contains an assets folder, saved_model.pb, and variables folder.
%ls /workspace/saved_model/my_model

In [None]:
model = tf.keras.models.load_model('/workspace/saved_model/my_model')
model.summary()

## Conclusion

The accuracy of the model can be improved by increasing the number of epochs, but after a certain period we reach a threshold, so the value should be determined accordingly.

The accuracy obtained for the test set is **93.71 %**, which is a satisfactory result.

#### *I hope you have enjoyed this tutorial. Try for yourself!*