# Urban 8k Audio Classification Project

## Dataset Description
This dataset contains 8732 labeled sound excerpts (<=4s) of urban sounds from 10 classes: air_conditioner, car_horn, children_playing, dog_bark, drilling, enginge_idling, gun_shot, jackhammer, siren, and street_music. The classes are drawn from the urban sound taxonomy. For a detailed description of the dataset and how it was compiled please refer to the paper.
All excerpts are taken from field recordings uploaded to www.freesound.org. The files are pre-sorted into ten folds (folders named fold1-fold10) to help in the reproduction of and comparison with the automatic classification results reported in the article above.

In addition to the sound excerpts, a CSV file containing metadata about each excerpt is also provided.

**8732 audio files of urban sounds (see description above) in WAV format. The sampling rate, bit depth, and number of channels are the same as those of the original file uploaded to Freesound (and hence may vary from file to file).**

## Task Description
The Dataset contains 10 classes, we will do a classification task on the audio files.
```
A numeric identifier of the sound class:
0 = air_conditioner
1 = car_horn
2 = children_playing
3 = dog_bark
4 = drilling
5 = engine_idling
6 = gun_shot
7 = jackhammer
8 = siren
9 = street_music
```



## How to get the Urban8K Dataset?
Just go to [Urban8K Website](https://urbansounddataset.weebly.com/urbansound8k.html) and fill a simple form to download the dataset. Since the dataset is > 5GB in compressed form itself, it's better to copy the download link and directly !wget it to the colab and move to the drive for further use.

### Code Block Below Downloads the Dataset and Unzips the tarball, this might take a while to download and unzip.

In [1]:
!wget https://goo.gl/8hY5ER
!tar -xvf 8hY5ER -C ./

'wget' is not recognized as an internal or external command,
operable program or batch file.
tar: Error opening archive: Failed to open '8hY5ER'


* Mount Google Drive, Create a Folder named Urban8KZip in it.
* Copy the unzipped urban8kzipped contents to the Urban8KZip Directory

In [2]:
from google.colab import drive
drive.mount('/content/drive')

ModuleNotFoundError: No module named 'google.colab'

In [None]:
!mkdir /content/drive/MyDrive/Urban8KZip
!cp ./8hY5ER /content/drive/MyDrive/Urban8KZip

### In case of running notebook again,

If you do run notebook again there is no reason to download dataset again and move to drive, just mount the drive and copy the content from the Urban8KZip folder to the colab.

**Run this next cell if running the notebook again, Don't run the above cell and downloading cells(in starting), just mount the colab and run the next cell.**

In [None]:
!tar -xvf /content/drive/MyDrive/Urban8KZip/8hY5ER -C ./

## Data Analysis, Exploration and Visualizations

### Some Information About Dataset.
This dataset has audio belonging to 10 classes, unlike images Audio data is different, it's similar in some senses but there are certain differences which are unique to audio data.

Before even using audio file dataset, the first challenge in itself is that audio files are continous data and our computers can only store discrete data, so in what way we digitize it such that we can easily store.

Apparently the solution is that we figure out a way that is discrete yet continous. Sounds misleading but what we do is  <b>instead of storing continous value, we just sample it at fixed durations like 0.2 second apart or less or more<b>. We just sample it and the rate at which it is sampled is known as **Sampling rate**.

<figure>
  <img src="https://cdn.analyticsvidhya.com/wp-content/uploads/2017/08/23210623/sound.png" width="350" alt="my img"/>
  <figcaption> Image Courtesy Analytics Vidhya </figcaption>
</figure>

When we say we sample it, what are we actually doing?
> We are recording the amplitude at that point.


* So the audio data is nothing just the amplitude stored at regular intervals which mimic the original continous wave.

* We take and store thousands of measurements per second. If we can take tons of measurements extremely quickly with enough possible amplitude values, we can effectively use these snapshots to reconstruct the resolution and complexity of an analog wave.


<figure>
  <img src="https://www.izotope.com/en/learn/digital-audio-basics-sample-rate-and-bit-depth/_jcr_content/root/sectioncontainer_main/flexcontainer/flexcontainer_center/flexcontainer_center_top/image_1558274996.coreimg.82.1280.jpeg/1590799241393/reconstructing-the-original-signal.jpeg
" width= "500" alt="my img"/>
  <figcaption> Image Courtesy izotope </figcaption>
</figure>

* [Read more about the audio concepts involved such as bit depth and nyquist rate](https://www.izotope.com/en/learn/digital-audio-basics-sample-rate-and-bit-depth.html)


In [None]:
!pip install librosa

In [None]:
import IPython.display as ipd #Allows Audio files to be played directly in the notebook
import librosa #library we will use to analyze sounds
import librosa.display #library module which helps visualize the waves

In [None]:
import os
import glob
import numpy as np
import pandas as pd
import random

import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")

### Reading audio files

In [None]:
# Recursively find the files that end with .wav extension in the provided path
audios = glob.glob(os.path.join("/content/UrbanSound8K/audio/*/*.wav"), recursive=True)

print(f"Total Audio Files : {len(audios)}")

In [None]:
audios[:5] #Contains path of all audios

### Hit the play icon below and listen to the sound, try to remember the sound and the plot generated do it several times and try to pick up  the pattern yourself.

To read audio files, we will be using,

```python
data, sample_rate = librosa.load(audio_path)
```

librosa by default reads the audio file at sr(sampling rate) = 22050, if you want your custom sampling here, you can pass sr = your_rate in the librosa.load function.

In [None]:
plt.figure(figsize=(12,4))

play_audio = random.choice(audios)
data, sample_rate = librosa.load(play_audio)
librosa.display.waveshow(data)
plt.title("Sound")
ipd.Audio(play_audio)

#### **If you managed to pick up patterns, you can see it becomes a visual problem as well. Just a thought.**

In [None]:
plt.figure(figsize=(12,4))
play_audio = random.choice(audios)
data, sample_rate = librosa.load(play_audio, sr=100)
librosa.display.waveshow(data)
plt.title("Sound")
ipd.Audio(play_audio)

### Plotting same audio at different sampling rate

Higher the sampling rate, higher the crisp of plot because less error while trying to recreate original audio with higher number of samples and vice versa.

In [None]:
fig, axs = plt.subplots(2,2, figsize=(15,12))
fig.suptitle("Sampling of Same Audio File With Different values")
axs = np.reshape(axs,-1)
play_audio = random.choice(audios)
srs = [100, 2000, 22050, 44100]
for ax,sr in zip(axs,srs):
  try:
    data, sample_rate = librosa.load(play_audio, sr=sr)
    librosa.display.waveshow(data, sr=sample_rate, ax=ax)
    ax.set_title(f"Sampling Rate {sr} Hz")
  except:
    print("Run Again Some Unknown Error")

ipd.Audio(play_audio)

### Reading Metadata for the dataset

In [None]:
metadata = pd.read_csv("/content/UrbanSound8K/metadata/UrbanSound8K.csv")

metadata.head()

#### Anatomy of Metadata csv
* UrbanSound8K the main dataset directory contains two folders audio and metadata, the audio folder is where all the audio files are located, if you are running this in colab on right if you click on the UrbanSound8K Folder, you can see that it has the said two folders by clicking on it.
* audio folder in itself contains 10 folders named fold1, fold2 ..... fold10. These folds contains audio files details of which are mentioned in above the csv or dataframe.

* If we join "UrbanSound8K/audio/" + "fold" + fold_value(1/2/...10) + slice_file_name it becomes path for the audio file as well.(This logic will be implemented later, keep it in mind till then.)

#### Selecting random row for each class in metadata

In [None]:
unique_audios = metadata.groupby(['class']).apply(lambda sub_df : sub_df.sample()).reset_index(drop = True)

In [None]:
unique_audios.head()

#### Plotting selected rows

In [None]:
fig, axs = plt.subplots(5,2,figsize=(15,8),constrained_layout=True)
axs = np.reshape(axs, -1)

for (index,row),ax in zip(unique_audios.iterrows(),axs):
  ax.set_title(row.values[-1])
  data, sr = librosa.load(f"/content/UrbanSound8K/audio/fold{row.values[-3]}/" + row.values[0])
  _ = librosa.display.waveshow(data, sr=sample_rate, ax=ax)

#### Let's see the number of instances for each class

In [None]:
instance_counts = metadata['class'].value_counts()
sns.set(rc={'figure.figsize':(11.7,4.14)})
sns.barplot(x=instance_counts.values, y=instance_counts.index, orient='h');

* Apart from car_horn and gun_shot class all of the classes appear to be balanced, this might be a problem we will try to address later.

### **What is this mfcc and why are we using it?**

> MFCCs are essentially like taking a Fourier Transform of the signal, however, MFCCs use Mel scaling to try to model the way that the human hearing audiotory system perceives sounds, rather than describe them on a purely frequency (Hz) basis. This means that the MFCC should represent the textural or timbre of the signal (the baby cry) as we might hear it (e.g. a 'piercing' cry or a 'discontent' cry).


In [None]:
unique_audios_5 = unique_audios.head()
fig, axs = plt.subplots(5,3,figsize=(15,8),constrained_layout=True)

for (index,row),ax in zip(unique_audios_5.iterrows(),axs):
  
  file_path = f"/content/UrbanSound8K/audio/fold{row.values[-3]}/" + row.values[0]
  ax[0].set_title(f"Librosa : {row.values[-1]}")
  data, sr = librosa.load(file_path)
  _ = librosa.display.waveshow(data, sr=sample_rate, ax=ax[0])

  ax[1].set_title(f"MFCC 40 : {row.values[-1]}")
  mfccs = librosa.feature.mfcc(y =data, n_mfcc= 40)
  librosa.display.specshow(mfccs, x_axis='time', ax=ax[1])

  ax[2].set_title(f"MFCC 30 : {row.values[-1]}")
  mfccs = librosa.feature.mfcc(y =data, n_mfcc= 30)
  librosa.display.specshow(mfccs, x_axis='time', ax=ax[2])

In [None]:
def extract_features(file_name, n_mfcc=40):
   
    try:
        audio, sample_rate = librosa.load(file_name) 
        mfccs = librosa.feature.mfcc(y=audio, sr=sample_rate, n_mfcc=n_mfcc)
        mfccsscaled = np.mean(mfccs.T,axis=0)
        
    except Exception as e:
        print("Error encountered while parsing file: ", file)
        return None 
    return mfccsscaled

In [None]:
#Takes time
from tqdm.notebook import tqdm 

features = []
for index,row in tqdm(metadata.iterrows()):
    file_path = f"/content/UrbanSound8K/audio/fold{row.values[-3]}/" + row.values[0]
    class_label = row.values[-1]
    try:
      data = extract_features(file_path)
      features.append([data, class_label])
    except:
      pass
      

featuresdf = pd.DataFrame(features, columns=['feature','class_label'])

In [None]:
featuresdf[featuresdf.feature.notnull()]

In [None]:
featuresdf['class_label'] = pd.Categorical(featuresdf['class_label'])
featuresdf['class_category'] = featuresdf['class_label'].cat.codes

featuresdf['feature_len'] = featuresdf['feature'].apply(lambda x: len(x))

In [None]:
featuresdf.head()

In [None]:
featuresdf.to_json("features.json", orient='records')
!cp ./features.json /content/drive/MyDrive/Urban8KZip/

In [None]:
!cp /content/drive/MyDrive/Urban8KZip/features.json ./

In [None]:
import pandas as pd
featuresdf = pd.read_json("features.json", orient='records')

In [None]:
featuresdf.head()

In [None]:
featuresdf.class_category.unique()

## Model Building & Training

### Converting Data to Required Format

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

import torchvision

from torch.utils.data import TensorDataset,DataLoader

The Data is not in the desired format, we will do the follow the steps to convert it in our required format.
1. Split Dataset using sklearn
2. Convert the split data into torch tensors
3. Convert the torch tensors into Tensor Datasets.
4. Created DataLoaders from these Tensor Datasets.

#### 1. Splitting Dataset using sklearn

In [None]:
from sklearn.model_selection import train_test_split 
X = featuresdf.feature
y = featuresdf.class_category
x_train, x_test, y_train, y_test = train_test_split(list(X.values), list(y.values), test_size=0.2, random_state = 42, stratify=y)

In [None]:


x_train_base = x_train
x_test_base = x_test

#### 2. Converting Data into Torch tensors

In [None]:
import numpy as np

x_train = np.array(x_train_base)
x_test = np.array(x_test_base)

y_train = np.array(y_train)
y_test = np.array(y_test)

In [None]:
x_train_tensor = torch.tensor(x_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.long)


x_test_tensor = torch.tensor(x_test, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.long)

In [None]:
print(x_train_tensor.type(),y_train_tensor.type())

#### 3,4 Converting into TensorDataset and feeding it to DataLoader

In [None]:
train_dataset = TensorDataset(x_train_tensor,y_train_tensor)
test_dataset = TensorDataset(x_test_tensor,y_test_tensor)

train_dl = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_dl = DataLoader(test_dataset, batch_size=64*2)

### Model Building

#### Fixing Input and Output Size

In [None]:
input_size = 40
output_size = featuresdf['class_category'].nunique()

In [None]:
output_size

#### Model

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

import torchvision

class UrbanSoundBase(nn.Module):

  def training_step(self, batch_images):
    images,labels = batch_images
    outputs = self(images)
    loss = F.cross_entropy(outputs, labels)
    acc = self.accuracy(outputs, labels)
    return {'loss' : loss, 'acc' : acc}

  def accuracy(self,outputs, labels, logits=True):
    output_softmaxed = F.softmax(outputs, dim=1)
    vals,predictions, = torch.max(output_softmaxed, dim=1)
    assert predictions.shape == labels.shape
    return torch.tensor(torch.sum(predictions == labels).item()/outputs.size(0))

  def validation_step(self, batch_images, device):
    images,labels = batch_images
    images = images.to(device)
    labels = labels.to(device)
    outputs = self(images)
    loss = F.cross_entropy(outputs, labels)
    acc = self.accuracy(outputs, labels)

    return {'val_loss' : loss, 'val_acc' : acc}

  def epoch_end(self, history):
    loss = torch.stack([batch['val_loss'] for batch in history]).mean().item()
    accuracy = torch.stack([batch['val_acc'] for batch in history]).mean().item()

    return {'val_loss': loss, 'val_acc' : accuracy}

  @torch.no_grad()
  def evaluate_validation(self, valid_dataloader, device):
    return [self.validation_step(image_label_batch, device) for image_label_batch in valid_dataloader]

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

import torchvision

class NNModel_3Layers(UrbanSoundBase):
    """
    Layer Size : 128, 256, 10
    Drop Outs : True, True, True
    Drop Values : 0.2, 0.3, 0.2
    """
    def __init__(self, input_size, output_size):
        super().__init__()
        self.block1 = nn.Sequential(
            nn.Linear(input_size, 128),
            nn.ReLU(),
            nn.Dropout(0.2))
        
        self.block2 = nn.Sequential(
            nn.Linear(128, 256),
            nn.ReLU(),
            nn.Dropout(0.3))
        
        self.block3 = nn.Sequential(
            nn.Linear(256, output_size),
            nn.ReLU(),
            nn.Dropout(0.3))

        
    def forward(self, xb):
        out = self.block1(xb)
        out = self.block2(out)
        return self.block3(out)

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

import torchvision

class NNModel_4Layers(UrbanSoundBase):
    """
    Layer Size : 128, 256, 256, 10
    Drop Outs : True, True, True, True
    Drop Values : 0.2, 0.3, 0.3, 0.1
    """
    def __init__(self, input_size, output_size):
        super().__init__()
        self.block1 = nn.Sequential(
            nn.Linear(input_size, 128),
            nn.ReLU(),
            nn.Dropout(0.3))
        
        self.block2 = nn.Sequential(
            nn.Linear(128, 256),
            nn.ReLU(),
            nn.Dropout(0.3))
        
        self.block3 = nn.Sequential(
            nn.Linear(256, 256),
            nn.ReLU(),
            nn.Dropout(0.3))

        self.block4 = nn.Sequential(
            nn.Linear(256, output_size),
            nn.ReLU(),
            nn.Dropout(0.3))          
        
    def forward(self, xb):
        out = self.block1(xb)
        out = self.block2(out)
        out = self.block3(out)
        return self.block4(out)

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

import torchvision

class NNModel_5Layers(UrbanSoundBase):
    """
    Layer Size : 128, 128, 256, 256, 10
    Drop Outs : True, True, True, True, True
    Drop Values : 0.2, 0.3, 0.3, 0.2, 0.1
    """
    def __init__(self, input_size, output_size):
        super().__init__()
        self.block1 = nn.Sequential(
            nn.Linear(input_size, 128),
            nn.ReLU(),
            nn.Dropout(0.2))
        
        self.block2 = nn.Sequential(
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Dropout(0.3))
        
        self.block3 = nn.Sequential(
            nn.Linear(128, 256),
            nn.ReLU(),
            nn.Dropout(0.3))

        self.block4 = nn.Sequential(
            nn.Linear(256, 256),
            nn.ReLU(),
            nn.Dropout(0.2))
        
        self.block5 = nn.Sequential(
            nn.Linear(256, output_size),
            nn.ReLU(),
            nn.Dropout(0.1)) 
        
    def forward(self, xb):
        out = self.block1(xb)
        out = self.block2(out)
        out = self.block3(out)
        out = self.block4(out)
        return self.block5(out)

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

import torchvision

class NNModel_6Layers(UrbanSoundBase):
    """
    Layer Size : 128, 256, 512, 512, 256, 10
    Drop Outs : True, True, False, True, True
    Drop Values : 0.2, 0.3, _, 0.3, 0.1
    """
    def __init__(self, input_size, output_size):
        super().__init__()
        self.block1 = nn.Sequential(
            nn.Linear(input_size, 128),
            nn.ReLU(),
            nn.Dropout(0.2))
        
        self.block2 = nn.Sequential(
            nn.Linear(128, 256),
            nn.ReLU(),
            nn.Dropout(0.3))
        
        self.block3 = nn.Sequential(
            nn.Linear(256, 512),
            nn.ReLU())

        self.block4 = nn.Sequential(
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Dropout(0.3))   
        
        self.block5 = nn.Sequential(
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Dropout(0.1)) 
        
        self.block6 = nn.Sequential(
            nn.Linear(256, output_size),
            nn.ReLU(),
            nn.Dropout(0.1)) 
        
    def forward(self, xb):
        out = self.block1(xb)
        out = self.block2(out)
        out = self.block3(out)
        out = self.block4(out)
        out = self.block5(out)
        return self.block6(out)

### Model Training

#### Accuracy and Loss Before Training

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

In [None]:
model_4_layers.eval()
valid_history = model_4_layers.evaluate_validation(test_dl, device)
acc = torch.stack([valid['val_acc'] for valid in valid_history]).mean().item()
loss = torch.stack([valid['val_loss'] for valid in valid_history]).mean().item()

print(f"Accuracy : {acc:.3f}, Loss : {loss:.3f}")

#### Fit Function

In [None]:
def fit(model,train_dataloader, valid_dataloader, n_epochs,lr = 0.001, optim = torch.optim.Adam, device = device):
  model_history = {'train_loss' : [], 'valid_loss' : [], 'train_acc' : [], 'valid_acc' : []}
  optimizer = optim(model.parameters(), lr)

  for epoch in range(n_epochs):
    model.train()
    train_loss = []
    train_accuracy = []

    for image_label_batch in train_dataloader:
      image, label = image_label_batch[0].to(device), image_label_batch[1].to(device)
      loss_acc = model.training_step((image, label))
      loss = loss_acc['loss']
      acc = loss_acc['acc']
      loss.backward()
      train_loss.append(loss)
      train_accuracy.append(acc)

      optimizer.step()
      optimizer.zero_grad()
    
    model.eval()
    
    with torch.no_grad():
        val_history = model.evaluate_validation(valid_dataloader, device)
        result = model.epoch_end(val_history)

    model_history['train_loss'].append(torch.stack(train_loss).mean().item())
    model_history['valid_loss'].append(result['val_loss'])
    model_history['train_acc'].append(torch.stack(train_accuracy).mean().item())
    model_history['valid_acc'].append(result['val_acc'])
    print(f"Epoch : {epoch}, Train Loss : {model_history['train_loss'][-1]:.2f}, Train Accuracy : {model_history['train_acc'][-1]:.2f}, Validation Loss : {result['val_loss']:.2f}, Validation Accuracy : {result['val_acc']:.2f}")


  return model_history

In [None]:
import matplotlib.pyplot as plt

#### Test Case : n_epoch = 300, lr=0.001, SGD, final_acc = 18%

In [None]:
history = fit(model_4_layers, train_dl, test_dl, 300, 0.001, optim = torch.optim.SGD, device = device)

In [None]:
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(15,6))

fig.suptitle(f"model_nn_4_layers_simple : 300 Epochs, 0.001 SGD", fontsize=16)
ax1.plot(history['train_acc'], label='train accuracy')
ax1.plot(history['valid_acc'], label='valid accuracy')
ax1.legend()


ax2.plot(history['train_loss'], label='train loss')
ax2.plot(history['valid_loss'], label='valid loss')
ax2.legend()

fig.show()

#### Test Case : n_epoch = 1000, lr=0.001, SGD, final_acc = 74%

In [None]:
model_4_layers = NNModel_4Layers(input_size, output_size)
history2 = fit(model_4_layers, train_dl, test_dl, 1000, 0.001, optim=torch.optim.SGD, device = device)

In [None]:
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(15,6))

fig.suptitle(f"model_nn_4_layers_simple : 1000 Epochs, 0.001 SGD", fontsize=16)
ax1.plot(history2['train_acc'], label='train accuracy')
ax1.plot(history2['valid_acc'], label='valid accuracy')
ax1.legend()


ax2.plot(history2['train_loss'], label='train loss')
ax2.plot(history2['valid_loss'], label='valid loss')
ax2.legend()

fig.show()

#### Test Case : n_epoch = 500, lr=0.01, SGD, final_acc = 73%

In [None]:
model_4_layers = NNModel_4Layers(input_size, output_size)
history3 = fit(model_4_layers, train_dl, test_dl, 500, 0.01, optim=torch.optim.SGD)

In [None]:
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(15,6))

fig.suptitle(f"model_nn_4_layers_simple : 500 Epochs, 0.01 SGD", fontsize=16)
ax1.plot(history3['train_acc'], label='train accuracy')
ax1.plot(history3['valid_acc'], label='valid accuracy')
ax1.legend()


ax2.plot(history3['train_loss'], label='train loss')
ax2.plot(history3['valid_loss'], label='valid loss')
ax2.legend()

fig.show()

In [None]:
model_5_layers = NNModel_5Layers(input_size, output_size)
history4 = fit(model_5_layers, train_dl, test_dl, 500, 0.01, optim=torch.optim.SGD)

In [None]:

fig, (ax1, ax2) = plt.subplots(1,2, figsize=(15,6))

fig.suptitle(f"model_nn_5_layers_simple : 500 Epochs, 0.01 SGD", fontsize=16)
ax1.plot(history4['train_acc'], label='train accuracy')
ax1.plot(history4['valid_acc'], label='valid accuracy')
ax1.legend()


ax2.plot(history4['train_loss'], label='train loss')
ax2.plot(history4['valid_loss'], label='valid loss')
ax2.legend()

fig.show()

In [None]:
model_3_layers = NNModel_3Layers(input_size, output_size)
history5 = fit(model_3_layers, train_dl, test_dl, 500, 0.001, optim=torch.optim.SGD)

In [None]:
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(15,6))

fig.suptitle(f"model_nn_3_layers_simple : 500 Epochs, 0.001 SGD", fontsize=16)
ax1.plot(history5['train_acc'], label='train accuracy')
ax1.plot(history5['valid_acc'], label='valid accuracy')
ax1.legend()


ax2.plot(history5['train_loss'], label='train loss')
ax2.plot(history5['valid_loss'], label='valid loss')
ax2.legend()

fig.show()

In [None]:
model_3_layers = NNModel_3Layers(input_size, output_size)
history6 = fit(model_3_layers, train_dl, test_dl, 500, 0.001, optim=torch.optim.Adam)

In [None]:
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(15,6))

fig.suptitle(f"model_nn_3_layers_simple : 500 Epochs, 0.001 Adam", fontsize=16)
ax1.plot(history6['train_acc'], label='train accuracy')
ax1.plot(history6['valid_acc'], label='valid accuracy')
ax1.legend()


ax2.plot(history6['train_loss'], label='train loss')
ax2.plot(history6['valid_loss'], label='valid loss')
ax2.legend()

fig.show()

In [None]:
model_3_layers = NNModel_3Layers(input_size, output_size)
history7 = fit(model_3_layers, train_dl, test_dl, 500, 3e-4, optim=torch.optim.Adam)

In [None]:
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(15,6))

fig.suptitle(f"model_nn_3_layers_simple : 500 Epochs, 3e-4 Adam", fontsize=16)
ax1.plot(history7['train_acc'], label='train accuracy')
ax1.plot(history7['valid_acc'], label='valid accuracy')
ax1.legend()


ax2.plot(history7['train_loss'], label='train loss')
ax2.plot(history7['valid_loss'], label='valid loss')
ax2.legend()

fig.show()

In [None]:
model_6_layers = NNModel_6Layers(input_size, output_size)
history8 = fit(model_6_layers, train_dl, test_dl, 500, 0.001, optim=torch.optim.SGD)

In [None]:
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(15,6))

fig.suptitle(f"model_nn_3_layers_simple : 500 Epochs, 3e-4 Adam", fontsize=16)
ax1.plot(history8['train_acc'], label='train accuracy')
ax1.plot(history8['valid_acc'], label='valid accuracy')
ax1.legend()


ax2.plot(history8['train_loss'], label='train loss')
ax2.plot(history8['valid_loss'], label='valid loss')
ax2.legend()

fig.show()

In [None]:
model_5_layers = NNModel_5Layers(input_size, output_size)
history4 = fit(model_5_layers, train_dl, test_dl, 500, 0.001, optim=torch.optim.SGD)

In [None]:
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(15,6))

fig.suptitle(f"model_nn_5_layers_simple : 500 Epochs, 0.001 SGD", fontsize=16)
ax1.plot(history4['train_acc'], label='train accuracy')
ax1.plot(history4['valid_acc'], label='valid accuracy')
ax1.legend()


ax2.plot(history4['train_loss'], label='train loss')
ax2.plot(history4['valid_loss'], label='valid loss')
ax2.legend()

fig.show()

## Inference

In [None]:
metadata.head()

In [None]:
featuresdf.head()

In [None]:
meta_feature = pd.concat([metadata, featuresdf], axis=1)

In [None]:
meta_feature.head()

**Finally using model_5_layers.**

Creating Noise to Index Mapping and Vice-Versa.

In [None]:
meta_feature.loc[:,["class_label", "class_category"]]

In [None]:
#For converting the numerical codes of the labels into categorical values for readability
noise_2_idx = dict(set(meta_feature[['class_label', 'class_category']].to_records(index=False).tolist()))
idx_2_noise = { idx : noise for noise,idx in noise_2_idx.items()}

In [None]:
random_samples = meta_feature.sample(5)
msfcc_feature, labels = random_samples.feature.values.tolist(), random_samples.class_category.values.tolist() 

In [None]:
random_samples

In [None]:
msfcc_feature = torch.tensor(msfcc_feature)
labels = torch.tensor(labels)

print(msfcc_feature.shape, labels.shape)

In [None]:
model_5_layers

In [None]:
outputs = model_5_layers(msfcc_feature)

output_softmaxed = F.softmax(outputs, dim=1)
_, preds = torch.max(output_softmaxed, dim=1)

preds = preds.numpy().tolist()
preds_labels = [idx_2_noise[pred] for pred in preds]


labels = labels.numpy().tolist()
labels_names = [idx_2_noise[label] for label in labels]

## Final Inferences

In [None]:
random_samples.iloc[0]

In [None]:
audio_path = f"/content/UrbanSound8K/audio/fold{random_samples.iloc[0].values[5]}/" + random_samples.iloc[0].values[0]
data, sr = librosa.load(audio_path)

plt.figure(figsize=(15,6))
_ = librosa.display.waveshow(data)

plt.title(f"Original Sound : {labels_names[0]}, Predicted : {preds_labels[0]}", fontsize=16)
ipd.Audio(audio_path)

In [None]:
idx = 1
audio_path = f"/content/UrbanSound8K/audio/fold{random_samples.iloc[idx].values[5]}/" + random_samples.iloc[idx].values[0]
data, sr = librosa.load(audio_path)

plt.figure(figsize=(15,6))
_ = librosa.display.waveshow(data)

plt.title(f"Original Sound : {labels_names[idx]}, Predicted : {preds_labels[idx]}", fontsize=16)
ipd.Audio(audio_path)

In [None]:
idx = 2
audio_path = f"/content/UrbanSound8K/audio/fold{random_samples.iloc[idx].values[5]}/" + random_samples.iloc[idx].values[0]
data, sr = librosa.load(audio_path)

plt.figure(figsize=(15,6))
_ = librosa.display.waveshow(data)

plt.title(f"Original Sound : {labels_names[idx]}, Predicted : {preds_labels[idx]}", fontsize=16)
ipd.Audio(audio_path)

In [None]:
idx = 3
audio_path = f"/content/UrbanSound8K/audio/fold{random_samples.iloc[idx].values[5]}/" + random_samples.iloc[idx].values[0]
data, sr = librosa.load(audio_path)

plt.figure(figsize=(15,6))
_ = librosa.display.waveshow(data)

plt.title(f"Original Sound : {labels_names[idx]}, Predicted : {preds_labels[idx]}", fontsize=16)
ipd.Audio(audio_path)

In [None]:
idx = 4
audio_path = f"/content/UrbanSound8K/audio/fold{random_samples.iloc[idx].values[5]}/" + random_samples.iloc[idx].values[0]
data, sr = librosa.load(audio_path)

plt.figure(figsize=(15,6))
_ = librosa.display.waveshow(data)

plt.title(f"Original Sound : {labels_names[idx]}, Predicted : {preds_labels[idx]}", fontsize=16)
ipd.Audio(audio_path)

# Future Scope and Other Methods to Try:
* MFCC is not the only feature that can be utilized in audio analysis, beside that you can try Volume, Energy, Pitch, Zero Crossing Rate, Spectral Centroid etc as additional features along with it.

* Instead of extracting features via MFCC, feature reduction technique such as t-SNE or PCA can be used as well."