# Method 1: Using a CNN in Pytorch

In [1]:
import os
import numpy as np
import pandas as pd
import librosa
import torch
import torchaudio
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tqdm import tqdm
import warnings
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
import torch.nn.functional as F

warnings.filterwarnings('ignore')

In [2]:
dataset_path = "/kaggle/input/urbansound8k"
# Load the metadata CSV file containing labels and file information
metadata = pd.read_csv(os.path.join(dataset_path, 'UrbanSound8K.csv'))

def extract_features(file_path):
    """
    Extracts audio features from a given audio file.

    Parameters:
        file_path (str): The path to the audio file.

    Returns:
        np.ndarray: A feature vector containing aggregated audio features.
    """
    # Load the audio file using librosa
    audio, sample_rate = librosa.load(file_path, sr=None)

    # Compute the Mel Frequency Cepstral Coefficients (MFCCs)
    mfccs = librosa.feature.mfcc(y=audio, sr=sample_rate, n_mfcc=40)

    # Calculate the first-order delta (rate of change) of the MFCCs
    mfcc_delta = librosa.feature.delta(mfccs, width=3)
    # Calculate the second-order delta (acceleration) of the MFCCs
    mfcc_delta2 = librosa.feature.delta(mfccs, order=2, width=3)
    
    # Aggregate features by calculating the mean and variance over time for each feature
    feature_vector = np.hstack([
        np.mean(mfccs, axis=1), np.var(mfccs, axis=1),      # Mean and variance of MFCCs
        np.mean(mfcc_delta, axis=1), np.var(mfcc_delta, axis=1),  # Mean and variance of first-order delta MFCCs
        np.mean(mfcc_delta2, axis=1), np.var(mfcc_delta2, axis=1) # Mean and variance of second-order delta MFCCs
    ])
    
    return feature_vector

def prepare_data(metadata, dataset_path):
    """
    Prepares the dataset by extracting features and labels from the audio files.

    Parameters:
        metadata (DataFrame): The DataFrame containing file metadata.
        dataset_path (str): The base path to the dataset files.

    Returns:
        tuple: A tuple containing an array of features and an array of labels.
    """
    features = []  # List to store feature vectors
    labels = []    # List to store corresponding labels
    
    # Use tqdm to show progress while iterating through the metadata rows
    for i, row in tqdm(metadata.iterrows(), total=len(metadata), desc="Extracting features"):
        # Construct the full file path for the audio file
        file_path = os.path.join(dataset_path, f'fold{row["fold"]}', row["slice_file_name"])
        # Extract features from the audio file
        feature_vector = extract_features(file_path)
        features.append(feature_vector)  # Append the feature vector to the list
        labels.append(row["classID"])     # Append the corresponding label to the list
    
    return np.array(features), np.array(labels)  # Return features and labels as numpy arrays

# Prepare the dataset by extracting features and labels from the metadata
features, labels = prepare_data(metadata, dataset_path)

Extracting features: 100%|██████████| 8732/8732 [10:41<00:00, 13.61it/s]


In [3]:
class UrbanSoundDataset(Dataset):
    """
    Custom dataset class for the UrbanSound dataset.
    
    This class is used to wrap the features and labels into a format suitable for 
    PyTorch's DataLoader, allowing for easy iteration and batching during training.
    """
    
    def __init__(self, features, labels):
        """
        Initializes the UrbanSoundDataset with features and labels.
        
        Parameters:
            features (np.ndarray): The array of extracted audio features.
            labels (np.ndarray): The array of corresponding labels for the features.
        """
        self.features = features  # Store the extracted features
        self.labels = labels      # Store the corresponding labels
    
    def __len__(self):
        """
        Returns the total number of samples in the dataset.
        
        Returns:
            int: The number of feature-label pairs in the dataset.
    """
        return len(self.features)
    
    def __getitem__(self, idx):
        """
        Retrieves a sample from the dataset at the specified index.
        
        This method is called by the DataLoader to fetch individual samples.
        
        Parameters:
            idx (int): The index of the sample to retrieve.
        
        Returns:
            tuple: A tuple containing the feature tensor and the label tensor.
        """
        return (
            torch.tensor(self.features[idx], dtype=torch.float32),  # Convert feature to tensor
            torch.tensor(self.labels[idx], dtype=torch.long)        # Convert label to tensor
        )

# Split the dataset into training, testing, and validation sets
X_train, X_temp, y_train, y_temp = train_test_split(features, labels, test_size=0.3)  # 70% training, 30% temporary
X_test, X_val, y_test, y_val = train_test_split(X_temp, y_temp, test_size=0.5)        # Split temporary set into 15% test and 15% validation

# Create dataset instances for each split
train_dataset = UrbanSoundDataset(X_train, y_train)  # Training dataset
test_dataset = UrbanSoundDataset(X_test, y_test)      # Testing dataset
val_dataset = UrbanSoundDataset(X_val, y_val)          # Validation dataset

# Create data loaders for each dataset to facilitate batching and shuffling
train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True)  # Training data loader with shuffling
test_loader = DataLoader(test_dataset, batch_size=128, shuffle=False)    # Testing data loader without shuffling
val_loader = DataLoader(val_dataset, batch_size=128, shuffle=False)      # Validation data loader without shuffling

In [4]:
class EnhancedSoundCNN(nn.Module):
    """
    A Convolutional Neural Network (CNN) for sound classification.

    This model employs 1D convolutional layers to learn features from audio data, 
    followed by fully connected layers to classify the audio into predefined categories.
    """
    
    def __init__(self):
        """
        Initializes the EnhancedSoundCNN model by defining the architecture.

        The model consists of several convolutional layers, pooling layers, 
        dropout for regularization, and fully connected layers.
        """
        super(EnhancedSoundCNN, self).__init__()
        
        # First convolutional layer: 
        # Input channels = 1 (for mono audio), Output channels = 32, Kernel size = 3
        self.conv1 = nn.Conv1d(1, 32, kernel_size=3, stride=1, padding=1)
        
        # Second convolutional layer:
        # Input channels = 32 (from previous layer), Output channels = 64
        self.conv2 = nn.Conv1d(32, 64, kernel_size=3, stride=1, padding=1)
        
        # Third convolutional layer:
        # Input channels = 64 (from previous layer), Output channels = 128
        self.conv3 = nn.Conv1d(64, 128, kernel_size=3, stride=1, padding=1)
        
        # Max pooling layer to reduce the dimensionality of the output from the convolutional layers
        self.pool = nn.MaxPool1d(2)  # Pooling size of 2
        
        # Fully connected layers:
        # The input size for the first fully connected layer is adjusted based on the output size from the convolutional layers
        self.fc1 = nn.Linear(128 * 30, 256)  # Example size, adjust based on the actual flattened output
        self.fc2 = nn.Linear(256, 128)        # Second fully connected layer
        self.fc3 = nn.Linear(128, 10)         # Output layer for 10 classes
        
        # Dropout layer to prevent overfitting
        self.dropout = nn.Dropout(0.3)

    def forward(self, x):
        """
        Defines the forward pass of the model.

        The input tensor passes through the convolutional layers, followed by pooling, 
        and then through the fully connected layers. 

        Parameters:
            x (torch.Tensor): Input tensor containing audio features.

        Returns:
            torch.Tensor: Output tensor containing the class scores for the input audio.
        """
        # Apply the first convolutional layer followed by ReLU activation and max pooling
        x = self.pool(F.relu(self.conv1(x)))
        
        # Apply the second convolutional layer followed by ReLU activation and max pooling
        x = self.pool(F.relu(self.conv2(x)))
        
        # Apply the third convolutional layer followed by ReLU activation and max pooling
        x = self.pool(F.relu(self.conv3(x)))

        # Flatten the output for the fully connected layers
        x = x.view(x.size(0), -1)

        # Pass through the first fully connected layer with ReLU activation
        x = F.relu(self.fc1(x))
        
        # Apply dropout for regularization
        x = self.dropout(x)
        
        # Pass through the second fully connected layer with ReLU activation
        x = F.relu(self.fc2(x))
        
        # Final output layer (class scores)
        x = self.fc3(x)
        return x

# Determine the device to use (GPU if available, else CPU)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Instantiate the EnhancedSoundCNN model and move it to the appropriate device
model = EnhancedSoundCNN().to(device)

In [5]:
def train_model(model, train_loader, criterion, optimizer, num_epochs=10):
    """
    Trains the specified model using the provided training data and evaluation criteria.

    Parameters:
        model (nn.Module): The neural network model to be trained.
        train_loader (DataLoader): DataLoader providing training data in batches.
        criterion (loss function): The loss function used to evaluate the model.
        optimizer (Optimizer): The optimization algorithm used to update model weights.
        num_epochs (int): The number of training epochs.
    """
    model.train()  # Set the model to training mode

    best_val_acc = 0  # Initialize the best validation accuracy tracker
    
    for epoch in range(num_epochs):
        running_loss = 0.0  # Initialize loss for the current epoch
        correct_train = 0    # Counter for correct predictions in training
        total_train = 0      # Total number of training samples
        
        # Iterate over batches in the training DataLoader with a progress bar
        for inputs, labels in tqdm(train_loader, desc=f"Training Epoch {epoch+1}/{num_epochs}"):
            inputs = inputs.unsqueeze(1)  # Add a channel dimension for the model input
            optimizer.zero_grad()          # Zero the gradients from the previous iteration
            outputs = model(inputs)        # Forward pass: compute model outputs
            loss = criterion(outputs, labels)  # Calculate the loss using the criterion
            loss.backward()                # Backward pass: compute gradients
            optimizer.step()               # Update model parameters
            
            running_loss += loss.item()    # Accumulate the loss

            # Calculate training accuracy
            _, predicted = torch.max(outputs, 1)  # Get the predicted class indices
            correct_train += (predicted == labels).sum().item()  # Count correct predictions
            total_train += labels.size(0)         # Update total training samples

        # Compute average training loss and accuracy for this epoch
        train_accuracy = 100 * correct_train / total_train
        train_loss = running_loss / len(train_loader)

        # Validation Phase
        model.eval()  # Set the model to evaluation mode
        correct_val = 0        
        total_val = 0

        with torch.no_grad():  # Disable gradient calculation for validation
            for batch in val_loader:
                inputs, labels = batch
                inputs = inputs.unsqueeze(1)  # Add a channel dimension
                inputs, labels = inputs.to(device), labels.to(device)  # Move to the appropriate device

                # Forward pass for validation
                outputs = model(inputs)
                loss = criterion(outputs, labels)  # Calculate validation loss

                # Calculate validation accuracy
                _, predicted = torch.max(outputs, 1)  # Get the predicted class indices
                correct_val += (predicted == labels).sum().item()  # Count correct predictions
                total_val += labels.size(0)  # Update total validation samples
        
        # Compute validation accuracy for this epoch
        val_accuracy = 100 * correct_val / total_val

        # Log the statistics for this epoch
        print(f"Epoch [{epoch+1}/{num_epochs}], "
              f"Train Loss: {train_loss:.4f}, Train Accuracy: {train_accuracy:.2f}%, "
              f"Validation Accuracy: {val_accuracy:.2f}%")

        # Save the best model based on validation accuracy
        if val_accuracy > best_val_acc:
            best_val_acc = val_accuracy  # Update the best validation accuracy
            torch.save(model.state_dict(), "best_model.pth")  # Save the model parameters
            print(f"Model saved with validation accuracy: {val_accuracy:.2f}%")

    print(f"Best Validation Accuracy: {best_val_acc:.2f}%")  # Log the best validation accuracy

def load_best_model(model, save_path='best_model.pth'):
    """
    Loads the model parameters from the best model file.

    Parameters:
        model (nn.Module): The neural network model to load parameters into.
        save_path (str): The path to the saved model parameters file.
    """
    model.load_state_dict(torch.load(save_path))  # Load the model parameters from file
    model.eval()  # Set the model to evaluation mode
    print("Best model loaded.")  # Confirm loading of the best model

# Train the model with specified parameters
train_model(model, train_loader, nn.CrossEntropyLoss(), optim.Adam(model.parameters(), lr=0.001), num_epochs=100)

# Load the best model after training
load_best_model(model, 'best_model.pth')

Training Epoch 1/100: 100%|██████████| 48/48 [00:03<00:00, 12.64it/s]


Epoch [1/100], Train Loss: 3.0609, Train Accuracy: 33.69%, Validation Accuracy: 53.13%
Model saved with validation accuracy: 53.13%


Training Epoch 2/100: 100%|██████████| 48/48 [00:03<00:00, 13.64it/s]


Epoch [2/100], Train Loss: 1.1382, Train Accuracy: 61.49%, Validation Accuracy: 70.84%
Model saved with validation accuracy: 70.84%


Training Epoch 3/100: 100%|██████████| 48/48 [00:03<00:00, 13.36it/s]


Epoch [3/100], Train Loss: 0.8421, Train Accuracy: 72.17%, Validation Accuracy: 76.26%
Model saved with validation accuracy: 76.26%


Training Epoch 4/100: 100%|██████████| 48/48 [00:04<00:00, 11.78it/s]


Epoch [4/100], Train Loss: 0.6895, Train Accuracy: 77.36%, Validation Accuracy: 80.46%
Model saved with validation accuracy: 80.46%


Training Epoch 5/100: 100%|██████████| 48/48 [00:03<00:00, 13.30it/s]


Epoch [5/100], Train Loss: 0.5513, Train Accuracy: 81.68%, Validation Accuracy: 82.52%
Model saved with validation accuracy: 82.52%


Training Epoch 6/100: 100%|██████████| 48/48 [00:03<00:00, 13.61it/s]


Epoch [6/100], Train Loss: 0.4713, Train Accuracy: 85.23%, Validation Accuracy: 82.21%


Training Epoch 7/100: 100%|██████████| 48/48 [00:03<00:00, 13.46it/s]


Epoch [7/100], Train Loss: 0.4108, Train Accuracy: 86.75%, Validation Accuracy: 85.80%
Model saved with validation accuracy: 85.80%


Training Epoch 8/100: 100%|██████████| 48/48 [00:03<00:00, 13.90it/s]


Epoch [8/100], Train Loss: 0.3436, Train Accuracy: 88.84%, Validation Accuracy: 84.73%


Training Epoch 9/100: 100%|██████████| 48/48 [00:03<00:00, 13.63it/s]


Epoch [9/100], Train Loss: 0.3057, Train Accuracy: 90.43%, Validation Accuracy: 86.56%
Model saved with validation accuracy: 86.56%


Training Epoch 10/100: 100%|██████████| 48/48 [00:03<00:00, 13.58it/s]


Epoch [10/100], Train Loss: 0.2520, Train Accuracy: 91.97%, Validation Accuracy: 88.02%
Model saved with validation accuracy: 88.02%


Training Epoch 11/100: 100%|██████████| 48/48 [00:03<00:00, 13.29it/s]


Epoch [11/100], Train Loss: 0.2557, Train Accuracy: 91.43%, Validation Accuracy: 87.25%


Training Epoch 12/100: 100%|██████████| 48/48 [00:03<00:00, 13.58it/s]


Epoch [12/100], Train Loss: 0.2311, Train Accuracy: 92.29%, Validation Accuracy: 89.39%
Model saved with validation accuracy: 89.39%


Training Epoch 13/100: 100%|██████████| 48/48 [00:04<00:00, 11.14it/s]


Epoch [13/100], Train Loss: 0.1836, Train Accuracy: 94.19%, Validation Accuracy: 89.16%


Training Epoch 14/100: 100%|██████████| 48/48 [00:03<00:00, 13.04it/s]


Epoch [14/100], Train Loss: 0.1563, Train Accuracy: 95.17%, Validation Accuracy: 90.69%
Model saved with validation accuracy: 90.69%


Training Epoch 15/100: 100%|██████████| 48/48 [00:03<00:00, 12.46it/s]


Epoch [15/100], Train Loss: 0.1402, Train Accuracy: 95.53%, Validation Accuracy: 89.24%


Training Epoch 16/100: 100%|██████████| 48/48 [00:04<00:00, 11.81it/s]


Epoch [16/100], Train Loss: 0.1397, Train Accuracy: 95.44%, Validation Accuracy: 90.31%


Training Epoch 17/100: 100%|██████████| 48/48 [00:04<00:00, 11.96it/s]


Epoch [17/100], Train Loss: 0.0961, Train Accuracy: 97.27%, Validation Accuracy: 91.91%
Model saved with validation accuracy: 91.91%


Training Epoch 18/100: 100%|██████████| 48/48 [00:04<00:00, 11.99it/s]


Epoch [18/100], Train Loss: 0.1142, Train Accuracy: 95.98%, Validation Accuracy: 91.37%


Training Epoch 19/100: 100%|██████████| 48/48 [00:03<00:00, 12.34it/s]


Epoch [19/100], Train Loss: 0.0891, Train Accuracy: 96.96%, Validation Accuracy: 90.00%


Training Epoch 20/100: 100%|██████████| 48/48 [00:03<00:00, 12.21it/s]


Epoch [20/100], Train Loss: 0.0837, Train Accuracy: 97.41%, Validation Accuracy: 90.69%


Training Epoch 21/100: 100%|██████████| 48/48 [00:04<00:00, 10.75it/s]


Epoch [21/100], Train Loss: 0.0830, Train Accuracy: 97.28%, Validation Accuracy: 92.37%
Model saved with validation accuracy: 92.37%


Training Epoch 22/100: 100%|██████████| 48/48 [00:03<00:00, 12.33it/s]


Epoch [22/100], Train Loss: 0.0501, Train Accuracy: 98.54%, Validation Accuracy: 92.06%


Training Epoch 23/100: 100%|██████████| 48/48 [00:03<00:00, 12.11it/s]


Epoch [23/100], Train Loss: 0.0415, Train Accuracy: 98.79%, Validation Accuracy: 90.99%


Training Epoch 24/100: 100%|██████████| 48/48 [00:03<00:00, 12.31it/s]


Epoch [24/100], Train Loss: 0.0566, Train Accuracy: 98.46%, Validation Accuracy: 91.91%


Training Epoch 25/100: 100%|██████████| 48/48 [00:03<00:00, 12.24it/s]


Epoch [25/100], Train Loss: 0.0830, Train Accuracy: 97.35%, Validation Accuracy: 91.45%


Training Epoch 26/100: 100%|██████████| 48/48 [00:03<00:00, 12.49it/s]


Epoch [26/100], Train Loss: 0.0562, Train Accuracy: 98.07%, Validation Accuracy: 92.14%


Training Epoch 27/100: 100%|██████████| 48/48 [00:03<00:00, 12.22it/s]


Epoch [27/100], Train Loss: 0.0462, Train Accuracy: 98.56%, Validation Accuracy: 92.44%
Model saved with validation accuracy: 92.44%


Training Epoch 28/100: 100%|██████████| 48/48 [00:04<00:00, 10.38it/s]


Epoch [28/100], Train Loss: 0.0431, Train Accuracy: 98.61%, Validation Accuracy: 92.21%


Training Epoch 29/100: 100%|██████████| 48/48 [00:04<00:00, 11.77it/s]


Epoch [29/100], Train Loss: 0.0484, Train Accuracy: 98.46%, Validation Accuracy: 91.60%


Training Epoch 30/100: 100%|██████████| 48/48 [00:03<00:00, 12.36it/s]


Epoch [30/100], Train Loss: 0.1211, Train Accuracy: 95.98%, Validation Accuracy: 90.84%


Training Epoch 31/100: 100%|██████████| 48/48 [00:03<00:00, 12.31it/s]


Epoch [31/100], Train Loss: 0.2141, Train Accuracy: 95.52%, Validation Accuracy: 80.76%


Training Epoch 32/100: 100%|██████████| 48/48 [00:03<00:00, 12.48it/s]


Epoch [32/100], Train Loss: 0.2196, Train Accuracy: 93.52%, Validation Accuracy: 89.62%


Training Epoch 33/100: 100%|██████████| 48/48 [00:03<00:00, 12.42it/s]


Epoch [33/100], Train Loss: 0.0618, Train Accuracy: 98.05%, Validation Accuracy: 90.84%


Training Epoch 34/100: 100%|██████████| 48/48 [00:03<00:00, 12.62it/s]


Epoch [34/100], Train Loss: 0.0376, Train Accuracy: 98.97%, Validation Accuracy: 92.21%


Training Epoch 35/100: 100%|██████████| 48/48 [00:03<00:00, 12.64it/s]


Epoch [35/100], Train Loss: 0.0275, Train Accuracy: 99.20%, Validation Accuracy: 92.37%


Training Epoch 36/100: 100%|██████████| 48/48 [00:04<00:00, 10.58it/s]


Epoch [36/100], Train Loss: 0.0146, Train Accuracy: 99.56%, Validation Accuracy: 93.21%
Model saved with validation accuracy: 93.21%


Training Epoch 37/100: 100%|██████████| 48/48 [00:03<00:00, 12.26it/s]


Epoch [37/100], Train Loss: 0.0127, Train Accuracy: 99.66%, Validation Accuracy: 93.51%
Model saved with validation accuracy: 93.51%


Training Epoch 38/100: 100%|██████████| 48/48 [00:03<00:00, 12.24it/s]


Epoch [38/100], Train Loss: 0.0148, Train Accuracy: 99.57%, Validation Accuracy: 93.51%


Training Epoch 39/100: 100%|██████████| 48/48 [00:03<00:00, 12.39it/s]


Epoch [39/100], Train Loss: 0.0131, Train Accuracy: 99.61%, Validation Accuracy: 93.82%
Model saved with validation accuracy: 93.82%


Training Epoch 40/100: 100%|██████████| 48/48 [00:03<00:00, 12.33it/s]


Epoch [40/100], Train Loss: 0.0082, Train Accuracy: 99.72%, Validation Accuracy: 93.97%
Model saved with validation accuracy: 93.97%


Training Epoch 41/100: 100%|██████████| 48/48 [00:03<00:00, 12.21it/s]


Epoch [41/100], Train Loss: 0.0106, Train Accuracy: 99.69%, Validation Accuracy: 93.44%


Training Epoch 42/100: 100%|██████████| 48/48 [00:03<00:00, 12.15it/s]


Epoch [42/100], Train Loss: 0.0107, Train Accuracy: 99.66%, Validation Accuracy: 93.51%


Training Epoch 43/100: 100%|██████████| 48/48 [00:03<00:00, 12.24it/s]


Epoch [43/100], Train Loss: 0.0121, Train Accuracy: 99.64%, Validation Accuracy: 93.89%


Training Epoch 44/100: 100%|██████████| 48/48 [00:04<00:00, 10.28it/s]


Epoch [44/100], Train Loss: 0.0102, Train Accuracy: 99.75%, Validation Accuracy: 93.44%


Training Epoch 45/100: 100%|██████████| 48/48 [00:03<00:00, 12.16it/s]


Epoch [45/100], Train Loss: 0.0097, Train Accuracy: 99.71%, Validation Accuracy: 93.36%


Training Epoch 46/100: 100%|██████████| 48/48 [00:04<00:00, 11.99it/s]


Epoch [46/100], Train Loss: 0.0096, Train Accuracy: 99.72%, Validation Accuracy: 93.82%


Training Epoch 47/100: 100%|██████████| 48/48 [00:04<00:00, 11.90it/s]


Epoch [47/100], Train Loss: 0.0113, Train Accuracy: 99.67%, Validation Accuracy: 92.29%


Training Epoch 48/100: 100%|██████████| 48/48 [00:04<00:00, 11.91it/s]


Epoch [48/100], Train Loss: 0.0220, Train Accuracy: 99.31%, Validation Accuracy: 92.21%


Training Epoch 49/100: 100%|██████████| 48/48 [00:04<00:00, 11.95it/s]


Epoch [49/100], Train Loss: 0.0980, Train Accuracy: 96.81%, Validation Accuracy: 90.61%


Training Epoch 50/100: 100%|██████████| 48/48 [00:03<00:00, 12.39it/s]


Epoch [50/100], Train Loss: 0.0563, Train Accuracy: 98.05%, Validation Accuracy: 91.91%


Training Epoch 51/100: 100%|██████████| 48/48 [00:04<00:00, 11.97it/s]


Epoch [51/100], Train Loss: 0.0456, Train Accuracy: 98.76%, Validation Accuracy: 92.29%


Training Epoch 52/100: 100%|██████████| 48/48 [00:04<00:00, 10.99it/s]


Epoch [52/100], Train Loss: 0.0268, Train Accuracy: 99.05%, Validation Accuracy: 91.60%


Training Epoch 53/100: 100%|██████████| 48/48 [00:03<00:00, 12.23it/s]


Epoch [53/100], Train Loss: 0.0419, Train Accuracy: 98.48%, Validation Accuracy: 91.68%


Training Epoch 54/100: 100%|██████████| 48/48 [00:04<00:00, 11.76it/s]


Epoch [54/100], Train Loss: 0.0318, Train Accuracy: 98.90%, Validation Accuracy: 92.82%


Training Epoch 55/100: 100%|██████████| 48/48 [00:03<00:00, 12.23it/s]


Epoch [55/100], Train Loss: 0.0173, Train Accuracy: 99.48%, Validation Accuracy: 92.52%


Training Epoch 56/100: 100%|██████████| 48/48 [00:03<00:00, 12.17it/s]


Epoch [56/100], Train Loss: 0.0107, Train Accuracy: 99.62%, Validation Accuracy: 93.28%


Training Epoch 57/100: 100%|██████████| 48/48 [00:03<00:00, 12.30it/s]


Epoch [57/100], Train Loss: 0.0088, Train Accuracy: 99.64%, Validation Accuracy: 93.74%


Training Epoch 58/100: 100%|██████████| 48/48 [00:04<00:00, 11.97it/s]


Epoch [58/100], Train Loss: 0.0081, Train Accuracy: 99.69%, Validation Accuracy: 93.66%


Training Epoch 59/100: 100%|██████████| 48/48 [00:04<00:00, 10.10it/s]


Epoch [59/100], Train Loss: 0.0084, Train Accuracy: 99.67%, Validation Accuracy: 94.27%
Model saved with validation accuracy: 94.27%


Training Epoch 60/100: 100%|██████████| 48/48 [00:04<00:00, 11.98it/s]


Epoch [60/100], Train Loss: 0.0053, Train Accuracy: 99.79%, Validation Accuracy: 93.97%


Training Epoch 61/100: 100%|██████████| 48/48 [00:04<00:00, 12.00it/s]


Epoch [61/100], Train Loss: 0.0060, Train Accuracy: 99.79%, Validation Accuracy: 94.27%


Training Epoch 62/100: 100%|██████████| 48/48 [00:04<00:00, 11.80it/s]


Epoch [62/100], Train Loss: 0.0081, Train Accuracy: 99.79%, Validation Accuracy: 94.73%
Model saved with validation accuracy: 94.73%


Training Epoch 63/100: 100%|██████████| 48/48 [00:04<00:00, 11.87it/s]


Epoch [63/100], Train Loss: 0.0053, Train Accuracy: 99.79%, Validation Accuracy: 94.50%


Training Epoch 64/100: 100%|██████████| 48/48 [00:04<00:00, 11.80it/s]


Epoch [64/100], Train Loss: 0.0057, Train Accuracy: 99.77%, Validation Accuracy: 93.97%


Training Epoch 65/100: 100%|██████████| 48/48 [00:04<00:00, 11.55it/s]


Epoch [65/100], Train Loss: 0.0060, Train Accuracy: 99.75%, Validation Accuracy: 94.43%


Training Epoch 66/100: 100%|██████████| 48/48 [00:04<00:00, 11.74it/s]


Epoch [66/100], Train Loss: 0.0044, Train Accuracy: 99.79%, Validation Accuracy: 94.27%


Training Epoch 67/100: 100%|██████████| 48/48 [00:04<00:00, 10.07it/s]


Epoch [67/100], Train Loss: 0.0052, Train Accuracy: 99.77%, Validation Accuracy: 94.81%
Model saved with validation accuracy: 94.81%


Training Epoch 68/100: 100%|██████████| 48/48 [00:04<00:00, 11.72it/s]


Epoch [68/100], Train Loss: 0.0049, Train Accuracy: 99.79%, Validation Accuracy: 94.73%


Training Epoch 69/100: 100%|██████████| 48/48 [00:03<00:00, 12.00it/s]


Epoch [69/100], Train Loss: 0.0039, Train Accuracy: 99.84%, Validation Accuracy: 94.66%


Training Epoch 70/100: 100%|██████████| 48/48 [00:03<00:00, 12.20it/s]


Epoch [70/100], Train Loss: 0.0044, Train Accuracy: 99.79%, Validation Accuracy: 94.66%


Training Epoch 71/100: 100%|██████████| 48/48 [00:04<00:00, 11.84it/s]


Epoch [71/100], Train Loss: 0.0039, Train Accuracy: 99.79%, Validation Accuracy: 94.66%


Training Epoch 72/100: 100%|██████████| 48/48 [00:04<00:00, 11.88it/s]


Epoch [72/100], Train Loss: 0.0044, Train Accuracy: 99.79%, Validation Accuracy: 95.11%
Model saved with validation accuracy: 95.11%


Training Epoch 73/100: 100%|██████████| 48/48 [00:04<00:00, 11.88it/s]


Epoch [73/100], Train Loss: 0.0054, Train Accuracy: 99.79%, Validation Accuracy: 94.27%


Training Epoch 74/100: 100%|██████████| 48/48 [00:04<00:00, 10.64it/s]


Epoch [74/100], Train Loss: 0.0043, Train Accuracy: 99.85%, Validation Accuracy: 95.27%
Model saved with validation accuracy: 95.27%


Training Epoch 75/100: 100%|██████████| 48/48 [00:04<00:00, 11.85it/s]


Epoch [75/100], Train Loss: 0.0039, Train Accuracy: 99.82%, Validation Accuracy: 94.66%


Training Epoch 76/100: 100%|██████████| 48/48 [00:04<00:00, 11.74it/s]


Epoch [76/100], Train Loss: 0.0049, Train Accuracy: 99.80%, Validation Accuracy: 94.50%


Training Epoch 77/100: 100%|██████████| 48/48 [00:04<00:00, 11.47it/s]


Epoch [77/100], Train Loss: 0.0052, Train Accuracy: 99.77%, Validation Accuracy: 93.97%


Training Epoch 78/100: 100%|██████████| 48/48 [00:04<00:00, 11.41it/s]


Epoch [78/100], Train Loss: 0.0029, Train Accuracy: 99.82%, Validation Accuracy: 94.27%


Training Epoch 79/100: 100%|██████████| 48/48 [00:04<00:00, 11.49it/s]


Epoch [79/100], Train Loss: 0.0050, Train Accuracy: 99.75%, Validation Accuracy: 92.98%


Training Epoch 80/100: 100%|██████████| 48/48 [00:04<00:00, 11.77it/s]


Epoch [80/100], Train Loss: 0.0974, Train Accuracy: 96.99%, Validation Accuracy: 88.17%


Training Epoch 81/100: 100%|██████████| 48/48 [00:03<00:00, 12.25it/s]


Epoch [81/100], Train Loss: 0.2427, Train Accuracy: 92.51%, Validation Accuracy: 83.82%


Training Epoch 82/100: 100%|██████████| 48/48 [00:04<00:00, 10.70it/s]


Epoch [82/100], Train Loss: 0.1663, Train Accuracy: 94.75%, Validation Accuracy: 90.84%


Training Epoch 83/100: 100%|██████████| 48/48 [00:04<00:00, 11.49it/s]


Epoch [83/100], Train Loss: 0.0651, Train Accuracy: 97.79%, Validation Accuracy: 91.30%


Training Epoch 84/100: 100%|██████████| 48/48 [00:04<00:00, 11.37it/s]


Epoch [84/100], Train Loss: 0.0328, Train Accuracy: 98.72%, Validation Accuracy: 91.91%


Training Epoch 85/100: 100%|██████████| 48/48 [00:04<00:00, 11.49it/s]


Epoch [85/100], Train Loss: 0.0312, Train Accuracy: 99.10%, Validation Accuracy: 91.37%


Training Epoch 86/100: 100%|██████████| 48/48 [00:04<00:00, 11.15it/s]


Epoch [86/100], Train Loss: 0.0281, Train Accuracy: 99.03%, Validation Accuracy: 92.98%


Training Epoch 87/100: 100%|██████████| 48/48 [00:04<00:00, 11.02it/s]


Epoch [87/100], Train Loss: 0.0161, Train Accuracy: 99.46%, Validation Accuracy: 91.68%


Training Epoch 88/100: 100%|██████████| 48/48 [00:04<00:00, 10.98it/s]


Epoch [88/100], Train Loss: 0.0147, Train Accuracy: 99.48%, Validation Accuracy: 93.66%


Training Epoch 89/100: 100%|██████████| 48/48 [00:04<00:00,  9.93it/s]


Epoch [89/100], Train Loss: 0.0091, Train Accuracy: 99.67%, Validation Accuracy: 94.12%


Training Epoch 90/100: 100%|██████████| 48/48 [00:04<00:00, 10.83it/s]


Epoch [90/100], Train Loss: 0.0045, Train Accuracy: 99.80%, Validation Accuracy: 93.74%


Training Epoch 91/100: 100%|██████████| 48/48 [00:04<00:00, 11.03it/s]


Epoch [91/100], Train Loss: 0.0042, Train Accuracy: 99.84%, Validation Accuracy: 94.27%


Training Epoch 92/100: 100%|██████████| 48/48 [00:04<00:00, 10.99it/s]


Epoch [92/100], Train Loss: 0.0055, Train Accuracy: 99.74%, Validation Accuracy: 93.82%


Training Epoch 93/100: 100%|██████████| 48/48 [00:04<00:00, 10.91it/s]


Epoch [93/100], Train Loss: 0.0048, Train Accuracy: 99.82%, Validation Accuracy: 94.12%


Training Epoch 94/100: 100%|██████████| 48/48 [00:04<00:00, 10.81it/s]


Epoch [94/100], Train Loss: 0.0045, Train Accuracy: 99.82%, Validation Accuracy: 93.97%


Training Epoch 95/100: 100%|██████████| 48/48 [00:04<00:00, 10.71it/s]


Epoch [95/100], Train Loss: 0.0041, Train Accuracy: 99.79%, Validation Accuracy: 93.97%


Training Epoch 96/100: 100%|██████████| 48/48 [00:04<00:00,  9.66it/s]


Epoch [96/100], Train Loss: 0.0044, Train Accuracy: 99.79%, Validation Accuracy: 94.27%


Training Epoch 97/100: 100%|██████████| 48/48 [00:04<00:00, 10.47it/s]


Epoch [97/100], Train Loss: 0.0036, Train Accuracy: 99.80%, Validation Accuracy: 94.05%


Training Epoch 98/100: 100%|██████████| 48/48 [00:04<00:00, 10.59it/s]


Epoch [98/100], Train Loss: 0.0039, Train Accuracy: 99.84%, Validation Accuracy: 94.58%


Training Epoch 99/100: 100%|██████████| 48/48 [00:04<00:00, 10.46it/s]


Epoch [99/100], Train Loss: 0.0041, Train Accuracy: 99.79%, Validation Accuracy: 93.82%


Training Epoch 100/100: 100%|██████████| 48/48 [00:04<00:00, 10.86it/s]


Epoch [100/100], Train Loss: 0.0029, Train Accuracy: 99.84%, Validation Accuracy: 93.97%
Best Validation Accuracy: 95.27%
Best model loaded.


In [6]:
def evaluate_model(model, test_loader):
    """
    Evaluates the performance of the trained model on the test dataset.

    Parameters:
        model (nn.Module): The neural network model to evaluate.
        test_loader (DataLoader): DataLoader providing test data in batches.
    """
    model.eval()  # Set the model to evaluation mode to disable dropout and batch normalization
    correct = 0   # Initialize counter for correct predictions
    total = 0     # Initialize counter for total samples processed

    # Disable gradient calculation for evaluation to save memory and computation
    with torch.no_grad():
        for inputs, labels in test_loader:
            inputs = inputs.unsqueeze(1)  # Add a channel dimension to the input tensor
            outputs = model(inputs)        # Forward pass: compute model outputs
            
            # Get the predicted class indices with the highest scores
            _, predicted = torch.max(outputs, 1)
            total += labels.size(0)        # Update total number of samples processed
            correct += (predicted == labels).sum().item()  # Count correct predictions

    # Calculate and print the accuracy of the model on the test set
    print(f'Accuracy: {100 * correct / total}%')

# Evaluate the model using the test dataset
evaluate_model(model, test_loader)

Accuracy: 92.90076335877863%


### Overview

#### Data Preparation

1. **Label Encoding**:
   - Categorical labels representing different sound classes are transformed into numerical format through label encoding. This process ensures that the model can effectively interpret the class information, as many machine learning algorithms require numerical inputs.

2. **Data Splitting**:
   - The dataset is divided into training, validation, and test sets using a stratified approach. This method maintains the class distribution across the splits. Typically, 70% of the data is used for training, with the remaining 30% equally divided into validation and test sets. This ensures that the model is evaluated on unseen data, which is crucial for assessing its generalization capabilities.

#### Feature Extraction

3. **Audio Feature Extraction**:
   - **Mel-Frequency Cepstral Coefficients (MFCCs)**: MFCCs are one of the most common features used in audio classification tasks. They provide a compact representation of the audio spectrum by mimicking the human ear's perception of sound. The extraction process typically involves:
     - **Framing**: The audio signal is divided into overlapping frames, usually around 20-40 ms in length.
     - **Windowing**: Each frame is multiplied by a window function (like the Hamming window) to minimize signal discontinuities at the edges.
     - **Fast Fourier Transform (FFT)**: The windowed frames are transformed into the frequency domain using FFT, which provides information on the frequency components of the audio signal.
     - **Mel Filter Bank**: The FFT results are passed through a set of triangular filters that are spaced according to the Mel scale, emphasizing frequencies that align with human auditory perception.
     - **DCT (Discrete Cosine Transform)**: The log-magnitude of the filtered frequencies is then transformed using DCT to produce the MFCCs, which are commonly used as input features for neural networks.
   - **Spectrograms**: Another effective method for feature extraction involves converting audio signals into spectrograms, which visualize the frequency content over time. Spectrograms are generated by:
     - Applying FFT to overlapping time windows of the audio signal.
     - Mapping the resulting frequency information into a 2D representation, where one axis represents time and the other represents frequency. The intensity of colors in the spectrogram indicates the amplitude of frequencies at different time intervals.
   - **Zero-Crossing Rate and Spectral Features**: Other audio features such as zero-crossing rate (the rate at which the signal changes sign) and spectral centroid (the "center of mass" of the spectrum) can also provide valuable information for classification tasks. These features help characterize the tonal qualities of the audio signals, enhancing the model's ability to distinguish between different sound classes.

#### Model Architecture

4. **Building the Model**:
   - The **Enhanced Sound CNN** is constructed using convolutional layers specifically designed for processing audio signals. The architecture includes three convolutional layers, each capable of learning hierarchical features from the input audio data. The first layer accepts a single-channel input, while subsequent layers extract progressively complex features, helping the model learn intricate patterns in the data.

5. **Activation Functions**:
   - ReLU (Rectified Linear Unit) activation functions are employed in the hidden layers. ReLU introduces non-linearity into the model, which is crucial for learning complex relationships inherent in the audio data.

6. **Regularization**:
   - Dropout layers are integrated between dense layers to combat overfitting. By randomly deactivating a fraction of neurons during training, the model becomes less dependent on specific features, enhancing its generalization ability on unseen data.

#### Model Compilation

7. **Compilation**:
   - The model is compiled using the **Adam optimizer**, which is known for its adaptive learning rate capabilities, allowing for efficient training. The categorical crossentropy loss function is utilized for multi-class classification tasks, quantifying the discrepancy between predicted and actual class distributions.

#### Model Training

8. **Training the Model**:
   - The model undergoes training over multiple epochs, processing the training dataset and updating its weights based on computed losses. The validation dataset is used to monitor performance and make adjustments to prevent overfitting, helping the model learn to classify audio signals accurately.

#### Model Evaluation

9. **Evaluation**:
   - After training, the model's performance is assessed on the test dataset, yielding key metrics such as accuracy and loss. These metrics are vital for understanding how well the model generalizes to new, unseen data, which is critical for its application in real-world scenarios.

### Summary
This comprehensive methodology integrates data preparation, feature extraction techniques, model design, and training protocols to ensure robust sound classification. Each step is essential for building a model that effectively generalizes and classifies sound events in the UrbanSound8K dataset. By leveraging established practices in deep learning and audio processing, the model is well-equipped to tackle challenges in audio classification tasks.

---

# Method 2: Using a MLP in Tensorflow

In [7]:
# Encode the labels into a one-hot format for multi-class classification
le = LabelEncoder()  # Initialize the label encoder to convert class labels to numerical format
yy = to_categorical(le.fit_transform(labels))  # Transform labels into categorical (one-hot encoded) format

# Split the features and encoded labels into training and temporary sets (70% training, 30% for validation/testing)
X_train, X_temp, y_train, y_temp = train_test_split(features, yy, test_size=0.3)

# Further split the temporary set into testing and validation sets (50% testing, 50% validation of the temporary set)
X_test, X_val, y_test, y_val = train_test_split(X_temp, y_temp, test_size=0.5)

In [8]:
# Build the Multi-Layer Perceptron (MLP) model for sound classification
model = Sequential()  # Initialize the sequential model to stack layers

# Add the input layer and first hidden layer with 512 neurons
# The input shape is set to (240,) to match the feature vector size
model.add(Dense(512, input_shape=(240,), activation='relu'))  

# Apply dropout regularization to reduce overfitting by randomly setting 30% of the input units to 0 during training
model.add(Dropout(0.3))  

# Add the second hidden layer with 256 neurons
model.add(Dense(256, activation='relu'))  

# Apply dropout to the second hidden layer
model.add(Dropout(0.3))  

# Add the third hidden layer with 128 neurons
model.add(Dense(128, activation='relu'))  

# Apply dropout to the third hidden layer
model.add(Dropout(0.3))  

# Add the fourth hidden layer with 64 neurons
model.add(Dense(64, activation='relu'))  

# Apply dropout to the fourth hidden layer
model.add(Dropout(0.3))  

# Add the output layer with 10 neurons for the 10 classes in the UrbanSound8K dataset
# The softmax activation function is used to output probabilities for each class
model.add(Dense(10, activation='softmax'))  

In [9]:
# Compile the model with the specified optimizer, loss function, and metrics for evaluation
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model on the training dataset while validating on the validation set
history = model.fit(X_train, y_train, epochs=100, batch_size=128, validation_data=(X_val, y_val), verbose=1)

# Evaluate the trained model's performance on the test dataset
test_loss, test_acc = model.evaluate(X_test, y_test, verbose=1)

# Print the test accuracy to see how well the model generalizes to new data
print(f'Test Accuracy: {test_acc}')

Epoch 1/100
[1m48/48[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 18ms/step - accuracy: 0.1454 - loss: 96.0347 - val_accuracy: 0.3359 - val_loss: 3.5937
Epoch 2/100
[1m48/48[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 12ms/step - accuracy: 0.1741 - loss: 13.7648 - val_accuracy: 0.2290 - val_loss: 2.2126
Epoch 3/100
[1m48/48[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 12ms/step - accuracy: 0.1881 - loss: 7.2390 - val_accuracy: 0.2389 - val_loss: 2.3076
Epoch 4/100
[1m48/48[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 11ms/step - accuracy: 0.1825 - loss: 4.2332 - val_accuracy: 0.2840 - val_loss: 2.0631
Epoch 5/100
[1m48/48[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 11ms/step - accuracy: 0.2224 - loss: 3.0310 - val_accuracy: 0.2939 - val_loss: 2.0455
Epoch 6/100
[1m48/48[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 11ms/step - accuracy: 0.2341 - loss: 2.5069 - val_accuracy: 0.2847 - val_loss: 2.0337
Epoch 7/100
[1m48/48[0m 

### Overview

#### Data Preparation

1. **Label Encoding and One-Hot Encoding**:
   - Categorical labels representing different sound classes are first transformed into numerical format using label encoding. This step is essential for algorithms that require numerical inputs.
   - Following this, one-hot encoding converts these integer labels into a binary format suitable for multi-class classification. This ensures that each class is represented distinctly, allowing the model to treat each class independently during training.

2. **Data Splitting**:
   - The dataset is split into training, validation, and test sets using a stratified approach. Initially, the dataset is divided so that 70% is allocated for training, while the remaining 30% is set aside for further evaluation.
   - This temporary set is subsequently split equally into validation and test sets, ensuring that the model is evaluated on unseen data at each stage.

#### Model Architecture

3. **Building the Model**:
   - The Multi-Layer Perceptron (MLP) model is constructed using a sequential architecture, where layers are stacked linearly. This structure is beneficial for capturing complex relationships in data.
   - Each dense layer in the network connects all neurons from the previous layer to the current one. The model begins with a substantial number of neurons (512 in the first hidden layer) and progressively decreases the number of neurons in subsequent layers. This approach facilitates the extraction of hierarchical features.

4. **Activation Functions**:
   - The ReLU (Rectified Linear Unit) activation function is utilized in the hidden layers, introducing non-linearity to the model. This capability is crucial for learning complex patterns inherent in the audio data.

5. **Regularization**:
   - Dropout layers are integrated between dense layers to mitigate overfitting. By randomly setting a fraction of the inputs to zero during training, the model becomes less reliant on specific neurons, enhancing its generalization capabilities.

#### Model Compilation

6. **Compilation**:
   - The model is compiled with the Adam optimizer, which is known for its efficiency in adjusting learning rates based on the training dynamics. This optimizer helps achieve faster convergence compared to traditional stochastic gradient descent methods.
   - The loss function employed is categorical crossentropy, ideal for multi-class classification tasks. It quantifies the difference between the predicted probability distribution and the actual distribution of classes, guiding the optimization process effectively.

#### Model Training

7. **Training the Model**:
   - The model undergoes training using the `fit` method, where it processes the training dataset over multiple epochs. During training, the model updates its weights based on the computed loss, gradually learning to classify sounds accurately.
   - The validation dataset is also used during training to monitor performance, allowing for adjustments to be made to avoid overfitting.

#### Model Evaluation

8. **Evaluation**:
   - After training, the model’s performance is assessed on the test dataset using the `evaluate` method. This step provides crucial metrics, such as loss and accuracy, which indicate how well the model generalizes to unseen data.
   - Finally, the test accuracy is printed, giving an immediate understanding of the model's effectiveness in classifying sound events from the UrbanSound8K dataset.

### Summary
This systematic approach combines data preparation, model architecture design, regularization techniques, and rigorous training and evaluation methods. Each phase is crucial for ensuring that the final model is robust, generalizes well, and effectively classifies sound events within the UrbanSound8K dataset.