# **Audio Embeddings With Tabular Classification Model**

In this example, we will build an audio classification model using `PyTorch` and `Wav2Vec2`, a pretrained model for processing audio data. This guide will walk you through each step of the process, including setting up the environment, loading and preprocessing data, defining and training a model, and evaluating its performance.

## Tips
For best performance, ensure that the runtime is set to use a GPU (`Runtime > Change runtime type > T4 GPU`).

## Help & Questions

If you have any questions, please reachout on our [Discord](https://discord.gg/dncQwFdN9m).

You can also use our [documenation](https://docs.modlee.ai/README.html) as a reference for using our package.


## Step 1: Environment Setup

First, we need to make sure that we have the necessary packages installed.

This command mounts your Google Drive to the Colab environment. You'll be able to access files in your Drive through the /content/drive directory.

## Step 2: Downloading the Dataset from Kaggle

For this example, we will manually download the diabetes dataset from Kaggle and upload it to your Google Colab environment.

1. Visit the [Human words Audio Classification](https://www.kaggle.com/datasets/warcoder/cats-vs-dogs-vs-birds-audio-classification?resource=download) dataset on Kaggle.
2. Click the "Download" button to save the dataset to your local machine.
3. Upload the `Animal` directory to you Google Drive.
4. In your Colab notebook, click on the file icon on the left side and look for the `Animal` directory in your mounted Google Drive.

This section ensures that the dataset is ready for use in the subsequent steps. Copy the path to the dataset from your Colab environment. It will look something like this, `/content/drive/MyDrive/Animals`, and you can use this in your data processing code.

## Step 3:  Importing Required Libraries
Now, import the libraries needed for the rest of the project.

In [None]:
!pip3 install modlee torch torchvision torchaudio pytorch-lightning torchtext==0.18.0 soundfile

In [12]:

import torchaudio
from sklearn.preprocessing import LabelEncoder
from torch.utils.data import TensorDataset, DataLoader
from transformers import Wav2Vec2Model
from torch.utils.data import DataLoader, TensorDataset
import torch
import os
import modlee
import lightning.pytorch as pl
from sklearn.model_selection import train_test_split
torchaudio.set_audio_backend("sox_io")

  torchaudio.set_audio_backend("sox_io")


Now we will set our Modlee API key and initialize the Modlee package.
Make sure that you have a Modlee account and an API key [from the dashboard](https://www.dashboard.modlee.ai/).
Replace `replace-with-your-api-key` with your API key.

In [13]:
# Set the API key to an environment variable,
# to simulate setting this in your shell profile
os.environ['MODLEE_API_KEY'] = "OktSzjtS27JkuFiqpuzzyZCORw88Cz0P"
modlee.init(api_key=os.environ['MODLEE_API_KEY'])


## Step 4: Loading the Pretrained Wav2Vec2 Model
This snippet loads the Wav2Vec2 model. Wav2Vec2 is a model designed for speech processing. We'll use it to convert audio into embeddings.

In [14]:
# Set device to GPU if available, otherwise use CPU.
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Load the pre-trained Wav2Vec2 model and move it to the specified device.
wav2vec = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base").to(device)


Some weights of Wav2Vec2Model were not initialized from the model checkpoint at facebook/wav2vec2-base and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Step 5: Extracting Wav2Vec2 Embeddings
This function converts raw audio waveforms into embeddings using the Wav2Vec2 model. The embeddings are used as features for our classifier.

In [15]:
def get_wav2vec_embeddings(waveforms):
    with torch.no_grad():  # Turn off gradients to save memory during inference
        # Convert waveforms to a tensor and move it to the chosen device
        inputs = torch.tensor(waveforms).to(device)
        # Get embeddings from the Wav2Vec2 model
        embeddings = wav2vec(inputs).last_hidden_state.mean(dim=1)
    return embeddings


## Step 6: Creating a Custom Dataset Class
The `AudioDataset` class handles loading and preprocessing of audio files. This includes padding or truncating audio samples to a fixed length.

In [16]:
class AudioDataset(TensorDataset):
    def __init__(self, audio_paths, labels, target_length=16000):
        self.audio_paths = audio_paths  # List of paths to audio files
        self.labels = labels  # List of labels corresponding to audio files
        self.target_length = target_length  # Desired length for audio clips

    def __len__(self):
        return len(self.audio_paths)  # Number of items in the dataset

    def __getitem__(self, idx):
        audio_path = self.audio_paths[idx]  # Get the path of the audio file
        label = self.labels[idx]  # Get the label for the audio file
        waveform, sample_rate = torchaudio.load(audio_path, normalize=True)  # Load and normalize the audio
        waveform = waveform.mean(dim=0)  # Convert to mono by averaging channels

        # Pad or truncate the waveform to the target length
        if waveform.size(0) < self.target_length:
            waveform = torch.cat([waveform, torch.zeros(self.target_length - waveform.size(0))])
        else:
            waveform = waveform[:self.target_length]

        return waveform, label  # Return the processed waveform and its label


## Step 7: Loading and Preprocessing the Dataset
This function loads audio files and their corresponding labels from a directory structure.

In [17]:
def load_dataset(data_dir):
    audio_paths = []  # List to store paths to audio files
    labels = []  # List to store labels corresponding to each audio file

    # Loop through each subdirectory in the data directory
    for label_dir in os.listdir(data_dir):
        label_dir_path = os.path.join(data_dir, label_dir)
        if os.path.isdir(label_dir_path):  # Check if it's a directory
            # Loop through each file in the directory
            for file_name in os.listdir(label_dir_path):
                if file_name.endswith('.wav'):  # Check if the file is a .wav file
                    audio_paths.append(os.path.join(label_dir_path, file_name))  # Add file path to list
                    labels.append(label_dir)  # Add label (directory name) to list

    return audio_paths, labels  # Return lists of file paths and labels


## Step 8: Defining the Classifier Model
We define a simple Multi-Layer Perceptron (MLP) model for classification. This model takes the embeddings from `Wav2Vec2` as input.

In [18]:
class MLP(modlee.model.TabularClassificationModleeModel):
    def __init__(self, input_size, num_classes):
        super().__init__()
        # Define the model using nn.Sequential for simplicity
        self.model = torch.nn.Sequential(
            torch.nn.Linear(input_size, 256),  # First fully connected layer
            torch.nn.ReLU(),                   # ReLU activation
            torch.nn.Linear(256, 128),          # Second fully connected layer
            torch.nn.ReLU(),                   # ReLU activation
            torch.nn.Linear(128, num_classes)   # Output layer
        )
        self.loss_fn = torch.nn.CrossEntropyLoss()

    def forward(self, x):
        # Forward pass through the model
        return self.model(x)

    def training_step(self, batch, batch_idx):
        x, y_target = batch
        y_pred = self(x)
        loss = self.loss_fn(y_pred, y_target) # Calculate the loss
        return {"loss": loss}

    def validation_step(self, val_batch, batch_idx):
        x, y_target = val_batch
        y_pred = self(x)
        val_loss = self.loss_fn(y_pred, y_target)  # Calculate validation loss
        return {'val_loss': val_loss}

    def configure_optimizers(self):
        optimizer = torch.optim.SGD(self.parameters(), lr=0.001, momentum=0.9)  # Define the optimizer
        return optimizer

# Step 9: Precomputing Audio Embeddings Using `Wav2Vec2`
`Wav2Vec2` transforms raw audio data into numerical embeddings that a model can interpret. We preprocess the audio by normalizing and padding it to a fixed length. Then, `Wav2Vec2` generates embeddings for each audio clip.

In [19]:
def precompute_embeddings(dataloader):
    embeddings_list = []
    labels_list = []
    for inputs, labels in dataloader:
        inputs = inputs.to(device)
        embeddings = get_wav2vec_embeddings(inputs)  # Precompute embeddings
        embeddings_list.append(embeddings.cpu())
        labels_list.append(labels)
    embeddings_list = torch.cat(embeddings_list, dim=0)  # Stack all embeddings
    labels_list = torch.cat(labels_list, dim=0)  # Stack all labels
    return embeddings_list, labels_list

## Step 10: Training and Evaluating the Model
We create functions to train and evaluate our model. Training involves adjusting the model parameters to minimize the loss, while evaluation measures the model's performance on a validation set.

In [20]:
def train_model(model, dataloader, num_epochs=1):
    # Define the loss function and optimizer
    criterion = torch.nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
    model.train()  # Set the model to training mode

    for epoch in range(num_epochs):
        running_loss = 0.0
        for embeddings, labels in dataloader:
            embeddings = embeddings.to(device)
            labels = labels.to(device)

            optimizer.zero_grad()  # Clear previous gradients
            outputs = model(embeddings)  # Get model predictions
            loss = criterion(outputs, labels)  # Compute the loss
            loss.backward()  # Backpropagate the loss
            optimizer.step()  # Update model weights

            running_loss += loss.item()  # Accumulate loss

        # Print average loss for the epoch
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {running_loss / len(dataloader):.4f}')

def evaluate_model(model, dataloader):
    model.eval()  # Set the model to evaluation mode
    correct = 0
    total = 0

    with torch.no_grad():  # Disable gradient calculation
        for embeddings, labels in dataloader:
            embeddings = embeddings.to(device)
            labels = labels.to(device)

            outputs = model(embeddings)  # Get model predictions
            _, predicted = torch.max(outputs, 1)  # Get predicted class labels
            total += labels.size(0)  # Update total count
            correct += (predicted == labels).sum().item()  # Count correct predictions

    accuracy = correct / total  # Compute accuracy
    print(f'Accuracy: {accuracy * 100:.2f}%')  # Print accuracy percentage


## Step 11: Running the Main Script
Finally, we load the dataset, preprocess it, and train the model.

Add your path to the dataset in `data_dir`.

In [21]:
if __name__ == "__main__":
    # Path to dataset
    data_dir = '/Users/mansiagrawal/Downloads/Animals'  # Use the dataset containing 'cats', 'dogs', 'birds'

    # Load dataset
    audio_paths, labels = load_dataset(data_dir)

    # Encode labels
    label_encoder = LabelEncoder()
    labels = label_encoder.fit_transform(labels)

    # Split dataset into training and validation sets
    train_paths, val_paths, train_labels, val_labels = train_test_split(audio_paths, labels, test_size=0.2, random_state=42)

    # Create datasets and dataloaders
    target_length = 16000  # Define the length for padding/truncation
    train_dataset = AudioDataset(train_paths, train_labels, target_length=target_length)
    val_dataset = AudioDataset(val_paths, val_labels, target_length=target_length)
    train_dataloader = DataLoader(train_dataset, batch_size=4, shuffle=True)
    val_dataloader = DataLoader(val_dataset, batch_size=4, shuffle=False)

    # Precompute embeddings
    print("Precomputing embeddings for training and validation data...")
    train_embeddings, train_labels = precompute_embeddings(train_dataloader)
    val_embeddings, val_labels = precompute_embeddings(val_dataloader)

    # Create TensorDataset for precomputed embeddings and labels
    train_embedding_dataset = TensorDataset(train_embeddings, train_labels)
    val_embedding_dataset = TensorDataset(val_embeddings, val_labels)

    # Create DataLoaders for the precomputed embeddings
    train_embedding_loader = DataLoader(train_embedding_dataset, batch_size=4, shuffle=True)
    val_embedding_loader = DataLoader(val_embedding_dataset, batch_size=4, shuffle=False)

    # Define number of classes
    num_classes = len(label_encoder.classes_)
    mlp_audio = MLP(input_size=768, num_classes=num_classes).to(device)

    # Train and evaluate the model
    train_model(mlp_audio, train_embedding_loader)
    evaluate_model(mlp_audio, val_embedding_loader)


Precomputing embeddings for training and validation data...
Epoch [1/1], Loss: 0.6355
Accuracy: 82.79%


# Amazing Work!

We've successfully completed a machine learning project focused on audio classification using the `Wav2Vec2` model and a custom Multi-Layer Perceptron (MLP). Here's a quick recap of what we accomplished:

- Loaded and prepared an audio dataset.
- Extracted audio features using the `Wav2Vec2` model.
- Built and trained a custom MLP for classification.
- Evaluated the model's performance.

This project has provided you with a solid understanding of how to leverage pre-trained models for feature extraction and integrate them with custom architectures for classification tasks. With this knowledge, you are well-equipped to experiment with other types of data, fine-tune model architectures, and continue advancing your machine learning skills. Keep exploring and building!

In [11]:
import os
import torch
import torchaudio
from torch.utils.data import DataLoader, TensorDataset
from transformers import Wav2Vec2Model
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import modlee

# Set device to GPU if available, otherwise use CPU.
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Load the pre-trained Wav2Vec2 model and move it to the specified device.
wav2vec = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base").to(device)

def get_wav2vec_embeddings(waveforms):
    with torch.no_grad():  # Turn off gradients to save memory during inference
        # Convert waveforms to a tensor and move it to the chosen device
        inputs = torch.tensor(waveforms).to(device)
        # Get embeddings from the Wav2Vec2 model
        embeddings = wav2vec(inputs).last_hidden_state.mean(dim=1)
    return embeddings

class AudioDataset(TensorDataset):
    def __init__(self, audio_paths, labels, target_length=16000):
        self.audio_paths = audio_paths  # List of paths to audio files
        self.labels = labels  # List of labels corresponding to audio files
        self.target_length = target_length  # Desired length for audio clips
        torchaudio.set_audio_backend("soundfile")  # Set the audio backend to soundfile

    def __len__(self):
        return len(self.audio_paths)  # Number of items in the dataset

    def __getitem__(self, idx):
        audio_path = self.audio_paths[idx]  # Get the path of the audio file
        label = self.labels[idx]  # Get the label for the audio file
        
        # Load and normalize the audio, convert it to mono
        waveform, sample_rate = torchaudio.load(audio_path, normalize=True)
        waveform = waveform.mean(dim=0)  # Convert to mono by averaging channels

        # Pad or truncate the waveform to the target length
        if waveform.size(0) < self.target_length:
            waveform = torch.cat([waveform, torch.zeros(self.target_length - waveform.size(0))])
        else:
            waveform = waveform[:self.target_length]

        return waveform, label  # Return the processed waveform and its label

def load_dataset(data_dir):
    audio_paths = []  # List to store paths to audio files
    labels = []  # List to store labels corresponding to each audio file

    # Loop through each subdirectory in the data directory
    for label_dir in os.listdir(data_dir):
        label_dir_path = os.path.join(data_dir, label_dir)
        if os.path.isdir(label_dir_path):  # Check if it's a directory
            # Loop through each file in the directory
            for file_name in os.listdir(label_dir_path):
                if file_name.endswith('.wav'):  # Check if the file is a .wav file
                    audio_paths.append(os.path.join(label_dir_path, file_name))  # Add file path to list
                    labels.append(label_dir)  # Add label (directory name) to list

    return audio_paths, labels  # Return lists of file paths and labels

class MLP(modlee.model.TabularClassificationModleeModel):
    def __init__(self, input_size, num_classes):
        super().__init__()
        # Define the fully connected layers
        self.fc1 = torch.nn.Linear(input_size, 256)
        self.fc2 = torch.nn.Linear(256, 128)
        self.fc3 = torch.nn.Linear(128, num_classes)
        self.relu = torch.nn.ReLU()
        self.loss_fn = torch.nn.CrossEntropyLoss()

    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.fc3(x)
        return x

    def training_step(self, batch, batch_idx):
        x, y_target = batch
        y_pred = self(x)
        loss = self.loss_fn(y_pred, y_target)  # Calculate the loss
        return {"loss": loss}

    def validation_step(self, val_batch, batch_idx):
        x, y_target = val_batch
        y_pred = self(x)
        val_loss = self.loss_fn(y_pred, y_target)  # Calculate validation loss
        return {'val_loss': val_loss}

    def configure_optimizers(self):
        optimizer = torch.optim.SGD(self.parameters(), lr=0.001, momentum=0.9)  # Define the optimizer
        return optimizer

def precompute_embeddings(dataloader):
    embeddings_list = []
    labels_list = []
    for inputs, labels in dataloader:
        inputs = inputs.to(device)
        embeddings = get_wav2vec_embeddings(inputs)  # Precompute embeddings
        embeddings_list.append(embeddings.cpu())
        labels_list.append(labels)
    embeddings_list = torch.cat(embeddings_list, dim=0)  # Stack all embeddings
    labels_list = torch.cat(labels_list, dim=0)  # Stack all labels
    return embeddings_list, labels_list

def train_model(model, dataloader, num_epochs=1):
    criterion = torch.nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
    model.train()  # Set the model to training mode

    for epoch in range(num_epochs):
        running_loss = 0.0
        for embeddings, labels in dataloader:
            embeddings = embeddings.to(device)
            labels = labels.to(device)

            optimizer.zero_grad()  # Clear previous gradients
            outputs = model(embeddings)  # Get model predictions
            loss = criterion(outputs, labels)  # Compute the loss
            loss.backward()  # Backpropagate the loss
            optimizer.step()  # Update model weights

            running_loss += loss.item()  # Accumulate loss

        # Print average loss for the epoch
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {running_loss / len(dataloader):.4f}')

def evaluate_model(model, dataloader):
    model.eval()  # Set the model to evaluation mode
    correct = 0
    total = 0

    with torch.no_grad():  # Disable gradient calculation
        for embeddings, labels in dataloader:
            embeddings = embeddings.to(device)
            labels = labels.to(device)

            outputs = model(embeddings)  # Get model predictions
            _, predicted = torch.max(outputs, 1)  # Get predicted class labels
            total += labels.size(0)  # Update total count
            correct += (predicted == labels).sum().item()  # Count correct predictions

    accuracy = correct / total  # Compute accuracy
    print(f'Accuracy: {accuracy * 100:.2f}%')  # Print accuracy percentage

if __name__ == "__main__":
    # Path to dataset
    data_dir = '/Users/mansiagrawal/Downloads/Animals'

    # Load dataset
    audio_paths, labels = load_dataset(data_dir)

    # Encode labels
    label_encoder = LabelEncoder()
    labels = label_encoder.fit_transform(labels)

    # Split dataset into training and validation sets
    train_paths, val_paths, train_labels, val_labels = train_test_split(audio_paths, labels, test_size=0.2, random_state=42)

    # Create datasets and dataloaders
    target_length = 16000  # Define the length for padding/truncation
    train_dataset = AudioDataset(train_paths, train_labels, target_length=target_length)
    val_dataset = AudioDataset(val_paths, val_labels, target_length=target_length)
    train_dataloader = DataLoader(train_dataset, batch_size=4, shuffle=True)
    val_dataloader = DataLoader(val_dataset, batch_size=4, shuffle=False)

    # Precompute embeddings
    print("Precomputing embeddings for training and validation data...")
    train_embeddings, train_labels = precompute_embeddings(train_dataloader)
    val_embeddings, val_labels = precompute_embeddings(val_dataloader)

    # Create TensorDataset for precomputed embeddings and labels
    train_embedding_dataset = TensorDataset(train_embeddings, train_labels)
    val_embedding_dataset = TensorDataset(val_embeddings, val_labels)

    # Create DataLoaders for the precomputed embeddings
    train_embedding_loader = DataLoader(train_embedding_dataset, batch_size=4, shuffle=True)
    val_embedding_loader = DataLoader(val_embedding_dataset, batch_size=4, shuffle=False)

    # Define number of classes
    num_classes = len(label_encoder.classes_)
    mlp_audio = MLP(input_size=768, num_classes=num_classes).to(device)

    # Train and evaluate the model
    train_model(mlp_audio, train_embedding_loader)
    evaluate_model(mlp_audio, val_embedding_loader)


Some weights of Wav2Vec2Model were not initialized from the model checkpoint at facebook/wav2vec2-base and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  torchaudio.set_audio_backend("soundfile")  # Set the audio backend to soundfile


Precomputing embeddings for training and validation data...
Epoch [1/1], Loss: 0.7871
Accuracy: 80.33%
