**ImageBind**

[ImageBind](https://github.com/facebookresearch/ImageBind) is a powerful feature that allows you to combine visual images with other types of data in a Jupyter Notebook environment. With ImageBind, you can seamlessly integrate images into your notebook and enhance the visual representation of your data. Whether it's overlaying images on graphs, annotating images with text, or incorporating images into machine learning workflows, ImageBind provides a versatile and intuitive way to combine visual elements with other data formats. Its easy-to-use interface and markdown text formatting make it a convenient tool for creating visually rich and interactive notebooks.

In [1]:
!git clone https://github.com/facebookresearch/ImageBind.git
%cd ImageBind
!pip3 install -r requirements.txt
!pip3 install wandb

fatal: destination path 'ImageBind' already exists and is not an empty directory.
/content/ImageBind
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/, https://download.pytorch.org/whl/cu113
Collecting pytorchvideo@ git+https://github.com/facebookresearch/pytorchvideo.git@28fe037d212663c6a24f373b94cc5d478c8c1a1d (from -r requirements.txt (line 5))
  Using cached pytorchvideo-0.1.5-py3-none-any.whl
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
!pip install ipywebrtc
!pip install ffmpeg-python
!pip install pytorch-lightning


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## **Custom Silence and Unknown Datasets**

To incorporate custom silence and unknown datasets, we follow a specific approach. Firstly, we construct a custom silence dataset by randomly selecting background audio from the KWS dataset's _background_noise_ folder. These files serve as the source for our silence samples.

Additionally, we generate an unknown dataset by utilizing random audio samples from the training set. Although these samples are sourced from the training data, they are labeled as unknown.

The creation process for these two datasets adheres to the guidelines outlined in the KWS paper and is implemented in the code below. To maintain balance, we limit the number of samples to a value close to the size of the train dataset divided by 35, which corresponds to the number of distinct words in the KWS dataset.


## **The PyTorch Lightning Data Module for KWS**

To ensure a well-organized workflow, we employ the KWSDataModule. This module effectively separates the data handling aspects from the model itself. With the data module, we can manage datasets and dataloaders seamlessly.

For loading the training, testing, and validation sets, we rely on the torchaudio SPEECHCOMMANDS dataset. To address the variation in audio sample lengths, we employ a custom collate_fn. This function not only handles the different lengths but also converts the waveform audio files into mel spectrograms for further processing.

In [25]:


import torch
import torchaudio, torchvision
import os
import matplotlib.pyplot as plt 
import librosa
import argparse
import numpy as np
import wandb
from ipywebrtc import CameraStream, ImageRecorder, AudioRecorder
from IPython.display import Audio
from pytorch_lightning import LightningModule, Trainer, LightningDataModule, Callback
from pytorch_lightning.callbacks import ModelCheckpoint
from pytorch_lightning.loggers import WandbLogger
from torchmetrics.functional import accuracy
from torchvision.transforms import ToTensor
from torchaudio.datasets import SPEECHCOMMANDS
from torchaudio.datasets import SPEECHCOMMANDS
from torchaudio.datasets.speechcommands import _get_speechcommands_metadata



class SilenceDataset(SPEECHCOMMANDS):
    def __init__(self, root):
        super(SilenceDataset, self).__init__(root, subset='training')
        self.len = len(self._walker) // 35
        path = os.path.join(self._path, torchaudio.datasets.speechcommands.EXCEPT_FOLDER)
        self.paths = [os.path.join(path, p) for p in os.listdir(path) if p.endswith('.wav')]

    def __getitem__(self, index):
        index = np.random.randint(0, len(self.paths))
        filepath = self.paths[index]
        waveform, sample_rate = torchaudio.load(filepath)
        return waveform, sample_rate, "silence", 0, 0

    def __len__(self):
        return self.len

class UnknownDataset(SPEECHCOMMANDS):
    def __init__(self, root):
        super(UnknownDataset, self).__init__(root, subset='training')
        self.len = len(self._walker) // 35

    def __getitem__(self, index):
        index = np.random.randint(0, len(self._walker))
        fileid = self._walker[index]
        waveform, sample_rate, _, speaker_id, utterance_number = load_speechcommands_item(fileid, self._path)
        return waveform, sample_rate, "unknown", speaker_id, utterance_number

    def __len__(self):
        return self.len

class KWSDataModule(LightningDataModule):
    def __init__(self, path, batch_size=128, num_workers=0, n_fft=512, 
                 n_mels=128, win_length=None, hop_length=256, class_dict={}, 
                 **kwargs):
        super().__init__(**kwargs)
        self.path = path
        self.batch_size = batch_size
        self.num_workers = num_workers
        self.n_fft = n_fft
        self.n_mels = n_mels
        self.win_length = win_length
        self.hop_length = hop_length
        self.class_dict = class_dict

    def prepare_data(self):
        self.train_dataset = torchaudio.datasets.SPEECHCOMMANDS(self.path,
                                                                download=True,
                                                                subset='training')

        silence_dataset = SilenceDataset(self.path)
        unknown_dataset = UnknownDataset(self.path)
        self.train_dataset = torch.utils.data.ConcatDataset([self.train_dataset, silence_dataset, unknown_dataset])
                                                                
        self.val_dataset = torchaudio.datasets.SPEECHCOMMANDS(self.path,
                                                              download=True,
                                                              subset='validation')
        self.test_dataset = torchaudio.datasets.SPEECHCOMMANDS(self.path,
                                                               download=True,
                                                               subset='testing')                                                    
        _, sample_rate, _, _, _ = self.train_dataset[0]
        self.sample_rate = sample_rate
        self.transform = torchaudio.transforms.MelSpectrogram(sample_rate=sample_rate,
                                                              n_fft=self.n_fft,
                                                              win_length=self.win_length,
                                                              hop_length=self.hop_length,
                                                              n_mels=self.n_mels,
                                                              power=2.0)

    def setup(self, stage=None):
        self.prepare_data()

    def train_dataloader(self):
        return torch.utils.data.DataLoader(
            self.train_dataset,
            batch_size=self.batch_size,
            num_workers=self.num_workers,
            shuffle=True,
            pin_memory=True,
            collate_fn=self.collate_fn
        )

    def val_dataloader(self):
        return torch.utils.data.DataLoader(
            self.val_dataset,
            batch_size=self.batch_size,
            num_workers=self.num_workers,
            shuffle=True,
            pin_memory=True,
            collate_fn=self.collate_fn
        )
    
    def test_dataloader(self):
        return torch.utils.data.DataLoader(
            self.test_dataset,
            batch_size=self.batch_size,
            num_workers=self.num_workers,
            shuffle=True,
            pin_memory=True,
            collate_fn=self.collate_fn
        )

    def collate_fn(self, batch):
        mels = []
        labels = []
        wavs = []
        for sample in batch:
            waveform, sample_rate, label, speaker_id, utterance_number = sample
            # ensure that all waveforms are 1sec in length; if not pad with zeros
            if waveform.shape[-1] < sample_rate:
                waveform = torch.cat([waveform, torch.zeros((1, sample_rate - waveform.shape[-1]))], dim=-1)
            elif waveform.shape[-1] > sample_rate:
                waveform = waveform[:,:sample_rate]

            # mel from power to db
            mels.append(ToTensor()(librosa.power_to_db(self.transform(waveform).squeeze().numpy(), ref=np.max)))
            labels.append(torch.tensor(self.class_dict[label]))
            wavs.append(waveform)

        mels = torch.stack(mels)
        labels = torch.stack(labels)
        wavs = torch.stack(wavs)
   
        return mels, labels, wavs

In [26]:


def get_args():
    parser = argparse.ArgumentParser()
    # model training hyperparameters
    parser.add_argument('--batch-size', type=int, default=64, metavar='N',
                        help='input batch size for training (default: 64)')
    parser.add_argument('--max-epochs', type=int, default=30, metavar='N',
                        help='number of epochs to train (default: 30)')
    parser.add_argument('--lr', type=float, default=0.001, metavar='LR',
                        help='learning rate (default: 0.001)')

    # where dataset will be stored
    parser.add_argument("--path", type=str, default="data/speech_commands/")

    # 35 keywords + silence + unknown
    parser.add_argument("--num-classes", type=int, default=37)
   
    # mel spectrogram parameters
    parser.add_argument("--n-fft", type=int, default=1024)
    parser.add_argument("--n-mels", type=int, default=128)
    parser.add_argument("--win-length", type=int, default=None)
    parser.add_argument("--hop-length", type=int, default=512)

    # 16-bit fp model to reduce the size
    parser.add_argument("--precision", default=16)
    parser.add_argument("--accelerator", default='gpu')
    parser.add_argument("--devices", default=1)
    parser.add_argument("--num-workers", type=int, default=48)

    # parser.add_argument("--no-wandb", default=False, action='store_true')

    args = parser.parse_args("")
    return args

In [27]:

CLASSES = ['silence', 'unknown', 'backward', 'bed', 'bird', 'cat', 'dog', 'down', 'eight', 'five', 'follow',
              'forward', 'four', 'go', 'happy', 'house', 'learn', 'left', 'marvin', 'nine', 'no',
              'off', 'on', 'one', 'right', 'seven', 'sheila', 'six', 'stop', 'three',
              'tree', 'two', 'up', 'visual', 'wow', 'yes', 'zero']
    
    # make a dictionary from CLASSES to integers
CLASS_TO_IDX = {c: i for i, c in enumerate(CLASSES)}

args = get_args()

if not os.path.exists(args.path):
      os.makedirs(args.path, exist_ok=True)
      
datamodule = KWSDataModule(batch_size=args.batch_size, num_workers=args.num_workers,
                          path=args.path, n_fft=args.n_fft, n_mels=args.n_mels,
                          win_length=args.win_length, hop_length=args.hop_length,
                          class_dict=CLASS_TO_IDX)


Choose whether to record your own voice, or choose a random file from the dataset.

In [28]:
choice = int(input('Would you like to (1) record one voice file , or (2) choose random file/s? '))

if choice == 1:
  data_point = 1
elif choice == 2:
  data_point = int(input('How many random sound files would you like to generate? '))
else:
  raise ValueError('Input is invalid. Please choose only between options 1 and 2.')
  # print('Input is invalid. Please choose only between options 1 and 2.')

Would you like to (1) record one voice file , or (2) choose random file/s? 2
How many random sound files would you like to generate? 10


# Recording of audio
The [ipywebrtc library](https://github.com/maartenbreddels/ipywebrtc) provides a convenient way to record audio directly within Jupyter Notebook. By utilizing ipywebrtc's audio recording capabilities, we can capture audio data and use it as input for theImageBind model.

Used to enable the widgetsnbextension module in Jupyter Notebook. This extension allows the use of interactive widgets within Jupyter Notebook such as the audio player.

In [29]:
if choice == 1:
  !jupyter nbextension enable --py widgetsnbextension

*Recording Cell*

In [30]:
if choice == 1:
  from ipywebrtc import CameraStream, ImageRecorder, AudioRecorder
  from IPython.display import Audio
  from time import sleep
  from google.colab import output
  output.enable_custom_widget_manager()
  camera = CameraStream(constraints=
                        {'facing_mode': 'user',
                        'audio': True,
                        'video': False
                        })

  recorder = AudioRecorder(stream=camera)
  print("Please record your voice by clicking on the white circle above the audio player. Once circle turns red, press again to stop recording.")
  display(recorder)
  sleep(7)


In [31]:
if choice == 1:
  from google.colab import output
  output.disable_custom_widget_manager()

  with open('record.webm', 'wb') as f:
      f.write(recorder.audio.value)

  !ffmpeg -i record.webm -ac 1 -f wav file.wav -y -hide_banner -loglevel panic
  waveform, rate = torchaudio.load('file.wav')
  print(waveform.shape)

  wav_file = ("file.wav")
  display(Audio(data=waveform, rate=rate))


# **CHECKPOINT**
If you've chosen to record your voice, check if audio has been recorded properly by playing the audio player above. If sound did not record correctly go back to [recording cell](https://colab.research.google.com/drive/1236gF_OhGQTABpUYhEqOFnfEs01Qjoha#scrollTo=tX75EzW2aerJ&line=9&uniqifier=1) and record your voice again.

## Randomly Selecting WAV Files from a Dataset

This process involves randomly selecting WAV files from the created dataset. The dataset contains a collection of audio files in WAV format. The random selection allows for unbiased sampling of files, making it useful for zero shot prediction using the ImageBind model

In [32]:
if choice == 2:
  print("Please wait while dataset is being created.")
  datamodule.setup()
  print('Dataset Created')

Please wait while dataset is being created.
Dataset Created


In [33]:
if choice == 2:
  import torchaudio

  print("Searching for random kws wav file...")

  wav_file = None

  file_arr = []
  label_arr = []
  x = data_point
  while x > 0:
    path = 'data/speech_commands/'
    label = CLASSES[2:]
    label = np.random.choice(label)
    label.tostring()
    label_arr.append(label)

    path = os.path.join(path, "SpeechCommands/speech_commands_v0.02/")
    path = os.path.join(path, label)
    wav_files = [os.path.join(path, f)
                for f in os.listdir(path) if f.endswith('.wav')]
    wav_file = np.random.choice(wav_files)
    print('Selected file is ', wav_file)

    file_arr.append(wav_file)
    waveform, sample_rate = torchaudio.load(wav_file)
    display(Audio(data=waveform, rate=sample_rate))
   
    x -= 1

  print("%s random files are selected" %data_point)
  print(label_arr)



Searching for random kws wav file...
Selected file is  data/speech_commands/SpeechCommands/speech_commands_v0.02/house/12c206ea_nohash_0.wav


  label.tostring()


Selected file is  data/speech_commands/SpeechCommands/speech_commands_v0.02/nine/067f61e2_nohash_1.wav


Selected file is  data/speech_commands/SpeechCommands/speech_commands_v0.02/six/b4aa9fef_nohash_1.wav


Selected file is  data/speech_commands/SpeechCommands/speech_commands_v0.02/follow/d070ea86_nohash_1.wav


Selected file is  data/speech_commands/SpeechCommands/speech_commands_v0.02/left/30060aba_nohash_2.wav


Selected file is  data/speech_commands/SpeechCommands/speech_commands_v0.02/three/c22ebf46_nohash_0.wav


Selected file is  data/speech_commands/SpeechCommands/speech_commands_v0.02/backward/5170b77f_nohash_4.wav


Selected file is  data/speech_commands/SpeechCommands/speech_commands_v0.02/follow/d5b963aa_nohash_2.wav


Selected file is  data/speech_commands/SpeechCommands/speech_commands_v0.02/one/6a700f9d_nohash_1.wav


Selected file is  data/speech_commands/SpeechCommands/speech_commands_v0.02/right/70a00e98_nohash_0.wav


10 random files are selected
['house', 'nine', 'six', 'follow', 'left', 'three', 'backward', 'follow', 'one', 'right']



# Zero-Shot Prediction with Image and Randomly Selected WAV or User-recorded Audio 


This Jupyter Notebook focuses on performing zero-shot prediction of randomly selected WAV file/s or recorded audio as input data. The notebook showcases the ability to generalize predictions across different audio inputs by introducing variability through random selection. 

In [34]:
import data
from models import imagebind_model
from models.imagebind_model import ModalityType



In [35]:
#selection of Device
device = "cuda:0" if torch.cuda.is_available() else "cpu"

# Instantiate model
print('ImageBind model is being loaded, please wait...')
im_model = imagebind_model.imagebind_huge(pretrained=True)
im_model.eval()
im_model.to(device)

ImageBind model is being loaded, please wait...


ImageBindModel(
  (modality_preprocessors): ModuleDict(
    (vision): RGBDTPreprocessor(
      (cls_token): tensor((1, 1, 1280), requires_grad=True)
      
      (rgbt_stem): PatchEmbedGeneric(
        (proj): Sequential(
          (0): PadIm2Video()
          (1): Conv3d(3, 1280, kernel_size=(2, 14, 14), stride=(2, 14, 14), bias=False)
        )
      )
      (pos_embedding_helper): SpatioTemporalPosEmbeddingHelper(
        (pos_embed): tensor((1, 257, 1280), requires_grad=True)
        
      )
    )
    (text): TextPreprocessor(
      (pos_embed): tensor((1, 77, 1024), requires_grad=True)
      (mask): tensor((77, 77), requires_grad=False)
      
      (token_embedding): Embedding(49408, 1024)
    )
    (audio): AudioPreprocessor(
      (cls_token): tensor((1, 1, 768), requires_grad=True)
      
      (rgbt_stem): PatchEmbedGeneric(
        (proj): Conv2d(1, 768, kernel_size=(16, 16), stride=(10, 10), bias=False)
        (norm_layer): LayerNorm((768,), eps=1e-05, elementwise_affine=

Extract and compare features between input audio and text

In [36]:
# audio_paths=["/content/ImageBind/data/speech_commands/SpeechCommands/speech_commands_v0.02/dog/00b01445_nohash_0.wav", "/content/ImageBind/data/speech_commands/SpeechCommands/speech_commands_v0.02/bird/00970ce1_nohash_0.wav"]
if choice == 1:
  audio_paths = [wav_file]
elif choice == 2:
  audio_paths = file_arr
else:
  raise ValueError('Input is invalid, check choices.')

inputs = {
    ModalityType.TEXT: data.load_and_transform_text(CLASSES, device),
    ModalityType.AUDIO: data.load_and_transform_audio_data(audio_paths, device),
}

with torch.no_grad():
    embeddings = im_model(inputs)

ib_pred = torch.softmax(embeddings[ModalityType.AUDIO] @ embeddings[ModalityType.TEXT].T, dim=-1)




In [37]:
ib_pred = torch.softmax(embeddings[ModalityType.AUDIO] @ embeddings[ModalityType.TEXT].T, dim=-1)
pred = torch.argmax(embeddings[ModalityType.AUDIO] @ embeddings[ModalityType.TEXT].T, dim=-1)
max_value, max_index = torch.max(embeddings[ModalityType.AUDIO] @ embeddings[ModalityType.TEXT].T, dim=-1)

if choice== 1:
  print('Prediction:',CLASSES[max_index])
  
elif choice == 2:
  for i in range(len(max_index)):
    print('Prediction of file #' + str(i), 'is ' + CLASSES[max_index[i]], '|', 'Ground Truth:', label_arr[i])
    correct_pred = 0
    if CLASSES[max_index[i]] == label_arr[i]:
      correct_pred += 1
  accuracy = correct_pred / (len(max_index)) *100.
  print('Accuracy is:', str(accuracy) + '%') 


Prediction of file #0 is follow | Ground Truth: house
Prediction of file #1 is bed | Ground Truth: nine
Prediction of file #2 is up | Ground Truth: six
Prediction of file #3 is up | Ground Truth: follow
Prediction of file #4 is up | Ground Truth: left
Prediction of file #5 is up | Ground Truth: three
Prediction of file #6 is no | Ground Truth: backward
Prediction of file #7 is bed | Ground Truth: follow
Prediction of file #8 is happy | Ground Truth: one
Prediction of file #9 is wow | Ground Truth: right
Accuracy is: 0.0%


# State-of-the-Art Models for Speech Commands V2 Prediction and Their Accuracy

Below is a table showcasing state-of-the-art models tested on the Speech Commands V2 dataset, along with their corresponding accuracy rates:

| Model Name       | Description                       | Accuracy  |
|------------------|-----------------------------------|-----------|
| DeepSpeech      | Mozilla's speech recognition system using deep learning techniques. | 95.2%     |
| QuartzNet        | NVIDIA's efficient convolutional neural network for accurate speech recognition. | 97.5%     |
| Jasper           | NVIDIA's deep learning architecture with 1D convolutions for efficient speech recognition. | 98.2%     |
| SincNet | A neural network architecture that operates directly on raw waveforms, capturing local structure using parametric filters. | 94.5% |
| Tacotron 2 | A text-to-speech model that converts written text into natural-sounding speech. | 92.3% |

These models demonstrate outstanding accuracy rates on the Speech Commands V2 dataset, showcasing their effectiveness in accurately recognizing and classifying spoken commands.