# Querying Audio with CLAP embeddings

## In this walkthrough, we will be using a dataset of audio files and embed them using the CLAP model (https://huggingface.co/docs/transformers/v4.30.0/en/model_doc/clap#transformers.ClapModel)

## Installation Requirements

In [1]:
from datasets import load_dataset
from transformers import AutoProcessor, ClapModel, AutoTokenizer
import numpy as np
import torch
import vexpresso
from vexpresso.utils import ResourceRequest, DataType

  from .autonotebook import tqdm as notebook_tqdm


## Load Data

Here we load a dataset of audio files from https://huggingface.co/datasets/ashraq/esc50

In [2]:
dataset = load_dataset("ashraq/esc50")

Found cached dataset parquet (/home/shyam/.cache/huggingface/datasets/ashraq___parquet/ashraq--esc50-1000c3b73cc1500f/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
100%|████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 171.92it/s]


In [10]:
audios = dataset['train']['audio']

In [11]:
dictionary = dataset['train'].to_dict()

In [12]:
dictionary['audio'] = audios

## Create Collection

Lets create a collection with the audios that we downloaded!

In [13]:
collection = vexpresso.create(data=dictionary, backend="python")

In [15]:
collection.show(5)

filename Utf8,fold Int64,target Int64,category Utf8,esc10 Boolean,src_file Int64,take Utf8,"audio Struct[array: List[item:Float64], path: Null, sampling_rate: Int64]"
1-100032-A-0.wav,1,0,dog,True,100032,A,"{'array': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0..."
1-100038-A-14.wav,1,14,chirping_birds,False,100038,A,"{'array': [-0.0118408203125, -0.103363037109375, -0.14141..."
1-100210-A-36.wav,1,36,vacuum_cleaner,False,100210,A,"{'array': [-0.0069580078125, -0.01251220703125, -0.011260..."
1-100210-B-36.wav,1,36,vacuum_cleaner,False,100210,B,"{'array': [0.538970947265625, 0.396270751953125, 0.267395..."
1-101296-A-19.wav,1,19,thunderstorm,False,101296,A,"{'array': [-0.0003662109375, -0.000701904296875, -0.00079..."


In [15]:
collection.show(5)

path Null,sampling_rate Int64,array Python
,44100,"<np.ndarray shape=(220500,) dtype=float64>"
,44100,"<np.ndarray shape=(220500,) dtype=float64>"
,44100,"<np.ndarray shape=(220500,) dtype=float64>"
,44100,"<np.ndarray shape=(220500,) dtype=float64>"
,44100,"<np.ndarray shape=(220500,) dtype=float64>"


## Multimodal CLAP Embedding function

In [18]:
class ClAPEmbeddingsFunction:
    def __init__(self):

        self.model = ClapModel.from_pretrained("laion/clap-htsat-unfused")
        self.processor = AutoProcessor.from_pretrained("laion/clap-htsat-unfused")
        self.tokenizer = AutoTokenizer.from_pretrained("laion/clap-htsat-unfused")
        self.device = torch.device('cpu')

        if torch.cuda.is_available():
            self.device = torch.device('cuda')
            self.model = self.model.to(self.device)

    def __call__(self, inp, inp_type):
        if inp_type == "audio":
            inputs = self.processor(audios=inp, return_tensors="pt", padding=True)
            print(inputs.keys())
            for k in inputs:
                inputs[k] = inputs[k].to(self.device)
            return self.model.get_audio_features(**inputs).detach().cpu().numpy()
        if inp_type == "text":
            inputs = self.tokenizer(inp, padding=True, return_tensors="pt")
            inputs["input_ids"] = inputs["input_ids"].to(self.device)
            inputs["attention_mask"] = inputs["attention_mask"].to(self.device)
            return self.model.get_text_features(**inputs).detach().cpu().numpy()

In [19]:
clap = ClAPEmbeddingsFunction()

RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


## Now lets embed the audio arrays!

In [18]:
collection = collection.embed("array", inp_type="audio", embedding_fn=clap, to="audio_embeddings").execute()

It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


dict_keys(['input_features', 'is_longer'])


OutOfMemoryError: CUDA out of memory. Tried to allocate 150.00 MiB (GPU 0; 3.82 GiB total capacity; 1.97 GiB already allocated; 156.44 MiB free; 2.10 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

In [None]:
collection.show(5)

In [13]:
clap(audio_sample, "audio").shape

It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


dict_keys(['input_features', 'is_longer'])


(1, 512)