# Audio Similarity Search using Vector Embeddings
This notebook demonstrates how to create vector embeddings of audio files to store into the LanceDB vector store, and then to find similar audio files.
We will be using [panns_inference package](https://github.com/qiuqiangkong/panns_inference) to tag the audio and create embeddings. We'll also be using this [HuggingFace dataset](https://huggingface.co/datasets/ashraq/esc50) for the audio files. The dataset contains 2,000 sounds and labels.

### Installing dependencies

In [1]:
!pip install panns-inference tqdm --q
!pip3 install datasets
!pip install lancedb



### Importing all the libraries

In [2]:
import lancedb

**NOTE** : if you get any error while importing lancedb just you need to restart runtime

In [3]:
from datasets import load_dataset
from panns_inference import AudioTagging
from tqdm import tqdm
from IPython.display import Audio, display
import numpy as np

On devices that have CUDA installed, you may be able to install torch's CUDA supported version.
```bash
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
```
If you don't have CUDA or a GPU (or different os), you can install torch here: https://pytorch.org/get-started/locally/

### Load data

In [4]:
dataset = load_dataset("ashraq/esc50", split="train")
at = AudioTagging(checkpoint_path=None, device="cuda")  # device="cpu" for CPU inference

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Checkpoint path: /root/panns_data/Cnn14_mAP=0.431.pth
GPU number: 1


In [5]:
dataset

Dataset({
    features: ['filename', 'fold', 'target', 'category', 'esc10', 'src_file', 'take', 'audio'],
    num_rows: 2000
})

### Create Embeddings
Now, to create the data embeddings! We can start by creating batches of 70 for the data, keeping track of the most important columns: `category` and `audio`.

In [7]:
batches = [batch["audio"] for batch in dataset.iter(50)]
meta_batches = [batch["category"] for batch in dataset.iter(50)]
audio_data = [np.array([audio["array"] for audio in batch]) for batch in batches]
meta_data = [np.array([meta for meta in batch]) for batch in meta_batches]

We now want to iterate through these batches, and for each audio file, we want to use the AudioTagging embedder to extract the embedding. Then, we can store these embeddings, audio files, and category name into a list of dictionaries. Each dictionary has to contain a `vector` column in order to add to the LanceDB table, if no embedding function is provided.

In [8]:
for i in tqdm(range(len(audio_data))):
    (_, embedding) = at.inference(audio_data[i])
    data = [
        {
            "audio": x[0]["array"],
            "vector": x[1],
            "sampling_rate": x[0]["sampling_rate"],
            "category": meta_data[i][j],
        }
        for j, x in enumerate(zip(batches[i], embedding))
    ]

100%|██████████| 40/40 [00:13<00:00,  2.99it/s]


Once we have this data list, we can create a LanceDB table by first connecting to a certain directory before, and then calling `db.create_table()`. If the table already exists, we open the table and add the data.

### Add the VectorStore

In [14]:
# Connect to directory at the top of the file
db = lancedb.connect("data/audio-lancedb")
table_name = "audio-search"

if table_name not in db.table_names():
    print("Created Table")
    tbl = db.create_table(table_name, data)
else:
    print("Inserting data")
    tbl = db.open_table(table_name)
    tbl.add(data)

Created Table


We can now combine all of this into a single function:

### Composite function

In [11]:
def insert_audio():
    batches = [batch["audio"] for batch in dataset.iter(20)]
    meta_batches = [batch["category"] for batch in dataset.iter(20)]
    audio_data = [np.array([audio["array"] for audio in batch]) for batch in batches]
    meta_data = [np.array([meta for meta in batch]) for batch in meta_batches]
    print("Start")
    for i in tqdm(range(len(audio_data))):
        (_, embedding) = at.inference(audio_data[i])
        data = [
            {
                "audio": x[0]["array"],
                "vector": x[1],
                "sampling_rate": x[0]["sampling_rate"],
                "category": meta_data[i][j],
            }
            for j, x in enumerate(zip(batches[i], embedding))
        ]
        if table_name not in db.table_names():
            tbl = db.create_table(table_name, data)
        else:
            tbl = db.open_table(table_name)
            tbl.add(data)

In [12]:
import shutil

shutil.rmtree("data/audio-lancedb/audio-search.lance")

NOTE: if you get out of memory, then next time Run all cells & uncomment this lines #insert_audio()

In [None]:
# insert_audio()

Great! We now have a fully populated table with all the necessary information. The next step would be to query the table and find those similar audio files. We can do this by first opening the table, and then getting the specific audio file we want to search for.

### Query the database

In [15]:
tbl = db.open_table(table_name)
audio = dataset[50]["audio"]["array"]
category = dataset[50]["category"]
display(Audio(audio, rate=dataset[50]["audio"]["sampling_rate"]))
print("Category:", category)

Category: water_drops


Next, we call the embedding function again to create those embeddings, which would allow us to search our table.

In [16]:
(_, embedding) = at.inference(audio[None, :])
result = tbl.search(embedding[0]).limit(5).to_df()
print(result)

                                               audio  \
0  [0.00506591796875, 0.00653076171875, 0.0051574...   
1  [-0.157318115234375, -0.122344970703125, -0.17...   
2  [-0.0162353515625, -0.015716552734375, -0.0150...   
3  [-0.0008544921875, -0.000762939453125, -0.0005...   
4  [-0.003753662109375, -0.004119873046875, -0.00...   

                                              vector  sampling_rate  \
0  [0.0, 0.70255554, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0...          44100   
1  [0.0, 0.68818694, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0...          44100   
2  [0.0, 0.58163136, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0...          44100   
3  [0.0, 1.0475253, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...          44100   
4  [0.0, 0.45124823, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0...          44100   

           category  _distance  
0       water_drops  52.260319  
1       water_drops  57.536579  
2       water_drops  75.637405  
3  drinking_sipping  76.979073  
4       water_drops  77.981728  


  result = tbl.search(embedding[0]).limit(5).to_df()


In [17]:
for i in range(len(result)):
    print(str(i) + ". Category:", result["category"][i])
    display(Audio(result["audio"][i], rate=result["sampling_rate"][i]))

0. Category: water_drops


1. Category: water_drops


2. Category: water_drops


3. Category: drinking_sipping


4. Category: water_drops


Nice! It seems to be working! We can compile this into another function here, that takes an `id` of the audio from 0 to 1,999.

### Search Audio using IDs

In [18]:
def search_audio(id):
    tbl = db.open_table(table_name)
    audio = dataset[id]["audio"]["array"]
    category = dataset[id]["category"]
    display(Audio(audio, rate=dataset[id]["audio"]["sampling_rate"]))
    print("Category:", category)

    (_, embedding) = at.inference(audio[None, :])
    result = tbl.search(embedding[0]).limit(5).to_df()
    print(result)
    for i in range(len(result)):
        print(str(i) + ". Category:", result["category"][i])
        display(Audio(result["audio"][i], rate=result["sampling_rate"][i]))

In [19]:
search_audio(125)

Category: car_horn
                                               audio  \
0  [-0.022979736328125, -0.021820068359375, -0.02...   
1  [0.313934326171875, 0.312774658203125, 0.31698...   
2  [0.0655517578125, 0.011505126953125, -0.024536...   
3  [0.063690185546875, 0.065216064453125, 0.07296...   
4  [-0.006866455078125, -0.007476806640625, -0.00...   

                                              vector  sampling_rate  \
0  [0.0, 0.12407931, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0...          44100   
1  [0.0, 0.5878662, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...          44100   
2  [0.0, 0.7369921, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...          44100   
3  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...          44100   
4  [0.0, 0.42053863, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0...          44100   

          category   _distance  
0         airplane   85.660736  
1  washing_machine   91.059029  
2   vacuum_cleaner  110.453621  
3         clapping  111.933441  
4        footsteps  115.770401  
0. Category: airpla

  result = tbl.search(embedding[0]).limit(5).to_df()


1. Category: washing_machine


2. Category: vacuum_cleaner


3. Category: clapping


4. Category: footsteps
