# Audio Embedding Models

### pyannotate/embedding 
https://huggingface.co/pyannote/embedding

### panns_inference 
https://www.elastic.co/blog/searching-by-music-leveraging-vector-search-audio-information-retrieval

https://github.com/qiuqiangkong/panns_inference

### Resemblyzer
https://github.com/resemble-ai/Resemblyzer

### OpenL3
https://github.com/marl/openl3?tab=readme-ov-file

https://www.e2enetworks.com/blog/audio-driven-search-leveraging-vector-databases-for-audio-information-retrieval

### Towhee
https://towhee.io/audio-embedding/vggish

https://github.com/towhee-io/examples/blob/main/audio/audio_classification/music_genre_classification.ipynb

https://towhee.io/tasks/detail/operator?field_name=Audio&task_name=Audio-Embedding

# PyAnnotate

In [None]:
!pip install pyannote.audio

In [None]:
from pyannote.audio import Inference
from pyannote.audio import Model
import torch

model = Model.from_pretrained("pyannote/embedding", 
                              use_auth_token="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")
inference = Inference(model, window="whole
inference.to(torch.device("cuda")) # optional
                      
# Whole          
embedding1 = inference("BabyElephantWalk60.wav") # 512 dimensions

## Excerpt
# excerpt = Segment(13.37, 19.81)
# embedding = inference.crop("audio.wav", excerpt)
# `embedding` is (1 x D) numpy array extracted from the file excerpt.

## Sliding Window
# inference = Inference(model, window="sliding",
#                       duration=3.0, step=1.0)
# embeddings = inference("audio.wav")
# sliding window, i.e. from [i * step, i * step + duration].

In [None]:
# embedding1
print(len(embedding1))

In [None]:
# from scipy.spatial.distance import cdist
# distance = cdist(embedding1, embedding2, metric="cosine")[0,0]
# `distance` is a `float` describing how dissimilar speakers 1 and 2 are.

# panns_inference

In [None]:
! pip install panns-inference

In [3]:
import librosa
import panns_inference
from panns_inference import AudioTagging, SoundEventDetection, labels

model = AudioTagging(checkpoint_path=None, device='cuda')
a, _ = librosa.load("BabyElephantWalk60.wav", sr=44100)

# Reshape the audio time series to have an extra dimension, which is required by the model's inference function.
query_audio = a[None, :]

# Perform inference on the reshaped audio using the model. This returns an embedding of the audio. 
_, emb = model.inference(query_audio) # 2048 dimentions

Checkpoint path: /home/jovyan/panns_data/Cnn14_mAP=0.431.pth
GPU number: 1


In [12]:
print(len(emb[0]))
print(emb)

2048
[[0.         0.         0.         ... 0.26346177 0.8452306  0.        ]]


# Resemblyzer

In [None]:
!pip install Resemblyzer

In [7]:
from resemblyzer import VoiceEncoder, preprocess_wav
from pathlib import Path
import numpy as np

fpath = Path("BabyElephantWalk60.wav")
wav = preprocess_wav(fpath)

encoder = VoiceEncoder()
embed = encoder.embed_utterance(wav)
np.set_printoptions(precision=3, suppress=True)
print(len(embed)) # 256 dimensions
print(embed)


Loaded the voice encoder model on cuda in 0.02 seconds.
256
[0.006 0.009 0.006 0.    0.048 0.002 0.023 0.038 0.013 0.103 0.145 0.174
 0.052 0.017 0.048 0.012 0.11  0.017 0.03  0.003 0.004 0.081 0.058 0.
 0.001 0.087 0.001 0.112 0.029 0.008 0.101 0.047 0.02  0.002 0.018 0.062
 0.071 0.042 0.031 0.006 0.    0.031 0.088 0.102 0.051 0.077 0.023 0.014
 0.066 0.08  0.    0.036 0.003 0.    0.009 0.002 0.002 0.    0.016 0.052
 0.022 0.001 0.006 0.096 0.07  0.028 0.124 0.049 0.005 0.009 0.044 0.058
 0.024 0.044 0.034 0.032 0.    0.039 0.003 0.033 0.002 0.    0.14  0.078
 0.011 0.007 0.018 0.096 0.005 0.139 0.034 0.006 0.011 0.013 0.131 0.
 0.047 0.    0.    0.004 0.006 0.04  0.06  0.072 0.034 0.036 0.014 0.074
 0.017 0.147 0.04  0.074 0.    0.08  0.004 0.    0.05  0.005 0.065 0.006
 0.084 0.07  0.116 0.089 0.069 0.068 0.127 0.139 0.071 0.094 0.047 0.
 0.028 0.1   0.056 0.057 0.031 0.03  0.13  0.03  0.001 0.008 0.019 0.039
 0.052 0.008 0.1   0.002 0.132 0.127 0.044 0.088 0.004 0.018 0.026 0.14
 

# Towhee

In [None]:
!pip install towhee

In [None]:
from towhee import pipe, ops

p = (
      pipe.input('path')
          .map('path', 'frame', ops.audio_decode.ffmpeg())
          .map('frame', 'vecs', ops.audio_embedding.vggish())
          .output('vecs')
)

In [14]:
embedd = p('BabyElephantWalk60.wav').get()[0] # 128 dimensions
print(len(embedd[0])) # 128
print(len(embedd)) # 62 "slices" ... 1 slice is about 1 second (0.9s)
print(embedd)

128
62
[[-0.27  -0.055 -0.137 ... -0.162  0.229  0.178]
 [-0.103 -0.073  0.483 ... -0.342 -0.149 -0.216]
 [ 0.014 -0.22   0.661 ... -0.405 -0.133 -0.235]
 ...
 [-0.73  -0.093  0.773 ... -0.863 -0.113 -0.031]
 [-0.436  0.073  0.486 ... -0.612 -0.114 -0.035]
 [-0.412  0.082  0.489 ... -0.582 -0.202 -0.118]]
