# AMPLAB Module 4 Machine Listening - Embeddings Extractor

This notebook includes the code to extract audio embeddings that could then be used in your other machine listening tasks. This code does not extract embeddings for BSD10k audio files, you'll have to provide your own audio files. If you want to re-analyze BSD10k, you can do so by downloading it. You'll find details in the [BSD10k repository](https://github.com/allholy/BSD10k). You should be able to run this notebook locally or in Google Colab without problems.

In order to run this notebook locally, you'll need to create a Python virtual environment and install the requirements (`pip install -r requirements.txt`). Also, you'll need to download the file  `amplab_machine_listening_module_data.zip` that [you'll find in this shared folder](https://drive.google.com/drive/folders/1FHEmzEXgBV1CCAWo_F3KDpw9QM5ecuZf?usp=sharing), and place it uncompressed next to this notebook (the uncompressed folder should be named `amplab_machine_listening_module_data`).

If running in Google Colab, you'll need to make a copy of this notebook somewhere in your Google Drive, and add a shortcut to the `amplab_data` shared folder next to your notebook (the shortcut must be named same as the folder, `amplab_data`). Then run the cells normally. Note that before running the first cell, you'll need to update the `%cd ...` path to set the working directory to the folder where the notebook (and the shortcut) are placed within your Google Drive. If in Colab, running the first cell will take some minutes as it needs to copy some data and unzip.

This work is similar to that of a paper we published at DCASE 2024:
[Anastasopoulou, Panagiota, et al. "Heterogeneous sound classification with the Broad Sound Taxonomy and Dataset." DCASE Workshop (2024)](https://dcase.community/documents/workshop2024/proceedings/DCASE2024Workshop_Anastasopoulou_39.pdf).


In [7]:
try:
  from google.colab import drive
  # If this does not fail, it means we're running in a Colab environment

  # First mount google drive
  drive.mount('/content/drive')

  # Set the working directory to the directory where this notebook has been placed.
  # This directory should have a Google Drive shortcut to the "amplab_data" shared folder.
  # Edit the below to point to the Google Drive directory where this notebook is located.
  %cd '/content/drive/MyDrive/SMC/AMPLab2425/AMPLAB 2025 Module 4 - Machine Listening'

  # Now copy data files to the colab runtime local storage and uncompress the .zip file.
  # By placing data files in the notebook runtime local storage, we will make data loading much faster in the cells below.
  !cp "amplab_data/amplab_machine_listening_module_data.zip" /content/amplab_machine_listening_module_data.zip
  !unzip  -u /content/amplab_machine_listening_module_data.zip -d /content/
  DATA_FOLDER = '/content/amplab_machine_listening_module_data'

  # Install dependencies (if not running in Colab, this will need to be installed manually)
  !pip install laion_clap essentia-tensorflow

except:
  # Not running in Colab
  DATA_FOLDER = './amplab_machine_listening_module_data'

import os
import json
import numpy as np
import laion_clap
import librosa
import essentia.standard as estd

# MFCC "EMBEDDINGS"

In [8]:
mfcc_algo = estd.MFCC()
w_algo = estd.Windowing(type='blackmanharris62')
spectrum_algo = estd.Spectrum()

def get_mfcc_embeddings(audio_path):
  loader = estd.MonoLoader(filename=audio_path, sampleRate=48000)
  audio = loader()
  mfcc_frames = []
  for frame in estd.FrameGenerator(audio, frameSize=2048, hopSize=1024):
        spec = spectrum_algo(w_algo(frame))
        _, mfcc_coeffs = mfcc_algo(spec)
        mfcc_frames.append(mfcc_coeffs)
  mfcc_frames = np.array(mfcc_frames)
  mfcc_average = np.mean(mfcc_frames, axis=0)
  return mfcc_average

# FREESOUND SIMILARITY EMBEDDINGS

In [9]:
gaia_pca_dataset_history = json.load(open(os.path.join(DATA_FOLDER, 'gaia_pca_dataset_history.json')))
normalization_coefficients = [transform_info for transform_info in gaia_pca_dataset_history if transform_info["Analyzer name"] == "normalize"][0]["Applier parameters"]["coeffs"]
normalization_additional_info = [transform_info for transform_info in gaia_pca_dataset_history if transform_info["Analyzer name"] == "normalize"][0]["Additional info"]
dimensions_per_descriptor = {descriptor_name: len(descriptor_stats['mean']) for descriptor_name, descriptor_stats in normalization_additional_info.items()}
pca_descriptor_names = [transform_info for transform_info in gaia_pca_dataset_history if transform_info["Analyzer name"] == "pca"][0]["Applier parameters"]["descriptorNames"]
pca_matrix_raw = [transform_info for transform_info in gaia_pca_dataset_history if transform_info["Analyzer name"] == "pca"][0]["Applier parameters"]["matrix"]
pca_matrix_raw = pca_matrix_raw[2:]
pca_matrix = []
for i in range(0, len(pca_matrix_raw), len(pca_matrix_raw)//100):
    pca_matrix.append(pca_matrix_raw[i:i+len(pca_matrix_raw)//100])
pca_matrix = np.matrix(pca_matrix).transpose()

def project_sound_to_legacy_similarity_space(features):
    # Normalize
    normed_descriptors = {}
    for descriptor_name in pca_descriptor_names:
        value = features.get(descriptor_name[1:]
                             .replace('spectral_contrast.', 'spectral_contrast_coeffs.')
                             .replace('scvalleys.', 'spectral_contrast_valleys.')
                             .replace('erb_bands.', 'erbbands.')
                             .replace('frequency_bands.', 'barkbands.'))  # descriptor names have '.' at the beginning, and some have changed
        if type(value) == np.ndarray:
            value = list(value)  # Make sure this is not ndarray at this point
        if 'frequency_bands.' in descriptor_name:
            value += [value[-2]]  # frequency_bands descriptor (which is "same" as barkbands?), has one more dimension in the legacy extractor (and that missing dimension seems to be usualy similar to the penultimate)

        if type(value) == list:
            value_dimensionality = len(value)
        else:
            value_dimensionality = 1
        have_same_dimension = value_dimensionality == dimensions_per_descriptor[descriptor_name]
        if value is not None and have_same_dimension:
            coeffs = normalization_coefficients[descriptor_name]
            if type(value) != list:
                norm_value = value * coeffs['a'][0] + coeffs['b'][0]
            else:
                norm_value = [v * coeffs['a'][i] + coeffs['b'][i] for i, v in enumerate(value)]
            normed_descriptors[descriptor_name] = norm_value
        else:
            # If a descriptor is missing, we set it to 0
            # This might (will) happen if some sounds don't have values for all descriptors
            #print('Unaligned descriptor', descriptor_name)
            if dimensions_per_descriptor[descriptor_name] > 1:
                normed_descriptors[descriptor_name] = [0.0 for i in range(0, dimensions_per_descriptor[descriptor_name])]
            else:
                normed_descriptors[descriptor_name] = 0.0

    # Project to pca space
    # First concatenate all values into one flat list
    vector = []
    for descriptor_name in pca_descriptor_names:
        normed_value = normed_descriptors[descriptor_name]
        if type(normed_value) == list:
            vector += normed_value
        else:
            vector.append(normed_value)
    # Then multiply by pca matrix
    pca_vector = list(np.squeeze(np.asarray(np.matmul(np.matrix(vector), pca_matrix))))
    return pca_vector

def get_freesound_similarity_embeddings(audio_path):
  fs_pool, _ = estd.FreesoundExtractor()(audio_path)
  features = dict()
  for descriptor in fs_pool.descriptorNames():
      features[descriptor] = fs_pool[descriptor]
  sim_vector = project_sound_to_legacy_similarity_space(features)
  return np.array(sim_vector)

# FSD-SINET EMBEDDINGS

In [10]:
model_embeddings = estd.TensorflowPredictFSDSINet(graphFilename=os.path.join(DATA_FOLDER, "fsd-sinet-vgg42-tlpf_aps-1.pb"), output="model/global_max_pooling1d/Max")

def add_silence(audio, sr, silence_duration=0.5):
    silence = np.zeros(int(silence_duration * sr))
    repeated_audio = np.concatenate((silence, audio))
    return repeated_audio

def get_fsdsinet_embeddings(audio_path):
  loader = estd.MonoLoader(filename=audio_path, sampleRate=44100)
  audio = loader()
  if len(audio)/44100 < 0.5:
    audio = add_silence(audio, 44100)
  embeddings = model_embeddings(audio).mean(axis=0)  # Take mean of frame embeddings
  return embeddings

2025-02-27 18:26:43.976922: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-02-27 18:26:43.984605: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: NVIDIA GeForce RTX 3060 Laptop GPU computeCapability: 8.6
coreClock: 1.402GHz coreCount: 30 deviceMemorySize: 5.70GiB deviceMemoryBandwidth: 268.26GiB/s
2025-02-27 18:26:43.985000: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1766] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2025-02-27 18:26:43.985284: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] D

# CLAP EMBEDDINGS

In [None]:
model = laion_clap.CLAP_Module(enable_fusion=True)
model.load_ckpt(model_id=3) # download the default pretrained checkpoint, this might take some time...

def get_clap_embeddings_from_audio(audio_path):
    audio, _ = librosa.load(audio_path, sr=48000)
    np.random.seed(0)  # Make CLAP's random slice selection for >10s sounds deterministic so we get consistent results when re-run
    audio_embed = model.get_audio_embedding_from_data(x=[audio], use_tensor=False)
    audio_embed = audio_embed[0, :]
    return audio_embed

def get_clap_embeddings_from_text(text):
    text_embed = model.get_text_embedding([text])
    text_embed = text_embed[0, :]
    return text_embed

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# RUN EMBEDDING EXTRACTORS

In [6]:
test_sound_path = os.path.join(DATA_FOLDER, 'test_sounds', '93100__cgeffex__whip-crack-01.wav')  # This should be a valid path to an audio file

print('\nMFCC')
mfcc_embed = get_mfcc_embeddings(test_sound_path)
print(mfcc_embed.shape)
print(mfcc_embed)

print('\nFreesound similarity')
fssim_embed = get_freesound_similarity_embeddings(test_sound_path)
print(fssim_embed.shape)
print(fssim_embed[0:20])

print('\nFSD-SINET')
fsdsinet_embed = get_fsdsinet_embeddings(test_sound_path)
print(fsdsinet_embed.shape)
print(fsdsinet_embed[0:20])

print('\nCLAP')
audio_embed = get_clap_embeddings_from_audio(test_sound_path)
print(audio_embed.shape)
print(audio_embed[0:20])

text_embed = get_clap_embeddings_from_text("The sound of a baby crying")
print(text_embed.shape)
print(text_embed[0:20])

# NOTE: if you wan to do NN search in the CLAP space now, maybe you could use NearestNeighbors from scikit-learn similarly as we did for the audio mosaicing notebooks


MFCC


RuntimeError: Error while configuring MonoLoader: AudioLoader: Could not open file "amplab_machine_listening_module_data/test_sounds/93100__cgeffex__whip-crack-01.wav", error = No such file or directory

# LANGUAGE-BASED AUDIO RETRIEVAL EXAMPLE USING CLAP EMBEDDING SPACE

In [None]:
from sklearn.neighbors import NearestNeighbors
import pandas as pd
import sys
from IPython.display import display, IFrame

def load_embeddings_for_dataset(df, embeddings_folder):
  # Returns a numpy array of shape (n, d) where "n" is the number of sounds in the dataset and "d" is the number of dimensions of the embeddings
  # Available embedding types: "clap", "fs_similarity", "fsdsinet", "mfcc", "fsdsinet_frames", "mfcc_frames"
  # NOTE: if you are loading embeddings which have been stored frame by frame (i.e. those ending with "_frames"), you'll need to add some code
  # here to summarize them into a one-dimensional vectors before adding them to the returned numpy array.

  base_dir = os.path.join(DATA_FOLDER, 'embeddings', embeddings_folder)
  filenames = [os.path.join(base_dir, f'{df.iloc[i]["sound_id"]}.npy') for i in range(len(df))]
  example_embedding_vector = np.load(filenames[0])
  num_dimensions = len(example_embedding_vector)

  print(f'Will load {len(filenames)} points of data with {num_dimensions} dimensions each')
  X = np.zeros((len(filenames), num_dimensions))
  for i, fn in enumerate(filenames):
    if (i + 1) % 100 == 0:
      sys.stdout.write(f'\r{i + 1}/{len(filenames)}                  ')
      sys.stdout.flush()
    X[i, :] = np.load(fn)
  sys.stdout.write(f'\rLoaded {len(filenames)} embeddings from "{embeddings_folder}"!')
  print()
  return X

def show_sound_player(sound_id):
  display(IFrame(f'https://freesound.org/embed/sound/iframe/{sound_id}/simple/medium/', width=696, height=100))

dataset_df = pd.read_csv(open(os.path.join(DATA_FOLDER, 'BSD10k_metadata.csv')))
X = load_embeddings_for_dataset(dataset_df, embeddings_folder="clap")

nbrs = NearestNeighbors(n_neighbors=5, algorithm='ball_tree').fit(X)


In [None]:
#target = get_clap_embeddings_from_audio(os.path.join(DATA_FOLDER, 'test_sounds', '93100__cgeffex__whip-crack-01.wav'))
#target = get_clap_embeddings_from_audio(os.path.join(DATA_FOLDER, 'test_sounds', '15.wav'))
target = get_clap_embeddings_from_text('the sound of a baby crying')
distances, indices = nbrs.kneighbors([target])

for count, (distance, idx) in enumerate(zip(distances[0], indices[0])):
  fs_id = dataset_df.iloc[idx]["sound_id"]
  print(count + 1, '!', distance, fs_id)
  show_sound_player(fs_id)

In [None]:
nbrs = NearestNeighbors(n_neighbors=5, algorithm='ball_tree').fit([
    get_clap_embeddings_from_text("Music"),  # Music, index 0
    get_clap_embeddings_from_text("Music bla bla"),  # Music, index 1
    get_clap_embeddings_from_text("Sound Effects"),  # Sound effects, index 2

])

distances, indices = nbrs.kneighbours([get_clap_embeddings_from_audio('xxx')])
indices[0][0] # = 0 = Music
