<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/google-research/perch-hoplite/blob/main/perch_hoplite/agile/1_embed_audio_v2.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/google-research/perch-hoplite/blob/main/perch_hoplite/agile/1_embed_audio_v2.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
</table>

# Overview

This notebook uses `perch-hoplite` to compute and save embeddings for a set of audio files using a pre-trained model. This is the first step in the agile modeling process. If the data you wish to search and classify is already embedded with a pre-trained model into a perch-hoplite database, then proceed to the step 2 colab notebook ([02_agile_modeling.ipynb](perch_hoplite/agile/02_agile_modeling.ipynb)).

## [Optional] perch-hoplite installation for hosted runtimes

If you have not already installed `perch-hoplite` (particularly if you are using a hosted Colab runtime), make sure to install `perch-hoplite` from the GitHub source to ensure the most recent version is installed. After installation, you will need to restart your runtime before running anything else. Go to the top menu, select `Runtime` then `Restart Session`.

**If you want to use the Perch V2 model, you must additionally install TensorFlow version 2.20.**

In [None]:
# @title Only run this code if you need to install perch-hoplite

!pip install git+https://github.com/google-research/perch-hoplite.git

In [None]:
# @title If you plan to use Perch V2, you must install this version (or later) of TensorFlow and CUDA

!pip install tensorflow[and-cuda]~=2.20

In [None]:
# @title Imports

from etils import epath
from IPython.display import display
import ipywidgets as widgets
from ml_collections import config_dict
import numpy as np

from perch_hoplite.agile import colab_utils
from perch_hoplite.agile import embed
from perch_hoplite.agile import source_info
from perch_hoplite.db import brutalism
from perch_hoplite.db import interface
from perch_hoplite.zoo import taxonomy_model_tf

# Embed the audio data

In [None]:
# @title Configuration {vertical-output: true}

# @markdown Configure the raw dataset and output location(s).  The format is a mapping from
# @markdown a dataset_name to a (base_path, fileglob) pair.  Note that the file
# @markdown globs are case sensitive.  The dataset name can be anything you want.
#
# @markdown This structure allows you to move your data around without having to
# @markdown re-embed the dataset.  The generated embedding database will be
# @markdown placed in the base path. This allows you to simply swap out
# @markdown the base path here if you ever move your dataset.

# @markdown By default we only process one dataset at a time.  Re-run this entire notebook
# @markdown once per dataset.

# @markdown For example, we might set dataset_base_path to `/home/me/myproject`,
# @markdown and use the glob `*/*.wav` if all of the audio files have filepaths
# @markdown like `/home/me/myproject/site_XYZ/audio_ABC.wav` (e.g. audio files are contained
# @markdown in subfolders of the base directory).

# @markdown 1. Create a unique name for the database that will store the embeddings for the
# @markdown target data.
dataset_name = 'powdermill'  # @param {type: 'string'}

# @markdown 2. Input the filepath for the folder that is containing the input audio files.
dataset_base_path = 'gs://chirp-public-bucket/soundscapes/powdermill'  # @param {type: 'string'}

# @markdown 3. Input the file pattern for the audio files within that folder that you want
# @markdown to embed. Some examples for how to input:
# @markdown - All files in the base directory of a specific type (not subdirectories): e.g.
# @markdown `*.wav` (or `*.flac` etc) will generate embeddings for all .wav files (or whichever
# @markdown format) in the `dataset_base_path`.
# @markdown - All files in one level of subdirectories within the base directory: `*/*.flac`
# @markdown will generate embeddings for all .flac files.
# @markdown - Single file: `myfile.wav` will only embed the audio from that specific file.
dataset_fileglob = '*/*.wav'  # @param {type:'string'}

# @markdown 4. [Optional] If saving the embeddings database to a new directory, specify here.
# @markdown Otherwise, leave blank - by default the embeddings database output will be saved
# @markdown within `dataset_base_path` where the audio is located. You do not need to specify
# @markdown `db_path` unless you want to maintain multiple distinct embedding databases, or
# @markdown if you would like to save the output in a different folder. If your input audio
# @markdown data is accessed from a public URL, we recommend specifying a separate output
# @markdown directory here.
db_path = '/tmp/hoplite'  # @param {type:'string'}
if not db_path or db_path.lower() == 'none':
  db_path = None

# @markdown 5. Choose a supported model to generate embeddings. `perch_v2` is the latest Perch
# @markdown model and was trained on multiple taxa include birds, mammals, insects and amphibians.
# @markdown `perch_v2` has also demonstrated high performance for marine audio transfer learning
# @markdown tasks. **NOTE: `perch_v2` only works with GPU runtimes - see above instructions.**
# @markdown `perch_8` is the last updated version of Perch V1 trained only on birds, and
# @markdown `birdnet_v2.3` is also a common option for birds. Other choices include `surfperch`
# @markdown for coral reefs or `multispecies_whale` for marine mammals.
model_choice = 'perch_v2'  # @param ['perch_v2', 'perch_8', 'humpback', 'multispecies_whale', 'surfperch', 'birdnet_V2.3']

# @markdown 6. [Optional] Shard the audio for embeddings. File sharding automatically splits audio
# @markdown files into smaller chunks for creating embeddings. This limits both system and GPU
# @markdown memory usage, especially useful when working with long files (>1 hour).
use_file_sharding = True  # @param {type:'boolean'}
# @markdown If you want to change the length in seconds for the shards, specify here.
shard_length_in_seconds = 60  # @param {type:'number'}

audio_glob = source_info.AudioSourceConfig(
    dataset_name=dataset_name,
    base_path=dataset_base_path,
    file_glob=dataset_fileglob,
    min_audio_len_s=1.0,
    target_sample_rate_hz=-2,
    shard_len_s=float(shard_length_in_seconds) if use_file_sharding else None,
)

configs = colab_utils.load_configs(
    source_info.AudioSources((audio_glob,)),
    db_path,
    model_config_key=model_choice,
    db_key='sqlite_usearch',
)
configs

In [None]:
# @title Initialize the hoplite database (DB) {vertical-output: true}

global db
db = configs.db_config.load_db()
num_embeddings = db.count_embeddings()

print('Initialized DB located at:', configs.db_config.db_config.db_path)

def drop_and_reload_db(_) -> interface.HopliteDBInterface:
  db_path = epath.Path(configs.db_config.db_config.db_path)
  for fp in db_path.glob('hoplite.sqlite*'):
    fp.unlink()
  (db_path / 'usearch.index').unlink()
  print('\n Deleted previous db at: ', configs.db_config.db_config.db_path)
  db = configs.db_config.load_db()

# @markdown If `drop_existing_db` set to True, when the database already exists and contains
# @markdown embeddings, then those existing embeddings will be erased. You will be prompted
# @markdown to confirm you wish to delete those existing embeddings. If you want to keep
# @markdown existing embeddings in the database, then set to False, which will append the new
# @markdown embeddings to the database.
drop_existing_db = False  # @param {type: 'boolean'}

if num_embeddings > 0 and drop_existing_db:
  print('Existing DB contains projects: ', db.get_all_projects())
  print('num embeddings: ', num_embeddings)
  print('\n\nClick the button below to confirm you really want to drop the database at ')
  print(f'{configs.db_config.db_config.db_path}\n')
  print(f'This will permanently delete all {num_embeddings} embeddings from the existing database.\n')
  print('If you do NOT want to delete this data, set `drop_existing_db` above to `False` and re-run this cell.\n')

  button = widgets.Button(description='Delete database?')
  button.on_click(drop_and_reload_db)
  display(button)

In [None]:
# @title Run the embedding {vertical-output: true}

print(f'Embedding dataset as a new db project: {audio_glob.dataset_name}')

worker = embed.EmbedWorker(
    audio_sources=configs.audio_sources_config,
    db=db,
    model_config=configs.model_config,
)

worker.process_all(target_dataset_name=audio_glob.dataset_name)

print('\n\nEmbedding complete, total embeddings: ', db.count_embeddings())

In [None]:
# @title Per project statistics {vertical-output: true}

for project in db.get_all_projects():
  window_ids = db.match_window_ids(
      deployments_filter=config_dict.create(eq=dict(project=project))
  )
  print('Project:', project)
  print('>>> num embeddings:', len(window_ids))
  print()

In [None]:
# @title Show example embedding search
# @markdown As an example (and to show that the embedding process worked), this selects a single
# @markdown embedding from the database and outputs the embedding ids of the top-k (k = 128)
# @markdown nearest neighbors in the database.

q = db.get_embedding(db.match_window_ids(limit=1)[0])
%time results, scores = brutalism.brute_search(worker.db, query_embedding=q, search_list_size=128, score_fn=np.dot)
print([int(r.window_id) for r in results])