## Notebook Description

This notebook processes audio datasets by extracting embeddings using the **Birdnetlib library** and saving the results to CSV files. It is designed to handle multiple bird species datasets and ensures efficient processing with memory management and progress tracking.

### **Main Tasks:**

1. **Load and Organize Audio Files:**
   - Dynamically retrieves `.wav` audio files from the Extended_audios directory for the following datasets:
     - `chiffchaff-fg`
     - `littleowl-fg`
     - `pipit-fg`
     - `rtbc-begging`
     - `littlepenguin-display_call-exhale`
     - `greatTit_song-files`
     - `KiwiTrimmed`
   - Prints the number of audio files found for each dataset and verifies the first few file paths.

2. **Set Up Output Directory and CSV Files:**
   - Creates an output directory (Embeddings_from_3sPadding) if it does not exist.
   - Defines separate CSV files for each dataset to store the extracted embeddings.

3. **Extract Embeddings Using BirdNET:**
   - Initializes the BirdNET model (`Analyzer`).
   - Iterates through the audio files, skipping already processed files (tracked in the CSV).
   - For each audio file:
     - Loads the file into BirdNET.
     - Extracts embeddings and metadata (e.g., start time, end time).
     - Saves the embeddings to the corresponding CSV file in batches of 200 files to optimize memory usage.

4. **Handle Errors and Memory Management:**
   - Skips files that cannot be processed and logs errors.
   - Clears memory after processing each batch to ensure efficient execution.

5. **Post-Processing Analysis:**
   - Compares the original list of audio files with the processed files in the CSV.
   - Reports the total number of processed, unprocessed, and skipped files.
   - Displays a list of unprocessed audio files for further investigation.

### **Output:**
- CSV files containing embeddings for each dataset, saved in the directory:
  ```
  Output_files/Embeddings_from_3sPadding/
  ```
  Example:  `greatTit_embeddings.csv`, etc.

### **Key Features:**
- Efficient batch processing with progress updates every 200 files.
- Skips already processed files to avoid duplication.
- Handles large datasets with memory management and error handling.
- Provides detailed statistics on processed and unprocessed files.

This notebook ensures that audio embeddings are extracted and saved efficiently, making the datasets ready for further analysis or machine learning workflows.

In [None]:
import csv
import pandas as pd
from pathlib import Path
import gc
import soundfile as sf
import numpy as np
import glob
from birdnetlib import Recording
from birdnetlib.analyzer import Analyzer
import os
import ipywidgets as widgets
from IPython.display import display

# Get the current working directory
cwd = Path.cwd()
project_root = cwd.parents[1]
#Path base
base_path = project_root / 'Output_files' / 'Extended_audios'
# Create the output directory for embeddings
output_dir = project_root / 'Output_files' / 'Embeddings_from_3sPadding'
output_dir.mkdir(parents=True, exist_ok=True)


#Function to dynamically retrieve .wav files
def get_audio_files(base_path, folder_name, pattern="**/*.[Ww][Aa][Vv]"):
    return glob.glob(f"{str(base_path)}/{folder_name}/{pattern}", recursive=True)


#Use the function to get audio files for each bird species
audios_chiffchaff = get_audio_files(base_path, "chiffchaff-fg")
audios_littleowl = get_audio_files(base_path, "littleowl-fg")
audios_pipit = get_audio_files(base_path, "pipit-fg")
audios_rtbc = get_audio_files(base_path, "rtbc-begging")
audios_littlepenguin = get_audio_files(base_path, "littlepenguin-display_call-exhale")
audios_greatTit = get_audio_files(base_path, "greatTit_song-files")
audios_kiwi = get_audio_files(base_path, "KiwiTrimmed")


# Define the CSV files for each dataset
csv_chiffchaff = output_dir / 'chiffchaff_embeddings.csv'
csv_littleowl = output_dir / 'littleowl_embeddings.csv'
csv_pipit = output_dir / 'pipit_embeddings.csv'
csv_rtbc = output_dir / 'rtbc_embeddings.csv'
csv_littlepenguin = output_dir / 'littlepenguin_embeddings.csv'
csv_greatTit = output_dir / 'greatTit_embeddings.csv'
csv_kiwi = output_dir / 'kiwi_embeddings.csv'


# Dataset mapping
dataset_map = {
    "chiffchaff-fg": (audios_chiffchaff, csv_chiffchaff),
    "littleowl-fg": (audios_littleowl, csv_littleowl),
    "pipit-fg": (audios_pipit, csv_pipit),
    "rtbc-begging": (audios_rtbc, csv_rtbc),
    "littlepenguin-display_call-exhale": (audios_littlepenguin, csv_littlepenguin),
    "greatTit_song-files": (audios_greatTit, csv_greatTit),
    "KiwiTrimmed": (audios_kiwi, csv_kiwi) 
}

# Create dropdown with initial value None
dataset_dropdown = widgets.Dropdown(
    options=[('Select a dataset', None)] + [(k, k) for k in dataset_map.keys()],
    description='Select dataset:',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='50%'),
    #value=None  # 
)

# Callback function
def on_selection_change(change):
    if change['new'] is not None:
        global audios, csv_filename
        selected_key = change['new']
        audios, csv_filename = dataset_map[selected_key]
        print(f"\nSelected: {selected_key}")
        print(f"Total audio files: {len(audios)}")
        print(f"CSV output file: {csv_filename.name}")

# Observe changes
dataset_dropdown.observe(on_selection_change, names='value')

# Display the dropdown
display(dataset_dropdown)


Dropdown(description='Select dataset:', layout=Layout(width='50%'), options=(('Select a dataset', None), ('chi…


Selected: KiwiTrimmed
Total audio files: 455
CSV output file: kiwi_embeddings.csv


In [2]:
# Make sure audios and csv_filename were defined
try:
    # Check if the CSV file already exists
    if os.path.exists(csv_filename):
        df_embeddings = pd.read_csv(csv_filename)
        if "file_name" in df_embeddings.columns:
            processed_files = set(df_embeddings["file_name"].unique())
        else:
            processed_files = set()
    else:
        df_embeddings = pd.DataFrame()
        processed_files = set()

    # Extract filenames from paths
    original_audios = set(Path(audio).name for audio in audios)
    unprocessed_audios = original_audios - processed_files

    # Display results
    print(f"Total audios in the original list: {len(original_audios)}")
    print(f"Total audios already processed: {len(processed_files)}")
    print(f"Total audios not yet processed: {len(unprocessed_audios)}")

except NameError:
    print("❌ Please run the dataset selection cell first.")

Total audios in the original list: 455
Total audios already processed: 0
Total audios not yet processed: 455


In [3]:
# Check that required variables are defined (from the selection cell)
if "audios" not in globals() or "csv_filename" not in globals():
    raise RuntimeError("❌ 'audios' and 'csv_filename' are not defined. Please run the dataset selection cell first.")

if len(unprocessed_audios) > 0: 
    # Initialize the BirdNET model
    analyzer = Analyzer()

    # Check if the CSV file already exists to retrieve processed audio files
    if os.path.exists(csv_filename):
        df_existing = pd.read_csv(csv_filename)
        processed_files = set(df_existing["file_name"])  # Already processed files
    else:
        processed_files = set()

    # List of audio file paths
    all_data = []
    processed_count = 0  # Counter for files processed in this execution
    batch_size = 200     # Save after every 200 audios

    # Iterate through each file in the list of audio files
    for audio_path in audios:
        try:
            # Extract the file name without the path
            file_name = os.path.basename(audio_path)

            # Skip if the file has already been processed
            if file_name in processed_files:
                continue

            # Load the recording into BirdNET
            recording = Recording(analyzer, audio_path)
            # Extract embeddings
            recording.extract_embeddings()

            # Process the extracted embeddings
            for emb in recording.embeddings:
                row = {
                    "file_name": file_name,
                    "start_time": emb['start_time'],
                    "end_time": emb['end_time']
                }
                for i, value in enumerate(emb["embeddings"]):
                    row[f"dim_{i+1}"] = value

                all_data.append(row)

            processed_count += 1

            # Save every batch_size audios
            if processed_count % batch_size == 0:
                df_partial = pd.DataFrame(all_data)
                df_partial.to_csv(csv_filename, mode='a', header=not os.path.exists(csv_filename), index=False)
                all_data = []
                print(f"{processed_count} audios processed and saved.")

            # Clear memory
            del recording
            gc.collect()

        except Exception as e:
            print(f"Error processing {audio_path}: {e}")

    # Save remaining data at the end
    if all_data:
        df_partial = pd.DataFrame(all_data)
        df_partial.to_csv(csv_filename, mode='a', header=not os.path.exists(csv_filename), index=False)
        print(f"Saved remaining {len(all_data)} entries.")

    print(f"Embeddings saved to {csv_filename}")
else:
    print("✅ All audio files have already been processed. No new files to process.")
    print(f"Embeddings saved to {csv_filename}")



Labels loaded.
load model True
Model loaded.
Labels loaded.
load_species_list_model
Meta model loaded.
read_audio_data


INFO: Created TensorFlow Lite XNNPACK delegate for CPU.


read_audio_data: complete, read  11 chunks.
extract_embeddings_for_recording Hawdon-EbbNest201213_20121217_222458_125-GSKmale_trim.wav
read_audio_data
read_audio_data: complete, read  12 chunks.
extract_embeddings_for_recording Hawdon-EbbNest201213_20121229_233001_185-GSKmale_trim.wav
read_audio_data
read_audio_data: complete, read  8 chunks.
extract_embeddings_for_recording Hawdon-EbbNest201213_20121103_213004_0-GSKmale_trim.wav
read_audio_data
read_audio_data: complete, read  8 chunks.
extract_embeddings_for_recording Hawdon-EbbNest201213_20121223_214502_326-GSKmale_trim.wav
read_audio_data
read_audio_data: complete, read  10 chunks.
extract_embeddings_for_recording Hawdon-EbbNest201213_20121221_213002_623-GSKmale_trim.wav
read_audio_data
read_audio_data: complete, read  9 chunks.
extract_embeddings_for_recording Hawdon-EbbNest201213_20121228_214502_351-GSKmale_trim.wav
read_audio_data
read_audio_data: complete, read  10 chunks.
extract_embeddings_for_recording Hawdon-EbbNest201213_2

In [4]:
# Check if the CSV file already exists to retrieve processed audio files
if os.path.exists(csv_filename):
    df_embeddings = pd.read_csv(csv_filename)
    processed_files = set(df_embeddings["file_name"])  # Already processed files
else:
    df_embeddings = pd.DataFrame()
    processed_files = set()
 
#original_audios = set(audios) # Convert the original list of audios to a set,
original_audios = set(Path(audio).name for audio in audios) # Convert the original list of audios to a set,
processed_audios = set(df_embeddings["file_name"].unique()) # Get the list of processed audios from the DataFrame
unprocessed_audios = original_audios - processed_audios # Identify the audios that were not processed

# Display results
print(f"Total audios processed: {len(processed_audios)}")
print(f"Total audios not processed: {len(unprocessed_audios)}")

#Audios that were not processed:
print("Unprocessed audios:")
print("\n".join(unprocessed_audios))

print("structure of the DataFrame:")
print(df_embeddings.head() if not df_embeddings.empty else "DataFrame is empty.")
#print(df_embeddings.head())  # Display the first few rows of the DataFrame
# Clean up memory
del df_embeddings

Total audios processed: 455
Total audios not processed: 0
Unprocessed audios:

structure of the DataFrame:
                                           file_name  start_time  end_time  \
0  Hawdon-EbbNest201213_20121217_222458_125-GSKma...         0.0       3.0   
1  Hawdon-EbbNest201213_20121217_222458_125-GSKma...         3.0       6.0   
2  Hawdon-EbbNest201213_20121217_222458_125-GSKma...         6.0       9.0   
3  Hawdon-EbbNest201213_20121217_222458_125-GSKma...         9.0      12.0   
4  Hawdon-EbbNest201213_20121217_222458_125-GSKma...        12.0      15.0   

      dim_1  dim_2     dim_3     dim_4     dim_5     dim_6     dim_7  ...  \
0  0.280927    0.0  0.169676  0.231509  0.190414  0.204095  0.484960  ...   
1  0.318225    0.0  0.322079  0.188719  0.280874  0.118702  0.156187  ...   
2  0.601878    0.0  0.373245  0.013728  0.435837  0.096302  0.107153  ...   
3  0.184520    0.0  0.199776  0.348397  0.379622  0.059166  0.012470  ...   
4  0.005157    0.0  0.133940  0.319355 