## Notebook Description

This notebook processes audio datasets by extracting embeddings using the **Birdnetlib library** and saving the results to CSV files. It is designed to handle multiple bird species datasets and ensures efficient processing with memory management and progress tracking.

### **Main Tasks:**

1. **Load and Organize Audio Files:**
   - Dynamically retrieves `.wav` audio files from the Extended_audios directory for the following datasets:
     - `chiffchaff-fg`
     - `littleowl-fg`
     - `pipit-fg`
     - `rtbc-begging`
     - `littlepenguin-display_call-exhale`
     - `greatTit_song-files`
   - Prints the number of audio files found for each dataset and verifies the first few file paths.

2. **Set Up Output Directory and CSV Files:**
   - Creates an output directory (Embeddings_from_3sPadding) if it does not exist.
   - Defines separate CSV files for each dataset to store the extracted embeddings.

3. **Extract Embeddings Using BirdNET:**
   - Initializes the BirdNET model (`Analyzer`).
   - Iterates through the audio files, skipping already processed files (tracked in the CSV).
   - For each audio file:
     - Loads the file into BirdNET.
     - Extracts embeddings and metadata (e.g., start time, end time).
     - Saves the embeddings to the corresponding CSV file in batches of 200 files to optimize memory usage.

4. **Handle Errors and Memory Management:**
   - Skips files that cannot be processed and logs errors.
   - Clears memory after processing each batch to ensure efficient execution.

5. **Post-Processing Analysis:**
   - Compares the original list of audio files with the processed files in the CSV.
   - Reports the total number of processed, unprocessed, and skipped files.
   - Displays a list of unprocessed audio files for further investigation.

### **Output:**
- CSV files containing embeddings for each dataset, saved in the directory:
  ```
  Output_files/Embeddings_from_3sPadding/
  ```
  Example:  `greatTit_embeddings.csv`, etc.

### **Key Features:**
- Efficient batch processing with progress updates every 200 files.
- Skips already processed files to avoid duplication.
- Handles large datasets with memory management and error handling.
- Provides detailed statistics on processed and unprocessed files.

This notebook ensures that audio embeddings are extracted and saved efficiently, making the datasets ready for further analysis or machine learning workflows.

In [1]:
import librosa
import pandas as pd
from pathlib import Path
import gc
import soundfile as sf
import numpy as np
import glob
from birdnetlib import Recording
from birdnetlib.analyzer import Analyzer
import os



# Get the current working directory
cwd = Path.cwd()
project_root = cwd.parents[1]


# Define una función para obtener los archivos .wav dinámicamente
def get_audio_files(base_path, folder_name, pattern="**/*.[Ww][Aa][Vv]"):
    return glob.glob(f"{str(base_path)}/{folder_name}/{pattern}", recursive=True)

# Define el path base
base_path = project_root / 'Output_files' / 'Extended_audios'

# Usa la función para cada dataset
audios_chiffchaff = get_audio_files(base_path, "chiffchaff-fg")
audios_littleowl = get_audio_files(base_path, "littleowl-fg")
audios_pipit = get_audio_files(base_path, "pipit-fg")
audios_rtbc = get_audio_files(base_path, "rtbc-begging")
audios_littlepenguin = get_audio_files(base_path, "littlepenguin-display_call-exhale")
audios_greatTit = get_audio_files(base_path, "greatTit_song-files")

print("Files found:")
print(f"Audios chiffchaff: {len(audios_chiffchaff)}")
print(f"Audios littleowl: {len(audios_littleowl)}")
print(f"Audios pipit: {len(audios_pipit)}")
print(f"Audios rtbc: {len(audios_rtbc)}")
print(f"Audios littlepenguin: {len(audios_littlepenguin)}")
print(f"Audios greatTit: {len(audios_greatTit)}")

# Print the first few file paths for verification
print(audios_chiffchaff[:5])  # Verify the first few paths generated for chiffchaff
print(audios_littleowl[:5])  # Verify the first few paths generated for littleowl
print(audios_pipit[:5])  # Verify the first few paths generated for pipit
print(audios_rtbc[:5])  # Verify the first few paths generated for rtbc
print(audios_littlepenguin[:5])  # Verify the first few paths generated for littlepenguin
print(audios_greatTit[:5])  # Verify the first few paths generated for greatTit


# Crear el directorio si no existe
output_dir = project_root / 'Output_files' / 'Embeddings_from_3sPadding'
output_dir.mkdir(parents=True, exist_ok=True)

# Define the CSV files for each dataset

csv_chiffchaff = output_dir / 'chiffchaff_embeddings.csv'
csv_littleowl = output_dir / 'littleowl_embeddings.csv'
csv_pipit = output_dir / 'pipit_embeddings.csv'
csv_rtbc = output_dir / 'rtbc_embeddings.csv'
csv_littlepenguin = output_dir / 'littlepenguin_embeddings.csv'
csv_greatTit = output_dir / 'greatTit_embeddings.csv'


# Decide the audio files and CSV filename to work with.
audios = audios_littlepenguin
csv_filename = csv_littlepenguin

Files found:
Audios chiffchaff: 6762
Audios littleowl: 952
Audios pipit: 1364
Audios rtbc: 1785
Audios littlepenguin: 2429
Audios greatTit: 74048
['/teamspace/studios/this_studio/Output_files/Extended_audios/chiffchaff-fg/cutted_day1_PC1101_0000.wav', '/teamspace/studios/this_studio/Output_files/Extended_audios/chiffchaff-fg/cutted_day1_PC1101_0001.wav', '/teamspace/studios/this_studio/Output_files/Extended_audios/chiffchaff-fg/cutted_day1_PC1101_0002.wav', '/teamspace/studios/this_studio/Output_files/Extended_audios/chiffchaff-fg/cutted_day1_PC1101_0003.wav', '/teamspace/studios/this_studio/Output_files/Extended_audios/chiffchaff-fg/cutted_day1_PC1101_0004.wav']
['/teamspace/studios/this_studio/Output_files/Extended_audios/littleowl-fg/littleowl2017fg_test_108_0000.wav', '/teamspace/studios/this_studio/Output_files/Extended_audios/littleowl-fg/littleowl2017fg_test_108_0001.wav', '/teamspace/studios/this_studio/Output_files/Extended_audios/littleowl-fg/littleowl2017fg_test_108_0002.wav

In [2]:
# Initialize the BirdNET model
analyzer = Analyzer()

# Check if the CSV file already exists to retrieve processed audio files
if os.path.exists(csv_filename):
    df_existing = pd.read_csv(csv_filename)
    processed_files = set(df_existing["file_name"])  # Already processed files
else:
    #df_existing = pd.DataFrame()
    processed_files = set()

# List of audio file paths

all_data = []
processed_count = 0  # Counter for files processed in this execution
batch_size = 200     # Save after every 200 audios

# Iterate through each file in the list of audio files
for audio_path in audios:
    try:
        # Extract the file name without the path
        file_name = os.path.basename(audio_path)

        # Skip if the file has already been processed
        if file_name in processed_files:
            # print(f"Skipping {file_name}, already processed.")
            continue

        # Load the recording into BirdNET
        recording = Recording(analyzer, audio_path)
        recording.extract_embeddings()

        # Process the extracted embeddings
        for emb in recording.embeddings:
            row = {
                "path": audio_path,  # Full path of the file
                "file_name": file_name,  # File name without the path
                "start_time": emb['start_time'],
                "end_time": emb['end_time']
            }
            # Add the 1024 embedding values
            for i, value in enumerate(emb["embeddings"]):
                row[f"dim_{i+1}"] = value

            all_data.append(row)

        # Increment counter
        processed_count += 1

        # Save every batch_size audios
        if processed_count % batch_size == 0:
            df_partial = pd.DataFrame(all_data)
            df_partial.to_csv(csv_filename, mode='a', header=not os.path.exists(csv_filename), index=False)
            all_data = []  # Clear memory
            print(f"{processed_count} audios processed and saved.")

        # Clear memory
        del recording
        gc.collect()        

    except Exception as e:
        print(f"Error processing {audio_path}: {e}")

# Save remaining data at the end
if all_data:
    df_partial = pd.DataFrame(all_data)
    df_partial.to_csv(csv_filename, mode='a', header=not os.path.exists(csv_filename), index=False)
    print(f"Saved remaining {len(all_data)} entries.")

print(f"Embeddings saved to {csv_filename}")


Labels loaded.
load model True
Model loaded.
Labels loaded.
load_species_list_model
Meta model loaded.


INFO: Created TensorFlow Lite XNNPACK delegate for CPU.


Embeddings saved to /teamspace/studios/this_studio/Output_files/Embeddings_from_3sPadding/littlepenguin_embeddings.csv


In [3]:

# Check if the CSV file already exists to retrieve processed audio files
if os.path.exists(csv_filename):
    df_embeddings = pd.read_csv(csv_filename)
    processed_files = set(df_embeddings["file_name"])  # Already processed files
else:
    df_embeddings = pd.DataFrame()
    processed_files = set()

num_processed_audios = df_embeddings["path"].nunique()  # Or the correct column name
print(f"{num_processed_audios} different audios were successfully processed.")

# Convert the original list of audios to a set
original_audios = set(audios)

# Get the list of processed audios from the DataFrame
processed_audios = set(df_embeddings["path"].unique())

# Identify the audios that were not processed
unprocessed_audios = original_audios - processed_audios

# Display results
print(f"Total audios in the original list: {len(original_audios)}")
print(f"Total audios processed: {len(processed_audios)}")
print(f"Total audios not processed: {len(unprocessed_audios)}")

# If you want to see which audios were not processed:
print("Unprocessed audios:")
print("\n".join(unprocessed_audios))


2429 different audios were successfully processed.
Total audios in the original list: 2429
Total audios processed: 2429
Total audios not processed: 0
Unprocessed audios:

