## Notebook Description

This notebook **analyzes audio file durations and adds silence padding to make them compatible with BirdNET processing**.

**Main tasks:**

1. **Dataset selection**: Choose from 7 available bird datasets (chiffchaff-fg, littleowl-fg, pipit-fg, littlepenguin-display_call-exhale, rtbc-begging, Great tit, Great Kiwi)

2. **Duration analysis**: 
   - Recursively scans all WAV files in the selected dataset
   - Calculates and displays audio duration statistics (mean, min, max)
   - Identifies files shorter than 3 seconds requiring padding

3. **Padding calculation**: Computes silence needed to make each file's duration a multiple of 3 seconds (BirdNET requirement)

4. **Batch processing**: 
   - Adds calculated silence to the end of each audio file
   - Preserves original folder structure in output directory
   - Skips already processed files to avoid duplication
   - Provides progress updates every 200 files

5. **Output management**: 
   - Saves duration analysis as CSV with padding times
   - Stores padded audio files in organized output folders
   - Handles errors gracefully and manages memory efficiently

**Result**: All audio files standardized to multiples of 3 seconds duration, ready for BirdNET compatibility, with preserved dataset structure and comprehensive duration analysis.

# Select the database to process. It can be any of the following 7:

- chifffhaff-fg
- littleowl-fg
- pipit-fg
- littlepenguin-display_call-exhale
- rtbc-begging
- Great tit
- Great Kiwi

You need to change 'selected_folder' variable to analyze a different dataset.


In [1]:
"""
Dependencies:
- pathlib
- librosa
- soundfile
- numpy
- pandas
- gc

Usage:
- Set 'selected_folder' to the dataset path.
- Run to print file durations and pad files as needed.
"""



import librosa
import pandas as pd
from pathlib import Path
import gc
import soundfile as sf
import numpy as np


# Get the current working directory
cwd = Path.cwd()
project_root = cwd.parents[1]

#Select here which database to analize.
chifffhaff_fg_audios = project_root / 'Original_datasets' / 'chiffchaff-fg'
littleowl_fg_audios = project_root /'Original_datasets' / 'littleowl-fg'
pipit_fg_audios = project_root /  'Original_datasets' / 'pipit-fg'
littlepenguin_audios = project_root / 'Original_datasets' / 'littlepenguin-display_call-exhale'
rtbc_begging_audios = project_root / 'Original_datasets' / 'rtbc-begging'
greatTit_audios = project_root / 'Original_datasets' / 'greatTit_song-files'
kiwi_audios = project_root / 'Original_datasets' / 'KiwiTrimmed'


#Select here which database to analize.
selected_folder = kiwi_audios

audio_lengths = []
for file_path in sorted(selected_folder.rglob('*.wav'), key=lambda x: x.name):
    if file_path.is_file():
        audio, sr = librosa.load(str(file_path), sr=None)
        length_seconds = librosa.get_duration(y=audio, sr=sr)
        audio_lengths.append((file_path.name, length_seconds))

df_lengths = pd.DataFrame(audio_lengths, columns=['File', 'Length (s)'])
print(df_lengths)


print("Average duration:", df_lengths['Length (s)'].mean(), "seconds")
print("Maximum duration:", df_lengths['Length (s)'].max(), "seconds")
print("Minimum duration:", df_lengths['Length (s)'].min(), "seconds")

print("Summary statistics for audio durations (seconds):")
print(df_lengths['Length (s)'].describe().to_frame().rename(columns={'Length (s)': 'Duration (s)'}))

                                                  File  Length (s)
0                         10_2020_10_14_21_07_trim.wav   28.616000
1                         10_2020_10_15_21_57_trim.wav   28.648000
2                          11_2020_10_12_2_07_trim.wav   29.768000
3                         11_2020_10_13_21_08_trim.wav   31.656000
4                          11_2020_10_2_22_58_trim.wav   31.016000
..                                                 ...         ...
450  Hawdon-RangiNest201213_20121108_004503_6-GSKfe...   23.784000
451  Hawdon-RangiNest201213_20121109_013002_149-GSK...   19.848000
452  Hawdon-RangiNest201213_20121228_214502_790-GSK...   26.152125
453  Hawdon-RangiNest201213_20121229_021501_233-GSK...   27.496125
454  Hawdon-RangiNest201415_20150114_021502_138-GSK...   27.464125

[455 rows x 2 columns]
Average duration: 27.64113186813187 seconds
Maximum duration: 40.136 seconds
Minimum duration: 13.448 seconds
Summary statistics for audio durations (seconds):
       Durat

In [2]:
"""
This cell analyzes a DataFrame audio file lengths. 
It counts how many audio files are shorter than 3 seconds and prints that number. 
Then, it filters the DataFrame to show only those short audio files and prints the resulting subset. 
This helps identify and review audio files that may be too brief for further processing or analysis.
"""

num_short_audios = (df_lengths['Length (s)'] < 3).sum()
print(f"Number of audio files shorter than 3 seconds: {num_short_audios}")

short_audios_df = df_lengths[df_lengths['Length (s)'] < 3]
print(short_audios_df)


Number of audio files shorter than 3 seconds: 0
Empty DataFrame
Columns: [File, Length (s)]
Index: []


In [3]:
"""
This cell constructs the output file path for saving audio duration data with padding.
It uses the selected folder's name to generate a CSV filename, builds the output directory path,
and saves the DataFrame `df_lengths` as a CSV file without the index.
"""

# Function to calculate how much is needed to reach the next multiple of 3
def time_to_next_multiple(duration):
    next_multiple = np.ceil(duration / 3) * 3  # Find the next multiple of 3
    return next_multiple - duration  # Amount needed to reach the multiple

# Apply the function to the "Length (s)" column and create a new column "Padding_Time"
df_lengths["Padding_Time"] = df_lengths['Length (s)'].apply(time_to_next_multiple)

# Save the new DataFrame with the additional column
print(df_lengths.head())

folder_name = selected_folder.name

# Build the output file name and path
output_dir = project_root / 'Output_files' / 'audio_durations'
output_dir = Path("/teamspace/studios/this_studio/Output_files/audio_durations")
file_name = f"{folder_name}_durations_with_padding.csv"
output_path = output_dir / file_name

df_lengths.to_csv(output_path, index=False)
print(f"Audio durations with padding saved to: {output_path}")

                           File  Length (s)  Padding_Time
0  10_2020_10_14_21_07_trim.wav      28.616         1.384
1  10_2020_10_15_21_57_trim.wav      28.648         1.352
2   11_2020_10_12_2_07_trim.wav      29.768         0.232
3  11_2020_10_13_21_08_trim.wav      31.656         1.344
4   11_2020_10_2_22_58_trim.wav      31.016         1.984
Audio durations with padding saved to: /teamspace/studios/this_studio/Output_files/audio_durations/KiwiTrimmed_durations_with_padding.csv


In [4]:

"""
This cell processes a batch of audio files by appending silence to each file based on 
its specified padding time.

It ensures that all pending audio files are correctly padded and saved, 
avoiding reprocessing of already processed files.

"""

base_input_folder = project_root / 'Original_datasets' / selected_folder.name
base_output_folder = project_root / 'Output_files' / 'Extended_audios' / selected_folder.name

df_durations = df_lengths.copy()
# df_sample = df_durations.head(14)

# Get only the names of the processed files
processed_files = set(f.name for f in base_output_folder.rglob("*.wav"))
# Filter the pending audio files by file name
df_pending = df_durations[~df_durations["File"].isin(list(processed_files))]

start_index = df_pending.index.min()

print(f"🔹 Processed audios: {len(processed_files)}")
print(f"🔹 Pending audios to process: {len(df_pending)}")
print(f"🔹 Starting from index: {start_index}")

for idx, row in df_pending.iterrows():
    # Search for the file by name in all subfolders
    matches = list(base_input_folder.rglob(row["File"]))
    if not matches:
        print(f"File not found: {row['File']}")
        continue
    file_path = matches[0]

    padding_time = row["Padding_Time"]

    # Calculate the relative path with respect to the base folder
    relative_path = file_path.relative_to(base_input_folder)
    output_path = base_output_folder / relative_path

    # Create the destination folder if it doesn't exist
    output_path.parent.mkdir(parents=True, exist_ok=True)

    try:
        audio, sr = sf.read(str(file_path))
        silence = np.zeros((int(sr * padding_time),) + audio.shape[1:])
        padded_audio = np.concatenate([audio, silence])
        sf.write(str(output_path), padded_audio, sr)
        if (idx - start_index + 1) % 200 == 0:
            print(f"🔹 {idx} audios saved so far...")
    except Exception as e:
        print(f"Error processing {file_path}: {e}")
    del audio, silence, padded_audio
    gc.collect()
print("All audio files processed and saved with padding.")



🔹 Processed audios: 0
🔹 Pending audios to process: 455
🔹 Starting from index: 0


🔹 199 audios saved so far...
🔹 399 audios saved so far...
All audio files processed and saved with padding.
