# Audio Data Cleaning Pipeline

This notebook implements the same preprocessing pipeline as the original codebase but adapts it for audio data from HuggingFace datasets. The pipeline includes:

## Pipeline Overview

1. Loading multiple subsets from datasets on HuggingFace
2. Combining the dataset
3. Splitting into train/validation/test sets
4. Resampling audio to 16kHz
5. Processing audio into uniform 1-second chunks
6. Normalizing audio to consistent volume level
8. Filtering invalid or silent clips
9. Pushing processed dataset back to HuggingFace

**Note:** This notebook is only a preprocessing assistant tool for model training and will not be used in the production environment. The final system will be deployed on a Raspberry Pi-like device for continuous environmental sound monitoring.

## Setup & Imports

First, we import the necessary libraries for audio processing and dataset handling.

In [None]:
# Install required packages
# !pip install datasets

In [None]:
import numpy as np
import random
from datasets import load_dataset, DatasetDict, Audio
import logging

In [None]:
# Load the main repository
repo_id = "username/my_test_audio_dataset"

# Load configurations subset from main repository
config_names = [

]

# Load and combine datasets
all_subsets = {}

## Processing Parameters

Define key parameters for audio processing, feature extraction, and parallelization. These parameters determine the chunk size, audio characteristics, and feature dimensions.

In [None]:
# Configuration settings
# Audio Processing Parameters
TARGET_SR = 16000
CHUNK_DURATION_MS = 1000      # milliseconds
CHUNK_LENGTH = int(TARGET_SR * CHUNK_DURATION_MS / 1000)

NORMALIZATION_DB = -20.0

# Multiprocess
NUM_PROC = 48


## Load and Process HuggingFace Dataset

Now we'll implement the main processing pipeline for HuggingFace datasets. You can modify the dataset name and configuration as needed.

In [None]:
# Process datasets with resampling to target sample rate
for config in config_names:
    try:
        ds = load_dataset(repo_id, config, split="train", cache_dir='./cache')
        all_subsets[config] = ds
        print(f"Successfully loaded dataset {config} with {ds.num_rows} examples")
    except Exception as e:
        print(f"Error loading {config}: {e}")

all_subsets = DatasetDict(all_subsets)

In [None]:
print(all_subsets)

## Combine Processed Datasets

After processing each dataset individually, we combine them into a single unified dataset.

In [None]:
from datasets import concatenate_datasets

datasets_to_concatenate = [all_subsets[key] for key in all_subsets.keys()]

combined_dataset = concatenate_datasets(datasets_to_concatenate)

In [None]:
print(combined_dataset)

## Dataset Splitting

Split the combined dataset into train, validation, and test sets while maintaining the same class distribution in each split through stratification.

In [None]:
first_split = combined_dataset.train_test_split(test_size=0.10, stratify_by_column="label")
second_split = first_split["test"].train_test_split(test_size=0.50, stratify_by_column="label")

## Save Processed Dataset to HuggingFace

Upload the final processed dataset back to HuggingFace for use in model training. The dataset is saved with a 2GB shard size to manage file sizes appropriately.

In [None]:
final_dataset = DatasetDict({
    "train": first_split["train"],
    "valid": second_split["train"],
    "test": second_split["test"],
})

In [None]:
print(final_dataset)

## Audio Processing Utilities

Define utility functions for audio normalization and feature extraction. These functions handle:

1. Audio normalization to a target dB level
2. Detection and skipping of silent clips
3. Prevention of clipping 
4. Extraction of MFCC features and their derivatives

In [None]:
# Utils
def normalize_audio(signal: np.ndarray, label: int, target_db: float) -> tuple[np.ndarray, bool]:
    """
    Normalize audio to target dB level and detect silence.
    Returns (normalized_signal, success_flag)
    """
    # Convert to int16 for silence check
    signal_int16 = (signal * np.iinfo(np.int16).max).astype(np.int16)
    if np.all(signal_int16 == 0):
        if label == 1:
            logging.warning(f"Skipping empty signal for label {label}")
            return signal, False
        else:
            return signal, True

    # Handle silent/empty signals
    if np.max(np.abs(signal)) < 1e-10:
        if label == 1:
            logging.warning(f"Skipping silent signal for label {label}")
            return signal, False
        else:
            logging.warning(f"Skipping silent signal for label {label}")
            return signal, True

    try:
        # Calculate current RMS and dB
        rms = np.sqrt(np.mean(signal ** 2))
        current_db = 20 * np.log10(max(rms, 1e-10))

        # Calculate gain needed
        gain_db = target_db - current_db
        gain_factor = 10 ** (gain_db / 20)

        # Apply gain
        normalized_signal = signal * gain_factor

        # Prevent clipping if needed
        if np.max(np.abs(normalized_signal)) > 0.95:  # Using 0.95 as a safety margin
            normalized_signal = 0.95 * normalized_signal / np.max(np.abs(normalized_signal))

        # Validate result
        if not np.isfinite(normalized_signal).all():
            logging.warning("Invalid normalized signal")
            return signal, False

        return normalized_signal, True

    except Exception as e:
        logging.warning(f"Normalization failed: {e}")
        return signal, False


## Chunk Processing Function

This function is the core of our preprocessing pipeline. It processes audio in batches and:

1. Handles short clips by padding with zeros
2. Splits longer clips into 1-second chunks
3. Normalizes audio volume
4. Extracts MFCC and delta features
5. Filters out invalid or silent clips
6. Preserves original labels

In [None]:
def process_chunk(batch):
    """
    Takes a resampled, mono audio array and:
    - If shorter than clip_len: pads with zeros
    - If longer than clip_len: extracts overlapping windows
    - Normalizes audio to target dB level
    - Filters out silent drone clips
    Returns a list of fixed-size clips.
    """
    try:
        all_audio = batch["audio"]
        all_labels = batch["label"]
    except Exception as e:
        logging.warning(f"Error loading batch: {e}")

    chunk_clips = []
    chunk_labels = []

    try:
        for audio, label in zip(all_audio, all_labels):
            try:
                y = audio["array"]
                sr = audio["sampling_rate"]
                n = len(y)

                # Validate audio format
                if y.ndim != 1:
                    logging.error(f"Skipping sample: {audio['path']} with invalid number of channels {y.ndim}")

                if sr != TARGET_SR:
                    logging.error(f"Skipping sample: {audio['path']} with invalid sampling rate {sr}")

                # Handle audio file shorter than CHUNK_LENGTH
                if n <= CHUNK_LENGTH:
                    if n < CHUNK_LENGTH // 2:
                        logging.warning(f"Skipping sample: {audio['path']} with too short length {len(y)}")
                        continue

                    # pad short signals with 0 at the random position
                    pad_width = CHUNK_LENGTH - n
                    pad_left = random.randint(0, pad_width)
                    pad_right = pad_width - pad_left
                    y_pad = np.pad(y, (pad_left, pad_right), mode="constant")

                    # Normalize and add if valid
                    normalized_signal, norm_success = normalize_audio(y_pad, label, NORMALIZATION_DB)
                    if norm_success:
                        chunk_clips.append({
                            "array": normalized_signal,
                            "sampling_rate": sr
                        })
                        chunk_labels.append(label)


                # Handle longer clips (chunk into segments)
                else:
                    num_full_chunks = n // CHUNK_LENGTH
                    for i in range(num_full_chunks):
                        start = i * CHUNK_LENGTH
                        end = start + CHUNK_LENGTH
                        chunk_data = y[start:end]

                        # Normalize and add if valid
                        normalized_signal, norm_success = normalize_audio(chunk_data, label, NORMALIZATION_DB)
                        if norm_success:
                            chunk_clips.append({
                                "array": normalized_signal,
                                "sampling_rate": sr
                            })
                            chunk_labels.append(label)

            except Exception as e:
                logging.warning(f"Error processing individual sample {audio['path']}: {e}")
                continue

        return {
            "audio": chunk_clips,
            "label": chunk_labels,
        }

    except Exception as e:
        logging.error(f"Error processing batch: {e}")
        return {
            "audio": [],
            "label": []
        }


## Running Preprocessing Loop

Now we'll process each dataset in parallel using the HuggingFace datasets library. This applies our preprocessing to each audio file, resulting in a dataset with uniform chunk sizes and consistent features.

In [None]:
# Cast audio column to resample audio to targer sample rate
final_dataset = final_dataset.cast_column("audio", Audio(sampling_rate=TARGET_SR, mono=True))
# Perform audio preprocess
final_dataset = final_dataset.map(
    process_chunk,
    batched=True,
    batch_size=500,
    num_proc=NUM_PROC,
    remove_columns=final_dataset["train"].column_names
)


## Class Imbalance Analysis

Check the class distribution to understand the balance between drone and non-drone sounds in our dataset. This helps determine if we need to apply balancing techniques during model training.

In [None]:
drone_label_counts = combined_dataset.filter(lambda example: example['label'] == 1).num_rows
no_drone_label_counts = combined_dataset.filter(lambda example: example['label'] == 0).num_rows

total = combined_dataset.num_rows
minority_share = min(drone_label_counts, no_drone_label_counts) / total
imb_ratio      = max(drone_label_counts, no_drone_label_counts) / min(drone_label_counts, no_drone_label_counts)

print(f"Minority share: {minority_share:.3%}")
print(f"Imbalance ratio IR: {imb_ratio:.1f}:1")

In [None]:
print(f"Drone label count: {drone_label_counts}")
print(f"No Drone label count: {no_drone_label_counts}")

In [None]:
final_dataset.push_to_hub('username/my_test_audio_dataset', commit_message="processed dataset", max_shard_size="2GB")