# Human Voice Removal & Audio Chunking

In this notebook, I implemented a preprocessing step to **remove segments of human speech** from the training clips using `webrtcvad`. This was motivated by the fact that **some training audio files contain spoken words**, which are irrelevant and may introduce noise into the feature learning process.

### Why Remove Human Voice?

- Prevent model bias toward human speech patterns.
- Focus training on actual wildlife vocalizations.
- Improve model robustness to non-bird interference.

### Chunking

After removing human voice, I **chunked each clip into 5-second segments**, following the competition's submission format. This chunking aligns with how test soundscapes will be evaluated (e.g., one row per 5s segment).

### Final Decision

This cleaned and chunked data was used in training model. Excluding human voice **improved signal clarity** and helped the model generalize better to natural soundscapes.



In [None]:
import os
import numpy as np
import webrtcvad

def trim_and_chunk_files(
    input_dir='processed/raw_waveforms',
    output_dir='processed/chunks',
    chunk_duration_sec=5,
    frame_duration_ms=30,
    aggressiveness=2,
    verbose=True
):
    os.makedirs(output_dir, exist_ok=True)
    vad = webrtcvad.Vad(aggressiveness)

    for fname in sorted(os.listdir(input_dir)):
        if not fname.endswith('.npz'):
            continue

        input_path = os.path.join(input_dir, fname)
        data = np.load(input_path)
        y = data['waveform']
        sr = int(data['sample_rate'])

        frame_size = int(sr * frame_duration_ms / 1000)
        trimmed_audio = []

        # Step 1: Remove speech frames
        for i in range(0, len(y), frame_size):
            frame = y[i:i + frame_size]

            # If the final frame is shorter (e.g. only 15 ms left),
            # it needs to pad it temporarily with zeros only to pass it to the VAD.
            # This padding ensures:
            # The VAD always receives frames of the correct length.
            # The detection doesn’t crash due to size mismatch.

            if len(frame) < frame_size:
                frame = np.pad(frame, (0, frame_size - len(frame)), mode='constant')

            # webrtcvad expects 16-bit PCM(Pulse-Code Modulation) audio, not floating-point values.
            # But librosa.load() gives floats between -1.0 and +1.0.
            # So, to convert float → 16-bit PCM, we scale by 32768 (since 2¹⁵ = 32768), then cast to int16.
            # 16-bit PCM, each sample is a number from -32768 to +32767

            pcm = (frame * 32768).astype(np.int16).tobytes()

            if not vad.is_speech(pcm, sr):
                trimmed_audio.extend(frame[:len(frame)])

        trimmed_audio = np.array(trimmed_audio, dtype=np.float32)
        chunk_size = chunk_duration_sec * sr
        # Any leftover audio that doesn't make a full 5s chunk 
        # (e.g. last 2 seconds) is not saved or processed at all.
        total_chunks = len(trimmed_audio) // chunk_size

        if verbose:
            print(f"{fname}: trimmed {len(y)} → {len(trimmed_audio)} samples → {total_chunks} chunks")

        # Step 2: Save each chunk separately with proper time-based naming
        base = os.path.splitext(fname)[0]  # without .npz

        for i in range(total_chunks):
            start = i * chunk_size
            end = start + chunk_size
            chunk = trimmed_audio[start:end]
            end_time = (i + 1) * chunk_duration_sec
            out_name = f"{base}_{end_time}.npz"
            out_path = os.path.join(output_dir, out_name)
            np.savez_compressed(out_path, y=chunk, sr=sr)

            if verbose:
                print(f"  Saved chunk: {out_name}")


In [19]:
trim_and_chunk_files()

CSA18786.npz: trimmed 2273804 → 169920 samples → 1 chunks
  Saved chunk: CSA18786_5.npz
CSA35130.npz: trimmed 6239441 → 1178880 samples → 7 chunks
  Saved chunk: CSA35130_5.npz
  Saved chunk: CSA35130_10.npz
  Saved chunk: CSA35130_15.npz
  Saved chunk: CSA35130_20.npz
  Saved chunk: CSA35130_25.npz
  Saved chunk: CSA35130_30.npz
  Saved chunk: CSA35130_35.npz
CSA35146.npz: trimmed 8881899 → 4966080 samples → 31 chunks
  Saved chunk: CSA35146_5.npz
  Saved chunk: CSA35146_10.npz
  Saved chunk: CSA35146_15.npz
  Saved chunk: CSA35146_20.npz
  Saved chunk: CSA35146_25.npz
  Saved chunk: CSA35146_30.npz
  Saved chunk: CSA35146_35.npz
  Saved chunk: CSA35146_40.npz
  Saved chunk: CSA35146_45.npz
  Saved chunk: CSA35146_50.npz
  Saved chunk: CSA35146_55.npz
  Saved chunk: CSA35146_60.npz
  Saved chunk: CSA35146_65.npz
  Saved chunk: CSA35146_70.npz
  Saved chunk: CSA35146_75.npz
  Saved chunk: CSA35146_80.npz
  Saved chunk: CSA35146_85.npz
  Saved chunk: CSA35146_90.npz
  Saved chunk: CSA35

In [6]:
import IPython.display as ipd
# Load raw waveform
npz_path = '/Users/istiak/Desktop/birdCLEF2025/processed/chunks/CSA18786_5.npz'
data = np.load(npz_path)
y = data['y']
sr = int(data['sr'])

# Play raw audio without human voice 
ipd.Audio(y, rate=sr)

In [7]:
# Load raw waveform
npz_path = '/Users/istiak/Desktop/birdCLEF2025/processed/chunks/iNat210020_80.npz'
data = np.load(npz_path)
y = data['y']
sr = int(data['sr'])

# Play raw audio without human voice 
ipd.Audio(y, rate=sr)