# Audio Preprocessing Pipeline - Interactive Walkthrough

**Purpose:** Understand and test audio preprocessing methods for archival speech-to-text.

**What you'll learn:**
1. Review audio quality analysis results from 1000 VHP test files
2. Step through preprocessing functions on a single example
3. Compare raw vs. preprocessed audio (spectrograms, metrics, transcriptions)
4. Test different preprocessing variants as a "playground"

**Preprocessing Methods:**
- **Loudness Normalization** - Normalize to target LUFS
- **High-Pass Filter** - Remove low-frequency rumble (<80 Hz)
- **Noise Reduction** - Remove background noise
- **EQ High-Frequency Boost** - Enhance bandwidth-limited audio (TODO)

**For production batch processing:** Use `scripts/preprocess_audio.py` (functions are decoupled from this notebook)

**For STT inference:** Use existing `infer_*.ipynb` notebooks (they support custom blob prefixes for preprocessed audio)

In [None]:
import sys
sys.path.append("../scripts")

import torch
import torchaudio
import numpy as np
import pandas as pd
from pathlib import Path
import pyloudnorm as pyln
from scipy import signal
import librosa
import matplotlib.pyplot as plt
from tqdm import tqdm
import io

# Project utilities
from cloud.azure_utils import download_blob_to_memory

# Check GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Device: {device}")
if device.type == 'cuda':
    print(f"GPU: {torch.cuda.get_device_name(0)}")

In [None]:
import os
from dotenv import load_dotenv
load_dotenv(dotenv_path='../credentials/creds.env')

---

## Section 1: Load & Review Analysis Results

Review the audio quality analysis to understand what preprocessing is needed.

In [None]:
# Load analysis results
analysis_path = Path('../learnings/audio_quality_analysis/audio_quality_analysis.parquet')
df_analysis = pd.read_parquet(analysis_path)

print(f"Loaded analysis for {len(df_analysis)} files")
print(f"\nColumns: {df_analysis.columns.tolist()}")
print(f"\nSuccess rate: {(df_analysis['status'] == 'success').sum()} / {len(df_analysis)}")

# Show first few rows
df_analysis.head(3)

In [None]:
# Summary statistics for successful files
df_success = df_analysis[df_analysis['status'] == 'success']

print("=" * 70)
print("AUDIO QUALITY METRICS SUMMARY")
print("=" * 70)
print(df_success[[
    'snr_db', 
    'spectral_rolloff_hz', 
    'spectral_flatness',
    'spectral_centroid_hz',
    'zcr_mean',
    'loudness_lufs',
    'low_freq_energy_ratio'
]].describe())

In [None]:
# Distribution of issues
from collections import Counter

# Flatten all issues
all_issues = [issue for issues in df_success['issues'] for issue in issues]
issue_counts = Counter(all_issues)

print("=" * 70)
print("ISSUE DISTRIBUTION")
print("=" * 70)
for issue, count in issue_counts.most_common():
    pct = count / len(df_success) * 100
    print(f"{issue:<30s}: {count:4d} files ({pct:5.1f}%)")

print(f"\nFiles with NO issues: {(df_success['issues'].str.len() == 0).sum()}")
print(f"Files with issues: {(df_success['issues'].str.len() > 0).sum()}")

In [None]:
# Distribution of preprocessing recommendations
all_recs = [rec for recs in df_success['recommended_preprocessing'] for rec in recs]
rec_counts = Counter(all_recs)

print("=" * 70)
print("PREPROCESSING RECOMMENDATIONS")
print("=" * 70)
for rec, count in rec_counts.most_common():
    pct = count / len(df_success) * 100
    print(f"{rec:<30s}: {count:4d} files ({pct:5.1f}%)")

In [None]:
# Sample representative files for each issue category
print("=" * 70)
print("SAMPLE FILES FOR TESTING")
print("=" * 70)

# Bandwidth-limited severe
bandwidth_severe = df_success[df_success['issues'].apply(lambda x: 'bandwidth_limited_severe' in x)]
if len(bandwidth_severe) > 0:
    sample = bandwidth_severe.iloc[0]
    print(f"\nBandwidth-limited (severe): {sample['audio_id']}")
    print(f"  Rolloff: {sample['spectral_rolloff_hz']:.0f} Hz")
    print(f"  Centroid: {sample['spectral_centroid_hz']:.0f} Hz")

# High noise
high_noise = df_success[df_success['issues'].apply(lambda x: 'high_noise_snr' in x or 'high_noise_zcr' in x)]
if len(high_noise) > 0:
    sample = high_noise.iloc[0]
    print(f"\nHigh noise: {sample['audio_id']}")
    print(f"  SNR: {sample['snr_db']:.1f} dB")
    print(f"  ZCR: {sample['zcr_mean']:.4f}")

# Low loudness
low_loud = df_success[df_success['issues'].apply(lambda x: 'low_loudness' in x)]
if len(low_loud) > 0:
    sample = low_loud.iloc[0]
    print(f"\nLow loudness: {sample['audio_id']}")
    print(f"  Loudness: {sample['loudness_lufs']:.1f} LUFS")

# Good quality (no issues)
good_quality = df_success[df_success['issues'].str.len() == 0]
if len(good_quality) > 0:
    sample = good_quality.iloc[0]
    print(f"\nGood quality (no issues): {sample['audio_id']}")
    print(f"  SNR: {sample['snr_db']:.1f} dB, Rolloff: {sample['spectral_rolloff_hz']:.0f} Hz")

---

## Section 2: Preprocessing Functions - Step-by-Step Walkthrough

**Goal:** Understand each preprocessing method by applying it to a single example audio file.

**Pattern:** 
1. Show the **core code** (2-3 lines) that does the preprocessing
2. Apply it to the example audio
3. Show before/after metrics
4. (Optional) Play the audio to hear the difference

**Note:** After understanding the logic, these functions are available in `scripts/preprocess_audio.py` for production use.

### Step 1: Choose an Example Audio File

Pick a file from the analysis results. You can either:
- Pick a random file: `df_success.sample(1).iloc[0]`
- Pick a specific VHP index: `df_success[df_success['audio_id'] == '1234'].iloc[0]`
- Pick by issue type: `df_success[df_success['issues'].apply(lambda x: 'bandwidth_limited' in str(x))].iloc[0]`

In [None]:
# Option 1: Pick a file with bandwidth limitation (recommended for testing)
example_file = df_success[df_success['issues'].apply(lambda x: 'bandwidth_limited' in str(x))].iloc[0]

# Option 2: Pick a random file
# example_file = df_success.sample(1, random_state=42).iloc[0]

# Option 3: Pick specific VHP index
# example_file = df_success[df_success['audio_id'] == '1234'].iloc[0]

print(f"Selected file: {example_file['audio_id']}")
print(f"\nIssues detected: {example_file['issues']}")
print(f"Recommended preprocessing: {example_file['recommended_preprocessing']}")
print(f"\nMetrics:")
print(f"  SNR: {example_file['snr_db']:.1f} dB")
print(f"  Rolloff: {example_file['spectral_rolloff_hz']:.0f} Hz")
print(f"  Centroid: {example_file['spectral_centroid_hz']:.0f} Hz")
print(f"  Loudness: {example_file['loudness_lufs']:.1f} LUFS")

### Step 2: Download the Audio File

In [None]:
# Helper function to load audio from bytes (we'll use this throughout)
def load_audio_bytes(audio_bytes, target_sr=16000):
    """Load audio from bytes to numpy array."""
    from pydub import AudioSegment
    audio = AudioSegment.from_file(io.BytesIO(audio_bytes))
    audio = audio.set_channels(1).set_frame_rate(target_sr)
    samples = np.array(audio.get_array_of_samples())
    
    # Normalize to float32 [-1, 1]
    if audio.sample_width == 2:
        waveform = samples.astype(np.float32) / 32768.0
    elif audio.sample_width == 4:
        waveform = samples.astype(np.float32) / 2147483648.0
    else:
        waveform = samples.astype(np.float32)
    
    return waveform, target_sr

print("‚úì Helper function defined")

In [None]:
# Download from Azure blob
audio_id = example_file['audio_id']

# Try audio.mp3 first, then video.mp4
try:
    blob_path = f"loc_vhp/{audio_id}/audio.mp3"
    audio_bytes_raw = download_blob_to_memory(blob_path)
    print(f"‚úì Downloaded: {blob_path}")
except:
    blob_path = f"loc_vhp/{audio_id}/video.mp4"
    audio_bytes_raw = download_blob_to_memory(blob_path)
    print(f"‚úì Downloaded: {blob_path}")

print(f"  Size: {len(audio_bytes_raw)/1024/1024:.1f} MB")

# Load to waveform for analysis
waveform_raw, sr = load_audio_bytes(audio_bytes_raw, target_sr=16000)
print(f"  Duration: {len(waveform_raw)/sr:.1f} seconds")
print(f"  Sample rate: {sr} Hz")

### Step 3: Apply Preprocessing Functions One-by-One

Test each preprocessing function independently to understand what it does.

**‚ö†Ô∏è IMPORTANT:** Based on research, Whisper and Parakeet already include optimized preprocessing. These examples are educational. For production inference with those models, use raw audio.

**When to use these:** AWS Transcribe (test with validation), or understanding preprocessing concepts.

In [None]:
# 3a. Loudness Normalization
print("=" * 70)
print("LOUDNESS NORMALIZATION")
print("=" * 70)

# Core logic (2 lines):
meter = pyln.Meter(sr)
current_lufs = meter.integrated_loudness(waveform_raw)
waveform_normalized = pyln.normalize.loudness(waveform_raw.copy(), current_lufs, -16.0)  # positional args
waveform_normalized = np.clip(waveform_normalized, -1.0, 1.0)  # Prevent clipping

# Measure after
after_lufs = meter.integrated_loudness(waveform_normalized)

print(f"Before: {current_lufs:.1f} LUFS")
print(f"After:  {after_lufs:.1f} LUFS")
print(f"Change: {after_lufs - current_lufs:+.1f} LUFS")
print(f"\n‚úì Normalized to target -16.0 LUFS")

print("\nüìù RESEARCH NOTE:")
print("   - Whisper: Already does amplitude normalization to [-1, 1]")
print("   - Parakeet: Already includes normalization (per-feature or all-features)")
print("   - Literature: Conservative normalization helps, aggressive can hurt WER")
print("   - For Whisper/Parakeet: This step is REDUNDANT (educational only)")
print("   - For AWS Transcribe: TEST with validation before using")

# Optional: Play audio (uncomment if running locally)
# import IPython.display as ipd
# print("\nPlay BEFORE:")
# ipd.display(ipd.Audio(waveform_raw, rate=sr))
# print("\nPlay AFTER:")
# ipd.display(ipd.Audio(waveform_normalized, rate=sr))

In [None]:
# 3b. High-Pass Filter (remove low-frequency rumble)
print("=" * 70)
print("HIGH-PASS FILTER (80 Hz cutoff)")
print("=" * 70)

# Core logic (3 lines):
nyquist = sr / 2.0
b, a = signal.butter(5, 80 / nyquist, btype='high', analog=False)
waveform_highpassed = signal.filtfilt(b, a, waveform_raw.copy())

# Calculate low-frequency energy before/after
freqs = np.fft.rfftfreq(len(waveform_raw), 1/sr)
fft_before = np.abs(np.fft.rfft(waveform_raw))
fft_after = np.abs(np.fft.rfft(waveform_highpassed))

low_freq_mask = freqs < 80
energy_before = np.sum(fft_before[low_freq_mask])
energy_after = np.sum(fft_after[low_freq_mask])

print(f"Low-freq energy (<80 Hz) before: {energy_before:.2e}")
print(f"Low-freq energy (<80 Hz) after:  {energy_after:.2e}")
print(f"Reduction: {(1 - energy_after/energy_before)*100:.1f}%")
print(f"\n‚úì Removed low-frequency rumble")

print("\nüìù RESEARCH NOTE:")
print("   - Literature: High-pass filtering shows CONSISTENT improvements")
print("   - Removes low-frequency ambient noise (AC hum, tape rumble)")
print("   - SAFE: Does not harm speech content (human voice >100 Hz)")
print("   - Chu et al.: 'Significant improvements' for low-freq noise")
print("   - Recommended for: AWS Transcribe, or any model with archival audio")

# Optional: Play audio
# import IPython.display as ipd
# print("\nPlay BEFORE:")
# ipd.display(ipd.Audio(waveform_raw, rate=sr))
# print("\nPlay AFTER (rumble removed):")
# ipd.display(ipd.Audio(waveform_highpassed, rate=sr))

In [None]:
# 3c. Noise Reduction
print("=" * 70)
print("NOISE REDUCTION")
print("=" * 70)

# Core logic (1 line - using noisereduce library):
import noisereduce as nr
waveform_denoised = nr.reduce_noise(y=waveform_raw.copy(), sr=sr, stationary=True, prop_decrease=1.0)

# Calculate SNR before/after (simple RMS-based estimate)
def simple_snr(waveform, sr):
    # Assume first 0.5s is noise
    noise_samples = int(0.5 * sr)
    noise_rms = np.sqrt(np.mean(waveform[:noise_samples]**2))
    signal_rms = np.sqrt(np.mean(waveform**2))
    return 20 * np.log10(signal_rms / noise_rms) if noise_rms > 0 else 0

snr_before = simple_snr(waveform_raw, sr)
snr_after = simple_snr(waveform_denoised, sr)

print(f"SNR before: {snr_before:.1f} dB")
print(f"SNR after:  {snr_after:.1f} dB")
print(f"Improvement: {snr_after - snr_before:+.1f} dB")
print(f"\n‚úì Noise reduced")

print("\n‚ö†Ô∏è RESEARCH WARNING:")
print("   - Models: Whisper/Parakeet rely on robustness, NO explicit denoising")
print("   - Literature: RNNoise and ASTEROID shown to HARM ASR")
print("   - Problem: Aggressive noise reduction removes speech info")
print("   - This library (noisereduce): NOT tested in research papers")
print("   - Current setting (prop_decrease=1.0): May be too aggressive")
print("   - Recommendation:")
print("     * For Whisper/Parakeet: SKIP this step (use raw audio)")
print("     * For AWS Transcribe: TEST with conservative settings (prop_decrease=0.5)")
print("     * ALWAYS validate with WER before using in production")

# Optional: Play audio
# import IPython.display as ipd
# print("\nPlay BEFORE:")
# ipd.display(ipd.Audio(waveform_raw, rate=sr))
# print("\nPlay AFTER (denoised):")
# ipd.display(ipd.Audio(waveform_denoised, rate=sr))

<cell_type>markdown</cell_type>---

### Using the Preprocessing Functions in Production

**You've now seen the core logic!** These functions are available in `scripts/preprocess_audio.py` for production use.

---

#### üéØ Decision Matrix: Which Preprocessing for Which Model?

Based on research findings (see `learnings/preprocessing_research_findings.md`):

| Model | Loudness Norm | High-Pass Filter | Noise Reduction | Recommendation |
|-------|---------------|------------------|-----------------|----------------|
| **Whisper (pretrained/fine-tuned)** | ‚ùå Redundant | ‚ö†Ô∏è Test cautiously | ‚ùå Skip | **Use raw audio** |
| **Parakeet-TDT** | ‚ùå Redundant | ‚ö†Ô∏è Test cautiously | ‚ùå Skip | **Use raw audio** |
| **Google Chirp 2/3** | ‚ùå Skip | ‚ùå Skip | ‚úÖ Use built-in | **Use API params** (`denoise_audio=true`) |
| **AWS Transcribe** | ‚ö†Ô∏è Test | ‚úÖ Safe | ‚ö†Ô∏è Test conservatively | **Test with validation** |

**Key insights:**
- ‚úÖ **High-pass filter (80 Hz)**: Safe for all models, removes rumble without harming speech
- ‚ö†Ô∏è **Loudness normalization**: Whisper/Parakeet already do this (redundant)
- ‚ùå **Noise reduction**: Literature shows RNNoise/ASTEROID harm ASR; this library untested
- üéØ **Google Chirp**: Use built-in preprocessing (`denoise_audio=true`, `snr_threshold=150`)

**For this project:**
- Primary models (Whisper/Parakeet): Use **raw audio** only
- AWS Transcribe experiments: Test high-pass filter with WER validation
- Educational purpose: These examples help understand preprocessing concepts

---

#### Option 1: Import and use in notebooks
```python
from scripts.preprocess_audio import (
    load_audio_bytes,
    save_audio_bytes,
    loudness_normalize,
    highpass_filter,
    noise_reduce,
    preprocess_audio
)

# Apply full pipeline
audio_bytes_processed = preprocess_audio(
    audio_bytes_raw,
    methods=['highpass_filter', 'noise_reduction', 'loudness_normalization']
)
```

#### Option 2: Batch process via CLI (for AWS Transcribe experiments)

Run this on a cloud GPU instance (RunPod A6000, Azure VM, etc.) for fast batch processing:

```bash
# Example: Process entire test set with full pipeline
python scripts/preprocess_audio.py \
    --parquet data/raw/loc/veterans_history_project_resources_pre2010_test.parquet \
    --variant full_pipeline \
    --output_prefix loc_vhp_preprocessed/full_pipeline \
    --batch_size 100

# Available variants:
# - raw: No preprocessing (baseline - use this for Whisper/Parakeet)
# - normalized: Loudness normalization only
# - normalized_eq: Loudness + EQ boost (TODO: implement EQ)
# - normalized_denoised: Loudness + noise reduction
# - full_pipeline: Highpass + denoise + EQ + loudness
```

#### Option 3: Google Chirp API (recommended approach)

```python
# For Chirp, use built-in preprocessing parameters (NO upstream preprocessing needed)
from google.cloud import speech_v2

config = speech_v2.RecognitionConfig(
    auto_decoding_config=speech_v2.AutoDetectDecodingConfig(),
    features=speech_v2.RecognitionFeatures(
        enable_automatic_punctuation=True,
        denoise_audio=True,  # Use built-in denoising
        snr_threshold=150     # SNR threshold for noise reduction
    )
)
```

**You can test the functions here, or skip to Section 3 for inference playground!**

In [None]:
# OPTIONAL: Test the imported functions (you can skip this)
# Uncomment to run:

# from scripts.preprocess_audio import preprocess_audio, save_audio_bytes

# # Apply full pipeline using the imported function
# audio_bytes_preprocessed = preprocess_audio(
#     audio_bytes_raw,
#     methods=['highpass_filter', 'noise_reduction', 'loudness_normalization'],
#     audio_id=audio_id,
#     metrics=example_file.to_dict()
# )

# # Load it back to waveform for comparison
# from scripts.preprocess_audio import load_audio_bytes
# waveform_full_pipeline, _ = load_audio_bytes(audio_bytes_preprocessed)

# print(f"‚úì Preprocessed audio ready ({len(audio_bytes_preprocessed)/1024/1024:.1f} MB WAV)")

# # Play it
# import IPython.display as ipd
# ipd.display(ipd.Audio(waveform_full_pipeline, rate=sr))

print("‚úì Skipped (optional cell)")
print("  Uncomment the code above to test the imported preprocessing function")

### Step 4: Visualize Before/After (Optional)

Compare spectrograms to see the effect of preprocessing.

In [None]:
# Create full pipeline by chaining the preprocessing steps
waveform_full_pipeline = waveform_highpassed.copy()  # Start with highpassed
waveform_full_pipeline = nr.reduce_noise(y=waveform_full_pipeline, sr=sr, stationary=True)  # Then denoise
# Finally normalize (use positional args)
current_lufs = pyln.Meter(sr).integrated_loudness(waveform_full_pipeline)
waveform_full_pipeline = pyln.normalize.loudness(waveform_full_pipeline, current_lufs, -16.0)

# Plot spectrograms: Raw vs. Full Pipeline
fig, axes = plt.subplots(2, 1, figsize=(14, 8))

# Raw
D_raw = librosa.stft(waveform_raw)
S_db_raw = librosa.amplitude_to_db(np.abs(D_raw), ref=np.max)
img1 = librosa.display.specshow(S_db_raw, sr=sr, x_axis='time', y_axis='hz', ax=axes[0], cmap='viridis')
axes[0].set_ylim([0, 4000])
axes[0].set_title('RAW AUDIO (No Preprocessing)', fontsize=14, fontweight='bold')
fig.colorbar(img1, ax=axes[0], format='%+2.0f dB')

# Preprocessed
D_processed = librosa.stft(waveform_full_pipeline)
S_db_processed = librosa.amplitude_to_db(np.abs(D_processed), ref=np.max)
img2 = librosa.display.specshow(S_db_processed, sr=sr, x_axis='time', y_axis='hz', ax=axes[1], cmap='viridis')
axes[1].set_ylim([0, 4000])
axes[1].set_title('PREPROCESSED (Highpass ‚Üí Denoise ‚Üí Normalize)', fontsize=14, fontweight='bold')
fig.colorbar(img2, ax=axes[1], format='%+2.0f dB')

plt.tight_layout()
plt.show()

print("‚úì Spectrogram comparison complete")

# Optional: Play both versions
# import IPython.display as ipd
# print("\nPlay RAW:")
# ipd.display(ipd.Audio(waveform_raw, rate=sr))
# print("\nPlay PREPROCESSED (full pipeline):")
# ipd.display(ipd.Audio(waveform_full_pipeline, rate=sr))

### Step 5: Intelligibility Check with STOI (Optional)

**STOI** (Short-Time Objective Intelligibility) predicts whether preprocessing helps or hurts ASR.

**From research:** Pearson r=0.79 correlation between STOI and ASR performance.

**How it works:** Compares preprocessed audio to a reference. Higher score (0-1) means better intelligibility.

**Limitation for archival audio:** STOI expects a "clean reference" signal. For VHP degraded audio, we use raw as reference (pragmatic workaround).

**Best validation:** Test WER directly in Section 3!

In [None]:
# STOI - Intelligibility Check
# Predicts if preprocessing maintains speech intelligibility (r=0.79 with ASR)

try:
    from pystoi import stoi
    
    # For degraded archival audio, we use raw as "reference" (not ideal, but pragmatic)
    # STOI score > 0.7: preprocessing maintains intelligibility
    # STOI score 0.5-0.7: moderate intelligibility (validate with WER)
    # STOI score < 0.5: preprocessing likely degrades quality
    
    stoi_score = stoi(waveform_raw, waveform_full_pipeline, sr, extended=False)
    
    print("=" * 70)
    print("STOI INTELLIGIBILITY CHECK")
    print("=" * 70)
    print(f"STOI Score: {stoi_score:.3f} (range: 0-1, higher = better)")
    print(f"\nInterpretation:")
    
    if stoi_score > 0.7:
        print("  ‚úì High intelligibility maintained")
        print("  ‚Üí Preprocessing likely safe for ASR")
    elif stoi_score > 0.5:
        print("  ‚ö† Moderate intelligibility")
        print("  ‚Üí Validate with WER testing in Section 3")
    else:
        print("  ‚úó Low intelligibility")
        print("  ‚Üí Preprocessing may hurt ASR performance")
    
    print(f"\nüìù Note: For archival audio, STOI has limitations.")
    print(f"   Best validation: Compare WER on sample set (Section 3)")
    
except ImportError:
    print("=" * 70)
    print("STOI NOT AVAILABLE")
    print("=" * 70)
    print("Install with: pip install pystoi")
    print("\nSTOI (Short-Time Objective Intelligibility) helps predict")
    print("if preprocessing will help or hurt ASR (r=0.79 correlation)")
    print("\nFor VHP project: WER testing in Section 3 is the gold standard!")

---

## Section 3: Inference Playground - Test Preprocessing Impact

**Goal:** Run STT inference on the same audio with different preprocessing to see transcription differences.

**Pattern:** Raw audio ‚Üí transcribe ‚Üí Preprocessed audio ‚Üí transcribe ‚Üí Compare outputs

**Note:** This is a "smoke test" playground. For large-scale inference, use the decoupled `infer_*.ipynb` notebooks and `scripts/run_inference.py`.

### Choose a Model for Testing

Pick one of the available STT models. You can test different models to see which benefits most from preprocessing.

In [None]:
# Choose a model to test
# Options: 'whisper-large-v3', 'whisper-large-v3-lora', 'parakeet', 'chirp', 'aws'

MODEL_CHOICE = 'whisper-large-v3'  # Change this to test different models

print(f"Selected model: {MODEL_CHOICE}")
print(f"\nAvailable models:")
print("  - whisper-large-v3: Pretrained Whisper (baseline)")
print("  - whisper-large-v3-lora: Fine-tuned on VHP data")
print("  - parakeet: NVIDIA Parakeet-TDT (optimized for noisy audio)")
print("  - chirp: Google Chirp 3 (commercial API)")
print("  - aws: AWS Transcribe (commercial API)")

### Transcribe Raw Audio (Baseline)

In [None]:
# Transcribe raw audio
# TODO: Implement model inference logic based on MODEL_CHOICE
# This is a placeholder - replace with actual inference code

print("=" * 70)
print(f"TRANSCRIBING RAW AUDIO with {MODEL_CHOICE}")
print("=" * 70)

# Placeholder - replace with actual inference
if MODEL_CHOICE == 'whisper-large-v3':
    # from faster_whisper import WhisperModel
    # model = WhisperModel("models/faster-whisper/whisper-large-v3")
    # segments, info = model.transcribe(audio_bytes_raw)
    # transcript_raw = " ".join([s.text for s in segments])
    transcript_raw = "[TODO] Implement Whisper inference"
    
elif MODEL_CHOICE == 'whisper-large-v3-lora':
    # Load fine-tuned model
    transcript_raw = "[TODO] Implement fine-tuned Whisper inference"
    
elif MODEL_CHOICE == 'parakeet':
    # Load Parakeet model
    transcript_raw = "[TODO] Implement Parakeet inference"
    
else:
    transcript_raw = "[TODO] Implement API-based inference"

print(f"Raw transcript: {transcript_raw}")

### Transcribe Preprocessed Audio

In [None]:
# We already have waveform_full_pipeline from Section 2!
# Let's convert it to bytes for inference (if needed by some models)

from pydub import AudioSegment

# Convert waveform to bytes (WAV format)
waveform_int16 = (waveform_full_pipeline * 32767).astype(np.int16)
audio_segment = AudioSegment(
    waveform_int16.tobytes(),
    frame_rate=sr,
    sample_width=2,
    channels=1
)
buffer = io.BytesIO()
audio_segment.export(buffer, format='wav')
audio_bytes_preprocessed = buffer.getvalue()

print("‚úì Using preprocessed audio from Section 2")
print(f"  Methods applied: Highpass ‚Üí Denoise ‚Üí Normalize")
print(f"  Size: {len(audio_bytes_preprocessed)/1024/1024:.1f} MB (WAV format)")
print(f"\nReady for inference comparison!")

# Transcribe preprocessed audio
print("=" * 70)
print(f"TRANSCRIBING PREPROCESSED AUDIO with {MODEL_CHOICE}")
print("=" * 70)

# TODO: Same inference logic as above
transcript_preprocessed = "[TODO] Implement inference on preprocessed audio"

print(f"Preprocessed transcript: {transcript_preprocessed}")

### Compare Transcriptions

Compare the raw vs. preprocessed transcriptions to see if preprocessing improved accuracy.

In [None]:
# Compare transcriptions
print("=" * 70)
print("COMPARISON")
print("=" * 70)

print(f"\nModel: {MODEL_CHOICE}")
print(f"Audio: {audio_id}")
print(f"\nRaw transcript:")
print(f"  {transcript_raw}")
print(f"\nPreprocessed transcript:")
print(f"  {transcript_preprocessed}")

# TODO: If ground truth is available, calculate WER for both
# from scripts.eval.evaluate import calculate_wer
# ground_truth = example_file['transcript']  # Get from parquet
# wer_raw = calculate_wer(ground_truth, transcript_raw)
# wer_preprocessed = calculate_wer(ground_truth, transcript_preprocessed)
# print(f"\nWER (raw): {wer_raw:.1%}")
# print(f"WER (preprocessed): {wer_preprocessed:.1%}")
# print(f"Improvement: {(wer_raw - wer_preprocessed)*100:+.1f} percentage points")

print("\n[TODO] Add WER calculation if ground truth is available")

---

## Next Steps

**You've completed the interactive walkthrough!** Here's what to do next:

### 1. Complete the TODOs in this notebook:
- Implement EQ high-frequency boost in `scripts/preprocess_audio.py`
- Add inference logic for your chosen models
- Add WER calculation for comparing raw vs. preprocessed

### 2. For production batch processing:
Use the decoupled script with your preferred preprocessing variant:
```bash
# Example: Process test set with full pipeline
python scripts/preprocess_audio.py \
    --parquet data/raw/loc/veterans_history_project_resources_pre2010_test.parquet \
    --variant full_pipeline \
    --output_prefix loc_vhp_preprocessed/full_pipeline
```

### 3. For STT inference on preprocessed audio:
Use existing `infer_*.ipynb` notebooks - they support custom blob prefixes:
```python
# In infer_whisper.ipynb or similar:
CONFIG = {
    "blob_prefix": "loc_vhp_preprocessed/full_pipeline",  # Point to preprocessed audio
    # ... other config
}
```

### 4. Generate WER Matrix:
Run inference for each model √ó preprocessing variant combination to build your evaluation matrix:

| Model | Raw | Normalized | Normalized+EQ | Normalized+Denoised | Full Pipeline |
|-------|-----|------------|---------------|---------------------|---------------|
| Whisper v3 | ? | ? | ? | ? | ? |
| Whisper v3 LoRA | ? | ? | ? | ? | ? |
| Parakeet | ? | ? | ? | ? | ? |
| Chirp 3 | ? | ? | ? | ? | ? |
| AWS Transcribe | ? | ? | ? | ? | ? |