# Tutorial: Whisper Inference Pipeline

This notebook demonstrates how to run the Whisper inference pipeline on a single sample file.

**What this notebook does:**
- Uses production orchestration (Azure blob storage)
- Supports both faster-whisper (pretrained) and HuggingFace transformers (fine-tuned)
- Processes just 1 sample file
- Shows outputs (inference results, hypothesis text)

**Models supported:**
- Standard Whisper models (tiny, base, small, medium, large-v3) via faster-whisper
- Fine-tuned VHP model (large-v3-vhp-lora) via HuggingFace transformers

## 1. Setup: Import and Configure Parameters

In [None]:
import sys
import os
from pathlib import Path
import pandas as pd

# Load Azure credentials from .env file
from dotenv import load_dotenv
load_dotenv(dotenv_path='../credentials/creds.env')

# Disable wandb for notebook runs (tutorials don't need tracking)
os.environ['WANDB_MODE'] = 'disabled'

# Add scripts directory to path
sys.path.insert(0, str(Path.cwd().parent / "scripts"))

# Import both inference modules - we'll choose which one to use based on model
from infer_whisper import run as run_faster_whisper
from infer_whisper_hf import run as run_hf_whisper

print("ðŸ’¡ Note: Wandb logging is disabled for tutorial runs")
print("ðŸ’¡ Note: Both faster-whisper and HuggingFace inference available")

## 2. Select Model

Choose a Whisper model to use (including the fine-tuned model):

In [None]:
# Available Whisper models
MODEL_OPTIONS = {
    "tiny": ("../models/faster-whisper/models--Systran--faster-whisper-tiny", "faster-whisper"),
    "base": ("../models/faster-whisper/models--Systran--faster-whisper-base", "faster-whisper"),
    "small": ("../models/faster-whisper/models--Systran--faster-whisper-small", "faster-whisper"),
    "medium": ("../models/faster-whisper/models--Systran--faster-whisper-medium", "faster-whisper"),
    "large-v3": ("../models/faster-whisper/models--Systran--faster-whisper-large-v3", "faster-whisper"),
    "large-v3-vhp-lora": ("../models/hf-whisper/whisper-large-v3-vhp-lora", "hf")  # Fine-tuned (HF format)
}

# SELECT YOUR MODEL HERE
SELECTED_MODEL = "large-v3-vhp-lora"  # Change to "large-v3-vhp-lora" to test fine-tuned model

model_path, inference_type = MODEL_OPTIONS[SELECTED_MODEL]
print(f"Selected model: {SELECTED_MODEL}")
print(f"Model path: {model_path}")
print(f"Inference type: {inference_type}")

## 3. Configure Inference Parameters

Set parameters directly in Python (no yaml file needed):

In [None]:
# Build configuration dictionary
# Choose model name based on inference type
if inference_type == "hf":
    model_name = "whisper-hf"  # Use HuggingFace inference
    compute_type_default = "float32"  # HF uses float32 on CPU
else:
    model_name = "whisper"  # Use faster-whisper
    compute_type_default = "int8"  # faster-whisper uses int8 on CPU

cfg = {
    "experiment_id": "tutorial-whisper-sample1",
    
    # Model settings
    "model": {
        "name": model_name,
        "dir": model_path,  # Use selected model path
        "device": "cpu",  # Use "cuda" if GPU available
        "compute_type": compute_type_default,  # Optimized for inference type
        "batch_size": 16 if inference_type == "faster-whisper" else 1,  # HF doesn't batch
        
        # Transcription parameters (explicit for reproducibility)
        "beam_size": 5,  # Beam search width (1=greedy, 5=default balance)
        "temperature": 0.0,  # Sampling temperature (0.0=deterministic)
        "vad_filter": True,  # Voice activity detection to skip silence (faster-whisper only)
        "no_speech_threshold": 0.6,  # Threshold for detecting no speech
        "word_timestamps": False,  # Don't generate word-level timestamps (faster)
        "initial_prompt": None,  # No initial prompt
        "suppress_tokens": [-1],  # Default: suppress only special tokens
        "condition_on_previous_text": True,  # Use context from previous segments (important for long-form!)
    },
    
    # Input settings
    "input": {
        "source": "azure_blob",
        "parquet_path": "../data/raw/loc/veterans_history_project_resources_pre2010_test.parquet",  # Test set only
        "blob_prefix": "loc_vhp",
        "sample_size": 1,  # Just 1 file for tutorial
        "duration_sec": 30, 
        "sample_rate": 16000
    },
    
    # Output settings
    "output": {
        "dir": "../outputs/tutorial-whisper-sample1",  # Relative to notebook dir
        "save_per_file": True  # Save individual hypothesis files
    },
    
    # Evaluation (optional)
    "evaluation": {
        "use_whisper_normalizer": True
    }
}

print("Configuration:")
print(f"  Model: {SELECTED_MODEL}")
print(f"  Inference type: {inference_type}")
print(f"  Model path: {cfg['model']['dir']}")
print(f"  Device: {cfg['model']['device']}")
print(f"  Compute type: {cfg['model']['compute_type']}")
print(f"  Beam size: {cfg['model']['beam_size']}")
print(f"  Sample size: {cfg['input']['sample_size']} file(s)")
print(f"  Duration: {cfg['input']['duration_sec']}s")
print(f"  Output: {cfg['output']['dir']}")

## 4. Run Inference

This will:
1. Load 1 sample from the parquet file (test set)
2. Download audio from Azure blob storage
3. Preprocess audio (trim to 300s, convert to 16kHz mono WAV)
4. Load Whisper model (using faster-whisper for efficiency)
5. Run transcription with the configured parameters
6. Save results

In [None]:
# Run the appropriate inference pipeline based on model type
if inference_type == "hf":
    print("Using HuggingFace transformers inference (for fine-tuned model)...")
    run_hf_whisper(cfg)
else:
    print("Using faster-whisper inference (optimized for pretrained models)...")
    run_faster_whisper(cfg)

In [None]:
# Load results parquet
results_path = Path(cfg['output']['dir']) / "inference_results.parquet"
df_results = pd.read_parquet(results_path)

print(f"Results shape: {df_results.shape}")
print(f"\nColumns: {list(df_results.columns)}")
print(f"\nFirst row:")
df_results.head()

## 5. View Results

### 5.1 Inference Results (Parquet)

In [None]:
# Display the transcription
row = df_results.iloc[0]

print("=" * 70)
print(f"File ID: {row['file_id']}")
print(f"Collection: {row['collection_number']}")
print(f"Duration: {row['duration_sec']:.1f}s")
print(f"Processing time: {row['processing_time_sec']:.1f}s")
print(f"Status: {row['status']}")
print("=" * 70)
print(f"\nTranscription (hypothesis):\n")
print(row['hypothesis'][:500] + "..." if len(row['hypothesis']) > 500 else row['hypothesis'])

### 5.2 Hypothesis Text

In [None]:
# Check individual hypothesis file
hyp_file = Path(cfg['output']['dir']) / f"hyp_{row['file_id']}.txt"

if hyp_file.exists():
    print(f"Hypothesis file: {hyp_file}")
    print(f"\nContent:\n")
    print(hyp_file.read_text()[:500] + "..." if len(hyp_file.read_text()) > 500 else hyp_file.read_text())
else:
    print(f"Hypothesis file not found: {hyp_file}")

### 5.3 Individual Hypothesis File

## 6. Understanding the Output

**Key output files:**
- `inference_results.parquet`: All results in structured format (hypothesis, duration, status, etc.)
- `hyp_{file_id}.txt`: Individual hypothesis text files (one per audio file)
- `hyp_whisper-{size}.txt`: Combined hypothesis text (all transcriptions concatenated)
- `whisper_log_*.txt`: Detailed inference log

**Key result columns:**
- `file_id`: Unique identifier
- `collection_number`: VHP collection number
- `hypothesis`: Model transcription output
- `ground_truth`: Reference transcript (for evaluation)
- `duration_sec`: Audio duration processed
- `processing_time_sec`: Time taken for inference
- `status`: success/error
- `model_name`: Model identifier (e.g., whisper-base, whisper-large-v3-vhp-lora)

**Whisper model sizes:**
- `tiny`: ~39MB, fastest, lowest accuracy
- `base`: ~74MB
- `small`: ~244MB
- `medium`: ~769MB
- `large-v3`: ~1550MB, best accuracy, slowest
- `large-v3-vhp-lora`: ~1550MB, fine-tuned on VHP archival audio

## 7. Next Steps

To run on more files:
1. Increase `sample_size` (e.g., 10, 50, 100)
2. Set `duration_sec: null` for full audio
3. Use `device: "cuda"` if you have a GPU (and `compute_type: "float16"`)
4. Try the fine-tuned model: Set `SELECTED_MODEL = "large-v3-vhp-lora"`
5. Compare results between pretrained `"large-v3"` and fine-tuned `"large-v3-vhp-lora"`

**Note on fine-tuned model:**
- The fine-tuned model uses HuggingFace transformers (not faster-whisper)
- It has 128 mel filterbanks instead of standard 80
- Inference is slower but works with the LoRA-merged weights
- The notebook automatically selects the correct inference engine

To evaluate results:
```python
from scripts.eval.evaluate import evaluate_results
metrics = evaluate_results(df_results, use_whisper_normalizer=True)
print(f"WER: {metrics['wer']:.2%}")
```