# Tutorial: Whisper Inference Pipeline

This notebook demonstrates how to run the Whisper inference pipeline on a single sample file.

**What this notebook does:**
- Uses production orchestration (Azure blob storage)
- Runs the smallest model (`openai/whisper-tiny`)
- Processes just 1 sample file
- Shows outputs (inference results, hypothesis text)

## 1. Setup: Import and Configure Parameters

In [None]:
import sys
import os
from pathlib import Path
import pandas as pd

# Load Azure credentials from .env file
from dotenv import load_dotenv
load_dotenv(dotenv_path='../credentials/creds.env')

# Disable wandb for notebook runs (tutorials don't need tracking)
os.environ['WANDB_MODE'] = 'disabled'

# Add scripts directory to path
sys.path.insert(0, str(Path.cwd().parent / "scripts"))

from infer_whisper import run

print("ðŸ’¡ Note: Wandb logging is disabled for tutorial runs")

## 2. Configure Inference Parameters

Set parameters directly in Python (no yaml file needed):

In [None]:
# Build configuration dictionary
cfg = {
    "experiment_id": "tutorial-whisper-sample1",
    
    # Model settings
    "model": {
        "name": "whisper",
        "dir": "../models/faster-whisper/models--Systran--faster-whisper-base",  # Local model directory
        "device": "cpu",  # Use "cuda" if GPU available
        "compute_type": "int8",  # Options: int8, float16, float32
        "batch_size": 16
    },
    
    # Input settings
    "input": {
        "source": "azure_blob",
        "parquet_path": "../data/raw/loc/veterans_history_project_resources_pre2010.parquet",  # Relative to notebook dir
        "blob_prefix": "loc_vhp",
        "sample_size": 1,  # Just 1 file for tutorial
        "duration_sec": 300,  # First 5 minutes only (faster)
        "sample_rate": 16000
    },
    
    # Output settings
    "output": {
        "dir": "../outputs/tutorial-whisper-sample1",  # Relative to notebook dir
        "save_per_file": True  # Save individual hypothesis files
    },
    
    # Evaluation (optional)
    "evaluation": {
        "use_whisper_normalizer": True
    }
}

print("Configuration:")
print(f"  Model: {cfg['model']['dir']}")
print(f"  Device: {cfg['model']['device']}")
print(f"  Compute type: {cfg['model']['compute_type']}")
print(f"  Sample size: {cfg['input']['sample_size']} file(s)")
print(f"  Duration: {cfg['input']['duration_sec']}s (first 5 minutes)")
print(f"  Output: {cfg['output']['dir']}")

## 3. Run Inference

This will:
1. Load 1 sample from the parquet file
2. Download audio from Azure blob storage
3. Preprocess audio (trim to 300s, convert to 16kHz mono WAV)
4. Load Whisper model (using faster-whisper for efficiency)
5. Run transcription
6. Save results

In [None]:
# Run the inference pipeline
run(cfg)

## 4. View Results

### 4.1 Inference Results (Parquet)

In [None]:
# Load results parquet
results_path = Path(cfg['output']['dir']) / "inference_results.parquet"
df_results = pd.read_parquet(results_path)

print(f"Results shape: {df_results.shape}")
print(f"\nColumns: {list(df_results.columns)}")
print(f"\nFirst row:")
df_results.head()

### 4.2 Hypothesis Text

In [None]:
# Display the transcription
row = df_results.iloc[0]

print("=" * 70)
print(f"File ID: {row['file_id']}")
print(f"Collection: {row['collection_number']}")
print(f"Duration: {row['duration_sec']:.1f}s")
print(f"Processing time: {row['processing_time_sec']:.1f}s")
print(f"Status: {row['status']}")
print("=" * 70)
print(f"\nTranscription (hypothesis):\n")
print(row['hypothesis'][:500] + "..." if len(row['hypothesis']) > 500 else row['hypothesis'])

### 4.3 Individual Hypothesis File

In [None]:
# Check individual hypothesis file
hyp_file = Path(cfg['output']['dir']) / f"hyp_{row['file_id']}.txt"

if hyp_file.exists():
    print(f"Hypothesis file: {hyp_file}")
    print(f"\nContent:\n")
    print(hyp_file.read_text()[:500] + "..." if len(hyp_file.read_text()) > 500 else hyp_file.read_text())
else:
    print(f"Hypothesis file not found: {hyp_file}")

## 5. Understanding the Output

**Key output files:**
- `inference_results.parquet`: All results in structured format (hypothesis, duration, status, etc.)
- `hyp_{file_id}.txt`: Individual hypothesis text files (one per audio file)
- `hyp_whisper-{size}.txt`: Combined hypothesis text (all transcriptions concatenated)
- `whisper_log_*.txt`: Detailed inference log

**Key result columns:**
- `file_id`: Unique identifier
- `collection_number`: VHP collection number
- `hypothesis`: Model transcription output
- `ground_truth`: Reference transcript (for evaluation)
- `duration_sec`: Audio duration processed
- `processing_time_sec`: Time taken for inference
- `status`: success/error
- `model_name`: Model identifier (e.g., whisper-tiny)

**Whisper model sizes:**
- `tiny`: ~39MB, fastest, lowest accuracy
- `base`: ~74MB
- `small`: ~244MB
- `medium`: ~769MB
- `large-v3`: ~1550MB, best accuracy, slowest

## 6. Next Steps

To run on more files:
1. Increase `sample_size` (e.g., 10, 50, 100)
2. Set `duration_sec: null` for full audio
3. Use `device: "cuda"` if you have a GPU
4. Try larger models: `"large-v3"` for best accuracy
5. Use `compute_type: "float16"` on GPU for faster inference

To evaluate results:
```python
from scripts.eval.evaluate import evaluate_results
metrics = evaluate_results(df_results, use_whisper_normalizer=True)
print(f"WER: {metrics['wer']:.2%}")
```