# Tutorial: NVIDIA Canary-Qwen-2.5B Inference

**Quick Start Guide** for running Canary-Qwen-2.5B speech-to-text inference.

**What is Canary-Qwen?**
- Speech Audio Language Model (SALM) with 2.5B parameters
- Combines NVIDIA Canary ASR with Qwen3-1.7B LLM
- Top performer on HuggingFace OpenASR leaderboard (WER: 5.63%)
- Supports English with punctuation and capitalization

**Requirements:**
- GPU: ~10 GB VRAM (tested on T4 16GB)
- RAM: ~5 GB system memory
- Audio: 16kHz mono WAV/FLAC files

---

## 1. Setup

Configure environment and check resources:

In [None]:
import os
import sys
from pathlib import Path
import torch

# Configure caching and logging
os.environ['WANDB_MODE'] = 'disabled'
os.environ['HF_HOME'] = str(Path.cwd().parent / "models/canary")
os.environ['TRANSFORMERS_CACHE'] = str(Path.cwd().parent / "models/canary")

# Check GPU
if torch.cuda.is_available():
    print(f"✓ GPU: {torch.cuda.get_device_name(0)}")
    print(f"  VRAM: {torch.cuda.get_device_properties(0).total_memory/1024**3:.1f} GB")
else:
    print("❌ No GPU found - Canary requires CUDA")

print("\n✓ Setup complete")

## 2. Load Model

Load Canary-Qwen-2.5B using the SALM API:

In [None]:
from nemo.collections.speechlm2.models import SALM

print("Loading Canary-Qwen-2.5B...")
print("(First run downloads ~5GB model)\n")

model = SALM.from_pretrained("nvidia/canary-qwen-2.5b")
model = model.cuda()

print("\n✓ Model loaded (uses ~5 GB RAM, ~10 GB GPU)")
print(f"  Audio placeholder tag: {model.audio_locator_tag}")

## 3. Transcribe Audio

**Method 1: Single file (simple)**

In [None]:
# Example: Transcribe a WAV file
audio_path = "/path/to/your/audio.wav"  # Change this

# Format prompt with audio placeholder
prompts = [[
    {
        "role": "user",
        "content": f"Transcribe the following: {model.audio_locator_tag}",
        "audio": [audio_path]
    }
]]

# Generate transcription
with torch.no_grad():
    answer_ids = model.generate(prompts=prompts, max_new_tokens=512)

# Decode result
transcript = model.tokenizer.ids_to_text(answer_ids[0].cpu()).strip()

print("Transcript:")
print(transcript)

**Method 2: Batch processing (production)**

Use the production script for multiple files:

In [None]:
sys.path.insert(0, str(Path.cwd().parent / "scripts"))
from infer_canary import run

# Configure inference run
config = {
    "experiment_id": "my-transcription",
    
    "model": {
        "name": "canary",
        "path": "nvidia/canary-qwen-2.5b",
        "device": "cuda",
        "prompt": "Transcribe the following:",
        "max_new_tokens": 512
    },
    
    "input": {
        "source": "azure_blob",  # or "local_files"
        "parquet_path": "../data/raw/loc/veterans_history_project_resources_pre2010.parquet",
        "blob_prefix": "loc_vhp",
        "sample_size": 10,  # Number of files
        "duration_sec": 300,  # First 5 minutes (or None for full)
        "sample_rate": 16000
    },
    
    "output": {
        "dir": "../outputs/my-transcription",
        "save_per_file": True
    }
}

# Run inference
run(config)

## 4. View Results

In [None]:
import pandas as pd

# Load results
results = pd.read_parquet("../outputs/my-transcription/inference_results.parquet")

print(f"Processed {len(results)} files")
print(f"\nColumns: {list(results.columns)}")
print(f"\nFirst transcript:")
print(results.iloc[0]['hypothesis'][:500])

## Tips & Best Practices

**Audio Requirements:**
- Format: 16kHz, mono, WAV/FLAC
- Max duration tested: 40 seconds per prompt
- For longer audio: chunk into segments

**Prompting:**
- Always include `{model.audio_locator_tag}` placeholder
- Customize prompt: `"Transcribe this interview:"`, `"Convert speech to text:"`
- Model handles punctuation/capitalization automatically

**Performance:**
- RTFx: 418× (processes audio 418× faster than real-time)
- Batch size: 1 for development, increase for production
- GPU memory: ~10 GB baseline + ~0.1 GB per concurrent inference

**Troubleshooting:**
- "Missing audio_locator_tag": Make sure prompt includes `{model.audio_locator_tag}`
- "CUDA out of memory": Reduce batch size or process sequentially
- "Expected N replacements": Audio placeholder missing from prompt

---

## Resources

- Model card: https://huggingface.co/nvidia/canary-qwen-2.5b
- Production script: `scripts/infer_canary.py`
- Config examples: `configs/runs/vhp-canary-*.yaml`