# Tutorial: NVIDIA Canary-1B-v2 Inference Pipeline

This notebook demonstrates how to run the NVIDIA Canary-1B-v2 inference pipeline on a single sample file.

**Model**: [nvidia/canary-1b-v2](https://huggingface.co/nvidia/canary-1b-v2) (978M params, 25 languages, 8.40% WER on Fleurs-25)

**What this notebook does:**
- Uses production orchestration (Azure blob storage)
- Runs Canary-1B-v2 model (pure ASR, faster than Canary-Qwen)
- Processes just 1 sample file for testing
- Shows outputs (inference results, hypothesis text)

**Requirements:**
- GPU: ~5 GB VRAM (less than Canary-Qwen)
- RAM: ~3 GB system memory
- Azure credentials configured (for blob storage)

## 1. Setup: Import and Configure Parameters

In [None]:
import sys
import os
from pathlib import Path
import pandas as pd
import torch
import gc

import os
from dotenv import load_dotenv
load_dotenv(dotenv_path='../credentials/creds.env')

# Disable wandb for notebook runs (tutorials don't need tracking)
os.environ['WANDB_MODE'] = 'disabled'

# Set HuggingFace cache to use local models directory (avoid re-downloading)
os.environ['HF_HOME'] = str(Path.cwd().parent / "models/canary")
os.environ['TRANSFORMERS_CACHE'] = str(Path.cwd().parent / "models/canary")

# Add scripts directory to path
sys.path.insert(0, str(Path.cwd().parent / "scripts"))

from infer_canary_1b import run

# System and GPU Memory Check
print("=" * 70)
print("Memory Status Check")
print("=" * 70)

# Check system RAM
try:
    import psutil
    mem = psutil.virtual_memory()
    ram_used = mem.used / 1024**3
    ram_total = mem.total / 1024**3
    ram_available = mem.available / 1024**3
    
    print(f"System RAM: {ram_used:.1f}/{ram_total:.1f} GB used")
    print(f"Available: {ram_available:.1f} GB")
    
    if ram_available < 5:
        print(f"‚ö†Ô∏è  WARNING: Only {ram_available:.1f} GB RAM available")
        print(f"   Canary-1B needs ~3 GB free for model loading")
except ImportError:
    print("psutil not installed, skipping RAM check")

# Check GPU memory
if torch.cuda.is_available():
    print()
    for i in range(torch.cuda.device_count()):
        total = torch.cuda.get_device_properties(i).total_memory / 1024**3
        allocated = torch.cuda.memory_allocated(i) / 1024**3
        reserved = torch.cuda.memory_reserved(i) / 1024**3
        free = total - reserved
        
        print(f"GPU {i}: {torch.cuda.get_device_name(i)}")
        print(f"  Total: {total:.2f} GB")
        print(f"  Allocated: {allocated:.2f} GB")
        print(f"  Reserved: {reserved:.2f} GB")
        print(f"  Free: {free:.2f} GB")
        
        if free < 5.0:
            print(f"  ‚ö†Ô∏è  WARNING: Only {free:.2f} GB GPU free")
            print(f"     Canary-1B needs ~5 GB GPU VRAM")
else:
    print("\n‚ö†Ô∏è  No CUDA GPU available - Canary requires GPU!")

print("=" * 70)
print("\nüí° Note: Wandb logging is disabled for tutorial runs")

## 1.1 GPU Memory Cleanup (Run if needed)

If you see "CUDA out of memory" errors, run this cell to clear GPU memory from previous runs:

In [None]:
# Clean up GPU memory
import torch
import gc

# Clear PyTorch CUDA cache
torch.cuda.empty_cache()

# Force Python garbage collection
gc.collect()

print("‚úì GPU memory cleared")

# Show updated memory status
if torch.cuda.is_available():
    for i in range(torch.cuda.device_count()):
        allocated = torch.cuda.memory_allocated(i) / 1024**3
        reserved = torch.cuda.memory_reserved(i) / 1024**3
        total = torch.cuda.get_device_properties(i).total_memory / 1024**3
        free = total - reserved
        print(f"GPU {i}: {allocated:.2f} GB allocated, {reserved:.2f} GB reserved, {free:.2f} GB free")

## 2. Configure Inference Parameters

Set parameters directly in Python (no yaml file needed):

In [None]:
# Build configuration dictionary
cfg = {
    "experiment_id": "tutorial-canary-1b-v2-sample1",
    
    # Model settings
    "model": {
        "name": "canary-1b-v2",
        "dir": "nvidia/canary-1b-v2",  # v2: 978M params, 25 languages, 8.40% WER
        "device": "cuda",  # Canary requires GPU (use "cuda")
        "batch_size": 1  # Process one file at a time
        # Note: No max_new_tokens (Canary-1B is pure ASR, not LLM-based)
    },
    
    # Input settings
    "input": {
        "source": "azure_blob",
        "parquet_path": "../data/raw/loc/veterans_history_project_resources_pre2010.parquet",  # Relative to notebook dir
        "blob_prefix": "loc_vhp",
        "sample_size": 1,  # Just 1 file for tutorial
        "duration_sec": 300,  # First 5 minutes only (faster)
        "sample_rate": 16000  # Canary expects 16kHz
    },
    
    # Output settings
    "output": {
        "dir": "../outputs/tutorial-canary-1b-v2-sample1",  # Relative to notebook dir
        "save_per_file": True  # Save individual hypothesis files
    },
    
    # Evaluation (optional)
    "evaluation": {
        "use_whisper_normalizer": True
    }
}

print("Configuration:")
print(f"  Model: {cfg['model']['dir']}")
print(f"  Version: v2 (978M params, 25 languages, trained on 1.7M hours)")
print(f"  Memory: ~3 GB RAM, ~5 GB GPU (less than Canary-Qwen)")
print(f"  Device: {cfg['model']['device']} (GPU required)")
print(f"  Sample size: {cfg['input']['sample_size']} file(s)")
print(f"  Duration: {cfg['input']['duration_sec']}s (first 5 minutes)")
print(f"  Output: {cfg['output']['dir']}")
print(f"\n‚ö†Ô∏è  Note: Canary requires a CUDA-enabled GPU")
print(f"\nüí° Tip: First run will download model (~1.5GB for v2). Subsequent runs use cached model.")

## 3. Run Inference

This will:
1. Load 1 sample from the parquet file
2. Download audio from Azure blob storage
3. Preprocess audio (trim to 300s, convert to 16kHz mono WAV)
4. Load Canary-1B-v2 model (~3GB RAM, ~5GB GPU)
5. Run transcription with pure ASR model
6. Save results

**Note**: First run will download the model (~1.5GB for v2). Cached after that.

In [None]:
# Run the inference pipeline
run(cfg)

## 4. View Results

### 4.1 Inference Results (Parquet)

In [None]:
# Load results parquet
results_path = Path(cfg['output']['dir']) / "inference_results.parquet"
df_results = pd.read_parquet(results_path)

print(f"Results shape: {df_results.shape}")
print(f"\nColumns: {list(df_results.columns)}")
print(f"\nFirst row:")
df_results.head()

### 4.2 Hypothesis Text

In [None]:
# Display the transcription
row = df_results.iloc[0]

print("=" * 70)
print(f"File ID: {row['file_id']}")
print(f"Collection: {row['collection_number']}")
print(f"Duration: {row['duration_sec']:.1f}s")
print(f"Processing time: {row['processing_time_sec']:.1f}s")
print(f"Status: {row['status']}")
print("=" * 70)
print(f"\nTranscription (hypothesis):\n")
print(row['hypothesis'][:500] + "..." if len(row['hypothesis']) > 500 else row['hypothesis'])

### 4.3 Individual Hypothesis File

In [None]:
# Check individual hypothesis file
hyp_file = Path(cfg['output']['dir']) / f"hyp_{row['file_id']}.txt"

if hyp_file.exists():
    print(f"Hypothesis file: {hyp_file}")
    print(f"\nContent:\n")
    print(hyp_file.read_text()[:500] + "..." if len(hyp_file.read_text()) > 500 else hyp_file.read_text())
else:
    print(f"Hypothesis file not found: {hyp_file}")

## 5. Understanding the Output

**Key output files:**
- `inference_results.parquet`: All results in structured format (hypothesis, duration, status, etc.)
- `hyp_{file_id}.txt`: Individual hypothesis text files (one per audio file)
- `hyp_canary1b.txt`: Combined hypothesis text (all transcriptions concatenated)
- `canary1b_log_*.txt`: Detailed inference log

**Key result columns:**
- `file_id`: Unique identifier
- `collection_number`: VHP collection number
- `hypothesis`: Model transcription output
- `ground_truth`: Reference transcript (for evaluation)
- `duration_sec`: Audio duration processed
- `processing_time_sec`: Time taken for inference
- `status`: success/error
- `model_name`: Model identifier

**Model used:**
- **[nvidia/canary-1b-v2](https://huggingface.co/nvidia/canary-1b-v2)**: 978M params, 25 languages, 8.40% WER on Fleurs-25
- Pure ASR model (not LLM-based), faster than Canary-Qwen
- Trained on 1.7M hours of multilingual audio (Granary dataset)

## 6. Understanding Canary-1B-v2

**Model**: [nvidia/canary-1b-v2](https://huggingface.co/nvidia/canary-1b-v2)

Canary-1B-v2 is a **pure ASR (Automatic Speech Recognition)** model, unlike Canary-Qwen which is a SALM (Speech Audio Language Model).

**Key features:**
- 978M parameters (32 encoder + 8 decoder layers)
- Fast-Conformer architecture (pure ASR, not LLM-based)
- 25 languages supported (vs 4 in v1)
- 8.40% WER on Fleurs-25 benchmark
- Trained on 1.7M hours of multilingual audio (Granary dataset)
- English speech-to-text with basic punctuation
- Can generate timestamps (unlike Canary-Qwen)
- Uses `transcribe()` method instead of LLM `generate()`

**Memory requirements (tested on Tesla T4):**
- GPU: ~6 GB VRAM (half of Canary-Qwen)
- RAM: ~3 GB system memory (less than Canary-Qwen)
- Fits comfortably on T4 16GB GPU

**Performance:**
- Faster than Canary-Qwen (no LLM decoding overhead)
- Lower memory footprint
- Good accuracy for clear audio
- May be less robust than Canary-Qwen on degraded/archival audio

**Differences from Canary-Qwen:**

| Aspect | Canary-1B-v2 | Canary-Qwen-2.5B |
|--------|--------------|------------------|
| **HuggingFace** | [nvidia/canary-1b-v2](https://huggingface.co/nvidia/canary-1b-v2) | [nvidia/canary-qwen-2.5b](https://huggingface.co/nvidia/canary-qwen-2.5b) |
| **Architecture** | Pure ASR (Fast-Conformer) | SALM (ASR + LLM) |
| **Parameters** | 978M | 2.5B |
| **Languages** | 25 | Multilingual |
| **WER (Fleurs-25)** | 8.40% | 5.63% (better) |
| **GPU Memory** | ~6 GB | ~10 GB |
| **Timestamps** | ‚úÖ Yes | ‚ùå No |
| **Punctuation** | Basic | ‚úÖ Full (LLM-based) |
| **Speed** | Faster (~23x realtime on T4) | Slower |
| **Method** | `transcribe()` | `generate()` |
| **max_new_tokens** | N/A (not LLM) | Required |
| **Prompt** | N/A | Required |

## 7. Next Steps

To run on more files:
1. Increase `sample_size` (e.g., 10, 50, 500)
2. Set `duration_sec: None` for full audio
3. **No prompt needed** (Canary-1B-v2 is pure ASR, not LLM-based)

To evaluate results:
```python
# This will be available after implementing evaluation
from scripts.eval.evaluate import evaluate_results
metrics = evaluate_results(df_results, use_whisper_normalizer=True)
print(f"WER: {metrics['wer']:.2%}")
```

**Recommended config for 10-sample test:**
```python
cfg = {
    "experiment_id": "vhp-canary-1b-v2-sample10",
    "model": {
        "name": "canary-1b-v2",
        "dir": "nvidia/canary-1b-v2",  # v2: 978M params, 25 languages
        "device": "cuda",
    },
    "input": {
        "sample_size": 10,  # Test with 10 files
        "duration_sec": None  # Full audio
    },
    "output": {
        "dir": "../outputs/vhp-canary-1b-v2-sample10"
    }
}
```

**Recommended config for 500-sample benchmark:**
```python
cfg = {
    "experiment_id": "vhp-canary-1b-v2-500",
    "model": {
        "name": "canary-1b-v2",
        "dir": "nvidia/canary-1b-v2",  # v2: 978M params, 25 languages
        "device": "cuda",
    },
    "input": {
        "sample_size": 500,  # Full benchmark
        "duration_sec": None  # Full audio
    },
    "output": {
        "dir": "../outputs/vhp-canary-1b-v2-500"
    }
}
```