# Tutorial: NVIDIA Parakeet-TDT-0.6B-v3 Inference Pipeline

This notebook demonstrates how to run the NVIDIA Parakeet-TDT-0.6B-v3 inference pipeline on a single sample file.

**Model**: [nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) (600M params, 25 languages, 6.32% WER, **fastest multilingual ASR**)

**What this notebook does:**
- Uses production orchestration (Azure blob storage)
- Runs Parakeet-TDT-0.6B-v3 model (fastest multilingual ASR on HF leaderboard)
- Processes just 1 sample file for testing
- Shows outputs (inference results, hypothesis text)

**Why Parakeet?**
- **4.4x faster than Canary-1B-v2** (RTFx: 3332.74 vs 749)
- **Better English WER**: 6.32% vs 8.40%
- **Smaller model**: 600M params vs 978M
- **Auto language detection**: No lang params needed
- **Rich timestamps**: Word/segment/char-level (optional)

**Requirements:**
- GPU: ~4 GB VRAM (less than Canary-1B)
- RAM: ~2-3 GB system memory
- Azure credentials configured (for blob storage)

## 1. Setup: Import and Configure Parameters

In [None]:
import sys
import os
from pathlib import Path
import pandas as pd
import torch
import gc

from dotenv import load_dotenv
load_dotenv(dotenv_path='../credentials/creds.env')

# Disable wandb for notebook runs (tutorials don't need tracking)
os.environ['WANDB_MODE'] = 'disabled'

# Set HuggingFace cache to use local models directory (avoid re-downloading)
os.environ['HF_HOME'] = str(Path.cwd().parent / "models/parakeet")
os.environ['TRANSFORMERS_CACHE'] = str(Path.cwd().parent / "models/parakeet")

# Add scripts directory to path
sys.path.insert(0, str(Path.cwd().parent / "scripts"))

from infer_parakeet import run

# System and GPU Memory Check
print("=" * 70)
print("Memory Status Check")
print("=" * 70)

# Check system RAM
try:
    import psutil
    mem = psutil.virtual_memory()
    ram_used = mem.used / 1024**3
    ram_total = mem.total / 1024**3
    ram_available = mem.available / 1024**3
    
    print(f"System RAM: {ram_used:.1f}/{ram_total:.1f} GB used")
    print(f"Available: {ram_available:.1f} GB")
    
    if ram_available < 3:
        print(f"‚ö†Ô∏è  WARNING: Only {ram_available:.1f} GB RAM available")
        print(f"   Parakeet needs ~2-3 GB free for model loading")
except ImportError:
    print("psutil not installed, skipping RAM check")

# Check GPU memory
if torch.cuda.is_available():
    print()
    for i in range(torch.cuda.device_count()):
        total = torch.cuda.get_device_properties(i).total_memory / 1024**3
        allocated = torch.cuda.memory_allocated(i) / 1024**3
        reserved = torch.cuda.memory_reserved(i) / 1024**3
        free = total - reserved
        
        print(f"GPU {i}: {torch.cuda.get_device_name(i)}")
        print(f"  Total: {total:.2f} GB")
        print(f"  Allocated: {allocated:.2f} GB")
        print(f"  Reserved: {reserved:.2f} GB")
        print(f"  Free: {free:.2f} GB")
        
        if free < 4.0:
            print(f"  ‚ö†Ô∏è  WARNING: Only {free:.2f} GB GPU free")
            print(f"     Parakeet needs ~4 GB GPU VRAM")
else:
    print("\n‚ö†Ô∏è  No CUDA GPU available - Parakeet requires GPU!")

print("=" * 70)
print("\nüí° Note: Wandb logging is disabled for tutorial runs")

## 1.1 GPU Memory Cleanup (Run if needed)

If you see "CUDA out of memory" errors, run this cell to clear GPU memory from previous runs:

In [None]:
# Clean up GPU memory
import torch
import gc

# Clear PyTorch CUDA cache
torch.cuda.empty_cache()

# Force Python garbage collection
gc.collect()

print("‚úì GPU memory cleared")

# Show updated memory status
if torch.cuda.is_available():
    for i in range(torch.cuda.device_count()):
        allocated = torch.cuda.memory_allocated(i) / 1024**3
        reserved = torch.cuda.memory_reserved(i) / 1024**3
        total = torch.cuda.get_device_properties(i).total_memory / 1024**3
        free = total - reserved
        print(f"GPU {i}: {allocated:.2f} GB allocated, {reserved:.2f} GB reserved, {free:.2f} GB free")

## 2a. Configure Inference Parameters

Set parameters directly in Python (no yaml file needed):

In [None]:
# Build configuration dictionary
cfg = {
    "experiment_id": "tutorial-parakeet-sample1",
    
    # Model settings
    "model": {
        "name": "parakeet-tdt-0.6b-v3",
        "dir": "nvidia/parakeet-tdt-0.6b-v3",  # HuggingFace model ID
        "device": "cuda",  # Parakeet requires GPU
        "enable_timestamps": False,  # Set True for word/segment/char timestamps
        "use_local_attention": False,  # Set True for very long audio (>24 min, up to 3 hrs)
        "att_context_size": [256, 256]  # Context size for local attention
    },
    
    # Input settings
    "input": {
        "source": "azure_blob",
        "parquet_path": "../data/raw/loc/veterans_history_project_resources_pre2010.parquet",
        "blob_prefix": "loc_vhp",
        "sample_size": 1,  # Just 1 file for tutorial
        "duration_sec": 300,  # First 5 minutes only (faster)
        "sample_rate": 16000  # Parakeet expects 16kHz
    },
    
    # Output settings
    "output": {
        "dir": "../outputs/tutorial-parakeet-sample1",
        "save_per_file": True  # Save individual hypothesis files
    },
    
    # Evaluation (optional)
    "evaluation": {
        "use_whisper_normalizer": True
    }
}

print("Configuration:")
print(f"  Model: {cfg['model']['dir']}")
print(f"  Features: Fastest multilingual ASR (RTFx: 3332.74)")
print(f"  Languages: 25 European languages (auto-detected)")
print(f"  WER: 6.32% (English), 9.7% (multilingual avg)")
print(f"  Memory: ~2-3 GB RAM, ~4 GB GPU (smallest NVIDIA model)")
print(f"  Device: {cfg['model']['device']} (GPU required)")
print(f"  Sample size: {cfg['input']['sample_size']} file(s)")
print(f"  Duration: {cfg['input']['duration_sec']}s (first 5 minutes)")
print(f"  Output: {cfg['output']['dir']}")
print(f"\nüí° Advantages over Canary-1B-v2:")
print(f"   - 4.4x faster (RTFx: 3332 vs 749)")
print(f"   - Better English WER (6.32% vs 8.40%)")
print(f"   - Smaller model (600M vs 978M params)")
print(f"   - Auto language detection (no lang params needed)")
print(f"\nüí° Tip: First run will download model (~1.2GB). Subsequent runs use cached model.")

## 2b. (optional) Advanced: Testing Local Attention for Long Audio

**Local Attention** allows Parakeet to process audio longer than 24 minutes (up to 3 hours).

### What is Local Attention?

- **NOT traditional chunking** - it's an architectural change to the attention mechanism
- Each audio frame only attends to nearby frames (sliding window) instead of all frames
- **Single forward pass** through the model (not multiple inference calls)
- Memory usage: O(n) instead of O(n¬≤)
- Completely handled by NeMo internally (blackbox)

### When to Use:
- ‚úÖ Audio >24 minutes (up to 3 hours)
- ‚ùå Audio <24 minutes (use default full attention - better quality)

**Important**: This is different from the manual chunking we use for Canary-Qwen. Local attention is a model-level setting that enables processing longer audio in a single `transcribe()` call.

**For very long audio (>24 minutes):**
```python
cfg['model']['use_local_attention'] = True  # Enables processing up to 3 hours
cfg['model']['att_context_size'] = [256, 256]  # Attention context size
```

In [None]:
cfg['model']['use_local_attention'] = True  # Enables processing up to 3 hours
cfg['model']['att_context_size'] = [256, 256]  # Attention context size
cfg['input']['duration_sec'] = None # ‚úÖ FULL AUDIO - test with complete file

## 3. Run Inference

This will:
1. Load 1 sample from the parquet file
2. Download audio from Azure blob storage
3. Preprocess audio (trim to 300s, convert to 16kHz mono WAV)
4. Load Parakeet-TDT-0.6B-v3 model (~2-3GB RAM, ~4GB GPU)
5. Run transcription with auto language detection
6. Save results

**Note**: First run will download the model (~1.2GB). Cached after that.

In [None]:
# Run the inference pipeline
result = run(cfg)

## 4. View Results

### 4.1 Inference Results (Parquet)

In [None]:
# Load results parquet
results_path = Path(cfg['output']['dir']) / "inference_results.parquet"
df_results = pd.read_parquet(results_path)

print(f"Results shape: {df_results.shape}")
print(f"\nColumns: {list(df_results.columns)}")
print(f"\nFirst row:")
df_results.head()

### 4.2 Hypothesis Text

In [None]:
# Display the transcription
row = df_results.iloc[0]

print("=" * 70)
print(f"File ID: {row['file_id']}")
print(f"Collection: {row['collection_number']}")
print(f"Duration: {row['duration_sec']:.1f}s")
print(f"Processing time: {row['processing_time_sec']:.1f}s")
print(f"Status: {row['status']}")
print("=" * 70)
print(f"\nTranscription (hypothesis):\n")
print(row['hypothesis'][:500] + "..." if len(row['hypothesis']) > 500 else row['hypothesis'])

### 4.3 Individual Hypothesis File

In [None]:
# Check individual hypothesis file
hyp_file = Path(cfg['output']['dir']) / f"hyp_{row['file_id']}.txt"

if hyp_file.exists():
    print(f"Hypothesis file: {hyp_file}")
    print(f"\nContent:\n")
    print(hyp_file.read_text()[:500] + "..." if len(hyp_file.read_text()) > 500 else hyp_file.read_text())
else:
    print(f"Hypothesis file not found: {hyp_file}")

## 6. Understanding the Output

**Key output files:**
- `inference_results.parquet`: All results in structured format (hypothesis, duration, status, etc.)
- `hyp_{file_id}.txt`: Individual hypothesis text files (one per audio file)
- `hyp_parakeet.txt`: Combined hypothesis text (all transcriptions concatenated)
- `parakeet_log_*.txt`: Detailed inference log

**Key result columns:**
- `file_id`: Unique identifier
- `collection_number`: VHP collection number
- `hypothesis`: Model transcription output
- `ground_truth`: Reference transcript (for evaluation)
- `duration_sec`: Audio duration processed
- `processing_time_sec`: Time taken for inference
- `status`: success/error
- `model_name`: Model identifier

**Model used:**
- **[nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3)**: 600M params, 25 languages
- **Fastest multilingual ASR** on HuggingFace leaderboard (RTFx: 3332.74)
- **Better English WER** than Canary-1B-v2 (6.32% vs 8.40%)
- Trained on 1.7M hours of multilingual audio (Granary dataset)

## 7. Understanding Parakeet-TDT-0.6B-v3

**Model**: [nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3)

Parakeet-TDT-0.6B-v3 is a **Fast-Conformer TDT (Transducer)** model optimized for speed and efficiency.

**Key features:**
- 600M parameters (smallest NVIDIA multilingual ASR model)
- FastConformer-TDT architecture (Transducer decoder)
- **Fastest multilingual ASR on HuggingFace leaderboard** (RTFx: 3332.74)
- 25 European languages with auto language detection
- 6.32% WER on English (HF leaderboard)
- 9.7% WER on multilingual average (24 languages)
- Trained on ~660,000 hours from Granary ASR subset
- Auto punctuation and capitalization
- Word/segment/char-level timestamps (optional)
- Supports up to 24 min audio (full attention) or 3 hrs (local attention)

**Memory requirements (tested on Tesla T4):**
- GPU: ~4 GB VRAM (smallest NVIDIA model)
- RAM: ~2-3 GB system memory
- Fits comfortably on T4 16GB GPU

**Performance:**
- **4.4x faster than Canary-1B-v2** (RTFx: 3332.74 vs 749)
- **Better English WER** than Canary-1B-v2 (6.32% vs 8.40%)
- Lower memory footprint (600M vs 978M params)
- Slightly lower multilingual accuracy vs Canary-1B-v2 (9.7% vs 8.1% WER)

**Comparison with NVIDIA models:**

| Aspect | Parakeet-TDT-0.6B-v3 | Canary-1B-v2 | Canary-Qwen-2.5B |
|--------|---------------------|--------------|------------------|
| **HuggingFace** | [nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) | [nvidia/canary-1b-v2](https://huggingface.co/nvidia/canary-1b-v2) | [nvidia/canary-qwen-2.5b](https://huggingface.co/nvidia/canary-qwen-2.5b) |
| **Architecture** | FastConformer-TDT | FastConformer-CTC | SALM (ASR+LLM) |
| **Parameters** | 600M | 978M | 2.5B |
| **Speed (RTFx)** | **3332.74** (FASTEST) | 749 | ~60-100 |
| **English WER** | **6.32%** (BEST) | 8.40% | **5.63%** (BEST) |
| **Multilingual WER** | 9.7% | **8.1%** (BEST) | Better |
| **GPU Memory** | **~4 GB** (SMALLEST) | ~6 GB | ~10 GB |
| **Timestamps** | ‚úÖ Word/segment/char | ‚úÖ Yes | ‚ùå No |
| **Punctuation** | ‚úÖ Auto | Basic | ‚úÖ Full (LLM) |
| **Language Detection** | ‚úÖ Auto | Manual params | Auto |
| **Method** | `transcribe()` | `transcribe()` | `generate()` |
| **Use Case** | **High-throughput, real-time** | Balanced speed/accuracy | Best accuracy |

**When to use Parakeet:**
- ‚úÖ Need fastest processing (4.4x faster than Canary-1B)
- ‚úÖ High-throughput scenarios (transcribing large volumes)
- ‚úÖ Real-time transcription requirements
- ‚úÖ Limited GPU memory (~4 GB)
- ‚úÖ English-primary workloads (best English WER among non-LLM models)
- ‚úÖ Need timestamps (word/segment/char level)
- ‚ùå Need absolute best multilingual accuracy (use Canary-1B-v2 instead)

## 8. Next Steps

To run on more files:
1. Increase `sample_size` (e.g., 10, 50, 500)
2. Set `duration_sec: None` for full audio
3. Enable `enable_timestamps: True` for word-level timestamps (if needed)
4. Set `use_local_attention: True` for very long audio (>24 min, up to 3 hrs)

To evaluate results:
```python
# This will be available after implementing evaluation
from scripts.evaluate import evaluate_results
metrics = evaluate_results(df_results, use_whisper_normalizer=True)
print(f"WER: {metrics['wer']:.2%}")
```

**Recommended config for 10-sample test:**
```python
cfg = {
    "experiment_id": "vhp-parakeet-sample10",
    "model": {
        "name": "parakeet-tdt-0.6b-v3",
        "dir": "nvidia/parakeet-tdt-0.6b-v3",
        "device": "cuda",
        "enable_timestamps": False,
        "use_local_attention": False  # Full attention for <24 min audio
    },
    "input": {
        "sample_size": 10,  # Test with 10 files
        "duration_sec": None  # Full audio
    },
    "output": {
        "dir": "../outputs/vhp-parakeet-sample10"
    }
}
```

**Recommended config for 500-sample benchmark:**
```python
cfg = {
    "experiment_id": "vhp-parakeet-500",
    "model": {
        "name": "parakeet-tdt-0.6b-v3",
        "dir": "nvidia/parakeet-tdt-0.6b-v3",
        "device": "cuda",
        "enable_timestamps": False,
        "use_local_attention": False  # Most VHP files <24 min
    },
    "input": {
        "sample_size": 500,  # Full benchmark
        "duration_sec": None  # Full audio
    },
    "output": {
        "dir": "../outputs/vhp-parakeet-500"
    }
}
```

**For very long audio (>24 minutes, up to 3 hours):**
```python
cfg = {
    "experiment_id": "vhp-parakeet-long-audio",
    "model": {
        "name": "parakeet-tdt-0.6b-v3",
        "dir": "nvidia/parakeet-tdt-0.6b-v3",
        "device": "cuda",
        "enable_timestamps": False,
        "use_local_attention": True,  # ‚úÖ Enable for >24 min
        "att_context_size": [256, 256]  # Default window size
    },
    "input": {
        "sample_size": 10,
        "duration_sec": None  # Full audio (will handle up to 3 hours)
    },
    "output": {
        "dir": "../outputs/vhp-parakeet-long-audio"
    }
}
```

**Key Configuration Options:**

| Config Parameter | Values | When to Use |
|-----------------|--------|-------------|
| `use_local_attention` | `false` (default) | Audio <24 minutes (better quality) |
|  | `true` | Audio 24 min - 3 hours (enables longer processing) |
| `att_context_size` | `[256, 256]` (default) | Standard attention window (~13 sec) |
|  | `[128, 128]` | Smaller window (faster, less context) |
|  | `[512, 512]` | Larger window (slower, more context) |
| `enable_timestamps` | `false` (default) | Don't need word-level timestamps |
|  | `true` | Need word/segment/char timestamps |

**See Also:**
- `learnings/nvidia-models-comparison.md` - Full model comparison
- `learnings/nvidia-models-chunking-mechanisms.md` - Understanding local attention vs chunking

### 5.1 Configure for Local Attention

Let's create a new config with local attention enabled to process longer audio:

## 5. Understanding the Output

**Key output files:**
- `inference_results.parquet`: All results in structured format (hypothesis, duration, status, etc.)
- `hyp_{file_id}.txt`: Individual hypothesis text files (one per audio file)
- `hyp_parakeet.txt`: Combined hypothesis text (all transcriptions concatenated)
- `parakeet_log_*.txt`: Detailed inference log

**Key result columns:**
- `file_id`: Unique identifier
- `collection_number`: VHP collection number
- `hypothesis`: Model transcription output
- `ground_truth`: Reference transcript (for evaluation)
- `duration_sec`: Audio duration processed
- `processing_time_sec`: Time taken for inference
- `status`: success/error
- `model_name`: Model identifier

**Model used:**
- **[nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3)**: 600M params, 25 languages
- **Fastest multilingual ASR** on HuggingFace leaderboard (RTFx: 3332.74)
- **Better English WER** than Canary-1B-v2 (6.32% vs 8.40%)
- Trained on 1.7M hours of multilingual audio (Granary dataset)

## 6. Understanding Parakeet-TDT-0.6B-v3

**Model**: [nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3)

Parakeet-TDT-0.6B-v3 is a **Fast-Conformer TDT (Transducer)** model optimized for speed and efficiency.

**Key features:**
- 600M parameters (smallest NVIDIA multilingual ASR model)
- FastConformer-TDT architecture (Transducer decoder)
- **Fastest multilingual ASR on HuggingFace leaderboard** (RTFx: 3332.74)
- 25 European languages with auto language detection
- 6.32% WER on English (HF leaderboard)
- 9.7% WER on multilingual average (24 languages)
- Trained on ~660,000 hours from Granary ASR subset
- Auto punctuation and capitalization
- Word/segment/char-level timestamps (optional)
- Supports up to 24 min audio (full attention) or 3 hrs (local attention)

**Memory requirements (tested on Tesla T4):**
- GPU: ~4 GB VRAM (smallest NVIDIA model)
- RAM: ~2-3 GB system memory
- Fits comfortably on T4 16GB GPU

**Performance:**
- **4.4x faster than Canary-1B-v2** (RTFx: 3332.74 vs 749)
- **Better English WER** than Canary-1B-v2 (6.32% vs 8.40%)
- Lower memory footprint (600M vs 978M params)
- Slightly lower multilingual accuracy vs Canary-1B-v2 (9.7% vs 8.1% WER)

**Comparison with NVIDIA models:**

| Aspect | Parakeet-TDT-0.6B-v3 | Canary-1B-v2 | Canary-Qwen-2.5B |
|--------|---------------------|--------------|------------------|
| **HuggingFace** | [nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) | [nvidia/canary-1b-v2](https://huggingface.co/nvidia/canary-1b-v2) | [nvidia/canary-qwen-2.5b](https://huggingface.co/nvidia/canary-qwen-2.5b) |
| **Architecture** | FastConformer-TDT | FastConformer-CTC | SALM (ASR+LLM) |
| **Parameters** | 600M | 978M | 2.5B |
| **Speed (RTFx)** | **3332.74** (FASTEST) | 749 | ~60-100 |
| **English WER** | **6.32%** (BEST) | 8.40% | **5.63%** (BEST) |
| **Multilingual WER** | 9.7% | **8.1%** (BEST) | Better |
| **GPU Memory** | **~4 GB** (SMALLEST) | ~6 GB | ~10 GB |
| **Timestamps** | ‚úÖ Word/segment/char | ‚úÖ Yes | ‚ùå No |
| **Punctuation** | ‚úÖ Auto | Basic | ‚úÖ Full (LLM) |
| **Language Detection** | ‚úÖ Auto | Manual params | Auto |
| **Method** | `transcribe()` | `transcribe()` | `generate()` |
| **Use Case** | **High-throughput, real-time** | Balanced speed/accuracy | Best accuracy |

**When to use Parakeet:**
- ‚úÖ Need fastest processing (4.4x faster than Canary-1B)
- ‚úÖ High-throughput scenarios (transcribing large volumes)
- ‚úÖ Real-time transcription requirements
- ‚úÖ Limited GPU memory (~4 GB)
- ‚úÖ English-primary workloads (best English WER among non-LLM models)
- ‚úÖ Need timestamps (word/segment/char level)
- ‚ùå Need absolute best multilingual accuracy (use Canary-1B-v2 instead)

## 7. Next Steps

To run on more files:
1. Increase `sample_size` (e.g., 10, 50, 500)
2. Set `duration_sec: None` for full audio
3. Enable `enable_timestamps: True` for word-level timestamps (if needed)
4. Set `use_local_attention: True` for very long audio (>24 min, up to 3 hrs)

To evaluate results:
```python
# This will be available after implementing evaluation
from scripts.evaluate import evaluate_results
metrics = evaluate_results(df_results, use_whisper_normalizer=True)
print(f"WER: {metrics['wer']:.2%}")
```

**Recommended config for 10-sample test:**
```python
cfg = {
    "experiment_id": "vhp-parakeet-sample10",
    "model": {
        "name": "parakeet-tdt-0.6b-v3",
        "dir": "nvidia/parakeet-tdt-0.6b-v3",
        "device": "cuda",
        "enable_timestamps": False,
        "use_local_attention": False
    },
    "input": {
        "sample_size": 10,  # Test with 10 files
        "duration_sec": None  # Full audio
    },
    "output": {
        "dir": "../outputs/vhp-parakeet-sample10"
    }
}
```

**Recommended config for 500-sample benchmark:**
```python
cfg = {
    "experiment_id": "vhp-parakeet-500",
    "model": {
        "name": "parakeet-tdt-0.6b-v3",
        "dir": "nvidia/parakeet-tdt-0.6b-v3",
        "device": "cuda",
        "enable_timestamps": False,
        "use_local_attention": False
    },
    "input": {
        "sample_size": 500,  # Full benchmark
        "duration_sec": None  # Full audio
    },
    "output": {
        "dir": "../outputs/vhp-parakeet-500"
    }
}
```