# Canary-Qwen-2.5B Resource Testing

**Purpose:** Measure RAM and GPU memory requirements for Canary-Qwen-2.5B

**Test Results:**
- Model loading: **5.1 GB RAM, 9.6 GB GPU**
- Inference overhead: **+0.01 GB RAM, +0.01 GB GPU** (negligible)
- **Total: ~5 GB RAM, ~10 GB GPU** needed

**VM Requirements:**
- âœ… Fits on NC4as_T4_v3 (28GB RAM, 16GB T4 GPU)
- âœ… No need for larger VM

---

## Setup

In [None]:
import os
import psutil
import torch
import gc
from datetime import datetime
from pathlib import Path

# Configure environment
os.environ['WANDB_MODE'] = 'disabled'
model_cache = Path.cwd().parent / "models" / "canary"
os.environ['HF_HOME'] = str(model_cache)
os.environ['TRANSFORMERS_CACHE'] = str(model_cache)

def print_resources(label=""):
    """Print current RAM and GPU memory usage."""
    print("="*80)
    print(f"RESOURCE CHECK: {label}")
    print(f"Timestamp: {datetime.now().strftime('%H:%M:%S')}")
    print("="*80)
    
    # System RAM
    mem = psutil.virtual_memory()
    print(f"\nðŸ“Š RAM: {mem.used/1024**3:.2f}/{mem.total/1024**3:.2f} GB ({mem.percent:.1f}%)")
    print(f"   Available: {mem.available/1024**3:.2f} GB")
    
    # GPU
    if torch.cuda.is_available():
        for i in range(torch.cuda.device_count()):
            allocated = torch.cuda.memory_allocated(i) / 1024**3
            total = torch.cuda.get_device_properties(i).total_memory / 1024**3
            print(f"\nðŸŽ® GPU {i}: {torch.cuda.get_device_name(i)}")
            print(f"   Allocated: {allocated:.2f}/{total:.2f} GB")
    
    print("="*80 + "\n")

print("âœ“ Setup complete")

## Test 1: Load Canary-Qwen-2.5B

In [None]:
# Clean start
gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()

print_resources("BEFORE loading model")

# Capture baseline
ram_before = psutil.virtual_memory().used / 1024**3
gpu_before = torch.cuda.memory_allocated(0) / 1024**3 if torch.cuda.is_available() else 0

# Load model
from nemo.collections.speechlm2.models import SALM
model = SALM.from_pretrained("nvidia/canary-qwen-2.5b")
if torch.cuda.is_available():
    model = model.cuda()

print("\nâœ“ Model loaded\n")
print_resources("AFTER loading model")

# Calculate delta
ram_after = psutil.virtual_memory().used / 1024**3
gpu_after = torch.cuda.memory_allocated(0) / 1024**3 if torch.cuda.is_available() else 0

print("="*80)
print("ðŸ“Š MEMORY DELTA:")
print(f"  RAM: +{ram_after - ram_before:.2f} GB")
print(f"  GPU: +{gpu_after - gpu_before:.2f} GB")
print("="*80)

## Test 2: Run Inference

In [None]:
import numpy as np
import soundfile as sf
from tempfile import NamedTemporaryFile

# Create dummy audio (10s silence)
audio = np.zeros(16000 * 10, dtype=np.float32)

print_resources("BEFORE inference")
ram_before = psutil.virtual_memory().used / 1024**3
gpu_before = torch.cuda.memory_allocated(0) / 1024**3

# Run inference
with NamedTemporaryFile(suffix=".wav", delete=True) as tmp:
    sf.write(tmp.name, audio, 16000)
    
    prompts = [[
        {
            "role": "user",
            "content": f"Transcribe the following: {model.audio_locator_tag}",
            "audio": [tmp.name]
        }
    ]]
    
    with torch.no_grad():
        answer_ids = model.generate(prompts=prompts, max_new_tokens=512)
    
    result = model.tokenizer.ids_to_text(answer_ids[0].cpu()).strip()

print(f"\nâœ“ Transcription: '{result}'\n")
print_resources("AFTER inference")

ram_after = psutil.virtual_memory().used / 1024**3
gpu_after = torch.cuda.memory_allocated(0) / 1024**3

print("="*80)
print("ðŸ“Š INFERENCE OVERHEAD:")
print(f"  RAM: +{ram_after - ram_before:.2f} GB")
print(f"  GPU: +{gpu_after - gpu_before:.2f} GB")
print("="*80)

## Summary

**Canary-Qwen-2.5B Memory Requirements:**
- Model loading: ~5 GB RAM, ~10 GB GPU
- Inference overhead: Negligible (<0.1 GB)

**Conclusion:**
- âœ… Runs on NC4as_T4_v3 (28GB RAM, 16GB T4)
- âœ… Previous "20-25GB RAM" estimate was incorrect
- âœ… No need for larger VM or quota increase