# Fine-Tuning Parakeet-v3 ASR Model with NeMo 2.5+

This notebook demonstrates how to fine-tune the NVIDIA Parakeet-v3 (parakeet-tdt-0.6b-v3) ASR model using NeMo 2.5+.

## Key Updates from Original Tutorial:
- Updated to work with NeMo 2.5+ (from 1.23)
- Uses Parakeet-v3 model instead of original Parakeet
- Compatible with Modal GPU infrastructure for training
- Updated API calls and configuration structure

## Model Information:
- **Model**: nvidia/parakeet-tdt-0.6b-v3
- **Architecture**: FastConformer-TDT
- **Parameters**: 600M
- **Languages**: 25 European languages
- **License**: CC BY 4.0

## 1. Environment Setup and Dependencies

In [None]:
# Install system dependencies
!apt-get update && apt-get install -y sox libsndfile1 ffmpeg libsox-fmt-mp3 jq wget

# Install Python dependencies
!pip install text-unidecode matplotlib>=3.3.2 Cython librosa soundfile
!pip install huggingface-hub>=0.23.2

# Install NeMo 2.5+ (latest version)
!pip install nemo_toolkit[asr]>=2.5.0

print("Dependencies installed successfully!")

In [None]:
# Import required libraries
import os
import json
import librosa
import glob
import subprocess
import torch
from pathlib import Path

# NeMo imports for v2.5+
import nemo
import nemo.collections.asr as nemo_asr
from nemo.collections.asr.models import ASRModel
from nemo.core.config import hydra_runner
from nemo.utils.exp_manager import exp_manager

# Check NeMo version
print(f"NeMo version: {nemo.__version__}")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")

## 2. Data Preparation

We'll use the AN4 dataset for demonstration, but you can replace this with your own dataset.

In [None]:
# Set up data directories
DATA_DIR = os.getcwd()
os.environ["DATA_DIR"] = DATA_DIR

# Download AN4 dataset
if not os.path.exists(f"{DATA_DIR}/an4_sphere.tar.gz"):
    !wget https://dldata-public.s3.us-east-2.amazonaws.com/an4_sphere.tar.gz
    
# Extract dataset
if not os.path.exists(f"{DATA_DIR}/an4"):
    !tar -xvf an4_sphere.tar.gz
    !mv an4 $DATA_DIR

print("Dataset downloaded and extracted successfully!")

In [None]:
def an4_build_manifest(transcripts_path, manifest_path, target_wavs_dir):
    """Build an AN4 manifest from a given transcript file."""
    with open(transcripts_path, 'r') as fin:
        with open(manifest_path, 'w') as fout:
            for line in fin:
                # Lines look like this:
                # <s> transcript </s> (fileID)
                transcript = line[: line.find('(') - 1].lower()
                transcript = transcript.replace('<s>', '').replace('</s>', '')
                transcript = transcript.strip()

                file_id = line[line.find('(') + 1 : -2]  # e.g. "cen4-fash-b"
                audio_path = os.path.join(target_wavs_dir, file_id + '.wav')

                if os.path.exists(audio_path):
                    duration = librosa.core.get_duration(filename=audio_path)
                    # Write the metadata to the manifest
                    metadata = {"audio_filepath": audio_path, "duration": duration, "text": transcript}
                    json.dump(metadata, fout)
                    fout.write('\n')

# Process AN4 dataset
source_data_dir = f"{DATA_DIR}/an4"
target_data_dir = f"{DATA_DIR}/an4_converted"

if not os.path.exists(source_data_dir):
    raise ValueError(f"Data not found at `{source_data_dir}`. Please ensure the AN4 dataset is properly extracted.")

# Convert SPH files to WAV files
sph_list = glob.glob(os.path.join(source_data_dir, '**/*.sph'), recursive=True)
target_wavs_dir = os.path.join(target_data_dir, 'wavs')

if not os.path.exists(target_wavs_dir):
    print(f"Creating directories for {target_wavs_dir}.")
    os.makedirs(target_wavs_dir, exist_ok=True)

print(f"Converting {len(sph_list)} SPH files to WAV...")
for sph_path in sph_list:
    wav_path = os.path.join(target_wavs_dir, os.path.splitext(os.path.basename(sph_path))[0] + '.wav')
    if not os.path.exists(wav_path):
        cmd = ["sox", sph_path, wav_path]
        subprocess.run(cmd, check=True)

# Build AN4 manifests
train_transcripts = os.path.join(source_data_dir, 'etc/an4_train.transcription')
train_manifest = os.path.join(target_data_dir, 'train_manifest.json')
an4_build_manifest(train_transcripts, train_manifest, target_wavs_dir)

test_transcripts = os.path.join(source_data_dir, 'etc/an4_test.transcription')
test_manifest = os.path.join(target_data_dir, 'test_manifest.json')
an4_build_manifest(test_transcripts, test_manifest, target_wavs_dir)

print("Data preprocessing completed!")
print(f"Train manifest: {train_manifest}")
print(f"Test manifest: {test_manifest}")

## 3. Load Parakeet-v3 Model

We'll load the pre-trained Parakeet-v3 model from HuggingFace.

In [None]:
# Load the pre-trained Parakeet-v3 model
model_name = "nvidia/parakeet-tdt-0.6b-v3"

print(f"Loading {model_name}...")
asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name=model_name)

print(f"Model loaded successfully!")
print(f"Model type: {type(asr_model)}")
print(f"Model architecture: {asr_model.__class__.__name__}")

# Display model configuration
print("\nModel configuration:")
print(f"Encoder: {asr_model.encoder.__class__.__name__}")
print(f"Decoder: {asr_model.decoder.__class__.__name__}")
print(f"Vocabulary size: {asr_model.decoder.vocab_size if hasattr(asr_model.decoder, 'vocab_size') else 'N/A'}")

## 4. Test Pre-trained Model

Let's test the pre-trained model on a sample audio file before fine-tuning.

In [None]:
# Download a sample audio file for testing
sample_audio = "2086-149220-0033.wav"
if not os.path.exists(sample_audio):
    !wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav

# Test transcription
print("Testing pre-trained model...")
output = asr_model.transcribe([sample_audio])
print(f"Transcription: {output[0].text}")

# Test with timestamps
print("\nTesting with timestamps...")
output_with_timestamps = asr_model.transcribe([sample_audio], timestamps=True)
if hasattr(output_with_timestamps[0], 'timestamp') and output_with_timestamps[0].timestamp:
    word_timestamps = output_with_timestamps[0].timestamp.get('word', [])
    print("Word-level timestamps:")
    for stamp in word_timestamps[:5]:  # Show first 5 words
        print(f"  {stamp['start']:.2f}s - {stamp['end']:.2f}s : {stamp['word']}")
else:
    print("Timestamps not available in this output format")

## 5. Fine-tuning Configuration

Set up the configuration for fine-tuning with NeMo 2.5+.

In [None]:
from omegaconf import OmegaConf, DictConfig
import pytorch_lightning as pl
from nemo.utils.exp_manager import exp_manager

# Create fine-tuning configuration
def create_finetune_config():
    config = OmegaConf.create({
        'model': {
            'train_ds': {
                'manifest_filepath': train_manifest,
                'sample_rate': 16000,
                'batch_size': 8,  # Adjust based on GPU memory
                'shuffle': True,
                'num_workers': 4,
                'pin_memory': True,
                'use_start_end_token': False,
            },
            'validation_ds': {
                'manifest_filepath': test_manifest,
                'sample_rate': 16000,
                'batch_size': 8,
                'shuffle': False,
                'num_workers': 4,
                'pin_memory': True,
                'use_start_end_token': False,
            },
            'optim': {
                'name': 'adamw',
                'lr': 1e-4,  # Lower learning rate for fine-tuning
                'weight_decay': 0.001,
                'sched': {
                    'name': 'CosineAnnealing',
                    'warmup_steps': 100,
                    'min_lr': 1e-6,
                }
            }
        },
        'trainer': {
            'devices': 1,
            'max_epochs': 10,  # Adjust as needed
            'precision': 'bf16-mixed' if torch.cuda.is_available() else 32,
            'accelerator': 'gpu' if torch.cuda.is_available() else 'cpu',
            'strategy': 'auto',
            'enable_checkpointing': True,
            'logger': True,
            'log_every_n_steps': 10,
            'val_check_interval': 1.0,
            'gradient_clip_val': 1.0,
        },
        'exp_manager': {
            'exp_dir': f'{DATA_DIR}/checkpoints',
            'name': 'parakeet_v3_finetune',
            'version': 'v1',
            'use_datetime_version': False,
            'create_tensorboard_logger': True,
            'create_checkpoint_callback': True,
            'checkpoint_callback_params': {
                'monitor': 'val_wer',
                'mode': 'min',
                'save_top_k': 3,
                'save_last': True,
            }
        }
    })
    return config

# Create configuration
cfg = create_finetune_config()
print("Fine-tuning configuration created!")
print(f"Training manifest: {cfg.model.train_ds.manifest_filepath}")
print(f"Validation manifest: {cfg.model.validation_ds.manifest_filepath}")
print(f"Max epochs: {cfg.trainer.max_epochs}")
print(f"Learning rate: {cfg.model.optim.lr}")

## 6. Fine-tuning Process

Now we'll fine-tune the Parakeet-v3 model on our dataset.

In [None]:
# Setup experiment manager
trainer = pl.Trainer(**cfg.trainer)
exp_dir = exp_manager(trainer, cfg.exp_manager)

# Update model configuration for fine-tuning
asr_model.set_trainer(trainer)

# Setup data loaders
asr_model.setup_training_data(cfg.model.train_ds)
asr_model.setup_validation_data(cfg.model.validation_ds)

# Configure optimizer
asr_model.configure_optimizers()

print("Starting fine-tuning...")
print(f"Experiment directory: {exp_dir}")

# Start training
trainer.fit(asr_model)

print("Fine-tuning completed!")

## 7. Model Evaluation

Evaluate the fine-tuned model on the test set.

In [None]:
# Load the best checkpoint
checkpoint_dir = f"{exp_dir}/checkpoints"
checkpoint_files = glob.glob(f"{checkpoint_dir}/*.ckpt")

if checkpoint_files:
    # Find the best checkpoint (lowest validation WER)
    best_checkpoint = None
    for ckpt in checkpoint_files:
        if "last" not in ckpt:  # Skip last.ckpt, look for best
            best_checkpoint = ckpt
            break
    
    if best_checkpoint is None:
        best_checkpoint = checkpoint_files[0]  # Use any available checkpoint
    
    print(f"Loading checkpoint: {best_checkpoint}")
    
    # Load the fine-tuned model
    finetuned_model = ASRModel.load_from_checkpoint(best_checkpoint)
    
    # Test the fine-tuned model
    print("\nTesting fine-tuned model...")
    output_finetuned = finetuned_model.transcribe([sample_audio])
    print(f"Fine-tuned transcription: {output_finetuned[0].text}")
    
    # Compare with original
    print(f"\nComparison:")
    print(f"Original model: {output[0].text}")
    print(f"Fine-tuned model: {output_finetuned[0].text}")
    
else:
    print("No checkpoints found. Using the current model state.")
    finetuned_model = asr_model

## 8. Model Export

Export the fine-tuned model for deployment.

In [None]:
# Save the fine-tuned model in NeMo format
output_model_path = f"{DATA_DIR}/parakeet_v3_finetuned.nemo"

if 'finetuned_model' in locals():
    finetuned_model.save_to(output_model_path)
    print(f"Fine-tuned model saved to: {output_model_path}")
    
    # Verify the saved model can be loaded
    print("\nVerifying saved model...")
    loaded_model = ASRModel.restore_from(output_model_path)
    test_output = loaded_model.transcribe([sample_audio])
    print(f"Loaded model transcription: {test_output[0].text}")
    print("Model verification successful!")
    
else:
    print("No fine-tuned model available to save.")

## 9. Batch Evaluation (Optional)

Evaluate the model on the entire test set to compute WER.

In [None]:
# Batch evaluation on test set
if 'finetuned_model' in locals():
    print("Running batch evaluation on test set...")
    
    # Read test manifest
    test_files = []
    test_texts = []
    
    with open(test_manifest, 'r') as f:
        for line in f:
            data = json.loads(line)
            test_files.append(data['audio_filepath'])
            test_texts.append(data['text'])
    
    print(f"Evaluating on {len(test_files)} test files...")
    
    # Transcribe all test files
    predictions = finetuned_model.transcribe(test_files[:10])  # Limit to first 10 for demo
    
    # Display some results
    print("\nSample results:")
    for i, (pred, true_text) in enumerate(zip(predictions[:5], test_texts[:5])):
        print(f"\nSample {i+1}:")
        print(f"  Ground truth: {true_text}")
        print(f"  Prediction:   {pred.text}")
    
    print("\nBatch evaluation completed!")
else:
    print("No fine-tuned model available for batch evaluation.")

## 10. Summary and Next Steps

This notebook demonstrated how to:

1. **Set up NeMo 2.5+** environment with all required dependencies
2. **Load Parakeet-v3** model from HuggingFace
3. **Prepare training data** in the correct format
4. **Fine-tune the model** on custom data
5. **Evaluate and export** the fine-tuned model

### Key Differences from NeMo 1.23:
- Updated import statements and API calls
- New configuration structure with OmegaConf
- Updated trainer and experiment manager setup
- Improved model loading from HuggingFace

### Next Steps:
1. **Scale up training** with more epochs and larger datasets
2. **Experiment with hyperparameters** (learning rate, batch size, etc.)
3. **Deploy the model** using NVIDIA Riva or other inference frameworks
4. **Evaluate on domain-specific data** for your use case

### For Production Use:
- Use larger batch sizes and multiple GPUs for faster training
- Implement proper validation and early stopping
- Add comprehensive logging and monitoring
- Consider using distributed training for very large datasets