# Amapiano AI - Preflight Training Pipeline (Local JupyterLab)

**Purpose**: Validate the complete training pipeline locally before AWS deployment

**Prerequisites**:
- Python 3.9+
- CUDA-compatible GPU (recommended)
- ~10GB disk space for model + dataset
- Jupyter Lab installed

**Tests**:
1. Environment setup
2. Dataset preparation
3. Model initialization
4. Training execution
5. Checkpoint validation

## Setup: Install Dependencies

In [None]:
%%bash
# Navigate to ai-service directory and install requirements
cd ai-service
pip install -r requirements.txt

## TEST 1: Environment Validation

In [None]:
import torch
import transformers
import audiocraft
from pathlib import Path

print(f"PyTorch version: {torch.__version__}")
print(f"Transformers version: {transformers.__version__}")
print(f"Audiocraft version: {audiocraft.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    print("‚ö†Ô∏è  No GPU detected - training will be VERY slow")

print("\n‚úÖ TEST 1: ENVIRONMENT SETUP - PASS")

## TEST 2: Dataset Setup

In [None]:
%%bash
# Run dataset setup script
cd ai-service
python dataset_setup.py

In [None]:
# Verify dataset was created
from pathlib import Path
import json

dataset_dir = Path("./ai-service/data/amapiano_dataset")
metadata_file = dataset_dir / "metadata.jsonl"

if metadata_file.exists():
    with open(metadata_file, 'r') as f:
        samples = [json.loads(line) for line in f]
    
    print(f"Dataset samples: {len(samples)}")
    print(f"\nSample entry:")
    print(json.dumps(samples[0], indent=2))
    
    # Count audio files
    audio_files = list(dataset_dir.glob("*.wav"))
    print(f"\nAudio files found: {len(audio_files)}")
    
    if len(audio_files) == len(samples):
        print("\n‚úÖ TEST 2: DATASET PREPARED - PASS")
    else:
        print(f"\n‚ùå TEST 2: FAILED - Mismatch between metadata ({len(samples)}) and audio files ({len(audio_files)})")
else:
    print("‚ùå TEST 2: FAILED - metadata.jsonl not found")

## TEST 3: Model Initialization

In [None]:
from audiocraft.models import MusicGen

print("Loading MusicGen model (this may take a few minutes)...")
model = MusicGen.get_pretrained('facebook/musicgen-small')

print(f"\nModel loaded successfully")
print(f"Model device: {next(model.lm.parameters()).device}")
print(f"Model parameters: {sum(p.numel() for p in model.lm.parameters()) / 1e6:.1f}M")

# Test generation to ensure model works
print("\nTesting base model generation...")
model.set_generation_params(duration=5)
wav = model.generate(["upbeat electronic music"])

print(f"Generated audio shape: {wav.shape}")
print("\n‚úÖ TEST 3: MODEL INITIALIZED - PASS")

## Configure Training

In [None]:
# Create local training config
import json

config = {
    "model_name": "facebook/musicgen-small",
    "dataset_path": "./ai-service/data/amapiano_dataset",
    "output_dir": "./training_output",
    "num_epochs": 1,
    "batch_size": 2,
    "learning_rate": 1e-5,
    "gradient_accumulation_steps": 4,
    "save_steps": 50,
    "logging_steps": 10,
    "max_duration": 10.0,
    "sample_rate": 32000,
    "use_fp16": torch.cuda.is_available()
}

config_path = Path("./ai-service/config_local.json")
with open(config_path, 'w') as f:
    json.dump(config, indent=2, fp=f)

print("Training configuration:")
print(json.dumps(config, indent=2))
print(f"\nConfig saved to: {config_path}")

## TEST 4 + 5: Training Execution & Checkpoint Validation

In [None]:
%%time
# Run training script
!python ./ai-service/train_musicgen.py \
  --config ./ai-service/config_local.json \
  2>&1 | tee training_test.log

In [None]:
from pathlib import Path

# TEST 4: Verify training executed
if Path('training_test.log').exists():
    with open('training_test.log', 'r') as f:
        log_content = f.read()
        if 'Epoch 1/' in log_content and 'avg_loss' in log_content:
            print("‚úÖ TEST 4: TRAINING EXECUTED - PASS")
        else:
            print("‚ùå TEST 4: TRAINING FAILED - Check logs above")
else:
    print("‚ùå TEST 4: No training log found")

# TEST 5: Verify checkpoint saved
checkpoint_dir = Path("./training_output/checkpoints")

if checkpoint_dir.exists():
    checkpoints = list(checkpoint_dir.glob("*.pt")) + list(checkpoint_dir.glob("*.ckpt"))
    if checkpoints:
        print("\n‚úÖ TEST 5: CHECKPOINT SAVED - PASS")
        print(f"Found {len(checkpoints)} checkpoint(s):")
        for ckpt in checkpoints[:5]:
            size_mb = ckpt.stat().st_size / 1e6
            print(f"  - {ckpt.name} ({size_mb:.1f} MB)")
    else:
        print("\n‚ùå TEST 5: No checkpoint files found")
else:
    print("\n‚ùå TEST 5: Checkpoint directory not found")
    print(f"Expected: {checkpoint_dir}")

## TEST 6 (Optional): Checkpoint Inference Test

In [None]:
# Load the fine-tuned checkpoint and generate a sample
import torch
from audiocraft.models import MusicGen
from pathlib import Path
import IPython.display as ipd

checkpoint_dir = Path("./training_output/checkpoints")
checkpoints = sorted(checkpoint_dir.glob("*.pt"))

if checkpoints:
    latest_checkpoint = checkpoints[-1]
    print(f"Loading checkpoint: {latest_checkpoint.name}")
    
    # Load base model
    model = MusicGen.get_pretrained('facebook/musicgen-small')
    
    # Load fine-tuned weights
    checkpoint = torch.load(latest_checkpoint, map_location='cpu')
    model.lm.load_state_dict(checkpoint['model_state_dict'])
    
    print("\nGenerating 5-second amapiano sample...")
    model.set_generation_params(duration=5)
    wav = model.generate(["upbeat amapiano with log drums and piano"])
    
    # Save and play
    output_path = Path("./test_generation.wav")
    import torchaudio
    torchaudio.save(str(output_path), wav[0].cpu(), sample_rate=32000)
    
    print(f"\n‚úÖ TEST 6: INFERENCE SUCCESSFUL")
    print(f"Audio saved to: {output_path}")
    
    # Play in notebook
    display(ipd.Audio(str(output_path)))
else:
    print("‚ùå TEST 6: No checkpoint found to test")

## Summary

If all tests pass:
- ‚úÖ Environment is configured correctly
- ‚úÖ Dataset generation works
- ‚úÖ Model initialization works
- ‚úÖ Training executes successfully
- ‚úÖ Checkpoints are saved
- ‚úÖ Fine-tuned model can generate audio

**Next Steps**: Deploy to AWS with confidence! üöÄ