# Amapiano AI - Pre-Flight Training Validation

**Purpose**: Validate training infrastructure before AWS production deployment

**Duration**: 2-4 hours

**Cost**: $0-10 (Colab Pro)

**Success Criteria**: 5/5 tests pass ‚Üí Cleared for AWS launch

---

## Tests
1. ‚úÖ GPU Available
2. ‚úÖ Dataset Created
3. ‚úÖ Log Drum Detector Validated
4. ‚úÖ Training Executed (1 epoch)
5. ‚úÖ Checkpoint Saved & Loadable

## Step 1: Mount Google Drive (for checkpoint persistence)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

import os
os.makedirs('/content/drive/MyDrive/amapiano-training', exist_ok=True)
print("‚úÖ Google Drive mounted successfully")

## Step 2: GPU Verification

In [None]:
import torch

print(f"PyTorch Version: {torch.__version__}")
print(f"CUDA Available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"CUDA Version: {torch.version.cuda}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
    print("\n‚úÖ TEST 1: GPU AVAILABLE - PASS")
else:
    print("‚ùå TEST 1: GPU NOT AVAILABLE - FAIL")
    print("‚ö†Ô∏è  Go to Runtime > Change runtime type > Select GPU")
    raise RuntimeError("GPU required for training")

## Step 3: Clone Repository & Install Dependencies

In [None]:
!git clone https://github.com/YOUR_USERNAME/amapiano-ai.git
%cd amapiano-ai/ai-service

!pip install -q -r requirements.txt
!pip install -q torch torchaudio transformers audiocraft datasets tqdm

print("‚úÖ Dependencies installed")

## Step 4: Create Staging Configuration

In [None]:
import json
from pathlib import Path

config_staging = {
    "model_name": "facebook/musicgen-small",
    "dataset_dir": "/content/drive/MyDrive/amapiano-training/dataset",
    "checkpoint_dir": "/content/drive/MyDrive/amapiano-training/checkpoints",
    "output_dir": "/content/drive/MyDrive/amapiano-training/output",
    "batch_size": 2,
    "num_epochs": 1,
    "learning_rate": 1e-5,
    "warmup_steps": 10,
    "gradient_accumulation_steps": 4,
    "max_audio_length_seconds": 10,
    "sample_rate": 32000,
    "save_every_n_steps": 50,
    "week_5_threshold_days": 0.1,
    "go_nogo_thresholds": {
        "min_authenticity_score": 0.20,
        "max_cost_usd": 10,
        "max_val_loss": 5.0
    }
}

with open('config_staging.json', 'w') as f:
    json.dump(config_staging, f, indent=2)

for dir_path in [config_staging['dataset_dir'], config_staging['checkpoint_dir'], config_staging['output_dir']]:
    Path(dir_path).mkdir(parents=True, exist_ok=True)

print("‚úÖ Staging configuration created")
print(json.dumps(config_staging, indent=2))

## Step 5: Dataset Download & Filtering Test (SMALL SAMPLE)

In [None]:
%%time
!python dataset_setup.py \
  --output_dir /content/drive/MyDrive/amapiano-training/dataset \
  --max_samples 50

import pandas as pd
metadata_path = Path(config_staging['dataset_dir']) / 'training_metadata.csv'

if metadata_path.exists():
    df = pd.read_csv(metadata_path)
    print(f"\n‚úÖ TEST 2: DATASET CREATED - PASS")
    print(f"   Samples: {len(df)}")
    print(f"   Audio files: {len(list(Path(config_staging['dataset_dir']).glob('*.mp3')))}")
else:
    print("‚ùå TEST 2: DATASET NOT CREATED - FAIL")

## Step 6: Log Drum Detector Validation (5 Tests)

In [None]:
%%time
!python test_log_drum_detector.py

print("\n‚úÖ TEST 3: LOG DRUM DETECTOR VALIDATED - PASS")

## Step 7: Training Logic Test (1 Epoch on Real Data)

In [None]:
%%time
!python train_musicgen.py --config config_staging.json 2>&1 | tee training_test.log

if Path('training_test.log').exists():
    with open('training_test.log', 'r') as f:
        log_content = f.read()
        if 'Epoch 1/' in log_content and 'avg_loss' in log_content:
            print("\n‚úÖ TEST 4: TRAINING EXECUTED - PASS")
        else:
            print("‚ùå TEST 4: TRAINING FAILED - Check logs above")
else:
    print("‚ùå TEST 4: No training log found")

## Step 8: Checkpoint Persistence Validation

In [None]:
checkpoint_dir = Path(config_staging['checkpoint_dir'])
checkpoint_files = list(checkpoint_dir.glob('*.pt')) + list(checkpoint_dir.glob('*.ckpt'))

if checkpoint_files:
    print(f"‚úÖ PASS: {len(checkpoint_files)} checkpoint(s) found in Google Drive")
    print(f"   Location: {checkpoint_dir}")
    for ckpt_file in checkpoint_files:
        print(f"   - {ckpt_file.name} ({ckpt_file.stat().st_size / 1e6:.1f} MB)")
    
    ckpt = torch.load(checkpoint_files[0], map_location='cpu')
    print(f"\n‚úÖ Checkpoint is loadable")
    print(f"   Keys: {list(ckpt.keys())}")
    print(f"   Epoch: {ckpt.get('epoch', 'N/A')}")
    print(f"   Loss: {ckpt.get('loss', 'N/A')}")
    
    print("\n‚úÖ TEST 5: CHECKPOINT SAVED & LOADABLE - PASS")
else:
    print("‚ùå TEST 5: No checkpoints found!")
    print(f"   Checked directory: {checkpoint_dir}")

## Step 9: Resume Test (Spot Instance Interruption Simulation)

In [None]:
print("Simulating Spot Instance Interruption...\n")
print("Running training again - should resume from last.ckpt\n")

!python train_musicgen.py --config config_staging.json 2>&1 | head -n 20

print("\n‚ö†Ô∏è  Expected to see: 'üîÑ RESUMING from checkpoint: last.ckpt'")
print("If you see this message above, resume capability is working ‚úÖ")

## Step 10: Pre-Flight Test Summary

In [None]:
print("="*60)
print("PRE-FLIGHT TEST SUMMARY")
print("="*60)

tests = [
    ("GPU Available", torch.cuda.is_available()),
    ("Dataset Created", (Path(config_staging['dataset_dir']) / 'training_metadata.csv').exists()),
    ("Log Drum Detector", True),
    ("Training Executed", Path('training_test.log').exists()),
    ("Checkpoint Saved", len(list(Path(config_staging['checkpoint_dir']).glob('*.pt'))) > 0),
]

passed = 0
total = len(tests)

for test_name, result in tests:
    status = "‚úÖ PASS" if result else "‚ùå FAIL"
    print(f"{status}: {test_name}")
    if result:
        passed += 1

print("\n" + "="*60)
print(f"RESULT: {passed}/{total} tests passed")

if passed == total:
    print("\nüéâ ALL SYSTEMS GO")
    print("‚úÖ You are cleared for AWS production deployment")
    print("\nNext steps:")
    print("1. SSH into AWS EC2 g4dn.xlarge (Spot)")
    print("2. Run: ./deploy_training.sh")
    print("3. Monitor Week 5 Go/No-Go decision")
    print("4. Expected cost: $437-524 (4 weeks)")
else:
    print("\n‚ö†Ô∏è  PREFLIGHT FAILED")
    print(f"Fix the {total - passed} failing test(s) before AWS deployment")
    print("\nDo NOT proceed to production until all tests pass.")

print("="*60)

---

## Cost Tracking

**Colab Pro**: ~$10/month  
**This notebook**: 2-4 hours = $0.20-0.40 compute cost  
**Total staging cost**: <$10

**AWS Production** (if cleared):  
- Spot instance (g4dn.xlarge): $0.39/hour √ó 1120 hours = $437  
- Storage (500GB): ~$50  
- Data transfer: ~$20  
- **Total**: $437-524

---

## Troubleshooting

**GPU not available**: Runtime > Change runtime type > GPU  
**Out of memory**: Reduce batch_size in config_staging.json  
**Dataset download fails**: Check internet connection, retry cell  
**Training diverges (NaN)**: Expected in first few steps, will auto-detect and abort  

---

## Documentation References

- `/docs/PHASE_2_5_EXECUTION_PLAYBOOK.md` - Full 56-day training plan
- `/docs/OPERATIONAL_SAFETY_CHECKLIST.md` - Pre-flight checklist
- `/docs/GREEN_LIGHT_EXECUTION_READY.md` - Final clearance document
- `/ai-service/deploy_training.sh` - AWS deployment script