# Amapiano AI - Simplified Training (Dataset Already Downloaded)

**Status**: ‚úÖ Dataset ready (1,582 clips in `/content/datasets/amapiano_proxy/`)

**Next Step**: Train MusicGen model

**Training Options**:
- Quick Test: 1 epoch, 100 samples (~15-30 min)
- Short: 5 epochs, full dataset (~4-6 hours)
- Full: 20 epochs, full dataset (~16-20 hours)

---

## Step 1: Verify GPU & Dataset

In [None]:
import torch
from pathlib import Path
import pandas as pd

# Check GPU
print("="*60)
print("GPU CHECK")
print("="*60)
if torch.cuda.is_available():
    print(f"‚úÖ GPU: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
    print(f"   CUDA: {torch.version.cuda}")
else:
    print("‚ùå NO GPU FOUND")
    print("‚ö†Ô∏è  Go to Runtime > Change runtime type > GPU")
    raise RuntimeError("GPU required")

# Check dataset
print("\n" + "="*60)
print("DATASET CHECK")
print("="*60)

dataset_path = Path('/content/datasets/amapiano_proxy')
metadata_path = dataset_path / 'training_metadata.csv'
audio_path = dataset_path / 'audio'

if metadata_path.exists():
    df = pd.read_csv(metadata_path)
    audio_files = list(audio_path.glob('*.mp3'))
    
    print(f"‚úÖ Dataset found")
    print(f"   Location: {dataset_path}")
    print(f"   Metadata entries: {len(df)}")
    print(f"   Audio files: {len(audio_files)}")
    print(f"   Total size: {sum(f.stat().st_size for f in audio_files) / 1e9:.2f} GB")
    print(f"   Avg score: {df['score'].mean():.2f}")
    
    print(f"\n   Top characteristics:")
    for char, count in df['characteristics'].value_counts().head(5).items():
        pct = count / len(df) * 100
        print(f"   - {char}: {count} clips ({pct:.1f}%)")
else:
    print(f"‚ùå Dataset not found at {dataset_path}")
    print("   Please run the dataset setup notebook first")
    raise RuntimeError("Dataset missing")

print("\n" + "="*60)
print("‚úÖ ALL CHECKS PASSED - Ready to train!")
print("="*60)

## Step 2: Install Training Dependencies

In [None]:
%%time
!pip install -q torch torchaudio transformers audiocraft accelerate
!pip install -q datasets librosa soundfile

print("‚úÖ Training dependencies installed")

## Step 3: Upload Training Script

**Action Required**: Upload `train_musicgen.py` using the file browser on the left:
1. Click the folder icon üìÅ on the left sidebar
2. Click the upload button ‚¨ÜÔ∏è
3. Select `/ai-service/train_musicgen.py` from your local files
4. Wait for upload to complete
5. Then run the verification cell below

In [None]:
# Verify training script is uploaded
training_script = Path('/content/train_musicgen.py')

if training_script.exists():
    print(f"‚úÖ Training script found ({training_script.stat().st_size / 1024:.1f} KB)")
else:
    print("‚ùå Training script not found")
    print("   Please upload train_musicgen.py to /content/")
    print("   Use the file browser (üìÅ) on the left sidebar")

## Step 4: Mount Google Drive (for saving checkpoints)

In [None]:
from google.colab import drive
import os

drive.mount('/content/drive')

# Create output directories
os.makedirs('/content/drive/MyDrive/amapiano-models', exist_ok=True)
os.makedirs('/content/drive/MyDrive/amapiano-models/quick-test', exist_ok=True)
os.makedirs('/content/drive/MyDrive/amapiano-models/5-epoch', exist_ok=True)
os.makedirs('/content/drive/MyDrive/amapiano-models/20-epoch', exist_ok=True)

print("‚úÖ Google Drive mounted")
print("   Models will be saved to: /content/drive/MyDrive/amapiano-models/")

## Step 5: Choose Your Training Mode

**Run ONLY ONE of the following cells:**

### Option A: Quick Pipeline Test ‚ö°

**Duration**: 15-30 minutes  
**Purpose**: Verify everything works  
**Config**: 1 epoch, 100 samples, batch size 1

In [None]:
%%time
!python /content/train_musicgen.py \
  --data_dir /content/datasets/amapiano_proxy/audio \
  --metadata /content/datasets/amapiano_proxy/training_metadata.csv \
  --output_dir /content/drive/MyDrive/amapiano-models/quick-test \
  --epochs 1 \
  --batch_size 1 \
  --max_samples 100 \
  --learning_rate 1e-5

print("\n" + "="*60)
print("‚úÖ QUICK TEST COMPLETE")
print("="*60)
print("Check the output above for:")
print("  - Loss values (should decrease)")
print("  - Any error messages")
print("  - Checkpoint saved to Google Drive")
print("\nIf successful, you can run the 5-epoch or 20-epoch training next.")

### Option B: Short Training (5 epochs) üöÄ

**Duration**: 4-6 hours  
**Purpose**: Get initial trained model  
**Config**: 5 epochs, full dataset, batch size 2

In [None]:
%%time
!python /content/train_musicgen.py \
  --data_dir /content/datasets/amapiano_proxy/audio \
  --metadata /content/datasets/amapiano_proxy/training_metadata.csv \
  --output_dir /content/drive/MyDrive/amapiano-models/5-epoch \
  --epochs 5 \
  --batch_size 2 \
  --learning_rate 1e-5 \
  --save_every 500

print("\n" + "="*60)
print("‚úÖ 5-EPOCH TRAINING COMPLETE")
print("="*60)
print("Model saved to: /content/drive/MyDrive/amapiano-models/5-epoch/")
print("\nNext steps:")
print("  1. Download the model from Google Drive")
print("  2. Test generation in your app")
print("  3. Evaluate authenticity (target: 15-25%)")

### Option C: Full Training (20 epochs) üéØ

**Duration**: 16-20 hours  
**Purpose**: Best results with current dataset  
**Config**: 20 epochs, full dataset, batch size 2  
**Note**: Requires Colab Pro or run overnight

In [None]:
%%time
!python /content/train_musicgen.py \
  --data_dir /content/datasets/amapiano_proxy/audio \
  --metadata /content/datasets/amapiano_proxy/training_metadata.csv \
  --output_dir /content/drive/MyDrive/amapiano-models/20-epoch \
  --epochs 20 \
  --batch_size 2 \
  --learning_rate 1e-5 \
  --save_every 500

print("\n" + "="*60)
print("‚úÖ FULL 20-EPOCH TRAINING COMPLETE")
print("="*60)
print("Model saved to: /content/drive/MyDrive/amapiano-models/20-epoch/")
print("\nThis is your best model with the MagnaTagATune dataset.")
print("\nExpected performance:")
print("  - Authenticity: 15-25% (vs 10-20% baseline)")
print("  - Better rhythms and beats")
print("  - Some piano elements")
print("  - Foundation for further training with real Amapiano samples")

## Step 6: Monitor Training Progress

In [None]:
# This cell helps you monitor training while it's running
# Run this in a separate cell while training is ongoing

import time
from pathlib import Path

# Check which training is running
output_dirs = {
    'Quick Test': '/content/drive/MyDrive/amapiano-models/quick-test',
    '5 Epoch': '/content/drive/MyDrive/amapiano-models/5-epoch',
    '20 Epoch': '/content/drive/MyDrive/amapiano-models/20-epoch'
}

print("Checking for training progress...\n")

for name, path in output_dirs.items():
    checkpoint_dir = Path(path)
    if checkpoint_dir.exists():
        checkpoints = list(checkpoint_dir.glob('*.pt')) + list(checkpoint_dir.glob('*.pth'))
        if checkpoints:
            latest = max(checkpoints, key=lambda p: p.stat().st_mtime)
            age_seconds = time.time() - latest.stat().st_mtime
            age_minutes = age_seconds / 60
            
            print(f"üìä {name}:")
            print(f"   Latest checkpoint: {latest.name}")
            print(f"   Size: {latest.stat().st_size / 1e6:.1f} MB")
            print(f"   Last updated: {age_minutes:.1f} minutes ago")
            print(f"   Total checkpoints: {len(checkpoints)}")
            print()

print("üí° Tip: Rerun this cell every few minutes to track progress")

## Step 7: Verify Training Completed Successfully

In [None]:
import torch
from pathlib import Path

print("="*60)
print("TRAINING VERIFICATION")
print("="*60)

output_dirs = {
    'Quick Test': '/content/drive/MyDrive/amapiano-models/quick-test',
    '5 Epoch': '/content/drive/MyDrive/amapiano-models/5-epoch',
    '20 Epoch': '/content/drive/MyDrive/amapiano-models/20-epoch'
}

for name, path in output_dirs.items():
    checkpoint_dir = Path(path)
    if checkpoint_dir.exists():
        checkpoints = list(checkpoint_dir.glob('*.pt')) + list(checkpoint_dir.glob('*.pth'))
        
        if checkpoints:
            print(f"\nüì¶ {name}:")
            print(f"   Location: {path}")
            print(f"   Checkpoints found: {len(checkpoints)}")
            
            # Try to load the latest checkpoint
            latest = max(checkpoints, key=lambda p: p.stat().st_mtime)
            print(f"   Latest: {latest.name} ({latest.stat().st_size / 1e6:.1f} MB)")
            
            try:
                ckpt = torch.load(latest, map_location='cpu')
                print(f"   ‚úÖ Checkpoint is loadable")
                
                if 'epoch' in ckpt:
                    print(f"   Epoch: {ckpt['epoch']}")
                if 'loss' in ckpt:
                    print(f"   Loss: {ckpt['loss']:.4f}")
                if 'model_state_dict' in ckpt:
                    print(f"   ‚úÖ Model weights present")
                    
            except Exception as e:
                print(f"   ‚ö†Ô∏è  Warning: {str(e)[:100]}")

print("\n" + "="*60)
print("\n‚úÖ Verification complete!")
print("\nYour trained models are saved in Google Drive and will persist")
print("even after this Colab session ends.")

## Summary & Next Steps

In [None]:
print("="*60)
print("TRAINING COMPLETE - SUMMARY")
print("="*60)

print("\nüìä Dataset Used:")
print("   - Source: MagnaTagATune (filtered)")
print("   - Clips: 1,582")
print("   - Duration: ~12.7 hours")
print("   - Characteristics: Electronic/techno with drums, piano, bass")

print("\nüíæ Model Location:")
print("   - Google Drive: /MyDrive/amapiano-models/")
print("   - Download to use in your app")

print("\nüéØ Expected Results:")
print("   - Baseline MusicGen: 10-20% Amapiano authenticity")
print("   - Your trained model: 15-25% authenticity (estimated)")
print("   - Improvement: Better rhythm/beat patterns")
print("   - Limitation: Not authentic Amapiano yet (dataset constraint)")

print("\nüìù Next Steps:")
print("   1. Download your trained model from Google Drive")
print("   2. Load it in your Amapiano AI application")
print("   3. Generate test samples with different prompts")
print("   4. Evaluate authenticity compared to baseline")
print("   5. If promising (>20%), collect real Amapiano samples")
print("   6. Fine-tune on 500-1000 real tracks for 40-50% authenticity")

print("\nüí° Recommendations:")
print("   - If results show improvement, proceed to Phase 3")
print("   - Collect authentic Amapiano tracks from:")
print("     ‚Ä¢ Kabza De Small")
print("     ‚Ä¢ DJ Maphorisa")
print("     ‚Ä¢ Kelvin Momo")
print("     ‚Ä¢ Focalistic")
print("   - Target: 500-1000 tracks for next training phase")
print("   - Expected authenticity with real data: 40-50%")

print("\nüí∞ Cost Breakdown:")
print("   - Colab (free tier): $0")
print("   - Colab Pro (if used): $10/month")
print("   - GPU time: Already included")
print("   - Total for this phase: $0-10")

print("\n" + "="*60)
print("üéâ Congratulations on completing the training!")
print("="*60)

---

## Troubleshooting Guide

### GPU Issues
**Problem**: No GPU available  
**Solution**: Runtime > Change runtime type > Hardware accelerator: GPU > Save

### Memory Issues
**Problem**: CUDA out of memory  
**Solution**: Reduce `--batch_size` to 1 in training command

### Session Disconnected
**Problem**: Colab session timed out during training  
**Solution**: Checkpoints are saved in Google Drive. Training will resume from last checkpoint if you rerun the same command

### Training Script Not Found
**Problem**: `train_musicgen.py` not found  
**Solution**: 
1. Use file browser (üìÅ) on left
2. Click upload button (‚¨ÜÔ∏è)
3. Select `train_musicgen.py` from `/ai-service/` folder
4. Verify it appears in `/content/`

### Loss is NaN
**Problem**: Training shows NaN loss  
**Solution**: This can happen in early steps. If it persists after 100 steps, reduce learning rate to 5e-6

### Slow Training
**Problem**: Training is very slow  
**Expected**: 
- Quick test: 15-30 min
- 5 epochs: 4-6 hours
- 20 epochs: 16-20 hours

If significantly slower, check GPU is being used (Step 1)

---

## Files You Need to Upload

From your local `/ai-service/` folder:
1. ‚úÖ `train_musicgen.py` - Main training script (REQUIRED)

Everything else is handled by pip install.

---

## References

- **MagnaTagATune Dataset**: https://mirg.city.ac.uk/codeapps/the-magnatagatune-dataset
- **MusicGen Paper**: https://arxiv.org/abs/2306.05284
- **AudioCraft GitHub**: https://github.com/facebookresearch/audiocraft
- **Project Docs**: See `/docs/` folder in repository

---

## Contact & Support

If you encounter issues:
1. Check the troubleshooting section above
2. Review training logs for error messages
3. Verify all prerequisites (GPU, dataset, script uploaded)
4. Check Google Drive has sufficient space (~5-10 GB)

---

**Version**: 1.0  
**Last Updated**: 2025-11-29  
**Status**: Ready for production use ‚úÖ