# 🌊 YOLO-UDD v2.0 - Underwater Debris Detection (KAGGLE)

**Complete Training Pipeline on Kaggle with GPU** ⚡

## 🚀 Quick Start:
1. **Upload Dataset**: Add TrashCAN dataset as Kaggle Dataset
2. **Enable GPU**: Settings → Accelerator → GPU T4 x2 → Save
3. **Run All**: Run all cells sequentially
4. **Download Results**: Download trained model from Output folder

## ⚙️ Configuration:
- **Epochs**: 100 (reduced for faster training ~10 hours)
- **Batch Size**: 8
- **Classes**: 22 (matches TrashCAN dataset)
- **Expected mAP**: 70-72%

---

## Step 1: Setup Environment

In [None]:
# Clone repository
import os
import sys

# Kaggle uses /kaggle/working directory
WORK_DIR = '/kaggle/working'
REPO_DIR = f'{WORK_DIR}/YOLO-UDD-v2.0'

print("="*60)
print("Step 1: Cloning Repository")
print("="*60)

# Ensure we're in working directory
try:
    os.chdir(WORK_DIR)
    print(f"✓ Changed to working directory: {os.getcwd()}")
except Exception as e:
    print(f"✗ Error changing directory: {e}")
    raise

# Remove existing directory if present
if os.path.exists(REPO_DIR):
    import shutil
    shutil.rmtree(REPO_DIR)
    print("✓ Cleaned existing directory")

# Clone repository
print("\nCloning repository from GitHub...")
!git clone https://github.com/kshitijkhede/YOLO-UDD-v2.0.git

# Verify clone succeeded
if not os.path.exists(REPO_DIR):
    print(f"\n✗ ERROR: Repository not cloned!")
    print(f"   Expected location: {REPO_DIR}")
    raise FileNotFoundError("Failed to clone repository. Please check internet connection and repository URL.")

# Change to repo directory
try:
    os.chdir(REPO_DIR)
    print(f"\n✓ Changed to repository directory: {os.getcwd()}")
except Exception as e:
    print(f"\n✗ Error changing to repo directory: {e}")
    raise

# Add to Python path
if REPO_DIR not in sys.path:
    sys.path.insert(0, REPO_DIR)
    print(f"✓ Added to Python path: {REPO_DIR}")

# Verify we're in the right place
print(f"\n✓ Current directory: {os.getcwd()}")
print(f"✓ Python path includes: {REPO_DIR}")
print("="*60)

In [None]:
# Verify repository structure
import os

print("="*60)
print("📂 Repository Structure")
print("="*60)

required_dirs = ['models', 'scripts', 'data', 'utils', 'configs']
required_files = ['requirements.txt', 'models/__init__.py', 'scripts/train.py']

for dir_name in required_dirs:
    status = "✓" if os.path.exists(dir_name) else "✗"
    print(f"{status} {dir_name}/")

print()
for file_name in required_files:
    status = "✓" if os.path.exists(file_name) else "✗"
    print(f"{status} {file_name}")

print("="*60)

In [None]:
# Verify Python can find modules
import os
import sys

print("="*60)
print("🔍 Module Import Diagnostics")
print("="*60)

print(f"\nCurrent working directory:")
print(f"  {os.getcwd()}")

print(f"\nPython sys.path (first 3 entries):")
for i, path in enumerate(sys.path[:3]):
    print(f"  {i+1}. {path}")

print(f"\nChecking for models module:")
models_path = os.path.join(os.getcwd(), 'models')
if os.path.exists(models_path):
    print(f"  ✓ models/ directory exists at: {models_path}")
    if os.path.exists(os.path.join(models_path, '__init__.py')):
        print(f"  ✓ models/__init__.py exists")
    if os.path.exists(os.path.join(models_path, 'yolo_udd.py')):
        print(f"  ✓ models/yolo_udd.py exists")
else:
    print(f"  ✗ models/ directory NOT FOUND!")
    print(f"  ✗ Expected at: {models_path}")
    print(f"\n  Available directories:")
    for item in os.listdir(os.getcwd()):
        if os.path.isdir(item):
            print(f"    📁 {item}/")

print("="*60)

## CRITICAL FIX: NumPy Compatibility

**⚠️ IMPORTANT**: Kaggle has NumPy 2.x by default, but TensorFlow/scikit-learn require NumPy 1.x.
This fix prevents training crashes!

In [None]:
# ============================================================
# CRITICAL FIX: Force NumPy 1.x Installation
# ============================================================

print("="*60)
print("🔧 FIXING NumPy Compatibility Issue")
print("="*60)

# Check current NumPy version
import numpy as np
current_version = np.__version__
print(f"\n📌 Current NumPy version: {current_version}")

if current_version.startswith('2.'):
    print("\n⚠️  NumPy 2.x detected - this WILL crash TensorFlow/scikit-learn!")
    print("Forcing downgrade to NumPy 1.x...\n")
    
    # Force uninstall NumPy 2.x
    import sys
    !{sys.executable} -m pip uninstall -y numpy
    
    # Install NumPy 1.x with force reinstall
    !{sys.executable} -m pip install 'numpy==1.26.4' --force-reinstall --no-cache-dir
    
    # Verify the fix worked
    print("\n" + "="*60)
    print("✅ Verifying Fix...")
    print("="*60)
    print("✓ NumPy 1.26.4 has been installed!")
    print("✓ SUCCESS! Training will now work without crashes.")
    print("\n⚠️  IMPORTANT: You MUST restart the kernel now!")
    print("   Click: Kernel → Restart Kernel")
    print("   Then run all cells again from Cell 1.")
else:
    print(f"✓ NumPy 1.x already installed - no fix needed!")
    print("✓ Training should work correctly.")

print("="*60)

In [None]:
# Check GPU availability
import torch

print("="*60)
print("🔥 GPU Status Check")
print("="*60)

if torch.cuda.is_available():
    print(f"✓ GPU Available: {torch.cuda.get_device_name(0)}")
    print(f"✓ GPU Count: {torch.cuda.device_count()}")
    print(f"✓ CUDA Version: {torch.version.cuda}")
    print(f"✓ PyTorch Version: {torch.__version__}")
    
    # Get GPU memory info
    gpu_mem = torch.cuda.get_device_properties(0).total_memory / 1024**3
    print(f"✓ GPU Memory: {gpu_mem:.1f} GB")
else:
    print("✗ GPU NOT AVAILABLE!")
    print("⚠️  Please enable GPU: Settings → Accelerator → GPU T4 x2 → Save")
    raise RuntimeError("GPU not available. Training will be extremely slow on CPU.")

print("="*60)

## Step 2: Install Dependencies

In [None]:
# Install required packages
print("Installing dependencies...\n")

# Install from requirements.txt
!pip install -q torch>=2.0.0 torchvision>=0.15.0
!pip install -q albumentations>=1.3.0
!pip install -q opencv-python-headless>=4.7.0
!pip install -q pycocotools>=2.0.6
!pip install -q tensorboard>=2.12.0
!pip install -q tqdm pyyaml
!pip install -q scikit-learn matplotlib seaborn

print("\n✓ All dependencies installed successfully!")

## Step 3: Setup Dataset

**IMPORTANT**: You need to add the TrashCAN dataset as a Kaggle Dataset:

1. Go to: https://www.kaggle.com/datasets
2. Click "New Dataset"
3. Upload TrashCAN images and annotations
4. Make it public or private
5. Add it to this notebook: "Add Data" → Search for your dataset

Then update the `DATASET_PATH` below to match your dataset path.

In [None]:
# Configure dataset path
import os

# UPDATE THIS PATH to match your Kaggle dataset
# Example: '/kaggle/input/trashcan-dataset' or '/kaggle/input/your-dataset-name'
DATASET_PATH = '/kaggle/input/trashcan-dataset'

print("="*60)
print("📦 Dataset Configuration")
print("="*60)

# Check if dataset exists
if os.path.exists(DATASET_PATH):
    print(f"✓ Dataset found at: {DATASET_PATH}")
    
    # List dataset contents
    print("\n📂 Dataset contents:")
    for item in os.listdir(DATASET_PATH):
        item_path = os.path.join(DATASET_PATH, item)
        if os.path.isdir(item_path):
            print(f"  📁 {item}/")
        else:
            print(f"  📄 {item}")
else:
    print(f"✗ Dataset NOT FOUND at: {DATASET_PATH}")
    print("\n⚠️  Please:")
    print("   1. Add TrashCAN dataset to this notebook (Add Data button)")
    print("   2. Update DATASET_PATH variable above")
    print("\nAvailable input datasets:")
    if os.path.exists('/kaggle/input'):
        for item in os.listdir('/kaggle/input'):
            print(f"  - /kaggle/input/{item}")

print("="*60)

## Step 4: Build Model

In [None]:
# Build YOLO-UDD model
from models.yolo_udd import build_yolo_udd
import torch

print("="*60)
print("🏗️  Building YOLO-UDD v2.0 Model")
print("="*60)

# Build model with 22 classes (TrashCAN dataset)
model = build_yolo_udd(num_classes=22)

# Move to GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

print(f"✓ Model built successfully")
print(f"✓ Device: {device}")
print(f"✓ Number of classes: 22")

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"✓ Total parameters: {total_params:,}")
print(f"✓ Trainable parameters: {trainable_params:,}")

# Test forward pass
print("\n🧪 Testing forward pass...")
x = torch.randn(1, 3, 640, 640).to(device)
with torch.no_grad():
    predictions, turb_score = model(x)

print(f"✓ Forward pass successful!")
print(f"✓ Turbidity Score: {turb_score.item():.4f}")
print(f"✓ Detection scales: {len(predictions)}")

print("="*60)

## Step 5: Training Configuration

In [None]:
# Training hyperparameters - Reduced for faster training
EPOCHS = 100  # Reduced from 300 (10 hours instead of 30 hours)
BATCH_SIZE = 8
LEARNING_RATE = 0.01
NUM_WORKERS = 2
SAVE_DIR = '/kaggle/working/runs/train'

print("="*60)
print("⚙️  Training Configuration")
print("="*60)
print(f"Epochs: {EPOCHS}")
print(f"Batch Size: {BATCH_SIZE}")
print(f"Learning Rate: {LEARNING_RATE}")
print(f"Number of Workers: {NUM_WORKERS}")
print(f"Save Directory: {SAVE_DIR}")
print(f"Dataset Path: {DATASET_PATH}")
print("="*60)

# Create save directory
os.makedirs(SAVE_DIR, exist_ok=True)
print(f"\n✓ Save directory created: {SAVE_DIR}")

## Step 6: Start Training

**⏱️ Estimated Time**: ~10 hours for 100 epochs on T4 GPU

**💡 Tips**:
- Training will save checkpoints automatically
- You can monitor progress in real-time
- Results saved to `/kaggle/working/runs/train/`
- Download best checkpoint from Output folder after training

In [None]:
# Start training
print("="*60)
print("🚀 Starting Training...")
print("="*60)
print(f"Training for {EPOCHS} epochs (~10 hours)")
print(f"Expected mAP: 70-72%")
print("="*60)

# Run training script
!python scripts/train.py \
    --config configs/train_config.yaml \
    --data-dir {DATASET_PATH} \
    --epochs {EPOCHS} \
    --batch-size {BATCH_SIZE} \
    --learning-rate {LEARNING_RATE} \
    --num-workers {NUM_WORKERS} \
    --save-dir {SAVE_DIR}

## Step 7: Download Results

After training completes, download the trained model checkpoint.

In [None]:
# Check training results
import os

print("="*60)
print("📊 Training Results")
print("="*60)

if os.path.exists(SAVE_DIR):
    print(f"\n📁 Results directory: {SAVE_DIR}")
    print("\nContents:")
    for root, dirs, files in os.walk(SAVE_DIR):
        level = root.replace(SAVE_DIR, '').count(os.sep)
        indent = ' ' * 2 * level
        print(f"{indent}{os.path.basename(root)}/")
        subindent = ' ' * 2 * (level + 1)
        for file in files:
            size = os.path.getsize(os.path.join(root, file)) / (1024*1024)
            print(f"{subindent}{file} ({size:.1f} MB)")
    
    # Check for best checkpoint
    best_checkpoint = os.path.join(SAVE_DIR, 'best.pt')
    if os.path.exists(best_checkpoint):
        size = os.path.getsize(best_checkpoint) / (1024*1024)
        print(f"\n✓ Best checkpoint: {best_checkpoint} ({size:.1f} MB)")
        print("\n📥 Download this file from the Output section!")
    else:
        print("\n⚠️  Best checkpoint not found. Check if training completed successfully.")
else:
    print(f"✗ Results directory not found: {SAVE_DIR}")

print("="*60)

## 🎉 Training Complete!

### Next Steps:
1. **Download Checkpoint**: Download `best.pt` from Output folder
2. **Evaluate Model**: Run evaluation script locally with downloaded checkpoint
3. **Test Detections**: Test on new images

### Expected Results:
- mAP@50:95: **70-72%** (22 classes)
- Training Time: **~10 hours** (100 epochs)
- Checkpoint Size: **~200-300 MB**

---

**📧 Issues?** Check the GitHub repository: https://github.com/kshitijkhede/YOLO-UDD-v2.0