# üöÄ YOLO-UDD v2.0 Training on Kaggle - Fixed Version

**Last Updated:** November 2, 2025

## üìã Prerequisites
1. Upload **TrashCAN annotations** dataset to Kaggle
2. Upload **TrashCAN images** dataset to Kaggle
3. Enable **GPU** in notebook settings (T4 or P100)
4. Enable **Internet** in notebook settings

---

## üîß Step 2: Install Dependencies
**Uses albumentations 1.3.1** - Stable version without albucore dependency issues

In [None]:
%%bash
# Clone repository
if [ ! -d "YOLO-UDD-v2.0" ]; then
    git clone https://github.com/kshitijkhede/YOLO-UDD-v2.0.git
fi
cd YOLO-UDD-v2.0
echo "‚úÖ Repository cloned"

In [None]:
%cd YOLO-UDD-v2.0

In [None]:
# EMERGENCY FIX: Install dependencies with AGGRESSIVE NumPy 1.26.4 locking
import subprocess
import sys

print("üì¶ Installing dependencies with LOCKED versions...\n")
print("‚ö†Ô∏è  CRITICAL: NumPy MUST be 1.26.4 (NOT 2.x) to prevent TensorBoard/scikit-learn crashes\n")
print("="*70)

# STEP 1: Nuclear uninstall - remove ALL potentially conflicting packages
print("üóëÔ∏è  Step 1/6: Removing all conflicting packages...")
subprocess.run([sys.executable, '-m', 'pip', 'uninstall', '-y', 
                'numpy', 'scipy', 'scikit-learn', 'tensorflow', 
                'tensorboard', 'keras', 'matplotlib', 'albumentations', 'albucore'],
               stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

# STEP 2: Install NumPy 1.26.4 FIRST (CRITICAL!)
print("üìç Step 2/6: Installing NumPy 1.26.4 (LOCKED with --no-cache-dir)...")
subprocess.run([sys.executable, '-m', 'pip', 'install', 
                '--no-cache-dir', '--force-reinstall', 'numpy==1.26.4', '-q'],
               check=True)

# STEP 3: Install PyTorch with CUDA 11.8
print("üî• Step 3/6: Installing PyTorch 2.2.2 with CUDA 11.8...")
subprocess.run([sys.executable, '-m', 'pip', 'install', '-q',
                'torch==2.2.2', 'torchvision==0.17.2', 'torchaudio==2.2.2',
                '--index-url', 'https://download.pytorch.org/whl/cu118'],
               check=True)

# STEP 4: Install compatible scipy and matplotlib
print("? Step 4/6: Installing scipy 1.11.4 and matplotlib 3.7.5...")
subprocess.run([sys.executable, '-m', 'pip', 'install', '--no-cache-dir', '-q',
                'scipy==1.11.4', 'matplotlib==3.7.5'],
               check=True)

# STEP 5: Install other core packages
print("?üì¶ Step 5/6: Installing core packages...")
subprocess.run([sys.executable, '-m', 'pip', 'install', '-q',
                'opencv-python-headless==4.9.0.80', 'pillow==10.3.0', 
                'pycocotools==2.0.7', 'pyyaml==6.0.1', 'tqdm==4.66.4'],
               check=True)

print("üìä Installing TensorBoard 2.16.2...")
subprocess.run([sys.executable, '-m', 'pip', 'install', '--no-cache-dir', '-q',
                'tensorboard==2.16.2'],
               check=True)

print("üé® Installing albumentations 1.3.1 (no albucore dependency)...")
subprocess.run([sys.executable, '-m', 'pip', 'install', '-q',
                'albumentations==1.3.1'],
               check=True)

print("‚öôÔ∏è  Installing timm 0.9.16...")
subprocess.run([sys.executable, '-m', 'pip', 'install', '-q',
                'timm==0.9.16'],
               check=True)

# STEP 6: Install scikit-learn LAST (needs stable NumPy)
print("üî¨ Step 6/6: Installing scikit-learn 1.3.2 (LAST)...")
subprocess.run([sys.executable, '-m', 'pip', 'install', '--no-cache-dir', '-q',
                'scikit-learn==1.3.2'],
               check=True)

print("="*70)
print("‚úÖ Dependencies installed!\n")
print("‚ö†Ô∏è  You may see warnings about cesium/tsfresh/umap-learn - IGNORE THEM!")
print("    These are harmless Kaggle pre-installed packages that won't interfere.\n")
print("üîç Next cell will verify NumPy 1.26.4 is correctly installed...")

In [None]:
# Verify critical dependencies and NumPy version
print("üîç Verifying installations...\n")
print("="*70)

import numpy as np
import torch
import cv2
import albumentations as A

print(f"‚úÖ NumPy: {np.__version__}")
print(f"‚úÖ PyTorch: {torch.__version__}")
print(f"‚úÖ OpenCV: {cv2.__version__}")
print(f"‚úÖ Albumentations: {A.__version__}")

# CRITICAL CHECK: NumPy MUST be 1.x (not 2.x)
numpy_version = np.__version__
if numpy_version.startswith('2.'):
    print("\n" + "="*70)
    print("‚ùå CRITICAL ERROR: NumPy 2.x detected!")
    print(f"   Current version: {numpy_version}")
    print("   This will cause TensorBoard and scikit-learn to crash!")
    print("\nüîß EMERGENCY FIX: Running aggressive NumPy downgrade...")
    print("="*70 + "\n")
    
    # Emergency fix
    import subprocess
    import sys
    subprocess.run([sys.executable, '-m', 'pip', 'uninstall', '-y', 
                    'numpy', 'scipy', 'scikit-learn'],
                   stdout=subprocess.DEVNULL)
    subprocess.run([sys.executable, '-m', 'pip', 'install', '--no-cache-dir',
                    'numpy==1.26.4', 'scipy==1.11.4'],
                   check=True)
    subprocess.run([sys.executable, '-m', 'pip', 'install', '--no-cache-dir',
                    'scikit-learn==1.3.2'],
                   check=True)
    
    print("\n‚úÖ NumPy downgraded to 1.26.4")
    print("‚ö†Ô∏è  IMPORTANT: Click 'Kernel' ‚Üí 'Restart & Run All' NOW!")
    raise SystemExit("NumPy fixed - RESTART KERNEL and re-run all cells!")

print(f"‚úÖ NumPy version is CORRECT: {numpy_version}")

# Test scikit-learn
try:
    import sklearn
    print(f"‚úÖ scikit-learn: {sklearn.__version__}")
except Exception as e:
    print(f"‚ö†Ô∏è  scikit-learn import warning: {e}")

# Test TensorBoard (this is the critical one that crashes with NumPy 2.x)
try:
    from torch.utils.tensorboard import SummaryWriter
    print("‚úÖ TensorBoard: Import successful (NumPy compatibility confirmed)")
except Exception as e:
    print(f"‚ùå TensorBoard import FAILED: {e}")
    print("   This means NumPy is incompatible - restart kernel!")
    raise

# Test albumentations
try:
    transform = A.Compose([A.HorizontalFlip(p=0.5)])
    print("‚úÖ Albumentations: Transform test passed")
except Exception as e:
    print(f"‚ùå Albumentations test failed: {e}")

print("="*70)
print("‚úÖ ALL VERIFICATIONS PASSED!")
print("üìç NumPy 1.26.4 is locked and compatible with all packages")
print("üöÄ Safe to proceed with training!\n")

In [None]:
# FINAL SAFETY CHECK: Verify NumPy before training starts
import subprocess
import sys

print("üîí Final NumPy version check before training...\n")

result = subprocess.run([sys.executable, '-m', 'pip', 'show', 'numpy'], 
                       capture_output=True, text=True)
current_numpy = None
for line in result.stdout.split('\n'):
    if line.startswith('Version:'):
        current_numpy = line.split(':')[1].strip()
        break

if current_numpy and current_numpy.startswith('2.'):
    print("="*70)
    print(f"‚ùå CRITICAL: NumPy {current_numpy} detected!")
    print("   This will crash during training!")
    print("\nüîß EMERGENCY DOWNGRADE IN PROGRESS...")
    print("="*70 + "\n")
    
    subprocess.run([sys.executable, '-m', 'pip', 'uninstall', '-y', 
                    'numpy', 'scipy', 'scikit-learn'],
                   stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
    subprocess.run([sys.executable, '-m', 'pip', 'install', '--no-cache-dir',
                    'numpy==1.26.4', 'scipy==1.11.4'],
                   check=True)
    subprocess.run([sys.executable, '-m', 'pip', 'install', '--no-cache-dir',
                    'scikit-learn==1.3.2'],
                   check=True)
    
    print("‚úÖ Fixed! Downgraded to NumPy 1.26.4")
    print("‚ö†Ô∏è  NOW: Click 'Kernel' ‚Üí 'Restart & Run All'")
    raise SystemExit("NumPy fixed - restart required!")
else:
    print(f"‚úÖ NumPy {current_numpy} is CORRECT")
    print("‚úÖ TensorBoard and scikit-learn will work properly")
    print("‚úÖ Safe to proceed with training")
    print("\nüöÄ Starting training in next cell...")

## üìä Step 2: Setup Dataset Paths

In [None]:
import os
import shutil
import json

print("üîç Setting up dataset paths...\n")

# Create directory structure
os.makedirs('data/trashcan/annotations', exist_ok=True)
os.makedirs('data/trashcan/images', exist_ok=True)

# === MODIFY THESE PATHS TO MATCH YOUR KAGGLE DATASETS ===
ANNOTATIONS_PATH = '/kaggle/input/trashcan-annotations-coco-format/annotations'
IMAGES_PATH = '/kaggle/input/trashcan/images'

# Alternative paths (uncomment and modify if needed)
# ANNOTATIONS_PATH = '/kaggle/input/YOUR-ANNOTATIONS-DATASET-NAME/'
# IMAGES_PATH = '/kaggle/input/YOUR-IMAGES-DATASET-NAME/'

print(f"Annotations source: {ANNOTATIONS_PATH}")
print(f"Images source: {IMAGES_PATH}")
print("\n" + "="*70)

In [None]:
# Link annotations
print("üìã Copying annotations...")

train_json = os.path.join(ANNOTATIONS_PATH, 'train.json')
val_json = os.path.join(ANNOTATIONS_PATH, 'val.json')

if os.path.exists(train_json) and os.path.exists(val_json):
    shutil.copy(train_json, 'data/trashcan/annotations/train.json')
    shutil.copy(val_json, 'data/trashcan/annotations/val.json')
    
    # Verify
    with open('data/trashcan/annotations/train.json', 'r') as f:
        train_data = json.load(f)
    with open('data/trashcan/annotations/val.json', 'r') as f:
        val_data = json.load(f)
    
    print(f"‚úÖ Train: {len(train_data['images'])} images, {len(train_data['annotations'])} annotations")
    print(f"‚úÖ Val: {len(val_data['images'])} images, {len(val_data['annotations'])} annotations")
    print(f"‚úÖ Categories: {len(train_data['categories'])}")
else:
    print(f"‚ùå Annotations not found!")
    print(f"   Looking for: {train_json}")
    print(f"   Please update ANNOTATIONS_PATH in the cell above")

In [None]:
# Link images (symbolic links to save space)
print("üñºÔ∏è  Linking images...")

train_imgs_src = os.path.join(IMAGES_PATH, 'train')
val_imgs_src = os.path.join(IMAGES_PATH, 'val')

train_imgs_dst = 'data/trashcan/images/train'
val_imgs_dst = 'data/trashcan/images/val'

# Remove old links
for path in [train_imgs_dst, val_imgs_dst]:
    if os.path.exists(path):
        if os.path.islink(path):
            os.unlink(path)
        else:
            shutil.rmtree(path)

# Create symbolic links
if os.path.exists(train_imgs_src) and os.path.exists(val_imgs_src):
    os.symlink(train_imgs_src, train_imgs_dst)
    os.symlink(val_imgs_src, val_imgs_dst)
    
    train_count = len([f for f in os.listdir(train_imgs_dst) if f.endswith('.jpg')])
    val_count = len([f for f in os.listdir(val_imgs_dst) if f.endswith('.jpg')])
    
    print(f"‚úÖ Train images: {train_count}")
    print(f"‚úÖ Val images: {val_count}")
    
    if train_count > 0 and val_count > 0:
        print("\nüéâ Dataset is ready for training!")
else:
    print(f"‚ùå Images not found!")
    print(f"   Looking for: {train_imgs_src}")
    print(f"   Please update IMAGES_PATH in the cell above")

## üîç Step 3: Verify GPU and PyTorch

In [None]:
import torch

print(f"PyTorch Version: {torch.__version__}")
print(f"CUDA Available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"‚úÖ GPU: {torch.cuda.get_device_name(0)}")
    print(f"‚úÖ GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
    torch.cuda.empty_cache()
else:
    print("‚ö†Ô∏è  WARNING: GPU not available!")
    print("   Go to Settings ‚Üí Accelerator ‚Üí Select GPU T4 or P100")

## ‚öôÔ∏è Step 4: Create Optimized Training Config

In [None]:
import yaml

# Create optimized config for Kaggle
config = {
    'model': {
        'name': 'YOLO-UDD-v2.0',
        'num_classes': 22,
        'pretrained_path': None
    },
    'data': {
        'dataset_name': 'TrashCAN-1.0',
        'data_dir': 'data/trashcan',
        'img_size': 640,
        'class_names': [
            "rov", "plant", "animal_fish", "animal_starfish", "animal_shells",
            "animal_crab", "animal_eel", "animal_etc", "trash_clothing", "trash_pipe",
            "trash_bottle", "trash_bag", "trash_snack_wrapper", "trash_can", "trash_cup",
            "trash_container", "trash_unknown_instance", "trash_branch", "trash_wreckage",
            "trash_tarp", "trash_rope", "trash_net"
        ]
    },
    'training': {
        'epochs': 100,
        'batch_size': 8,           # Optimized for T4 GPU
        'num_workers': 2,
        'optimizer': 'AdamW',
        'learning_rate': 0.001,    # Lower initial LR for stability
        'weight_decay': 0.0005,
        'scheduler': 'CosineAnnealing',
        'lr_min': 0.00001,
        'early_stopping_patience': 30,
        'grad_clip_norm': 10.0,
        'use_amp': True            # Mixed precision
    },
    'loss': {
        'lambda_box': 5.0,
        'lambda_obj': 1.0,
        'lambda_cls': 1.0,
        'focal_loss_gamma': 2.0,
        'iou_type': 'CIoU'
    },
    'augmentation': {
        'use_augmentation': True,
        'horizontal_flip_prob': 0.5,
        'color_jitter': True,
        'gaussian_blur': False,     # Disabled to reduce training time
        'underwater_augmentation': True
    },
    'checkpoints': {
        'save_dir': '/kaggle/working/checkpoints',
        'save_interval': 10,
        'save_best_only': False
    },
    'logging': {
        'use_tensorboard': True,
        'log_dir': '/kaggle/working/runs',
        'log_interval': 50
    },
    'eval': {
        'conf_threshold': 0.001,
        'nms_threshold': 0.6,
        'eval_interval': 5
    }
}

# Save config
os.makedirs('configs', exist_ok=True)
with open('configs/kaggle_config.yaml', 'w') as f:
    yaml.dump(config, f, default_flow_style=False)

print("‚úÖ Training config created!")
print("\nKey settings:")
print(f"  - Batch size: {config['training']['batch_size']}")
print(f"  - Epochs: {config['training']['epochs']}")
print(f"  - Learning rate: {config['training']['learning_rate']}")
print(f"  - Image size: {config['data']['img_size']}")
print(f"  - Mixed precision: {config['training']['use_amp']}")

## üöÄ Step 5: Start Training

In [None]:
import glob
import os

# Check for existing checkpoints to resume from
checkpoint_dir = '/kaggle/working/runs/train/checkpoints'
os.makedirs(checkpoint_dir, exist_ok=True)

# Look for latest.pt checkpoint (new format)
latest_checkpoint = os.path.join(checkpoint_dir, 'latest.pt')

if os.path.exists(latest_checkpoint):
    print(f"üîÑ Found checkpoint: {latest_checkpoint}")
    print("   Will resume training from this checkpoint\n")
    
    # Load checkpoint to show progress
    import torch
    ckpt = torch.load(latest_checkpoint, map_location='cpu')
    print(f"   ? Previous progress:")
    print(f"      - Completed epoch: {ckpt['epoch']}")
    print(f"      - Best mAP: {ckpt['best_map']:.4f}")
    print(f"      - Resuming from epoch: {ckpt['epoch'] + 1}\n")
    
    resume_flag = f"--resume {latest_checkpoint}"
else:
    print("üÜï No previous checkpoint found")
    print("   Starting fresh training from epoch 0\n")
    resume_flag = ""

print("="*70)
print("üöÄ Starting/Resuming YOLO-UDD v2.0 Training")
print("="*70)

In [None]:
# Run training
!python scripts/train.py --config configs/kaggle_config.yaml {resume_flag}

## üíæ Step 6: Check Checkpoints (Auto-Saved)
**Checkpoints are automatically saved every epoch!** Use this cell to view them.

In [None]:
import shutil
import glob
import os

print("üíæ Checkpoint Information...\n")

# Check checkpoint directory
checkpoint_dir = '/kaggle/working/runs/train/checkpoints'

if os.path.exists(checkpoint_dir):
    checkpoints = glob.glob(f'{checkpoint_dir}/*.pt')
    
    if checkpoints:
        print(f"‚úÖ Found {len(checkpoints)} checkpoint(s):\n")
        
        for ckpt in checkpoints:
            size = os.path.getsize(ckpt) / (1024*1024)
            name = os.path.basename(ckpt)
            print(f"   üì¶ {name} ({size:.1f} MB)")
            
            # Show details for latest checkpoint
            if 'latest.pt' in name:
                import torch
                ckpt_data = torch.load(ckpt, map_location='cpu')
                print(f"      - Epoch: {ckpt_data['epoch']}")
                print(f"      - Best mAP: {ckpt_data['best_map']:.4f}")
        
        print(f"\n‚úÖ Checkpoints are in: {checkpoint_dir}")
        print("üí° These checkpoints persist and allow training to resume!")
        
        # Show download instructions
        print("\nüì• To download checkpoints:")
        print("   1. Click 'Output' in right sidebar")
        print("   2. Find checkpoint files")
        print("   3. Click download icon")
    else:
        print("‚ö†Ô∏è  No .pt checkpoint files found")
        print("   Training may still be in progress or just started")
else:
    print("‚ö†Ô∏è  Checkpoint directory not found")
    print("   Training has not started yet")

## üìä Step 7: View Training Logs (TensorBoard)

In [None]:
# Load TensorBoard
%load_ext tensorboard
%tensorboard --logdir /kaggle/working/runs

## üéØ Step 8: Evaluate Model (Optional)

In [None]:
# Find best checkpoint
import glob

best_ckpt = glob.glob('/kaggle/working/checkpoints/best.pth')

if best_ckpt:
    print(f"üìä Evaluating model: {best_ckpt[0]}\n")
    !python scripts/evaluate.py \
        --checkpoint {best_ckpt[0]} \
        --data-dir data/trashcan \
        --split val
else:
    print("‚ö†Ô∏è  No 'best.pth' checkpoint found")
    print("   Training may still be in progress")

## üñºÔ∏è Step 9: Run Detection on Sample Images (Optional)

In [None]:
# Run detection on validation images
import glob

best_ckpt = glob.glob('/kaggle/working/checkpoints/best.pth')

if best_ckpt:
    print(f"üéØ Running detection with: {best_ckpt[0]}\n")
    !python scripts/detect.py \
        --checkpoint {best_ckpt[0]} \
        --source data/trashcan/images/val/ \
        --output /kaggle/working/results/ \
        --max-images 10
else:
    print("‚ö†Ô∏è  No checkpoint found for detection")

In [None]:
# Display detection results
import matplotlib.pyplot as plt
from PIL import Image
import glob
import os

result_images = glob.glob('/kaggle/working/results/*.jpg')[:6]

if result_images:
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    axes = axes.flatten()
    
    for idx, img_path in enumerate(result_images):
        img = Image.open(img_path)
        axes[idx].imshow(img)
        axes[idx].axis('off')
        axes[idx].set_title(f'Detection {idx+1}')
    
    # Hide empty subplots
    for idx in range(len(result_images), 6):
        axes[idx].axis('off')
    
    plt.tight_layout()
    plt.show()
else:
    print("‚ö†Ô∏è  No detection results found")
    print("   Run the detection cell above first")

## üì• Step 10: Download Checkpoints (Optional)

In [None]:
# List all available checkpoints
import glob
import os

checkpoints = glob.glob('/kaggle/working/checkpoints/*.pth')

if checkpoints:
    print("üì¶ Available checkpoints:\n")
    for ckpt in sorted(checkpoints):
        size = os.path.getsize(ckpt) / (1024*1024)
        print(f"  - {os.path.basename(ckpt)} ({size:.1f} MB)")
    
    print("\nüí° To download, you can:")
    print("  1. Use Kaggle's file browser (right sidebar)")
    print("  2. Navigate to /kaggle/working/checkpoints/")
    print("  3. Right-click on files to download")
else:
    print("‚ö†Ô∏è  No checkpoints found")

---

## üéâ Training Complete!

### Next Steps:
1. **Download checkpoints** from `/kaggle/working/checkpoints/`
2. **View TensorBoard** logs to analyze training
3. **Run evaluation** to see final metrics
4. **Test on new images** using `detect.py`

### Tips for Better Results:
- Train for more epochs (increase `epochs` in config)
- Adjust learning rate if loss plateaus
- Try different batch sizes based on GPU memory
- Enable more augmentations for better generalization

---

## üîÑ How Checkpoint Resume Works

**Your training is protected!** Checkpoints are saved automatically.

### What Gets Saved:
- ‚úÖ **latest.pt** - Saved after every epoch
- ‚úÖ **best.pt** - Saved when validation improves
- üìÅ **Location**: `/kaggle/working/runs/train/checkpoints/`

### Auto-Resume:
If training stops (timeout, disconnect, etc.), just **re-run the notebook**:
1. Re-run cells 1-4 (dependencies, setup)
2. Cell 16 will **auto-detect** `latest.pt`
3. Training **continues** from where it stopped!

### Manual Resume (if needed):
```python
!python scripts/train.py \
    --config configs/kaggle_config.yaml \
    --resume /kaggle/working/runs/train/checkpoints/latest.pt
```

**No progress lost!** üéâ