# CADI AI Model Training Pipeline

Clean modular training pipeline using external Python scripts.
This notebook orchestrates the training process using `!python` commands to run our modular scripts.

## Pipeline Overview:
1. **Environment Setup** - Install dependencies and set paths
2. **Dataset Preparation** - Create data.yaml and validate dataset
3. **Model Training** - Train YOLO model with optimized settings
4. **Evaluation** - Validate and test the trained model

## Requirements:
- Python scripts: `dataset_utils.py`, `train.py`
- Configuration file: `config.yaml`
- Dataset in proper YOLO format
- GPU recommended for training

In [None]:
# !git pull origin main

In [1]:
# Install required packages
!pip install -U -q ultralytics roboflow opencv-python supervision PyYAML "numpy <2"

# Import basic libraries
import os
import sys
import yaml
from pathlib import Path

print("✅ Dependencies installed successfully!")

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m26.4 MB/s[0m eta [36m0:00:00[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.9/86.9 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.8/66.8 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.9/49.9 MB[0m [31m36.4 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.2/207.2 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m62.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.2/4.2 MB[0m [31m86.5 MB/s[0m eta [36m0:00:00[0m:00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[

In [3]:
# Environment Configuration
# Adjust these paths based on your environment (Kaggle, Colab, or local)

# For Kaggle:
if '/kaggle' in os.getcwd():
    # This is where the repository will be cloned
    PROJECT_DIR = '/kaggle/working/cadi-ai'
    # Assumes the dataset is in /kaggle/input
    DATASET_PATH = '/kaggle/input/cadi-ai-retraining/combined_dataset/combined_dataset'  # Update this
    WORKING_DIR = '/kaggle/working/cadi-training-2508'
    ENVIRONMENT = 'kaggle'

# For Google Colab:
elif '/content' in os.getcwd():
    PROJECT_DIR = '/content/cadi-ai'
    DATASET_PATH = '/content/dataset'  # Update this
    WORKING_DIR = '/content/cadi-training-2508'
    ENVIRONMENT = 'colab'

# For local development:
else:
    PROJECT_DIR = r'c:\Users\Mecha Mino 5 Outlook\Documents\Mino Health AI labs\cadi-ai'
    DATASET_PATH = r'c:\Users\Mecha Mino 5 Outlook\Documents\Mino Health AI labs\cadi-ai\dataset'  # Update this
    WORKING_DIR = r'c:\Users\Mecha Mino 5 Outlook\Documents\Mino Health AI labs\cadi-ai\training_outputs'
    ENVIRONMENT = 'local'

# Create working directory
os.makedirs(WORKING_DIR, exist_ok=True)

# In Kaggle, we need to clone the repo first
if ENVIRONMENT == 'kaggle':
    !git clone https://github.com/minoHealth/cadi-ai.git 
    os.chdir(PROJECT_DIR)
else:
    os.chdir(PROJECT_DIR)

print(f"🔧 Environment: {ENVIRONMENT}")
print(f"📁 Project directory: {os.getcwd()}")
print(f"💾 Working directory: {WORKING_DIR}")


Cloning into 'cadi-ai'...
remote: Enumerating objects: 33, done.[K
remote: Counting objects: 100% (33/33), done.[K
remote: Compressing objects: 100% (24/24), done.[K
remote: Total 33 (delta 13), reused 27 (delta 9), pack-reused 0 (from 0)[K
Receiving objects: 100% (33/33), 23.73 KiB | 3.95 MiB/s, done.
Resolving deltas: 100% (13/13), done.
🔧 Environment: kaggle
📁 Project directory: /kaggle/working/cadi-ai
💾 Working directory: /kaggle/working/cadi-training-2508


In [4]:
# Recreate data.yaml with improved class name detection
print("🔄 Recreating data.yaml with CADI AI class names...")

!python dataset_utils.py --create-yaml {DATASET_PATH} --output-path {os.path.join(PROJECT_DIR, 'data.yaml')} --cache-dir {os.path.join(WORKING_DIR, 'cache')}

# print("\n📊 Re-validating with proper class names...")
!python dataset_utils.py --validate {os.path.join(PROJECT_DIR, 'data.yaml')}

🔄 Recreating data.yaml with CADI AI class names...
🔍 Detected 3 classes from label files: [0, 1, 2]
🎯 Using CADI AI class names: ['abiotic', 'disease', 'insect']
⚠️  Dataset in read-only location, caching disabled
✅ Created data.yaml at: /kaggle/working/cadi-ai/data.yaml
📊 Dataset Validation Report
✅ train: 2813 images, 2813 labels
✅ val  :  501 images,  501 labels
✅ test :  340 images,  340 labels

📈 Class Distribution:
  abiotic   :  7762 objects ( 57.2%)
  disease   :  3566 objects ( 26.3%)
  insect    :  2236 objects ( 16.5%)

📋 Summary:
  Total Images: 3654
  Total Labels: 3654
  Total Objects: 13564
  Classes: 3


In [5]:
# Check current data.yaml and config.yaml cache settings
print("📋 Current data.yaml content:")
data_yaml_path = os.path.join(PROJECT_DIR, 'data.yaml')
if os.path.exists(data_yaml_path):
    with open(data_yaml_path, 'r') as f:
        data_content = f.read()
        print(data_content)
else:
    print("❌ data.yaml not found")

print("\n⚙️ Current config.yaml cache setting:")
config_yaml_path = os.path.join(PROJECT_DIR, 'config.yaml')
if os.path.exists(config_yaml_path):
    with open(config_yaml_path, 'r') as f:
        config = yaml.safe_load(f)
        cache_setting = config.get('cache', 'not set')
        print(f"cache: {cache_setting}")
else:
    print("❌ config.yaml not found")

print(f"\n📁 Cache directory: {os.path.join(WORKING_DIR, 'cache')}")
print(f"📁 Project directory: {PROJECT_DIR}")
print(f"📁 Working directory: {WORKING_DIR}")

📋 Current data.yaml content:
cache: false
names:
- abiotic
- disease
- insect
nc: 3
path: /kaggle/input/cadi-ai-retraining/combined_dataset/combined_dataset
test: /kaggle/input/cadi-ai-retraining/combined_dataset/combined_dataset/test/images
train: /kaggle/input/cadi-ai-retraining/combined_dataset/combined_dataset/train/images
val: /kaggle/input/cadi-ai-retraining/combined_dataset/combined_dataset/valid/images


⚙️ Current config.yaml cache setting:
cache: False

📁 Cache directory: /kaggle/working/cadi-training-2508/cache
📁 Project directory: /kaggle/working/cadi-ai
📁 Working directory: /kaggle/working/cadi-training-2508


In [6]:
# Step 2: Find optimal batch size
print("🔍 Finding optimal batch size for your hardware...")
print("This may take a few minutes as it tests different batch sizes.")

# We use the --find-batch argument from our updated train.py script
!python train.py --config config.yaml --find-batch

🔍 Finding optimal batch size for your hardware...
This may take a few minutes as it tests different batch sizes.
Creating new Ultralytics Settings v0.0.6 file ✅ 
View Ultralytics Settings with 'yolo settings' or at '/root/.config/Ultralytics/settings.json'
Update Settings with 'yolo settings key=value', i.e. 'yolo settings runs_dir=path/to/dir'. For help see https://docs.ultralytics.com/quickstart/#ultralytics-settings.
Finding optimal batch size...
🔍 Finding optimal batch size...
  Testing batch size: 32
Downloading https://github.com/ultralytics/assets/releases/download/v8.3.0/yolo1
Ultralytics 8.3.174 🚀 Python-3.11.13 torch-2.6.0+cu124 CUDA:0 (Tesla P100-PCIE-16GB, 16269MiB)
[34m[1mengine/trainer: [0magnostic_nms=False, amp=True, augment=False, auto_augment=randaugment, batch=32, bgr=0.0, box=7.5, cache=False, cfg=None, classes=None, close_mosaic=10, cls=0.5, conf=None, copy_paste=0.0, copy_paste_mode=flip, cos_lr=False, cutmix=0.0, data=data.yaml, degrees=0.0, deterministic=True

In [7]:
# Step 3: Start training with optimal settings
# The `train.py` script will automatically use the settings from `config.yaml`
# including the auto-detected batch size if you set it to "auto" in the config.

print("🚀 Starting training...")
print("Training will save results to the output directory specified in config.yaml.")
print("You can monitor progress in the output below.")

!python train.py --config config.yaml

🚀 Starting training...
Training will save results to the output directory specified in config.yaml.
You can monitor progress in the output below.
🚀 Starting CADI AI Model Training
💻 System Information:
  Python: 3.11.13
  PyTorch: 2.6.0+cu124
🚀 Starting CADI AI Model Training
💻 System Information:
  Python: 3.11.13
  PyTorch: 2.6.0+cu124
  CUDA available: True
  GPU: Tesla P100-PCIE-16GB
  GPU Memory: 17.1 GB

📁 Validating data paths...
  ✅ train: 2813 images
  ✅ val: 501 images

🔍 Finding optimal batch size...
  CUDA available: True
  GPU: Tesla P100-PCIE-16GB
  GPU Memory: 17.1 GB

📁 Validating data paths...
  ✅ train: 2813 images
  ✅ val: 501 images

🔍 Finding optimal batch size...
  Testing batch size: 32
  Testing batch size: 32
Ultralytics 8.3.174 🚀 Python-3.11.13 torch-2.6.0+cu124 CUDA:0 (Tesla P100-PCIE-16GB, 16269MiB)
Ultralytics 8.3.174 🚀 Python-3.11.13 torch-2.6.0+cu124 CUDA:0 (Tesla P100-PCIE-16GB, 16269MiB)
[34m[1mengine/trainer: [0magnostic_nms=False, amp=True, augment

: 

In [None]:
# Step 4: Evaluate the trained model
import glob
from ultralytics import YOLO

# Find the best model weights from the runs directory specified in config.yaml
output_dir = 'runs' # Or read from config.yaml
with open("config.yaml", 'r') as f:
    config = yaml.safe_load(f)
    output_dir = config.get('output_dir', 'runs')

weight_files = glob.glob(os.path.join(output_dir, '**/weights/best.pt'), recursive=True)

if weight_files:
    best_weights = weight_files[0]  # Get the most recent
    print(f"📊 Evaluating model: {best_weights}")
    
    # Load and validate the model
    model = YOLO(best_weights)
    
    # Run validation
    print("\n🧪 Running validation...")
    val_results = model.val(data=os.path.join(WORKING_DIR, 'data.yaml'))
    
else:
    print("❌ No trained model weights found. Please run training first.")