# CADI AI Model Training Pipeline

Clean modular training pipeline using external Python scripts.
This notebook orchestrates the training process using `!python` commands to run our modular scripts.

## Pipeline Overview:
1. **Environment Setup** - Install dependencies and set paths
2. **Dataset Preparation** - Create data.yaml and validate dataset
3. **Model Training** - Train YOLO model with optimized settings
4. **Evaluation** - Validate and test the trained model

## Requirements:
- Python scripts: `dataset_utils.py`, `train.py`
- Configuration file: `config.yaml`
- Dataset in proper YOLO format
- GPU recommended for training

In [1]:
# Install required packages
!pip install -U -q ultralytics roboflow opencv-python supervision PyYAML

# Import basic libraries
import os
import sys
import yaml
from pathlib import Path

print("✅ Dependencies installed successfully!")

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.0/62.0 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.9/86.9 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.8/66.8 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.9/49.9 MB[0m [31m26.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.0/67.0 MB[0m [31m21.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.2/207.2 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.8/16.8 MB[0m [31m45.9 MB/s[0m eta [36m0:00:00

In [2]:
# Environment Configuration
# Adjust these paths based on your environment (Kaggle, Colab, or local)

# For Kaggle:
if '/kaggle' in os.getcwd():
    # This is where the repository will be cloned
    PROJECT_DIR = '/kaggle/working/cadi-ai'
    # Assumes the dataset is in /kaggle/input
    DATASET_PATH = '/kaggle/input/cadi-ai-retraining/combined_dataset/combined_dataset'  # Update this
    WORKING_DIR = '/kaggle/working/cadi-training-2508'
    ENVIRONMENT = 'kaggle'

# For Google Colab:
elif '/content' in os.getcwd():
    PROJECT_DIR = '/content/cadi-ai'
    DATASET_PATH = '/content/dataset'  # Update this
    WORKING_DIR = '/content/cadi-training-2508'
    ENVIRONMENT = 'colab'

# For local development:
else:
    PROJECT_DIR = r'c:\Users\Mecha Mino 5 Outlook\Documents\Mino Health AI labs\cadi-ai'
    DATASET_PATH = r'c:\Users\Mecha Mino 5 Outlook\Documents\Mino Health AI labs\cadi-ai\dataset'  # Update this
    WORKING_DIR = r'c:\Users\Mecha Mino 5 Outlook\Documents\Mino Health AI labs\cadi-ai\training_outputs'
    ENVIRONMENT = 'local'

# Create working directory
os.makedirs(WORKING_DIR, exist_ok=True)

# In Kaggle, we need to clone the repo first
if ENVIRONMENT == 'kaggle':
    !git clone https://github.com/minoHealth/cadi-ai.git 
    os.chdir(PROJECT_DIR)
else:
    os.chdir(PROJECT_DIR)

print(f"🔧 Environment: {ENVIRONMENT}")
print(f"📁 Project directory: {os.getcwd()}")
print(f"💾 Working directory: {WORKING_DIR}")


Cloning into 'cadi-ai'...
remote: Enumerating objects: 18, done.[K
remote: Counting objects: 100% (18/18), done.[K
remote: Compressing objects: 100% (15/15), done.[K
remote: Total 18 (delta 4), reused 15 (delta 3), pack-reused 0 (from 0)[K
Receiving objects: 100% (18/18), 13.81 KiB | 441.00 KiB/s, done.
Resolving deltas: 100% (4/4), done.
🔧 Environment: kaggle
📁 Project directory: /kaggle/working/cadi-ai
💾 Working directory: /kaggle/working/cadi-training-2508


In [4]:
os.getcwd()

'/kaggle/working/cadi-ai'

In [3]:
# Quick Environment Diagnostic
!python train.py --config config.yaml --validate-paths-only

Creating new Ultralytics Settings v0.0.6 file ✅ 
View Ultralytics Settings with 'yolo settings' or at '/root/.config/Ultralytics/settings.json'
Update Settings with 'yolo settings key=value', i.e. 'yolo settings runs_dir=path/to/dir'. For help see https://docs.ultralytics.com/quickstart/#ultralytics-settings.
Running path validation only...
📁 Validating data paths...
💥 Training pipeline failed: Data file not found: data.yaml


In [None]:
# Step 1: Create data.yaml and Validate Dataset
# The dataset_utils.py script handles the creation of the data.yaml file.
# We will call it with the appropriate arguments.

        "source": [
            "# Step 1: Create data.yaml and Validate Dataset",
            "# The improved dataset_utils.py script now handles:",
            "# - Auto-detection of class names",
            "# - Proper cache directory configuration", 
            "# - Read-only dataset compatibility",
            "",
            "print("📋 Creating data.yaml with auto-detected classes and cache configuration...")",
            "",
            "# Create data.yaml with cache directory in writable location",
            "!python dataset_utils.py --create-yaml {DATASET_PATH} --output-path {os.path.join(PROJECT_DIR, 'data.yaml')} --cache-dir {os.path.join(WORKING_DIR, 'cache')}",
            "",
            "print("
🔍 Validating the created dataset...")",
            "# Validate the newly created dataset",
            "!python dataset_utils.py --validate {os.path.join(PROJECT_DIR, 'data.yaml')}"
        ]

✅ Created data.yaml at: /kaggle/working/cadi-ai/data.yaml
📊 Dataset Validation Report
✅ train: 2813 images, 2813 labels
✅ val  :  501 images,  501 labels
✅ test :  340 images,  340 labels

📈 Class Distribution:
  abiotic   :  7762 objects ( 57.2%)
  disease   :  3566 objects ( 26.3%)
  insect    :  2236 objects ( 16.5%)

📋 Summary:
  Total Images: 3654
  Total Labels: 3654
  Total Objects: 13564
  Classes: 3


In [None]:
# Verify the created data.yaml configuration
print("📋 Current data.yaml configuration:")
data_yaml_path = os.path.join(PROJECT_DIR, 'data.yaml')
if os.path.exists(data_yaml_path):
    with open(data_yaml_path, 'r') as f:
        print(f.read())
else:
    print("⚠️ data.yaml not found. Please run the dataset creation step above.")

In [10]:
!pip install "numpy<2.0"

Collecting numpy<2.0
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/18.3 MB[0m [31m?[0m eta [36m-:--:--[0m  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.3/18.3 MB[0m [31m68.6 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [11]:
# Step 2: Find optimal batch size
print("🔍 Finding optimal batch size for your hardware...")
print("This may take a few minutes as it tests different batch sizes.")

# We use the --find-batch argument from our updated train.py script
!python train.py --config config.yaml --find-batch

🔍 Finding optimal batch size for your hardware...
This may take a few minutes as it tests different batch sizes.
Finding optimal batch size...
🔍 Finding optimal batch size...
Finding optimal batch size...
🔍 Finding optimal batch size...
  Testing batch size: 32
  Testing batch size: 32
Ultralytics 8.3.173 🚀 Python-3.11.13 torch-2.6.0+cu124 CUDA:0 (Tesla P100-PCIE-16GB, 16269MiB)
Ultralytics 8.3.173 🚀 Python-3.11.13 torch-2.6.0+cu124 CUDA:0 (Tesla P100-PCIE-16GB, 16269MiB)
[34m[1mengine/trainer: [0magnostic_nms=False, amp=True, augment=False, auto_augment=randaugment, batch=32, bgr=0.0, box=7.5, cache=False, cfg=None, classes=None, close_mosaic=10, cls=0.5, conf=None, copy_paste=0.0, copy_paste_mode=flip, cos_lr=False, cutmix=0.0, data=data.yaml, degrees=0.0, deterministic=True, device=None, dfl=1.5, dnn=False, dropout=0.0, dynamic=False, embed=None, epochs=1, erasing=0.4, exist_ok=False, fliplr=0.5, flipud=0.0, format=torchscript, fraction=1.0, freeze=None, half=False, hsv_h=0.015, 

In [None]:
# Step 3: Start training with optimal settings
# The `train.py` script will automatically use the settings from `config.yaml`
# including the auto-detected batch size if you set it to "auto" in the config.

print("🚀 Starting training...")
print("Training will save results to the output directory specified in config.yaml.")
print("You can monitor progress in the output below.")

!python train.py --config config.yaml

In [None]:
# Step 4: Evaluate the trained model
import glob
from ultralytics import YOLO

# Find the best model weights from the runs directory specified in config.yaml
output_dir = 'runs' # Or read from config.yaml
with open("config.yaml", 'r') as f:
    config = yaml.safe_load(f)
    output_dir = config.get('output_dir', 'runs')

weight_files = glob.glob(os.path.join(output_dir, '**/weights/best.pt'), recursive=True)

if weight_files:
    best_weights = weight_files[0]  # Get the most recent
    print(f"📊 Evaluating model: {best_weights}")
    
    # Load and validate the model
    model = YOLO(best_weights)
    
    # Run validation
    print("\n🧪 Running validation...")
    val_results = model.val(data=os.path.join(WORKING_DIR, 'data.yaml'))
    
else:
    print("❌ No trained model weights found. Please run training first.")