# Dental Cavities Detection - YOLOv5 Training Notebook

This notebook trains a YOLOv5 model to detect dental cavities in X-ray images.

## Overview
- **Task**: Object Detection (Bounding Boxes)
- **Dataset**: Dental Cavities (Custom)
- **Model**: YOLOv5 (Small/Medium/Large)
- **Output**: Trained model weights for cavity detection

## Steps
1. **Colab Setup** (Run this first if using Google Colab)
2. Setup and Environment Check
3. Dataset Preparation
4. Training Configuration
5. Start Training
6. View Results
7. Test Inference


## 1. Colab Setup (Run This First!)

**If you're using Google Colab**, run this cell first to set up the environment.  
**If you're running locally**, skip this cell and go to "Setup and Environment Check".


In [5]:
import torch
import sys
from pathlib import Path

# Check PyTorch and CUDA
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")
    print(f"CUDA memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

# Check if we're in the right directory
print(f"\nCurrent directory: {Path.cwd()}")
print(f"YOLOv5 root: {Path(__file__).parent if '__file__' in globals() else Path.cwd()}")\
    

# Import YOLOv5 utilities
try:
    import utils
    display = utils.notebook_init()  # checks
    print("\n‚úÖ Setup complete!")
except Exception as e:
    print(f"\n‚ö†Ô∏è Setup check failed: {e}")
    print("Make sure you're running this notebook from the yolov5 directory")


PyTorch version: 2.8.0+cu126
CUDA available: True
CUDA device: Tesla T4
CUDA memory: 15.83 GB

Current directory: /content
YOLOv5 root: /content

‚ö†Ô∏è Setup check failed: No module named 'utils'
Make sure you're running this notebook from the yolov5 directory


## 2. Setup and Environment Check

Check if we have all required dependencies and verify the environment.


In [6]:
import torch
import sys
from pathlib import Path

# Check PyTorch and CUDA
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")
    print(f"CUDA memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

# Check if we're in the right directory
current_dir = Path.cwd()
print(f"\nCurrent directory: {current_dir}")

# Check if we're in yolov5 directory (look for train.py)
if Path("train.py").exists():
    print("‚úÖ Found train.py - we're in the YOLOv5 directory")
elif Path("yolov5/train.py").exists():
    print("‚ö†Ô∏è Found yolov5/train.py - changing to yolov5 directory...")
    %cd yolov5
    print(f"‚úÖ Changed to: {Path.cwd()}")
else:
    print("‚ö†Ô∏è Warning: train.py not found!")
    print("Make sure you're in the yolov5 directory or have run the Colab setup")

# Import YOLOv5 utilities
try:
    import utils
    display = utils.notebook_init()  # checks
    print("\n‚úÖ Setup complete!")
except Exception as e:
    print(f"\n‚ö†Ô∏è Setup check failed: {e}")
    print("Try running the Colab setup cell above if you're in Colab")


PyTorch version: 2.8.0+cu126
CUDA available: True
CUDA device: Tesla T4
CUDA memory: 15.83 GB

Current directory: /content
Make sure you're in the yolov5 directory or have run the Colab setup

‚ö†Ô∏è Setup check failed: No module named 'utils'
Try running the Colab setup cell above if you're in Colab


## 3. Upload Dataset to Colab (Colab Only)

**If using Google Colab**, you need to upload your dataset.  
**If running locally**, skip this cell.


In [None]:
# ============================================
# UPLOAD DATASET TO COLAB - OPTION 1: Direct Upload
# ============================================
# IMPORTANT: Run this cell fresh in your browser session!
# If you see "Upload widget is only available..." error, 
# refresh the page and run this cell again.

try:
    import google.colab
    IN_COLAB = True
except ImportError:
    IN_COLAB = False

if IN_COLAB:
    print("="*60)
    print("OPTION 1: Direct Upload (Small files < 2GB)")
    print("="*60)
    print("\nüì§ Upload your dataset:")
    print("   1. If you have a zip file with your Dataset folder, upload it")
    print("   2. If you already prepared the dataset, upload the 'datasets' folder as zip")
    print("   3. Or upload the YAML config file if dataset is already in Colab")
    print("\nüí° TIP: If upload widget doesn't appear, refresh the page and run this cell again")
    print("="*60)
    
    from google.colab import files
    from zipfile import ZipFile
    from pathlib import Path
    import shutil
    
    # This will show the upload widget - make sure to run this cell fresh!
    print("\n‚è≥ Waiting for file upload...")
    uploaded = files.upload()
    
    if uploaded:
        print(f"\n‚úÖ Received {len(uploaded)} file(s)")
        
        for filename in uploaded.keys():
            print(f"\nüì¶ Processing {filename}...")
            
            if filename.endswith('.zip'):
                # Extract zip file
                print(f"   Extracting {filename}...")
                with ZipFile(filename, 'r') as zip_ref:
                    zip_ref.extractall('.')
                print(f"   ‚úÖ Extracted {filename}")
                
                # If it contains Dataset folder, prepare it
                if Path("Dataset").exists():
                    print("\nüîÑ Preparing dataset...")
                    !python prepare_dental_dataset.py
                elif Path("datasets").exists():
                    print("   ‚úÖ Found prepared dataset folder")
                else:
                    print("   ‚ö†Ô∏è No Dataset or datasets folder found in zip")
                    
            elif filename.endswith('.yaml') or filename.endswith('.yml'):
                # Copy YAML file to data directory
                data_dir = Path("data")
                data_dir.mkdir(exist_ok=True)
                shutil.copy(filename, "data/dental_cavities.yaml")
                print(f"   ‚úÖ Copied {filename} to data/dental_cavities.yaml")
            else:
                print(f"   ‚ö†Ô∏è Unknown file type: {filename}")
        
        print("\n" + "="*60)
        print("‚úÖ Upload and processing complete!")
        print("="*60)
    else:
        print("\n‚ö†Ô∏è No files uploaded. Try running this cell again.")
else:
    print("‚è≠Ô∏è Skipping upload (running locally)")
    print("Make sure your dataset is in the correct location")


OPTION 1: Direct Upload (Small files < 2GB)

üì§ Upload your dataset:
   1. If you have a zip file with your Dataset folder, upload it
   2. If you already prepared the dataset, upload the 'datasets' folder as zip
   3. Or upload the YAML config file if dataset is already in Colab

üí° TIP: If upload widget doesn't appear, refresh the page and run this cell again

‚è≥ Waiting for file upload...


## 3b. Mount Google Drive (Alternative - For Large Datasets)

**Recommended for large datasets!** Mount your Google Drive to access files stored there.

**Why use Drive instead of upload?**
- No file size limits (upload has ~2GB limit)
- Faster for large datasets
- Files persist between Colab sessions
- No need to re-upload if you disconnect


In [2]:
# ============================================
# MOUNT GOOGLE DRIVE (Alternative Method)
# ============================================
# Use this if your dataset is already in Google Drive
# This is better for large datasets that take too long to upload

try:
    import google.colab
    IN_COLAB = True
except ImportError:
    IN_COLAB = False

if IN_COLAB:
    from google.colab import drive
    from pathlib import Path
    import shutil
    
    print("üîó Mounting Google Drive...")
    print("   (You'll need to authorize access - click the link and sign in)")
    drive.mount('/content/drive')
    
    print("\n‚úÖ Drive mounted!")
    print("\nüìÅ Now you can access files from your Drive:")
    print("   Example: /content/drive/MyDrive/your_dataset_folder")
    
    # Ask user for the path to their dataset in Drive
    print("\n" + "="*60)
    print("Copy your dataset from Drive to Colab workspace")
    print("="*60)
    
    # Example: Copy from Drive to current directory
    drive_dataset_path = input("\nEnter path to your dataset in Drive (e.g., /content/drive/MyDrive/dental_cavities/Dataset): ").strip()
    
    if drive_dataset_path and Path(drive_dataset_path).exists():
        print(f"\nüì¶ Copying dataset from {drive_dataset_path}...")
        
        # Copy Dataset folder if it exists
        if Path(drive_dataset_path).name == "Dataset" or (Path(drive_dataset_path) / "Dataset").exists():
            source = Path(drive_dataset_path) if Path(drive_dataset_path).name == "Dataset" else Path(drive_dataset_path) / "Dataset"
            dest = Path("Dataset")
            if source.exists():
                if dest.exists():
                    shutil.rmtree(dest)
                shutil.copytree(source, dest)
                print(f"‚úÖ Copied Dataset folder")
                
                # Prepare dataset
                if Path("prepare_dental_dataset.py").exists():
                    print("\nüîÑ Preparing dataset...")
                    !python prepare_dental_dataset.py
        else:
            print("‚ö†Ô∏è Dataset folder not found at that path")
    else:
        print("‚ö†Ô∏è Path not found or not provided")
        print("\nüí° Manual steps:")
        print("   1. Find your dataset in /content/drive/MyDrive/...")
        print("   2. Copy it to the current directory")
        print("   3. Run: !python prepare_dental_dataset.py")
else:
    print("‚è≠Ô∏è Skipping Drive mount (running locally)")


üîó Mounting Google Drive...
   (You'll need to authorize access - click the link and sign in)


KeyboardInterrupt: 

## 3c. Download from URL (Alternative)

If your dataset is hosted online (GitHub, Dropbox, etc.), download it directly.


In [None]:
# ============================================
# DOWNLOAD DATASET FROM URL
# ============================================
# Use this if your dataset is hosted online (GitHub, Dropbox, Google Drive share link, etc.)

try:
    import google.colab
    IN_COLAB = True
except ImportError:
    IN_COLAB = False

if IN_COLAB:
    import urllib.request
    from zipfile import ZipFile
    from pathlib import Path
    
    print("üåê Download dataset from URL")
    print("\n   Supported:")
    print("   - Direct download links (.zip files)")
    print("   - Google Drive share links (need to convert)")
    print("   - GitHub releases")
    
    # Example: Download from URL
    dataset_url = input("\nEnter dataset URL (or press Enter to skip): ").strip()
    
    if dataset_url:
        print(f"\nüì• Downloading from {dataset_url}...")
        
        # Download file
        filename = "dataset.zip"
        try:
            urllib.request.urlretrieve(dataset_url, filename)
            print(f"‚úÖ Downloaded to {filename}")
            
            # Extract
            print("\nüì¶ Extracting...")
            with ZipFile(filename, 'r') as zip_ref:
                zip_ref.extractall('.')
            print("‚úÖ Extracted")
            
            # Prepare if Dataset folder exists
            if Path("Dataset").exists():
                print("\nüîÑ Preparing dataset...")
                !python prepare_dental_dataset.py
        except Exception as e:
            print(f"‚ùå Error: {e}")
            print("\nüí° Tips:")
            print("   - Make sure the URL is a direct download link")
            print("   - For Google Drive: Use 'Download' link, not 'View' link")
            print("   - For large files, consider using Drive mount instead")
    else:
        print("‚è≠Ô∏è Skipped URL download")
else:
    print("‚è≠Ô∏è Skipping URL download (running locally)")


In [8]:
from pathlib import Path
import yaml

# Check dataset structure
dataset_path = Path("datasets/dental_cavities")
yaml_path = Path("data/dental_cavities.yaml")

print("Checking dataset...")
print(f"Dataset path exists: {dataset_path.exists()}")
print(f"YAML config exists: {yaml_path.exists()}")

if dataset_path.exists():
    train_images = len(list((dataset_path / "images" / "train").glob("*.*")))
    train_labels = len(list((dataset_path / "labels" / "train").glob("*.txt")))
    val_images = len(list((dataset_path / "images" / "val").glob("*.*")))
    val_labels = len(list((dataset_path / "labels" / "val").glob("*.txt")))
    
    print(f"\nDataset statistics:")
    print(f"  Train images: {train_images}")
    print(f"  Train labels: {train_labels}")
    print(f"  Val images: {val_images}")
    print(f"  Val labels: {val_labels}")
    
    if train_images == train_labels and val_images == val_labels:
        print("\n‚úÖ Dataset is ready!")
    else:
        print("\n‚ö†Ô∏è Mismatch between images and labels!")
else:
    print("\n‚ö†Ô∏è Dataset not found. Run prepare_dental_dataset.py first!")
    print("   Command: python prepare_dental_dataset.py")

# Load and display YAML config
if yaml_path.exists():
    with open(yaml_path, 'r') as f:
        config = yaml.safe_load(f)
    print(f"\nDataset configuration:")
    print(f"  Path: {config.get('path')}")
    print(f"  Classes: {config.get('nc')}")
    print(f"  Class names: {config.get('names')}")


Checking dataset...
Dataset path exists: False
YAML config exists: False

‚ö†Ô∏è Dataset not found. Run prepare_dental_dataset.py first!
   Command: python prepare_dental_dataset.py


## 3. Training Configuration

Configure your training parameters. Adjust these based on your system's memory and requirements.


In [None]:
# ============================================
# TRAINING CONFIGURATION
# ============================================
# Adjust these parameters based on your needs

# Dataset
DATA_YAML = "data/dental_cavities.yaml"  # Path to dataset config

# Model selection (choose one)
# - yolov5n.pt: Nano (smallest, fastest, least memory)
# - yolov5s.pt: Small (recommended for baseline)
# - yolov5m.pt: Medium (better accuracy)
# - yolov5l.pt: Large
# - yolov5x.pt: Extra Large (best accuracy, most memory)
WEIGHTS = "yolov5s.pt"  # Pretrained weights or '' for training from scratch

# Training parameters
EPOCHS = 10  # Number of epochs (10 for baseline, 100+ for full training)
BATCH_SIZE = 4  # Batch size (reduce if out of memory: try 2, 4, 8)
IMG_SIZE = 416  # Image size (320, 416, 512, 640 - smaller uses less memory)

# Advanced options
DEVICE = ""  # Leave empty for auto-detect, or specify 'cpu', '0', '0,1,2,3'
WORKERS = 4  # Number of dataloader workers (reduce if memory issues)
PROJECT = "runs/train"  # Project directory
NAME = "dental_cavities"  # Experiment name

# Optional: Resume from checkpoint
# RESUME = "runs/train/dental_cavities/weights/last.pt"  # Uncomment to resume
RESUME = False

print("Training Configuration:")
print(f"  Dataset: {DATA_YAML}")
print(f"  Weights: {WEIGHTS}")
print(f"  Epochs: {EPOCHS}")
print(f"  Batch size: {BATCH_SIZE}")
print(f"  Image size: {IMG_SIZE}")
print(f"  Device: {DEVICE if DEVICE else 'Auto'}")
print(f"  Workers: {WORKERS}")
print(f"  Project: {PROJECT}/{NAME}")


Training Configuration:
  Dataset: data/dental_cavities.yaml
  Weights: yolov5s.pt
  Epochs: 10
  Batch size: 4
  Image size: 416
  Device: Auto
  Workers: 4
  Project: runs/train/dental_cavities


## 4. Start Training

Run the training. This may take a while depending on your configuration.


In [None]:
# Import training module
import train
from utils.callbacks import Callbacks

# Prepare training arguments
training_args = {
    'data': DATA_YAML,
    'weights': WEIGHTS,
    'epochs': EPOCHS,
    'batch_size': BATCH_SIZE,
    'imgsz': IMG_SIZE,
    'device': DEVICE,
    'workers': WORKERS,
    'project': PROJECT,
    'name': NAME,
    'exist_ok': True,  # Overwrite existing experiment
}

# Add resume if specified
if RESUME:
    training_args['resume'] = RESUME

print("Starting training...")
print("=" * 60)

# Run training
try:
    opt = train.run(**training_args)
    print("\n‚úÖ Training completed!")
except KeyboardInterrupt:
    print("\n‚ö†Ô∏è Training interrupted by user")
except Exception as e:
    print(f"\n‚ùå Training failed: {e}")
    import traceback
    traceback.print_exc()


ModuleNotFoundError: No module named 'train'

## 5. View Training Results

Display training results, metrics, and visualizations.


In [None]:
from pathlib import Path
from IPython.display import Image, display
import glob

# Find the latest training run
runs_dir = Path(PROJECT) / NAME
if not runs_dir.exists():
    # Try to find the latest exp directory
    exp_dirs = sorted(Path(PROJECT).glob("exp*"), key=lambda x: x.stat().st_mtime, reverse=True)
    if exp_dirs:
        runs_dir = exp_dirs[0]
        print(f"Using latest run: {runs_dir}")

if runs_dir.exists():
    print(f"Results directory: {runs_dir}")
    
    # Display training results
    results_img = runs_dir / "results.png"
    if results_img.exists():
        print("\nüìä Training Results:")
        display(Image(str(results_img), width=800))
    
    # Display confusion matrix
    confusion_img = runs_dir / "confusion_matrix.png"
    if confusion_img.exists():
        print("\nüìà Confusion Matrix:")
        display(Image(str(confusion_img), width=600))
    
    # Display validation batch
    val_batch = runs_dir / "val_batch0_labels.jpg"
    if val_batch.exists():
        print("\nüîç Validation Batch (Ground Truth):")
        display(Image(str(val_batch), width=600))
    
    val_pred = runs_dir / "val_batch0_pred.jpg"
    if val_pred.exists():
        print("\nüéØ Validation Batch (Predictions):")
        display(Image(str(val_pred), width=600))
    
    # Show best weights location
    best_weights = runs_dir / "weights" / "best.pt"
    last_weights = runs_dir / "weights" / "last.pt"
    
    print(f"\nüíæ Model Weights:")
    print(f"  Best model: {best_weights}")
    print(f"  Last model: {last_weights}")
    print(f"  Best model exists: {best_weights.exists()}")
    print(f"  Last model exists: {last_weights.exists()}")
    
else:
    print(f"‚ùå Results directory not found: {runs_dir}")
    print("Training may not have completed successfully.")


## 6. Test Inference on Sample Images

Test your trained model on validation images or new images.


In [None]:
from pathlib import Path
from IPython.display import Image, display
import detect

# Find the best weights
weights_path = Path(PROJECT) / NAME / "weights" / "best.pt"
if not weights_path.exists():
    # Try last.pt
    weights_path = Path(PROJECT) / NAME / "weights" / "last.pt"

if weights_path.exists():
    print(f"Using weights: {weights_path}")
    
    # Test on validation images
    val_images_dir = Path("datasets/dental_cavities/images/val")
    
    if val_images_dir.exists():
        # Get a few sample images
        sample_images = list(val_images_dir.glob("*.*"))[:5]
        
        if sample_images:
            print(f"\nTesting on {len(sample_images)} sample images...")
            
            # Run detection
            detect.run(
                weights=str(weights_path),
                source=str(sample_images[0].parent),  # Directory
                imgsz=IMG_SIZE,
                conf=0.25,  # Confidence threshold
                save_txt=True,
                save_conf=True,
                project="runs/detect",
                name="dental_cavities_test",
                exist_ok=True
            )
            
            # Display results
            results_dir = Path("runs/detect/dental_cavities_test")
            if results_dir.exists():
                result_images = list(results_dir.glob("*.jpg"))[:5]
                print(f"\nüì∏ Detection Results:")
                for img_path in result_images:
                    display(Image(str(img_path), width=400))
        else:
            print("No images found in validation directory")
    else:
        print(f"Validation images directory not found: {val_images_dir}")
else:
    print(f"‚ùå Model weights not found: {weights_path}")
    print("Training may not have completed successfully.")


## 7. Continue Training (Optional)

If you want to train for more epochs, you can resume from the last checkpoint.


In [None]:
# Uncomment and modify to continue training
# This will resume from the last checkpoint and train for additional epochs

# CONTINUE_EPOCHS = 50  # Additional epochs to train
# LAST_WEIGHTS = f"{PROJECT}/{NAME}/weights/last.pt"
# 
# if Path(LAST_WEIGHTS).exists():
#     print(f"Resuming training from {LAST_WEIGHTS}")
#     train.run(
#         data=DATA_YAML,
#         weights=LAST_WEIGHTS,
#         epochs=EPOCHS + CONTINUE_EPOCHS,
#         batch_size=BATCH_SIZE,
#         imgsz=IMG_SIZE,
#         device=DEVICE,
#         workers=WORKERS,
#         project=PROJECT,
#         name=NAME,
#         resume=True,
#         exist_ok=True
#     )
# else:
#     print(f"Last weights not found: {LAST_WEIGHTS}")

print("To continue training, uncomment the code above and adjust CONTINUE_EPOCHS")


## Tips and Troubleshooting

### Memory Issues
- Reduce `BATCH_SIZE` to 2 or 4
- Reduce `IMG_SIZE` to 320 or 416
- Use `yolov5n.pt` instead of `yolov5s.pt`
- Reduce `WORKERS` to 2

### Better Results
- Train for more epochs (100+)
- Use larger model (`yolov5m.pt` or `yolov5l.pt`)
- Use larger image size (640)
- Ensure dataset quality and annotation accuracy

### Quick Baseline
- Use 10 epochs for quick testing
- Use smaller model and image size
- Check if training is working before full training

### Resume Training
- Set `RESUME = "runs/train/dental_cavities/weights/last.pt"` to continue from checkpoint
- Or use the "Continue Training" section above


## Tips and Troubleshooting
