Kidney stone detection with yolo26 #24093

CihangirEmre · 2026-04-01T20:17:04Z

CihangirEmre
Apr 1, 2026

Hello, I'm trying to train a model for my graduation project, with this model I am trying to detect kidney stones in CT images. I used yolo26(I did some benchmarks and results are better rather than rt-detr). I'm currently using a dataset that contains 2564 train images.
Test results:

mAP50 : 0.85
mAP50_95 : 0.55
precision : 0.86
recall : 0.79

I wanted to get better results so I did some research:
How recall metric can be increased?
How mAP50 can be increased?
How model can detect smaller objects?
(For my purposes recall is the most important metric)

what I found was bunch of different custom yolo models (different backbone, P2 aux head or attention block etc.) and their recall or mAP50 increased even if it's not so big so I tried to implement these changes. Results decrased pretty dramatically. I couldnt figure it out why so how I can achieve what I aim? Am I looking the wrong way ? What could be the reason for getting low results ?

P.S. I'm fine-tuning the model

Answered by glenn-jocher

Apr 2, 2026

Suggested reply:

Thanks for sharing the batches — the mAP50 to mAP50-95 gap suggests the main issue is likely localization on very small targets, so I’d pause backbone/head changes and instead check annotation tightness, review the class-wise model.val() outputs plus the confusion matrix and PR/F1 curves, and for fine-tuning try warmup_epochs=0 and training-time image tiling before making architecture edits; the model evaluation guide and performance metrics guide are the best references here. (docs.ultralytics.com)

View full answer

UltralyticsAssistant · 2026-04-01T20:17:34Z

UltralyticsAssistant
Apr 1, 2026
Maintainer

👋 Hello @CihangirEmre, thank you for your interest in Ultralytics 🚀 and for sharing details about your kidney stone detection project with YOLO26. This is an automated response to help get the discussion moving quickly, and an Ultralytics engineer will also assist you soon.

We recommend a visit to the Docs for new users where you can find many Python and CLI usage examples and where many of the most common questions may already be answered.

If this is a 🐛 Bug Report, please provide a minimum reproducible example to help us debug it.

Since this appears to be a custom training ❓ question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results. In particular, it would help if you could share:

Your YOLO26 model variant and training command
A few representative CT images with labels
Training and validation curves/logs
Image size, augmentation settings, and dataset YAML
Whether your validation set has similar stone sizes and scan characteristics as training
What exact architecture changes you tested and how you evaluated them

Join the Ultralytics community where it suits you best. For real-time chat, head to Discord 🎧. Prefer in-depth discussions? Check out Discourse. Or dive into threads on our Subreddit to share knowledge with the community.

Upgrade

Upgrade to the latest ultralytics package including all requirements in a Python>=3.8 environment with PyTorch>=1.8 to verify your issue is not already resolved in the latest version:

pip install -U ultralytics

Environments

YOLO may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all Ultralytics CI tests are currently passing. CI tests verify correct operation of all YOLO Modes and Tasks on macOS, Windows, and Ubuntu every 24 hours and on every commit.

0 replies

Nickalus12 · 2026-04-01T23:08:30Z

Nickalus12
Apr 1, 2026

I've hit this exact wall with medical datasets before. Standard YOLO architectures often struggle when the target is only a few pixels wide. The reason those custom P2 heads and attention blocks likely tanked your results is that they are notoriously hard to converge on smaller datasets, like your 2.5k images, without massive pre-training.

Instead of fighting the architecture, I would look at SAHI (Slicing Aided Hyper Inference). By tiling your CT scans during inference, the model sees the stones at their original resolution instead of downsampled blobs. It is usually the fastest way to spike recall for tiny objects.

Also, check your imgsz. If you are training at 640 but the stones are tiny in the raw CT, they might be getting lost in the stride-32 downsampling. Try pushing imgsz to 1024.

One gotcha: Watch your conf threshold. For medical use cases where recall is king, try dropping it to 0.1 or 0.15 during validation to see how many stones you're actually missing vs just filtering out.

3 replies

glenn-jocher Apr 2, 2026
Maintainer

You’re likely looking in the right place, but with a 2.5k-image medical dataset the biggest gains usually come from data and training setup rather than custom heads; for small stones, higher imgsz, lower validation/inference conf when recall matters, and image tiling are typically more effective, and our model evaluation and fine-tuning guide specifically notes tiling for small objects and trying warmup_epochs=0 when fine-tuning. Since your mAP50 is strong but mAP50-95 is much lower, the bottleneck is often localization quality, so I’d focus on annotation consistency and small-object visibility first rather than architecture changes; if you can share a few sample CTs and your training args, we can suggest more targeted tweaks.

Nickalus12 Apr 2, 2026

Good point on the imgsz and augmentation tuning - for small structures like kidney stones that's probably where the biggest gains are. Thanks for the additional context.

glenn-jocher Apr 2, 2026
Maintainer

Yes, that’s usually the safer path here; for tiny medical targets, architecture changes often hurt more than help on smaller datasets, so I’d keep the base YOLO26 model and iterate on imgsz, tiling, label quality, and validation conf, and if @CihangirEmre shares a few representative CT crops plus the exact training args here we can suggest more targeted next steps.

CihangirEmre · 2026-04-02T15:33:04Z

CihangirEmre
Apr 2, 2026
Author

Thank you for advices while I was searching I saw the SAHI too but didnt try it I will defenitly check it out. @glenn-jocher there are some validation and train batch results and training args. I never tried imgsz=1024 with this dataset so I take that as a note.

hyperparameters = {
# Veri
'data': '/content/drive/MyDrive/kidney_stone_roboflow_filtered/data.yaml',

'epochs': 200,
'batch': 16,
'imgsz': 640,
'lr0': 0.001,
'lrf': 0.01,
'momentum': 0.937,
'weight_decay': 0.0005,

'optimizer': 'AdamW',
'cos_lr': True,

# Regularization
'dropout': 0.0,
'patience': 40,

# Augmentation
'hsv_h': 0.015,
'hsv_s': 0.7,
'hsv_v': 0.4,
'degrees': 0.0,
'translate': 0.1,
'scale': 0.5,
'fliplr': 0.5,
'mosaic': 1.0,
'mixup': 0.0,
'close_mosaic': 10,

'device': '0',
'workers': 8,
'amp': True,
'cache': False,

'project': 'runs/train',
'name': experiment_name,
'exist_ok': True,
'save': True,
'save_period': 10,

'plots': True,
'verbose': True,

}

Again thank you for your advices I just found out this QA section and I think I will be spent time in here.

3 replies

glenn-jocher Apr 2, 2026
Maintainer

Suggested reply:

Thanks for sharing the batches — the mAP50 to mAP50-95 gap suggests the main issue is likely localization on very small targets, so I’d pause backbone/head changes and instead check annotation tightness, review the class-wise model.val() outputs plus the confusion matrix and PR/F1 curves, and for fine-tuning try warmup_epochs=0 and training-time image tiling before making architecture edits; the model evaluation guide and performance metrics guide are the best references here. (docs.ultralytics.com)

Answer selected by CihangirEmre

CihangirEmre Apr 3, 2026
Author

Thank you for reply. I have an additional question:
How I can measure the quality of datasets which parameters I should check ?
What are the key things ?

glenn-jocher Apr 10, 2026
Maintainer

The biggest dataset-quality checks are annotation tightness and consistency, complete labeling of every stone, duplicate or near-duplicate patient slices across train/val, class and object-size balance, and image quality/resolution; for small CT targets especially, even slightly loose or missing boxes can hurt recall a lot, so I’d start with a manual audit of 100 to 200 samples and use the data-centric AI guide plus the model testing guide as a practical checklist.

Ultralytics

Kidney stone detection with yolo26 #24093

Uh oh!

Uh oh!

CihangirEmre Apr 1, 2026

Replies: 3 comments · 6 replies

Uh oh!

UltralyticsAssistant Apr 1, 2026 Maintainer

Upgrade

Environments

Status

Uh oh!

Nickalus12 Apr 1, 2026

Uh oh!

glenn-jocher Apr 2, 2026 Maintainer

Uh oh!

Nickalus12 Apr 2, 2026

Uh oh!

glenn-jocher Apr 2, 2026 Maintainer

Uh oh!

CihangirEmre Apr 2, 2026 Author

Uh oh!

glenn-jocher Apr 2, 2026 Maintainer

Uh oh!

CihangirEmre Apr 3, 2026 Author

Uh oh!

glenn-jocher Apr 10, 2026 Maintainer

CihangirEmre
Apr 1, 2026

Replies: 3 comments 6 replies

UltralyticsAssistant
Apr 1, 2026
Maintainer

Nickalus12
Apr 1, 2026

glenn-jocher Apr 2, 2026
Maintainer

glenn-jocher Apr 2, 2026
Maintainer

CihangirEmre
Apr 2, 2026
Author

glenn-jocher Apr 2, 2026
Maintainer

CihangirEmre Apr 3, 2026
Author

glenn-jocher Apr 10, 2026
Maintainer