Training failed using yolov6l on 1GPU, Assertion `target_val >= zero && target_val <= one` failed, Data is verified but still training fails #1038

Yahya-Younes · 2024-04-22T12:18:12Z

Before Asking

I have read the README carefully. 我已经仔细阅读了README上的操作指引。
I want to train my custom dataset, and I have read the tutorials for training your custom data carefully and organize my dataset correctly; (FYI: We recommand you to apply the config files of xx_finetune.py.) 我想训练自定义数据集，我已经仔细阅读了训练自定义数据的教程，以及按照正确的目录结构存放数据集。（FYI: 我们推荐使用xx_finetune.py等配置文件训练自定义数据集。）
I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。

Search before asking

I have searched the YOLOv6 issues and found no similar questions.

Question

I train yolov6l cloned from this repo on my custom dataset on a GPU for 350 or 500 epochs with different batchs but each time the training fails showing this error and when i resume it continues training for a while then stops and each time it trains for less epochs until reaching 105 epochs where it can't continue training at all.

I verified my data and labels they are nomalized, the gpu i am using runs very well using Yolov5 or Yolov8 but i don't know what's the problem here !

I have another questions please can we do earlystopping in this yolov6 ? like the parameter patience in Yolov5 for eg.

Thank you so much for your help!
When i launch training

When i resume training
img record infomation path is:../dataset/images/.train_cache.json
Train: Final numbers of valid images: 10000/ labels: 10000.
0.6s for dataset initialization.
img record infomation path is:../dataset/images/.validation_cache.json
Convert to COCO format
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1036/1036 [00:00<00:00, 5118.47it/s]
Convert to COCO format finished. Resutls saved in ../dataset/annotations/instances_validation.json
Val: Final numbers of valid images: 1036/ labels: 1036.
0.5s for dataset initialization.
Training start...

 Epoch        lr  iou_loss  dfl_loss  cls_loss

105/349 0.006246 0.1285 0.2691 0.3886: 24%|██▍ | 96/400 [00:37<01:45, 2.88it/s../aten/src/ATen/native/cuda/Loss.cu:95: operator(): block: [12844,0,0], thread: [32,0,0] Assertion target_val >= zero && target_val <= one failed.
105/349 0.006246 0.1285 0.2691 0.3886: 24%|██▍ | 96/400 [00:37<01:58, 2.57it/s
ERROR in training steps.
ERROR in training loop or eval/save model.
Traceback (most recent call last):
File "/partage////app/YOLOv6/yolov6/core/engine.py", line 121, in train
self.train_one_epoch(self.epoch)
File "/partage/*****///app/YOLOv6/yolov6/core/engine.py", line 135, in train_one_epoch
self.train_in_steps(epoch_num, self.step)
File "/partage////app/YOLOv6/yolov6/core/engine.py", line 169, in train_in_steps
total_loss, loss_items = self.compute_loss(preds, targets, epoch_num, step_num,
File "/partage////app/YOLOv6/yolov6/models/losses/loss.py", line 163, in call
loss_cls = self.varifocal_loss(pred_scores, target_scores, one_hot_label)
File "/home//.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(args, **kwargs)
File "/home//.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(args, kwargs)
File "///*****//app/YOLOv6/yolov6/models/losses/loss.py", line 209, in forward
loss = (F.binary_cross_entropy(pred_score.float(), gt_score.float(), reduction='none') * weight).sum()
File "/home/*/.local/lib/python3.10/site-packages/torch/nn/functional.py", line 3127, in binary_cross_entropy
return torch._C._nn.binary_cross_entropy(input, target, weight, reduction_enum)
RuntimeError: CUDA error: device-side assert triggered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/partage///***/app/YOLOv6/tools/train.py", line 143, in

Additional

No response

The text was updated successfully, but these errors were encountered:

Dingerscat · 2024-05-09T04:16:52Z

学习率过大，模型跑飞了，调小点

Yahya-Younes · 2024-05-14T08:06:00Z

First thank you for your answer ! but i changed the lr in the configs/yolov6l.py file as follows :
solver=dict(
optim='SGD',
lr_scheduler='Cosine',
lr0=0.001,
lrf=0.01,
momentum=0.937,
weight_decay=0.0005,
warmup_epochs=5.0,
warmup_momentum=0.8,
warmup_bias_lr=0.05
)
and still encouter the same error

Yahya-Younes added the question Further information is requested label Apr 22, 2024

Yahya-Younes changed the title ~~Training failed using 1GPU, Assertion target_val >= zero && target_val <= one failed,~~ Training failed using yolov6l on 1GPU, Assertion target_val >= zero && target_val <= one failed, Data is verified but still training fails Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training failed using yolov6l on 1GPU, Assertion `target_val >= zero && target_val <= one` failed, Data is verified but still training fails #1038

Training failed using yolov6l on 1GPU, Assertion `target_val >= zero && target_val <= one` failed, Data is verified but still training fails #1038

Yahya-Younes commented Apr 22, 2024

Dingerscat commented May 9, 2024

Yahya-Younes commented May 14, 2024 •

edited

Loading

Training failed using yolov6l on 1GPU, Assertion target_val >= zero && target_val <= one failed, Data is verified but still training fails #1038

Training failed using yolov6l on 1GPU, Assertion target_val >= zero && target_val <= one failed, Data is verified but still training fails #1038

Comments

Yahya-Younes commented Apr 22, 2024

Before Asking

Search before asking

Question

Additional

Dingerscat commented May 9, 2024

Yahya-Younes commented May 14, 2024 • edited Loading

Training failed using yolov6l on 1GPU, Assertion `target_val >= zero && target_val <= one` failed, Data is verified but still training fails #1038

Training failed using yolov6l on 1GPU, Assertion `target_val >= zero && target_val <= one` failed, Data is verified but still training fails #1038

Yahya-Younes commented May 14, 2024 •

edited

Loading