Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training failed using yolov6l on 1GPU, Assertion target_val >= zero && target_val <= one failed, Data is verified but still training fails #1038

Open
4 tasks done
Yahya-Younes opened this issue Apr 22, 2024 · 2 comments
Labels
question Further information is requested

Comments

@Yahya-Younes
Copy link

Before Asking

  • I have read the README carefully. 我已经仔细阅读了README上的操作指引。

  • I want to train my custom dataset, and I have read the tutorials for training your custom data carefully and organize my dataset correctly; (FYI: We recommand you to apply the config files of xx_finetune.py.) 我想训练自定义数据集,我已经仔细阅读了训练自定义数据的教程,以及按照正确的目录结构存放数据集。(FYI: 我们推荐使用xx_finetune.py等配置文件训练自定义数据集。)

  • I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。

Search before asking

  • I have searched the YOLOv6 issues and found no similar questions.

Question

I train yolov6l cloned from this repo on my custom dataset on a GPU for 350 or 500 epochs with different batchs but each time the training fails showing this error and when i resume it continues training for a while then stops and each time it trains for less epochs until reaching 105 epochs where it can't continue training at all.

I verified my data and labels they are nomalized, the gpu i am using runs very well using Yolov5 or Yolov8 but i don't know what's the problem here !

I have another questions please can we do earlystopping in this yolov6 ? like the parameter patience in Yolov5 for eg.

Thank you so much for your help!
When i launch training
image (1)

When i resume training
img record infomation path is:../dataset/images/.train_cache.json
Train: Final numbers of valid images: 10000/ labels: 10000.
0.6s for dataset initialization.
img record infomation path is:../dataset/images/.validation_cache.json
Convert to COCO format
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1036/1036 [00:00<00:00, 5118.47it/s]
Convert to COCO format finished. Resutls saved in ../dataset/annotations/instances_validation.json
Val: Final numbers of valid images: 1036/ labels: 1036.
0.5s for dataset initialization.
Training start...

 Epoch        lr  iou_loss  dfl_loss  cls_loss

105/349 0.006246 0.1285 0.2691 0.3886: 24%|██▍ | 96/400 [00:37<01:45, 2.88it/s../aten/src/ATen/native/cuda/Loss.cu:95: operator(): block: [12844,0,0], thread: [32,0,0] Assertion target_val >= zero && target_val <= one failed.
105/349 0.006246 0.1285 0.2691 0.3886: 24%|██▍ | 96/400 [00:37<01:58, 2.57it/s
ERROR in training steps.
ERROR in training loop or eval/save model.
Traceback (most recent call last):
File "/partage////app/YOLOv6/yolov6/core/engine.py", line 121, in train
self.train_one_epoch(self.epoch)
File "/partage/*****/
//app/YOLOv6/yolov6/core/engine.py", line 135, in train_one_epoch
self.train_in_steps(epoch_num, self.step)
File "/partage/
///app/YOLOv6/yolov6/core/engine.py", line 169, in train_in_steps
total_loss, loss_items = self.compute_loss(preds, targets, epoch_num, step_num,
File "/partage/
///app/YOLOv6/yolov6/models/losses/loss.py", line 163, in call
loss_cls = self.varifocal_loss(pred_scores, target_scores, one_hot_label)
File "/home/
/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(args, **kwargs)
File "/home/
/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(args, kwargs)
File "/
//*****/
/app/YOLOv6/yolov6/models/losses/loss.py", line 209, in forward
loss = (F.binary_cross_entropy(pred_score.float(), gt_score.float(), reduction='none') * weight).sum()
File "/home/
*/.local/lib/python3.10/site-packages/torch/nn/functional.py", line 3127, in binary_cross_entropy
return torch._C._nn.binary_cross_entropy(input, target, weight, reduction_enum)
RuntimeError: CUDA error: device-side assert triggered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/partage///***/app/YOLOv6/tools/train.py", line 143, in

Additional

No response

@Yahya-Younes Yahya-Younes added the question Further information is requested label Apr 22, 2024
@Yahya-Younes Yahya-Younes changed the title Training failed using 1GPU, Assertion target_val >= zero && target_val <= one failed, Training failed using yolov6l on 1GPU, Assertion target_val >= zero && target_val <= one failed, Data is verified but still training fails Apr 22, 2024
@Dingerscat
Copy link

学习率过大,模型跑飞了,调小点

@Yahya-Younes
Copy link
Author

Yahya-Younes commented May 14, 2024

First thank you for your answer ! but i changed the lr in the configs/yolov6l.py file as follows :
solver=dict(
optim='SGD',
lr_scheduler='Cosine',
lr0=0.001,
lrf=0.01,
momentum=0.937,
weight_decay=0.0005,
warmup_epochs=5.0,
warmup_momentum=0.8,
warmup_bias_lr=0.05
)

and still encouter the same error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants