Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training on new dataset #37

Open
ljtruong opened this issue Jul 15, 2018 · 4 comments
Open

Training on new dataset #37

ljtruong opened this issue Jul 15, 2018 · 4 comments

Comments

@ljtruong
Copy link

I'm attempting to train on a new dataset but I'm having trouble understanding where I should change my classes.I've changed it when feeding in the network. box_coder and multiloss box.
I'm having an error here when I feed in my network.

cls_targets = 1 + labels[index.clamp(min=0)]

I've removed the 1 + and was able to continue training, but I'm sure this isn't the correct fix.

When I have 37 classes, including background at 0 index. What is the class number I should feed into the network?

@ahkarami
Copy link

Dear @Worulz,
Don't change the original code. Just note that, when you want to use SSDLoss then you must set:

num_classes = Number of Classes in your data set + 1 (For background)
# Example, in your case:
num_classes = 38  # because 37 + 1= 38 

& when you want to use Focal Loss then you must set:

num_classes = Number of Classes in your data set
# Example, in your case:
num_classes = 37  # because you have really 37 object classes

Note that these mentioned changes must apply in

criterion = SSDLoss(num_classes=21)
&
# net = SSD512(num_classes=21)
.
Good Luck

@ljtruong
Copy link
Author

@ahkarami

Thank you for your guidance. I have made the changes. I've changed it to match my classes then +1 for background.

Here:

net = FPNSSD512(num_classes=21)

and here:

criterion = SSDLoss(num_classes=21)

I still experience the same error.

/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor<Dtype, 2, int, DefaultPtrTraits>, THCDeviceTensor<long, 1, int, DefaultPtrTraits>, THCDeviceTensor<Dtype, 1, int, DefaultPtrTraits>, Dtype *, int, int) [with Dtype = float]: block: [25,0,0], thread: [748,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor<Dtype, 2, int, DefaultPtrTraits>, THCDeviceTensor<long, 1, int, DefaultPtrTraits>, THCDeviceTensor<Dtype, 1, int, DefaultPtrTraits>, Dtype *, int, int) [with Dtype = float]: block: [19,0,0], thread: [598,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor<Dtype, 2, int, DefaultPtrTraits>, THCDeviceTensor<long, 1, int, DefaultPtrTraits>, THCDeviceTensor<Dtype, 1, int, DefaultPtrTraits>, Dtype *, int, int) [with Dtype = float]: block: [25,0,0], thread: [598,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor<Dtype, 2, int, DefaultPtrTraits>, THCDeviceTensor<long, 1, int, DefaultPtrTraits>, THCDeviceTensor<Dtype, 1, int, DefaultPtrTraits>, Dtype *, int, int) [with Dtype = float]: block: [25,0,0], thread: [599,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
Traceback (most recent call last):
  File "train.py", line 122, in <module>
    loss, loc_loss, cls_loss = criterion(bbox_preds, boxes, cls_preds, labels)
  File "/home/ubuntu/py3/lib/python3.5/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/projects/DeepLearningSSD/lib/model/detector/ssd_loss.py", line 59, in forward
    cls_loss[cls_targets<0] = 0  # set ignored loss to 0
RuntimeError: copy_if failed to synchronize: device-side assert triggered

It happens in SSD loss when my cls_target has the last class feeding into it. It's very weird. It means there is a class mis-match. Is there a dependency anywhere else?

@ahkarami
Copy link

Dear @Worulz,
Please pay attestation that you have used the torchcv/examples/ssd (i.e., SSD CNN Model example for detection); however, you have used a net = FPNSSD512(num_classes=21) (i.e., FPN Model)!!!
If you have used the torchcv/examples/ssd codes, then make a SSD model & if you want to use the FPN model then use the corresponding codes of it, in the torchcv/examples/fpnssd.
Also note that, I think SSD model codes are based on the PyTorch 0.3 & FPN model codes are based on PyTorch 0.4. You can use both version of PyTorch as I have mentioned in
https://github.com/ahkarami/Ubuntu-for-Deep-Learning#install-2-different-versions-of-a-package-eg-pytorch-on-a-single-system

@ljtruong
Copy link
Author

@ahkarami thanks for the help. I'll give it a try again. I assume I may have an error when writing my own example scripts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants