Training on new dataset #37

ljtruong · 2018-07-15T04:33:39Z

I'm attempting to train on a new dataset but I'm having trouble understanding where I should change my classes.I've changed it when feeding in the network. box_coder and multiloss box.
I'm having an error here when I feed in my network.

torchcv/torchcv/models/ssd/box_coder.py

Line 88 in 6291f3e

cls_targets = 1 + labels[index.clamp(min=0)]

I've removed the 1 + and was able to continue training, but I'm sure this isn't the correct fix.

When I have 37 classes, including background at 0 index. What is the class number I should feed into the network?

ahkarami · 2018-07-15T08:15:20Z

Dear @Worulz,
Don't change the original code. Just note that, when you want to use SSDLoss then you must set:

num_classes = Number of Classes in your data set + 1 (For background)
# Example, in your case:
num_classes = 38  # because 37 + 1= 38

& when you want to use Focal Loss then you must set:

num_classes = Number of Classes in your data set
# Example, in your case:
num_classes = 37  # because you have really 37 object classes

Note that these mentioned changes must apply in

torchcv/examples/ssd/train.py

Line 91 in 6291f3e

criterion = SSDLoss(num_classes=21)

&

torchcv/examples/ssd/train.py

Line 36 in 6291f3e

# net = SSD512(num_classes=21)

.
Good Luck

ljtruong · 2018-07-15T11:33:48Z

@ahkarami

Thank you for your guidance. I have made the changes. I've changed it to match my classes then +1 for background.

Here:

torchcv/examples/ssd/train.py

Line 37 in 6291f3e

net = FPNSSD512(num_classes=21)

and here:

torchcv/examples/ssd/train.py

Line 91 in 6291f3e

criterion = SSDLoss(num_classes=21)

I still experience the same error.

/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor<Dtype, 2, int, DefaultPtrTraits>, THCDeviceTensor<long, 1, int, DefaultPtrTraits>, THCDeviceTensor<Dtype, 1, int, DefaultPtrTraits>, Dtype *, int, int) [with Dtype = float]: block: [25,0,0], thread: [748,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor<Dtype, 2, int, DefaultPtrTraits>, THCDeviceTensor<long, 1, int, DefaultPtrTraits>, THCDeviceTensor<Dtype, 1, int, DefaultPtrTraits>, Dtype *, int, int) [with Dtype = float]: block: [19,0,0], thread: [598,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor<Dtype, 2, int, DefaultPtrTraits>, THCDeviceTensor<long, 1, int, DefaultPtrTraits>, THCDeviceTensor<Dtype, 1, int, DefaultPtrTraits>, Dtype *, int, int) [with Dtype = float]: block: [25,0,0], thread: [598,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor<Dtype, 2, int, DefaultPtrTraits>, THCDeviceTensor<long, 1, int, DefaultPtrTraits>, THCDeviceTensor<Dtype, 1, int, DefaultPtrTraits>, Dtype *, int, int) [with Dtype = float]: block: [25,0,0], thread: [599,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
Traceback (most recent call last):
  File "train.py", line 122, in <module>
    loss, loc_loss, cls_loss = criterion(bbox_preds, boxes, cls_preds, labels)
  File "/home/ubuntu/py3/lib/python3.5/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/projects/DeepLearningSSD/lib/model/detector/ssd_loss.py", line 59, in forward
    cls_loss[cls_targets<0] = 0  # set ignored loss to 0
RuntimeError: copy_if failed to synchronize: device-side assert triggered

It happens in SSD loss when my cls_target has the last class feeding into it. It's very weird. It means there is a class mis-match. Is there a dependency anywhere else?

ahkarami · 2018-07-19T06:32:02Z

Dear @Worulz,
Please pay attestation that you have used the torchcv/examples/ssd (i.e., SSD CNN Model example for detection); however, you have used a net = FPNSSD512(num_classes=21) (i.e., FPN Model)!!!
If you have used the torchcv/examples/ssd codes, then make a SSD model & if you want to use the FPN model then use the corresponding codes of it, in the torchcv/examples/fpnssd.
Also note that, I think SSD model codes are based on the PyTorch 0.3 & FPN model codes are based on PyTorch 0.4. You can use both version of PyTorch as I have mentioned in
https://github.com/ahkarami/Ubuntu-for-Deep-Learning#install-2-different-versions-of-a-package-eg-pytorch-on-a-single-system

ljtruong · 2018-07-19T10:56:29Z

@ahkarami thanks for the help. I'll give it a try again. I assume I may have an error when writing my own example scripts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training on new dataset #37

Training on new dataset #37

ljtruong commented Jul 15, 2018

ahkarami commented Jul 15, 2018

ljtruong commented Jul 15, 2018

ahkarami commented Jul 19, 2018

ljtruong commented Jul 19, 2018

Training on new dataset #37

Training on new dataset #37

Comments

ljtruong commented Jul 15, 2018

ahkarami commented Jul 15, 2018

ljtruong commented Jul 15, 2018

ahkarami commented Jul 19, 2018

ljtruong commented Jul 19, 2018