questions/issues on training segfix with own data #30

marcok · 2020-08-20T10:53:05Z

I was excited to try segfix training on my own data.

I could produce the mat files for train and val data.
Training works with run_h_48_d_4_segfix.sh and loss convergences. But on the validation the IoU is more or less random (I have 2 classes)

2020-08-20 10:47:41,932 INFO [base.py, 32] Result for mask
2020-08-20 10:47:41,932 INFO [base.py, 48] Mean IOU: 0.7853758111568029
2020-08-20 10:47:41,933 INFO [base.py, 49] Pixel ACC: 0.9692584678389714
2020-08-20 10:47:41,933 INFO [base.py, 54] F1 Score: 0.7523384841507573 Precision: 0.7928424176432377 Recall: 0.7157718538603068
2020-08-20 10:47:41,933 INFO [base.py, 32] Result for dir (mask)
2020-08-20 10:47:41,933 INFO [base.py, 48] Mean IOU: 0.5390945167184129
2020-08-20 10:47:41,933 INFO [base.py, 49] Pixel ACC: 0.7248566725097775
2020-08-20 10:47:41,933 INFO [base.py, 32] Result for dir (GT)
2020-08-20 10:47:41,934 INFO [base.py, 48] Mean IOU: 0.41990305666871003
2020-08-20 10:47:41,934 INFO [base.py, 49] Pixel ACC: 0.6007717101395131

to investigate the issue further I tried to analyse the predicted mat files with
bash scripts/cityscapes/segfix/run_h_48_d_4_segfix.sh segfix_pred_val 1

with "input_size": [640, 480] this exception happens:
File "/home/rsa-key-20190908/openseg.pytorch/lib/datasets/tools/collate.py", line 108, in collate
assert pad_height >= 0 and pad_width >= 0
after fixing it more or less, iv got similar results as val during training
They were around 3Kb instead of ~70kb
btw, it took "input_size": [640, 480] config from "test": { leave instead "val": {

is it possible validation only works with "input_size": [2048, 1024],?
Can you give me any hints how to manually verify the .mat files of there correctness? Currently I'm diving into 2007.04269.pdf and the code of dt_offset_generator.py to get an understanding.

PkuRainBow · 2020-08-20T11:06:59Z

Could you share more details of your training details on your own dataset, e.g., batch-size, training crop-size?
Besides, what about the crop-size setting during the evaluation stage?

@hsfzxjy Please help to check the mentioned issues.

marcok · 2020-08-20T12:01:57Z

I don't use crop (image same size as crop since) since the object is always on the same position it increases results and I have such a big amount of training data, its not needed as a augmentation.
I use standard config, except input_size and class num.
i think i will transform it to standard cityscape and start to debug from there. In the meantime I figured out what the depth maps are and analyse the (wrongly) predicted maps.

{
"dataset": "cityscapes",
"method": "fcn_segmentor",
"data": {
"image_tool": "cv2",
"input_mode": "BGR",
"num_classes": 2,
"label_list": [0,1],
"workers": 1, #for debug
"pred_dt_offset": true
},
"train": {
"batch_size": 16,
"data_transformer": {
"size_mode": "fix_size",
"input_size": [640, 480],
"align_method": "only_pad",
"pad_mode": "random"
}
},
"val": {
"batch_size": 4,
"mode": "ss_test",
"data_transformer": {
"size_mode": "fix_size",
"input_size": [640, 480],
"align_method": "only_pad"
}
},
"test": {
"batch_size": 16,
"mode": "ss_test",
"data_transformer": {
"size_mode": "fix_size",
"input_size": [640, 480],
"align_method": "only_pad"
}
},
"train_trans": {
"trans_seq": ["random_hflip", "random_brightness"],
"random_brightness": {
"ratio": 1.0,
"shift_value": 10
},
"random_hflip": {
"ratio": 0.5,
"swap_pair": []
} },
"val_trans": {
"trans_seq": ["random_hflip"],
"random_hflip": {
"ratio": 0.5,
"swap_pair": []
}
},

hsfzxjy · 2020-08-20T12:28:47Z

@marcok Hi. I could not well understand what your problem is. What do you mean by "But on the validation the IoU is more or less random"? I cannot see the phenomenon from your log posted.

marcok · 2020-08-20T16:13:47Z

@hsfzxjy simply said, /run_h_48_d_4_segfix.sh segfix_pred_val doesn't generate reasonable mat files.
But I i haven't given up to find out why by myself
are you sure in tester_offset.py the w/h of the crop is assigned wrongly and it should be like this:

class Tester(object):
def init(self, configer):
self.crop_size = configer.get('train',
'data_transformer')['input_size']

    **self.crop_size[0], self.crop_size[1] = self.crop_size[1], self.crop_size[0]**

hsfzxjy · 2020-08-21T01:08:55Z

@marcok thanks for your feedback. Does your problem still exist? We could help you figure out it.

marcok · 2020-08-21T15:04:32Z

Hi @hsfzxjy

thanks for fixing it!
Val during training uses other code, I suspect there something partly shaky too.
May you can explain me the difference between this 3 metrics?
Result for mask
Result for dir (mask)
Result for dir (GT)

hsfzxjy · 2020-09-05T02:55:21Z

@marcok

"mask" is the accuracy of binary boundary prediction;
"dir (mask)" is the accuracy of direction prediction (within the predicted boundary);
"dir (GT)" is the accuracy of direction prediction (within the ground truth boundary).

All of them print out mIoU too but only the accuracy / FScore is useful.

i-amgeek · 2020-09-28T05:05:05Z

Hey @hsfzxjy

I was able to train segfix on a binary segmentation data. Both Mask loss and direction loss converged well.

But there wasn't any improvement in Pixel ACC of either dir (mask) or dir (GT). They just kept jumping around 0.4 +-0.04 even though direction loss converged from 2.09 to 0.8. I trained for about 50000 iterations.

Is it okay for this to happen? And can you share any of your training log?

hsfzxjy · 2020-09-28T05:48:09Z

Hi @i-amgeek

I think it's a little strange. The ACC should be increased along with the training process. The attachment is SegFix's training log for only 20k iters. The log is in some old format, where "Result for 1" is dir (mask) and "Result for 3" is dir (GT). You may check it for comparison.

btw, Did you generate the ground truth offset correctly? It may also due to incorrect supervision.
mask_offset_hrnext_hrnext20_mask_dt_offset_loss_crop_bs16_lr4x_1.log

i-amgeek · 2020-09-28T09:58:08Z

Thanks for sharing log. I will look into it.

I used this script to generate offsets - lib/datasets/preprocess/cityscapes/dt_offset_generator.py after changing label_list variable to [0,255].

YAwei666 · 2020-12-10T08:57:31Z

@i-amgeek would you mind tell me how to change the H_SEGFIX.json
for the loss part?

hsfzxjy · 2020-12-10T09:06:09Z

@YAwei666 It's already specified in the Bash script so there's no need to edit config file.

See https://github.com/openseg-group/openseg.pytorch/blob/master/scripts/cityscapes/segfix/run_hx_20_d_2_segfix.sh#L41

cnnAndBn · 2021-05-28T07:01:41Z

hi,@hsfzxjy @PkuRainBow PkuRainBow @LayneH

as for the log file you give i.e. mask_offset_hrnext_hrnext20_mask_dt_offset_loss_crop_bs16_lr4x_1.log, I notice the result after 20000 iteration is not very satistactory,

the left is the result after first testing while the right is the result of last test in the log file.
the final mean iou for boundary mask is 0.387, pixel acc is 0.603
and the one for direction is 0.327 and 0.564 respectively.
can this result benificial to the segmentation result? and the log file is complete one?
I train segfix on my own dataset, details are:
the loss at initial training is
Train Epoch: 0 Train Iteration: 20 Time 97.117s / 20iters, (4.856) Forward Time 46.520s / 20iters, (2.326) Backward Time 36.260s / 20iters, (1.813) Loss Time 4.666s / 20iters, (0.233) Data load 9.671s / 20iters, (0.483533)
Learning rate = [0.03999087988446927, 0.03999087988446927] Loss = 2.08182430 (ave = 2.22278991)

the result of first test on validation set is :

after 54000 iteration training ,the result is
2021-05-19 18:50:26,765 INFO [trainer.py, 212] Train Epoch: 47 Train Iteration: 54000 Time 54.218s / 20iters, (2.711) Forward Time 24.295s / 20iters, (1.215) Backward Time 28.376s / 20iters, (1.419) Loss Time 1.312s / 20iters, (0.066) Data load 0.234s / 20iters, (0.011699)
Learning rate = [0.012720987671203884, 0.012720987671203884] Loss = 1.13631892 (ave = 1.16415203)

the test result in 54000 is

from above ,the loss get decreased, but the test result on validation set seems not to get improved , can you help me?

last question is
in https://github.com/openseg-group/openseg.pytorch/blob/master/MODEL_ZOO.md#how-to-reproduce-the-hrnet--ocr-with-mapillary-pretraining ,

after getting the refined label ,how to get the metrics like miou on the test dataset?

hsfzxjy · 2021-05-28T09:07:00Z

@dadada101 Perhaps you can open a new issue so that we can discuss better.

In fact, the direction IoU in the log is meaningless. We print it out just because we re-used the evaluator designed for segmentation result. The one you should really concern is the direction accuracy, and simply ignore the mIoU. Practically, an accuracy of above 0.60 can bring a boost in segmentation result.
Your screenshot tells that the accuracy is above 0.76, which is already of great quality. The accuracy of dir(mask) gets boosted so the training process should be normal.
Be careful that scripts/cityscapes/segfix.py is designed for Cityscapes. To apply SegFix on your own datasets, you should change the label_list inside, specifically Line 28-29, to match the definition of your dataset. After you get the refined labels, you can use any semantic segmentation evaluator to evaluate their mIoU, just like how you did with the original labels.

cnnAndBn · 2021-05-28T10:50:48Z

oh,thanks for your instant replay @hsfzxjy .
you said "in fact, the direction IoU in the log is meaningless. We print it out just because we re-used the evaluator designed for segmentation result. The one you should really concern is the direction accuracy, and simply ignore the mIoU. "
why the direction IoU in the log is meaningless? because some of directions (in my training ,I used 8 direction) in the ground truth may be absent?

hsfzxjy · 2021-05-28T11:50:11Z

@dadada101 IoU is for measuring the quality of area prediction, whilst directions have no such a concept of area, especially at some complicated edges some direction classes may be very fragmented. Of course IoU has a positive correlation with accuracy and should be as high as possible, but currently our models cannot perform so accurate and the value of IoU gets very degraded, and is not a useful reference for (roughly) evaluating the quality of direction prediction. In contrast the value of pixel accuracy is more sensitive to small boost of direction quality, but I have to admit that it's still not an appropriate metric. After all, the metrics mentioned above are just as references for picking a better SegFix model, the quality of a SegFix model should be evaluated as how much improvement it can bring to a segmentation baseline.

BTW, if you want to tune SegFix models on a different dataset, we have observed that LR and crop size matters more. For LR you may try 0.01 to 0.04. For crop size you may try a smaller one than segmentation models use, e.g. on Cityscapes, segmentation models may use 769x769 or 512x1024, while SegFix adopts 512x512.

cnnAndBn · 2021-05-31T01:26:31Z

@hsfzxjy ,really thanks for your prompt and detailed reply!!!
I got your meaning . Because mIou is a metric insensitive to the class imbalance problem in the ground truth label, right? so at the begining, I thought miou is a better metric than accuray...
in addiction thanks for your training trick advise, may I ask why adopting a smaller crop than segmentation model is benificial?

cnnAndBn · 2021-06-17T02:33:55Z

@hsfzxjy ,really thanks for your prompt and detailed reply!!!
I got your meaning . Because mIou is a metric insensitive to the class imbalance problem in the ground truth label, right? so at the begining, I thought miou is a better metric than accuray...
in addiction thanks for your training trick advise, may I ask why adopting a smaller crop than segmentation model is benificial?

marcok closed this as completed Aug 21, 2020

cnnAndBn mentioned this issue Jun 17, 2021

question resumed on #30 #70

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

questions/issues on training segfix with own data #30

questions/issues on training segfix with own data #30

marcok commented Aug 20, 2020

PkuRainBow commented Aug 20, 2020

marcok commented Aug 20, 2020

hsfzxjy commented Aug 20, 2020

marcok commented Aug 20, 2020

hsfzxjy commented Aug 21, 2020

marcok commented Aug 21, 2020

hsfzxjy commented Sep 5, 2020

i-amgeek commented Sep 28, 2020 •

edited

hsfzxjy commented Sep 28, 2020

i-amgeek commented Sep 28, 2020

YAwei666 commented Dec 10, 2020

hsfzxjy commented Dec 10, 2020

cnnAndBn commented May 28, 2021 •

edited

hsfzxjy commented May 28, 2021

cnnAndBn commented May 28, 2021

hsfzxjy commented May 28, 2021 •

edited

cnnAndBn commented May 31, 2021

cnnAndBn commented Jun 17, 2021

questions/issues on training segfix with own data #30

questions/issues on training segfix with own data #30

Comments

marcok commented Aug 20, 2020

PkuRainBow commented Aug 20, 2020

marcok commented Aug 20, 2020

hsfzxjy commented Aug 20, 2020

marcok commented Aug 20, 2020

hsfzxjy commented Aug 21, 2020

marcok commented Aug 21, 2020

hsfzxjy commented Sep 5, 2020

i-amgeek commented Sep 28, 2020 • edited

hsfzxjy commented Sep 28, 2020

i-amgeek commented Sep 28, 2020

YAwei666 commented Dec 10, 2020

hsfzxjy commented Dec 10, 2020

cnnAndBn commented May 28, 2021 • edited

hsfzxjy commented May 28, 2021

cnnAndBn commented May 28, 2021

hsfzxjy commented May 28, 2021 • edited

cnnAndBn commented May 31, 2021

cnnAndBn commented Jun 17, 2021

i-amgeek commented Sep 28, 2020 •

edited

cnnAndBn commented May 28, 2021 •

edited

hsfzxjy commented May 28, 2021 •

edited