Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

questions/issues on training segfix with own data #30

Closed
marcok opened this issue Aug 20, 2020 · 18 comments
Closed

questions/issues on training segfix with own data #30

marcok opened this issue Aug 20, 2020 · 18 comments

Comments

@marcok
Copy link

marcok commented Aug 20, 2020

I was excited to try segfix training on my own data.

I could produce the mat files for train and val data.
Training works with run_h_48_d_4_segfix.sh and loss convergences. But on the validation the IoU is more or less random (I have 2 classes)

2020-08-20 10:47:41,932 INFO [base.py, 32] Result for mask
2020-08-20 10:47:41,932 INFO [base.py, 48] Mean IOU: 0.7853758111568029
2020-08-20 10:47:41,933 INFO [base.py, 49] Pixel ACC: 0.9692584678389714
2020-08-20 10:47:41,933 INFO [base.py, 54] F1 Score: 0.7523384841507573 Precision: 0.7928424176432377 Recall: 0.7157718538603068
2020-08-20 10:47:41,933 INFO [base.py, 32] Result for dir (mask)
2020-08-20 10:47:41,933 INFO [base.py, 48] Mean IOU: 0.5390945167184129
2020-08-20 10:47:41,933 INFO [base.py, 49] Pixel ACC: 0.7248566725097775
2020-08-20 10:47:41,933 INFO [base.py, 32] Result for dir (GT)
2020-08-20 10:47:41,934 INFO [base.py, 48] Mean IOU: 0.41990305666871003
2020-08-20 10:47:41,934 INFO [base.py, 49] Pixel ACC: 0.6007717101395131

to investigate the issue further I tried to analyse the predicted mat files with
bash scripts/cityscapes/segfix/run_h_48_d_4_segfix.sh segfix_pred_val 1

with "input_size": [640, 480] this exception happens:
File "/home/rsa-key-20190908/openseg.pytorch/lib/datasets/tools/collate.py", line 108, in collate
assert pad_height >= 0 and pad_width >= 0
after fixing it more or less, iv got similar results as val during training
They were around 3Kb instead of ~70kb
btw, it took "input_size": [640, 480] config from "test": { leave instead "val": {

is it possible validation only works with "input_size": [2048, 1024],?
Can you give me any hints how to manually verify the .mat files of there correctness? Currently I'm diving into 2007.04269.pdf and the code of dt_offset_generator.py to get an understanding.

@PkuRainBow
Copy link
Contributor

Could you share more details of your training details on your own dataset, e.g., batch-size, training crop-size?
Besides, what about the crop-size setting during the evaluation stage?

@hsfzxjy Please help to check the mentioned issues.

@marcok
Copy link
Author

marcok commented Aug 20, 2020

I don't use crop (image same size as crop since) since the object is always on the same position it increases results and I have such a big amount of training data, its not needed as a augmentation.
I use standard config, except input_size and class num.
i think i will transform it to standard cityscape and start to debug from there. In the meantime I figured out what the depth maps are and analyse the (wrongly) predicted maps.

{
"dataset": "cityscapes",
"method": "fcn_segmentor",
"data": {
"image_tool": "cv2",
"input_mode": "BGR",
"num_classes": 2,
"label_list": [0,1],
"workers": 1, #for debug
"pred_dt_offset": true
},
"train": {
"batch_size": 16,
"data_transformer": {
"size_mode": "fix_size",
"input_size": [640, 480],
"align_method": "only_pad",
"pad_mode": "random"
}
},
"val": {
"batch_size": 4,
"mode": "ss_test",
"data_transformer": {
"size_mode": "fix_size",
"input_size": [640, 480],
"align_method": "only_pad"
}
},
"test": {
"batch_size": 16,
"mode": "ss_test",
"data_transformer": {
"size_mode": "fix_size",
"input_size": [640, 480],
"align_method": "only_pad"
}
},
"train_trans": {
"trans_seq": ["random_hflip", "random_brightness"],
"random_brightness": {
"ratio": 1.0,
"shift_value": 10
},
"random_hflip": {
"ratio": 0.5,
"swap_pair": []
} },
"val_trans": {
"trans_seq": ["random_hflip"],
"random_hflip": {
"ratio": 0.5,
"swap_pair": []
}
},

@hsfzxjy
Copy link
Contributor

hsfzxjy commented Aug 20, 2020

@marcok Hi. I could not well understand what your problem is. What do you mean by "But on the validation the IoU is more or less random"? I cannot see the phenomenon from your log posted.

@marcok
Copy link
Author

marcok commented Aug 20, 2020

@hsfzxjy simply said, /run_h_48_d_4_segfix.sh segfix_pred_val doesn't generate reasonable mat files.
But I i haven't given up to find out why by myself
are you sure in tester_offset.py the w/h of the crop is assigned wrongly and it should be like this:

class Tester(object):
def init(self, configer):
self.crop_size = configer.get('train',
'data_transformer')['input_size']

    **self.crop_size[0], self.crop_size[1] = self.crop_size[1], self.crop_size[0]**

@hsfzxjy
Copy link
Contributor

hsfzxjy commented Aug 21, 2020

@marcok thanks for your feedback. Does your problem still exist? We could help you figure out it.

@marcok marcok closed this as completed Aug 21, 2020
@marcok
Copy link
Author

marcok commented Aug 21, 2020

Hi @hsfzxjy

thanks for fixing it!
Val during training uses other code, I suspect there something partly shaky too.
May you can explain me the difference between this 3 metrics?
Result for mask
Result for dir (mask)
Result for dir (GT)

@hsfzxjy
Copy link
Contributor

hsfzxjy commented Sep 5, 2020

@marcok

  • "mask" is the accuracy of binary boundary prediction;
  • "dir (mask)" is the accuracy of direction prediction (within the predicted boundary);
  • "dir (GT)" is the accuracy of direction prediction (within the ground truth boundary).

All of them print out mIoU too but only the accuracy / FScore is useful.

@i-amgeek
Copy link

i-amgeek commented Sep 28, 2020

Hey @hsfzxjy

I was able to train segfix on a binary segmentation data. Both Mask loss and direction loss converged well.

But there wasn't any improvement in Pixel ACC of either dir (mask) or dir (GT). They just kept jumping around 0.4 +-0.04 even though direction loss converged from 2.09 to 0.8. I trained for about 50000 iterations.

Is it okay for this to happen? And can you share any of your training log?

@hsfzxjy
Copy link
Contributor

hsfzxjy commented Sep 28, 2020

Hi @i-amgeek

I think it's a little strange. The ACC should be increased along with the training process. The attachment is SegFix's training log for only 20k iters. The log is in some old format, where "Result for 1" is dir (mask) and "Result for 3" is dir (GT). You may check it for comparison.

btw, Did you generate the ground truth offset correctly? It may also due to incorrect supervision.
mask_offset_hrnext_hrnext20_mask_dt_offset_loss_crop_bs16_lr4x_1.log

@i-amgeek
Copy link

Thanks for sharing log. I will look into it.

I used this script to generate offsets - lib/datasets/preprocess/cityscapes/dt_offset_generator.py after changing label_list variable to [0,255].

@YAwei666
Copy link

@i-amgeek would you mind tell me how to change the H_SEGFIX.json
image for the loss part?

@hsfzxjy
Copy link
Contributor

hsfzxjy commented Dec 10, 2020

@YAwei666 It's already specified in the Bash script so there's no need to edit config file.

See https://github.com/openseg-group/openseg.pytorch/blob/master/scripts/cityscapes/segfix/run_hx_20_d_2_segfix.sh#L41

@cnnAndBn
Copy link

cnnAndBn commented May 28, 2021

hi,@hsfzxjy @PkuRainBow PkuRainBow @LayneH

  1. as for the log file you give i.e. mask_offset_hrnext_hrnext20_mask_dt_offset_loss_crop_bs16_lr4x_1.log, I notice the result after 20000 iteration is not very satistactory,
    image
    the left is the result after first testing while the right is the result of last test in the log file.
    the final mean iou for boundary mask is 0.387, pixel acc is 0.603
    and the one for direction is 0.327 and 0.564 respectively.
    can this result benificial to the segmentation result? and the log file is complete one?

  2. I train segfix on my own dataset, details are:
    the loss at initial training is
    Train Epoch: 0 Train Iteration: 20 Time 97.117s / 20iters, (4.856) Forward Time 46.520s / 20iters, (2.326) Backward Time 36.260s / 20iters, (1.813) Loss Time 4.666s / 20iters, (0.233) Data load 9.671s / 20iters, (0.483533)
    Learning rate = [0.03999087988446927, 0.03999087988446927] Loss = 2.08182430 (ave = 2.22278991)

the result of first test on validation set is :
image

after 54000 iteration training ,the result is
2021-05-19 18:50:26,765 INFO [trainer.py, 212] Train Epoch: 47 Train Iteration: 54000 Time 54.218s / 20iters, (2.711) Forward Time 24.295s / 20iters, (1.215) Backward Time 28.376s / 20iters, (1.419) Loss Time 1.312s / 20iters, (0.066) Data load 0.234s / 20iters, (0.011699)
Learning rate = [0.012720987671203884, 0.012720987671203884] Loss = 1.13631892 (ave = 1.16415203)

the test result in 54000 is
image

from above ,the loss get decreased, but the test result on validation set seems not to get improved , can you help me?

  1. last question is
    in https://github.com/openseg-group/openseg.pytorch/blob/master/MODEL_ZOO.md#how-to-reproduce-the-hrnet--ocr-with-mapillary-pretraining ,
    image
    after getting the refined label ,how to get the metrics like miou on the test dataset?

@hsfzxjy
Copy link
Contributor

hsfzxjy commented May 28, 2021

@dadada101 Perhaps you can open a new issue so that we can discuss better.

  1. In fact, the direction IoU in the log is meaningless. We print it out just because we re-used the evaluator designed for segmentation result. The one you should really concern is the direction accuracy, and simply ignore the mIoU. Practically, an accuracy of above 0.60 can bring a boost in segmentation result.
  2. Your screenshot tells that the accuracy is above 0.76, which is already of great quality. The accuracy of dir(mask) gets boosted so the training process should be normal.
  3. Be careful that scripts/cityscapes/segfix.py is designed for Cityscapes. To apply SegFix on your own datasets, you should change the label_list inside, specifically Line 28-29, to match the definition of your dataset. After you get the refined labels, you can use any semantic segmentation evaluator to evaluate their mIoU, just like how you did with the original labels.

@cnnAndBn
Copy link

oh,thanks for your instant replay @hsfzxjy .
you said "in fact, the direction IoU in the log is meaningless. We print it out just because we re-used the evaluator designed for segmentation result. The one you should really concern is the direction accuracy, and simply ignore the mIoU. "
why the direction IoU in the log is meaningless? because some of directions (in my training ,I used 8 direction) in the ground truth may be absent?

@hsfzxjy
Copy link
Contributor

hsfzxjy commented May 28, 2021

@dadada101 IoU is for measuring the quality of area prediction, whilst directions have no such a concept of area, especially at some complicated edges some direction classes may be very fragmented. Of course IoU has a positive correlation with accuracy and should be as high as possible, but currently our models cannot perform so accurate and the value of IoU gets very degraded, and is not a useful reference for (roughly) evaluating the quality of direction prediction. In contrast the value of pixel accuracy is more sensitive to small boost of direction quality, but I have to admit that it's still not an appropriate metric. After all, the metrics mentioned above are just as references for picking a better SegFix model, the quality of a SegFix model should be evaluated as how much improvement it can bring to a segmentation baseline.

BTW, if you want to tune SegFix models on a different dataset, we have observed that LR and crop size matters more. For LR you may try 0.01 to 0.04. For crop size you may try a smaller one than segmentation models use, e.g. on Cityscapes, segmentation models may use 769x769 or 512x1024, while SegFix adopts 512x512.

@cnnAndBn
Copy link

@hsfzxjy ,really thanks for your prompt and detailed reply!!!
I got your meaning . Because mIou is a metric insensitive to the class imbalance problem in the ground truth label, right? so at the begining, I thought miou is a better metric than accuray...
in addiction thanks for your training trick advise, may I ask why adopting a smaller crop than segmentation model is benificial?

1 similar comment
@cnnAndBn
Copy link

@hsfzxjy ,really thanks for your prompt and detailed reply!!!
I got your meaning . Because mIou is a metric insensitive to the class imbalance problem in the ground truth label, right? so at the begining, I thought miou is a better metric than accuray...
in addiction thanks for your training trick advise, may I ask why adopting a smaller crop than segmentation model is benificial?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants