Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2 questions about calibrations #38

Closed
sdimantsd opened this issue Jan 24, 2021 · 33 comments
Closed

2 questions about calibrations #38

sdimantsd opened this issue Jan 24, 2021 · 33 comments

Comments

@sdimantsd
Copy link

sdimantsd commented Jan 24, 2021

  1. Do I need a annotations file for calibrations?
  2. I am using TensorRT fp16 optimize (--use_fp16_tensorrt) but the network didn't find even one object (without the TensorRT it's works perfect).
    It is possible that the fp16 don't do the calibration?
    It look like it from the code (https://github.com/haotian-liu/yolact_edge/blob/662d760f8b2d8b4409d385aaf172e155aaa3a3d8/utils/tensorrt.py#L38)

Thanks

@haotian-liu
Copy link
Collaborator

  1. no you don't, just put the calibration images to a directory, and specify it with --calib_images option.
  2. FP16 does not have the option to calibrate. This is weird, as from our experiment (in the paper), converting to FP16 affects the AP very little. So AP wise, what is the difference between AP_FP32 and AP_FP16? And if you evaluate with our pretrained models, does FP16 conversion give you good results?

@sdimantsd
Copy link
Author

Thanks for your response :-)

In COCO dataset with your weights it's look the same (with TensorRT fp16 and without TensorRT) but in my dataset the results are much worse...
Here a 2 images for example:
1
Command: python3 eval.py --yolact_transfer --disable_tensorrt --trained_model=./weights/yolact_resnet101_im400_12_340000.pth --images=/home/ws/images/imgs_in/:/home/ws/imgs_out --top_k=10 --score_threshold=0.3

1
Command: python3 eval.py --yolact_transfer --use_fp16_tensorrt --trained_model=./weights/yolact_resnet101_im400_12_340000.pth --images=/home/ws/images/imgs_in/:/home/ws/imgs_out --top_k=10 --score_threshold=0.3

As you can see, in TensorRT fp16 the network didn't recognize even one car!

Can I train the network with fp16 optimize?
How can I do it?

Thanks! :-)

@haotian-liu
Copy link
Collaborator

Hi, I tried the the model on your image with both FP32/FP16/INT8, and the results are reasonable. I uploaded them here.

The command and the model I am using (FP16 as example):
python eval.py --trained_model=./weights/yolact_edge_54_800000.pth --score_threshold=0.3 --top_k=100 --use_fp16_tensorrt --image=cars.png:seg_out_fp16.jpg

Can you try with the model I am using and see if it is due to the model file or TensorRT? And please pull the newest commit from our repo because the newest code will automatically convert YOLACT weights.

@sdimantsd
Copy link
Author

I will try it now.
What about training with TensorRT fp16/int8?

@haotian-liu
Copy link
Collaborator

We do not need to train with FP16. Simply train a full precision model and you can convert the trained model to FP16/INT8 using TensorRT.

@sdimantsd
Copy link
Author

yolact_edge_fp16
This is the result with TensorRT fp16. the results without TensorRT look the same.

If it's matters I am using Jetson Nano (and this is the reason I need fp16 and not int8. because Jetson Nano did not support int8).

@haotian-liu
Copy link
Collaborator

Can you try removing all TensorRT cache with rm /path/to/weights/*.trt, and re run the model with your previous command? It might be you have once run the evaluation without yolact_transfer and the converted TensorRT model (with incorrect weights) was stored in cache.

@sdimantsd
Copy link
Author

This is what I did:
`ws@PC_2:/DL/yolact_edge$ rm weights/*.trt
ws@PC_2:
/DL/yolact_edge$ ls weights/
yolact_edge_54_800000.pth yolact_nets_shuk_resnet101_im400_12_340000.pth
yolact_edge_vid_847_50000.pth yolact_resnet101_im350_low_height_58_460000.pth
ws@PC_2:~/DL/yolact_edge$ python3 eval.py --use_fp16_tensorrt --trained_model=./weights/yolact_nets_shuk_resnet101_im400_12_340000.pth --images=/home/ws/images/imgs_in/:/home/ws/imgs_out --top_k=10 --score_threshold=0.3
In /home/ws/.local/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle:
The text.latex.preview rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In /home/ws/.local/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle:
The mathtext.fallback_to_cm rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In /home/ws/.local/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle: Support for setting the 'mathtext.fallback_to_cm' rcParam is deprecated since 3.3 and will be removed two minor releases later; use 'mathtext.fallback : 'cm' instead.
In /home/ws/.local/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle:
The validate_bool_maybe_none function was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In /home/ws/.local/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle:
The savefig.jpeg_quality rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In /home/ws/.local/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle:
The keymap.all_axes rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In /home/ws/.local/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle:
The animation.avconv_path rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In /home/ws/.local/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle:
The animation.avconv_args rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
Config not specified. Parsed yolact_nets_shuk_resnet101_im400_config from the file name.

[01/25 11:22:15 yolact.eval]: Loading model...
[01/25 11:22:34 yolact.eval]: Model loaded.
[01/25 11:22:34 yolact.eval]: Converting to TensorRT...
[01/25 11:22:34 yolact.eval]: Converting backbone to TensorRT...
[01/25 11:23:54 yolact.eval]: Converting protonet to TensorRT...
[01/25 11:24:15 yolact.eval]: Converting FPN to TensorRT...
Warning: Encountered known unsupported method torch.zeros
[01/25 11:24:38 yolact.eval]: Converting PredictionModule to TensorRT...
[01/25 11:24:59 yolact.eval]: Converted to TensorRT.

/home/ws/images/imgs_in/cars.jpg -> /home/ws/imgs_out/cars.png`

but the results are the same...

1

@sdimantsd
Copy link
Author

The line over the first lines it because the '~' in linux

@sdimantsd
Copy link
Author

bdw, this is the config:

yolact_nets_shuk_resnet101_im400_config = yolact_edge_config.copy({
    'name': 'yolact_nets_shuk_resnet101_im400',

    # Dataset stuff
    'dataset': nets_shuk_dataset,
    'num_classes': len(nets_shuk_dataset.class_names) + 1,

    'masks_to_train': 100,
    'max_num_detections': 50,
    'max_size': 400,

    'backbone': yolact_base_config.backbone.copy({
        'pred_scales': [[int(x[0] / yolact_base_config.max_size * 400)] for x in
                        yolact_base_config.backbone.pred_scales],
    }),
})
NETS_CAR_TRUCK_BUS = ('car', 'bus', 'truck')
NETS_CAR_TRUCK_BUS _LABEL_MAP = {1:  1,  2:  2, 3: 3}

@haotian-liu
Copy link
Collaborator

So you basically trained your own model on this dataset with an image size of 400x400? Would you mind sharing the trained model with email to me so that I can test it. We haven't seen a model that has such huge performance difference between FP16 and FP32.

@sdimantsd
Copy link
Author

Yes, I can share it with you.
What is your email?

Thanks

@haotian-liu
Copy link
Collaborator

liuhaotian.cn at gmail

@sdimantsd
Copy link
Author

OK, I am not at work right now (it's 21:00 in our time).
I will send it tomorrow.

@sdimantsd
Copy link
Author

@haotian-liu I sent it now. Thanks

@haotian-liu
Copy link
Collaborator

My collaborator and I will take a look later this week, and will let you know with the updates, thanks.

@sdimantsd
Copy link
Author

Thanks!

@sdimantsd
Copy link
Author

Hi @haotian-liu
Anything new about it?

@haotian-liu
Copy link
Collaborator

@sdimantsd Hi we found that it is due to TensorRT conversion of prediction module/FPN. As when we disable these two conversion, and only use the backbone/protonet conversion, everything works fine. Could you try this on your model / dataset? We also found that native PyTorch FP16 conversion works fine. We decide to contact the upstream TensorRT and torch2trt maintainers for more information and help.

@sdimantsd
Copy link
Author

Thanks!
How can I disable conversion for the FPN?

@haotian-liu
Copy link
Collaborator

Setting these two option to False in the config allows you to disable TensorRT for FPN (similar for other modules)

{
    'torch2trt_fpn': False,
    'torch2trt_fpn_int8': False,
}

@haotian-liu
Copy link
Collaborator

Hi I am currently closing this issue, and merge the discussion related to TensorRT conversion issue after training on a custom dataset to this issue #47 as it is quite hard for me to track so many open issues. Hope you understand, thanks.

@haotian-liu
Copy link
Collaborator

haotian-liu commented Feb 7, 2021

I somehow figured out that the cause and applied the fix, details of the solution are explained in #47. Please take a look to see if the issue can be resolved.
If the issue persists, please reply directly to #47 (this will be the main thread to deal with related issues for now) with experiment configurations (details also explain there). Thanks.

@sdimantsd
Copy link
Author

OK.
Thanks :-)

@chingi071
Copy link

@sdimantsd Hello, I would like to ask, do you make inferences on jeston nano? The backbone is resnet101 and the image size is 400, right? How much memory did you use during conversion to tensorrt and inference? Because I couldn’t make inferences on jetson nano 2gb, it was killed. My backbone is mobilenetv2 and image size I tried 320, 160, 80. I'm thinking about whether to change to jetson nano 4gb. In addition, will the pred_scales affect the result? I see that you have made changes. Thank you.
I also very grateful for @haotian-liu open source, this is very good work.

@haotian-liu
Copy link
Collaborator

@chingi071 You can try to set the cfg.torch2trt_max_calibration_images to lower (e.g. 5), if it still OOM, then you might set it to use TensorRT FP16 with --use_fp16_tensorrt.

@chingi071
Copy link

@haotian-liu Hello, I tried setting cfg.torch2trt_max_calibration_images to a smaller value (1, 5), and used --use_fp16_tensorrt, but it was still killed... My current resolution is 320, do I need to set a smaller one?

@haotian-liu
Copy link
Collaborator

What if you only use the PyTorch version? We haven't been testing our method on Jetson Nano, thus I cannot provide much of the advice.

@chingi071
Copy link

I use the pytorch version. Then I try to use Jetson Nano 4GB to see if it can be inference,thank you.

@haotian-liu
Copy link
Collaborator

@chingi071 Not sure if I am not being clear, I mean use --disable_tensorrt for pure PyTorch inference.

@chingi071
Copy link

Oh, I misunderstand. I haven't tried use --disable_tensorrt. I will try it,thank you very much!

@sdimantsd
Copy link
Author

Hi @chingi071
I am using Jetson Nano 4 GB, not 2 GB.
With 4 GB it's works with input size of 500x500 and fp16 with ~540 ms for frame (around 1.85 FPS).
Hope that helps you.

@chingi071
Copy link

@sdimantsd Thank you very much! This is very useful information for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants