Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retraining custom object detector #24

Closed
ghost opened this issue Jan 3, 2020 · 15 comments
Closed

Retraining custom object detector #24

ghost opened this issue Jan 3, 2020 · 15 comments

Comments

@ghost
Copy link

ghost commented Jan 3, 2020

Hey can you share the particular commit or the version of the object detection API with which you trained the "Hand-detection-model"? So I have a custom dataset and I retrained it with tensorflow 1.15 and the latest object detection API (November 2019 commit) . I was unable to build the tensorrt engine. I encountered the following error.

[TensorRT] ERROR: UffParser: Validator error: FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_4_3x3_s2_256/BatchNorm/FusedBatchNormV3: Unsupported operation _FusedBatchNormV3 [TensorRT] ERROR: Network must have at least one output

Tensorflow version for training and freezing - 1.15 (GPU Telsa K80)
Model used - http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v2_coco_2018_03_29.tar.gz

  • Mobilnet SSD v2 coco

Jetpack on Jetson Nano (used for Uff conversion and building TRT engine as per your trt_ssd tutorial)
TensorRT version - 5x
tensorflow 1.14

So I would like to retrain my custom dataset with the same version of Tensorflow and Obj detection API you had used for this tutorial. Could you please share some details regarding this. Or can you suggest a workaround for the error encountered.

@jkjung-avt
Copy link
Owner

For training, I used tensorflow-1.12.0. (I think both 1.11.x and 1.12.x would work.) And I used the exact version ('6518c1c') of object detection API as what's on my GitHub repo.

image

I could optimize these trained SSD egohands models with TensorRT and run inference on Jetson Nano/TX2. Details could be found at: TensorRT UFF SSD.

@ghost
Copy link
Author

ghost commented Jan 6, 2020

Hey thank you so much , I was able to successfully retrain both Mobilenet SSD V1 and V2 on my custom dataset and convert them to .uff and .bin , I got FPS rates of around 25 on average.

My setup

Training - GPU Telsa K80 with tensorflow gpu- 1.12 , Tensorflow object detection commit - '6518c1c'

Inference and conversion - Jetson Nano
Tensorflow 1.14 , TensorRT and Uff API version - 5x
FPS - 25

Thanks jkjung again for your quick responses and suggestion. Cheers.

@ghost ghost closed this as completed Jan 6, 2020
@jkjung-avt
Copy link
Owner

Thanks for letting me know this update as well.

@SahilChachere
Copy link

@siddharthrameshiisc what was your cuda version with tensorflow==1.12.0??

@ghost
Copy link
Author

ghost commented Jan 6, 2020

@SahilChachere

This is my cuda setup (nvidia-smi output below)

**NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA Version: 10.1**

Also I trained it on a conda environment. Did not install tensorflow via bazel. This setup and conversion works well with tensorflow GPU-1.11 as well.

Just copy paste the following contents onto a ".yml" file and you can replicate my environment,


name: tf_1.10
channels:
  - defaults
dependencies:
  - _libgcc_mutex=0.1=main
  - _tflow_select=2.1.0=gpu
  - absl-py=0.8.1=py36_0
  - astor=0.8.0=py36_0
  - blas=1.0=mkl
  - c-ares=1.15.0=h7b6447c_1001
  - ca-certificates=2019.11.27=0
  - certifi=2019.11.28=py36_0
  - cudatoolkit=9.2=0
  - cudnn=7.6.5=cuda9.2_0
  - cupti=9.2.148=0
  - gast=0.3.2=py_0
  - grpcio=1.16.1=py36hf8bcb03_1
  - h5py=2.9.0=py36h7918eee_0
  - hdf5=1.10.4=hb1b8bf9_0
  - intel-openmp=2019.4=243
  - keras-applications=1.0.8=py_0
  - keras-preprocessing=1.1.0=py_1
  - libedit=3.1.20181209=hc058e9b_0
  - libffi=3.2.1=hd88cf55_4
  - libgcc-ng=9.1.0=hdf63c60_0
  - libgfortran-ng=7.3.0=hdf63c60_0
  - libprotobuf=3.11.2=hd408876_0
  - libstdcxx-ng=9.1.0=hdf63c60_0
  - markdown=3.1.1=py36_0
  - mkl=2019.4=243
  - mkl-service=2.3.0=py36he904b0f_0
  - mkl_fft=1.0.15=py36ha843d7b_0
  - mkl_random=1.1.0=py36hd6b4f25_0
  - ncurses=6.1=he6710b0_1
  - numpy=1.17.4=py36hc1035e2_0
  - numpy-base=1.17.4=py36hde5b4d6_0
  - openssl=1.1.1d=h7b6447c_3
  - pip=19.3.1=py36_0
  - protobuf=3.11.2=py36he6710b0_0
  - python=3.6.9=h265db76_0
  - readline=7.0=h7b6447c_5
  - scipy=1.3.2=py36h7c811a0_0
  - setuptools=42.0.2=py36_0
  - six=1.13.0=py36_0
  - sqlite=3.30.1=h7b6447c_0
  - tensorboard=1.11.0=py36hf484d3e_0
  - tensorflow=1.11.0=gpu_py36h9c9050a_0
  - tensorflow-base=1.11.0=gpu_py36had579c0_0
  - tensorflow-gpu=1.11.0=h0d30ee6_0
  - termcolor=1.1.0=py36_1
  - tk=8.6.8=hbc83047_0
  - werkzeug=0.16.0=py_0
  - wheel=0.33.6=py36_0
  - xz=5.2.4=h14c3975_4
  - zlib=1.2.11=h7b6447c_3
 

@SahilChachere
Copy link

@SahilChachere

This is my cuda setup (nvidia-smi output below)

**NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA Version: 10.1**

Also I trained it on a conda environment. Did not install tensorflow via bazel. This setup and conversion works well with tensorflow GPU-1.11 as well.

Just copy paste the following contents onto a ".yml" file and you can replicate my environment,


name: tf_1.10
channels:
  - defaults
dependencies:
  - _libgcc_mutex=0.1=main
  - _tflow_select=2.1.0=gpu
  - absl-py=0.8.1=py36_0
  - astor=0.8.0=py36_0
  - blas=1.0=mkl
  - c-ares=1.15.0=h7b6447c_1001
  - ca-certificates=2019.11.27=0
  - certifi=2019.11.28=py36_0
  - cudatoolkit=9.2=0
  - cudnn=7.6.5=cuda9.2_0
  - cupti=9.2.148=0
  - gast=0.3.2=py_0
  - grpcio=1.16.1=py36hf8bcb03_1
  - h5py=2.9.0=py36h7918eee_0
  - hdf5=1.10.4=hb1b8bf9_0
  - intel-openmp=2019.4=243
  - keras-applications=1.0.8=py_0
  - keras-preprocessing=1.1.0=py_1
  - libedit=3.1.20181209=hc058e9b_0
  - libffi=3.2.1=hd88cf55_4
  - libgcc-ng=9.1.0=hdf63c60_0
  - libgfortran-ng=7.3.0=hdf63c60_0
  - libprotobuf=3.11.2=hd408876_0
  - libstdcxx-ng=9.1.0=hdf63c60_0
  - markdown=3.1.1=py36_0
  - mkl=2019.4=243
  - mkl-service=2.3.0=py36he904b0f_0
  - mkl_fft=1.0.15=py36ha843d7b_0
  - mkl_random=1.1.0=py36hd6b4f25_0
  - ncurses=6.1=he6710b0_1
  - numpy=1.17.4=py36hc1035e2_0
  - numpy-base=1.17.4=py36hde5b4d6_0
  - openssl=1.1.1d=h7b6447c_3
  - pip=19.3.1=py36_0
  - protobuf=3.11.2=py36he6710b0_0
  - python=3.6.9=h265db76_0
  - readline=7.0=h7b6447c_5
  - scipy=1.3.2=py36h7c811a0_0
  - setuptools=42.0.2=py36_0
  - six=1.13.0=py36_0
  - sqlite=3.30.1=h7b6447c_0
  - tensorboard=1.11.0=py36hf484d3e_0
  - tensorflow=1.11.0=gpu_py36h9c9050a_0
  - tensorflow-base=1.11.0=gpu_py36had579c0_0
  - tensorflow-gpu=1.11.0=h0d30ee6_0
  - termcolor=1.1.0=py36_1
  - tk=8.6.8=hbc83047_0
  - werkzeug=0.16.0=py_0
  - wheel=0.33.6=py36_0
  - xz=5.2.4=h14c3975_4
  - zlib=1.2.11=h7b6447c_3
 

Thanks @siddharthrameshiisc

@bhavitvyamalik
Copy link

@jkjung-avt I used Tensorflow gpu- 1.12 and Tensorflow object detection commit - '6518c1c' for training. For conversion on Jetson nano I used Tf 1.15 but still it gives following error:

[TensorRT] ERROR: FeatureExtractor/MobilenetV2/Conv/BatchNorm/Const: volume of dimensions is not consistent with weights size
[TensorRT] ERROR: FeatureExtractor/MobilenetV2/Conv/BatchNorm/Const_1: volume of dimensions is not consistent with weights size
[TensorRT] ERROR: UffParser: Parser error: FeatureExtractor/MobilenetV2/Conv/BatchNorm/FusedBatchNorm: Invalid Batchnorm inputs for layer FeatureExtractor/MobilenetV2/Conv/BatchNorm/FusedBatchNorm
[TensorRT] ERROR: Network must have at least one output
Traceback (most recent call last):
  File "build_engine.py", line 218, in <module>
    main()
  File "build_engine.py", line 212, in main
    buf = engine.serialize()
AttributeError: 'NoneType' object has no attribute 'serialize'

@jkjung-avt
Copy link
Owner

I'm not sure what the problem is. As I have shared in my JetPack-4.3 for Jetson Nano blog post, I was able to use tensorflow-1.15.0 and the UFF converter (JetPack-4.3, TensorRT 6) to convert my custom trained "egohands" model to a TensorRT engine.

If you'd like to do the comparison, my trained ssd_mobilenet_v2_egohands model checkpoint could be found here: #21 (comment)

And the frozen_inference_graph.pb could be downloaded from my "jkjung-avt/tensorrt_demos" repository: https://github.com/jkjung-avt/tensorrt_demos/blob/master/ssd/ssd_mobilenet_v2_egohands.pb

@bhavitvyamalik
Copy link

I used your ssd_mobilenet_v2_egohands.pb file and even the original pb file for conversion which is given in their repo but to my dismay both of them gave them the same error (different error this time)

UFF Output written to /home/jetson-beta/tensorrt_demos/ssd/tmp_v2_egohands.uff
UFF Text Output written to /home/jetson-beta/tensorrt_demos/ssd/tmp_v2_egohands.pbtxt


[TensorRT] INFO: Detected 1 inputs and 2 output network tensors.
python3: nmsPlugin.cpp:139: virtual void nvinfer1::plugin::DetectionOutput::configureWithFormat(const nvinfer1::Dims*, int, const nvinfer1::Dims*, int, nvinfer1::DataType, nvinfer1::PluginFormat, int): Assertion `numPriors * numLocClasses * 4 == inputDims[param.inputOrder[0]].d[0]' failed.
Aborted (core dumped)

@jkjung-avt
Copy link
Owner

Have you modified the 'input_order' in this line of code?

https://github.com/jkjung-avt/tensorrt_demos/blob/master/ssd/build_engine.py#L69

@bhavitvyamalik
Copy link

Yes, now it is working perfectly. I think I was messing up the classes somewhere. Even though you detected only single class in your model but in your build_engine.py you had two classes? I think now I'll retrain my model from your repo only.

@jkjung-avt
Copy link
Owner

Right. For tensorflow "ssd_mobilenet_v2_xxx" models, you need to add 1 to "num_classes" (for "background"). So "num_classes" is set to 2 for the egohands model.

@bhavitvyamalik
Copy link

Thanks! My model is now running perfectly on custom data.

@1208overlord
Copy link

Hello, I trained custom mobilenetv2_fn model with 6 classes on Tensorflow 1.15 version.
And then, I converted to pb file.
But when building TensorRT model, I have the following error.

Using output node NMS
Converting to UFF graph
Warning: No conversion function registered for layer: NMS_TRT yet.
Converting NMS as custom op: NMS_TRT
Warning: No conversion function registered for layer: FlattenConcat_TRT yet.
Converting concat_box_conf as custom op: FlattenConcat_TRT
Warning: No conversion function registered for layer: Unpack yet.
Converting Preprocessor/unstack as custom op: Unpack
Warning: No conversion function registered for layer: GridAnchor_TRT yet.
Converting GridAnchor as custom op: GridAnchor_TRT
Warning: No conversion function registered for layer: FlattenConcat_TRT yet.
Converting concat_box_loc as custom op: FlattenConcat_TRT
DEBUG [/usr/lib/python3.6/dist-packages/uff/converters/tensorflow/converter.py:96] Marking ['NMS'] as outputs
No. nodes: 612
UFF Output written to ssd_mobilenet_v2_coco_2018_03_29/frozen_inference_graph_1.uff
[TensorRT] ERROR: UffParser: Validator error: Preprocessor/unstack: Unsupported operation _Unpack
Building TensorRT engine, this may take a few minutes...
[TensorRT] ERROR: Network must have at least one output
Traceback (most recent call last):
File "main.py", line 371, in
buf = trt_engine.serialize()
AttributeError: 'NoneType' object has no attribute 'serialize'

Please help me how to fix it.
I attach pb file and python file I used.
I will expect good news ASAP.

https://drive.google.com/drive/folders/1DrCFP3T0mFSm1GNzRp8aude-Ona6SoMz?usp=sharing

@jkjung-avt
Copy link
Owner

Duplicate: #38

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants