Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem running post quantization fine-tuning #60

Open
trivedisarthak opened this issue Jul 17, 2023 · 8 comments
Open

Problem running post quantization fine-tuning #60

trivedisarthak opened this issue Jul 17, 2023 · 8 comments

Comments

@trivedisarthak
Copy link

Hi,

I'm trying to convert YOLOv7 model trained on Crowdhuman dataset to HEF. I followed the optimization tutorial ; I'm trying to optimize the network with optimization level set to 2 with all the other options set according to the alls files provided in hailo model zoo for YOLOv7. I'm using the latest Hailo Software Suite - Docker image. I get the following error:

[info] Translation completed on ONNX model yolov7
[2023-07-17 10:01:27,146][hailo_sdk.client][INFO] - Translation completed on ONNX model yolov7
[info] Initialized runner for yolov7
[2023-07-17 10:01:27,703][hailo_sdk.client][INFO] - Initialized runner for yolov7
[info] Loading model script to yolov7 from string
[2023-07-17 10:01:31,186][hailo_sdk.client][INFO] - Loading model script to yolov7 from string
[info] Starting Model Optimization
[2023-07-17 10:03:11,265][hailo_sdk.client][IMPORTANT] - Starting Model Optimization
2023-07-17 10:03:11.617384: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-07-17 10:03:11.624954: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-07-17 10:03:11.625081: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-07-17 10:03:11.625693: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-07-17 10:03:11.626656: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-07-17 10:03:11.626763: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-07-17 10:03:11.626856: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-07-17 10:03:11.938253: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-07-17 10:03:11.938391: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-07-17 10:03:11.938492: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-07-17 10:03:11.938559: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:42] Overriding orig_value setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2023-07-17 10:03:11.938578: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 19157 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:08:00.0, compute capability: 8.6
[info] Using calibration set of 1500 entries
[2023-07-17 10:03:13,403][hailo_sdk.client][INFO] - Using calibration set of 1500 entries
[info] Assigning 16bit activation to output layer yolov7/output_layer3
[2023-07-17 10:03:13,405][hailo_sdk.client][INFO] - Assigning 16bit activation to output layer yolov7/output_layer3
[info] Assigning 16bit activation to output layer yolov7/output_layer2
[2023-07-17 10:03:13,407][hailo_sdk.client][INFO] - Assigning 16bit activation to output layer yolov7/output_layer2
[info] Starting auto 4bit weights
[2023-07-17 10:03:13,408][hailo_sdk.client][INFO] - Starting auto 4bit weights
[info] Assigning 4bit weights to layer yolov7/conv91 with 4719.62k parameters
[2023-07-17 10:03:13,412][hailo_sdk.client][INFO] - Assigning 4bit weights to layer yolov7/conv91 with 4719.62k parameters
[info] Assigning 4bit weights to layer yolov7/conv35 with 2359.81k parameters
[2023-07-17 10:03:13,412][hailo_sdk.client][INFO] - Assigning 4bit weights to layer yolov7/conv35 with 2359.81k parameters
[info] Assigning 4bit weights to layer yolov7/conv46 with 2359.81k parameters
[2023-07-17 10:03:13,412][hailo_sdk.client][INFO] - Assigning 4bit weights to layer yolov7/conv46 with 2359.81k parameters
[info] Ratio of weights in 4bit is 0.26
[2023-07-17 10:03:13,412][hailo_sdk.client][INFO] - Ratio of weights in 4bit is 0.26
[info] auto4bit completion time 00:00:00.00
[2023-07-17 10:03:13,412][hailo_sdk.client][INFO] - auto4bit completion time 00:00:00.00
[info] Auto 4bit weights is done
[2023-07-17 10:03:13,412][hailo_sdk.client][INFO] - Auto 4bit weights is done
[info] Starting Stats Collector
[2023-07-17 10:03:17,383][acceleras][INFO] - Starting Stats Collector
Calibration:   0%|                                                                                                                                                                                                                                     | 0/1500 [00:00<?, ?entries/s]2023-07-17 10:03:19.256664: I tensorflow/stream_executor/cuda/cuda_blas.cc:1786] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
2023-07-17 10:03:19.970407: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8101
2023-07-17 10:03:20.385517: I tensorflow/core/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2023-07-17 10:03:20.386139: I tensorflow/core/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2023-07-17 10:03:20.386149: W tensorflow/stream_executor/gpu/asm_compiler.cc:80] Couldn't get ptxas version string: INTERNAL: Couldn't invoke ptxas --version
2023-07-17 10:03:20.386542: I tensorflow/core/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2023-07-17 10:03:20.386575: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] INTERNAL: Failed to launch ptxas
Relying on driver to perform ptx compilation. 
Modify $PATH to customize ptxas location.
This message will be only logged once.
2023-07-17 10:03:55.010791: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 314572800 exceeds 10% of free system memory.
2023-07-17 10:03:55.010824: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 314572800 exceeds 10% of free system memory.
Calibration: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1500/1500 [02:37<00:00,  9.55entries/s]
[info] Stats Collector is done (completion time is 00:02:38.38)
[2023-07-17 10:05:55,764][acceleras][INFO] - Stats Collector is done (completion time is 00:02:38.38)
[info] Bias Correction skipped
[2023-07-17 10:06:08,818][acceleras][INFO] - Bias Correction skipped
[info] Adaround skipped
[2023-07-17 10:06:08,821][acceleras][INFO] - Adaround skipped
[info] Starting Fine Tune
[2023-07-17 10:06:08,822][acceleras][INFO] - Starting Fine Tune
Epoch 1/6
2023-07-17 10:07:10.005476: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:903] layout failed: INVALID_ARGUMENT: Size of values 0 does not match size of permutation 4 @ fanin shape inSelectV2_2-2-TransposeNHWCToNCHW-LayoutOptimizer
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
2023-07-17 10:07:16.352546: W tensorflow/core/framework/op_kernel.cc:1733] UNKNOWN: JIT compilation failed.
Error executing job with overrides: []
Traceback (most recent call last):
  File "convert.py", line 18, in main
    convert_obj.optimizer_har()
  File "convert.py", line 79, in optimizer_har
    self.runner.optimize(self.calib_dataset)
  File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_common/states/states.py", line 16, in wrapped_func
    return func(self, *args, **kwargs)
  File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_client/runner/client_runner.py", line 1783, in optimize
    self._optimize(calib_data, data_type=data_type, work_dir=work_dir)
  File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_common/states/states.py", line 16, in wrapped_func
    return func(self, *args, **kwargs)
  File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_client/runner/client_runner.py", line 1671, in _optimize
    self._sdk_backend.full_quantization(calib_data, data_type=data_type, work_dir=work_dir,
  File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_client/sdk_backend/sdk_backend.py", line 869, in full_quantization
    self._full_acceleras_run()
  File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_client/sdk_backend/sdk_backend.py", line 1019, in _full_acceleras_run
    optimization_flow.run()
  File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_model_optimization/flows/optimization_flow.py", line 101, in run
    self.post_quantization_optimization()
  File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_model_optimization/flows/optimization_flow.py", line 129, in post_quantization_optimization
    self._finetune()
  File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_model_optimization/flows/optimization_flow.py", line 265, in _finetune
    _, results = finetune.run()
  File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_model_optimization/algorithms/algorithm_base.py", line 119, in run
    self._run_int()
  File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_model_optimization/algorithms/finetune/qft.py", line 301, in _run_int
    self.run_qft(self._model_native, self._model, metrics=self.metrics)
  File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_model_optimization/algorithms/finetune/qft.py", line 358, in run_qft
    qft_distiller.fit(self.train_dataset, verbose=1,
  File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 54, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.UnknownError: Graph execution error:

Detected at node 'Adam/mod' defined at (most recent call last):
    File "convert.py", line 94, in <module>
      main()
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hydra/main.py", line 94, in decorated_main
      _run_hydra(
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
      _run_app(
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hydra/_internal/utils.py", line 457, in _run_app
      run_and_report(
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
      return func()
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
      lambda: hydra.run(
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 119, in run
      ret = run_job(
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hydra/core/utils.py", line 186, in run_job
      ret.return_value = task_function(task_cfg)
    File "convert.py", line 18, in main
      convert_obj.optimizer_har()
    File "convert.py", line 79, in optimizer_har
      self.runner.optimize(self.calib_dataset)
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_common/states/states.py", line 16, in wrapped_func
      return func(self, *args, **kwargs)
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_client/runner/client_runner.py", line 1783, in optimize
      self._optimize(calib_data, data_type=data_type, work_dir=work_dir)
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_common/states/states.py", line 16, in wrapped_func
      return func(self, *args, **kwargs)
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_client/runner/client_runner.py", line 1671, in _optimize
      self._sdk_backend.full_quantization(calib_data, data_type=data_type, work_dir=work_dir,
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_client/sdk_backend/sdk_backend.py", line 869, in full_quantization
      self._full_acceleras_run()
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_client/sdk_backend/sdk_backend.py", line 1019, in _full_acceleras_run
      optimization_flow.run()
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_model_optimization/flows/optimization_flow.py", line 101, in run
      self.post_quantization_optimization()
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_model_optimization/flows/optimization_flow.py", line 129, in post_quantization_optimization
      self._finetune()
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_model_optimization/flows/optimization_flow.py", line 265, in _finetune
      _, results = finetune.run()
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_model_optimization/algorithms/algorithm_base.py", line 119, in run
      self._run_int()
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_model_optimization/algorithms/finetune/qft.py", line 301, in _run_int
      self.run_qft(self._model_native, self._model, metrics=self.metrics)
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_model_optimization/algorithms/finetune/qft.py", line 358, in run_qft
      qft_distiller.fit(self.train_dataset, verbose=1,
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/keras/engine/training.py", line 1409, in fit
      tmp_logs = self.train_function(iterator)
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/keras/engine/training.py", line 1051, in train_function
      return step_function(self, iterator)
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/keras/engine/training.py", line 1040, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/keras/engine/training.py", line 1030, in run_step
      outputs = model.train_step(data)
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_model_optimization/acceleras/model/distiller.py", line 109, in train_step
      self.optimizer.apply_gradients(zip(gradients_f, trainable_vars_f))
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/keras/optimizers/optimizer_v2/optimizer_v2.py", line 672, in apply_gradients
      apply_state = self._prepare(var_list)
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/keras/optimizers/optimizer_v2/optimizer_v2.py", line 992, in _prepare
      self._prepare_local(var_device, var_dtype, apply_state)
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/keras/optimizers/optimizer_v2/adam.py", line 130, in _prepare_local
      super(Adam, self)._prepare_local(var_device, var_dtype, apply_state)
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/keras/optimizers/optimizer_v2/optimizer_v2.py", line 998, in _prepare_local
      lr_t = tf.identity(self._decayed_lr(var_dtype))
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/keras/optimizers/optimizer_v2/optimizer_v2.py", line 1056, in _decayed_lr
      lr_t = tf.cast(lr_t(local_step), var_dtype)
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_model_optimization/algorithms/finetune/qft.py", line 58, in __call__
      step = step % self.steps_per_epoch
Node: 'Adam/mod'
JIT compilation failed.
         [[{{node Adam/mod}}]] [Op:__inference_train_function_682919]
@nadaved1
Copy link

nadaved1 commented Jul 17, 2023 via email

@trivedisarthak
Copy link
Author

trivedisarthak commented Jul 18, 2023

Yes, I have installed the device drivers and I can correctly load and run the same script with optimization level 0; without fine-tuning and it works perfectly. I can also run the example demo for yolov7 using the optimized model.

Additionally, tensorflow installed in the container can see the gpu on the machine.

Screenshot from 2023-07-18 10-12-46

@nadaved1
Copy link

nadaved1 commented Jul 18, 2023 via email

@trivedisarthak
Copy link
Author

Yes, that is already installed. TensorFlow can access the GPU from inside the docker container.

@nadaved1
Copy link

hmm.. What is the GPU model?

@trivedisarthak
Copy link
Author

Okay; I tried downgrading the docker image from 2023.04 to 2022.10 and the fine-tuning works; Is it an internal bug with the Dataflow compiler library ? I'm using an RTX 3090.

@nadaved1
Copy link

The docker image that you're reffering is the suite? If so, does it works on the latest 2023.07?

@trivedisarthak
Copy link
Author

Yes I'm referring to the software suite docker image. I'll try it with 2023.07 and let you know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants