-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Full Integer Quantization Not Working #53
Comments
I'm attempting the same thing, but mine died with a seg fault before reaching that point. Traced back to line 51 of /site-packages/tensorflow_core/lite/python/optimize/calibrator.py. Did you encounter that issue too? |
I got the same error, but it could be resolved. In my case, it caused that the image file paths written in When 2020-05-28 08:22:36.338159: I tensorflow/core/grappler/devices.cc:55] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0
2020-05-28 08:22:36.341075: I tensorflow/core/grappler/clusters/single_machine.cc:356] Starting new session
2020-05-28 08:22:36.413729: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:814] Optimization results for grappler item: graph_to_optimize
2020-05-28 08:22:36.416676: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:816] function_optimizer: function_optimizer did nothing. time = 0.017ms.
2020-05-28 08:22:36.422199: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:816] function_optimizer: function_optimizer did nothing. time = 0.003ms.
2020-05-28 08:22:49.053607: I tensorflow/core/grappler/devices.cc:55] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0
2020-05-28 08:22:49.056863: I tensorflow/core/grappler/clusters/single_machine.cc:356] Starting new session
2020-05-28 08:22:53.839949: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:814] Optimization results for grappler item: graph_to_optimize
2020-05-28 08:22:53.842214: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:816] constant_folding: Graph size after: 1356 nodes (-541), 3100 edges (-541), time = 2270.63403ms.
2020-05-28 08:22:53.846921: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:816] constant_folding: Graph size after: 1356 nodes (0), 3100 edges (0), time = 692.577ms.
0 <== HERE
2 <== HERE
29 <== HERE
50 <== HERE
54 <== HERE
57 <== HERE
60 <== HERE
67 <== HERE
79 <== HERE
82 <== HERE
88 <== HERE
91 <== HERE
95 <== HERE
I0528 08:24:40.245354 140080696526656 convert_tflite.py:83] model saved to: ./data/yolov4-full_int8.tflite
I0528 08:24:40.433028 140080696526656 convert_tflite.py:88] tflite model loaded Would this help you? |
No, I don't think I ran into that problem. Could you post your error? I forgot to mention that I originally had an error that LeakRelu was not supported by full 8 bit quantization. I changed it to regular Relu, retrained, and got past that error when quantizing. I now think my new error is because the upsampling operation of Yolo is not supported. see core.common.upsample which uses the function tf.image.resize. Not sure what I can replace that with yet |
Thank you. I am providing it with my own dataset andI created my own .txt file similar to val2017.txt. And I used hard coded paths as you suggested. I don't know if it's my images that are causing the problem. Where did you get your dataset? Just to be sure, are you doing the full int8 quantization and setting the flag --quantize_mode full_int8? |
Thanks, guys. And sorry for distracting from the issue you're facing - hopefully if I can get to that point, I can help figure it out. I'm pretty sure my seg fault isn't caused by bad paths to the images. I'd already copied the coco/ directory to the same level as the convert_tflite.py script, and separately called the representative_data_gen() function to make sure it's finding the files (and it is). But it's clearly something going wrong with the steps in representative data calibration, as the seg fault disappears if I remove the line "converter.representative_dataset = representative_data_gen". And as mentioned, it's dying at line 51 of calibrator.py: Current thread 0x00007f5065f37740 (most recent call first): (Line numbers in convert_tflite are slightly different because I've added code.) |
I think something might be going wrong in the step before calibration (in the script lite.py):
This step is common to int8, float16, and full_int8 quantization, and produces a tflite flat buffer result, which in the case of int8 and float16 is saved to a .tflite file. full_int8 has an extra step that runs the calibration before the file is saved, and that's where the above failure occurs for me. I tried int8 and float16 too. Both fail initially, unless I add the line:
Without this line, these two options fail with the error: Some of the operators in the model are not supported by the standard TensorFlow Lite runtime. If those are native TensorFlow operators, you might be able to use the extended runtime by passing --enable_select_tf_ops, or by setting target_ops=TFLITE_BUILTINS,SELECT_TF_OPS when calling tf.lite.TFLiteConverter(). Otherwise, if you have a custom implementation for them you can disable this error with --allow_custom_ops, or by setting allow_custom_ops=True when calling tf.lite.TFLiteConverter(). Here is a list of builtin operators you are using: CONCATENATION, CONV_2D, EXP, LEAKY_RELU, LOG, LOGISTIC, MAX_POOL_2D, MUL, PACK, PAD, RESHAPE, RESIZE_NEAREST_NEIGHBOR, SHAPE, SPLIT_V, STRIDED_SLICE, TANH. Here is a list of operators for which you will need custom implementations: AddV2. This looks similar to the original issue of this thread. Perhaps you're using int8 instead of full_int8? If so, you could try inserting this supported_ops line to allow custom ops. Continuing, int8 and float16 produce a .tflite file. But then when the convert_tflite script calls demo(), it fails on the first line: Traceback (most recent call last): Looks like something is wrong with the .tflite file, and that's why I think the failure for all cases is happening in the toco_convert_impl(...) step. What do you guys think? |
I'm using mscoco dataset. Followed the instructions written in the README (here).
Yes, I'm doing full integer quantization. python convert_tflite.py \
--weights ./data/yolov4.weights \
--output ./data/yolov4-full_int.tflite \
--quantize_mode full_int8 \
--model yolov4 \
--dataset /full_path_to/val2017.txt My env is: $ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.4 LTS
Release: 18.04
Codename: bionic
$ python --version
Python 3.6.9
$ pip freeze | grep tensorflow
tensorflow==2.1.0
tensorflow-addons==0.9.1
tensorflow-estimator==2.1.0 Have you confirmed that |
I'm using the coco dataset too, and my setup is: Ubuntu 18.04 Frustrating that we're all doing essentially the same thing but getting different errors (though it's no big surprise with tensorflow). Could you try with int8 and/or float16 and see what happens? This takes representative_data_gen completely out of the picture. |
Hi, @mm7721
The both of int8 and float16 quantization worked well on my env. I forgot to say, but I'm working with Ubuntu on WSL. And, this wouldn't help you, but attaching the log (int8_float16.zip ). |
@in-die-nibelungen @sterlingrpi |
@in-die-nibelungen |
@mm7721 I just pulled latest and can confirm that it does successfully convert and save the full int8 tflite file. But now just like you it fails on the first line of demo, although with a different error: ValueError: Didn't find op for builtin opcode 'LEAKY_RELU' version '2' This is running TF 2.2. I've been trying all different versions, but not since getting the latest. I might try 2.1 now. |
I think I'm going to continue trying to |
@mm7721 ,
Unfortunately, no change made on the codes in the repo. The HEAD is |
Well that was easier than expected. I only had to change one line and the whole thing runs now. I replaced leaky_relu operation with regular relu in core/common.py on line 36: before: This is still with TF 2.2. I like using the current version of things. I'm going to test out latency on RPi, which is what my project is running on. I might need to switch to v3 or tiny to get the latency I need. Then ultimately I will need to retrain on my dataset and with the operation change from leaky_relu to regular relu. |
Tragic. yolov4 latency is 16 secs on RPi and slows to 18 secs after the CPU heats up, even with a heat sink. Yolov3 tiny (is there a v4 tiny?) latency is 1.1 sec all day. So that's what I'll be going with. I might also try converting the inputs and outputs to int8 to help speed. But might start another thread on that one as it's another topic. |
Glad you made progress. I did too, by switching to a different machine. There are a few differences in the setup of this machine, but mostly it's the same (ubuntu 18.04, anaconda, TF 2.1/2.2, etc). No clue why, but my seg fault is gone...mysterious, but at least I have something working. And I see exactly the same thing as you: it fails on leaky relu. As TF's documentation states, they only support relu and relu6 for TF Lite models. So your change to relu enables it to pass - that's great. However, it breaks the model. You can see this by running the original detection on the kite image with the updated common.py: python detect.py --weights data/yolov4.weights --framework tf --size 608 --image ./data/kite.jpg Nothing is detected now. Not sure what the best solution is, but one thing I'm going to investigate is training the original Darknet model with relu (or even better, relu6). Will let you know if I can get that working. Also wanted to point out that "full_int8" quantization as defined in convert_tflite.py is not actually full integer quantization. It's only quantizing the weights (hence why you see the tflite file size at about 25% of the original weights file). But the input and activations are still fp32. There are some extra lines that can force additional quantization, but I haven't tried to get it working yet. |
@mm7721 thank you for confirming that changing leaky_relu to relu broke the model. I'm not surprised though and anticipated needing to retraining. Hopefully it won't be too far off and train pretty quickly and with similar accuracy. Do you recommend using the using the darknet repo for training? What's your process flow like? I looked at the darknet training instuctions and it looked intense. You'll have to change the leaky_relu operator there as well. Currently I am looking to train in this repo. But train.py doesn't seem to support tiny too well. I get an error where the prediction shape doesn't match the model shape. I suspect the prediction shape is different for yolo tiny than regular yolo. |
I had to do some really crude hacks but I got tiny yolo training in this repo. First I commented out all the freezing layers because that didn't work. Then I changed the strides in cfg file. I doubled the values here. Then there ought to only be two boxes predicted per zone in tiny yolo. So for the optimizing process stages in train.py I change "for i in range(3):" to "for i in range(2):". Had to do this in two places. Not ideal, but at least it working. |
You are right about the inputs and outputs still being float32. I have changed those to int8 before for another project. However, you have to scale the input data from 0-255 instead of 0-1, otherwise there isn't enough dynamic range to learn from. Also labels should be 0 or 255. Here's the commands I used: It's my understanding that when you set the flag converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8] you are getting 8 bit weights AND 8 bit activations. While regular tflite is only 8 bit weights, which makes the file smaller. Quantizing the activations has made the models even faster in my experience. |
@sterlingrpi Setting up darknet is a bit of a pain - have to get a bunch of dependencies in place, like CUDA, the full OpenCV version, etc. I still don't have everything working perfectly, but can run training and single image detection (video detection isn't working yet). Have already retrained with relu6 in place of relu_leaky, imported the results into the current repo, and evaluated some images/videos. Performance looks fairly similar on the small set I tested. There are some new issues with "full int8" conversion to TF Lite, which I haven't debugged yet...but I was able to use int8 mode. Regarding TF Lite conversion, your comments make sense and align with my understanding. What worries me is that there are lots of posts out there suggesting the conversion process doesn't always work as expected...including one post suggesting it's impossible to get everything into int8 if you're doing post-training quantization like we are. Not to mention the fact that it's really hard to debug TF Lite models. If you get a fully int8 quantized version working, could you share your code? So far I haven't explored yolo-tiny, but I'll keep your comments in mind. Thanks. |
@mm7721 I agree the tflite conversion is a bit of a black box. But I think the full int is doing something. I got 16 sec latency with yolov4 on my RPi. While a regular tflite yolov3 took over 60 sec. So I believe it is replacing quite a few float 32 calculations with 8 bit. I'm happy to share any of the code I did on this. But I think I shared all the steps I took to get to this part. Is there anything specific that would help? I even got training to work and loaded the trained weights into the demo. Still have to validate it for accuracy yet. |
@sterlingrpi That's a nice speedup factor. So yeah, you're right, full_int8 is definitely doing something different. Viewing the two networks, int8 ends up with no quantize/dequantize operations, whereas full_int8 ends up with lots of them. For example, every mish activation is built as: Lots of other examples of -->dequantize-->op-->quantize in the full_int8 network too. Not very pretty. My ultimate goal is a model which is integer quantized from input to output, without all these intermediate operations. With my current version that uses relu6, the full_int8 version converts successfully and saves to a tflite file, but it crashes on the interpreter.allocate_tensors() line in demo(). Not sure why yet (it's one of those pesky Fatal Python Error: Aborted messages). |
@sterlingrpi would you mind sharing any changes you made in |
humnn, I'm also having this problem where my yolov4 is clocking at 55s per inference after doing a full int8 conversion |
@a-rich are you sure you are doing full int8 quantization by setting the flag --quantize_mode full_int8? |
@sterlingrpi not sure about @a-rich , but here is a quick reference of my ~57 inference :( |
FYI, unfortunately,
is deprecated for tf2.x, at least for now, so it is expected to see IO to be float with some quantized/dequantized ops |
Hi, |
Also, does anyone of you mind sharing your representative_datset_gen function? |
def representative_data_gen(): sorry, don't know how to make github take the indentations :-( |
Thanks!
…On Thu, Jun 11, 2020 at 5:39 PM Code Monkey ***@***.***> wrote:
Also, does anyone of you mind sharing your representative_datset_gen
function?
def representative_data_gen():
fimage = open(FLAGS.dataset).read().split()
for input_value in range(100):
if os.path.exists(fimage[input_value]):
original_image=cv2.imread(fimage[input_value])
original_image = cv2.cvtColor(original_image, cv2.COLOR_BGR2RGB)
image_data = utils.image_preporcess(np.copy(original_image),
[FLAGS.input_size, FLAGS.input_size])
img_in = image_data[np.newaxis, ...].astype(np.float32)
print(input_value)
yield [img_in]
else:
continue
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#53 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AMQBWSND4JLSLVBD5Q7IEBDRWFFJVANCNFSM4NIBNZEA>
.
|
@sterlingrpi I'm very surprised yolov4 is this slow after quantization running on ARM silicon - which RPi are you running on? Out of curiosity, are you using TF for Python to inference or C++? Have you been able to benchmark your application using the TFLite benchmarking tool or are you using a custom application? For reference, on a dual-core 1.3GHz ARMv8-based SOC I'm able to get 20 seconds per inference on the 416 x 416 model without quantization. Am eager to benchmark on this setup with an int8 quantized yolov4, but alas quantization remains broken in this repo so I haven't been able to generate a model for testing. Using the benchmarking tool on the float model, ~50% of the inferencing time is spent in CONV2D ops. I believe these would be quantized, hence able to take advantage of ARMs accelerations. Because of this, would also expect a sizeable speedup on the RPi for the quantized vs the float model. |
you can use pre operator in html |
Did you manage to replace the resample function? I have this error "RuntimeError: Quantization not yet supported for op: EXP" |
I tried to get the full int8 quantization by running convert_tflite.py and setting the flag --quantize_mode full_int8. However, I got the following error:
RuntimeError: Quantization not yet supported for op: RESIZE_NEAREST_NEIGHBOR
I gave it a representative dataset and I have all the requirements installed. Has anyone else been able to do a full int8 quantization of yolo? Thank you!
The text was updated successfully, but these errors were encountered: