Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

response map fusion implementation #77

Open
meiqua opened this issue May 5, 2020 · 48 comments
Open

response map fusion implementation #77

meiqua opened this issue May 5, 2020 · 48 comments
Labels
enhancement New feature or request

Comments

@meiqua
Copy link
Owner

meiqua commented May 5, 2020

Motivation

According to Halide paper, fusion can improve the creation of response map a lot. However, configing Halide is not an easy job, and our response map don't need many features of Halide too. So implementing a simple version of tile-based fusion method is preferred. This is also what opencv4 is doing.

Related issues

Current works

Currently, a simple tile-based fusion pipeline is implemented, and gaussian / sobel / mag / phase / hist / spread ... is finished and tested. Refer to fusion by hand branch for more info. The basic idea is implementing tile-based fusion only, and do the compiling stuff of Halide by hand... Though it seems not as fancy as Halide, it simplifies jobs a lot and is easy to use too.

Results and TODOs

The speed is roughly 10x faster than using opencv. We will use it to create response map in the future.

See test_fusion.cpp for more examples. Also, Any discussion, test, or improvements are welcomed!

Update

Now we pass all tests and match function can be used as usual! It's about 6x faster for full pipeline of creating response map, and no need to crop images to 16n as before.

Update

Now rgb image is also supported, by cvtColor first. After investigating many solutions, we found using opencv is the cleanest way... Compared with using gray image, cvtColor only cost ~5% more.

@meiqua meiqua added the enhancement New feature or request label May 5, 2020
@DennisLiu-elogic
Copy link

DennisLiu-elogic commented May 11, 2020

meiqua大又是我 冏

馬上試了下手工fusion,斷在這
image

image

圖像同那張很多愛心的

image

VS進階指令集選SSE2

若選AVX2則斷不同地方
image
image

@meiqua
Copy link
Owner Author

meiqua commented May 11, 2020

一般来说是因为MIPP在有些指令集上没有实现函数。先关掉能正常跑吗

@DennisLiu-elogic
Copy link

一般来说是因为MIPP在有些指令集上没有实现函数。先关掉能正常跑吗

換地方了
image

image

@meiqua
Copy link
Owner Author

meiqua commented May 11, 2020

什么报错?

@DennisLiu-elogic
Copy link

什么报错?

image

發現op_row給錯了,改成5之後
image
image
image
image
image

@meiqua
Copy link
Owner Author

meiqua commented May 11, 2020

这个是最新的代码直接跑的吗?我找个win笔记本试试

@DennisLiu-elogic
Copy link

fusion by hand branch

這一個fusion.h,改了點指標用到gauss_size的地方讓VS編譯過,MIPP也是從這來的

@meiqua
Copy link
Owner Author

meiqua commented May 11, 2020

@DennisLiu-elogic 我试了下,gauss_size那用vector,SIMD关掉可以跑呀。
用SIMD的话,除了AVX2都挂了。。我看看怎么把MIPP没定义的都补全

@DennisLiu-elogic
Copy link

@DennisLiu-elogic 我试了下,gauss_size那用vector,SIMD关掉可以跑呀。
用SIMD的话,除了AVX2都挂了。。我看看怎么把MIPP没定义的都补全

這麼奇怪,int32_t* parent_buf_ptr [gauss_size] --- > int32_t* parent_buf_ptr [5]導致不開simd也會錯...?

@aemior
Copy link

aemior commented May 19, 2020

@meiqua RGB图的fusion最近有计划更新吗?

@meiqua
Copy link
Owner Author

meiqua commented May 20, 2020

@aemior 我打算先把这个SIMD的问题解决掉,然后做RGB2GRAY的fusion。RGB的fusion有点麻烦,感觉不是很必要。

@aemior
Copy link

aemior commented May 22, 2020

@meiqua 好的,我这边做的RGB的pipline,RGB的化如果涉及不同目标的自然场景的检测应该能提高精度,工业场景确实没必要

@mangoeffect
Copy link

您好,我测试了一些fusion,在vs上无法编译通过呢
image
这么定义数组可以吗?
参数-Wno-sign-compare在vs上又是无效的

@meiqua
Copy link
Owner Author

meiqua commented May 24, 2020

@mangosroom VS编译器不支持变量数组,新commit改成vector可用

@mangoeffect
Copy link

嗯嗯,我也是这么改的,算法层代码最好还是写标准的c++

@meiqua
Copy link
Owner Author

meiqua commented May 31, 2020

@DennisLiu-elogic 现在SSE2应该能跑了。之前测的结果是SSE4 AVX2可以

@DennisLiu-elogic
Copy link

@DennisLiu-elogic 现在SSE2应该能跑了。之前测的结果是SSE4 AVX2可以
可以幫我顯示下line2Dup.h .cpp的改動嗎?

@DennisLiu-elogic
Copy link

DennisLiu-elogic commented Jun 4, 2020

中斷在這
image

image

image

roi的x, y都是-4,這樣呼叫.ptr ()一定會錯的吧?

@meiqua
Copy link
Owner Author

meiqua commented Jun 4, 2020

这两个文件没改,改的是MIPP,增加了mul<int32_t> abs<int32_t> cvt<int16_t,int32_t>
如果没有这个debug assert没问题,因为之后有范围判断。可以把这句加在范围判断之后,或者直接用in.at(r, c)

@DennisLiu-elogic
Copy link

这两个文件没改,改的是MIPP,增加了mul<int32_t> abs<int32_t> cvt<int16_t,int32_t>
如果没有这个debug assert没问题,因为之后有范围判断。可以把这句加在范围判断之后,或者直接用in.at(r, c)

居然沒注意到後面有判斷...
不過copyToBound這段if寫在for裡面有點浪費時間,應該可以先讓out填充0,再根據roi填值吧?
還是有什麼我沒注意到的地方

@meiqua
Copy link
Owner Author

meiqua commented Jun 6, 2020

先填0不如这个快,因为会多一遍写入。不过这里不是hot path,时间差不了多少。

@DennisLiu-elogic
Copy link

DennisLiu-elogic commented Jun 8, 2020

更新了fusion branch的line2Dup.h .cpp,走原匹配流程
不用simd的話,
image
image
out_hearder這個陣列越界了

用simd
image

image

test_fusion.cpp跑起來是沒問題的
----更正
test_fusion.cpp,設use_simd=true的話
image
image
image

image

@meiqua
Copy link
Owner Author

meiqua commented Jun 8, 2020

如果use_simd = true,但没有配置SIMD确实会出错;use_simd = false这个我跑的没问题,是用的最新的代码吗?

@DennisLiu-elogic
Copy link

DennisLiu-elogic commented Jun 8, 2020

如果use_simd = true,但没有配置SIMD确实会出错;use_simd = false这个我跑的没问题,是用的最新的代码吗?

我沒講清楚,Visual Studio編譯器選項都是有開SSE2的,調整的只有use_simd

所以反而是test_fusion在use_simd=true,編譯器選項開SSE2時會報錯
use_simd=false,編譯器開SSE2時正常

fusion.h是新代碼沒錯

新版的line2Dup.h .cpp是用原版的test.cpp的angle_test()測試的,這部分沒有更新到,明天試試

-----0609
檢查了下angle_test (),只有更新旋轉模板的部分(use_rot),我這邊已是新的代碼

--
use_simd=false,編譯器也關掉
在高斯node這邊,r=8時out_header的size不對,r=其他值的時候都正常
image

@meiqua
Copy link
Owner Author

meiqua commented Jun 14, 2020

确实会越界,应该加上条件。之前之所以还能正常跑,是因为越界的时候刚好没用这个值,然后编译器也不会做越界检查。

@DennisLiu-elogic
Copy link

确实会越界,应该加上条件。之前之所以还能正常跑,是因为越界的时候刚好没用这个值,然后编译器也不会做越界检查。

這個加了檢查後沒問題

但在use_simd=true且編譯器開啟SSE2時還是會報錯。

update_simd ()中的dxint16.r = 0時
image

測試圖檔
https://drive.google.com/file/d/1FTuiw5dEgCmpNi3bnPTc8QwAmcVS0zFu/view?usp=sharing

@meiqua
Copy link
Owner Author

meiqua commented Jun 15, 2020

什么报错?

@DennisLiu-elogic
Copy link

看callStack順序是這樣
748行
image

image

image

@meiqua
Copy link
Owner Author

meiqua commented Jun 15, 2020

看起来是未定义low<int16_t>,但其实已经在这里定义过了。这应该会在use_simd=true,同时没有配置SSE2时发生;确定SSE2开了吗?可以跑mipp_test()看看

@DennisLiu-elogic
Copy link

image
原來是我的電腦SSE2開了沒作用,AVX2才有...何解?

@meiqua
Copy link
Owner Author

meiqua commented Jun 16, 2020

MIPP通过这里的宏进入SSE分支,不太清楚VS编译器定义了没。

@XuleiTao
Copy link

我用vs也是只能用avx2,但cpu不支持avx指令集,这个怎么使用MIPP呢?看MIPP那里是支持SSE的。

@meiqua
Copy link
Owner Author

meiqua commented Jun 25, 2020

也是上面说的问题吗,开SSE但MIPP没进入SSE分支?

@meiqua
Copy link
Owner Author

meiqua commented Jun 25, 2020

搜了下,还真是这样:

According to their documentation (msdn.microsoft.com/en-us/library/b0084kay.aspx), Visual Studio doesn’t set the SSEn macros (but they do set AVX and AVX2). – Stephen Canon May 22 '14 at 15:27
Typical, I suppose - everybody else defines the SSEn macros, but not Microsoft. – Paul R May 22 '14 at 15:39

试试这个branch解决了没

@XuleiTao
Copy link

好像还不行,我这里用的x86编译。看VS里的说明是:只有x86体系结构生成程序时,SSE、SSE2才可用
image
image

@meiqua
Copy link
Owner Author

meiqua commented Jun 25, 2020

这个关系不大。SSE2的时候应该把__SSE__的宏也加上,改了下,再试试?

@XuleiTao
Copy link

可以用了,赞。不过,我测试感觉在VS上,使用MIPP的效果不明显。
测试,模板特征点数都是128

  1. 未加MIPP那份代码:我在梯度扩散,梯度响应那里加了两句OpenMP,加速了大概10ms(130ms->120ms)。matchClass那里用你提供的那段并行,提高大概20ms(30ms->10ms)。图像:200w(1600x1200);CPU:i7-6700;VS2015

  2. 有MIPP的master那份代码,开启了AVX2,梯度响应那块大概耗时是110-120ms,匹配大概10ms。图像:200w(1600x1200);CPU:i7-6700;VS2015

不过,这个在linux上跑很快,设置padding=500,像素大于200w的,大概总耗时80ms。CPU:i7-8700
同样参数下,VS2015,AVX2耗时大概150ms。

然后,VS2017,AVX2,CPU:i5-6300,同样master那份,padding=500,耗时大概280ms。

  1. CPU: i3,VS2017,图像:200w。对比了有MIPP那份代码和未加MIPP的代码,有MIPP的开启了SSE2,耗时大概300-400ms;未加MIPP的耗时也差不多300ms,平均稍快一点儿。

然后,fusion那份代码,(1)图像200w,VS2017,SSE2,CPU: i3,开闭AVX2的耗时都大概100-110ms。(2)图像200w,VS2015,CPU: i7-6700,开闭AVX2的耗时都大概80ms。

这个环境用的有点乱,但VS上使用MIPP速度没怎么提升,Linux上提升明显。看MIPP那里的说明,是需要升级到VS2019吗?
On msvc 14.10 (Microsoft Visual Studio 2017), the performances are reduced compared to the other compilers, the compiler is not able to fully inline all the MIPP methods. This has been fixed on msvc 14.21 (Microsoft Visual Studio 2019) and now you can expect high performances.

@meiqua
Copy link
Owner Author

meiqua commented Jun 27, 2020

MIPP相对最开始SSE实现对速度提升应该不大,是为了在arm上能用加的;linux平台下快一点是有可能的,一是opencv可能不同版本、不同编译选项下的速度不一样,二是可能像这里说的inline做的更好。

@XuleiTao
Copy link

哦哦。fusion那份代码跑200w像素的图片,用时大概70-80ms,CPU:i7,OpenCV:3.4.6;这个属于正常吗?

@meiqua
Copy link
Owner Author

meiqua commented Jun 28, 2020

不正常,我在ubuntu16.04 i7跑的20ms。可以把这行改成false先关掉MIPP看看是不是inline的问题,我关掉后大概40ms。

@XuleiTao
Copy link

自带的图像,padding=500,在ubuntu16.04 i7上跑也是20ms,关了MIPP大概50ms。
VS2015 开或关掉MIPP都大概是60-70ms。现在这个CPU:i7-8700,比之前那个i7-6700的80ms快点。难道是VS的问题?需要VS升级一下?

@meiqua
Copy link
Owner Author

meiqua commented Jun 28, 2020

看起来是这样,因为fusion的代码没调用opencv,那可能就是编译器优化不够了。

@XuleiTao
Copy link

嗯嗯,之后找个装VS2017的电脑试试。感谢感谢。

@XuleiTao
Copy link

VS2017对速度提升是有效的。看来VS2015对MIPP也是不支持的。

@zzqusst
Copy link

zzqusst commented Nov 23, 2021

单张图像内,多个模板实例,需要加上 cv_dnn_nms::NMSBoxes,设置好重叠率,然后再做ICP 配准

@wiekern
Copy link

wiekern commented May 6, 2022

测试图片1200x1200
训练 padding=100,角度[-60,60]每一度一个共计121个模板,尺度只有1个(line2Dup::Detector detector(128, {4});)
测试 padding=250,只取top1,stride=16
CPU: Intel Xeon E3-1270 支持AVX2指令集
系统: Win11
编译环境:QT creator(有在定义#define SSE2 后加#praga message打印,可以看到编译时进了这个逻辑因此开记了SSE2)、Qt_6_2_4_MinGW_64,默认release版本开启了 O2优化(从编译输出看到 g++ -c -fno-keep-inline-dllexport -O2)
使用分支: fusion_fix_memo
耗时如下,基本250ms左右,达不到上面提到200W像素70-80ms,不知道哪里没设置对?还请指教,感谢!

----------thread 1---------
bgr2gray: 2.3253ms
gauss1x5: 8.8146ms
gauss5x1: 8.4448ms
sobel1x3_sxx_syx: 1.5282ms
sobel3x1_sxy_syy: 1.4955ms
mag_phase_quant1x1: 15.6051ms
hist3x3: 47.7554ms
spread1xn: 0.595ms
spreadnx1: 1.6778ms
response1x1: 1.9622ms
linearizeTxT: 17.3772ms
-----------------------------------------
fusion time
elasped time:0.114451s

match time
elasped time:0.138171s

@wiekern
Copy link

wiekern commented May 6, 2022

使用 fusion_by_hand 分支跑了一下测试程序,结果如下:第一次打印的 fusion 耗时严重

MIPP tests
----------

Instr. type:       SSE
Instr. full type:  SSE3
Instr. version:    3
Instr. size:       128 bits
Instr. lanes:      1
64-bit support:    yes
Byte/word support: yes
in this SIMD, int8 max is not inplemented by MIPP
in this SIMD, int8 shuff is not inplemented by MIPP
----------

test img size: 2356800

fusion time
elasped time:0.100045s

fusion time
elasped time:0.0262209s

match time
elasped time:0.027269s

match total time
elasped time:0.156801s

matches.size(): 7

match.template_id: 340
match.similarity: 100

@DennisLiu1993
Copy link

@wiekern
@zzqusst
@XuleiTao
各位可以參考我的github,這裡有個shaped matching的替代方案,可以替換某些應用場域
https://github.com/DennisLiu1993/Fastest_Image_Pattern_Matching

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

8 participants