Skip to content

๐Ÿ”Ž Image template matching using Apple vDSP.

Notifications You must be signed in to change notification settings

osnr/template-matching

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

15 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

template-matching

Template matching using Apple Accelerate (vDSP) API in pure C.

(TODO: see my !!con 2022 talk about this)

Implements 2D normalized cross-correlation (normxcorr2) to quickly find instances of template (patch) image on another image.

run

$ make slow && ./slow
image 1032x515, templ 62x19 -> result 1093x533 in 1.276023 sec
hit at (142, 108)
hits: 1
$ make fast && ./fast
image 1032x515, templ 62x19 -> result 1093x533 in 0.036434 sec
hit at (142, 108)
hits: 1

(also check out result.png and hits.png after each run)

sources and notes

Based on a mix of stuff (Googling normxcorr2 is a good place to start, that seems to be a nice well-defined shorthand name that means the same function when people use it):

I literally implemented it in terms of basic numpy operations and FFT first so I could understand it and compare the code to known-good code, then went and translated each line to the 2-5 equivalent lines of C.

Key things are:

  • use multiplication of the FFTs instead of naive convolution (because the kernel is really big, it's not your 3x3 or 5x5 or whatever)

  • use summed-area tables to help with normalization (some of the implementations above get lazy and use convolution with a table of ones to get sums, and that's slower)

  • use the Apple vDSP APIs, don't use for loops if you can avoid it

  • precision is important -- floats are not good enough for the running sum if you're summing over arrays of floats, you need to go 1 level higher-precision (doubles)

potential performance wins

We're still maybe 50% slower than OpenCV.

  • use real FFT instead of complex FFT. I honestly could not figure out how to do this -- the packing is so weird -- and it doesn't seem to speed it up that much (you can try just stubbing it in and seeing how long it takes to run even if you get garbage values)

  • get rid of remaining memcpys?

  • incrementality? if you have optical flow or otherwise know how an image changed

  • cache FFT of template in memory. this saves about 1/7 of runtime I think, but it complicates the code a lot right now

random thoughts

  • it would be nice to free all the buffers at the end, or have an arena allocator or something

  • it would be nice to automatically be able to view every stage of the image processing

  • there are various baseline things that it's good to know the speed of -- how long does an FFT pass take? how long does a for loop through a whole image take?

About

๐Ÿ”Ž Image template matching using Apple vDSP.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published