Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New superpixel algorithm (F-DBSCAN) #3093

Merged
merged 44 commits into from
Nov 29, 2021
Merged

New superpixel algorithm (F-DBSCAN) #3093

merged 44 commits into from
Nov 29, 2021

Conversation

scloke
Copy link
Contributor

@scloke scloke commented Oct 31, 2021

Implementation of a new superpixel algorithm, "Accelerated superpixel image segmentation with a parallelized DBSCAN algorithm".

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or other license that is incompatible with OpenCV
  • The PR is proposed to proper branch
  • There is reference to original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake
force_builders=Linux x64,Win64,Mac,Android armeabi-v7a,Docs,iOS,Win32,ARMv7,ARMv8,Linux x64 Debug

Implementation of a new superpixel algorithm, "Accelerated superpixel image segmentation with a parallelized DBSCAN algorithm".
added newline at end of file
added newline at end of file
trailing whitespace removal
editing changes
editing changes
@sturkmen72
Copy link
Contributor

sturkmen72 commented Nov 7, 2021

@scloke first of all thank you for the contribution
as a common OpenCV user, after compilation, I tested using the following python code
and want to share my experience.

code :

img = cv.imread('d:/test/hajandrade.jpg')
ss = cv.ximgproc.createScanSegment(img.shape[1],img.shape[0],500,1,True)
tm = cv.TickMeter()
tm.start()
ss.iterate(img)
res = ss.getLabelContourMask(True)
tm.stop()
print(tm.getTimeMilli())
res = cv.cvtColor(res,cv.COLOR_GRAY2BGR)
res = cv.add(res,img)
cv.imshow("Output", res);
cv.waitKey()
cv.destroyAllWindows()
  1. is it possible to create one instance of ScanSegment and use it for different-sized images?
  2. giving createScanSegment different values to threads produces different results.

threads = 1

image

threads = 4

image

  1. image is this pattern intentional?

image

@scloke
Copy link
Contributor Author

scloke commented Nov 7, 2021

Hi,

Thanks for the comments. Appreciate it+++ Am a bit of a newbie at open source, so I will try and explain the background of this contribution in a bit of detail.

For this algorithm, it was developed to try and speed-up superpixel segmentation. The underlying principle is straightforward. What it does is to parallelise two important processes, which are: 1) the actual segmentation process which is DBSCAN based, and 2) the merging of small segments.

Normally DBSCAN is considered too slow for real-time image work, and quite hard to parallelise since many small segments are created and large segments formed by different processes will overlap. The innovation here is to limit the cluster size and convert the colour difference function to simple integer-based arithmetic. This can be processed very fast and segments are limited in size and hence don't overlap.

Similarly, merging of small segments is quite slow as I am using an adaptation of OpenCV's watershed algorithm. I managed to parallelise it efficiently by dividing it in the same way as the segmentation, and using a window with a surrounding margin of 1 pixel.

This explains a couple of things which you have noticed. 1) The inverted-T pattern flows from the way that the DBSCAN clustering works. If you have a large homogenous textureless image, this pattern is repeated throughout. 2) The use of 4 threads shows a clear horizontal and vertical division, and this comes from the merging algorithm. 3) the output is deterministic when the number of threads is fixed (this is in the comments of the .hpp file), but different threads counts will give different segmentation results. 4) using the initialisation function will only allocate the buffers. Each run of the iterate function will segment a new image using the allocated buffers. No problems.

This algorithm has an O(n2) complexity (quadratic), so it is better for smaller images. When tested on the Berkeley Segmentation Dataset (smaller images), it is about six times faster than SEEDS or any of the other OpenCV algorithms (the new algorithm is F-DBSCAN on the chart).

Fig6

The segmentation accuracy is about the same as the OpenCV routines.
Fig9

When tested on this random 2 megapixel image on my hard drive with 200 superpixels, it gave 199 segments. In comparison, SEEDS gave 144 segments using the default settings. Runtime was about 32.9 s for 1000 iterations for F-DBSCAN and 58.3 s for 1000 iterations for SEEDS on a 10-core i9 Windows machine.

Original
7bda7bd38a24f2d02c8e80acf8e966d2

F-DBSCAN
7bda7bd38a24f2d02c8e80acf8e966d2_Segmented_199

SEEDS
7bda7bd38a24f2d02c8e80acf8e966d2_Seeds_144

The original algorithm was written in ISO C++ 17.0, so I had to adapt it to C++ 11.0 for OpenCV and use the inbuilt parallelisation routines. The speed is different from the original, but the segmentation accuracy should be the same, so I intend to submit this first and if it is approved, then I will use the finalised code to build a short comparison sample program to test performance against the other OpenCV algorithms. I will also write a proper guide on usage with examples.

SC

Copy link
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for contribution!

Please take a look on comments below.

modules/ximgproc/include/opencv2/ximgproc/scansegment.hpp Outdated Show resolved Hide resolved
modules/ximgproc/include/opencv2/ximgproc/scansegment.hpp Outdated Show resolved Hide resolved
modules/ximgproc/include/opencv2/ximgproc/scansegment.hpp Outdated Show resolved Hide resolved
modules/ximgproc/include/opencv2/ximgproc/scansegment.hpp Outdated Show resolved Hide resolved
modules/ximgproc/src/scansegment.cpp Outdated Show resolved Hide resolved
modules/ximgproc/src/scansegment.cpp Outdated Show resolved Hide resolved
modules/ximgproc/src/scansegment.cpp Outdated Show resolved Hide resolved
modules/ximgproc/src/scansegment.cpp Outdated Show resolved Hide resolved
indents removed
extra indents removed
license agreement updated
license agreement updated
reference moved to ximgproc.bib
reference moved to ximgproc.bib
c++ def removed
changed threads param
changed threads param
tab indents replaced with 4 spaces
removed trailing whitespace
replace malloc with autobuffer
updated header guard
fixed process threads to the number of slices
Copy link
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for updates!

dr = std::abs((ptr1)[2] - (ptr2)[2]); \
diff = ws_max(db, dg); \
diff = ws_max(diff, dr); \
assert(0 <= diff && diff <= 255); \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using of C-style assert() is not allowed.

  • use CV_Assert() / CV_DbgAssert() instead
  • or prefer to use code with std::clamp() logic instead of checks (clamp is C++17 feature and not available yet by default, so simulate it through std::min/max)

other cases should be fixed too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

substituted with CV_Assert()

Comment on lines 596 to 598
else \
q[idx].first = node; \
q[idx].last = node; \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{} brackets? or add indentation to clarify that this code is not broken

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation added

//! @addtogroup ximgproc_superpixel
//! @{

class ScanSegmentImpl : public ScanSegment
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider using CV_FINAL (final)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

ScanSegmentImpl::ScanSegmentImpl(int image_width, int image_height, int num_superpixels, int slices, bool merge_small)
{
// set the number of process threads
processthreads = std::thread::hardware_concurrency();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use cv::getNumThreads() instead, drop #include <thread>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replaced


// start at the center of the rect, then run through the remainder
labBuffer = reinterpret_cast<cv::Vec3b*>(src.data);
cv::parallel_for_(cv::Range(0, (int)indexNeighbourVec.size()), PP1(reinterpret_cast<ScanSegmentImpl*>(this)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, using of C++11 lambdas with parallel_for looks like:

    parallel_for_(Range(0, (int)indexNeighbourVec.size()), [&](const Range& range) {
        for (int i = range.start; i < range.end; i++) {
            OP1(i);
        }
    });

OP4 case:

    // copy back to labels mat
    parallel_for_(Range(0, (int)indexProcessVec.size()), [&](const Range& range) {
        for (int i = range.start; i < range.end; i++) {
            OP4(indexProcessVec[i]);
        }
    });

Copy link
Contributor Author

@scloke scloke Nov 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks+++ Appreciate the help. I compiled on my system on C++ 14 and no problems. All replaced

Comment on lines 63 to 70
cv::AutoBuffer<cv::Rect> _seedRects; // autobuffer of seed rectangles
cv::AutoBuffer<cv::Rect> _seedRectsExt; // autobuffer of extended seed rectangles
cv::AutoBuffer<cv::Rect> _offsetRects; // autobuffer of offset rectangles
cv::AutoBuffer<cv::Point> _neighbourLoc;// autobuffer of neighbour locations
cv::Rect* seedRects; // array of seed rectangles
cv::Rect* seedRectsExt; // array of extended seed rectangles
cv::Rect* offsetRects; // array of offset rectangles
cv::Point* neighbourLoc; // neighbour locations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general it is dangerous to store dedicated RAW pointers (RAW pointers unable to control lifetime of allocated buffer).
Also it doesn't make sense as .data() method is fast as RAW pointer.
Moreover operator [](size_t i) would check index for valid range in debug builds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The .data() method is new to me, so I used the CLAHE implementation code in OpenCV as a template to follow. The code that was used there converted the .data() to a RAW pointer:
cv::AutoBuffer _tileHist(histSize);
int* tileHist = _tileHist.data();

There is a significant speed difference when converting to .data() without dedicated RAW pointers. I ran with and without dedicated RAW pointers over a thousand cycles, twice, and the speed difference was 31.1s (with RAW) vs 37.3s (without RAW). I managed to improve the speed a bit by converting the neighbourLoc to a pre-initialised buffer, but otherwise I left it unchanged.

I think that the use of RAW pointers in this case should be safe enough since they are sourced from the AutoBuffers which are initialised as class variables that are allocated and deallocated based on the lifetime of the class. The evidence of this is:

  1. running several thousand cycles in both debug and release builds showed no instability / memory leaks / buffer overruns.
  2. the CLAHE module in OpenCV uses the same method and there has be no report of instability

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really have several questions for such micro-benchmarks... No idea that they measure.

https://github.com/opencv/opencv/blob/ac4b592b4e550a0ced1977e9aa19e8059a796e3c/modules/core/include/opencv2/core/utility.hpp#L127-L142

Copy link
Contributor Author

@scloke scloke Nov 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the code you quoted, .data() should give a straight reference to the aligned pointer of the data in the AutoBuffer. The operator should also function similarly. Hence, there should not be any speed difference as you said.

Previously, I ran the entire code to process the 2 MP image I described earlier 1000 times. Now I have written a microbenchmark to test the AutoBuffer specifically.

// test 10000 autobuffer with raw vs without raw
void testIterate(int iterate, int buffersize, int* rawtime, int* norawtime)
{
    cv::AutoBuffer<int> _testBuffer = cv::AutoBuffer<int>(buffersize);
    int* testBuffer = _testBuffer.data();
    std::fill(testBuffer, testBuffer + buffersize, 0);

    auto tstart1 = std::chrono::high_resolution_clock::now();

    for (int i = 0; i < iterate; i++) {
        for (int j = 0; j < buffersize; j++) {
            testBuffer[j] = 0;
            testBuffer[j] = testBuffer[j] + 1;
        }
    }

    auto tend1 = std::chrono::high_resolution_clock::now();
    *rawtime = (int)std::chrono::duration_cast<std::chrono::microseconds>(tend1 - tstart1).count();


    cv::AutoBuffer<int> testBuffer2 = cv::AutoBuffer<int>(buffersize);
    std::fill(testBuffer2.data(), testBuffer2.data() + buffersize, 0);

    auto tstart2 = std::chrono::high_resolution_clock::now();

    for (int i = 0; i < iterate; i++) {
        for (int j = 0; j < buffersize; j++) {
            testBuffer2.data()[j] = 0;
            testBuffer2.data()[j] = testBuffer2.data()[j] + 1;
        }
    }

    auto tend2 = std::chrono::high_resolution_clock::now();
    *norawtime = (int)std::chrono::duration_cast<std::chrono::microseconds>(tend2 - tstart2).count();
}

These were the results in Debug and Release mode respectively, with the numbers in microseconds, iterating 50000 times, with a buffer size of 50000.

DEBUG
Debug

RELEASE
Release

There is a large difference between the use of RAWs and without, so much so that I am wondering if this is a compiler optimisation effect we are looking at.

If this is so, then the use of RAW pointers may be easier for the compiler to optimise, hence the speed difference we are seeing.

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried again using XCode on iOS, and here are the new values

DEBUG
iOS Debug

RELEASE
iOS Release

This time the release numbers are looking more like expected. Once again, compiler differences? I am open to suggestions and have kept both versions on my system. If you think that it's better to go completely without RAWs, then I will upload the new version. Do let me know.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Release difference is about 0.1% - measurement accuracy. I would say we don't loose performance here.
Debug difference is expected - .data() method and other functions are not inlined by default, more checks are involved to validate code assumptions (e.g, CV_DbgAssert). Usually debug builds are not tracked for performance.

It is better to replace RAW pointers from code safety/security perspective.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have replaced the RAW pointers in the code except for labBuffer since this is taken from a Mat.data rather than an Autobuffer. This should be safe since the lifetime of the pointer is short (only used in OP1 and read just before invocation), and is read-only for operator values.

scloke and others added 7 commits November 22, 2021 12:18
C++ 11 lambdas used instead of cv::ParallelLoopBody
changed neighbours location buffer to array
remove whitespace
RAW pointers removed
Copy link
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed update with fixed coding style, added smoke test.

Please take a look on the comments below.

Comment on lines 336 to 339
void ScanSegmentImpl::OP2(std::pair<int, int> const& p)
{
std::pair<int, int>& q = const_cast<std::pair<int, int>&>(p);
for (int i = q.first; i < q.second; i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this alias?

std::pair<int, int>& q = const_cast<std::pair<int, int>&>(p);


The same note is about OP4

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My apologies. This was some old code that got carried over. Have updated it.

Comment on lines +70 to +71
@param image_width Image width.
@param image_height Image height.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, prefer to use Size image_size instead of 2 dedicated values

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of this also, but in the end I went with following the pattern in createSuperpixelSEEDS which used two dedicated values. If you prefer cv::Size, let me know and I will change.

existing superpixel segmentation methods. When tested on the Berkeley Segmentation Dataset, the average processing speed is 175 frames/s
with a Boundary Recall of 0.797 and an Achievable Segmentation Accuracy of 0.944. The computational complexity is quadratic O(n2) and
more suited to smaller images, but can still process a 2MP colour image faster than the SEEDS algorithm in OpenCV. The output is deterministic
when the number of processing threads is fixed, and requires the source image to be in Lab colour format.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add @cite loke2021accelerated in this documentation section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added.

Copy link
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job! Thank you for contribution 👍

@alalek alalek merged commit a5cc475 into opencv:4.x Nov 29, 2021
@scloke
Copy link
Contributor Author

scloke commented Nov 29, 2021 via email

@alalek alalek mentioned this pull request Dec 30, 2021
@alalek alalek mentioned this pull request Feb 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants