New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: HPX backend for OpenCV #11897

Merged
merged 4 commits into from Aug 31, 2018

Conversation

4 participants
@Jakub-Golinowski
Contributor

Jakub-Golinowski commented Jul 5, 2018

Introduction

My name is Jakub Golinowski and I am a student enrolled for Google Summer of Code 2018. I am working for Ste||ar Group on developing HPX backend for parallelism in OpenCV. Here is the link to the GSoC project: https://summerofcode.withgoogle.com/projects/#5375652711104512. The main goal of the project is to allow better interoperability of HPX applications, which are using OpenCV. The primary use-case we have in mind is an application that is already using the HPX runtime environment and also performs computer vision operations using OpenCV. Including the HPX backend within the OpenCV will make such an application easier to implement and to control its parallelization behaviour using standard HPX approach.

So far there are 2 versions of HPX backend:

  • HPX - primary version, described above, which assumes that user’s application starts the runtime before making calls to OpenCV functions that use cv::parallel_for_(). As mentioned in the introduction, this would be the preferred and hopefully only way to use the backend.
  • HPX_STARTSTOP - back-up version of backend in which HPX runtime is started and stopped for each cv::parallel_for_() call.

The backend was tested locally using small applications (they can be found here: https://github.com/Jakub-Golinowski/opencv_hpx_backend). The applications were behaving as expected.

This pullrequest changes

  • In the first commit (9006fb9) there are all source code changes required for backend functionality and proposition of cmake changes for HPX library inclusion.

  • In the second commit (1cea2da ) there is a proposition of changes required for OpenCV with HPX backend to enable the primary version of backend to pass the tests.

Summary

This is still work in progress, but I would like to ask for feedback early especially on the style of changes introduced to CMake config files.

Also, I would like to ask what would be required to have the HPX backend tested using OpenCV's CI? For now I added the cmake option WITH_HPX to choose HPX as the parallel backend and WITH_HPX_STARTSTOP to choose its back-up version (described above).

@Jakub-Golinowski

This comment has been minimized.

Contributor

Jakub-Golinowski commented Jul 5, 2018

Answers to questions about this PR:

Could this backend be used to distribute the algorithms among multiple workers? Especially for DNN this might be interesting.?

HPX supports distributed computation so it would be possible to distribute algorithms among multiple worker machines, but current implementation is made analogously to the other backends and uses only local parallel_for()_. However, this is a very interesting idea which requires some work in order to figure out for example if it is necessary to back the cv::Mat data structure by HPX distributed data structures (like partitioned_vector).

HPX_STARTSTOP seems like a too big hack. Isn't there a better way to initialize HPX? Could you use a static initialization like tbb and Concurrency? Or initialize on a first use of parallel for?

As for the HPX_STARTSTOP, the static initialization is possible with HPX and can be considered as the alternative version of the HPX_STARTSTOP backend (current STARTSTOP version is just the simplest version of HPX backed that does not require any changes to user code).

I wonder, if you can provide any benchmarks, e.g. comparing HPX against TBB or pthreads or OpenMP?

I was using the mandelbrot example from OpenCV documentation, the results for 4 cores can be found here:
https://github.com/Jakub-Golinowski/opencv_hpx_backend/blob/master/experiments/2018-06-17-22.33-all_over_workload_num_threasds_0-100_nstripes/images/Speed-up_as_function_of_image_size.png

I found that w/o START_STOP option you need to put #include "...hpx_main.h" into every application, is this correct? If so, it's very big limitation. We shall think of some workaround.

As for the primary backend version (runtime assumed to be started by the user), it is true that user has to launch the HPX runtime explicitly and make calls to cv::parallel_for()_ loop from within the runtime - one way of achieving it is by putting #include "...hpx_main.hpp" into cpp file of his application that contains main() function, then all the user code is within the HPX runtime. We realize it might be a limitation to some applications but on the other hand this approach makes it easier for the use-case we have in mind. Since HPX is a library that is focused on parallelization by allowing the user to build a DAG of his work-flow instead of classic fork-join approach, the use-case we are thinking about is when user of HPX library wants to use functionality of OpenCV in his application as part of the above mentioned workflow DAG. In the above use-case using the backend in the primary version is far easier and more intuitive to the user. Whereas, any type of STARTSTOP backend in this use-case would make it really difficult to write such an application. Therefore, we propose two versions configurable by a build option. The first version is for the user who is using HPX and adds OpenCV functionality and second (STARTSTOP) for the user who simply wants to use HPX backend for cv::parallel_for_().

Also, we found that some OpenMP implementations have very aggressive timeout politics, for the sake of better performance in synthetic benchmarks. But in reality it means that CPU is 100% busy most of the time and the other parallel frameworks run out of CPU power at the same time. E.g. imagine that we have parallel TBB section, then parallel pthreads section, then parallel HPX section, then TBB again etc. How quickly will HPX threads fall asleep? Can the timeout be configured?

As for the timeout policies (when the worker threads are put to sleep), they can also be aggressive in HPX but they are configurable so this aspect can be easily tuned. However, I am not sure if I understand the example with multiple backends, I thought that parallel backend is a compile-time option and there cannot be multiple backends in OpenCV? Or are we talking about more general case?

@vpisarev

This comment has been minimized.

Contributor

vpisarev commented Jul 6, 2018

@Jakub-Golinowski, the "with start/stop" option looks very inconvenient for OpenCV users. Basically, as I said, your PR is far from being complete in terms of support of this option - you put the include directive just to opencv_test_core without putting it to all the other tests and samples. And I don't think that would be a good idea to add this thing to every single test app and sample. Even with this option, if I interpreted the performance charts you provided correctly, HPX is slower than the existing parallel backends. With start/stop option it's even slower. So, what's the benefit of this parallel framework then?

@Jakub-Golinowski

This comment has been minimized.

Contributor

Jakub-Golinowski commented Jul 6, 2018

@vpisarev The main benefit of including the “primary” version of HPX backend in OpenCV is solving the problem of competing parallel backends. Currently if a user develops an application using parallel capabilities of HPX runtime and wants to use OpenCV functionality as part of his application, then his application will spawn 2 parallel backends competing for the resources. Achieving high and predictable performance in this case is not trivial and introduces extra work for the user, discouraging him from combining HPX with OpenCV. However, with “primary” version of HPX backend in place the user would be able to easily include OpenCV in his HPX-based application and fully control its parallel behaviour.

As for the “start/stop” version of the backend I would like to clarify, that in this case the backend is not at all inconvenient for the OpenCV users as it does not require runtime management (no need to include hpx_init.hpp) and calls to cv::parallel_for_() can be made as in case of other backends. However, the main downside of this backend version is that starting and stopping HPX runtime environment for each cv::parallel_for_() call introduces overhead, and as you noted makes it the slowest backend.

I would like to also mention that in the benchmark I presented, the backend providing highest performance is the dedicated “pthreads” backend which was developed specifically to support cv::parallel_for_() calls. Other backends (tbb, omp, hpx) achieve lower performance - I see it as a trade-off between specialized implementation for highest performance and general implementation for serving multiple purposes.

As for the inclusion of hpx_main.hpp to accuracy and performance tests I am currently taking care of that and at the same time running tests locally. In my latest commit I added conditional #include of hpx_main.hpp to all the accuracy and performance tests that were built on my machine.

Summing up, the HPX “primary” backend version allows for “parallel compatibility” between OpenCV and HPX preserving the full control over the runtime environment in hands of the user. Additionally, we propose HPX “start/stop” backend allowing for calls to cv::parallel_for_() in the same way it is done for other backends for completeness.

@vpisarev

This comment has been minimized.

Contributor

vpisarev commented Jul 9, 2018

@Jakub-Golinowski, according to the chart, even though tbb and omp are slower than pthreads, they are still faster than HPX. I think, if you want some real heavy workload, you may want to run the opencv_perf_dnn (and we (opencv team) would be very interested to see the results); for that you need to clone opencv_extra repository, run opencv_extra/testdata/dnn/download_models.py, set environment variable OPENCV_TEST_DATA_PATH=<path to opencv_extra>/testdata and then use the following guide to run the test and compare the performance or HPX and other parallel frameworks.

Also, you are saying that using HPX has the advantage that the utilization of CPU cores is balanced between different components. Is the advantage preserved when START/ STOP mode is used? And what happens in non START\STOP mode? What does that #include <hpx/hpx_main.hpp> actually do? Does it retrieves and make use of the 'application-global' thread pool? Can we take the content of hpx_main.hpp, rework it and put inside DllMain, DLL_PROCESS_ATTACH case (in the case of Windows) or add some automatically initialized singleton in the case of Linux? Basically, we can place the check and single-time initialization inside cv::parallel_for_, so that OpenCV users do not have to insert #include <hpx/hpx_main.hpp> into each single app. It's just really inconvenient. In the current form, I would say, it's so inconvenient that it cannot be integrated.

@biddisco

This comment has been minimized.

biddisco commented Jul 9, 2018

To help clarify this HPX PR, I'd like to explain further ...

OpenMP and TBB are parallelism libraries that specify parallel regions in which for_loops and such like can be run, and within which, make use of threading resources. The OpenCV pthreads backend is a special case that is 'hand-implemented'.

HPX does not use parallel regions as such. When the user starts his/her application, the whole of int main(...) is started on an HPX thread and all tasks that are created from that point on can be executed on thread pools and distributed between compute resources. You could say that the whole application is a parallel region. This requires the user to tell HPX to start before int main is reached and is usually done by using #include <hpx_main.hpp>. (In general, HPX does not encourage a fork-join style of parallelism and tasks can be enqueued and run asynchronously from any part of code. parallel::for_loops can be nested arbitrarily and can be run as tasks themselves).

Starting the runtime before int main is a decision that should be made on a per application basis because using hpx_main can have unintended consequences if you make calls to system functions that might block when in fact your code is on an HPX thread.

The use case for this PR is for a user who wishes to use HPX in their code, has already started the HPX runtime and created tasks, but may also want to run an OpenCV algorithm and make use of the existing thread pools on HPX threads. The OpenCV algorithm will generally be run within an existing HPX task and will be run on all threads assigned to the HPX runtime (or subset thereof defined by an executor/pool etc).

The secondary use-case is when a user does not already use HPX in their code, but wishes to start/stop the HPX runtime for each OpenCV algorithm (analogous to parallel regions for TBB/OpenMP). (In practice this is unlikely to be used if performance is worse than OpenMP and we could drop this support if it improved the chances of the first use-case being accepted).

We shall investigate the performance of both operating modes relative to OpenMP/pthreads. This PR represents the first version of HPX integration and may be subject to performance improvements.

@vpisarev

This comment has been minimized.

Contributor

vpisarev commented Jul 16, 2018

@biddisco, sorry for delay with followup. So, what will happen if start/stop option is not used and one forgot to add this magic clause into an app?

#if defined(HAVE_HPX) && !defined(HPX_STARTSTOP)
#include 
#endif

I think, we should eliminate start/stop option support, since it's extremely slow. And then think on how to make non-start/stop variant more convenient (or at least report a proper error when we forgot to add that include).

@biddisco

This comment has been minimized.

biddisco commented Jul 16, 2018

I believe that an exception would be thrown to the effect of "Trying to call an HPX routine before the runtime has been started". It's possible that the exception would be a bit more obscure - like "runtime pointer undefined" or something of that sort. Perhaps @Jakub-Golinowski could try it and see what does actually happen.

@Jakub-Golinowski

This comment has been minimized.

Contributor

Jakub-Golinowski commented Jul 17, 2018

As for the error reported when one forgets to include hpx_main.hpp, it depends on what will be the first call to functions depending on the runtime. For example in the mandelbrot opencv benchmark (link) one gets the following error when hpx_main.hpp is not included:

terminate called after throwing an instance of 'std::invalid_argument' what(): hpx::resource::get_partitioner() can be called only after the resource partitioner has been allowed to parse the command line options.

At the moment we have the opportunity to rework the error messages caused by not including hpx_main.hpp as other gsoc student is working on improvements to the hpx_main implementation and I can collaborate with him on this. Moreover, including hpx_main is not the only way of starting the runtime and we assume that users should be familiar with the following chapter of the HPX documentation: link. It describes different ways of starting the backend, some of them are easier to use and others give the user greater control over the runtime. Finally, since the runtime is the core construct in HPX the user who decides to build opencv with HPX backend will most likely be aware of the above mentioned documentation chapter and and will not try to write an application without starting the runtime.

As for the dnn benchmark the results are presented in the following html file (produced with the opencv run.py and summary.py scripts): link. This was the run on the 4-core machine with fixed cpu frequency (ensuring that results are comparable). As can be seen in the above mentioned summary html file the opencv with hpx backend is of comparable or better performance than pthread backend in roughly 60% of tests. For the remaining tests the relative performance of hpx against pthreads is mainly in range 5%-25% with a few outliers.

We agree that the start/stop backend could be removed.

@vpisarev

This comment has been minimized.

Contributor

vpisarev commented Aug 6, 2018

@Jakub-Golinowski, thank you! The DNN perf test comparison table looks pretty good! Were the results achieved with start/stop or without?

Jakub-Golinowski added some commits Jun 27, 2018

Add HPX backend for OpenCV implementation
Adds hpx backend for cv::parallel_for_() calls respecting the nstripes chunking parameter. C++ code for the backend is added to modules/core/parallel.cpp. Also, the necessary changes to cmake files are introduced.
Backend can operate in 2 versions (selectable by cmake build option WITH_HPX_STARTSTOP): hpx (runtime always on) and hpx_startstop (start and stop the backend for each cv::parallel_for_() call)
WIP: Conditionally include hpx_main.hpp to tests in core module
Header hpx_main.hpp is included to both core/perf/perf_main.cpp and core/test/test_main.cpp.
The changes to cmake files for linking hpx library to above mentioned test executalbles are proposed but have issues.
@Jakub-Golinowski

This comment has been minimized.

Contributor

Jakub-Golinowski commented Aug 8, 2018

@vpisarev The results presented in the DNN perf test comparison table are achieved without the start/stop, i.e. with the primary version that requires the user to manage the runtime by himself and therefore giving him maximum control over what happens in his application.

As discussed before I dropped the start/stop backend in the most recent commit. As test has shown for the common use-case choosing pthreads backend is optimal and start/stop version is superfluous.

Summing up, the primary (and now the only) version of the HPX backend is suitable for an HPX application that uses OpenCV library. User can seamlessly integrate the OpenCV calls within his execution DAG and does not have to worry about competing backends.

@vpisarev

This comment has been minimized.

Contributor

vpisarev commented Aug 30, 2018

@Jakub-Golinowski, looks great! I'm fine with it 👍

@vpisarev vpisarev self-assigned this Aug 30, 2018

@alalek alalek added this to the 4.0 milestone Aug 31, 2018

@alalek alalek merged commit 9f1218b into opencv:master Aug 31, 2018

1 check passed

default Required builds passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment