Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Add OpenMP support for widget rendering (#118) #189

Merged
merged 13 commits into from
Dec 20, 2019

Conversation

hyugabokko
Copy link
Contributor

@hyugabokko hyugabokko commented Dec 11, 2019

Purpose

Add OpenMP support for widget rendering (#118) .

Changes

  • Modified src/Makefile.am and added compiler option.
  • Modified src/display.c and added preprocessor directives for OpenMP.
  • To check whether multithreading is possible, the string is output to standard output. However, I will remove it before merging.

Screenshots

  • When OpenMP is not supported(change before)
    openmp_is_not_supported

  • When OpenMP is supported(change after)
    openmp_is_supported

I printed thread_num to confirm that multiple threads were used.
If there are no problems with the review, I will remove the thread_num output.

To reproduce

$ make
$ make test

In make test, there is only one rectangle in rects, the execution time was slightly longer due to overhead.
I measured it with time make test.


IssueHunt Summary

Referenced issues

This pull request has been submitted to:


IssueHunt has been backed by the following sponsors. Become a sponsor

- Modified src/Makefile.am and added compiler option.
- Modified src/display.c and added preprocessor directives for OpenMP.
- To check whether multithreading is possible, the string is output to standard output. However, I will remove it before merging.
Copy link
Owner

@lc-soft lc-soft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The openMP feature should give the user the option to turn it on or off, for example:

# OpenMP is enabled by default
./configure

# Disable OpenMP
./configure --with-openmp=no

@@ -30,6 +30,7 @@
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a test program to compare the performance before and after OpenMP is enabled.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I came up with a suitable test:

  • Set window size to 1920x1080
  • Create 21384 widgets, they're all 10x10 in size
  • Render 1000 frames
  • For each frame, set the background color of each widget randomly

This is an optional task, if you are interested in it, you can try do it.

src/Makefile.am Outdated
@@ -1,5 +1,5 @@
AUTOMAKE_OPTIONS=foreign
AM_CFLAGS = -I$(abs_top_srcdir)/include $(CODE_COVERAGE_CFLAGS)
AM_CFLAGS = -I$(abs_top_srcdir)/include $(CODE_COVERAGE_CFLAGS) -fopenmp
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be an option for the configure script, you can refer to this:

LCUI/configure.ac

Lines 62 to 75 in 460354c

# pthread
want_thread=no
thread_name=pthread
AC_ARG_WITH(pthread, AC_HELP_STRING([--with-pthread], [use pthread (default)]))
if test "x$with_pthread" != "xno"; then
AX_PTHREAD([
want_thread=yes
CC="$PTHREAD_CC"
CFLAGS="$CFLAGS $PTHREAD_CFLAGS"
PACKAGE_LIBS="$PACKAGE_LIBS $PTHREAD_LIBS"
AC_DEFINE_UNQUOTED([LCUI_THREAD_PTHREAD], 1, [Define to 1 if you are using pthread support.])
], [AC_MSG_ERROR([The support could not be configured for the POSIX thread programming interface.])])
fi

@hyugabokko
Copy link
Contributor Author

Thank you for your review and suggestions for solutions!
I am a beginner in OSS development, so it would be very helpful if you could make such a review!

  • Enabling or disabling OpenMP with the configure option
    I have an outlook on modifying. Please wait for a little while longer.

  • Compare the performance before and after OpenMP is enabled
    I think it will take quite a while to implement.
    It would be helpful if you submit a new issue so that anyone other than me could implement it.

@lc-soft
Copy link
Owner

lc-soft commented Dec 12, 2019

I think it will take quite a while to implement.
It would be helpful if you submit a new issue so that anyone other than me could implement it.

@d4yvector This task is simple, I'll take the time to do it.

@lc-soft
Copy link
Owner

lc-soft commented Dec 12, 2019

I pushed a test program into your branch, you can run it to check rendering performance.

hyugabokko@d42a4e8

Widget_Append(root, box);
Widget_Append(root, status);

#ifdef WITH_WINDOW
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to experience the actual rendering effect, you can define the WITH_WINDOW macro

@lc-soft
Copy link
Owner

lc-soft commented Dec 13, 2019

The current test program is not suitable for testing OpenMP performance, when there is only one render area of 1920x1080 size, only one thread is working.

I 'll make some changes to split it into several small areas, for example, split 1920x1080 area into four areas: (0, 0, 960, 540), (960, 0, 960, 540), (0, 540, 960, 540), (960, 540, 960, 540), and assign them to each thread for rendering.

@hyugabokko
Copy link
Contributor Author

hyugabokko commented Dec 13, 2019

Hi, @lc-soft

The openMP feature should give the user the option to turn it on or off, for example:

# OpenMP is enabled by default
./configure

# Disable OpenMP
./configure --with-openmp=no

I modified configure.ac to add OpenMP configure option (b82315b).

I pushed a test program into your branch, you can run it to check rendering performance.

d4yvector@d42a4e8

Thank you for your test program.

I 'll make some changes to split it into several small areas, for example, split 1920x1080 area into four areas: (0, 0, 960, 540), (960, 0, 960, 540), (0, 540, 960, 540), (960, 540, 960, 540), and assign them to each thread for rendering.

I will wait for your changes.

src/display.c Outdated
for (LinkedList_Each(rn, &rects)) {
#ifdef USE_OPENMP
#pragma omp task firstprivate(rn)
printf("thread_num: %d\n", omp_get_thread_num());
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

printf() takes a long time to call, you should remove it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed printf() in d5cf404.

src/display.c Outdated
@@ -27,9 +27,14 @@
* POSSIBILITY OF SUCH DAMAGE.
*/

#include <LCUI/config.h>
Copy link
Owner

@lc-soft lc-soft Dec 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config.h file is for internal use, you should use #include "config.h"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed include path of config.h in d5cf404.

@hyugabokko hyugabokko marked this pull request as ready for review December 13, 2019 11:29
Copy link
Owner

@lc-soft lc-soft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compiling with vs2017 will output an error:

error C3001: 'taskwait' : expected an OpenMP directive name

Maybe you should refer to this: https://stackoverflow.com/questions/23545930/openmp-tasks-in-visual-studio

src/display.c Outdated
#endif
/* Repaint dirty rectangles of surface */
for (count = 0, i = 0; i < 4; ++i) {
count += LCUIDisplay_RenderSurfaceEx(record, &rects_group[i]);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@d4yvector the changes have been pushed, check here.

@lc-soft
Copy link
Owner

lc-soft commented Dec 16, 2019

First performance test results

Run the test three times

OpenMP disabled:

rendered 120 frames in 17.10s, rendering speed is 7.02 fps
rendered 120 frames in 17.73s, rendering speed is 6.77 fps
rendered 120 frames in 17.20s, rendering speed is 6.98 fps

OpenMP enabled:

rendered 120 frames in 23.50s, rendering speed is 5.11 fps
rendered 120 frames in 24.32s, rendering speed is 4.93 fps
rendered 120 frames in 25.01s, rendering speed is 4.80 fps

The effect is not good

@hyugabokko
Copy link
Contributor Author

I have a question about a for loops to be parallelized.

LCUI/src/display.c

Lines 186 to 198 in 05fff83

LinkedList_Init(&rects_group[0]);
LinkedList_Init(&rects_group[1]);
LinkedList_Init(&rects_group[2]);
LinkedList_Init(&rects_group[3]);
SurfaceRecord_DumpRects(record, rects_group);
#ifdef USE_OPENMP
#pragma omp parallel
#pragma omp single
#endif
/* Repaint dirty rectangles of surface */
for (count = 0, i = 0; i < 4; ++i) {
count += LCUIDisplay_RenderSurfaceEx(record, &rects_group[i]);
}

I think the characteristics of loops that are effective for parallelization are:

  • case1. Huge number of loops
  • case2. Although the number of loops is small, one loop takes a lot of time

However, the number of loops is 4 in this loop, so case1 is not satisfied.
Also, it does not take much time for one loop because rects_group[x].length is all 1, so case2 is not satisfied.

Similarly for the loop pointed out in #118, I think the performance test (test/test_render.c) can't measure the effect of parallelizing these loops.

t = clock();
for (i = 0; i < 120; ++i) {
UpdateFrame(box);
LCUIWidget_Update();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@d4yvector

I found that the performance of LCUiWidget_Update() is very low.

  ...
  UpdateFrame(box);
  LCUIWidget_Update();
+ continue;
  LCUIDisplay_Update();
 ...

After adding the continue statement, the program took 11.48 seconds to run.

rendered 120 frames in 11.48s, rendering speed is 10.45 fps

Please wait for me to optimize it.

@lc-soft
Copy link
Owner

lc-soft commented Dec 18, 2019

Performance test results

I updated the test code:

  • Number of widgets reduced to 3600
  • Resize window to 1600x900

OpenMP disabled:

running rendering performance test
rendered 120 frames in 6.47s, rendering speed is 18.55 fps

OpenMP enabled:

running rendering performance test
rendered 120 frames in 6.48s, rendering speed is 18.52 fps

It looks the same, maybe there are still problems to be solved.


rectArray = (LCUI_Rect **)malloc(sizeof(LCUI_Rect*) * rects.length);
i = 0;
for (LinkedList_Each(node, &rects)) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier I was curious how OpenMP would traverse the nodes in the linked list. Do your changes mean that the array is more suitable for OpenMP?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is.
LinkedList_Each() macro's loop form does not follow "Canonical Loop Form" defined in OpenMP.
I rewritten the loop form to "Canonical Loop Form" by converting the rects linked list to an array.
This is a change to address the error in the previous review (#189 (review)) and is not related to performance.
Would an you try compiling with vs2017?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some debug message output, OpenMP does work.

rects: 4
rect: (0,0,800,450), thread: 0
rect: (800,0,800,450), thread: 1
rect: (0,450,800,450), thread: 2
rect: (800,450,800,450), thread: 3
count: 3609
rects: 4
rect: (800,0,800,450), thread: 1
rect: (0,0,800,450), thread: 0
rect: (0,450,800,450), thread: 2
rect: (800,450,800,450), thread: 3
count: 3609
rects: 4
rect: (0,0,800,450), thread: 0
rect: (800,450,800,450), thread: 3
rect: (800,0,800,450), thread: 1
rect: (0,450,800,450), thread: 2
count: 3609
rects: 4
rect: (0,0,800,450), thread: 0
rect: (800,450,800,450), thread: 3
rect: (800,0,800,450), thread: 1
rect: (0,450,800,450), thread: 2
count: 3609

Maybe some code in the renderer is affecting performance, I will debug the performance of the renderer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.

@hyugabokko
Copy link
Contributor Author

To check if the parallelized loop is working as expected, I displayed the thread number, loop number, and time taken for one loop in each thread.

$ ./test/test_render 
running rendering performance test
thread_num: 0, i: 0, time: 10
thread_num: 2, i: 2, time: 18
thread_num: 3, i: 3, time: 26
thread_num: 1, i: 1, time: 34

thread_num: 0, i: 0, time: 8
thread_num: 2, i: 2, time: 19
thread_num: 3, i: 3, time: 27
thread_num: 1, i: 1, time: 35

thread_num: 2, i: 2, time: 5
thread_num: 1, i: 1, time: 24
thread_num: 3, i: 3, time: 31
thread_num: 0, i: 0, time: 36

I think parallelized loop is working as expected from the thread number and loop number.

However, as the loop number increases, the time required for one loop increases.
The variable for time measurement is not shared by each thread, so this is not the accumulated time in all loops but the time in one loop.

Therefore, execution time may be uneven depending on the drawing area each thread is responsible for.
So, the same display was made with parallelization disabled.

$ ./test/test_render 
running rendering performance test
thread_num: 0, i: 0, time: 9
thread_num: 0, i: 1, time: 8
thread_num: 0, i: 2, time: 7
thread_num: 0, i: 3, time: 6

thread_num: 0, i: 0, time: 8
thread_num: 0, i: 1, time: 8
thread_num: 0, i: 2, time: 8
thread_num: 0, i: 3, time: 6

thread_num: 0, i: 0, time: 7
thread_num: 0, i: 1, time: 9
thread_num: 0, i: 2, time: 5
thread_num: 0, i: 3, time: 6

However, it was found that even if the loop number increased, the time taken for one loop was not biased.
One possible cause of this result is waiting in the next thread until the previous thread terminates.
Do you have any idea what causes the waiting time?

@lc-soft
Copy link
Owner

lc-soft commented Dec 18, 2019

@d4yvector Does malloc() and free() make threads wait?

@lc-soft
Copy link
Owner

lc-soft commented Dec 18, 2019

LCUIWidget_Update() has low performance.

openmp-enabled
image

static size_t LCUIDisplay_RenderSurface(SurfaceRecord record)
{
size_t count = 0;
int i;
Copy link
Owner

@lc-soft lc-soft Dec 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move int i to top, and change to size_t i = 0

LinkedList rects;
LinkedListNode *node;
LCUI_BOOL can_render;
LCUI_Rect **rectArray;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename rectArray to rect_array

LCUI_PaintContext paint;
LCUI_SysEventRec ev;
LinkedList rects;
LinkedListNode *node;
Copy link
Owner

@lc-soft lc-soft Dec 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move LinkedList to bottom, like this:

LCUI_BOOL can_render;
LCUI_Rect **rectArray;
LinkedList rects;
LinkedListNode *node;

SurfaceRecord_DumpRects(record, &rects);

rectArray = (LCUI_Rect **)malloc(sizeof(LCUI_Rect*) * rects.length);
i = 0;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this line, because the initial value of the `i' variable is already 0.

- i = 0;

@hyugabokko
Copy link
Contributor Author

Although it is different from the performance improvement of LCUIWidget_Update(),
I found that the cause of waiting for other threads in OpenMP parallelization is in Surface_BeginPaint().
I guess that the actual function executed is X11Surface_BeginPaint() in the case of Ubuntu.
I think that the cause of waiting other threads is that LCUIMutex_Lock() in X11Surface_BeginPaint() locks the surface.

If this problem is not solved, it will not be possible to improve performance by parallelizing the requested in #118.
If you split a surface into multiples, such as rects, you might be able to assign separate surfaces to different threads to avoid locking (though I don't know how to achieve this).
Do you have any other ideas?

@lc-soft lc-soft merged commit f08bb7e into lc-soft:develop Dec 20, 2019
lc-soft added a commit that referenced this pull request Jan 19, 2020
* perf: Add OpenMP support for widget rendering (#118)

- Modified src/Makefile.am and added compiler option.
- Modified src/display.c and added preprocessor directives for OpenMP.
- To check whether multithreading is possible, the string is output to standard output. However, I will remove it before merging.

* test: add rendering performance test

* build: add OpenMP configure option

* refactor: change include path of `config.h` and remove printf()

* perf: split the dirty rectangles into four parts for rendering

* fix(linux): missing surface size access method

* refactor(display): update dirty rectangle collection method

* test: update test_render.c

* build: add vs project file for test render

* fix(display): Convert `rects` list to array and follow a "Canonical Loop Form" defined in OpenMP

* refactor: Change where variable i is initialized

* fix: Widget_GenerateHash() not work

* test: improved widget update performance

Co-authored-by: Liu <lc-soft@live.cn>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants