Thread wakening may be bottom neck for large core systems #61

xhebox · 2023-08-04T18:36:10Z

Result: boost from 3.6~8token/s to 4.2~4.3token/s on SG2042.

Analyze: I added tracy to trace execution in details. I observed that worker wakening is problematic. There are some workers picked up task after the completion of other workers.

That means the execution time is as twice as the expected sometimes. I guess that it is caused by thread::yield(), which switched out the busy waiting loop.

I am thinking about to make a higher/busier performance poll, while being able to switch to an idle state when waiting for user input.

The text was updated successfully, but these errors were encountered:

chenqy4933 · 2023-08-06T08:17:06Z

Can you show the trace result? or you can comment the thread::yield() function, and trace again

xhebox · 2023-08-07T04:16:28Z

Can you show the trace result? or you can comment the thread::yield() function, and trace again

Later. I've modified the base code a lot. It is like so:

main      | add_task                                         | add_task
thread1 | matmul_int8 
thread2 | ................. matmul_int8 ( this is problematic part, if it can start much earlier, the total time will be reduced a lot)
...
threadx | ..matmul_int8

chenqy4933 · 2023-08-07T07:14:37Z

It seems when dispatching a task， not all threads will be scheduled right now， some threads yield out. if so, comment the thread::yield and test again, it should be faster, but I test with the same result.

xhebox · 2023-08-07T07:30:17Z

It seems when dispatching a task， not all threads will be scheduled right now， some threads yield out. if so, comment the thread::yield and test again, it should be faster, but I test with the same result.

It could be an advantage of x64 processors. I mean these processors may be more responsive anyway.

And yes, removing yield works. But not always. It makes processors too busy on atomic operations sometimes. Racing will slow down the whole system. I am still experimenting on how to make the loop better.

I've got a pretty good result so far. With some other new optimizations, I got 5token/s.

chenqy4933 · 2023-08-07T07:44:33Z

great ！！！， Looking forward to your optimizations

xhebox · 2023-08-07T13:11:32Z

great ！！！， Looking forward to your optimizations

#62

Check out this demo. Add ZoneScopedNS() in the lambda function of matmul will give you the trace result.

xhebox · 2023-08-18T04:04:59Z

@chenqy4933 Got 4.4 tokens with the master. I guess the optimization works. Though I did not get as high as 5 token/s.

chenqy4933 · 2023-08-18T05:00:26Z

@chenqy4933 Got 4.4 tokens with the master. I guess the optimization works. Though I did not get as high as 5 token/s.

you can optimize it continue， I just optimized it with the CPU level yield.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thread wakening may be bottom neck for large core systems #61

Thread wakening may be bottom neck for large core systems #61

xhebox commented Aug 4, 2023 •

edited

Loading

chenqy4933 commented Aug 6, 2023

xhebox commented Aug 7, 2023 •

edited

Loading

chenqy4933 commented Aug 7, 2023

xhebox commented Aug 7, 2023 •

edited

Loading

chenqy4933 commented Aug 7, 2023

xhebox commented Aug 7, 2023

xhebox commented Aug 18, 2023

chenqy4933 commented Aug 18, 2023

Thread wakening may be bottom neck for large core systems #61

Thread wakening may be bottom neck for large core systems #61

Comments

xhebox commented Aug 4, 2023 • edited Loading

chenqy4933 commented Aug 6, 2023

xhebox commented Aug 7, 2023 • edited Loading

chenqy4933 commented Aug 7, 2023

xhebox commented Aug 7, 2023 • edited Loading

chenqy4933 commented Aug 7, 2023

xhebox commented Aug 7, 2023

xhebox commented Aug 18, 2023

chenqy4933 commented Aug 18, 2023

xhebox commented Aug 4, 2023 •

edited

Loading

xhebox commented Aug 7, 2023 •

edited

Loading

xhebox commented Aug 7, 2023 •

edited

Loading