Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread wakening may be bottom neck for large core systems #61

Open
xhebox opened this issue Aug 4, 2023 · 8 comments
Open

Thread wakening may be bottom neck for large core systems #61

xhebox opened this issue Aug 4, 2023 · 8 comments

Comments

@xhebox
Copy link
Collaborator

xhebox commented Aug 4, 2023

Result: boost from 3.6~8token/s to 4.2~4.3token/s on SG2042.

Analyze: I added tracy to trace execution in details. I observed that worker wakening is problematic. There are some workers picked up task after the completion of other workers.

That means the execution time is as twice as the expected sometimes. I guess that it is caused by thread::yield(), which switched out the busy waiting loop.

I am thinking about to make a higher/busier performance poll, while being able to switch to an idle state when waiting for user input.

@chenqy4933
Copy link
Collaborator

Can you show the trace result? or you can comment the thread::yield() function, and trace again

@xhebox
Copy link
Collaborator Author

xhebox commented Aug 7, 2023

Can you show the trace result? or you can comment the thread::yield() function, and trace again

Later. I've modified the base code a lot. It is like so:

main      | add_task                                         | add_task
thread1 | matmul_int8 
thread2 | ................. matmul_int8 ( this is problematic part, if it can start much earlier, the total time will be reduced a lot)
...
threadx | ..matmul_int8 

@chenqy4933
Copy link
Collaborator

It seems when dispatching a task, not all threads will be scheduled right now, some threads yield out. if so, comment the thread::yield and test again, it should be faster, but I test with the same result.

@xhebox
Copy link
Collaborator Author

xhebox commented Aug 7, 2023

It seems when dispatching a task, not all threads will be scheduled right now, some threads yield out. if so, comment the thread::yield and test again, it should be faster, but I test with the same result.

It could be an advantage of x64 processors. I mean these processors may be more responsive anyway.

And yes, removing yield works. But not always. It makes processors too busy on atomic operations sometimes. Racing will slow down the whole system. I am still experimenting on how to make the loop better.

I've got a pretty good result so far. With some other new optimizations, I got 5token/s.

@chenqy4933
Copy link
Collaborator

great !!!, Looking forward to your optimizations

@xhebox
Copy link
Collaborator Author

xhebox commented Aug 7, 2023

great !!!, Looking forward to your optimizations

#62

Check out this demo. Add ZoneScopedNS() in the lambda function of matmul will give you the trace result.

@xhebox
Copy link
Collaborator Author

xhebox commented Aug 18, 2023

@chenqy4933 Got 4.4 tokens with the master. I guess the optimization works. Though I did not get as high as 5 token/s.

@chenqy4933
Copy link
Collaborator

@chenqy4933 Got 4.4 tokens with the master. I guess the optimization works. Though I did not get as high as 5 token/s.

you can optimize it continue, I just optimized it with the CPU level yield.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants