Skip to content

Conversation

@sandeepnmenon
Copy link
Collaborator

@sandeepnmenon sandeepnmenon commented Apr 8, 2024

Details before change

Loss

python_time_ws=2_rk=0.log
python_time_ws=2_rk=1.log

Gpu time

gpu_time_ws=2_rk=0.log
gpu_time_ws=2_rk=1.log

Details after change

Loss

python_time_ws=2_rk=0.log
python_time_ws=2_rk=1.log

Gpu time

gpu_time_ws=2_rk=0.log
gpu_time_ws=2_rk=1.log

TODO

Use prefix sum and optimize computation of block_count from compute_locally

@TarzanZhao
Copy link
Collaborator

The modification looks good. It speeds things up as expected and the code is clean. We should merge it.

@TarzanZhao TarzanZhao merged commit d898cb4 into dist Apr 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants