[stdlib] benchmark sort scalar list #3022

mzaks · 2024-06-12T16:18:37Z

This PR adds a benchmark for sorting a list of scalars.

The benchmarks are performed on the public sort and internal _insertion_sort and _small_sort functions.
The intention of the benchmark are following:

Evaluate results on tiny lists (size 2 to 5) and observe how much faster the _small_sort function is compared to _insertion_sort.
Evaluate result on tiny lists (size 2 to 5) and observe how much overhead does the sort function produce compared to internal functions
Evaluate results on small lists (size 10 to 100) and observe how much overhead does the sort function produce compared to calling the internal insertion sort function

My observations running the benchmarks on an Apple M1:

_small_sort is sometimes ~10% faster
sort function performs 5 - 6x slower than internal functions
sort function performs 3 - 100x slower than internal insertion sort

Suggestions:

We should run the benchmark on other architectures to identify if my observations are consistent.
Consider removing the _small_sort function if it does really provided almost no benefits compared to insertion sort.
Expose insertion sort as top level function, as it has some nice characteristics which users might want to utilise directly.
Consider simplifying the _quicksort function and maybe provide a parameter, which defines upper bound of the list size to be sorted with insertion sort.
Consider adding similar parameter to sort function.

Dan13llljws · 2024-06-17T16:35:07Z

I ran the benchmark on Apple M3 and I got see that sort performs 3-4x worse than direct calls to internal functions for small arrays.

To reduce overhead, I added a function _delegate_small_sort() and in sort I have:

    if len <= 5:
        _delegate_small_sort[Scalar[type], _less_than_equal](buff, len)
        return
    if len < 32:
        _insertion_sort[Scalar[type], _less_than_equal](buff, 0, len)
        return
    else:
        _quicksort[Scalar[type], _less_than_equal](buff, len)

This makes sort as fast as _small_sort for $N <= 5$. (I also see no benefit for using small sort instead of insertion sort).
This also makes sort as fast as _insertion_sort for $N < 32$, and when $N = 32$ I observed that sort starts to perform 9-25x worse than direct calls to _insertion_sort, which strongly indicates that the overhead in _quicksort is BAD.

Then I rewrote _quicksort to use direct recursion instead of a stack. This improve the runtime to be only 2-10x worse than direct calls.

Then I rewrote _partition as follows:

    if start == end:
        return end

    var left = start
    var right = end - 2
    var pivot_value = array[end - 1]

    while True:
        while cmp_fn(array[left], pivot_value):
            left += 1
        while left < right and not cmp_fn(array[right], pivot_value):
            right -= 1
        if left >= right:
            swap(array[left], array[end - 1])
            return left
        swap(array[left], array[right])
        left += 1
        right -= 1

This makes sort as run about the same speed as _insertion_sort. When I extended your benchmark for list of size 1000, I actually see sort running in around 0.9x the time of _insertion_sort (But I was expecting much better results). Then instead picking the last element as the pivot. I decided to pick a random one:

    var pivot = int(random.random_si64(start, end - 1))
    var pivot_value = array[pivot]

which makes sort run in around 0.7x the time of _insertion_sort.

Conclusion

Maybe we should remove _small_sort network for now as it doesn't provide any value.
Rewrite _partition.
Either improve the overhead for stack (which is currently implemented using a List), or use direct recursion.

mzaks · 2024-06-17T18:02:22Z

Would be great if someone could run the benchmark on Intel/AMD and other architectures.

Dan13llljws · 2024-06-18T14:59:43Z

BTW, I just noticed that the random_scalar_list function returns only lists of 0 and 1s. Does this happen to anyone else?

Dan13llljws · 2024-06-18T15:08:17Z

BTW, I just noticed that the random_scalar_list function returns only lists of 0 and 1s. Does this happen to anyone else?

This is likely because of #3045

stdlib/benchmarks/builtin/bench_sort_scalars.mojo

Dan13llljws · 2024-06-19T14:40:58Z

It will also be great if you can make everything in one chunk of tests run on same set of data. Currently, you are generating a new set of list for every iteration.

mzaks · 2024-06-20T07:17:08Z

It will also be great if you can make everything in one chunk of tests run on same set of data. Currently, you are generating a new set of list for every iteration.

Thanks for this and the other suggestions, I implemented them in one go, please have a look.

stdlib/benchmarks/builtin/bench_sort_scalars.mojo

Dan13llljws · 2024-06-27T20:11:14Z

!sync

Dan13llljws · 2024-06-27T20:17:29Z

Hey @mzaks do you mind fixing the merge conflict? We are ready to import this PR :)

EDIT: It's a quick fix. I did it :)

Dan13llljws · 2024-06-28T20:44:51Z

@mzaks Can you reformat bench_sort_scalars.mojo with mojo format?

Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>

@Dan13llljws

…dback by @Dan13llljws Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>

Co-authored-by: Joe Loser <joeloser@fastmail.com> Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>

Co-authored-by: Wangshu Jiang <59179986+Dan13llljws@users.noreply.github.com> Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>

Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>

mzaks · 2024-06-29T14:23:24Z

@mzaks Can you reformat bench_sort_scalars.mojo with mojo format?

@Dan13llljws all problems should be resolved now.

Dan13llljws · 2024-06-29T19:32:05Z

!sync

modularbot · 2024-07-03T20:55:53Z

✅🟣 This contribution has been merged 🟣✅

Your pull request has been merged to the internal upstream Mojo sources. It will be reflected here in the Mojo repository on the nightly branch during the next Mojo nightly release, typically within the next 24-48 hours.

We use Copybara to merge external contributions, click here to learn more.

[External] [stdlib] benchmark sort scalar list This PR adds a benchmark for sorting a list of scalars. The benchmarks are performed on the public `sort` and internal `_insertion_sort` and `_small_sort` functions. The intention of the benchmark are following: 1. Evaluate results on tiny lists (size 2 to 5) and observe how much faster the `_small_sort` function is compared to `_insertion_sort`. 2. Evaluate result on tiny lists (size 2 to 5) and observe how much overhead does the `sort` function produce compared to internal functions 3. Evaluate results on small lists (size 10 to 100) and observe how much overhead does the `sort` function produce compared to calling the internal insertion sort function My observations running the benchmarks on an Apple M1: 1. `_small_sort` is sometimes ~10% faster 2. `sort` function performs 5 - 6x slower than internal functions 3. `sort` function performs 3 - 100x slower than internal insertion sort Suggestions: - We should run the benchmark on other architectures to identify if my observations are consistent. - Consider removing the `_small_sort` function if it does really provided almost no benefits compared to insertion sort. - Expose insertion sort as top level function, as it has some nice characteristics which users might want to utilise directly. - Consider simplifying the `_quicksort` function and maybe provide a parameter, which defines upper bound of the list size to be sorted with insertion sort. - Consider adding similar parameter to `sort` function. Co-authored-by: Maxim Zaks <maxim.zaks@gmail.com> Closes #3022 MODULAR_ORIG_COMMIT_REV_ID: 86e412f3310544c3198b7be355cc3c527b584cc5

modularbot · 2024-07-04T17:03:55Z

Landed in 3822acd! Thank you for your contribution 🎉

[External] [stdlib] benchmark sort scalar list This PR adds a benchmark for sorting a list of scalars. The benchmarks are performed on the public `sort` and internal `_insertion_sort` and `_small_sort` functions. The intention of the benchmark are following: 1. Evaluate results on tiny lists (size 2 to 5) and observe how much faster the `_small_sort` function is compared to `_insertion_sort`. 2. Evaluate result on tiny lists (size 2 to 5) and observe how much overhead does the `sort` function produce compared to internal functions 3. Evaluate results on small lists (size 10 to 100) and observe how much overhead does the `sort` function produce compared to calling the internal insertion sort function My observations running the benchmarks on an Apple M1: 1. `_small_sort` is sometimes ~10% faster 2. `sort` function performs 5 - 6x slower than internal functions 3. `sort` function performs 3 - 100x slower than internal insertion sort Suggestions: - We should run the benchmark on other architectures to identify if my observations are consistent. - Consider removing the `_small_sort` function if it does really provided almost no benefits compared to insertion sort. - Expose insertion sort as top level function, as it has some nice characteristics which users might want to utilise directly. - Consider simplifying the `_quicksort` function and maybe provide a parameter, which defines upper bound of the list size to be sorted with insertion sort. - Consider adding similar parameter to `sort` function. Co-authored-by: Maxim Zaks <maxim.zaks@gmail.com> Closes modularml#3022 MODULAR_ORIG_COMMIT_REV_ID: 86e412f3310544c3198b7be355cc3c527b584cc5 Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>

mzaks requested a review from a team as a code owner June 12, 2024 16:18

mzaks force-pushed the feature/benchmark-sort-scalar-list branch from 76c78b4 to 7a865b0 Compare June 13, 2024 15:28

Dan13llljws reviewed Jun 18, 2024

View reviewed changes

stdlib/benchmarks/builtin/bench_sort_scalars.mojo Outdated Show resolved Hide resolved

Dan13llljws reviewed Jun 18, 2024

View reviewed changes

stdlib/benchmarks/builtin/bench_sort_scalars.mojo Outdated Show resolved Hide resolved

Dan13llljws reviewed Jun 18, 2024

View reviewed changes

stdlib/benchmarks/builtin/bench_sort_scalars.mojo Outdated Show resolved Hide resolved

Dan13llljws reviewed Jun 18, 2024

View reviewed changes

stdlib/benchmarks/builtin/bench_sort_scalars.mojo Outdated Show resolved Hide resolved

mzaks force-pushed the feature/benchmark-sort-scalar-list branch from 7a865b0 to 21c9059 Compare June 20, 2024 07:15

mzaks requested a review from Dan13llljws June 20, 2024 07:17

mzaks force-pushed the feature/benchmark-sort-scalar-list branch from 21c9059 to 959a788 Compare June 20, 2024 07:20

JoeLoser approved these changes Jun 20, 2024

View reviewed changes

stdlib/benchmarks/builtin/bench_sort_scalars.mojo Outdated Show resolved Hide resolved

stdlib/benchmarks/builtin/bench_sort_scalars.mojo Outdated Show resolved Hide resolved

Dan13llljws reviewed Jun 21, 2024

View reviewed changes

stdlib/benchmarks/builtin/bench_sort_scalars.mojo Outdated Show resolved Hide resolved

stdlib/benchmarks/builtin/bench_sort_scalars.mojo Show resolved Hide resolved

stdlib/benchmarks/builtin/bench_sort_scalars.mojo Show resolved Hide resolved

modular-automation bot assigned Dan13llljws Jun 27, 2024

mzaks and others added 7 commits June 29, 2024 13:26

Add benchmark for sorting list of scalars

cc0e16f

Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>

Make sure that benchmarks run on the same list as suggested in PR fee…

8d33197

…dback by @Dan13llljws Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>

Update stdlib/benchmarks/builtin/bench_sort_scalars.mojo

054a2e7

Co-authored-by: Joe Loser <joeloser@fastmail.com> Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>

Update stdlib/benchmarks/builtin/bench_sort_scalars.mojo

ff463c3

Co-authored-by: Joe Loser <joeloser@fastmail.com> Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>

Update bench_sort_scalars.mojo

e0db070

Co-authored-by: Wangshu Jiang <59179986+Dan13llljws@users.noreply.github.com> Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>

Update bench_sort_scalars.mojo

9f6e639

Co-authored-by: Wangshu Jiang <59179986+Dan13llljws@users.noreply.github.com> Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>

Update bench_sort_scalars.mojo

b48a47b

Co-authored-by: Wangshu Jiang <59179986+Dan13llljws@users.noreply.github.com> Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>

mzaks force-pushed the feature/benchmark-sort-scalar-list branch from b343bcd to b48a47b Compare June 29, 2024 11:28

format code

ec8a8de

Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>

Remove unnecessary comments

ed440a2

Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>

modularbot added the imported-internally Signals that a given pull request has been imported internally. label Jun 29, 2024

modularbot added merged-internally Indicates that this pull request has been merged internally merged-externally Merged externally in public mojo repo labels Jul 3, 2024

modularbot closed this Jul 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[stdlib] benchmark sort scalar list #3022

[stdlib] benchmark sort scalar list #3022

mzaks commented Jun 12, 2024

Dan13llljws commented Jun 17, 2024 •

edited

Loading

mzaks commented Jun 17, 2024

Dan13llljws commented Jun 18, 2024

Dan13llljws commented Jun 18, 2024

Dan13llljws commented Jun 19, 2024

mzaks commented Jun 20, 2024

Dan13llljws commented Jun 27, 2024

Dan13llljws commented Jun 27, 2024 •

edited

Loading

Dan13llljws commented Jun 28, 2024

mzaks commented Jun 29, 2024

Dan13llljws commented Jun 29, 2024

modularbot commented Jul 3, 2024

modularbot commented Jul 4, 2024

[stdlib] benchmark sort scalar list #3022

[stdlib] benchmark sort scalar list #3022

Conversation

mzaks commented Jun 12, 2024

Dan13llljws commented Jun 17, 2024 • edited Loading

mzaks commented Jun 17, 2024

Dan13llljws commented Jun 18, 2024

Dan13llljws commented Jun 18, 2024

Dan13llljws commented Jun 19, 2024

mzaks commented Jun 20, 2024

Dan13llljws commented Jun 27, 2024

Dan13llljws commented Jun 27, 2024 • edited Loading

Dan13llljws commented Jun 28, 2024

mzaks commented Jun 29, 2024

Dan13llljws commented Jun 29, 2024

modularbot commented Jul 3, 2024

modularbot commented Jul 4, 2024

Dan13llljws commented Jun 17, 2024 •

edited

Loading

Dan13llljws commented Jun 27, 2024 •

edited

Loading