Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stdlib] benchmark sort scalar list #3022

Closed

Conversation

mzaks
Copy link
Contributor

@mzaks mzaks commented Jun 12, 2024

This PR adds a benchmark for sorting a list of scalars.

The benchmarks are performed on the public sort and internal _insertion_sort and _small_sort functions.
The intention of the benchmark are following:

  1. Evaluate results on tiny lists (size 2 to 5) and observe how much faster the _small_sort function is compared to _insertion_sort.
  2. Evaluate result on tiny lists (size 2 to 5) and observe how much overhead does the sort function produce compared to internal functions
  3. Evaluate results on small lists (size 10 to 100) and observe how much overhead does the sort function produce compared to calling the internal insertion sort function

My observations running the benchmarks on an Apple M1:

  1. _small_sort is sometimes ~10% faster
  2. sort function performs 5 - 6x slower than internal functions
  3. sort function performs 3 - 100x slower than internal insertion sort

Suggestions:

  • We should run the benchmark on other architectures to identify if my observations are consistent.
  • Consider removing the _small_sort function if it does really provided almost no benefits compared to insertion sort.
  • Expose insertion sort as top level function, as it has some nice characteristics which users might want to utilise directly.
  • Consider simplifying the _quicksort function and maybe provide a parameter, which defines upper bound of the list size to be sorted with insertion sort.
  • Consider adding similar parameter to sort function.

@mzaks mzaks requested a review from a team as a code owner June 12, 2024 16:18
@mzaks mzaks force-pushed the feature/benchmark-sort-scalar-list branch from 76c78b4 to 7a865b0 Compare June 13, 2024 15:28
@Dan13llljws
Copy link
Contributor

Dan13llljws commented Jun 17, 2024

I ran the benchmark on Apple M3 and I got see that sort performs 3-4x worse than direct calls to internal functions for small arrays.

To reduce overhead, I added a function _delegate_small_sort() and in sort I have:

    if len <= 5:
        _delegate_small_sort[Scalar[type], _less_than_equal](buff, len)
        return
    if len < 32:
        _insertion_sort[Scalar[type], _less_than_equal](buff, 0, len)
        return
    else:
        _quicksort[Scalar[type], _less_than_equal](buff, len)

This makes sort as fast as _small_sort for $N &lt;= 5$. (I also see no benefit for using small sort instead of insertion sort).
This also makes sort as fast as _insertion_sort for $N &lt; 32$, and when $N = 32$ I observed that sort starts to perform 9-25x worse than direct calls to _insertion_sort, which strongly indicates that the overhead in _quicksort is BAD.

Then I rewrote _quicksort to use direct recursion instead of a stack. This improve the runtime to be only 2-10x worse than direct calls.

Then I rewrote _partition as follows:

    if start == end:
        return end

    var left = start
    var right = end - 2
    var pivot_value = array[end - 1]

    while True:
        while cmp_fn(array[left], pivot_value):
            left += 1
        while left < right and not cmp_fn(array[right], pivot_value):
            right -= 1
        if left >= right:
            swap(array[left], array[end - 1])
            return left
        swap(array[left], array[right])
        left += 1
        right -= 1

This makes sort as run about the same speed as _insertion_sort. When I extended your benchmark for list of size 1000, I actually see sort running in around 0.9x the time of _insertion_sort (But I was expecting much better results). Then instead picking the last element as the pivot. I decided to pick a random one:

    var pivot = int(random.random_si64(start, end - 1))
    var pivot_value = array[pivot]

which makes sort run in around 0.7x the time of _insertion_sort.

Conclusion

  • Maybe we should remove _small_sort network for now as it doesn't provide any value.
  • Rewrite _partition.
  • Either improve the overhead for stack (which is currently implemented using a List), or use direct recursion.

@mzaks
Copy link
Contributor Author

mzaks commented Jun 17, 2024

Would be great if someone could run the benchmark on Intel/AMD and other architectures.

@Dan13llljws
Copy link
Contributor

BTW, I just noticed that the random_scalar_list function returns only lists of 0 and 1s. Does this happen to anyone else?

@Dan13llljws
Copy link
Contributor

BTW, I just noticed that the random_scalar_list function returns only lists of 0 and 1s. Does this happen to anyone else?

This is likely because of #3045

@Dan13llljws
Copy link
Contributor

It will also be great if you can make everything in one chunk of tests run on same set of data. Currently, you are generating a new set of list for every iteration.

@mzaks mzaks force-pushed the feature/benchmark-sort-scalar-list branch from 7a865b0 to 21c9059 Compare June 20, 2024 07:15
@mzaks
Copy link
Contributor Author

mzaks commented Jun 20, 2024

It will also be great if you can make everything in one chunk of tests run on same set of data. Currently, you are generating a new set of list for every iteration.

Thanks for this and the other suggestions, I implemented them in one go, please have a look.

@mzaks mzaks requested a review from Dan13llljws June 20, 2024 07:17
@mzaks mzaks force-pushed the feature/benchmark-sort-scalar-list branch from 21c9059 to 959a788 Compare June 20, 2024 07:20
stdlib/benchmarks/builtin/bench_sort_scalars.mojo Outdated Show resolved Hide resolved
stdlib/benchmarks/builtin/bench_sort_scalars.mojo Outdated Show resolved Hide resolved
@Dan13llljws
Copy link
Contributor

!sync

@Dan13llljws
Copy link
Contributor

Dan13llljws commented Jun 27, 2024

Hey @mzaks do you mind fixing the merge conflict? We are ready to import this PR :)

EDIT: It's a quick fix. I did it :)

@Dan13llljws
Copy link
Contributor

@mzaks Can you reformat bench_sort_scalars.mojo with mojo format?

mzaks and others added 7 commits June 29, 2024 13:26
Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>
…dback by @Dan13llljws

Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>
Co-authored-by: Joe Loser <joeloser@fastmail.com>
Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>
Co-authored-by: Joe Loser <joeloser@fastmail.com>
Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>
Co-authored-by: Wangshu Jiang <59179986+Dan13llljws@users.noreply.github.com>
Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>
Co-authored-by: Wangshu Jiang <59179986+Dan13llljws@users.noreply.github.com>
Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>
Co-authored-by: Wangshu Jiang <59179986+Dan13llljws@users.noreply.github.com>
Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>
@mzaks mzaks force-pushed the feature/benchmark-sort-scalar-list branch from b343bcd to b48a47b Compare June 29, 2024 11:28
Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>
Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>
@mzaks
Copy link
Contributor Author

mzaks commented Jun 29, 2024

@mzaks Can you reformat bench_sort_scalars.mojo with mojo format?

@Dan13llljws all problems should be resolved now.

@Dan13llljws
Copy link
Contributor

!sync

@modularbot modularbot added the imported-internally Signals that a given pull request has been imported internally. label Jun 29, 2024
@modularbot
Copy link
Collaborator

✅🟣 This contribution has been merged 🟣✅

Your pull request has been merged to the internal upstream Mojo sources. It will be reflected here in the Mojo repository on the nightly branch during the next Mojo nightly release, typically within the next 24-48 hours.

We use Copybara to merge external contributions, click here to learn more.

@modularbot modularbot added merged-internally Indicates that this pull request has been merged internally merged-externally Merged externally in public mojo repo labels Jul 3, 2024
modularbot pushed a commit that referenced this pull request Jul 4, 2024
[External] [stdlib] benchmark sort scalar list

This PR adds a benchmark for sorting a list of scalars.

The benchmarks are performed on the public `sort` and internal
`_insertion_sort` and `_small_sort` functions.
The intention of the benchmark are following:
1. Evaluate results on tiny lists (size 2 to 5) and observe how much
faster the `_small_sort` function is compared to `_insertion_sort`.
2. Evaluate result on tiny lists (size 2 to 5) and observe how much
overhead does the `sort` function produce compared to internal functions
3. Evaluate results on small lists (size 10 to 100) and observe how much
overhead does the `sort` function produce compared to calling the
internal insertion sort function

My observations running the benchmarks on an Apple M1:
1. `_small_sort` is sometimes ~10% faster
2. `sort` function performs 5 - 6x slower than internal functions
3. `sort` function performs 3 - 100x slower than internal insertion sort

Suggestions:
- We should run the benchmark on other architectures to identify if my
observations are consistent.
- Consider removing the `_small_sort` function if it does really
provided almost no benefits compared to insertion sort.
- Expose insertion sort as top level function, as it has some nice
characteristics which users might want to utilise directly.
- Consider simplifying the `_quicksort` function and maybe provide a
parameter, which defines upper bound of the list size to be sorted with
insertion sort.
- Consider adding similar parameter to `sort` function.

Co-authored-by: Maxim Zaks <maxim.zaks@gmail.com>
Closes #3022
MODULAR_ORIG_COMMIT_REV_ID: 86e412f3310544c3198b7be355cc3c527b584cc5
@modularbot
Copy link
Collaborator

Landed in 3822acd! Thank you for your contribution 🎉

@modularbot modularbot closed this Jul 4, 2024
martinvuyk pushed a commit to martinvuyk/mojo that referenced this pull request Jul 10, 2024
[External] [stdlib] benchmark sort scalar list

This PR adds a benchmark for sorting a list of scalars.

The benchmarks are performed on the public `sort` and internal
`_insertion_sort` and `_small_sort` functions.
The intention of the benchmark are following:
1. Evaluate results on tiny lists (size 2 to 5) and observe how much
faster the `_small_sort` function is compared to `_insertion_sort`.
2. Evaluate result on tiny lists (size 2 to 5) and observe how much
overhead does the `sort` function produce compared to internal functions
3. Evaluate results on small lists (size 10 to 100) and observe how much
overhead does the `sort` function produce compared to calling the
internal insertion sort function

My observations running the benchmarks on an Apple M1:
1. `_small_sort` is sometimes ~10% faster
2. `sort` function performs 5 - 6x slower than internal functions
3. `sort` function performs 3 - 100x slower than internal insertion sort

Suggestions:
- We should run the benchmark on other architectures to identify if my
observations are consistent.
- Consider removing the `_small_sort` function if it does really
provided almost no benefits compared to insertion sort.
- Expose insertion sort as top level function, as it has some nice
characteristics which users might want to utilise directly.
- Consider simplifying the `_quicksort` function and maybe provide a
parameter, which defines upper bound of the list size to be sorted with
insertion sort.
- Consider adding similar parameter to `sort` function.

Co-authored-by: Maxim Zaks <maxim.zaks@gmail.com>
Closes modularml#3022
MODULAR_ORIG_COMMIT_REV_ID: 86e412f3310544c3198b7be355cc3c527b584cc5

Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
imported-internally Signals that a given pull request has been imported internally. merged-externally Merged externally in public mojo repo merged-internally Indicates that this pull request has been merged internally
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants