Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ideas for improving performance of transducers and reducers #351

Open
jackfirth opened this issue Nov 15, 2019 · 12 comments
Open

Ideas for improving performance of transducers and reducers #351

jackfirth opened this issue Nov 15, 2019 · 12 comments
Labels
performance This is related to the performance of a Rebellion library very complex This is hard and will take a lot of careful thought. Write lots of tests and docs.

Comments

@jackfirth
Copy link
Owner

jackfirth commented Nov 15, 2019

  • Make variants intern themselves upon construction when they're wrapping values that are interned, like fixnums, symbols, booleans, keywords, and singletons. Could turn operations like into-nth and into-sum into zero-allocation operations (sometimes), and it will make all stateless transitions (i.e. returning (variant #:<kw> #f) from a state transition handler) zero-allocation.

  • Consider adding a way for users to force a variant to be interned, so that transducer/reducer implementations that use mutable state can be zero-allocation.

  • Add special case dispatching to transduce and reduce-all when the input sequence is a list. Possibly add special case dispatching to transduce when the #:into argument is into-list.

  • Use transducer composition to implement transduce, instead of repeatedly folding over the input sequence with in-transduced. This should remove a bunch of generic interface dynamic dispatch overhead.

  • Add no-contract submodules and use them in various places.

@jackfirth jackfirth added the performance This is related to the performance of a Rebellion library label Nov 15, 2019
@jackfirth jackfirth added this to To do in Stream Processing Library via automation Nov 15, 2019
@jackfirth
Copy link
Owner Author

cc @samdphillips and @rocketnia

@jackfirth
Copy link
Owner Author

jackfirth commented Nov 15, 2019

Another idea: make variant a macro that statically expands into a direct call to the internal variant constructor when it's used in a first-order way. So (variant #:foo (some-expression)) would expand into (constructor:variant '#:foo (some-expression)). When used in a higher-order fashion, like with keyword-apply, then fall back to the normal function-like behavior.

This would reduce the need to go through the make-keyword-function slow path, and eliminate the need to check that only one keyword argument is given. Since variants are often constructed by transducers and reducers in tight loops, and the current constructor is never used in a higher-order fashion, that could save a significant chunk of time.

@jackfirth
Copy link
Owner Author

See also #287 and #301.

@samdphillips
Copy link
Collaborator

samdphillips commented Nov 15, 2019

Here is a branch (in my fork) where I used no contract versions of rebellion/collection/list and rebellion/base/comparators from the sorting transducer.

I don't have the numbers handy, IIRC in the case of sorting contracts were using about 60% of the time. I'm planning on running the contract profiler again tomorrow on some test programs. Maybe seeing how something like (transduce (in-range 10000) (mapping values) #:into (into-for-each void)) performs.

@samdphillips
Copy link
Collaborator

samdphillips commented Nov 16, 2019

EDIT: the headline of this should probably be Running time is 57.74% contracts

I tested this program with the contract profiler. The changes that I made in the above branch were against a sorting program and don't improve the performance of this baseline case of transducers, so there is still contracts that can be removed or at least talked about. Here are the profile results for this program.

#lang racket/base

(require rebellion/streaming/reducer
         rebellion/streaming/transducer)

(transduce (in-range 1000000)
           (mapping values)
           #:into
           (into-for-each void))

@jackfirth
Copy link
Owner Author

jackfirth commented Nov 16, 2019

Yikes, checking the consumer contract of (mapping values) is 5 seconds on its own, and the emitter contract is another four seconds. And that check would be repeated for every transducer in the chain, so I suspect this:

(transduce (in-range 1000000)
           (mapping values)
           (mapping values)
           (mapping values)
           (mapping values)
           (mapping values)
           #:into
           (into-for-each void))

...would add at least another 40 seconds or so. Maybe we need transducer fusion?

@jackfirth
Copy link
Owner Author

Expanded on an idea for transducer fusion in #358.

@samdphillips
Copy link
Collaborator

Transducing (mapping values) five times:

Running time is 58.62% contracts
131420/224180 ms

@jackfirth
Copy link
Owner Author

My goodness

@jackfirth jackfirth added the very complex This is hard and will take a lot of careful thought. Write lots of tests and docs. label Nov 17, 2019
@jackfirth
Copy link
Owner Author

I've written a small microbenchmarking library called Atomichron. Hopefully we can use this to get a better sense of the performance problems in Rebellion.

@jackfirth
Copy link
Owner Author

jackfirth commented Dec 12, 2019

I made an example Atomichron benchmark that measures the performance of (transduce <some vector> #:into into-list) compared to (vector->list <some vector>). For a vector with 1000 elements on my machine:

  • vector->list takes 30.1 microseconds
  • The transducer-based implementation takes 61.5 milliseconds (that's 2000x slower!)

samdphillips added a commit to samdphillips/rebellion that referenced this issue Dec 17, 2019
Improves jackfirth#351

Because there are contracts on the creator and accessors for the
transducer callbacks each applies the same contract on the callbacks.
This change weakens the contract on the accessors, and should not change
any contractual obligations provided transducers are created with
`make-transducer`.

This provides modest performance improvements:

- [Vector to List - 1000000 elements](https://gist.github.com/samdphillips/92dd56e1fc78884230dbc59aabf6f1c0#file-vector-list-rkt)
 - 32s -> 22.4s
- [1000000 Range to Void](https://gist.github.com/samdphillips/92dd56e1fc78884230dbc59aabf6f1c0#file-identity-rkt)
 - 30s -> 19.8s
- [1000000 Range pass through 5 `mapping` to Void](https://gist.github.com/samdphillips/92dd56e1fc78884230dbc59aabf6f1c0#file-map5-rkt)
 - 283s -> 206s
@samdphillips
Copy link
Collaborator

Linking performance benchmarks Jack has written here, so we can use them for comparisons.

https://gist.github.com/jackfirth/9596dd48536344b4b94c1938a36de4c9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance This is related to the performance of a Rebellion library very complex This is hard and will take a lot of careful thought. Write lots of tests and docs.
Projects
Development

No branches or pull requests

2 participants