-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement incremental lists via fusion & stepped sub-expression interpretation #25
Comments
potential sources of bugs:
|
I now have a sketch of the toplevel API for SSEI. it's a little funky, because we basically are evaluating "one lambda at a time", either in normal interpretation mode, or in SSEI incremental mode (for each variable), until we reach an atomic value (non lambda) which cannot be evaluated further. example:
so here our first argument is a single integer, they way I'm imagining we will interpret this is to first process there will be a strange dance between the caller and the callee to discover that the inner function is SSEI-able, and the caller will provide an iterator to incrementally provide those values. |
it is looking increasingly like laziness is the way to go. which means that SSEI would get scrapped. fusion, however, will likely be incorporated at some point. |
the completion of #34 subsumes / eclipses / removes necessity of this. |
this ticket sketches out a possible approach for interpreting incremental lists in a principled way.
it supersedes #12.
motivation for fusion
we want to be able to define functions which process potentially large quantities of data.
for example:
suppose we want to apply
mean
a large number of values - enough to cause an out of memory error on the machine we are using if all values are initially loaded into memory. in principle, there is no reason to load all values - both folds could be interpreted in constant space, as all they have to maintain is an integer accumulating parameter.now take another example:
we would also like this function to run in constant space, and be able to process OOM-inducing volumes of values. however, a naive interpreter would insist on producing the incremental list of "squares" (the output of
map
) before feeding that tofoldl
. so the intermediate list would trigger an OOM and ruin our fun.now consider a rewritten
sqSum
which processes everything "all in one go" and thus avoids the OOM-y intermediate list:this definition represents a "fusion" of the separate
foldl
andmap
of the previous definition.goal of fusion
allow the programmer to write arbitrary pipelines of list-processing operations, and automatically combine them into (ideally) single
foldl
s, which can be incrementally interpreted across large datasets.motivation for "stepped sub-expression interpretation"
(working name, subject to revision, open to suggestions)
let's revisit
mean
:how would we interpret this, incrementally, over a massive dataset?
sketch of implementation approach for "stepped sub-expression interpretation"
in the above example, we really mean interpreting the body, in an environment where
xs
is understood to not be a normal list (represented in memory) but an iterator-stream-incremental-thing, which we can pull one value out of at a time. we want to interpret:of note, we are performing two folds across the same list. do we need to interpret things separately and thus run the risk of duplicating calls to the outside world to provision the input values? it is likely out of scope for now, but we could perform automated detection of shared data, and interpret multiple subexpressions which consume the same data simultaneously - in this case we would interpret both
foldl
s at the same time, feeding in each value ofxs
to each.let's leave the sharing optimization aside for now. if we are ok with duplicating the "fetching" work, we could take the following approach:
(note: this part is rough)
(foldl + 0 xs)
(foldl (lam [acc x] (+ acc 1)) 0 xs)
(/ var_0 var_1)
var_0
to a value (using the captured environment from (1))var_1
to a value (using the captured environment from (2))(/ var_0 var_1)
by chunking up the program into subexpressions which process lists, and interpreting those incrementally, we can then take their outputs and interpret the remaining program which depends on those values.
requirement of fusion for maximal usefulness of "stepped sub-expression interpretation" (SSEI)
note that we do need fusion to interpret the above example of
mean
. bothfoldl
s are in position to be interpreted incrementally.however, the original
sumSq
from above is not a "single fold" - it has an intermediate list which would cause us to OOM on a large dataset.if we were to fuse
sumSq
to the collapsed "single foldl" version above, we would then be able to interpret it using the approach from the previous section. it would only have 1 subexpression of interest, and no "remaining program" to interpret after that was finished.non-necessity of fusion for early prototyping / early adopters
note that we do not need fusion for the SSEI system to be usable for complex programs. however it will be up to the programmer to manually do the work of transforming programs into "single foldl"s (as in the case of
sumSq
above). for small programs this may be fine, but for larger programs it may be quite tedious & error-prone.for the SSEI system to work, all we would need is a program which is a closure/function, which receives a number of list input arguments, and performs "single foldl"s on them, reducing them to "atomic" values (i.e. non-lists).
The text was updated successfully, but these errors were encountered: