implement incremental lists via fusion & stepped sub-expression interpretation #25

mhuesch · 2021-05-23T14:11:26Z

this ticket sketches out a possible approach for interpreting incremental lists in a principled way.

it supersedes #12.

motivation for fusion

we want to be able to define functions which process potentially large quantities of data.

for example:

(defn mean
  (lam [xs]
    (/ (foldl + 0 xs)
       (foldl (lam [acc x] (+ acc 1)) 0 xs))))

suppose we want to apply mean a large number of values - enough to cause an out of memory error on the machine we are using if all values are initially loaded into memory. in principle, there is no reason to load all values - both folds could be interpreted in constant space, as all they have to maintain is an integer accumulating parameter.

now take another example:

(defn sqSum
  (lam [xs]
    (foldl + 0 (map (lam [x] (* x x)) xs))))

we would also like this function to run in constant space, and be able to process OOM-inducing volumes of values. however, a naive interpreter would insist on producing the incremental list of "squares" (the output of map) before feeding that to foldl. so the intermediate list would trigger an OOM and ruin our fun.

now consider a rewritten sqSum which processes everything "all in one go" and thus avoids the OOM-y intermediate list:

(defn sqSum
  (lam [xs]
    (foldl (lam [acc x] (+ acc (* x x))) 0 xs)))

this definition represents a "fusion" of the separate foldl and map of the previous definition.

goal of fusion

allow the programmer to write arbitrary pipelines of list-processing operations, and automatically combine them into (ideally) single foldls, which can be incrementally interpreted across large datasets.

motivation for "stepped sub-expression interpretation"

(working name, subject to revision, open to suggestions)

let's revisit mean:

(defn mean
  (lam [xs]
    (/ (foldl + 0 xs)
       (foldl (lam [acc x] (+ acc 1)) 0 xs))))

how would we interpret this, incrementally, over a massive dataset?

sketch of implementation approach for "stepped sub-expression interpretation"

in the above example, we really mean interpreting the body, in an environment where xs is understood to not be a normal list (represented in memory) but an iterator-stream-incremental-thing, which we can pull one value out of at a time. we want to interpret:

(/ (foldl + 0 xs)
   (foldl (lam [acc x] (+ acc 1)) 0 xs))

of note, we are performing two folds across the same list. do we need to interpret things separately and thus run the risk of duplicating calls to the outside world to provision the input values? it is likely out of scope for now, but we could perform automated detection of shared data, and interpret multiple subexpressions which consume the same data simultaneously - in this case we would interpret both foldls at the same time, feeding in each value of xs to each.

let's leave the sharing optimization aside for now. if we are ok with duplicating the "fetching" work, we could take the following approach:

(note: this part is rough)

capture the environment of (foldl + 0 xs)
capture the environment of (foldl (lam [acc x] (+ acc 1)) 0 xs)
create freshnames to represent the two folds and substitute them into the "toplevel program": (/ var_0 var_1)
interpret var_0 to a value (using the captured environment from (1))
interpret var_1 to a value (using the captured environment from (2))
substitute those into the original toplevel environment and interpret (/ var_0 var_1)

by chunking up the program into subexpressions which process lists, and interpreting those incrementally, we can then take their outputs and interpret the remaining program which depends on those values.

requirement of fusion for maximal usefulness of "stepped sub-expression interpretation" (SSEI)

note that we do need fusion to interpret the above example of mean. both foldls are in position to be interpreted incrementally.

however, the original sumSq from above is not a "single fold" - it has an intermediate list which would cause us to OOM on a large dataset.

if we were to fuse sumSq to the collapsed "single foldl" version above, we would then be able to interpret it using the approach from the previous section. it would only have 1 subexpression of interest, and no "remaining program" to interpret after that was finished.

non-necessity of fusion for early prototyping / early adopters

note that we do not need fusion for the SSEI system to be usable for complex programs. however it will be up to the programmer to manually do the work of transforming programs into "single foldl"s (as in the case of sumSq above). for small programs this may be fine, but for larger programs it may be quite tedious & error-prone.

for the SSEI system to work, all we would need is a program which is a closure/function, which receives a number of list input arguments, and performs "single foldl"s on them, reducing them to "atomic" values (i.e. non-lists).

The text was updated successfully, but these errors were encountered:

mhuesch · 2021-06-05T00:56:18Z

potential sources of bugs:

variable de-aliasing may cause variable capture

mhuesch · 2021-06-05T15:12:46Z

I now have a sketch of the toplevel API for SSEI.

it's a little funky, because we basically are evaluating "one lambda at a time", either in normal interpretation mode, or in SSEI incremental mode (for each variable), until we reach an atomic value (non lambda) which cannot be evaluated further.

example:

(defn coeffSum
  (lam [coeff]
    (lam [xs]
      (* coeff
         (foldl + 0 xs)))))

so here our first argument is a single integer, coeff, which will not be SSEI-able (because it's not a list, just a single value). our second argument, xs, could be SSEI-d, because it's a list of numbers.

they way I'm imagining we will interpret this is to first process coeff via the "normal interpreter". in this case, that will evaluate the application of a lambda to some value, creating a binding for coeff in the interpreter's environment. then we will het the inner lambda and return it.

there will be a strange dance between the caller and the callee to discover that the inner function is SSEI-able, and the caller will provide an iterator to incrementally provide those values.

mhuesch · 2021-08-18T04:29:01Z

it is looking increasingly like laziness is the way to go. which means that SSEI would get scrapped.

fusion, however, will likely be incorporated at some point.

mhuesch · 2021-08-23T05:39:56Z

the completion of #34 subsumes / eclipses / removes necessity of this.

mhuesch self-assigned this May 23, 2021

mhuesch mentioned this issue May 23, 2021

implement "incremental lists" #12

Closed

2 tasks

mhuesch mentioned this issue Aug 9, 2021

investigate lazy evaluation as a means of implementing incremental list processing #34

Closed

mhuesch closed this as completed Aug 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement incremental lists via fusion & stepped sub-expression interpretation #25

implement incremental lists via fusion & stepped sub-expression interpretation #25

mhuesch commented May 23, 2021 •

edited

mhuesch commented Jun 5, 2021

mhuesch commented Jun 5, 2021

mhuesch commented Aug 18, 2021

mhuesch commented Aug 23, 2021

implement incremental lists via fusion & stepped sub-expression interpretation #25

implement incremental lists via fusion & stepped sub-expression interpretation #25

Comments

mhuesch commented May 23, 2021 • edited

motivation for fusion

goal of fusion

motivation for "stepped sub-expression interpretation"

sketch of implementation approach for "stepped sub-expression interpretation"

requirement of fusion for maximal usefulness of "stepped sub-expression interpretation" (SSEI)

non-necessity of fusion for early prototyping / early adopters

mhuesch commented Jun 5, 2021

mhuesch commented Jun 5, 2021

mhuesch commented Aug 18, 2021

mhuesch commented Aug 23, 2021

mhuesch commented May 23, 2021 •

edited