Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement incremental lists via fusion & stepped sub-expression interpretation #25

Closed
mhuesch opened this issue May 23, 2021 · 4 comments
Assignees

Comments

@mhuesch
Copy link
Contributor

mhuesch commented May 23, 2021

this ticket sketches out a possible approach for interpreting incremental lists in a principled way.

it supersedes #12.

motivation for fusion

we want to be able to define functions which process potentially large quantities of data.

for example:

(defn mean
  (lam [xs]
    (/ (foldl + 0 xs)
       (foldl (lam [acc x] (+ acc 1)) 0 xs))))

suppose we want to apply mean a large number of values - enough to cause an out of memory error on the machine we are using if all values are initially loaded into memory. in principle, there is no reason to load all values - both folds could be interpreted in constant space, as all they have to maintain is an integer accumulating parameter.

now take another example:

(defn sqSum
  (lam [xs]
    (foldl + 0 (map (lam [x] (* x x)) xs))))

we would also like this function to run in constant space, and be able to process OOM-inducing volumes of values. however, a naive interpreter would insist on producing the incremental list of "squares" (the output of map) before feeding that to foldl. so the intermediate list would trigger an OOM and ruin our fun.

now consider a rewritten sqSum which processes everything "all in one go" and thus avoids the OOM-y intermediate list:

(defn sqSum
  (lam [xs]
    (foldl (lam [acc x] (+ acc (* x x))) 0 xs)))

this definition represents a "fusion" of the separate foldl and map of the previous definition.

goal of fusion

allow the programmer to write arbitrary pipelines of list-processing operations, and automatically combine them into (ideally) single foldls, which can be incrementally interpreted across large datasets.

motivation for "stepped sub-expression interpretation"

(working name, subject to revision, open to suggestions)

let's revisit mean:

(defn mean
  (lam [xs]
    (/ (foldl + 0 xs)
       (foldl (lam [acc x] (+ acc 1)) 0 xs))))

how would we interpret this, incrementally, over a massive dataset?

sketch of implementation approach for "stepped sub-expression interpretation"

in the above example, we really mean interpreting the body, in an environment where xs is understood to not be a normal list (represented in memory) but an iterator-stream-incremental-thing, which we can pull one value out of at a time. we want to interpret:

(/ (foldl + 0 xs)
   (foldl (lam [acc x] (+ acc 1)) 0 xs))

of note, we are performing two folds across the same list. do we need to interpret things separately and thus run the risk of duplicating calls to the outside world to provision the input values? it is likely out of scope for now, but we could perform automated detection of shared data, and interpret multiple subexpressions which consume the same data simultaneously - in this case we would interpret both foldls at the same time, feeding in each value of xs to each.

let's leave the sharing optimization aside for now. if we are ok with duplicating the "fetching" work, we could take the following approach:

(note: this part is rough)

  1. capture the environment of (foldl + 0 xs)
  2. capture the environment of (foldl (lam [acc x] (+ acc 1)) 0 xs)
  3. create freshnames to represent the two folds and substitute them into the "toplevel program": (/ var_0 var_1)
  4. interpret var_0 to a value (using the captured environment from (1))
  5. interpret var_1 to a value (using the captured environment from (2))
  6. substitute those into the original toplevel environment and interpret (/ var_0 var_1)

by chunking up the program into subexpressions which process lists, and interpreting those incrementally, we can then take their outputs and interpret the remaining program which depends on those values.

requirement of fusion for maximal usefulness of "stepped sub-expression interpretation" (SSEI)

note that we do need fusion to interpret the above example of mean. both foldls are in position to be interpreted incrementally.

however, the original sumSq from above is not a "single fold" - it has an intermediate list which would cause us to OOM on a large dataset.

if we were to fuse sumSq to the collapsed "single foldl" version above, we would then be able to interpret it using the approach from the previous section. it would only have 1 subexpression of interest, and no "remaining program" to interpret after that was finished.

non-necessity of fusion for early prototyping / early adopters

note that we do not need fusion for the SSEI system to be usable for complex programs. however it will be up to the programmer to manually do the work of transforming programs into "single foldl"s (as in the case of sumSq above). for small programs this may be fine, but for larger programs it may be quite tedious & error-prone.

for the SSEI system to work, all we would need is a program which is a closure/function, which receives a number of list input arguments, and performs "single foldl"s on them, reducing them to "atomic" values (i.e. non-lists).

@mhuesch mhuesch self-assigned this May 23, 2021
@mhuesch
Copy link
Contributor Author

mhuesch commented Jun 5, 2021

potential sources of bugs:

  • variable de-aliasing may cause variable capture

@mhuesch
Copy link
Contributor Author

mhuesch commented Jun 5, 2021

I now have a sketch of the toplevel API for SSEI.

it's a little funky, because we basically are evaluating "one lambda at a time", either in normal interpretation mode, or in SSEI incremental mode (for each variable), until we reach an atomic value (non lambda) which cannot be evaluated further.

example:

(defn coeffSum
  (lam [coeff]
    (lam [xs]
      (* coeff
         (foldl + 0 xs)))))

so here our first argument is a single integer, coeff, which will not be SSEI-able (because it's not a list, just a single value). our second argument, xs, could be SSEI-d, because it's a list of numbers.

they way I'm imagining we will interpret this is to first process coeff via the "normal interpreter". in this case, that will evaluate the application of a lambda to some value, creating a binding for coeff in the interpreter's environment. then we will het the inner lambda and return it.

there will be a strange dance between the caller and the callee to discover that the inner function is SSEI-able, and the caller will provide an iterator to incrementally provide those values.

@mhuesch
Copy link
Contributor Author

mhuesch commented Aug 18, 2021

it is looking increasingly like laziness is the way to go. which means that SSEI would get scrapped.

fusion, however, will likely be incorporated at some point.

@mhuesch
Copy link
Contributor Author

mhuesch commented Aug 23, 2021

the completion of #34 subsumes / eclipses / removes necessity of this.

@mhuesch mhuesch closed this as completed Aug 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant