Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: Go 2: function values as iterators #43557

Open
mknyszek opened this issue Jan 6, 2021 · 85 comments
Open

proposal: Go 2: function values as iterators #43557

mknyszek opened this issue Jan 6, 2021 · 85 comments
Labels
Go2 A change that can only be done in Go 2 LanguageChange Proposal
Milestone

Comments

@mknyszek
Copy link
Contributor

mknyszek commented Jan 6, 2021

Proposal: Function values as iterators

Motivation

A number of proposals have come up over the years of supporting some first-class notion of iterator such that loops that use the range keyword may be used to iterate over some custom data structure. The Go community in general has also wondered about the "right" way to describe an iterator, evidenced by this blog post from 2013 that describes many of the ways Go programmers abstract away iteration, and much of this is true today as well.

Overall, however, iteration over a non-builtin data structure may be somewhat error-prone and clunky. To refresh on the various ways that iteration may be abstracted away in Go, let's work through some examples.

Firstly, we may have some tree type with some dedicated iterator type:

type IntTree struct {
    ...
}

func (t *IntTree) Iter() IntTreeIterator {
    ...
}

type IntTreeIterator struct {
    ...
}

func (t IntTreeIterator) Next() (int, bool) {
    ...
}

One way such an "iterator" could be used is,

iter := tree.Iter()
for {
    i, ok := iter.Next()
    if !ok {
        break
    }
    // Do stuff with i.
}

or even,

iter := tree.Iter()
for i, ok := iter.Next(); ok; i, ok = iter.Next() {
    // Do stuff with i.
}

The former usage works fine, but suffers from readability issues. The meaning of the code is obscured somewhat as the reader first sees an apparently infinite for-loop and must look for the "break" statement to understand what's going. Luckily this condition is usually present at the top of the loop, but it requires a more careful look. Furthermore, the iteration condition needs to be written explicitly. Writing it once may not be a problem, but writing it 100 times might be.

The latter usage also works fine and the intent is more clear, but it has a similar problem with the iteration condition. There's also an element of repetition which on the surface is fine, but it does harm the readability of the loop. Especially with variable names like "i" it becomes easy to get lost in punctuation.

Another way to abstract iteration away is to pass a function value to a function that iterates on behalf of the caller. For example:

type IntTree struct {
    ...
}

func (t *IntTree) ForEach(func(i int)) {
    ...
}

tree.ForEach(func(i int) {
    // Do stuff with i.
})

This method works well in many scenarios, but is decidedly less flexible as it separates the loop body from the surrounding code. Capturing local variables in that function value helps, but potentially at the cost of some efficiency, depending on the complexity of the iteration. One advantage of this method, though, is that defer may be used to perform clean up on each loop iteration, without allocating a defer (thanks to open-coded defers).

Prior art

A previous proposal (#40605) suggested allowing types with a Next method to have that method repeatedly called when used in a range loop:

iter := tree.Iter()
for i := range iter {
    // Do stuff with i.
}

This works fine, but from my perspective, doesn't feel very Go-like. Having a language construct be aware of a type's methods for the sake of syntactic sugar is not a pattern found anywhere else in the language (yet). In the parlance of the generic design, the existing range keyword matches on the underlying type used, not the type's methods.

Furthermore, it usually requires defining a new type which is a bit more work for the writer of the code as well as the reader. Overall a bit clunky, but not bad. It lines up well with how other languages work. Rust, for instance, uses a trait (analogous to an interface) to determine whether a type is an iterator.

Another previous proposal (#21855) suggested supporting a two-clause for-loop to make iterating less error-prone, such as:

iter := tree.Iter()
for i, ok := iter.Next(); ok {
    // Do stuff with i.
}

Unfortunately, a two-clause loop form is itself error-prone, as the placement of a single semicolon has a significant effect on semantics (specifically, the second semicolon which distinguishes between the 2-clause and 3-clause forms). This proposal was rejected because that semicolon was considered too dangerous.

Other proposals (#24282) have been made to fundamentally change how for loops work, indicating at least some degree of friction.

Proposal

Rolling with the idea of closures, and with the observation that the range form matches on types largely structurally, I would like to propose that range loops repeatedly apply function values of a particular form.

More specifically, I propose allowing the for-range statement to accept values of type func() (T, bool) or func() (T, S, bool) (where T and S are placeholder types; they may be any substituted for any other type) and will repeatedly call these values until their bool result is false.

Iterators may then be written in the following way:

type IntTree struct {
    ...
}

func (t *IntTree) Iter() func() (int, bool) {
    ...
    curr := t.root
    return func() (int, bool) {
        // Use curr to keep track of your position, repeatedly find the successor of curr
        // until there is none.
    }
}

for i := range tree.Iter() {
    // Do stuff with i.
}

More precisely, the last for statement "de-sugars" into the following code, where tmp is a temporary variable not visible to the program:

tmp := tree.Iter()
for i, ok := tmp(); ok; i, ok = tmp()  {
    // Do stuff with i.
}

The limitation to variables of type func() (T, bool) or func() (T, S, bool) (instead of allowing an arbitrary number of return values) is to keep range loops looking familiar and to avoid a misuse of the syntax.

Discussion and observations

Pros:

  • Efficient: with appropriate inlining, the Go compiler can stack-allocate any values captured by the function value returned by Iter and perform a direct function call for each iteration (theoretically that could be inlined further, but it depends).
  • Shifts API complexity to the writer: users get all the benefits of the range form, but writers have a little more work to do to generate the function value.
  • Lightweight: allows the writer to avoid defining a new type in many cases.
  • Sets the standard: by picking this structure, it sanctions a particular way of writing iterators at the language level.
  • Composes well: with generics, iterators defined this way can be easily transformed (map, filter, zip, etc.).

Cons:

  • Hidden behavior: similar to operator overloading, there's now one more thing that range could mean, and each call could be arbitrarily expensive (e.g. a series of HTTP requests).
    • Counter-point: (credit to @mvdan) channels may cause a for-range loop to block for an arbitrarily long time, and copies created during iteration could also cause surprising CPU usage.
  • Outsized impact: a whole language change for something that is arguably only a minor issue.
  • Unfamiliar: users coming from other popular statically-typed languages might find this behavior surprising.
  • Ambiguity: the range keyword doesn't always make sense to the reader (it does for custom data structures and in some other cases, though).
  • Ambiguity: the type func() (T, bool) may not always mean "iterator." A Next method is more explicit.
    • Counter-point: the fact that it doesn't accept any arguments is a pretty good signal, though.

This idea was inspired by the way Python generators are used as iterators. I recently had the realization that Python generators can approximated by repeated application of closures. I thought that this would have been proposed already, but I couldn't find anything like it. Interestingly, this proposal also allows for iterating over infinite generators and such. It can also be used to write Python-like loops (for better or for worse):

// Linspace returns an iterator that iterates from start (inclusive)
// to stop (exclusive) by stride. start must be <= stop.
func Linspace(start, stop, stride int) func() (int, bool) {
    i := start
    return func() (int, bool) {
        i += stride
        return i, i < stop
    }
}

for i := range Linspace(0, 10, 2) {
    // i = 0, 2, 4, 6, 8
}

It might also be worth allowing the iterator's final return value to be an error instead of a bool, though I'm not totally sure how to surface that to the user.

I'm not even totally convinced myself that this is worth doing, since the benefit seems minor at best, but I figured I should at least put the idea out there.

@mknyszek mknyszek added LanguageChange Go2 A change that can only be done in Go 2 Proposal labels Jan 6, 2021
@gopherbot gopherbot added this to the Proposal milestone Jan 6, 2021
@Deleplace
Copy link

Deleplace commented Jan 6, 2021

Iiuc we can achieve something similar by returning a chan.
I mean, same simplicity at the range clause, and similar moderate complexity of the iterator implementation. Writing to an unbuffered chan is like yielding in a continuation.

@ianlancetaylor
Copy link
Contributor

ianlancetaylor commented Jan 6, 2021

@Deleplace Using a channel can work, but it has some difficulties.

  1. It means that there has to be a separate goroutine that writes to the channel. If the loop breaks or returns early, some mechanism is required to tell that goroutine to exit.
  2. Because each iteration requires context switching between the two goroutines, it's significantly slower. So much so that in practice people don't use this approach.

@zikaeroh
Copy link
Contributor

zikaeroh commented Jan 7, 2021

One thing I'd consider to be missing would be a way to inform the iterator that the caller is "finished" with the iterator, i.e. if the range loop were exited early for some reason. I can see some iterators wanting some element of cleanup; for example, building up a type safe iterator for rows.Next() and needing to call Close() at some point. Right now the only cleanup appears to be the eventual GC of whatever the closure referred to. Otherwise, you basically have to build an object that satisfies some interface and then use it in two places, deal with calling it the right way, etc. If this isn't going to be some multi-method interface implementation (and just the function), maybe return two functions, or have the one function have a exiting bool parameter, but I don't see much benefit over having a well-known interface.

I'd also wonder if this would need to wait for generics, such that the definition of this pattern could be well-defined with a generic interface somewhere rather than being "well-known" (and in regards to the above cleanup, potentially have some extra method an iterator could provide to be called on loop exit).

@smasher164
Copy link
Member

smasher164 commented Jan 7, 2021

for i := range tree.Iter() {
   // Do stuff with i.
}

This code example is the same for #40605 and the current proposal, the only difference being matching on an interface type as opposed to a function type.

You mention that matching on an interface's methods as opposed to a concrete type is not Go-like, but in the same vein, the error handling proposals thus far have treated error specially. Why couldn't the same approach be taken for a hypothetical OneClauseIterator and TwoClauseIterator?

@ivila
Copy link

ivila commented Jan 7, 2021

I see many of the repositories provide iterator in other way, for example: func ForEachCell in "github.com/tealeg/xlsx"

// ForEachCell will call the provided CellVisitorFunc for each
// currently defined cell in the Row.  Optionally you may pass one or
// more CellVisitorOption to affect how ForEachCell operates.  For
// example you may wish to pass SkipEmptyCells to only visit cells
// which are populated.
func (r *Row) ForEachCell(cvf CellVisitorFunc, option ...CellVisitorOption) error {
	return r.cellStoreRow.ForEachCell(cvf, option...)
}

and call it by like

err = r.ForEachCell(func(c *Cell) error) {
        // do something here ...
}

and can also define a common error like ErrEarlyBreak to know what happen while iter through all data, just like ErrNil for telling people key not exist in redis

err = r.ForEachCell(func(c *Cell) error) {
        // do something here ...
        if someCondition {
                return ErrEarlyBreak
        }
}

@mknyszek
Copy link
Contributor Author

mknyszek commented Jan 7, 2021

@Deleplace What Ian said. I also don't foresee the compiler gaining the ability to optimize that down at any point. It gets quite hard, and the optimization becomes fairly fragile.

@zikaeroh It may be true that some iterators want an element of cleanup, but I'd argue most don't. To me this is a signal that whatever the code is going iterate over is fairly complex, and maybe there's a better abstraction for that. For instance, the ForEach example above could be good for this because the cleanup is more succinctly captured by ForEach. I'm also wary about hiding more calls to code. If there was some cleanup function, its call ends up somewhat hidden (moreso than the original repeated call already is, since at least the value is visible).

RE: not seeing a benefit over having an interface, I think I get that. You also make a good point about this pattern being surprising, especially if it's not backed by some formal definition. It's true that most other languages do this differently and that's perhaps what folks have come to expect.

@smasher164 In my defense, none of those proposals actually made it yet. :) But, that's a good point. You certainly could go that route instead. FWIW, the "Go-like" in the proposal is absolutely subjective. Interfaces are "Go-like" too.

To me, range is like an operator that acts on types of a certain structure. That structure is visible in the type via brackets ([]), the chan keyword, and the map keyword. I'm proposing adding the func keyword (in some limited way) to the mix, basically.

@mknyszek
Copy link
Contributor Author

mknyszek commented Jan 7, 2021

@ivila I mention that (or a simpler version of that) in the proposal. I think that pattern has its uses and isn't going to go away (we already have e.g. filepath.Walk that works similarly). I'm not sure yet that there's a nice way to pass errors up with what I've defined.

And maybe its limited use is a good reason not to support it. I do think there are still cases where the iterator I've proposed can be useful. It works nicely for iterating over common data structures (or some combination thereof), optimizes nicely in the common case, and it composes nicely with map and filter functions (though it gets harder for the compiler to optimize that, of course). That last bit about composition gets easier with generics, but you can still do it without.

@bboreham
Copy link
Contributor

bboreham commented Jan 7, 2021

Could you say what is meant in the T, S case? I scanned the text several times looking for S; apologies if I missed it.

My guess is it’s “key, value” like when range-ing over a map.

@mvdan
Copy link
Member

mvdan commented Jan 7, 2021

I don't have a strong feeling about this proposal, but I also have to say it's probably the nicest approach to native iterators I've seen.

Hidden behavior: similar to operator overloading, there's now one more thing that range could mean, and each call could be arbitrarily expensive (e.g. a series of HTTP requests).

As a counter-point, ranges can already block for a long time (e.g. channel receive) or consume a surprising amount of CPU (e.g. if copying huge arrays). This proposal could make these cases more common, but I don't think it would be something new. The syntax containing a function call also helps hint at running arbitrary code for each iteration.

Outsized impact: a whole language change for something that is arguably only a minor issue.

Out of all the downsides you list, I think this is the only worrying one. At the same time, it's a problem with any proposal to add iterators to the language, so I don't think it's a problem with this approach in particular.

@mknyszek
Copy link
Contributor Author

mknyszek commented Jan 7, 2021

Could you say what is meant in the T, S case? I scanned the text several times looking for S; apologies if I missed it.

My guess is it’s “key, value” like when range-ing over a map.

Yeah, sorry about that, the S kind of comes out of nowhere. It's just supposed to be another placeholder for a type (I thought K, V might be more confusing). I added a line around where I introduce the func() (T, bool) form to explain that these are placeholder types that may be substituted with any other type. (@bboreham)

@DeedleFake
Copy link

DeedleFake commented Jan 7, 2021

As proposed, this seems to allow the use of method values, meaning that it is actually essentially compatible with the original Next() method iterator proposal:

iter := tree.Iter()
for i := range iter.Next {
    // Do stuff with i.
}

@tv42
Copy link

tv42 commented Jan 7, 2021

The word "closure" here is a red herring and an implementation detail of the example code. Closures (in Go terminology) are function literals, but there's no reason to enforce that, or to really even know. The word wanted here is "function value" (and method values are function values).

I would suggest retitling to "proposal: Go 2: function values as iterators"

@tv42
Copy link

tv42 commented Jan 7, 2021

"More specifically, a for-range statement may accept variables" should say "More specifically, a for-range statement may accept values"

@mknyszek
Copy link
Contributor Author

mknyszek commented Jan 7, 2021

I don't have a strong feeling about this proposal, but I also have to say it's probably the nicest approach to native iterators I've seen.

Thanks! 😀

As a counter-point, ranges can already block for a long time (e.g. channel receive) or consume a surprising amount of CPU (e.g. if copying huge arrays). This proposal could make these cases more common, but I don't think it would be something new. The syntax containing a function call also helps hint at running arbitrary code for each iteration.

I thought about writing down that counter-point, but what stopped me was the thought that it could do an arbitrary amount of CPU work. But you raise a good point that that's possible today, via copies. I may go back and add that counter-point, then.

Out of all the downsides you list, I think this is the only worrying one. At the same time, it's a problem with any proposal to add iterators to the language, so I don't think it's a problem with this approach in particular.

Agreed.

@prattmic
Copy link
Member

prattmic commented Jan 7, 2021

As a counter-point, ranges can already block for a long time (e.g. channel receive) or consume a surprising amount of CPU (e.g. if copying huge arrays). This proposal could make these cases more common, but I don't think it would be something new. The syntax containing a function call also helps hint at running arbitrary code for each iteration.

I totally agree here, though I will note that, as I understand it, the proposed syntax uses a function value as the argument to range, not a call (like the go statement), so it is not necessarily completely obvious that there are calls. So this may be a direct function reference (like @DeedleFake's example), or a function value defined in the function (for v := range fn {).

@mknyszek
Copy link
Contributor Author

mknyszek commented Jan 7, 2021

The word "closure" here is a red herring and an implementation detail of the example code. Closures (in Go terminology) are function literals, but there's no reason to enforce that, or to really even know. The word wanted here is "function value" (and method values are function values).

I would suggest retitling to "proposal: Go 2: function values as iterators"

Right. I'm aware of the distinction, though I figured it was important to make it clear that you can't really do much with an iterator that is a function value but not a closure. You can, for instance, return some value infinitely, and I suppose you can also iterate over some value in global variable as well.

Overall, what you suggest is indeed more precise. I'll update the proposal.

"More specifically, a for-range statement may accept variables" should say "More specifically, a for-range statement may accept values"

You're probably right. I had some trouble with this because "range" isn't an operator. The most precise phrasing might be "More specifically, a for-range statement may accept expressions that evaluate to a function value of type ..." That way you get both a semantic and a syntactic description all in one.

EDIT: Re-reading that passage, I think you're 100% right. I modified it to fit your phrasing. :)

Thank you for your suggestions!

@mknyszek mknyszek changed the title proposal: Go 2: closures as iterators proposal: Go 2: function values as iterators Jan 7, 2021
@tv42
Copy link

tv42 commented Jan 7, 2021

to make it clear that you can't really do much with an iterator that is a function value but not a closure.

You can replace any closure by defining a type that captures the free variables, returning a method value on that type. That often leads to more debuggable & readable code, too.

@bcmills
Copy link
Member

bcmills commented Jan 7, 2021

A method value is, in a very real sense, a closure over the receiver.

(function literal∶closure∷square∶rectangle)

@tv42
Copy link

tv42 commented Jan 7, 2021

Sure, but in that case I posit the term "closure" is not defined in the spec, and shouldn't be used for that reason alone.

Function literals are closures: they may refer to variables defined in a surrounding function.

That reads as an example, not an exhaustive normative definition, but it is the only place the word appears.

One of the strengths of Go is that the spec is very clean and easy to refer to.

@ianlancetaylor
Copy link
Contributor

ianlancetaylor commented Jan 19, 2021

Just to clarify, it seems that with this proposal

for i := range f { ... }

is equivalent to

for i, next := f(); next; i, next = f() { ... }

@josharian
Copy link
Contributor

josharian commented Jan 20, 2021

And maybe:

for := range f { ... }

is equivalent to:

for next := f(); next; next = f() { ... }

And maybe for higher numbers of yielded values than 2?

Except presumably the iteration vars have the same capturing semantics that we all love so well.

Further into esoterica, range loops over channels zero out the iteration vars at the end of each iteration, to avoid pinning large objects in memory while waiting an arbitrary length of time for a channels receive. It might not be a bad idea to do that same here, were this accepted.

@josharian
Copy link
Contributor

josharian commented Jan 20, 2021

What about:

for i := range f { ... }

when f returns two values plus a bool?

If allowed, I presume it is equivalent to:

for i, _, next := f(); next; i, _, next = f() { ... }

@mknyszek
Copy link
Contributor Author

mknyszek commented Jan 20, 2021

@ianlancetaylor Yup, that's pretty much all it is, though also extended to the 3-return-value form (i.e. T, S, bool) as @josharian points out.

@josharian I think clearing the variable at the end of the loop iteration is a good idea, if this goes anywhere. RE: generating a new variable on each iteration would be nice, but I suspect it's still probably better to keep things consistent. Maybe that one experiment to look at all of GitHub will pan out. :)

@griesemer
Copy link
Contributor

griesemer commented Jan 20, 2021

  1. Is there any reason for restricting the function signature to a fixed number (2 or 3) results? It seems to me that one could allow any signature (without incoming arguments) where the last result is of type bool. For instance:
for a1, a2, ... an, ok := range f { ... }

is translated as explained above (for just two results). This could even make sense if there's only a single (boolean) result. I think @josharian alluded to this as well, above.

  1. I'm not convinced that the suggested syntactic sugar (that's what it is, after all) buys that much. That said, if we chose this translation:
for {
   a1, a2, ... an, ok := f()
   if !ok { break }
   ...
}

we would get new iteration variables for each iteration which would address the variable capture problem without the need for a new range syntax.

  1. One issue that this proposal doesn't address is the problem of (efficiently) supporting iteration over recursive data structures such as a tree. Languages (e.g. Python) that support generators typically have some form of a yield operation which allows the iterator/generator to be written in a natural fashion. In Go we'd have to use a channel for cases like these. (Of course, the reason why we decided to co-opt channels for this is that running a generator function with a yield "in parallel" with an iteration using the yielded values is exactly because we have two coroutines in that case. It's just that we would prefer a more lightweight switchover.)

@deanveloper
Copy link

deanveloper commented Aug 16, 2021

I think I may open a proposal which is a reconsideration of #19702, perhaps with some more details and a more persuasive case. It seems to have been closed for a few reasons:

  1. There should be communication between the reader and writer of a channel when the channel is closed. (While I largely agree with this, it doesn't make as much sense for iterators).
  2. Deadlock stack traces would not be useful, as the goroutines would be removed. (Perhaps under that propsal this is true. However, under my prospective proposal, channels which still have references at both ends would not be garbage collected. This means that in actual deadlocked situations, where references to both ends of a channel are still held, we will still see deadlocked goroutines).

We already have generator-based iterators in the form of channels, but we cannot use them because we would need to exhaust the channel in order to stop the goroutine. In my eyes, the only blocker that I'm not really sure about is defer and whether or not defers/finalizers should be ran when the goroutine is GC'd. I'm thinking not, but I'll think more about this either when I write the proposal or a discussion is opened.

@carlmjohnson
Copy link
Contributor

carlmjohnson commented Aug 16, 2021

#19702 is locked, so can't comment there, but what about something like runtime.Deadlocked() <-chan struct{} used like this?

func numbers() <-chan int {
	ch := make(chan int)
	go func() {
		for i := 0; ; i++ {
			select {
			case ch <- i:
			case <-runtime.Deadlocked():
				return
			}
		}
	}()
	return ch
}

ISTM the trouble with the issue as opened before was A) it's expensive to find the dead channels B) there was ambiguity about how to handle the exit (runtime.Goexit vs. defers, etc). Having an explicit select statement resolves those, no?

@carlmjohnson
Copy link
Contributor

carlmjohnson commented Aug 16, 2021

Even better, with generics, you could write:

func yielder[T any](ch chan T) func(t T) {
	return func(t T) {
			select {
			case ch <- t:
			case <-runtime.Deadlocked():
				return
			}
	}
}

func numbers() <-chan int {
	ch := make(chan int)
	yield := yielder(ch)
	go func() {
		for i := 0; ; i++ {
			yield(i)
		}
	}()
	return ch
}

😆

Okay, that's a sort of pointless way to get out of writing a select statement, but it's interesting in principle as part of thinking about iterator functions. ISTM, we already have a really good general purpose iteration mechanism (channels), but we don't use it because A) it's too slow/expensive and B) there's no good way to handle closing. Can we solve problems A and B and thus get out of adding a whole new mechanism for iteration?

@deanveloper
Copy link

deanveloper commented Aug 16, 2021

Here is a pretty good wrapper function that could be in the standard library, provided that runtime.Deadlocked would exist sometime in the future:

package chans

func Generator[T any](generator func(yield func(T) bool)) <-chan T {
	ch := make(chan T)

	go func() {
		generator(func(incoming T) bool {
			select {
			case ch <- incoming:
				return true
			case <-runtime.Deadlocked():
				close(ch)
				return false
			}
		})
		close(ch)
	}()
	return ch
}

which would be used like:

package main

func Fibonacci() <-chan int {
	return chans.Generator(func(yield func(int) bool) {
		var a, b = 0, 1
		for {
			ok := yield(a)
			if !ok {
				return
			}
			c := a + b
			a = b
			b = c
		}
	})
}

The ok return mainly exists for infinite generators to prevent infinite looping, but in the common case, if it is ignored it shouldn't break programs (unlike the original yield/emit solution). Alternatively we could do some sort of panic, but I think that may cause some issues.

go2go link seems to time out sometimes, but it works other times: https://play.golang.org/p/28dKug56JdQ

However this seems to be an issue with go2go because the non-generic solution works fine in the regular playground: https://play.golang.org/p/28dKug56JdQ

@carlmjohnson
Copy link
Contributor

carlmjohnson commented Aug 16, 2021

That can be simplified to

func Fibonacci() <-chan int {
	return chans.Generator(func(yield func(int)) {
		var a, b = 0, 1
		for yield(a) {
			c := a + b
			a = b
			b = c
		}
	})
}

Having an explicit deadlock test also makes it easier to deal with the "how to put the queue item back" problem, since there's an explicit way to check for the loop being exhausted and then run arbitrary code.

@DeedleFake
Copy link

DeedleFake commented Aug 16, 2021

While I'm not sure if runtime.Deadlocked() is exactly the right approach, I do think that it's a good idea. Cooperative cancellation, as with context.Context, feels a lot more Go-like. Maybe, at least to start, it might make sense to make it runtime.deadlocked() instead and then only create a iter.Generator() wrapper around it with an unsafe hook into the unexported runtime API. If you export it, it's going to encourage people to rely on other people checking for that deadlock and you're right back to the same situation as the ignored boolean, but now for potentially every single channel usage that crosses an API boundary.

In whichever case, though, I also worry a little about potential performance issues stemming from having to multi-thread this. I'm not entirely sure that just directly using a channel is the correct approach for a generic iterator system, either. I think that this should just be one possible way to create an iterator, be it iter.Iter[T] or whatever else. Otherwise middleware functionality like map and filter is going to be kind of weird.

@deanveloper

go2go link seems to time out sometimes, but it works other times

The go2go playground's been having that problem a lot recently. I had a ton of trouble with it the other day.

Also, several of the examples above forgot the bool return for yield in the function signature.

@deanveloper
Copy link

deanveloper commented Aug 16, 2021

Also, several of the examples above forgot the bool return for yield in the function signature.

My bad - I had added the bool in later and forgot to re-copy the examples into my comment. I've fixed that now.

@deanveloper
Copy link

deanveloper commented Aug 16, 2021

In whichever case, though, I also worry a little about potential performance issues stemming from having to multi-thread this.

Correct me if I am wrong, but pretty much every Generator uses some form of coroutine in order to implement the pattern. I've checked Kotlin and Rust at the very least. However I also believe that they control the coroutines a bit better so that it still operates single-threaded, and the value-passing also works differently from using a channel. The goroutine/channel overhead is quite the problem at large scale, in this case I have a binary tree of 10,000 elements and show how slow iteration is on each one: https://play.golang.org/p/1asKVzsdLNn. You cannot get results in the go playground, as time is fixed. Feel free to try on your own machine.

Results on my machine are ~1ms for purely-function-call based iteration, and ~480ms for the channel-based iteration. I expected the performance to be worse, but not that bad. Perhaps some optimizations could be done to chans.Generator specifically, but I'm not very hopeful that we can get close enough.

EDIT - As a test, I implemented it in Javascript, and it runs for ~9ms. There definitely has to be a more optimal way to do this without goroutine overhead. https://gist.github.com/deanveloper/32d6c0c9b915f464cb917d5f76c34dd9

EDIT 2 - Did some more testing: https://play.golang.org/p/87HxaEmecno (again, execute locally for correct timings). Even with only a single goroutine created, the channel overhead is quite high. To iterate over a 10k element slice, it takes 20ms with chans, and 0.2ms with functions. That's 2 orders of magnitude... I'm not sure if we can get optimizations to help that much.

@DeedleFake
Copy link

DeedleFake commented Aug 16, 2021

I think without language-level support, there's going to be some pretty bad performance issues that aren't really easily resolvable. It might be possible to optimize the channel for this specific usage and cut out some of the overhead, as I'm guessing that quite a bit of that comes from locking that isn't useful in this particular situation.

@carlmjohnson
Copy link
Contributor

carlmjohnson commented Aug 17, 2021

Setting GOMAXPROCS to 1 cuts the tree time by 20% and the slice time by 50%, but clearly there are still orders of magnitude of overhead to work out. I do think it is theoretically doable though, since many other languages have a coroutine mechanism. There "just" (ha!) needs to be some way to detect that the routines will always yield to each other and set them to a simpler scheduler once that's detected.

@jba
Copy link
Contributor

jba commented Aug 23, 2021

I think the best way to handle functions that need to have some sort of cleanup or close is to return a second func() value:

gen,  close := chans.Generator(func(yield func(int) bool) {...})
defer close()

No runtime support is necessary, and the pattern is familiar to Go programmers from context.Cancel.

@deanveloper
Copy link

deanveloper commented Aug 23, 2021

Unfortunately that remove the ability to have "inline" generators, like my earlier examples (ie for i := range Fibonacci()). That is, unless we add a language change, such that range on a function call that returns (<-chan T, func()) will call the 2nd return value when the for loop exits. That seems like a strange language feature outside of iterators and it would only work on function calls.

It may be quite bothersome to manually call an extra close() when iterating, and it makes passing around iterators a lot more difficult as well. In IO, pipelining is a lot easier, since it is all handled in a single variable (which may implement io.Closer). But needing to pass around 2 variables for a single iterator is a bit cumbersome.

@jba
Copy link
Contributor

jba commented Aug 23, 2021

@deanveloper, I don't think those are big problems.

Unfortunately that remove the ability to have "inline" generators, like my earlier examples (ie for i := range Fibonacci()).

gen, close := Fibonacci()
defer close()
for i := range gen ...

But needing to pass around 2 variables for a single iterator is a bit cumbersome.

You normally wouldn't do that. Think about how you use *os.File: the code that opens the file is responsible for closing it. For iterators it would be:

gen, close := Fibonacci()
defer close()
process(gen)

I'm sure there are situations where the close needs to happen somewhere else, but I'm guessing they're rare.

As for this all being cumbersome, that's true, but I don't think the language or runtime features you're hoping for are likely to happen. You could also do something like

type Iter[T any] interface {
    Next() (T, bool)
    Close()
}

and then it would be more like a file, but it turns out that forgetting to call Close is a real problem, so I think the clumsiness is actually helpful.

@jba
Copy link
Contributor

jba commented Aug 23, 2021

I've actually been working for a bit on speeding up this conversion from the "yield" style iterator to a one-at-time iterator. I think the first step is to remove the explicit channel:

type Iter func() (int, bool)

You get the next element by calling the function; a second return value of false tells you you're done. (@bcmills has designed a nice iterator package with generics using a similar definition.)

Now that the channel is hidden, you can play tricks like using a chan []T to buffer the collected elements. The fastest I've been able to come up with so far takes a user-supplied buffer to avoid allocation, splits it in two, and swaps the halves between the producer and consumer in a kind of double-buffering:

func ToIter(b []int, ranger func(f func(int) bool)) (Iter, func()) {
    c := make(chan []int)
    done := make(chan struct{})
    n := len(b) / 2
    if n == 0 {
        panic("buffer too small")
    }
    go func() {
        b1 := b[:0]
        b2 := b[n:n]

        // producer
        ranger(func(v int) bool {
            b1 = append(b1, v)
            if len(b1) >= n {
                select {
                case c <- b1:
                case <-done:
                    return false
                }
                // Since channel is unbuffered, at this point the consumer                                                                                                                                                                    
                // has taken b1, meaning it no longer needs b2.                                                                                                                                                                               
                b1, b2 = b2[:0], b1
            }
            return true
        })

        // consumer
        if len(b1) > 0 {
            select {
            case c <- b1:
            case <-done:
            }
        }
        close(c)
    }()

    return iterFromSliceChannel(c), func() { close(done) }
}

func iterFromSliceChannel(c chan []int) Iter {
    var (
        s  []int
        i  int
        ok bool
    )
    return func() (int, bool) {
        if i >= len(s) {
            s, ok = <-c
            if !ok {
                return 0, false
            }
            i = 0
        }
        i++
        return s[i-1], true
    }
}

For enumerating the values of a simple binary tree, this code is about 40% slower than calling the yield iterator directly, for large enough trees (with a 16K buffer, around a million nodes). It's much worse for smaller trees, because of the initial overhead.

For small trees it's probably better to just use the yield iterator to populate a slice, and iterate over the slice.

@deanveloper
Copy link

deanveloper commented Aug 24, 2021

Ahaha, and we're back at where we started. Earlier, I was thinking about if we could possibly get yield-style iterators to work in the more classical iterator fashion, which would be the most ideal way to do it. My idea was similar to yours, before realizing that we can simply use channels as iterators if we got the needed performance improvements.

To organize some of my thoughts, here is what I think of the main proposed solutions, in descending order of preference:

  1. Using channels as iterators with performance improvements and runtime.deadlocked() <-chan struct{}
    • This allows us to have efficient iterators without any language changes. Iterators would be able to be GC'd properly.
  2. Using channels as iterators with performance improvements, and a returned cancel function
    • Again, this allows us to have efficient iterators without any language changes, as well as runtime changes.
    • Iterators are not something that need to be "closed" in other languages, I would expect that they get garbage collected like anything else.
  3. The original proposed iterator, with utility functions to convert yield-style iterators to "real" iterators
    • This is nice because we wouldn't need to make optimizations to goroutines and channels. I'm not sure how hard they are to optimize, but synchronization in general is a pretty hard problem.
    • This does require making language changes if we want it to be range-able, but likely for the better.
    • Without the utility functions, these iterators may be extremely hard to write.
  4. The generator-pattern/yield-style/"Do" iterators
    • Very easy to write iterators.
    • Poorly written iterators which ignore the return value will have drastic consequences.
    • Will likely never be range-able

Also I had a new idea, would it be possible to use a runtime finalizer to free the deadlocked goroutine? I'm unsure, but it could possibly work. I'll start testing something out. (from a few tries at using finalizers, this doesn't seem possible. i'm not bothered by it since it's a pretty hacky solution anyway.)

@jba
Copy link
Contributor

jba commented Aug 24, 2021

from a few tries at using finalizers, this doesn't seem possible

@ianlancetaylor used a finalizer to clean up a deadlocked goroutine in the generics proposal.

@DeedleFake
Copy link

DeedleFake commented Aug 24, 2021

Relevant code:

// Ranger provides a convenient way to exit a goroutine sending values
// when the receiver stops reading them.
//
// Ranger returns a Sender and a Receiver. The Receiver provides a
// Next method to retrieve values. The Sender provides a Send method
// to send values and a Close method to stop sending values. The Next
// method indicates when the Sender has been closed, and the Send
// method indicates when the Receiver has been freed.
func Ranger[T any]() (*Sender[T], *Receiver[T]) {
	c := make(chan T)
	d := make(chan bool)
	s := &Sender[T]{values: c, done: d}
	r := &Receiver[T]{values: c, done: d}
	// The finalizer on the receiver will tell the sender
	// if the receiver stops listening.
	runtime.SetFinalizer(r, r.finalize)
	return s, r
}

// A Sender is used to send values to a Receiver.
type Sender[T any] struct {
	values chan<- T
	done   <-chan bool
}

// Send sends a value to the receiver. It reports whether any more
// values may be sent; if it returns false the value was not sent.
func (s *Sender[T]) Send(v T) bool {
	select {
	case s.values <- v:
		return true
	case <-s.done:
		// The receiver has stopped listening.
		return false
	}
}

// Close tells the receiver that no more values will arrive.
// After Close is called, the Sender may no longer be used.
func (s *Sender[T]) Close() {
	close(s.values)
}

// A Receiver receives values from a Sender.
type Receiver[T any] struct {
	values <-chan T
	done  chan<- bool
}

// Next returns the next value from the channel. The bool result
// reports whether the value is valid. If the value is not valid, the
// Sender has been closed and no more values will be received.
func (r *Receiver[T]) Next() (T, bool) {
	v, ok := <-r.values
	return v, ok
}

// finalize is a finalizer for the receiver.
// It tells the sender that the receiver has stopped listening.
func (r *Receiver[T]) finalize() {
	close(r.done)
}

Something to note: This method of doing it makes it unsafe to give the raw channel to the user, which in turn means that the user can't directly use it with a select or with range.

Edit: Also, if the documentation for runtime.SetFinalizer() is correct, the code above has a bug, as the function passed to it above is a func(), but it expects a func(*Receiver). Hmmm... Makes me wonder if it could be changed to SetFinalizer[T constraints.Pointer[E], E any](obj T, finalizer func(T)). That would break the allowance for any number of ignored return values for finalizer, though.

@deanveloper
Copy link

deanveloper commented Aug 24, 2021

Something that's neat about that Sender/Receiver pattern is that it satisfies func() (T, bool), so it works with the original proposed iterator. We could move Generator from package chans to package iterator, and redefine it like so:

package iterator

func Generator[T any](generator func(yield func(T) bool)) func() (T, bool) {
	send, recv := chans.Ranger()

	go func() {
		generator(send.Send)
		send.Close()
	}()

	return recv.Next
}

We could also add in the buffering that @jba mentioned to get a significant speedup, assuming that we don't get any goroutine/channel performance improvements that make it necessary. Even with the performance improvements, it will likely be beneficial to iterate over slices instead of using a buffered channel.

@DeedleFake
Copy link

DeedleFake commented Aug 24, 2021

For a proper performance improvement, it could also be reimplemented at some point later the way that a bunch of other languages do it, thus getting rid of the channel and background thread, and the API could be left as is.

I'm still not thrilled about the need for yield to return a value to indicate that there's nothing receiving anymore, though. It would be much preferable for the generator function to somehow just return on its own, running deferred functions on the way. Would it make sense for the yield implementation that gets passed to just call runtime.Goexit() instead of returning something?

@deanveloper
Copy link

deanveloper commented Aug 25, 2021

Would it make sense for the yield implementation that gets passed to just call runtime.Goexit() instead of returning something?

Unfortunately I can't seem to get it to work. I'm not sure what's preventing recv from being GC'd, feel free to take a crack at it yourself. It may be that the original implementation of Ranger never called the finalizer either, I never tested it. https://go2goplay.golang.org/p/pPFUHTQyqI9

(Or, the Go 1 version which I was testing on my local machine: https://play.golang.org/p/6SaBW8TmNsy)

@DeedleFake
Copy link

DeedleFake commented Aug 25, 2021

@deanveloper

I fixed it. It seems like the problem was that the use of a method value for the finalizer function resulted in an extra reference being kept around, so it never got garbage collected. I changed it to a method expression and it worked.

@adonovan
Copy link
Member

adonovan commented Sep 20, 2021

[gri, Jan 20] 1. I'm not convinced that the suggested syntactic sugar (that's what it is, after all) buys that much. That said, if we chose this translation:

for {
   a1, a2, ... an, ok := f()
   if !ok { break }
   ...
}

we would get new iteration variables for each iteration which would address the variable capture problem without the need for a new range syntax.

The proposal should not attempt to solve the variable capture problem (in which "for x := range seq" creates a single variable x for the whole loop, which is sometimes undesirable). It would be surprising and inconsistent for the compiler to create one variable per loop if seq is a slice, but one variable per iteration if seq is a function. Also, if the variable escapes, it would turn a single allocation into a linear number of allocations.

I'm not a fan of this proposal. Although the range syntax is neat, it is confusing at first sight, unlike most Go constructs; and as @jba points out, it creates for the first time a function call where there is no call syntax, and doesn't address the great variety of iteration constructs that come up in practice.

@rogpeppe
Copy link
Contributor

rogpeppe commented Jan 13, 2022

@tv42 wrote:

You can replace any closure by defining a type that captures the free variables, returning a method value on that type. That often leads to more debuggable & readable code, too.

I tend to disagree with the second sentence there. In my view a closure is significantly less debuggable than an interface value because there's no way to tell what's going on inside it: we can't find out what type of iterator it is, and there's no way to define other methods on it that could be used to help debugging.

I also think that having an explicit type is clearer. When I see a func() (T, bool) it's not immediately clear that to me that it's an iterator.

@tv42
Copy link

tv42 commented Jan 13, 2022

@rogreppe I think you misunderstood what I was trying to say. I also think closures are harder to debug than explicit types.

@sammy-hughes
Copy link

sammy-hughes commented Apr 13, 2022

In case it meaningfully develops the discussion, I'm actually a vote in favor of closures.

My introduction to Go was in context of an API controller, backed by SQL. The standard database/sql Rows type had an iterator that carried a "Next()bool" method and "Scan(...interface{}) error" method. It was convenient, straight forward, and if you pre-allocated appropriate to the return size, was blisteringly fast.

I adopted it as an iterator in a two-closure fashion such as below. It's entered generically, but typically I would have redefined it per declared type. I get that closures aren't considered ergonomic, but I dissent. Excepting generics, the following is a pattern that is in use, in production code.

type IterRead[T any] func()*T
type IterWrite[T any] func(*T) bool
type IterNext[T any] func() bool
type Data[T any] []T

func iterSeek[T any](s *[]T, i int) (int, bool) {
    switch {
    case s == nil: return i, false
    case len(*s) > i+1: return i+1, true
    default: return i, false
    }
}
func iterScan[T any](s *[]T, v *T, i int) (int, bool) {
    i, ok := iterSeek(s, i)
    if ok { *v = (*s)[i] } // could just pointer+SizeOf(T)...I guess I haven't played with unsafe+generics since the preview on 1.17
    return i, ok
}

func Map(dP *Data[T]), f func(*T) T, z func() T) (IterRead[T], IterNext[T]) {
    var v T
    i, ok := -1, false
    return (
        func() *T { return &f(v) },
        func() bool {
            if ok { i, ok = iterScan(dP, &v, i) }
            if !ok { v = z() }
            return ok
        }
    )
}

It might feel heavy, but it's fairly easy to bundle stuff up out of the way. When used with concrete types, performance is quite good, and a read/write version is just func(p []T, z func()T) (r func()*T, w func(*T) bool, n func()bool). Of course, that all extends to a struct type with methods, but the pattern with closures just feels like less work to manage.

for instance, supposing the above, to transform a slice of any inferable type:

f, n := Map(data, mapFunc, zeroFunc)
for n() {
  fmt.Printf("%T: %v", f(), f())
}

And considering that there's not actually a dependency on generics here, as I use the pattern with concretes, it's an example of something that absolutely works. It could as easily be used with generics as with interfaces, meaningful or empty interfaces, as everything is defined by the caller, and type-assertions can be performed as appropriate by the caller.

I don't know what coroutines would look like as implemented in Go, but this pattern serves me as a poor-mans thread-local coroutine. To performance, this gets better if a pattern following this model could be guaranteed-inlined. In terms of ergonomics, I am stoked for #21498 to be finalized. Even in "today Go", I think this pattern has potential. If you then want to associate those independent closures with a struct, sure. Instantiate an interface-literal, and attach them? Sure.

@johan-bolmsjo
Copy link

johan-bolmsjo commented Jul 14, 2022

One thing I'd consider to be missing would be a way to inform the iterator that the caller is "finished" with the iterator, i.e. if the range loop were exited early for some reason.

That could be solved with something like this:

type RangeIterator[T any] func() (value T, ok bool)

type ResourceIterator[T any] interface {
        Begin() RangeIterator[T]
        Close()
}

iter := tree.Iter()
for i := range iter.Begin() {
        // Do stuff with i
}
iter.Close()

Note that the range aspect of the proposal would be unaltered. The cleanup is just built into the outer type according to its particular needs.

@johan-bolmsjo
Copy link

johan-bolmsjo commented Jul 14, 2022

I'm in favor of this proposal. Iterating over user collections or generators is something that I've always felt is a bit clunky in Go. I'm a huge fan, but nothing is perfect.

Today I wrote yet another generator that did not feel right and was thinking about writing a proposal myself. Researching the open issues revealed that multiple proposals already exist. I'll leave my comments here instead in the hope that they may influencing the decision to do something about the situation.

I landed in two solutions myself of which one is identical to the "range over closure" proposal presented here.

The second solution that I've always had in mind is to just introduce "proper while loops" as they exist in C. This suggestion was shut down in #21855 but I think the risk of errors would be eliminated by using a "while" keyword instead of overloading another behavior with "for". Such a while would accept the same forms as "if".

Example: Procedural approach

iter := tree.Iter()
while i, ok := iter.Next(); ok {
        // Do stuff with i
}

Not to side track this proposal with another one. I just wanted to mention this alternative to get it off my chest.

As already listed, I'm aware of thee common ways to iterate over things.

  1. For loop with manual exit (most verbose).
  2. Thee clause for loop, repeating the iterator stepping call.
  3. The scanner idiom

I don't think channel based approaches are reasonable for performance reasons.

Iterating over collections is something that is done fairly often. Having support for expressing this naturally would make it easier to read code (and IMHO make it slightly more elegant).

@bboreham
Copy link
Contributor

bboreham commented Aug 18, 2022

Folks following this proposal will likely be interested in discussion: standard iterator interface

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Go2 A change that can only be done in Go 2 LanguageChange Proposal
Projects
None yet
Development

No branches or pull requests