Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: slices: ChunkFunc to determine chunks of a slice using a function #69123

Open
DeedleFake opened this issue Aug 29, 2024 · 8 comments
Open
Labels
Milestone

Comments

@DeedleFake
Copy link

Proposal Details

I propose adding a new function to the slices package that would return an iterator yielding variable-length subslices of a given slice based on a provided function. There are two possible implementations of this.

Version 1

The first is something I have personally used several times before, as well as being similar to a function in the Elixir standard library, looks like

func ChunkFunc[T any, C comparable, S ~[]T](s S, chunker func(T) C) iter.Seq[[]T]

This function returns an iterator that yields subslices of s composed of consecutive runs of elements for which chunker returns the same value. In other words,

ChunkFunc([]int{-1, -2, -3, 1, 2, -1, 3, 2, 1}, func(v int) bool { return v > 0 })

would yield []int{-1, -2, -3}, []int{1, 2}, []int{-1}, and then []int{3, 2, 1}. This is useful for a number of different things. For example, let's say that you have a slice of lines of output from something, some of which are to stdout and some to stderr. If you want to output those with a header to indicate which is which, being able to group the consecutive runs of lines that were to each is very useful, and a function like this can do it quite efficiently.

groups := slices.ChunkFunc(outputs, func(output Output) string { return output.Destination })
for group := range groups {
  fmt.Printf("%v:\n", group[0].Destination) // Like Chunk(), never yields an empty slice.
  for _, output := range group {
    fmt.Println(output.Text)
  }
}

Version 2

The other possible implementation is to simply make the function always return a bool, and then define that each time it returns true is the start of a new chunk.

While this is not a function that I've personally had a use for, I'm not really stuck on either specifically simply because I'm pretty sure that both can be implemented using the other. I think it's probably easier to implement the comparable one using the bool version, though, as it would simply be something like the following untested function:

func ChunkBy[T any, C comparable, S ~[]S](s S, chunker func(T) C) iter.Seq[[]T] {
  return func(yield func([]T) bool) {
    first := true
    var prev C
    chunks := ChunkFunc(s, func(v T) bool {
      check := chunker(v)
      if first {
        prev = check
        return true
      }
      start := check != prev
      prev = check
      return start
    })
    chunks(yield)
  }
}

Version 3

Another alternative is to simply provide both as, say, ChunkFunc() for the bool version and ChunkBy() for the comparable one.

@gopherbot gopherbot added this to the Proposal milestone Aug 29, 2024
@gabyhelp
Copy link

Related Issues and Documentation

(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)

@earthboundkid
Copy link
Contributor

earthboundkid commented Aug 29, 2024

If it's called "ChunkFunc" I would expect to be able to implement Chunk as a trivial composition with it, like how slices.Sort is just SortFunc + cmp.Compare. However, implementing Chunk with this would involve what to me feels like a non-trivial use of closures. So I don't love the name. It does seem useful. It's similar to Python's itertools.GroupBy, although I believe Python returns the equivalent of iter.Seq2[Key, iter.Seq[Value]], which in my experience wasn't very convenient because I always ended up wanting the inner iterator to be a list.

@earthboundkid
Copy link
Contributor

Slightly more thought: I think it should be called xiter.GroupFunc and xiter.Group is the identity sequence:

func GroupFunc[K comparable, V any](seq iter.Seq[V], key func(V) K) iter.Seq2[K, []V]

func Group[T comparable](seq iter.Seq[T]) iter.Seq2[T, []T] {
    return GroupFunc(seq, func(v T) T { return v })
}

@DeedleFake
Copy link
Author

DeedleFake commented Aug 29, 2024

I agree with you on the name not being quite consistent with the other Func functions, but I don't think Group is right, either. Group is usually collecting all elements that match something into the same group, where as this function is only consecutive chunks of elements. Two elements that match but have a different one in between would be in different chunks.

A Group function can be written using this, though, by simply sorting the slice first so that all elements that match are in consecutive chunks. It's a pretty efficient way to do it, depending on the circumstances, because it can be done without allocations. Much better than the simple approach of building a map[K][]T if you want the lists sorted anyways. I use that in a project to produce a list of sublists of remote nodes grouped by their location with each sublist sorted as well to present in a UI.

Back on topic with name choice, though maybe just copying Elixir and calling it ChunkBy would make the most sense.

@jimmyfrasche
Copy link
Member

I would expect Group to return a map[K][]V. (That can be generalized to a GroupReduce[K comparable, S, T any](sum S, reduce func(S, T) S, seq iter.Seq2[K, T]) map[K]S making regular group GroupReduce(nil, append, seq) (well you need a wrapper around append but you get the idea)).

In the original Chunk thread I posted a generalization that's similar to this but instead of having a key func the predicate takes two values and returns true if it is should start a new chunk. It's a bit fussier to implement but I think this proposal could be implemented in terms of that more easily than the other way around. #53987 (comment) I'm not sure that's quite right either, though. When I implemented it I ended up needing to replace the func with an interface that had a reset method because otherwise it became difficult to write combinators.

@adonovan
Copy link
Member

adonovan commented Aug 29, 2024

Related:

@earthboundkid's Group idea is a generalization of this issue and that one.

@earthboundkid
Copy link
Contributor

Group is usually collecting all elements that match something into the same group, where as this function is only consecutive chunks of elements.

FWIW, Python's groupby works like this.

>>> from itertools import groupby
>>> list(groupby("aabbaa"))
[('a', <itertools._grouper object at 0x100d24b20>), 
('b', <itertools._grouper object at 0x100d25ae0>), 
('a', <itertools._grouper object at 0x100d25b10>)]

@josharian
Copy link
Contributor

FWIW, we have an xiter package internally and we have this:

// ChunkFunc returns an iterator over chunks of n, splitting the input sequence wherever split returns true.
func ChunkFunc[Slice ~[]E, E any](s Slice, split func(a, b E) bool) iter.Seq[Slice]

Looking at our uses of it, I think it'd be hard to rewrite it into a "convert each element into a key" style function. Not impossible. But I think it'd also harm readability, at least in our cases, where the splitting mechanism is pretty intricate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Incoming
Development

No branches or pull requests

7 participants