Skip to content

proposal: container/set: new package to provide a generic set type (discussion) #47331

proposal: container/set: new package to provide a generic set type (discussion) #47331
Jul 21, 2021 · 37 comments · 275 replies

This is a discussion that is intended to turn into a proposal.

This proposal is for use with #43651. We propose defining a new package, container/set, that will introduce a new set type. It is possible that this proposal, if accepted, will be included with the first release of Go that implements #43651 (we currently expect that that will be Go 1.18). Or it may be appropriate to delay this package until a later release. This package will not be in the 1.18 release, but is for consideration for a later release.

This description below is focused on the API, not the implementation. In general the implementation will be straightforward.

See also the slices proposal at #45955 (discussion at #47203) and the maps proposal discussion at #47330.

// Package set defines a Set type that holds a set of elements.
package set

// A Set is a set of elements of some comparable type.
// Sets are implemented using maps, and have similar performance characteristics.
// Like maps, Sets are reference types.
// That is, for Sets s1 = s2 will leave s1 and s2 pointing to the same set of elements:
// changes to s1 will be reflected in s2 and vice-versa.
// Unlike maps, the zero value of a Set is usable; there is no equivalent to make.
// As with maps, concurrent calls to functions and methods that read values are fine;
// concurrent calls to functions and methods that write values are racy.
type Set[Elem comparable] struct {
	// contains filtered or unexported fields
}

// Of returns a new set containing the listed elements.
func Of[Elem comparable](v ...Elem) Set[Elem]

// Add adds elements to a set.
func (s *Set[Elem]) Add(v ...Elem)

// AddSet adds the elements of set s2 to s.
func (s *Set[Elem]) AddSet(s2 Set[Elem])

// Remove removes elements from a set.
// Elements that are not present are ignored.
func (s *Set[Elem]) Remove(v ...Elem)

// RemoveSet removes the elements of set s2 from s.
// Elements present in s2 but not s are ignored.
func (s *Set[Elem]) RemoveSet(s2 Set[Elem])

// Contains reports whether v is in the set.
func (s *Set[Elem]) Contains(v Elem) bool

// ContainsAny reports whether any of the elements in s2 are in s.
func (s *Set[Elem]) ContainsAny(s2 Set[Elem]) bool

// ContainsAll reports whether all of the elements in s2 are in s.
func (s *Set[Elem]) ContainsAll(s2 Set[Elem]) bool

// Values returns the elements in the set s as a slice.
// The values will be in an indeterminate order.
func (s *Set[Elem]) Values() []Elem

// Equal reports whether s and s2 contain the same elements.
func (s *Set[Elem]) Equal(s2 Set[Elem]) bool

// Clear removes all elements from s, leaving it empty.
func (s *Set[Elem]) Clear()

// Clone returns a copy of s.
// The elements are copied using assignment,
// so this is a shallow clone.
func (s *Set[Elem]) Clone() Set[Elem]

// Filter deletes any elements from s for which keep returns false.
func (s *Set[Elem]) Filter(keep func(Elem) bool)

// Len returns the number of elements in s.
func (s *Set[Elem]) Len() int

// Do calls f on every element in the set s,
// stopping if f returns false.
// f should not change s.
// f will be called on values in an indeterminate order.
func (s *Set[Elem]) Do(f func(Elem) bool)

// Union constructs a new set containing the union of s1 and s2.
func Union[Elem comparable](s1, s2 Set[Elem]) Set[Elem]

// Intersection constructs a new set containing the intersection of s1 and s2.
func Intersection[Elem comparable](s1, s2 Set[Elem]) Set[Elem]

// Difference constructs a new set containing the elements of s1 that
// are not present in s2.
func Difference[Elem comparable](s1, s2 Set[Elem]) Set[Elem]

Replies

37 comments
·
275 replies

Needs some sort of performance promise. I think it would be ok to promise that Add and Has are amortized O(log(n)).
Not sure we need to enumerate that everywhere, but maybe a note at the bottom of package docs would list promises for all the functions.

6 replies
@Merovius

Why O(log(n))? My assumption would be that we'd use a hash-set (especially given the comparable constraint) which should provide O(1)?

@randall77

There's no obvious way to hash comparables. Although we might get a backdoor to the runtime to do that.

Another natural implementation would be a red-black tree, which is O(log(n)). Although that requires orderables.

It's all a question of how much we want to commit to the underlying implementation. O(log(n)) gives us some wiggle room. O(1) basically means it must be a hash table.
The "amortized" word is also doing some heavy lifting, as far as giving us some flexibility in the implementation.

@Merovius

There's no obvious way to hash comparables. Although we might get a backdoor to the runtime to do that.

It seems there is an obvious way to build a hashset though - using map[T]struct{}. I agree that it's a bit weird to use comparable as a stand-in for "hashable", but that's kind of the world we're in with Go, right? To ask another way: Is there a reason not to use a hashset, given that we already have a really good hashmap implementation? ISTM it gives the tightest complexity bounds with the widest requirements on the type (at least as far as Go is concerned, where everything that's comparable can be used as a map key).

FWIW, if I read in the docs that Add/Remove are O(log(n)) I would likely take that as an argument not to use it.

@Merovius

While we're at it: Personally, I would like us to add func Value[T comparable](h *Hash, v comparable)¹ to hash/maphash, to enable users to write generic hashmaps with the same power and good hash function as map - which could also be used to write a hashset, of course.

[1] Naming is hard. Ideally, this would be a method on *Hash. Or maybe a type Hasher[T comparable]? In any case - I'd like a us to expose the runtime hash functions for comparable values to hash/maphash.

@ianlancetaylor

Yes, the intent of this package is to implement Set[K] using map[K]struct{}.

Regarding the naming, it seems inconsistent to me to have standalone functions named Union, Intersection, and Difference, when the set type has methods named AddSet and RemoveSet. It almost suggests that AddSet and RemoveSet perform a different operation (I understand they work in place).

4 replies
@nickkeets

FWIW Swift calls the in-place versions formUnion and formIntersection.

@ianlancetaylor

I don't think it's inconsistent. The Union function forms a union of two sets. The AddSet function adds one set to another. While clearly the concepts are very similar I think it would be more confusing to have a Union function that returns a new set and a Union method that changes an existing set.

@ajlive

I agree the function and the method should have different names, but I think Swift is at least on the right track here.

Scanning the API proposal above, I did a double-take at AddSet and had to check the comment to make sure I knew what it did.

FormUnion or, say, UnionWith, I think are more clear about both what the method does and what makes it different from the Union function.

@deanveloper

I agree with Ian here - AddSet is a good name for a mutable function. Union (or anything containing the word) makes me think of immutable mathematical operations. FormUnion is okay I guess, but IMO it still could be mistaken for s1.FormUnion(s2) creating a new map. s1.AddSet(s2) is very clear that it adds s2 to s1.

// (Note: New and Of are separate because New will always require a type parameter
// whereas Of will often be able to infer the type parameter from the arguments.
// For example, set.New[int] or set.Of(1, 2), but not set.Of[int]().)

I don't understand this argument (also, set.New[int]()). As Of is the more general function, this comment would suggest that New saves us having to write type-parameters - but it, of course, doesn't. "Always having to write a type-parameter" is clearly not a beneficial property of a function.

IMO there should just be one function New[Elem comparable](v ...Elem) Set[Elem]. It still allows you to write the same New call as before and you can get type-inference if you list the elements. Seems strictly better to me.

4 replies
@Merovius

In other words: AIUI Of[T]() will be equivalent to New[T]() so New is redundant and should be removed. Freeing up the canonical New name for Of.

@ianlancetaylor

I wrote both because they read differently. But, yes, we could remove New.

@DeedleFake

While New() seems more idiomatic, Of() actually reads quite well for both:

set.Of(1, 2, 3)
set.Of[int]()
@ianlancetaylor

I've removed New to provide just Of.

I've also changed to pointer receivers, so that the zero value is useful.

I find it awkward to say "Like maps, Sets are reference types" and having to explain that, instead of just making it a pointer. It might feel weird to always have to pass around a Set via a pointer, but for declared types, that's not super uncommon. And if we made every method use a pointer, we can make the zero value of Set equivalent to an empty set - maps are one of the main reasons why I often have to write constructors instead of have my zero values be useful, so I would prefer to be have useful zero values.

4 replies
@bcmills

bcmills Jul 22, 2021
Maintainer

Agreed — having to make all of the fields of map types within a struct gets pretty tedious. The meaningful zero-value is one of the few things about sync.Map that is more ergonomic than a built-in map.

@ianlancetaylor

I've made the zero value useful by changing the methods to take pointer receivers.

@cespare

I've made the zero value useful by changing the methods to take pointer receivers.

That seems like an improvement, but I think we should further encourage sets to be passed by pointer by having all functions that accept or return sets use *Set[Elem] rather than Set[Elem]. (So change Of, Clone, HasAny, Union, etc.) As it is right now, this type uses a mix of pointer and non-pointer types that looks pretty unusual for a data structure type.

@jhenstridge

In particular, it means that none of the defined methods are in the method set of the value returned by Of(). While in many cases this doesn't matter, I suspect it will lead to confusion when people try to stuff sets into interface variables. Take the following contrived example, which would fail to compile:

type HasLen interface {
    Len() int
}
var foo HasLen = set.Of(1, 2, 3, 4, 5)

Any reason why Union, Intersection and Difference are functions instead of methods? image.Rectangle has Union, Intersect and other binary operations as methods, image.Point and the types in math/big as well.

Methods would feel more like infix notation. My suggestion would be Set[Elem].Union, Set[Elem].Intersect, Set[Elem].Diff

1 reply
@Merovius

Any reason why Union, Intersection and Difference are functions instead of methods?

Union is AddSet and Difference is RemoveSet, except that they operate in-place. There's definitely value in having both an in-place version and a not-in-place version, so adding Union as a method would mean it's the not-in-place version. Personally, I find it unsavory to return a new Set from a method (Clone is an obvious exception). image.Rectangle is different, to me, because it has a constant (small) size.

Why the requirement not to modify the set during iteration? Maps don't have that requirement and that's a very helpful property. Also, iterating by using a function is awkward (you can't just return from the outer function, or break more than one level of loop), so I wonder if it might be nicer to support some kind of iterator so iteration can be done with a straightforward for loop.

11 replies
@Merovius

Another huge difficulty, of course, is that when iterating []T, you might want to get the index you're at, for something like map[K]V you might want to get the key you're at and for something like chan T there is neither. So, a "general purpose iteration API" also needs to concern itself with how to resolve that.

FWIW we can always add methods to work with iterators, once we have an iterator interface.

@fzipp

@Merovius I see the challenges, but in my opinion it is worth directing more energy towards figuring out an iteration concept before adding packages for collection types, i.e. lay the foundation before building castles. I see an iteration interface on the same level of fundamentality as io.Reader/Writer that turned out to be incredibly beneficial to the standard library.

@Merovius

Note that io.Reader/Writer came after their concrete implementations, though.

@fzipp

Note that io.Reader/Writer came after their concrete implementations, though.

@Merovius In 2008, before the public release (called io.Read and io.Write back then) 😁, long before a compatibility promise was in sight, and it actually seems like they were defined at the same time as the first concrete Read(er) implementations.

I don't find "we can always add it later" compelling in this case. It concerns 9 out of the 18 proposed functions/methods (50%). Adding the more general functions later will increase the API surface unnecessarily, and the good function names will be taken by the obsolete functions.

@ianlancetaylor

It is not clear to me that we will want methods on Set that use iterators. Maybe we will, maybe we won't. In Go it seems to me more likely that iterator operations will take the form of functions. Containers will most likely need methods that provide an iteration mechanism, and then we might have a function like func AddTo(container.Addable, container.Iterator). (This may be a good argument that Set.Add should take a single element rather than a variadic, I'm not sure.)

Separately I'll note that Go already has a builtin iteration mechanism in the for/range statement. A general iterator mechanism should work cleanly with that, which will require language changes one way or another.

"Len" is an unusual name for the cardinality of a set, the colloquial term is usually "Size". Of course, "Len" would fit the Len method in sort.Interface (but the proposed Set type can't be sorted), and other types are measured via the "len" builtin function, so that's probably where it comes from.

2 replies
@DeedleFake

Both bytes.Buffer and strings.Builder have a Len() method and aren't sortable, though they are ordered. While I agree that Len() isn't the usual name for the cardinality, I think that I prefer the consistency over whatever's going on with Java's standard library.

@ianlancetaylor

I think Len is the standard Go name here. Even len(m) works for any map m.

Can we get explicit promises for the iteration order? And maybe another hashset without such promises?

10 replies
@Cyberax

Yeah, I don't necessarily want to impose order on all sets, just make it very explicit what kind of the collection you're dealing with.

@leighmcculloch

A set that preserves not just order from one iteration to the next but insert order would be very useful.

@ianlancetaylor

This type does not provide any ordering guarantee. This is intentional. I expect that over time there will be other set types that provide different performance/ordering guarantees.

@fzipp

I expect that over time there will be other set types that provide different performance/ordering guarantees.

Interoperability between different set types will be difficult, because these methods and functions take concrete types: AddSet, RemoveSet, HasAny, HasAll, Equal, Union, Intersection, Difference

@ianlancetaylor

I don't yet see a reason for an AddSet method that takes a generalized set. That's a role for a function that takes a container.Addable and a container.Iterator.

Making Add and Remove variadic doesn't feel quite right to me. These are two of the fundamental primitives for sets and having those methods involve slices seems like extra amount of complexity that doesn't sit flush with the rest of the API. I suspect that the vast majority of Add and Remove calls will have exactly one argument anyway, and there is some small performance penalty to the variadic version.

6 replies
@cespare

Personally, from an API perspective, I very much like that Add and Remove are variadic. It easily allows both to add a single element and a slice - and I don't even think adding a slice is that uncommon.

IMO the main reason to use a variadic function is if it will mostly be called with a variable number of arguments in the call. Sometimes in code review I see that someone has written a variadic function but then all the call sites look like f(s...). In that case, it would be better for the function to take a normal, non-variadic slice.

Here, my intuition is that the usage will look like this:

s.Add(v)            // very common
s.Add(vs...)        // an order of magnitude less common
s.Add(v0, v1, v2)   // another order of magnitude less

If that is roughly correct, then the purpose of this being variadic is a kind of backdoor function overloading to allow accepting both a single value and a slice, and mostly not to support s.Add(v0, v1, v2). That's what doesn't feel quite right to me. When I look through all the variadic functions in the standard library (there aren't many) I can't spot any that that seem like they are used this way. Most of them exist to support printf-like functionality.

If we want to support passing in a single value and also passing in a slice, I think that two separate functions (Add(Elem) and AddAll([]Elem)) are better.

I would personally expect them to be inlineable and thus have no penalty.

That does not follow. There is more work to be done to handle a slice than there is to handle a single value: preparing the slice in the caller; looping over the slice in the callee. I'm sure the compiler can do a good job of reducing the overhead, will it get clever enough to reduce it to zero?

Just as an illustration, this benchmark compares Add(string) vs. Add(...string) using a trivial implementation of a string set using a map[string]struct{}. Both functions are inlined. At tip, this gives me

name    old time/op  new time/op  delta
Add-12  10.9ns ± 5%  11.7ns ± 2%  +6.95%  (p=0.000 n=10+10)
@Merovius

Fair enough. I find it surprising that there is a performance difference (albeit a small one, TBH), but I can't argue with the numbers. I was obviously making wrong assumptions.

@ianlancetaylor

I think we should fix the performance difference. We shouldn't decide between single argument and variadic argument on the basis of performance, only on the basis of readability.

@cespare

My main argument is that using variadic methods here is a worse API than using methods which take single values (so I wrote that down first and mentioned performance afterward).

However, I'm curious what it could mean to "fix" the performance difference. ISTM that the variadic methods imply passing values inside slices which inherently brings more complexity/instructions than passing a single value. Or are you positing some compiler optimization where it notices a variadic call with a single value and then "unrolls" the loop from the caller?

@ianlancetaylor

My main argument is that using variadic methods here is a worse API than using methods which take single values (so I wrote that down first and mentioned performance afterward).

Right, sorry for misrepresenting.

But from a readability perspective I don't see much difference. s.Add(1) and s.Add(1, 2) both seem clear to me. Even s.Add(vals...) is reasonably clear. So to me the variadic API doesn't seem like it involves any extra complexity to the reader.

As far as performance goes, we need to teach the compiler to unroll a range loop over a known slice value with less than N elements where N is some value larger than 1. Then I assume the performance will be the same after inlining.

I expect it will look something like the following sequence of rewrites:

    s.Add(1, 2)
    s.Add([]int{1, 2}...)
    for _, v := range []int{1, 2} { s.m[v] = struct{}{} }
    s.m[1] = struct{}{}; s.m[2] = struct{}{}

I would expect EqualFunc, as in maps.

1 reply
@ianlancetaylor

The proposed maps EqualFunc function uses the function to compare map values, not map keys. Sets have no values, only keys, so there is no need for EqualFunc.

This proposal uses Has but Contains is used everywhere else (strings, bytes, the proposed slices). To be clear, I prefer Has on its own, but the inconsistency makes it harder to learn and I think that's more important.

1 reply
@ianlancetaylor

Switched to Contains.

This comment has been minimized.

@bcmills

This comment has been minimized.

// Filter deletes any elements from s for which keep returns false.
func (s Set[Elem]) Filter(keep func(Elem) bool)

In functional programming, Filter has the connotation of returning a new data structure. Can we find a different name for this?

Personally, I like Prune — which, to me, carries the connotation of lopping off parts (of a plant) in-place.

6 replies
@fzipp

@bcmills With Prune I would expect the predicate function to be negated.

@Sajmani

I was thinking Keep, which is the name of the function arg, too.

s.Keep(func(i int) {return i > 0})
@leighmcculloch

The term keep doesn't tell me for sure that it isn't in place.

Could we have .Filter and .FilterInPlace? The first a copy, the second in place. It's clear.

Providing both copy and in place methods are common in some stdlibs. E.g. in Ruby there are both but functions that mutate typically have a ! as the last character in their name: filter, filter!. Having the two functions with similar names helps me understand the API.

@ianlancetaylor

I'm leaning toward RemoveIf, which would be an in-place method.

I'm not sure we need a method that combines copying and removal. As this point that seems like a frill that we can consider adding if a lot of code winds up needing it.

@sbstp

Rust names this functionality retain. The method name makes it obvious what it does and if the condition is negated or not.

Parallel to #47330 (comment) for maps, would it make sense to reserve the Values() name for an iterator, and name the method that returns a slice something like ValueSlice() instead?

3 replies
@cespare

Or ToSlice or AsSlice or just Slice?

@leighmcculloch

Instead of placing the slice function on the type could it live on the iterator? Values returns an iterator and then you can construct a slice from the iterator.

@bcmills

@leighmcculloch, it could, but then we would have to define the iterator API before we could extract the values from a Set.

I think extracting the values as a slice is a common enough operation to merit its own method, and I don't think that this proposal necessarily needs to be gated on figuring out an entire ecosystem of “iterator” APIs.

In Java, it is often annoying that the various container types' add() methods are void. Although some of that annoyance is solved here by it being variadic, would it be possible for Add(), Remove(), and the like to return their own receiver? They'd still be in place, but that way something could be added and the set could be passed to something else in a single expression.

1 reply
@ianlancetaylor

That is not standard Go style, though. And speaking purely personally, I would it misleading that code like F(s.Add(1)) would modify s. It's much clearer that s is modified when the entire statement is s.Add(1).

@rsc
rsc Aug 11, 2021
Maintainer

Retracted


Having a struct that is a reference type is surprising. If we're going to make Set[Elem] a reference type perhaps we should make the connection clearer, with

type Set[Elem comparable] map[Elem]struct{}

That would also allow people with an "old-style" set s to call methods using sets.Set(s).Foo().

The obvious objection is that maybe Set and map should have different representations, but (1) they won't, and (2) if there were significant optimizations to make for sets, we should make them for map[T]struct{} too, to help all the old code that won't ever be converted.

28 replies
@Merovius

@jimmyfrasche

Acknowledging that while it's a modeling a set it's also just another map seems perfectly fine to me.

We are going to have to agree to disagree on this. Even if the method is just a wrapper around an equivalent maps function, the code using container/sets will read very differently, depending on whether it calls into maps or into container/set and I very much don't like the way it would look in the latter case.

@ianlancetaylor

Sets are like maps but they aren't maps. I think it's clearer to separate the API, rather than let people trivially use map operations on sets. So I think it's preferable to have a struct with a single field of map type.

@ajlive

I particularly don't like that a Set struct either forces a Set.Do or forces you to crack it open to get at the gooey map goodness inside for iteration (if it's even public, which it probably wouldn't be).

Then there's #47331 (reply in thread):

Also using loop might help escape analysis (afair there were problems with closures and inlining).

Is this true?

@rsc

Regarding "(1) they won't", the pieces to write a set independent of map do not exist.

A comment mentioned hash-based, tree-based, and list-based sets as implementation strategies.
A list-based set has O(n) operations, so that's off the table to start with.
A tree-based set depends on ordering (< and >), but these sets only require == of their elements, not general ordering.
That leaves hash-based sets as the only possible implementation.

If we were to want to write a hash-based set implementation separate from maps,
there is nothing in the language or libraries that exposes a general hash of comparable elements.
hash/maphash exists but it only accepts []byte and string arguments.

We could introduce even more API, of course, or we could acknowledge that any implementation of set is necessarily going to be based on map, which is what I meant by (1).

@rsc

rsc Aug 13, 2021
Maintainer

I retract the suggestion to expose the map.
[I wrote a comment here about docs but I will make that a separate thread.]

@rsc
rsc Aug 11, 2021
Maintainer

Retracted


I wonder whether Union, Intersection, and Difference would be clearer as Or, And, and AndNot.

set.Or(s1, s2)
set.And(s1, s2)
set.AndNot(s1, s2)

Those names might also work as methods:

s1.Or(s2)
s1.And(s2)
s1.AndNot(s2)

11 replies
@ajlive

I am pretty sure there is an unambiguous bidirectional mapping with logical operators

It is indeed a bijection, and in formal logic the set symbols are used with the logic names, eg, "U" is pronounced "or", and it's not even all that uncommon when mathematicians talk about sets.

the main thing I see us saving is bytes in the symbol names?

The only other argument I can see is that Xor is clearer for programmers who haven't studied set theory than DifferenceSymmetric. But we're probably talking about very minor differences in terms of which is clearer to who. Should we flip a coin? Go Time Twitter poll? ;)

@colin-sitehost

Ooh, it is a bijection; yeah, my vote is for logical operators. (With the secret hope that we can do operator overloading at some point.) That said, I am a little hesitant to buck the trend, but go programmers probably get logical operators better than set theory?

@Merovius

I'm strongly against names like Or, And, etc. IMO this is a package about sets, so we should use standard set parlance.
Or, And, etc. are not at all less jargony than Union, Intersection. If anything, it requires an extra step of knowledge, to be aware that logical and set operators map to each other this way.

But really, the point of jargon is that everyone using it can easily recognize and agree on what the specific operation is that is happening. Jargon isn't bad. It's a tool, which provides a shortcut in communication. So I think it would be less clear to deviate from the established parlance around sets.

Note that we can still have the doc-comments use logical operators to explain the operations, if we're really that worried that people might not know what a "set union" is. i.e. usage of the standard names doesn't prevent us from providing multiple explanations, addressing different target audiences.

@deanveloper

With the secret hope that we can do operator overloading at some point

If we get operator overloading, I do not want it added to the standard set package.

edit - To provide some reasoning, operator overloading on stdlib collections is a gimmick. Sure it's cool, but it only increases confusion because it gets vastly overused, and it causes confusion (something like set += elem would be illegal, since in Go operands are always required to have the same type, and we should keep this property with operator overloading). Then you also have confusions like "does set += set2 modify set in place, or does it return a new set?". Operator overloading on collections is hard to document properly, and it's less clear than simply using a method name.

@rsc

I retract this suggestion.

Retracted

Receptive to the argument in #47331 (comment) that the top-level functions not be in the package, at least in the name of a minimal initial API


@colin-sitehost mentioned symmetric difference in #47331 (reply in thread), and I think a DifferenceSymmetric should be considered

5 replies
@colin-sitehost
// DifferenceSymmetric constructs a new set containing the elements
// which are in either set, but not both.
func DifferenceSymmetric[Elem comparable](s1, s2 Set[Elem]) Set[Elem]
@Merovius

I thought about this, but I really can't think of ever needing it in practice. Not in programming and not in my 7 years of studying mathematics at university. I don't think I ever saw it in a context that wasn't just "there's a hole in the matrix of venn-diagramms, which we can fill with the symmetric difference". So, it seems a bit superfluous to add, to me.

That being said, I also wouldn't really mind if we have it. But we should call it SymmetricDifference, because that's what the operation is called and it's standard English word order to have the adjective first.

@ajlive

Symmetric difference did come up in math for me, though maybe only in Boolean rings/algebras...

OTOH, I've used the symmetric difference regularly, if not frequently, in programming for tasks like processing and validating sparse data. One of the things I appreciate about Python is that the symmetric difference is there for me when I need it.

SymmetricDifferencebecause that's what the operation is called and it's standard English word order to have the adjective first.

Good point. I proposed DifferenceSymmetric to make it discoverable with Difference, and I weighed that factor a little more heavily. (If we went with the logical names proposal it would be Xor, of course.)

@ajlive

From #47331 (comment), it's concise to write:

set.Union(set.Difference(x, y), set.Difference(y, x)) // symmetric difference of x and y.
set.Difference(set.Union(x, y), set.Intersection(x, y)) // also symmetric difference of x and y

...at least if those top-level functions are retained. Writing it using only the methods is a little more painful.

I think we can probably get most of this right on the first landing, but would it be worth stuffing this behind golang.org/x/ like we did for xerrors, just to see in situ usage before committing to the interface? (Same goes for slices and maps, but at least there we know how people are currently using those types and we are just adding extant helper methods that have currently been copied around.)

4 replies
@robaho

I think this conversation provides ample evidence why the “collections interfaces” need to be defined before you discuss any implementations. I implore you to study the Java collections package and it’s interfaces. It will save a lot of work and provide some consistency for those that might migrate from Java to Go.

@ajlive

The slices package discussion ended up with a pretty minimal package, which is a good alternative to /x/. If #47331 (comment) goes forward and Set is just a map[T]struct{} alias (which I am strongly in favor of), I think this package becomes almost trivial, with a lot of things just falling out of maps and several functions just replicating or even calling functions in the maps package.

@ajlive

this package becomes almost trivial, with a lot of things just falling out of maps and several functions just replicating or even calling functions in the maps package.

See #47331 (reply in thread) and replies

@ajlive

Retracted since I found #47331 (comment) convincing.

Retracted

As @rsc has retracted #47331 (comment), I've retracted this. I still don't like Do, and would rather go with @leighmcculloch 's suggestion to use Values and range (#47331 (comment))


If #47331 (comment) is agreed on and a Set is just a map[T]struct{}, I propose we drop Set.Do and just use range.

for e := range mySet { ... }

I like this much better than Do. Do not want.

6 replies
@ajlive

Also agree with his general philosophy around using range wherever possible: #47331 (reply in thread)

Range is generic and works on a variety of types, even if it's only built-in types. I don't think it's so bad if we use it for sets. I think we should be leaning towards how do we use range on more collection types rather than less. Do would be harder to use.

@cristaloleg

Also using loop might help escape analysis (afair there were problems with closures and inlining).

@deanveloper

Honestly I've started using channels every time I want an iterator now. I tried changing that for a bit with talk of #43557, but eventually went back to using channels again. People sometimes complain about efficiency with this, but I'm using it in a pathfinding application (making heavy use of channels) and I've never had any issues.

I honestly wouldn't mind seeing something like func (s Set[T]) Iter() <-chan T. This brings the benefits of using struct as well as the benefits of being range-able.

@Merovius

The problem of using channels isn't mainly efficiency, but the fact that you are leaking goroutines if you don't exhaust them.
This can be worked around with using context.Context (or another cancellation mechanism) but it's hardly ergonomic.

@ajlive

Retracted since #47331 (comment) is retracted

@rogpeppe asked "Why the requirement not to modify the set during iteration?"
but that thread diverged into general iterators.

I think we should drop the requirement that the set is not modified.
The reason we allow modification of maps during range applies equally well here:
disallowing modification essentially makes modification "undefined behavior"
which means implementation-specific behaviors, portability problems, security holes, and abuse.

If modification during Do must be disallowed, then we should require an implementation to panic when it is attempted.
(Do can set a 'no modifications' bit that the other methods can check.)

0 replies
3 replies
@rsc

It's fine to write your own list-based set for a specific use case.
That's not a plausible general implementation in this package.
Remember that type Set here is a specific implementation, not an interface.

Also note that replying via email apparently starts a new thread instead of actually replying.
Please try to reply on the web instead.

@robaho

So sorry again. Clearly the world is telling me not to respond to these.

But that begs the point - why so much focus on what the “standard” implementation is. Why is that a topic? The details should be hidden away and can be changed easily if needed. The discussion should focus on the operations.

@rsc

Agree about focusing on operations. Marking this resolved.

I thought for a while about what if people want more complex set operations like symmetric difference, union of more than two sets, and so on. Clearly that can be built out of the pieces here:

set.Union(x, set.Union(y, z)) // three-set union
set.Union(set.Difference(x, y), set.Difference(y, x)) // symmetric difference of x and y.
set.Difference(set.Union(x, y), set.Intersection(x, y)) // also symmetric difference of x and y

and so on. The problem is that all of these generate significant garbage, like Map, Filter, and Reduce in the slices API.

I wonder Union, Intersection, and Difference should be removed and instead we should encourage programmers to write them using temporaries they can control, or reusing existing sets. This is apparently what Java does: see "Set Interface Bulk Operations" in https://docs.oracle.com/javase/tutorial/collections/interfaces/set.html.

In Java, it was impossible to have the top-level functions because Set is an interface not an implementation. The mutating receiver version specifies the implementation. Even so, it is still also the case that they avoided a garbage-heavy API.

In Rust these operation produce distinct types that appear to serve as iterators alone.

In Swift and Python they do generate new sets.

5 replies
@ajlive

I wonder Union, Intersection, and Difference should be removed

I'm surprised to find myself receptive to this idea, at least in the name of a minimal initial API

@ajlive

If we remove Union, Intersection, and Difference, we should add Java's Set.RetainAll.

@ajlive

However, following the naming convention in AddSet and RemoveSet, I don't think it's anywhere near as clear what RetainSet does compared to the other two methods. Maybe we should reconsider copying Java's API there: AddAll, RemoveAll, RetainAll.

@Merovius

I think they are vital to a set-library, but I think it would be good to have them iterators. So removing them until we figure out iterators seems okay.

@ajlive

If we remove Union, Intersection, and Difference, we should add Java's Set.RetainAll.

In case this wasn't clear: If we remove the set.Union function and don't add a set.Set.RetainAll method, there will be nothing in the API to calculate an intersection, which would be bizarre.

I think the doc comment can be reduced to:

// A Set is a set of elements of some comparable type.
// The zero value of a Set is an empty set ready to use.
type Set[Elem comparable] struct {
    ... unexported fields ...
}

Calling sets "reference types" is confusing because they're not, at least as currently documented. They're just structs that happen to hold pointers, which is true of most Go types. We don't refer to them all as reference types.

Calling out that Sets cannot be modified concurrently from multiple goroutines is also confusing, because that's the default expectation for every Go data structure. We only document the deviations, when concurrent modifications are allowed.

(Compare list.List: it is just as much a "reference type" and is similarly safe to read but unsafe to modify from multiple goroutines, but we don't call attention to any of that. Or bytes.Buffer, which is also just as much a "reference type".)

Mentioning maps also encourages the reader to start thinking about questions like why isn't this a map or exactly how maps work. It is probably better to just let Sets be Sets.

0 replies

The Set as proposed, relevant components below, does not have a convenient method for iteration.

type Set[Elem comparable] struct {
	// contains filtered or unexported fields
}

// Values returns the elements in the set s as a slice.
// The values will be in an indeterminate order.
func (s *Set[Elem]) Values() []Elem

// Do calls f on every element in the set s,
// stopping if f returns false.
// f should not change s.
// f will be called on values in an indeterminate order.
func (s *Set[Elem]) Do(f func(Elem) bool)

I see two approaches to iteration given the current API:

  1. Using the Do method:

    Do allows you to provide a function that will be called for every element in the Set. This introduces a new pattern to iteration that isn't found in other types since iteration in Go in all other types are supported with a for range. This different pattern has some limitations.

    • It is not possible to return immediately from within an iteration. Instead, a value must be copied to a temporary variable and returned outside the iteration function, resulting in code that looks significantly different for iterating sets that other types. For example:

      func _() (Elem, bool) {
          set := set.Set{}
          // ...
          found := false
          var foundE Elem
          set.Do(func(e Elem) bool {
              if condition {
                  found = true
                  foundE = e
                  return false
              }
              return true
          })
          return foundE, found
      }

      I think it would be easier to work with map[key]struct{}. For example:

      func _() (Elem, bool) {
          set := map[Elem]struct{}{}
          // ...
          for v := range set {
              if condition {
                  return v, true
              }
          }
          return Elem{}, false
      }
    • Similar to the return example it is not possible to break out of higher up nested for's without a similar pattern.

  2. Using the Values method and range:

    This approach doesn't share the disadvantages of the Do approach, but it comes at a cost since the Set is copied into a slice.

Could we include in this proposal a method of iterating Set using range? I understand this would be a language change and is therefore worthy of a separate proposal. I'm happy to open a new issue for it but I was holding off because it sounded like we'd get range for free with #47331 (comment) but that proposal has been retracted.

24 replies
@robaho

Sorry - Java doesn’t allow it by design to avoid confusing race conditions.

I also think though that you will will need a different Do signature because you need to handle the case where the index/key is made available not just the value.

@Merovius

@robaho Yes. Presumably there will be some flexibility there (i.e. we'd allow both Do(T) and Do(K,V)). But again, the details aren't important right now.

@deanveloper

One issue with using Do as the “standard iterator” is that poorly written iterators which ignore return false will continue calling the Do argument

@Merovius

@deanveloper Any user-implemented iteration API will have that problem though.

In any case, given that the intention is explicitly not to discuss iteration now, but do it later, it seems we shouldn't go too deep into the ins and out of how Do would work for that case. I think it's clear it could be made to work with range though, alleviating at least that concern (if it's a good idea to do so is a question for another day).

@leighmcculloch

Given that I've seen a couple comments above that iteration isn't something to discuss within the container/set proposal, I've opened a new issue #47707 containing the proposal to support for range with user-defined types using the ideas described above.

@bcmills
bcmills Aug 31, 2021
Maintainer

[Edit: this turns about to be confusion over naming rather than a problem with the API signature itself.]

The lack of a short-circuit mechanism for Do seems awkward to me.

The corresponding sync.Map method (Range) returns a boolean from the callback to allow the caller to stop early. That makes the calls much more efficient if the caller is looking for a property that is reasonably likely to be satisfied by an arbitrary small sample of elements from a much larger set.

It is clearly possible to write every Do call site nearly as efficiently (but a bit more verbosely) using Range, but I do not see a way to write a Range-like call (that breaks early) using Do. (The closest we have in the current API is ContainsAny, but that method requires that the “small sample” be a single element.)

5 replies
@Merovius

The signature of Do in the top-post is

// Do calls f on every element in the set s,
// stopping if f returns false.
// f should not change s.
// f will be called on values in an indeterminate order.
func (s *Set[Elem]) Do(f func(Elem) bool)

So, maybe I'm missing something, but ISTM that the short-circuit mechanism you want is already in place?

@deanveloper

I think I’ve written this a few times (sorry if I am a bit repetitive) - I really think that a standard iteration mechanism should be implemented before we add more containers. This avoids the problem of having both “the .Do way” and “the .Iter way” in the future, and it would also solve usability concerns such as this one.

@bcmills

Oh, so it is! Apparently I fail at reading comprehension today. 😩

@bcmills

Ah! I was confused by the other Do thread, in which the API is compared to other Do methods in the standard library that lack the boolean. Perhaps that suggests a change in naming for consistency: the Range name should imply the boolean, while the Do name should imply its absence.

@ajlive

I was confused by the other Do thread, in which the API is compared to other Do methods in the standard library that lack the boolean

Me too

Retracted


I think something that may be concerning is that because this is implemented using a map, this would mean that sets are not comparable. So, a Set of Sets is illegal. However, I'm not exactly sure how we could implement sets such that a set of sets would be useful, other than abandoning the map idea, and using custom hash/equality methods. I don't really like that idea though.

9 replies
@Merovius

It's theoretically possible to build variable sized data structures which are comparable. Theoretically, it might be possible to extend this technique to a reasonably efficient comparable set data-structure (probably not with O(1) access time, but maybe O(log(n))).

However, this doesn't seem very go like. I think it's better to just accept that sets are not comparable.

@ajlive

I'm not sure that it would be, unless Go someday allows for immutable maps which would be comparable.

I was imagining that a FrozenSet need not necessarily be implemented using maps. That's an implementation detail that could be hidden?

In Java, this is done by allowing classes to override the int hashCode() and boolean equals(Object other) methods.
In Rust, this is done by implementing the Eq and Hash traits.

There has been discussion elsewhere here about extracting some kind of interface for sets in the future, with more or less controversy.

@ajlive

I hit "reply" before @Merovius 's comment showedd up on my screen.

I think it's better to just accept that sets are not comparable.

Agree. Haven't ever used frozenset in Python, and someone could build their own if they really needed one.

@deanveloper

It's theoretically possible to build variable sized data structures which are comparable. Theoretically, it might be possible to extend this technique to a reasonably efficient comparable set data-structure (probably not with O(1) access time, but maybe O(log(n))).

That's a really neat example. I wouldn't really say that the example "isn't Go-like" (I wouldn't mind seeing this as a third-party library), but I agree that something like this shouldn't be the standard implementation or even exist in the standard library. However I would love to use something like this to convert a set into an immutable set, then use that to make a set of sets. Going to retract this.

Edit - Definitely can't have O(log(n)) time without the elements being orderable unfortunately, the "generic case" is definitely O(n)

@Merovius

@deanveloper I tend to be careful with using "definitely" when applied to negatives :) Just because the obvious idea doesn't work, doesn't mean there aren't any non-obvious ideas. For example, you could definitely use reflect and hash/maphash to build a custom hash-function for arbitrary comparable types and then build a binary search tree using those hashes as keys, to get O(log(n)) lookups. You can't easily use it to build a hash-map, as that needs a variable number of randomly accessible buckets, but if you get creative with skip-lists you might well get something O(log(log(n))). I can even imagine someone being even more creative and find an O(1) implementation.

It's always harder to prove that something is not possible, than it is to prove that it is. So I prefer to be conservative about such claims :)

3 replies
@deanveloper

I don't think efficiency is the main reason why we wouldn't want to use custom hash functions, mainly just that it doesn't feel very Go-like

@Merovius

@robaho Please use the web interface when participating in discussions, for proper threading. You've been reminded a bunch of times about this.

As to your point: This might be something a custom implementation of a hash-set might do/allow via an API, but it doesn't address the actual point of having a set representation which can be used as a key for the builtin map. The builtin map does not allow overriding the equality operator or the hash, so you actually need to have a comparable type. Obviously, putting a cached hash into a struct doesn't help, because a) the rest of that struct would still have to support the equality operator just the same and b) the hash would then be part of the identity of the value as well, meaning it would have to be re-computed on every operation, which clearly loses all efficiency benefit.

@robaho

Sorry again. Why not fix this - it’s stupid. If I reply to a particular message it should know where the reply belongs.

Maybe this is a nitpick, but Intersection and Difference seem to be much longer than I'd expect.

Is there any appetite for the verb forms of Union, Intersect, and Diff?