Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: container: add tree-based ordered.Map/Set #60630

Open
leaxoy opened this issue Jun 6, 2023 · 22 comments
Open

proposal: container: add tree-based ordered.Map/Set #60630

leaxoy opened this issue Jun 6, 2023 · 22 comments
Labels
Milestone

Comments

@leaxoy
Copy link

leaxoy commented Jun 6, 2023

There are at least several thousand related implementations OrderedMap/OrderedSet, although we have generics, there are no related containers in the standard library, so people have to implement their own sorting containers.

So I propose add ordered.Map/Set into std, maybe container package is a suitable place.

The main API are:

type Map[K, V any] struct{ /* private fields */ }

// Constructor function
func NewMapFunc[K, V any](cmp func(K, K) int) *Map[K, V] {}
func NewMap[K cmp.Ordered, V any]() *Map[K, V] {}

// Method set
// Insert new pair and return previous value if present.
func (m *Map[K, V]) Insert(k K, v V) (V, bool) {}
// Get get value associated with k, else return empty and false.
func (m *Map[K, V]) Get(k K) (V, bool) {}
// Contains judge k in OrderedMap ?
func (m *Map[K, V]) Contains(k K) bool {}
// Delete entry by key and return stored value if present.
func (m *Map[K, V]) Delete(k K) (V, bool) {}
// Keys return all keys in ascending order.
func (m *Map[K, V]) Keys() iter.Seq[K] {}
// Values return all values in ascending order by key.
func (m *Map[K, V]) Values() iter.Seq[V] {}
func (m *Map[K, V]) Entries() iter.Seq2[K, V] {}
// Clear clear the whole map.
func (m *Map[K, V]) Clear() {}
// Retain will keep those entry which satisfy the given predicate function.
func (m *Map[K, V]) Retain(f func(K, V) bool) {}
// Len return number of entries of map
func (m *Map[K, V]) Len() int {}
// First return the minimum entry.
func (m *Map[K, V]) First() (Entry[K, V], bool) {}
// Last return the maximum entry.
func (m *Map[K, V]) Last() (Entry[K, V], bool) {}
// Reverse reverse order of all entries.
func (m *Map[K, V]) Reverse() iter.Seq2[K, V] {}
func (m *Map[K, V]) Reversed() *Map[K, V] {}
// Merge with another OrderedMap, if key exists, replace it
func (m *Map[K, V]) Merge(o *Map[K, V]) {}
// SubMap copy entries between start and end to a new Map.
func (m *Map[K, V]) SubMap(start K, end K) *Map[K, V] {}

// Wrapper of Map with unit value.
type Set[T any] struct{ inner *Map[T, struct{}] }

// Constrcutor function
func NewSetFunc[T any](cmp func(T, T) int) *Set[T] {}
func NewSet[T cmp.Ordered]() *Set[T] {}

// Method set
func (s *Set[T]) Contains(v T) bool {}
func (s *Set[T]) ContainsAll(seq iter.Seq[T]) bool {}
func (s *Set[T]) ContainsAny(seq iter.Seq[T]) bool {}
func (s *Set[T]) Add(v T) {}
func (s *Set[T]) AddSeq(seq iter.Seq[T]) {}
func (s *Set[T]) Delete(elem T) {}
func (s *Set[T]) DeleteSeq(seq iter.Seq[T]) {}
func (s *Set[T]) First() (T, bool) {} // first element or none
func (s *Set[T]) Last() (T, bool) {} // last element or none
func (s *Set[T]) All() iter.Seq[T] {}
// or
func (s *Set[T]) Values() iter.Seq[T] {} // all elements iterator
func (s *Set[T]) Reverse() iter.Seq[T] {} // reverse order iterator
func (s *Set[T]) Reversed() *Set[T] {} // reverse order then write into another set
func (s *Set[T]) Len() int {} 
func (s *Set[T]) Retain(f func(T) bool) {}
func (s *Set[T]) SupersetOf(o *Set[T]) bool {}
func (s *Set[T]) SubsetOf(o *Set[T]) bool {}
func (s *Set[T]) InterSection(o *Set[T]) *Set[T] {}
func (s *Set[T]) UnionInPlace(o *Set[T]) {}
func (s *Set[T]) Union(o *Set[T]) *Set[T] {}
func (s *Set[T]) Difference(o *Set[T]) *Set[T] {}
func (s *Set[T]) SymmetricDifference(o *Set[T]) *Set[T] {}
func (s *Set[T]) SubSet(start T, end T) *Set[T] {}

Appendix A

  1. In C++, there are ordered_set/ordered_set based on rb-tree
  2. In Java, there are SortedMap and SortedSet.
  3. In Rust, there are BTreeMap and BTreeSet

Appendix B

If we land sum type in go, those return a bool flag would be instead by Option or Maybe, it's intuitive and more suitable.
Sum type is more suitable than Product Type here.

Appendix C

Since Iter should be a separate issue, so this proposal would not include Iter api. So move Iter API to end of section.

@leaxoy leaxoy added the Proposal label Jun 6, 2023
@gopherbot gopherbot added this to the Proposal milestone Jun 6, 2023
@ianlancetaylor ianlancetaylor moved this to Incoming in Proposals Jun 6, 2023
@ianlancetaylor
Copy link
Member

  1. We aren't going to add any general purpose containers until we have an idea of how to implement iterators, aka how to find all the elements in the container. Iterators are pending resolution of user-defined iteration using range over func values #56413, which itself is pending the forward/backward compatibility work that will be part of Go 1.21.
  2. This should perhaps be a discussion first.

@earthboundkid
Copy link
Contributor

There's already a set proposal that is snagged on iteration. This seems related to that. This could be containers/ordered.Set and containers/ordered.Map and have overlapping methods with containers/set.

@leaxoy
Copy link
Author

leaxoy commented Jun 7, 2023

  1. We aren't going to add any general purpose containers until we have an idea of how to implement iterators, aka how to find all the elements in the container. Iterators are pending resolution of user-defined iteration using range over func values #56413, which itself is pending the forward/backward compatibility work that will be part of Go 1.21.
  2. This should perhaps be a discussion first.

@ianlancetaylor I'm agree that Iter is a important feature to iterate over all element in container as you explained, but there also many features does't relay on Iter, such as: map.Get/Insert/Delete/Contains/Clear/IsEmpty/First/Last and set.Insert(Many)/Contains(All/Any)/Delete/Clear/IsEmpty. They are useful in many scenarios.

A benefit is map's Key and set's Elem can any type, not limited to cmp.Ordered.

For those internal maybe relay on Iter, like set.SuperSetOf/SubSetOf/Difference/Union/InterSection and map.Merge/Keys/Values, not exposing implementation details can leave room for the introduction of Iter later.

The rest are Iter related, like: Iter can be ecognized by the compiler and can be used in for-loop, and iterate operations can based on this, including but not limited to ForEach/Map/Filter/Reduce/Fold/Max/Min/AllOf/AnyOf/Take/Skip/Cmp/Zip/Unzip/Chunk/GroupBy ...

At least for CPP, operations on containers can be split into several parts, Iterators/Element Access/Capacity/Modifiers/List Operations/Lookup.. at Containers, Iterators works with std::ranges.

If I understand correctly, what you care about above is the Iterators part

@petkostas
Copy link

petkostas commented Aug 2, 2024

I am curious why an ordered map cannot be introduced or considered. As stated, there are many different implementations of it, which are increasingly being adopted. Still, they are painful to implement when using third-party libraries (especially when those involve range iterations).
Because many languages have already adopted a similar functionality (some examples):

To provide a more concrete example, I am using libopenapi, which uses a custom orderedmap, rendering specifications requires the ordering to be in place, especially when visualizing the results. With the current lack of an ordered map solution, a developer must introduce another solution compatible with the native supporting packages (e.g., templates and range functionality).

@petkostas
Copy link

Coming back, with the use of yield things become a bit easier. I gave yield spin today over an iteration of an ordered map that previously required me to convert to a slice and loop over the slice and then over the map.
I still believe that adding support for an ordered map/set makes sense.

@adonovan
Copy link
Member

adonovan commented Sep 3, 2024

The Keys and Values methods (and a new All method) should definitely use go1.23 iterators rather than materializing a slice.

func (*Map[K, V]) Keys() iter.Seq[K]
func (*Map[K, V]) Values() iter.Seq[V]
func (*Map[K, V]) All() iter.Seq2[K, V]

Reverse too could return an iterator rather than a clone:

func (*Map[K, V]) Reverse() iter.Seq2[K, V]

Similarly, SubMap (renamed Range) could be an iterator for a portion of the key space. If you need to construct a new map from a Range (or All), combine the iterator with a InsertAll method that consumes the iterator.

func (*Map[K, V]) Range(from, to K) iter.Seq2[K, V]
func (*Map[K, V]) InsertAll(iter.Seq2[K, V]) bool

What is Entry? Just a pair? Or a reference to a part of the representation? (Java's Map.Entry took the latter choice and IIRC it was a major barrier to optimized special-purpose maps, because they still had to allocate one object per entry.) Is it necessary?

I like that all the operations return the maximum amount of information in a single access (e.g. Insert reports whether there was a previous entry), as these can really improve efficiency.

Retain should be DeleteFunc, with sense inverted.

We should try to harmonize the APIs of related collections types such as sets, lists, maps, queues, and so on. Java 1.5's collections did a good job of establishing conventions across a range of data types so that, for example, they all have an add(T) bool method that reports whether the size of the collection changed (though the API has grown to the point where I can no longer hold in memory).

Naming suggestion: package ordered; type Map; func NewMap{,Func}? Or package maps; type Ordered; func NewOrdered{,Func}?

@jimmyfrasche
Copy link
Member

I don't like Range since it clashes with range in a confusing way though I'm not sure what a better name is.

InsertAll should just be Insert because of maps.Insert.

👍 to container/ordered. maps is for native maps. If there's a type in maps, it should be for wrapping a native map in something with all the usual methods for other container types.

Also, SuperSetOf should be SupersetOf.

@DeedleFake
Copy link

I feel like container/ordered is ever so mildly confusing because there are lot ordered container types that make no sense in it, such as just a regular linked list, though I see the appeal of ordered.Map[K, V].

@seankhliao seankhliao changed the title proposal: add tree-based OrderedMap/OrderedSet proposal: container: add tree-based OrderedMap/OrderedSet Sep 3, 2024
@jimmyfrasche
Copy link
Member

@DeedleFake I would expect inserting into an ordered.List to perform the insertion at an index that maintains sort order. So an ordered.Set that allows duplicates.

@leaxoy
Copy link
Author

leaxoy commented Sep 4, 2024

The Keys and Values methods (and a new All method) should definitely use go1.23 iterators rather than materializing a slice.

func (*Map[K, V]) Keys() iter.Seq[K]
func (*Map[K, V]) Values() iter.Seq[V]
func (*Map[K, V]) All() iter.Seq2[K, V]

Reverse too could return an iterator rather than a clone:

func (*Map[K, V]) Reverse() iter.Seq2[K, V]

Similarly, SubMap (renamed Range) could be an iterator for a portion of the key space. If you need to construct a new map from a Range (or All), combine the iterator with a InsertAll method that consumes the iterator.

func (*Map[K, V]) Range(from, to K) iter.Seq2[K, V]
func (*Map[K, V]) InsertAll(iter.Seq2[K, V]) bool

What is Entry? Just a pair? Or a reference to a part of the representation? (Java's Map.Entry took the latter choice and IIRC it was a major barrier to optimized special-purpose maps, because they still had to allocate one object per entry.) Is it necessary?

I like that all the operations return the maximum amount of information in a single access (e.g. Insert reports whether there was a previous entry), as these can really improve efficiency.

Retain should be DeleteFunc, with sense inverted.

We should try to harmonize the APIs of related collections types such as sets, lists, maps, queues, and so on. Java 1.5's collections did a good job of establishing conventions across a range of data types so that, for example, they all have an add(T) bool method that reports whether the size of the collection changed (though the API has grown to the point where I can no longer hold in memory).

Naming suggestion: package ordered; type Map; func NewMap{,Func}? Or package maps; type Ordered; func NewOrdered{,Func}?

@adonovan

Agreed, when this proposal was made, iterators had not yet been implemented. Now that they have been implemented, it makes sense to use iterator syntax.

The existence of Entry is a personal preference. Since seq2 has never been seen in other languages, tuple is used instead, but since go uses seq2, there is no better reason not to use it here.

Short names are indeed better than long names, changed in original proposal.

In my personal implementation it has indeed been the short name for some time

Although I still think seq2 is a less successful abstraction. ^_^

@leaxoy
Copy link
Author

leaxoy commented Sep 4, 2024

I don't like Range since it clashes with range in a confusing way though I'm not sure what a better name is.

InsertAll should just be Insert because of maps.Insert.

👍 to container/ordered. maps is for native maps. If there's a type in maps, it should be for wrapping a native map in something with all the usual methods for other container types.

Also, SuperSetOf should be SupersetOf.

@jimmyfrasche

Since the native map supports direct insertion of a key-value pair at the syntactic level, but this does not work here. If you insert a sequence using insert, then I cannot think of any other suitable naming for inserting the key-value.

How about use InsertSeq or InsertIter instead.

@DeedleFake
Copy link

InsertSeq() fits well with slices.AppendSeq().

@leaxoy leaxoy changed the title proposal: container: add tree-based OrderedMap/OrderedSet proposal: container: add tree-based ordered.Map/Set Sep 23, 2024
@jba
Copy link
Contributor

jba commented Oct 28, 2024

I've been looking into this topic a bit and I'd like to share some of that work. This comment is about the leading ordered map implementation, github.com/google/btree. I think it's relevant to talk about, because if it's great, or maybe even good enough, then it can satisfy the ecosystem's need for an ordered map and there's no compelling reason to add one to the standard library. Python seems to have taken a similar route: there is no ordered map in the Python standard library, but there is at least one high-quality third-party implementation.

The strengths of google/btree (as I'll call it throughout) are:

  • It is stable, with few active changes and few or no outstanding bugs.
  • It is mature, having been around since 2014.
  • It is widely used: over 2,000 importers according to pkg.go.dev.
  • It is battle-tested: it is imported by major packages like Trillian, CockroachDB, and etcd.

Its weaknesses are:

  • It supports generics, but not range-over-func.
  • Its API is expressed in terms of a single Item type that includes both the key and value, instead of the more typical separate key and value types. That means that to look up a key, for instance, you have to pass an entire item with the key portion set.
  • It does not support range deletion (deleting a contiguous subsequence of keys). Range deletion can be done in logarithmic time, asymptotically faster than doing the deletions individually, so it is more than just a convenience. The issue to add it has been open since 2017.
  • The iteration API is clumsy: there are eight methods for iteration.
  • It provides no iteration guarantees when there is interleaved modification. That is, if the tree is modified while it is being iterated over, unexpected things can happen: deleted keys can appear, inserted keys may appear twice, and so on. (To be clear, this is about a single goroutine writing during an iteration, not about concurrency: like most Go data structures, google/btrees are not safe for use by multiple goroutines unless all of them are reading.)

Last but not least, google/btree uses copy-on-write. One way to think about that is that cloning, instead of copying all at once, happens incrementally. Each clone copies a few of the shared tree nodes during each modification, until both are fully copied and no nodes are shared. Of course, if those modifications never happen, no copying happens either.

Copy-on-write provides a kind of iteration guarantee. Clone a btree, then iterate over the clone while confining all modifications to the original. The iteration will observe a snapshot of the original's items. Furthermore, the iteration is concurrency-safe (though not the individual keys and values), because the tree nodes are never written—they are always copied first.

Is copy-on-write a strength or a weakness? I would say both. It makes some operations, like iteration and keeping a snapshot of a tree, quite inexpensive. But its performance can be surprising if you are not aware of what's happening inside the implementation. For example, even after the iteration over a clone is finished and the clone becomes garbage, the original will still be copied each time a "shared" node is modified; the implementation has no good way to tell that the nodes are no longer shared.

google/btree is arguably good enough to serve as the ecosystem's ordered map of choice. On the other hand, its quirks and lack of ranged delete leave room to argue that there is a place for a more idiomatic implementation.

@jba
Copy link
Contributor

jba commented Oct 28, 2024

In this second comment (the first is #60630 (comment)), I'd like to suggest a modern, idiomatic API for ordered maps. The github.com/jba/omap package is based on Russ Cox's rsc.io/omap but adds a few common operations and an improved range API.

In comparison with github.com/google/btree, this implemention:

  • takes two generic parameters, for the key type and the value type
  • supports range-over-func
  • offers range deletion
  • provides a uniform, easy-to-understand way to handle iteration over and deletion of ranges
  • has a strong iteration guarantee
  • provides a way to ask for the i th key and value in key order

If we were to adopt an ordered map implementation, I think it would be a good starting point.

@DeedleFake
Copy link

@jba

It does support range-over-func, actually, just not explicitly: https://go.dev/play/p/pc574L-uHyf

@leaxoy
Copy link
Author

leaxoy commented Oct 30, 2024

@jba Most agree, But why is there a MapFunc type? Can it write to Map[K, V any], then use two different constructor variants

func NewMap[K, V any](cmp func(K, K) int) *Map[K, V]
func NewOrderedMap[K Ordered, V any]() *Map[K, V] {
    return NewMap[K, V](cmp.Compare[K])
}

@jba
Copy link
Contributor

jba commented Oct 30, 2024

why is there a MapFunc type

Map may be much faster.

@hgl
Copy link

hgl commented Nov 1, 2024

@jba, could Map be an interface?

If any method wants to take another map as an argument, the current two-struct approach is going to make it awkward to implement.

func (m *Map[K, V]) Merge(o *Map[K, V]) {}

Maybe an argument can be make that Map doesn't need such methods, but if Set and SetFunc are implemented on top of them, strong arguments can be made that they should have methods like Union, Join etc. It would ideal if Map and Set could use the same API style.

@jba
Copy link
Contributor

jba commented Nov 4, 2024

@hgl Yes, we could define the interface and have two constructor functions, one for the cmp.Ordered case and one for the comparison function. That would halve the size of the API.

As for set operations, I think it's better to define them in terms of sorted iteration sequences, as in #70140.

@fumin
Copy link

fumin commented Nov 16, 2024

@jba I wonder if you can confirm that github.com/jba/omap is strictly superior to github.com/google/btree?
If not, what are the things that jba/omap need to improve over google/btree?

P.S. I am a Go user needing an OrderedMap for backing sparse matrices.
I am not trying to troll by pitting libraries against each other, just trying to figure out what is the best library at the moment in the absence of a standard library implementation.
I also suggest if we decide to leave OrderedMap outside of the standard library, perhaps we could have a Wiki page or blog post to list the recommended/canonical third-party implementations for common data structures.
Currently, information on gonuts is too old, and google search results give very confusing picture, with the top search results containing extremely buggy poorly written code.
It was only until I stumbled upon this issue did I find a correct path forward for OrderedMaps.

@jba
Copy link
Contributor

jba commented Nov 16, 2024

I am not trying to troll by pitting libraries against each other

On the contrary, I think it's important to compare these packages.

I don't think there is a clear winner. In #60630 (comment) I talk in detail about the strengths and weaknesses of google/btree. The main weaknesses are the clumsy API, the lack of range delete, and the lack of iteration guarantees without doing a possibly expensive clone.

Google/btree is probably faster because jba/omap is unoptimized. For example, jba/omap allocates all new nodes on the heap instead of using a sync.Pool. There are other implementation tricks too that I haven't bothered with because the main purpose of jba/omap is to serve as a reference implementation for this proposal.

So I think it depends on your particular use case. From my point of view, I'd prefer you use jba/omap to help shake out bugs and serious performance issues, and perhaps to contribute improvements. I'd say if you're working on a personal or student project, it would be a more forward-looking choice. If you need something for serious production work, then certainly google/btree has a lot more mileage on it.

perhaps we could have a Wiki page or blog post to list the recommended/canonical third-party implementations for common data structures

The Go team talks about this from time to time, but the problems are that it's a lot of work, and also we want to remain impartial.

@fumin
Copy link

fumin commented Nov 16, 2024

@jba Thanks for your helpful comparison!
It looks like omap is preferred except for performance critical cases.
I will be using it, and would report bugs or new ideas that may be useful.

The Go team talks about this from time to time, but the problems are that it's a lot of work, and also we want to remain impartial.

Understood, and thanks to the entire Go team for this great work all these years!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Incoming
Development

No branches or pull requests