Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic count for stdlib use outside len #217

Closed
metagn opened this issue Apr 30, 2020 · 10 comments
Closed

Generic count for stdlib use outside len #217

metagn opened this issue Apr 30, 2020 · 10 comments

Comments

@metagn
Copy link
Contributor

metagn commented Apr 30, 2020

Originally discussed in nim-lang/Nim#14162. Writing so many RFCs is taxing but this needs its own place to be discussed.

Problem: cstring is a victim of improper use by templates/generic procs that check for a len overload on a type. cstrings len is O(n) with n pointer dereferences which is needlessly complex for anything named len. The length of some data structures (lists, cstrings) are non-constant. This can cause some algorithms that use the length but don't need it, like generating newSeq(x.len) for efficiency instead of an empty seq and adding to it, to be needlessly slow.

Proposal: len should be conventionally defined for procs that have constant/fast access to their length, and there should be a generic unary count (maybe go in sequtils? can debate name) that iterates over the type. len for cstring should be deprecated outside of JS.

# template so as to work for iterators
template count(x: untyped): int =
  var res = 0
  for _ in x:
    inc res
  res

This works on top of sequtils.count since that one takes 2 arguments, same with countIt. It's also in line with how other languages implement and apply meaning to count. An alternative implementation could be:

from sequtils import countIt
template count(x: untyped): int = countIt(x, true)

After defining this original count, we can extend it to better support types that have len defined, like so:

type Lengthable = concept x
  x.len is int

template count(x: Lengthable) = x.len
template count(x: typed) = # ...

This implementation would break the use of count for iterators, but the point of count is to write better algorithms for data structures, you can just use toSeq at that point. countIt works otherwise.

Flaws & backwards compatibility: The name count should be backwards compatible, other options include getLen, countLen. Deprecating cstring.len is the crux of this issue.

I have an idea as to how you can deprecate it for JS too. JS cstring actually counts unicode characters as 1 character (nim-lang/Nim#10911), so we might use a new overload instead of len here, but I don't know what the name would be.

If not, and JS cstring len should be kept, then that's also doable, but it would be hard to document. Declaration branching between different backends is never fun and when it's for a deprecated symbol that's even worse JS cstring is a topic for another day, not deprecating len for cstrings is fine, c_strlen should stay in system anyway

The main backwards compatibility issue is porting standard library procs to count, this would only happen when we start using concepts though and that's going to be backwards incompatible in of itself

@disruptek
Copy link

I like sequtils.count. Another example is linkedlists, but I would rather that collection simply offered a type that hid all the machinery and stored the size of the list. Still, count should be a thing.

@bung87
Copy link

bung87 commented Apr 30, 2020

for js unicode there's https://nim-lang.org/docs/unicode.html#runeLen%2Cstring

@Araq
Copy link
Member

Araq commented May 4, 2020

Why do we need count? If it's allowed to be expensive in O(N) terms generic code should avoid it anyway.

@metagn
Copy link
Contributor Author

metagn commented May 4, 2020

The point here is to make sure generic code doesn't use it. If we have a concept like Indexable that uses [] and len then it should warn or error when using it for a type like cstring. Since we can't just delete len for cstring (also we need it for JS), we have to deprecate it on C backend, so a proc that uses Indexable will create a warning for now. count is merely a last resort in case you really need the length of something, you should be inclined to do let L = x.count first.

Also, this doesn't close nim-lang/Nim#14162, I miscommunicated, count is separate from isEmpty. @disruptek you can reopen if you want with Araq's suggestion

@Araq
Copy link
Member

Araq commented May 4, 2020

Well so len for cstring needs a deprecation period and instead count but why do we need sequtils.count for everything that defines items?

More importantly, the premise is wrong: In practice length information in a generic context can be valuable and speed up computations even if len is O(N):

        var result = newSeq[type(iter)](iter2.len) # ok, need to iterate over the C string once
        var i = 0
        for x in iter2: # ok, need to iterate over it once again, but now it's in the cache
          result[i] = x
          inc i
        result

@metagn
Copy link
Contributor Author

metagn commented May 4, 2020

why do we need sequtils.count for everything that defines items

The point here is we separate everything that has len vs everything that can have len calculated. We could do:

template count(x: Indexable): int = x.len

when Indexable is implemented, or for now just compiles(x.len). If a proc, say, in strutils, needs the length of a parameter (rfind), they would use count, but everything else should check for len for the most efficiency (if you believe cstring is efficient with len then we can simply not deprecate it) then fall back to a generic solution that uses a cap or something.

I'll add a when compiles(x.len) branch in my PR if you want but it's going to be messy like toSeq is since it has to account for iterators.

@metagn metagn changed the title Deprecate len for cstring and use generic count instead Generic count for stdlib use outside len May 4, 2020
@Araq
Copy link
Member

Araq commented May 4, 2020

As I said, the issue is unconvincing for today's computers. Preallocated sizes are often more important to have than avoiding O(N) traversals and if findIt is bad for type T even though T has a len, findIt could be specialized for T. Currently with our untyped design that's unfeasible to do, so maybe we should put our focus on this "untyped it templates" issue instead.

@disruptek
Copy link

The issue has less to do with the hardware and more to do with the programmers being able to communicate with one another as to the nature of the operation. You admit that O(n) won't always be avoided, so you can't also argue that count shouldn't be exposed to users that need it. With count, they know what they are getting when they need to ask for it.

@mratsim
Copy link
Collaborator

mratsim commented Jun 9, 2020

For ensembles it can be card (for cardinality) otherwise a slightly longer countItems might convey more the O(n) nature?

In any case, trees, graphs and other linked data-structure might not store counts of a subgraph/subtree and a count convention might be useful there.

@metagn
Copy link
Contributor Author

metagn commented Sep 28, 2020

Weird RFC, closing since there's no real point here

@metagn metagn closed this as completed Sep 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants