Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some useful math/statistics functions are missing #4

Open
AlexDaniel opened this Issue Mar 20, 2019 · 11 comments

Comments

Projects
None yet
6 participants
@AlexDaniel
Copy link
Member

AlexDaniel commented Mar 20, 2019

Some examples of things that are missing:

  • clamp or clip https://stackoverflow.com/questions/55250700/is-there-a-clamp-method-sub-for-ranges-num-etc-in-perl6
  • “One final observation about Perl 6 and math: although Perl 6 has all the usual functions from math.h, it could certainly use a few more.” https://www.evanmiller.org/statistical-shortcomings-in-standard-math-libraries.html
    • double incbet(double a, double b, double x); # Regularized incomplete beta function
    • double incbi(double a, double b, double y); # Inverse of incomplete beta integral
    • double igam(double a, double x); # Regularized incomplete gamma integral
    • double igamc(double a, double x); # Complemented incomplete gamma integral
    • double igami(double a, double p); # Inverse of complemented incomplete gamma integral
    • double ndtr(double x); # Normal distribution function
    • double ndtri(double y); # Inverse of Normal distribution function
    • double jv(double v, double x); # Bessel function of non-integer order
  • prod. It's easy to do it yourself but if we have sum then why not have prod too (for example, numpy has both)
  • mean
  • median
  • mode ?
  • peak-to-peak (range) – (numpy example)
  • standard-deviation
  • histogram
  • and so on…

@AlexDaniel AlexDaniel added the language label Mar 20, 2019

@AlexDaniel AlexDaniel changed the title Some useful math functions are missing Some useful math/statistics functions are missing Mar 20, 2019

@AlexDaniel

This comment has been minimized.

Copy link
Member Author

AlexDaniel commented Mar 20, 2019

@moritz any thoughts on this?

@japhb

This comment has been minimized.

Copy link

japhb commented Mar 20, 2019

Just as a side note about stats on non-scalar data -- if you need more than one statistic, there's often a large performance advantage to calculating some or all of them at once in a single pass through the data. Certainly for known-immutable data it would be easy to cache the results for some statistics while calculating others, but in the general case it would be useful to have some way to request calculation of several stats (particularly commonly used ones) at once, without having to hand-roll one's own calculations -- the latter being frankly an easy way for non-experts to fall prey to all sorts of numerical stability issues.

@moritz

This comment has been minimized.

Copy link
Member

moritz commented Mar 20, 2019

IMHO these belong into a statistics module.

The naming is not obvious, (don't tell me you want a function called ndtr by default in the setting in Perl 6, please; and I don't know if average or avg or mean is the best), as are the performance issues that @japhb mentioned.

Has anybody written such a module? This is a perfect use case of something that can be prototyped and ironed out outside the core language.

If there's a really well-working module, we might consider inclusion in core (though I still think it's out of scope for Perl 6).

@AlexDaniel

This comment has been minimized.

Copy link
Member Author

AlexDaniel commented Mar 20, 2019

though I still think it's out of scope for Perl 6

Well, Evan Miller makes a point that these should be part of the standard library. Then there's also:

☞ Math just is. Don’t make people declare it.

And also it makes me wonder why something like acosech is in core but a commonly needed mean is not.

I agree, however, that the first implementation of all that can be done in a module.

don't tell me you want a function called ndtr by default in the setting in Perl 6, please

Of course not. From the article:

The Cephes folks seem to be stingy when it comes to doling out letters in function names, so the C committee may want to add a few characters to the above for clarity

@moritz

This comment has been minimized.

Copy link
Member

moritz commented Mar 20, 2019

☞ Math just is. Don’t make people declare it.

Yet none of us are trying to turn Perl 6 into a fully-featured Computer Algebra System.

(Side note, people have, in fact, proposed that in the past, but @TimToady has stopped them).

We have to draw a boundary somwhere. For me, the boundary excludes the beta and gamma-related functions.

We can argue about mean, if you want, but then please be more precise about its semantics (what will it return for the empty list, for example?). Why "mean" as the name (when there is a Geometric Mean as well as the "normal" arithmetic mean), why not "average"?

mean/average and standard deviation suffer from the performance penalty of multi-pass calculations, which is why I think that a regular function interface might not be the best. Which is why somebody should first come up with a working design in form of a module.

@lizmat

This comment has been minimized.

Copy link
Member

lizmat commented Mar 22, 2019

FWIW, I think a clamp method should take a Range (or 2 values) as parameter. This would allow it to be used on e.g. a List, a Supply, etc:

42.clamp(^10);   # 9
(10,20,30).clamp( 25..35 );  # (25,25,30)

etc. etc.

@jnthn

This comment has been minimized.

Copy link
Member

jnthn commented Mar 23, 2019

There's a wide variety of suggested additions here. I'm in principle not opposed to adding things to CORE.setting, but there should be a good argument for those we do add, as well as a lack of strong counter-arguments for not adding them.

A general counter-argument is that everyone pays for the things we put into CORE.setting: its compiled form is over 14 MB by now, which everyone has to download, store, have mapped into memory, and so forth. While there will be technical measures we can take to make it more compact, and try to further reduce the impact the setting size has on startup time, additions there will never be free. (Some argue "it makes the language bigger and so more to learn", but for things in CORE.setting I don't really buy that argument; you don't have to know all of a language's standard library in order to use the language. Or at least, I sure hope not, or I should stop programming. :-))

One consideration that has not yet been mentioned here is whether there is a significant performance benefit to be had from providing the operation as a built-in. If, for example, some platforms provide for doing the operation at CPU level, or there exists a means to implement it more efficiently than would be possible through the composition of other operations, then there's a case for having it in CORE.setting so we can JIT it into something good. I've no idea if this is the case for any of those suggested here; research is needed.

It's also worth considering how widely used something would be. For example, there's probably a quite strong case for average, which for most people means sub average(@xs) { @xs.sum / @x }, even if there are many other kind of average. I suspect that's been defined by quite a few folks by now (and it's so short/simple to write, it's not really worth a module dependency).

A few assorted notes on various of the proposals:

  • We've tried to avoid abbreviations, so prod - if we were to add it - would want to be product.
  • clip seems a more evocative name to me than clamp. Also, I think (10,20,30).clamp( 25..35 ); should just be done using a map over the list, applying it to each element. It's arguably a useful enough thing to have it CORE.setting, but it's not an obvious list operation.

As for a way forward:

  • I think that it's worth making a more detailed proposal (perhaps with a prototype implementation) for these ones to go into CORE.setting:
    • clip (or clamp, or bound, or whatever we end up calling it)
    • average (with the semantics that the typical punter expects); I know that if you're doing other statistical things then it's more efficient to do it in a single pass, but my feeling is that - for better or worse - simple averaging is overwhelmingly the most commonly done thing. Even if we did later decide a means to calculate a bunch of statistical things at once belonged in CORE.setting, that'd still not take much from the value of a convenient average built-in.
  • I'm not sure product pulls its weight, especially since @x.product is more to type than [*](@x), and unlike sum, I don't see any obvious optimization opportunities (we ended up with .sum because, if done on a Range, you can calculate the answer without iterating the Range). However, I'd entertain arguments for why it should be included.
  • For other statistical things, my feeling is "module first".
@AlexDaniel

This comment has been minimized.

Copy link
Member Author

AlexDaniel commented Mar 23, 2019

To clarify the situation in this ticket: there was no proposal yet, the original post is simply stating that some functions may be missing. If somebody wants to make a proposal, see @jnthn's comment.

@japhb

This comment has been minimized.

Copy link

japhb commented Mar 23, 2019

jnthn: Aside from the pure performance implications of using builtins, there's also a matter of numerical accuracy and stability; some of these functions may need to be calculated using the processor's extended precision (e.g. 80, 96, or 128 bits) in order to be accurate to one ULP (Unit in the Last Place) of their 64-bit output across their domain. Which is to say that some of them we'd just want to implement as VM ops or NativeCall to a math lib anyway, because we can't efficiently fake that extra precision in NQP space.

@MattOates

This comment has been minimized.

Copy link

MattOates commented Mar 26, 2019

This annoyed me enough to create https://github.com/MattOates/Stats/blob/master/lib/Stats.pm6

The Evan Miller article is really great. The point of the functions he defines is its a core set of operations most of the rest of scientific programming is actually based on at a higher level.

If this were in the CORE.setting though I think it makes more sense to really limit what gets exported or provided, like mean/median/mode/stddev are common. A core module with use Maths for the more science/analysis end of the spectrum feels sensible. Outside of Rakudo Star do we have properly core level modules?

Really it would be great if these were optimised implementations from a maths library. Otherwise its hard to see how this might not stunt or slow down development in the ecosystem in this space.

@AlexDaniel

This comment has been minimized.

Copy link
Member Author

AlexDaniel commented Mar 26, 2019

Outside of Rakudo Star do we have properly core level modules?

Yes, Telemetry comes to mind.

This annoyed me enough …

@MattOates, by any chance can you come up with a detailed proposal (discussing what should be available to everyone, what needs to be in a potential maths module and what is left for the ecosystem), and later an implementation? I think nobody objects that at least some things need to be added, we just need a knowledgeable person with enough tuits to think this through.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.