-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use array_like instead of numeric/array-like #765
Comments
I think that using I would prefer pvlib documentation to be more explicit, which might mean putting more than one word, e.g.:
|
Hey, @adriesse @wholmgren I am new to this library and would like to contribute. I have experience with documentation in sphinx. Kindly guide me in solving this issue. |
@percygautam thanks for your interest! Maybe you could run a couple of find/replace-all commands. Note the contributing documentation contains some general suggestions for getting started with git/github. |
Thanks @wholmgren , I'll start exploring the documentation and would start working on issues. |
There are performance and syntactic differences between a ndarray of shape (1, 1, ...) and a python scalar, so I don't think the terms are interchangeable if numeric includes true native scalars
|
The numpy definition of
pvlib's definition of |
I just did a quick search and there seem to be only 4 occurrences of As @adriesse pointed out in an earlier comment, it's confusing to use Some further investigation reveals that all 4 usages of the underscore version
If that would cause an error, then maybe the types of those variables should be the PVLib Actually I just went ahead and tried it. Yes there seems to be an error if you input all scalars:
The error is at this line:
And it says:
Which makes sense. So I'm thinking those 4 and only usages of the underscore (numpy) version EDIT: This still doesn't solve this issue, but at least it normalises all usages so that we don't have both |
@ColmBhandal thanks for this report. As author of that function, I am 99.99% certain that I was not mindful of the distinction between That function should not work with scalar inputs. There probably is a minimum length of each subset |
Good idea to take a small bite out of this one @ColmBhandal ! Perhaps numpy is not the best example to follow. I suspect pvlib has a lot more variety in the types of function arguments overall, and I also suspect pvlib functions are far from uniform in their support for potentially equivalent data structures. So I will stick with my initial comment that it is better for pvlib to be explicit in the documentation as to the data types and data structures required or supported by each function. We could supplement this with programming (not documentation) guidelines on what range of equivalent structures pvlib functions should strive to support. Personally I often throw in an np.asanyarray() in the beginning of functions, which saves me a lot of pain; but it's not always the right thing to do. Come to think of it, I may be following examples from numpy on that technique... |
@adriesse I tend to agree with you. I think the ideal short-medium term solution is:
That's an achievable short/medium term solution, as long as we choose descriptors that don't clash with numpy. We can then leave the discussion of whether we want to use numpy descriptors to a separate issue / future discussion and not let it get in the way of resolving this issue. |
FWIW: I've been tripped up in the past by |
I have been thinking about this and I wonder if the long term solution is static duck typing i.e. structural types. Minimally, static duck typing will provide a much clearer documentation of the expected type via a type annotation vs. a documentation comment which is subject to typos. Down the line, static duck typing can also be coupled with static analysers which can be run as part of the standard build steps to ensure all statically duck typed portions of the code are used in a type safe way. The problem with Python's standard dynamic duck typing is that you don't know that something is not a duck until you ask it to quack, at runtime, and it says "sorry I don't know how", possibly deep into the function. Of course, that's what these documentation descriptors try to tackle. But it's up to a human to write and interpret them. There's no machine support (or is there?) The nice thing about static duck typing is that you still don't have to nominally type everything you put into a function - which violates the Python ethos. But by statically declaring the structural type (protocol) that a function accepts, you define a clear contract to users. So it's a good balance between strictly nominally typed languages like Java and untyped languages like Python. |
I confess I have only briefly skimmed PEP 544, but I don't see how static duck typing is clearer to the user than docstring comments are. From the user's perspective, isn't it essentially just moving |
@kanderso-nrel Definitely an example, or many, would be needed to justify such a significant introduction to the code base. I was holding off on the above comment for the past few days as I wanted a solid example to justify it, but I may or may not have time, so figured I'd just throw the idea out there. Certainly if I have time I'd be passionate about exploring this opportunity more, with concrete code examples. And if I do I'll post my findings here :) |
sorry for commenting glibly, but it seems to me the heart of this issue comes down to 2 concerns:
native python sequencesI think this only becomes an issue if user tries to invoke a numpy or pandas attribute on a sequence and they get an attribute error. We should be explicit in the documentation if a function requires a pandas series or dataframe, numpy array, or if a sequence will do. I believe pandas series and dataframes have a lot of the same attributes as numpy arrays, and many numpy functions will accept a sequence, so we may need to be diligent to check contributions to see if they will raise an attribute error with a native python sequence. Otherwise, I think that in general it's okay to tell users that pvlib only uses numpy arrays or pandas series and dataframes, and not native python sequences, and just be explicit about that in the docs. I don't know what word we should use to describe such a class of data types. I guess we can't use "array-like" or handling scalar ambiguityUsing So there's my recommendations:
now of course this doesn't handle xarray or other "array_like" datatypes, but I think that's out of scope for this issue |
I'm coming around to the view that it may be better to just list the specific types in the docstring, e.g.,
rather than
where |
The ambiguity that arises from discussions like these might be eliminated by structural types aka static duck types, because their semantics are rigorously defined. I agree that it's a safer approach not to assume that users have read the numpy docs when interpreting type declarations in the docs. Structural types may be clearer. I've briefly seen them in action in the Gosu language and I thought they were really cool. I'd love to investigate if they'd be applicable to this problem in Python. |
@mikofski I'll be even more glib: "explicit is better than implicit"! I agree with @cwhanse's comment. The only change I'd make to is to use I'm also all for adding type hints to pvlib but I really think that's a separate conversation because they are not easily integrated with documentation at this point in time. |
@wholmgren For what it's worth I think I'm with you that the type hints is probably a separate, much larger thing. For this issue to get closed as a practical matter I think just merging my PR (after I add the tests) then analysing the 26 remaining usages of array_like and updating the doc appropriately should do it. Of course, conversations are likely to follow on what those updates will be. |
Questions:
|
Readability counts. having to explicitly spell out every type is fine, but a possibly ugly. I don't think it's unreasonable to expect pvlib users to be fluent with numpy. Leveraging the numpy API is a common practice among python packages, and I think we can benefit from it too. IMO On the other hand, I think By that same token, I don't think we should use "scalar" as a datatype. I think any function that's meant to be used with arrays will work fine with an array of 1, and any function that requires an We should consider if we want to wrap inputs meant to be arrays with |
Another perspective is that having to explicitly wrap inputs into arrays is ugly. It delegates type checking to the function's computational part rather than its type signature, so it sort of muddies the responsibility of the function body, which is to compute, rather than type check. I'm relatively inexperienced in Python and from what I'm reading it seems like this is common enough practice though, but I think that's just a trade off with having a dynamically duck typed language. But yeah, the flip side is that types can get pretty ugly especially when they get as complicated as @campanelli-sunpower's second bullet point references eg different input types guaranteeing different outputs. |
I'll disagree on this point. Rarely do we design and test pvlib functions to accept any data type included in numpy's There are many functions which users will want to work with either
|
I think
However, numpy arrays have
We're all learning all the time here, so your contributions have been really helpful. For example, I'd never heard of pep 544 or structured typing so thanks for that! In my experience Python does differ significantly from other languages like Java and C## in that
Both of these features are embodied in a pythonic idiom called of EAFP or easier to ask forgiveness than permission. The opposite is look before you leap or LBYL. Microsoft developer Brett Cannon explains this in his post: EAFP versus LBYL. I think these reasons are what draw analysts to python in the first place, but may cause headaches for software developers and probably cause performance penalties compared to more statically typed languages. I'm curious to see what's going on in other packages. Surely we're not the only ones to have to handle this. What's going on in statsmodels, astropy, scikit-learn, etc.? I'll have to do some research. |
Yep - I figured this was the case reading bits here and there about Python. Typing is quite a subtle issue though and there's a whole spectrum of conventions to choose from. In my mind I've at least identified a series of points along the spectrum:
Then to throw a spanner in the works, you have @campanelli-sunpower's mention of type contracts. I can't say that this lies on a strictly linear spectrum of non-typed to nominally typed, but I suppose on the broader spectrum of weakly vs. strongly typed it's definitely on the strongly typed side of things. Now you start talking not just about what type a function accepts and what type it outputs, but you start saying things like: if the function accepts an Anyway, I'm excited to see where the broader discussion goes on this. In the meantime, I'll have a look at my open PR and try to progress it somewhat. |
If we wrap inputs we expect to be
also most NumPy functions do this conversion already for us so we might not even need it |
Rather `np.asarray` or `np.asanyarray()`?
|
My understanding is that neither PEP544 (Protocols) nor PEP484 (Type Hints) actually does any type checking unless you invoke an external tool like mypy. There is nothing in python itself that, at runtime or any other time, actually checks that inputs match the type annotations. The runtime semantics are duck typing regardless of whatever type fanciness we put in the signature. So to me, "typing it" is just another form of documentation, not an alternative to it. Sorry for the noise if everyone already knows this (or if I'm misinformed, very possible since I don't use these features myself!). Also I think PEP544 is python 3.8+, so it is not available to us in the near term anyway (3.7 EOL is June 2023).
If this turns out the be the best solution, so be it, but to @campanelli-sunpower's point I do expect a Series out if I put a Series in, which I think means basically every function will start and end with typing boilerplate. Or maybe we could be clever with a decorator. I see value in being consistent with other pydata packages, but does |
@kanderso-nrel are you okay with |
I think I'm against it? If we (1) don't want to be using I'd rather we come up with our own abstract types (e.g. our current |
I’m -1 for inventing yet another name. PEP 20: “There should be one— and preferably only one —obvious way to do it.” IMHO the proliferation of more arbitrary names will only confuse our intentions even more. In lieu of using an established term I would prefer to be as explicit as possible.
I only expect this for dataframes, series, or arrays. If input is a python tuple, list, or nested sequence, getting an array as output is fine for me. I believe this is the expectation with numpy or SciPy functions that are |
From what I've read, my understanding concurs with this. I still see the merit in types over documentation as long as both are roughly the same effort to implement/maintain:
But yeah, the proposal to use type hints and/or static duck types essentially boils down to a proposal for changing the way the types of functions are documented.
I guess that makes the decision simpler for now :) |
typing is still immature in the pydata world (arguably the larger python world too). Check out numpy typing documentation for some issues - and that's just numpy! Types are not a replacement for good manually written documentation at this time.
Totally agree. |
I'll just make a small correction here, although based on @wholmgren's reply I don't think I was misquoted and I think my meaning was correctly interpreted. Still, let me clarify. I only proposed to replace the typing portions of the docs with typing annotations - and at that, with structural types, not nominal ones. I don't think types can or should replace documentation in general. For example, documentation includes stuff like references to research papers - those definitely don't belong in the types. Even the description of what the function does, often that's messy to try to encode in the types - and if you tried to encode it 100% in the type... then the type itself would practically be as complex as the function body. This also isn't something I'd support. Also, even my proposal to surgically only replace the type declarations in the docs with type annotations is only really a proposal to use structural types. Why? Because with nominal types, you lose something: you're asking for explicitly declared types instead of things that fit the right mold, and that violates the ethos of duck typing. AFAIK PEP-484 type hints are nominal* rather than structural (maybe they're both though, I'm not 100% sure), and if so, then I wouldn't even suggest typing along the lines of the numpy typing anyway. *Note: looks like maybe they try to work around the fact that types are nominal by using union types to define things like Anyway, this all seems academic, since I would say the real death knell of the typing idea came when @kanderso-nrel noted:
|
I don't know to what extent this is relevant, but I find that function signatures can become very hard to parse for mere mortals like me when too much information, especially information intended for machines, is squeezed in. An example is |
@ColmBhandal I note that numpy 1.22+ does not support python 3.7, so dropping support for 3.7 may be closer than we think. I see this discussion being an expected result of trying to a build production-worthy codebase with a duck-typed language. I think a policy is worth sorting out as best we can, and I do appreciate you bringing PEP 544 into the discussion (which I still need to learn about!), because although I find type hints enormously valuable, they feel "tacked on" to a language that was not designed for them. |
@ColmBhandal we were on the same page, but thanks for clarifying. Given that we've all agreed that we're not going to add any type declarations at this time, how do we close this issue with only standard documentation changes? There are a handful of proposals buried above, but I don't think we've reached a consensus. |
Skimming the comment history, it seems like replacing the composite types ( |
This was a nice intro to how PEP 544 (protocols, or static duck typing) works in concert with PEP 484 (type hints) on polymorphic functions/methods. It provides good context and works through a concrete example with mypy. |
That's a great video and well worth the watch. The coolest thing I saw in that video was Typevars. I'd never seen them before. Particularly bounded typvars. They seem to answer an earlier question posed by @campanelli-sunpower:
Typeshed- I believe this means you can totally separate the files that define the types from the files that define the code. Which means you don't have to clutter your code with types, which sort of addresses @adriesse's earlier point:
Although I am admittedly someone who probably sways towards static typing, I do appreciate the "centrist" message that the presenter gives at the end of the video:
Even if PEP 544 can't be used due to version incompatibility, it would be worth exploring TypeVars, and maybe typeshed, as an alternative to annotations rather than doing it in the doc strings. Although maybe it's best to wait until 544 can be applied just as is done in the video. |
The following will not always be appropriate, but it could perhaps be used in many: What do you think about making a copy of the first 1D input and replacing its values to create a 1D output of the same class? Should work transparently for numpy, pandas and xarray, right? |
The pvlib parameters/returns type documentation style is:
However, Numpy defines array_like as "Any sequence that can be interpreted as an ndarray. This includes nested lists, tuples, scalars and existing arrays."
I think we should follow the numpy standard. We lose the implied specification for whether or not a function works with scalar input, but this seems to be a source of confusion anyways. Instead, the docstrings for functions that require array input with length > 1 could explicitly state this.
The text was updated successfully, but these errors were encountered: