Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: require scalar result from the scalar .at indexer #33153

Open
jorisvandenbossche opened this issue Mar 30, 2020 · 21 comments
Open

API: require scalar result from the scalar .at indexer #33153

jorisvandenbossche opened this issue Mar 30, 2020 · 21 comments
Labels
API Design Indexing Related to indexing on series/frames, not to indexes themselves Needs Discussion Requires discussion from core team before further action

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Mar 30, 2020

The .at indexer is documented as the fast, restricted version of loc to "access a scalar value" (https://pandas.pydata.org/docs/user_guide/indexing.html#fast-scalar-value-getting-and-setting).

However, currently, there is not always a guarantee that the result will actually be a scalar. For example, with a Series with duplicate index values:

In [3]: s = pd.Series([1, 2, 3], index=['a', 'b', 'a'])  

In [4]: s.at['a'] 
Out[4]: 
a    1
a    3
dtype: int64

There are some other possible cases that could give a non-scalar value as well:

However, those cases currently all fail (error or produce wrong result, see below for examples).
So a question that can be posed: should we consider those cases as bugs, or should we rather actually require a scalar result and thus see the series behaviour above as too liberal?

Since those linked issues have open PRs to "fix" them, I think we should first decide on what guarantees on the .at API we want to provide. And personally, I think taking the documented intention of accessing a scalar would be a plus (it gives a certainty about the type of the result, i.e. always a scalar, and not sometimes a series or dataframe depening on the value of the label passed to at).

If we prefer the more strict scalar result requirement, we can deprecate the Series case (but only if accessing a duplicated label) and provide better error messages for the error cases.


Examples of the failing behaviours (on pandas 1.0.1):

# duplicated row label -> you get a numpy array (which I would say is always wrong)
>>> df = pd.DataFrame(np.random.randn(3, 2), index=['a', 'b', 'a'], columns=['A', 'B'])  
>>> df.at['a', 'A'] 
array([-0.02914828,  0.2856617 ])

# duplicated column label -> error
>>> df = pd.DataFrame(np.random.randn(3, 2), index=['a', 'b', 'c'], columns=['A', 'A'])  
>>> df.at['a', 'A']  
...
AttributeError: 'BlockManager' object has no attribute 'T'

# Selecting one level of a MultiIndex -> error
>>> s = pd.Series(np.random.randn(4), index=pd.MultiIndex.from_product([['A', 'B'], [1, 2]]))  
>>> s.at['A']  
...
KeyError: 'A'
During handling of the above exception, another exception occurred:
...
AttributeError: 'numpy.ndarray' object has no attribute '_values'

# selecting all MultiIndex levels -> this should actually work (#26989)
>>> s.at['A', 1] 
...
TypeError: _get_value() got multiple values for argument 'takeable'

# DataFrame with all MultiIndex levels specified works OK
>>>s.to_frame('col').at[('A', 1), 'col']
1.341421652269149
@jorisvandenbossche jorisvandenbossche added API Design Needs Discussion Requires discussion from core team before further action labels Mar 30, 2020
@jorisvandenbossche jorisvandenbossche added this to the 1.1 milestone Mar 30, 2020
@jorisvandenbossche
Copy link
Member Author

cc @pandas-dev/pandas-core

@jbrockmendel
Copy link
Member

"access a scalar value"

The actual .at API is to require scalar indexer(s). At the very least the docs should be clarified on that count.

@rhshadrach
Copy link
Member

Seems to me #26989 does not address using .at when not all levels are specified. Perhaps you meant another issue?

@jorisvandenbossche
Copy link
Member Author

The actual .at API is to require scalar indexer(s). At the very least the docs should be clarified on that count.

That might be how it is coded, but for example also the reference docstring of at speaks about single values: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.at.html

@jreback
Copy link
Contributor

jreback commented Mar 31, 2020

thinking this through again, would be ok with raising on a return value that is non -scalar (eg we have duplicates)

@jbrockmendel
Copy link
Member

As a user, I think of .at as ".loc but faster because im passing scalars". having an always-scalar return would be nice, but then we're forcing users to do more checks before calling .at, which seems self-defeating.

@TomAugspurger thoughts?

@TomAugspurger
Copy link
Contributor

Agreed with Brock.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Apr 3, 2020

but then we're forcing users to do more checks before calling .at

Or otherwise they might need to do checks afterwards because you don't know whether you got back a scalar, series or dataframe

Indexing inherently can always generate errors depending on your data (eg the key you are checking is not present), so in general your code might always need to be robust against that. But I think being certain here about the return type would be nice.

@TomAugspurger
Copy link
Contributor

.at feels like the wrong place for ensuring the result is a scalar though. That sounds to me more like a problem of duplicate labels, rather than an issue with .at.

@jorisvandenbossche
Copy link
Member Author

That sounds to me more like a problem of duplicate labels, rather than an issue with .at.

It can be a problem of the duplicate labels, but not all our methods therefore need to be able to handle such duplicate labels. We have other methods that raise an error in such a case.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Apr 3, 2020

Hmm maybe I'm missing something but my feeling is that .at is clearly for indexing with a scalar, and the docstring author didn't consider duplicate labels when it was written.

@jbrockmendel
Copy link
Member

jbrockmendel commented Apr 3, 2020

As an implementation matter, if we wanted to disallow non-scalar results, this would require a deprecation cycle, since ATM Series.at specifically supports non-unique index. That deprecation wouldn't be enforced until 2.0 which is a ways off.

In the interim, raising AttributeError for DataFrame.at is clearly a bug and should be fixed (#33047). Fixing that now doesn't preclude later deciding to deprecate the behavior for both Series/DataFrame.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Apr 3, 2020

the docstring author didn't consider duplicate labels when it was written.

Yes, I suppose that indeed happened like that. But the initial implementation also didn't consider duplicates (I checked a couple of version back, and then .at with a duplicate label returned an ndarray, so didn't even bother to check for it, I suppose for performance reasons as it assumed scalars).

So now we are actually considering duplicate labels, we can make a choice.

We have a lot of places in pandas where the return type of a method can be all kinds of things, while in general it is nice to have stricter typing (eg -> Scalar vs -> Union[Scalar, Series, DataFrame] in this case).
So I mainly thought this is actually a rather easy place to be more strict.

@jreback
Copy link
Contributor

jreback commented Apr 6, 2020

I think the inconsistency with .loc is a big deal here. If we want to deprecate, ok, but that would be for both .loc and .at. Trying to shoe-horn this change to behavior here seems like a bad idea.

So I mainly thought this is actually a rather easy place to be more strict.

this is not easy at all to be more strict here.

@jreback
Copy link
Contributor

jreback commented Apr 6, 2020

Since this is blocking a number of PRs I would say @jorisvandenbossche pls make an issue for discussion if you think we ought to consider fixing duplicate label indexing (for both .loc and .at).

@jorisvandenbossche
Copy link
Member Author

pls make an issue for discussion

This is that issue?

@jorisvandenbossche
Copy link
Member Author

Ah, read over "both loc and at". But, I don't think those necessarily need to be handled / discussed at the same time. As there is quite some difference: .loc by definion can return different types (dataframe, series, scalar) depending on what you pass to loc.
For loc, there could be a discussion about having a consistent return type depending on the type of the indexer. For that, there is already #9519. .loc also doesn't necessarily need to raise an error (as what I proposed here), as it can also consistently return a Series/DataFrame instead of a scalar when duplicates are present.

Now, if no one agrees with me, I am of course fine to go with the majority. I only think this is a chance to clean up a part of our indexing API.
If we would be designing and adding .at now (without any historical baggage or comparison to other indexers), I would personally restrict it to return only scalars.

@jorisvandenbossche
Copy link
Member Author

So on the call we discussed this more, and there was not much appetetite for restricting the behaviour for indexes with duplicates keys, so I think the conclusion is that we can fix the DataFrame cases to return a Series/DataFrame instead of raising an error for duplicate keys.

@rhshadrach
Copy link
Member

rhshadrach commented Apr 11, 2020

@jorisvandenbossche Is the return type in various cases of a possible non-scalar result also part of this issue, or should it be separated off into another issue?

@TomAugspurger
Copy link
Contributor

@jorisvandenbossche I think all the raising cases have been fixed. Can you double check?

@TomAugspurger TomAugspurger removed this from the 1.1 milestone Jul 7, 2020
@TomAugspurger
Copy link
Contributor

Moved off the 1.1 milestone.

@mroeschke mroeschke added the Indexing Related to indexing on series/frames, not to indexes themselves label Jul 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Indexing Related to indexing on series/frames, not to indexes themselves Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

6 participants