API: require scalar result from the scalar .at indexer #33153

jorisvandenbossche · 2020-03-30T19:21:42Z

The .at indexer is documented as the fast, restricted version of loc to "access a scalar value" (https://pandas.pydata.org/docs/user_guide/indexing.html#fast-scalar-value-getting-and-setting).

However, currently, there is not always a guarantee that the result will actually be a scalar. For example, with a Series with duplicate index values:

In [3]: s = pd.Series([1, 2, 3], index=['a', 'b', 'a'])  

In [4]: s.at['a'] 
Out[4]: 
a    1
a    3
dtype: int64

There are some other possible cases that could give a non-scalar value as well:

MultiIndex if not all levels are indexed (Series with MultiIndex "at" function raises "TypeError" #26989)
DataFrame with duplicate index labels in index and/or columns (BUG: DataFrame.at implicitly assumes unique axes #33041)

However, those cases currently all fail (error or produce wrong result, see below for examples).
So a question that can be posed: should we consider those cases as bugs, or should we rather actually require a scalar result and thus see the series behaviour above as too liberal?

Since those linked issues have open PRs to "fix" them, I think we should first decide on what guarantees on the .at API we want to provide. And personally, I think taking the documented intention of accessing a scalar would be a plus (it gives a certainty about the type of the result, i.e. always a scalar, and not sometimes a series or dataframe depening on the value of the label passed to at).

If we prefer the more strict scalar result requirement, we can deprecate the Series case (but only if accessing a duplicated label) and provide better error messages for the error cases.

Examples of the failing behaviours (on pandas 1.0.1):

# duplicated row label -> you get a numpy array (which I would say is always wrong)
>>> df = pd.DataFrame(np.random.randn(3, 2), index=['a', 'b', 'a'], columns=['A', 'B'])  
>>> df.at['a', 'A'] 
array([-0.02914828,  0.2856617 ])

# duplicated column label -> error
>>> df = pd.DataFrame(np.random.randn(3, 2), index=['a', 'b', 'c'], columns=['A', 'A'])  
>>> df.at['a', 'A']  
...
AttributeError: 'BlockManager' object has no attribute 'T'

# Selecting one level of a MultiIndex -> error
>>> s = pd.Series(np.random.randn(4), index=pd.MultiIndex.from_product([['A', 'B'], [1, 2]]))  
>>> s.at['A']  
...
KeyError: 'A'
During handling of the above exception, another exception occurred:
...
AttributeError: 'numpy.ndarray' object has no attribute '_values'

# selecting all MultiIndex levels -> this should actually work (#26989)
>>> s.at['A', 1] 
...
TypeError: _get_value() got multiple values for argument 'takeable'

# DataFrame with all MultiIndex levels specified works OK
>>>s.to_frame('col').at[('A', 1), 'col']
1.341421652269149

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2020-03-30T19:31:27Z

cc @pandas-dev/pandas-core

jbrockmendel · 2020-03-30T19:36:59Z

"access a scalar value"

The actual .at API is to require scalar indexer(s). At the very least the docs should be clarified on that count.

rhshadrach · 2020-03-30T22:15:10Z

Seems to me #26989 does not address using .at when not all levels are specified. Perhaps you meant another issue?

jorisvandenbossche · 2020-03-31T06:11:58Z

The actual .at API is to require scalar indexer(s). At the very least the docs should be clarified on that count.

That might be how it is coded, but for example also the reference docstring of at speaks about single values: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.at.html

jreback · 2020-03-31T10:55:22Z

thinking this through again, would be ok with raising on a return value that is non -scalar (eg we have duplicates)

jbrockmendel · 2020-04-03T17:35:50Z

As a user, I think of .at as ".loc but faster because im passing scalars". having an always-scalar return would be nice, but then we're forcing users to do more checks before calling .at, which seems self-defeating.

@TomAugspurger thoughts?

TomAugspurger · 2020-04-03T17:40:22Z

Agreed with Brock.

jorisvandenbossche · 2020-04-03T17:41:25Z

but then we're forcing users to do more checks before calling .at

Or otherwise they might need to do checks afterwards because you don't know whether you got back a scalar, series or dataframe

Indexing inherently can always generate errors depending on your data (eg the key you are checking is not present), so in general your code might always need to be robust against that. But I think being certain here about the return type would be nice.

TomAugspurger · 2020-04-03T18:05:42Z

.at feels like the wrong place for ensuring the result is a scalar though. That sounds to me more like a problem of duplicate labels, rather than an issue with .at.

jorisvandenbossche · 2020-04-03T18:53:54Z

That sounds to me more like a problem of duplicate labels, rather than an issue with .at.

It can be a problem of the duplicate labels, but not all our methods therefore need to be able to handle such duplicate labels. We have other methods that raise an error in such a case.

TomAugspurger · 2020-04-03T19:00:21Z

Hmm maybe I'm missing something but my feeling is that .at is clearly for indexing with a scalar, and the docstring author didn't consider duplicate labels when it was written.

jbrockmendel · 2020-04-03T19:10:52Z

As an implementation matter, if we wanted to disallow non-scalar results, this would require a deprecation cycle, since ATM Series.at specifically supports non-unique index. That deprecation wouldn't be enforced until 2.0 which is a ways off.

In the interim, raising AttributeError for DataFrame.at is clearly a bug and should be fixed (#33047). Fixing that now doesn't preclude later deciding to deprecate the behavior for both Series/DataFrame.

jorisvandenbossche · 2020-04-03T19:11:13Z

the docstring author didn't consider duplicate labels when it was written.

Yes, I suppose that indeed happened like that. But the initial implementation also didn't consider duplicates (I checked a couple of version back, and then .at with a duplicate label returned an ndarray, so didn't even bother to check for it, I suppose for performance reasons as it assumed scalars).

So now we are actually considering duplicate labels, we can make a choice.

We have a lot of places in pandas where the return type of a method can be all kinds of things, while in general it is nice to have stricter typing (eg -> Scalar vs -> Union[Scalar, Series, DataFrame] in this case).
So I mainly thought this is actually a rather easy place to be more strict.

jreback · 2020-04-06T21:19:14Z

I think the inconsistency with .loc is a big deal here. If we want to deprecate, ok, but that would be for both .loc and .at. Trying to shoe-horn this change to behavior here seems like a bad idea.

So I mainly thought this is actually a rather easy place to be more strict.

this is not easy at all to be more strict here.

jreback · 2020-04-06T21:20:06Z

Since this is blocking a number of PRs I would say @jorisvandenbossche pls make an issue for discussion if you think we ought to consider fixing duplicate label indexing (for both .loc and .at).

jorisvandenbossche · 2020-04-07T07:33:48Z

pls make an issue for discussion

This is that issue?

jorisvandenbossche · 2020-04-07T07:46:03Z

Ah, read over "both loc and at". But, I don't think those necessarily need to be handled / discussed at the same time. As there is quite some difference: .loc by definion can return different types (dataframe, series, scalar) depending on what you pass to loc.
For loc, there could be a discussion about having a consistent return type depending on the type of the indexer. For that, there is already #9519. .loc also doesn't necessarily need to raise an error (as what I proposed here), as it can also consistently return a Series/DataFrame instead of a scalar when duplicates are present.

Now, if no one agrees with me, I am of course fine to go with the majority. I only think this is a chance to clean up a part of our indexing API.
If we would be designing and adding .at now (without any historical baggage or comparison to other indexers), I would personally restrict it to return only scalars.

jorisvandenbossche · 2020-04-08T19:13:45Z

So on the call we discussed this more, and there was not much appetetite for restricting the behaviour for indexes with duplicates keys, so I think the conclusion is that we can fix the DataFrame cases to return a Series/DataFrame instead of raising an error for duplicate keys.

rhshadrach · 2020-04-11T17:29:56Z

@jorisvandenbossche Is the return type in various cases of a possible non-scalar result also part of this issue, or should it be separated off into another issue?

TomAugspurger · 2020-06-17T18:57:04Z

@jorisvandenbossche I think all the raising cases have been fixed. Can you double check?

TomAugspurger · 2020-07-07T20:59:02Z

Moved off the 1.1 milestone.

jorisvandenbossche added API Design Needs Discussion Requires discussion from core team before further action labels Mar 30, 2020

jorisvandenbossche added this to the 1.1 milestone Mar 30, 2020

jorisvandenbossche mentioned this issue Mar 30, 2020

BUG: DataFrame.at with non-unique axes #33047

Merged

5 tasks

rhshadrach mentioned this issue Mar 31, 2020

BUG: Multiindexed series .at fix #32520

Merged

5 tasks

TomAugspurger removed this from the 1.1 milestone Jul 7, 2020

mroeschke added the Indexing Related to indexing on series/frames, not to indexes themselves label Jul 31, 2021

jbrockmendel mentioned this issue Jul 14, 2022

Make output of label-based indexing (e.g. df.loc) jittable Bodo-inc/Bodo-Pandas-Collaboration#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: require scalar result from the scalar .at indexer #33153

API: require scalar result from the scalar .at indexer #33153

jorisvandenbossche commented Mar 30, 2020 •

edited

jorisvandenbossche commented Mar 30, 2020

jbrockmendel commented Mar 30, 2020

rhshadrach commented Mar 30, 2020

jorisvandenbossche commented Mar 31, 2020

jreback commented Mar 31, 2020

jbrockmendel commented Apr 3, 2020

TomAugspurger commented Apr 3, 2020

jorisvandenbossche commented Apr 3, 2020 •

edited

TomAugspurger commented Apr 3, 2020

jorisvandenbossche commented Apr 3, 2020

TomAugspurger commented Apr 3, 2020 •

edited

jbrockmendel commented Apr 3, 2020 •

edited

jorisvandenbossche commented Apr 3, 2020 •

edited

jreback commented Apr 6, 2020

jreback commented Apr 6, 2020

jorisvandenbossche commented Apr 7, 2020

jorisvandenbossche commented Apr 7, 2020

jorisvandenbossche commented Apr 8, 2020

rhshadrach commented Apr 11, 2020 •

edited

TomAugspurger commented Jun 17, 2020

TomAugspurger commented Jul 7, 2020

API: require scalar result from the scalar .at indexer #33153

API: require scalar result from the scalar .at indexer #33153

Comments

jorisvandenbossche commented Mar 30, 2020 • edited

jorisvandenbossche commented Mar 30, 2020

jbrockmendel commented Mar 30, 2020

rhshadrach commented Mar 30, 2020

jorisvandenbossche commented Mar 31, 2020

jreback commented Mar 31, 2020

jbrockmendel commented Apr 3, 2020

TomAugspurger commented Apr 3, 2020

jorisvandenbossche commented Apr 3, 2020 • edited

TomAugspurger commented Apr 3, 2020

jorisvandenbossche commented Apr 3, 2020

TomAugspurger commented Apr 3, 2020 • edited

jbrockmendel commented Apr 3, 2020 • edited

jorisvandenbossche commented Apr 3, 2020 • edited

jreback commented Apr 6, 2020

jreback commented Apr 6, 2020

jorisvandenbossche commented Apr 7, 2020

jorisvandenbossche commented Apr 7, 2020

jorisvandenbossche commented Apr 8, 2020

rhshadrach commented Apr 11, 2020 • edited

TomAugspurger commented Jun 17, 2020

TomAugspurger commented Jul 7, 2020

jorisvandenbossche commented Mar 30, 2020 •

edited

jorisvandenbossche commented Apr 3, 2020 •

edited

TomAugspurger commented Apr 3, 2020 •

edited

jbrockmendel commented Apr 3, 2020 •

edited

jorisvandenbossche commented Apr 3, 2020 •

edited

rhshadrach commented Apr 11, 2020 •

edited