ENH/API: allow different values for labels/index in DF.reindex OR raise error #21685

h-vetinari · 2018-06-29T17:23:28Z

Currently, DataFrame.reindex has three overlapping keywords:

labels
index
columns

I (naively) expected it to work to pass different values to labels/index (motivating example below), but this does not work. I'm going to make a proposal of how this could be incorporated, but independently from that -- in the current state -- at least an error should be raised on conflicting values to labels/index (or even just using both kwargs).

2018-10-08 EDIT: This is as far as necessary for the purpose of raising errors.

Alternatively (or maybe complementarily), one could the following use case for allowing different values for labels/index - as .reindex (at least by name) has two interpretations:

selecting an index
assigning an index

[end of EDIT]

The example is related to what I'm working on in #21645, where I want to construct an inverse to .duplicated -- allowing to reconstruct the original object from the deduplicated one.

As a toy example:

df = pd.DataFrame({'A': [0, 1, 1, 2, 0], 'B': ['a', 'b', 'b', 'c', 'a']})
df
#    A  B
# 0  0  a
# 1  1  b
# 2  1  b
# 3  2  c
# 4  0  a

isdup, inv = df.duplicated(keep='last', return_inverse=True)
isdup
# 0     True
# 1     True
# 2    False
# 3    False
# 4    False
# dtype: bool

inv
# 0    4
# 1    2
# 2    2
# 3    3
# 4    4
# dtype: int64

unique = df.loc[~isdup]
unique
#    A  B
# 2  1  b
# 3  2  c
# 4  0  a

unique.reindex(inv)
#    A  B
# 4  0  a
# 2  1  b
# 2  1  b
# 3  2  c
# 4  0  a

This is obviously not identical to the original object yet, because -- while we have read the correct indexes from unique, we haven't assigned them to the correct output indexes yet.

I had been long working with .loc[] until v.0.23 started telling me to use .reindex, and consequently, I wasn't very acquainted with it. I started by trying the following, which would conceptually make sense to me (as opposed to interpreting .reindex(inv) directly, which would break heaps of code):

unique.reindex(labels=inv.values, index=inv.index)
#      A    B
# 0  NaN  NaN
# 1  NaN  NaN
# 2  1.0    b
# 3  2.0    c
# 4  0.0    a

This was surprising, because labels is completely ignored (even though it is the first argument in the call signature), and no warning is raised for swallowing contradictory results.

In any case, this is not very high priority, as a more-or-less simple work-around exists, but it is still something to consider, IMO.

## the workaround
unique.reindex(inv.values).set_index(inv.index).equals(df)
# True

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2018-06-29T18:18:21Z

Looks like we should raise in

pandas/pandas/util/_validators.py

Lines 291 to 292 in dc45fba

    
           else: 
        
               out[ax] = v

when ax in out. In this case, out[ax] comes from index=, and we overwrite it with index=.

machar94 · 2018-07-09T03:49:32Z

Hello. Is anyone working on this? I would like to take this as my first issue.

TomAugspurger · 2018-07-12T19:21:02Z

Sorry for the delay @machar94! I don't think anyone is currently working on it. Let us know if you need help getting started.

rcromo · 2018-08-29T01:46:21Z

Hello @TomAugspurger I know another user wanted to work on the issue but is the issue still open?

TomAugspurger · 2018-08-29T02:03:06Z

@rcromo I don't see any open PRs addressing this issue. Please feel free to take it.

machar94 · 2018-08-29T14:49:48Z

@rcromo Yes, please feel free to take it. Unfortunately I wasn't able to follow up on this myself.

adamshamsudeen · 2018-10-08T11:15:35Z

Is this issue still open?

TomAugspurger · 2018-10-08T11:21:28Z

Still open, and AFAICT no one is actively working on it.

…

On Mon, Oct 8, 2018 at 6:15 AM Adam Shamsudeen ***@***.***> wrote: Is this issue still open? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#21685 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIlKC8QXxKk15x4Sjy-FtwLTPCd-2ks5uizPngaJpZM4U9XIg> .

adamshamsudeen · 2018-10-08T11:28:57Z

@TomAugspurger I couldn't replicate this issue as df.duplicated does not support retrun_inverse now?

TomAugspurger · 2018-10-08T11:31:23Z

@h-vetinari do you know if the original issue has been addressed, and if so which issue fixed it?

h-vetinari · 2018-10-08T16:36:30Z

@TomAugspurger @adamshamsudeen

The issue is neither outdated nor closed, though I guess I should really separate the proposal that:

.reindex should raise when contradicting labels/index are passed
the idea (inspired by the reconstruction problem in ENH: add return_inverse to duplicated for DataFrame/Series/Index/MultiIndex #21645) to allow passing separate values labels/index to .reindex

The first one is the one that @adamshamsudeen can easily tackle, and that has nothing to do with .duplicated.

shuaggar-sys · 2021-02-05T02:30:31Z

take

shuaggar-sys · 2021-02-05T04:07:52Z

@h-vetinari I found an interesting thing while debugging, the line below never returns the a dict containing a key "label", it always return a dict with key "index"

pandas/pandas/core/frame.py

Line 4241 in e1a9b78

axes = validate_axis_style_args(self, args, kwargs, "labels", "reindex")

Also, the following line removes "labels" from the kwargs effectively making labels argument useless :

pandas/pandas/core/frame.py

Line 4245 in e1a9b78

kwargs.pop("labels", None)

Example:

unq.reindex(labels=0, index=inv.index)
#      A    B
# 0  NaN  NaN
# 1  NaN  NaN
# 2  1.0    b
# 3  2.0    c
# 4  0.0    a

unq.reindex(labels=["abcdefghi"], index=inv.index)

#      A    B
# 0  NaN  NaN
# 1  NaN  NaN
# 2  1.0    b
# 3  2.0    c
# 4  0.0    a

The output remains the same for any label you throw at it.
Please let me know how should i proceed on fixing this.

This was referenced Jun 29, 2018

ENH: set_index for Series #21684

Open

ENH: add return_inverse to duplicated for DataFrame/Series/Index/MultiIndex #21645

Closed

TomAugspurger added Error Reporting Incorrect or improved errors from pandas Effort Low good first issue labels Jun 29, 2018

TomAugspurger added this to the Next Major Release milestone Jun 29, 2018

jbrockmendel removed the Effort Low label Oct 21, 2019

github-actions bot assigned shuaggar-sys Feb 5, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH/API: allow different values for labels/index in DF.reindex OR raise error #21685

ENH/API: allow different values for labels/index in DF.reindex OR raise error #21685

h-vetinari commented Jun 29, 2018 •

edited

TomAugspurger commented Jun 29, 2018

machar94 commented Jul 9, 2018

TomAugspurger commented Jul 12, 2018

rcromo commented Aug 29, 2018

TomAugspurger commented Aug 29, 2018

machar94 commented Aug 29, 2018

adamshamsudeen commented Oct 8, 2018

TomAugspurger commented Oct 8, 2018 via email

adamshamsudeen commented Oct 8, 2018

TomAugspurger commented Oct 8, 2018

h-vetinari commented Oct 8, 2018

shuaggar-sys commented Feb 5, 2021

shuaggar-sys commented Feb 5, 2021 •

edited

ENH/API: allow different values for labels/index in DF.reindex __OR__ raise error #21685

ENH/API: allow different values for labels/index in DF.reindex __OR__ raise error #21685

Comments

h-vetinari commented Jun 29, 2018 • edited

TomAugspurger commented Jun 29, 2018

machar94 commented Jul 9, 2018

TomAugspurger commented Jul 12, 2018

rcromo commented Aug 29, 2018

TomAugspurger commented Aug 29, 2018

machar94 commented Aug 29, 2018

adamshamsudeen commented Oct 8, 2018

TomAugspurger commented Oct 8, 2018 via email

adamshamsudeen commented Oct 8, 2018

TomAugspurger commented Oct 8, 2018

h-vetinari commented Oct 8, 2018

shuaggar-sys commented Feb 5, 2021

shuaggar-sys commented Feb 5, 2021 • edited

ENH/API: allow different values for labels/index in DF.reindex OR raise error #21685

ENH/API: allow different values for labels/index in DF.reindex OR raise error #21685

h-vetinari commented Jun 29, 2018 •

edited

shuaggar-sys commented Feb 5, 2021 •

edited