Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH/API: allow different values for labels/index in DF.reindex __OR__ raise error #21685

Open
h-vetinari opened this issue Jun 29, 2018 · 13 comments
Assignees
Labels
Error Reporting Incorrect or improved errors from pandas good first issue

Comments

@h-vetinari
Copy link
Contributor

h-vetinari commented Jun 29, 2018

Currently, DataFrame.reindex has three overlapping keywords:

  • labels
  • index
  • columns

I (naively) expected it to work to pass different values to labels/index (motivating example below), but this does not work. I'm going to make a proposal of how this could be incorporated, but independently from that -- in the current state -- at least an error should be raised on conflicting values to labels/index (or even just using both kwargs).

2018-10-08 EDIT: This is as far as necessary for the purpose of raising errors.

Alternatively (or maybe complementarily), one could the following use case for allowing different values for labels/index - as .reindex (at least by name) has two interpretations:

  • selecting an index
  • assigning an index

[end of EDIT]

The example is related to what I'm working on in #21645, where I want to construct an inverse to .duplicated -- allowing to reconstruct the original object from the deduplicated one.

As a toy example:

df = pd.DataFrame({'A': [0, 1, 1, 2, 0], 'B': ['a', 'b', 'b', 'c', 'a']})
df
#    A  B
# 0  0  a
# 1  1  b
# 2  1  b
# 3  2  c
# 4  0  a

isdup, inv = df.duplicated(keep='last', return_inverse=True)
isdup
# 0     True
# 1     True
# 2    False
# 3    False
# 4    False
# dtype: bool

inv
# 0    4
# 1    2
# 2    2
# 3    3
# 4    4
# dtype: int64

unique = df.loc[~isdup]
unique
#    A  B
# 2  1  b
# 3  2  c
# 4  0  a

unique.reindex(inv)
#    A  B
# 4  0  a
# 2  1  b
# 2  1  b
# 3  2  c
# 4  0  a

This is obviously not identical to the original object yet, because -- while we have read the correct indexes from unique, we haven't assigned them to the correct output indexes yet.

I had been long working with .loc[] until v.0.23 started telling me to use .reindex, and consequently, I wasn't very acquainted with it. I started by trying the following, which would conceptually make sense to me (as opposed to interpreting .reindex(inv) directly, which would break heaps of code):

unique.reindex(labels=inv.values, index=inv.index)
#      A    B
# 0  NaN  NaN
# 1  NaN  NaN
# 2  1.0    b
# 3  2.0    c
# 4  0.0    a

This was surprising, because labels is completely ignored (even though it is the first argument in the call signature), and no warning is raised for swallowing contradictory results.

In any case, this is not very high priority, as a more-or-less simple work-around exists, but it is still something to consider, IMO.

## the workaround
unique.reindex(inv.values).set_index(inv.index).equals(df)
# True
@TomAugspurger
Copy link
Contributor

Looks like we should raise in

else:
out[ax] = v
when ax in out. In this case, out[ax] comes from index=, and we overwrite it with index=.

@TomAugspurger TomAugspurger added Error Reporting Incorrect or improved errors from pandas Effort Low good first issue labels Jun 29, 2018
@TomAugspurger TomAugspurger added this to the Next Major Release milestone Jun 29, 2018
@machar94
Copy link

machar94 commented Jul 9, 2018

Hello. Is anyone working on this? I would like to take this as my first issue.

@TomAugspurger
Copy link
Contributor

Sorry for the delay @machar94! I don't think anyone is currently working on it. Let us know if you need help getting started.

@rcromo
Copy link

rcromo commented Aug 29, 2018

Hello @TomAugspurger I know another user wanted to work on the issue but is the issue still open?

@TomAugspurger
Copy link
Contributor

@rcromo I don't see any open PRs addressing this issue. Please feel free to take it.

@machar94
Copy link

@rcromo Yes, please feel free to take it. Unfortunately I wasn't able to follow up on this myself.

@adamshamsudeen
Copy link

Is this issue still open?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Oct 8, 2018 via email

@adamshamsudeen
Copy link

@TomAugspurger I couldn't replicate this issue as df.duplicated does not support retrun_inverse now?

@TomAugspurger
Copy link
Contributor

@h-vetinari do you know if the original issue has been addressed, and if so which issue fixed it?

@h-vetinari
Copy link
Contributor Author

@TomAugspurger @adamshamsudeen

The issue is neither outdated nor closed, though I guess I should really separate the proposal that:

  1. .reindex should raise when contradicting labels/index are passed
  2. the idea (inspired by the reconstruction problem in ENH: add return_inverse to duplicated for DataFrame/Series/Index/MultiIndex #21645) to allow passing separate values labels/index to .reindex

The first one is the one that @adamshamsudeen can easily tackle, and that has nothing to do with .duplicated.

@shuaggar-sys
Copy link

take

@shuaggar-sys
Copy link

shuaggar-sys commented Feb 5, 2021

@h-vetinari I found an interesting thing while debugging, the line below never returns the a dict containing a key "label", it always return a dict with key "index"

axes = validate_axis_style_args(self, args, kwargs, "labels", "reindex")

Also, the following line removes "labels" from the kwargs effectively making labels argument useless :

kwargs.pop("labels", None)

Example:

unq.reindex(labels=0, index=inv.index)
#      A    B
# 0  NaN  NaN
# 1  NaN  NaN
# 2  1.0    b
# 3  2.0    c
# 4  0.0    a

unq.reindex(labels=["abcdefghi"], index=inv.index)

#      A    B
# 0  NaN  NaN
# 1  NaN  NaN
# 2  1.0    b
# 3  2.0    c
# 4  0.0    a

The output remains the same for any label you throw at it.
Please let me know how should i proceed on fixing this.

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas good first issue
Projects
None yet
Development

No branches or pull requests

8 participants