Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Use flags.allows_duplicate_labels to define default insert behavior #45109

Conversation

johnzangwill
Copy link
Contributor

@johnzangwill johnzangwill commented Dec 29, 2021

Alternative to #44755 to resolve #44410 and #44992 following @jreback's review requests.
The default for DataFrame.insert is now to use the flag, which defaults to allow duplicate column labels.
Comments please @rhshadrach, @phofl, @jbrockmendel

Some tests no longer raise and I have xfailed all of these these for the time being.
This is Draft and I have changed #44755 to Draft until there is agreement!

@pep8speaks
Copy link

pep8speaks commented Dec 29, 2021

Hello @johnzangwill! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-12-29 18:10:07 UTC

@johnzangwill johnzangwill marked this pull request as draft December 29, 2021 12:14
@phofl
Copy link
Member

phofl commented Dec 29, 2021

This is an API change which is not easy to see but has a big impact on users. We definitely can not do this without a deprecation cycle.

That said, I don't like duplicated columns in DataFrames as a user and as a maintainer. This causes lots of problems at various places. Imho we should not make this the default. I am a big fan that insert raises if somehow you try to insert the column twice.

Could you find out where the flag has an impact right now?

@phofl
Copy link
Member

phofl commented Dec 29, 2021

Even if you set the flag to False when creating your DataFrames, the information may be lost somewhere along the way, because we do not call finalize everywhere, for example:

left = pd.DataFrame({"a": [1, 2, 3]})
left.flags.allows_duplicate_labels = False

right = pd.DataFrame({"b": [1, 2, 3]})
right.flags.allows_duplicate_labels = False

result = left.merge(right, left_on="a", right_on="b")
result.flags.allows_duplicate_labels

@johnzangwill
Copy link
Contributor Author

Plus, my issue #44958 shows that the flag is not completely implemented.

@jbrockmendel
Copy link
Member

Plus, my issue #44958 shows that the flag is not completely implemented.

allows_duplicate_labels/flags is a neat idea, but it should either be fully implemented or deprecated, not left in a half-implemented state indefinitely.

@jreback
Copy link
Contributor

jreback commented Dec 29, 2021

allows_duplicate_labels/flags is a neat idea, but it should either be fully implemented or deprecated, not left in a half-implemented state indefinitely.

sure but we need to incrementally get there.

@phofl
Copy link
Member

phofl commented Dec 29, 2021

If we want to support it fully, we should make sure that the flag is not lost when calling something (e.g. finish the finalize issue) before using it more actively I think

@johnzangwill
Copy link
Contributor Author

allows_duplicate_labels/flags is a neat idea

I'm not so sure. I believe that it was intended to move towards defaulting to False, changing DataFrame's well-documented default behavior. But the value of True was precisely so that data could be loaded from the outside world, with or without unique column labels.

So if, in real-world practice, it is always going to need to be True, what purpose would a default of False serve?

The current situation is that flags/allows_duplicate_labels=False is broken and/or untested in a variety of ways, some quite difficult to fix.

@rhshadrach and @phofl have argued that reset_index() and insert() should default to raising on duplicates, regardless of the status of flags/allows_duplicate_labels. That was what my #44755 was based on, and it does not require fixing flags.

@jreback
Copy link
Contributor

jreback commented Dec 29, 2021

@rhshadrach and @phofl have argued that reset_index() and insert() should default to raising on duplicates, regardless of the status of flags/allows_duplicate_labels. That was what my #44755 was based on, and it does not require fixing flags.

that's not correct. pandas does support duplicate columns names. sure a lot of times you want to actively prevent that but there are many times where you do allow this. hence the flag.

@johnzangwill
Copy link
Contributor Author

Please note:

flag = df.flags.allows_duplicate_labels
df.set_flag(allows_duplicate_labels=True)
df.insert(...)
df.set_flag(allows_duplicate_labels=flag)

is different from the current

df.insert(..., allow_duplicates=True)

which raises if flags.allows_duplicate_labels is False.

Which do we prefer?

@jbrockmendel
Copy link
Member

allows_duplicate_labels/flags is a neat idea, but it should either be fully implemented or deprecated, not left in a half-implemented state indefinitely. [me]

sure but we need to incrementally get there. [jreback]

If we want to support it fully, we should make sure that the flag is not lost when calling something (e.g. finish the finalize issue) before using it more actively I think [phofl]

I guess I'm less optimistic than others about this actually happening. ATM a third of the finalize tests and a quarter of the duplicate_labels tests are xfailed. Together these constitute ~18% of our overall xfails (which is a stat I'm definitely more-than-averagely focused on).

To be clear, I'm not advocating any particular change here. Mostly signaling that I'm open to it if it comes into conflict with bugfixes or other priorities.

@jreback
Copy link
Contributor

jreback commented Dec 29, 2021

Please note:

flag = df.flags.allows_duplicate_labels
df.set_flag(allows_duplicate_labels=True)
df.insert(...)
df.set_flag(allows_duplicate_labels=flag)

is different from the current

df.insert(..., allow_duplicates=True)

which raises if flags.allows_duplicate_labels is False.

Which do we prefer?

we prefer allowing both the option AND having it handle the flag. My point is until we can do that (reliabley / w/o xfailing the world etc). we cannot change any code here.

@johnzangwill
Copy link
Contributor Author

This PR was to demonstrate to @jreback a solution using the frame flags to provide default insert behavior. Since neither I nor other reviewers believe that this is a valid or acceptable solution, I am closing it

@johnzangwill johnzangwill deleted the Using-flags.allows_duplicate_labels branch January 30, 2022 22:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: reset_index on a MultiIndex with duplicate levels raises a ValueError
5 participants