Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Pandas inconsistenly handles identically named columns in csv export and merging #3468
Comments
|
|
@jreback actually brought this up a few days ago, and I talked him down because Name mangling
My guess is that the column name mangling was put there before dupe columns As it turns out It's very easy to undo this behavior and it only breaks one test, the one Merge behaviorThe merge behavior is naturally more subtle. Though the error message is unclear What I will concede, is that:
Is really terrible usability, because once a frame with dupe cols is created the user can't If we can fix that somehow, I think that should be a reasonable compromise. @jreback , what do you think? |
|
|
The mangling code sits in a place that's reused by several io paths, |
jreback
referenced
this issue
Apr 29, 2013
Closed
BUG: GH3468 Fix assigning a new index to a duplicate index in a DataFrame would fail #3483
darindillon
commented
Apr 29, 2013
|
Do we really need an option to make read_csv name-mangle the columns? Yeah, we want to be backwards compatible and all, but really, is there even one person in the world who has a CSV that has duplicate columns and does NOT want pandas to create a dataframe with the exact same column names as in the file? |
|
Some users will have existing code that depends on this behavior, when users What the default behavior should be (opt-in to new, or activate "legacy" mode) is a bike shed topic |
This was referenced Apr 29, 2013
|
#3511 merged, unmangled will become the default in 0.12. |
jreback
referenced
this issue
Jun 28, 2013
Closed
Assign column values to non-unique DataFrame #4067
|
@y-p I think this an be closed? (or wait to revers mangle default in 0.13?) |
|
definitely, you nailed this one long ago. |
darindillon commentedApr 26, 2013
Using pandas 0.10.1
Pandas allows creating a dataframe with two columns with the same name. (I disagree that it should be allowed, but it is allowed, so OK). However, it doesn't handle that correctly in several cases.
Pandas ought to either completely disallow duplicate named columns or handle them everywhere. But it shouldn't handle them in some cases but not others.
Problem 1: Round-trip to a CSV
Dump the dataframe to a CSV and then read it back. Even though duplicate columns are supposed to be legal, Pandas won't allow that in the CSV import/export.
Problem 2: Merging to a dataframe with dup columns does not work
I'd be ok with almost any solution (disallowing duplicate named columns, giving you a warning when you do it, handling it correctly in the merge and csv read, etc). But it should be consistently one way or the other.