New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in read_csv with duplicated column names #7160

Closed
rafaljozefowicz opened this Issue May 18, 2014 · 10 comments

Comments

Projects
None yet
3 participants
@rafaljozefowicz

Tested on 0.13.0, 0.13.1 and 0.14.0rc1:

from StringIO import StringIO
import pandas as pd

# this is correct
print(pd.DataFrame([[0, 1, 2], [3, 4, 5]], columns=["a", "b", "a"]))
# and this is fine as well
# (although, changing the column names to a,b,a.1)
print(pd.read_csv(StringIO("a,b,a\n0,1,2\n3,4,5")))
# but this is not correct
print(pd.read_csv(StringIO("0,1,2\n3,4,5"), names=["a", "b", "a"]))

The last one returns:

Out[5]: 
   a  b  a
0  2  1  2
1  5  4  5

I would expect all 3 methods to return the same DataFrame. I noticed this when I wanted to read csv file that had a separate file with a header (and a duplicated column in it). BTW is there a better way to do it than to read the header file first and pass the output into 'names' parameter of read_csv?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback May 18, 2014

Contributor

read_csv needs to interpret many formats so that's why it changes to not have duplicate columns
(and some code that assumes they r unique)

so this needs some work

marking as a bug for 0.15 - feel free to submit a pr

Contributor

jreback commented May 18, 2014

read_csv needs to interpret many formats so that's why it changes to not have duplicate columns
(and some code that assumes they r unique)

so this needs some work

marking as a bug for 0.15 - feel free to submit a pr

@jreback jreback added Bug labels May 18, 2014

@jreback jreback added this to the 0.15.0 milestone May 18, 2014

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 3, 2015

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Apr 17, 2016

Member

@jreback : This bug still exists in master. What sort of output should be expected from this? Is the second one supposed to give the exact same output as the first and third (when patched) ones?

Member

gfyoung commented Apr 17, 2016

@jreback : This bug still exists in master. What sort of output should be expected from this? Is the second one supposed to give the exact same output as the first and third (when patched) ones?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Apr 17, 2016

Contributor

yeh I think the problem is that the names->columns are passed back as a dict and not as 2 lists, so it gets lots. Its in the post-processing code in python somewhere.

I would expect as the OP suggests. Note tthis is different that if usecols has dupes. names is merely acting as the header here.

Contributor

jreback commented Apr 17, 2016

yeh I think the problem is that the names->columns are passed back as a dict and not as 2 lists, so it gets lots. Its in the post-processing code in python somewhere.

I would expect as the OP suggests. Note tthis is different that if usecols has dupes. names is merely acting as the header here.

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Apr 19, 2016

Member

FYI: for the second example, that output is correct because mangle_dupe_cols defaults to True, meaning that all columns become unique with .{x} labelling as expected. The bug also surfaces if you set mangle_dupe_cols=False.

Member

gfyoung commented Apr 19, 2016

FYI: for the second example, that output is correct because mangle_dupe_cols defaults to True, meaning that all columns become unique with .{x} labelling as expected. The bug also surfaces if you set mangle_dupe_cols=False.

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Apr 19, 2016

Member

@jreback : Question, what does the as_recarray option do exactly? And, should we be raising a ValueError (as we do now in master) in the third example if we set as_recarray=True?

Member

gfyoung commented Apr 19, 2016

@jreback : Question, what does the as_recarray option do exactly? And, should we be raising a ValueError (as we do now in master) in the third example if we set as_recarray=True?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Apr 19, 2016

Contributor

as_recarray returns a numpy rec-array. But AFAIK its pretty much unused / not tested very well / somewhat buggy. Originally for numpy compat (and of course would be important if we eventually extracted the read_csv as a separate indepedent module (so that numpy could use it directly).

Contributor

jreback commented Apr 19, 2016

as_recarray returns a numpy rec-array. But AFAIK its pretty much unused / not tested very well / somewhat buggy. Originally for numpy compat (and of course would be important if we eventually extracted the read_csv as a separate indepedent module (so that numpy could use it directly).

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Apr 19, 2016

Member

Oh, okay. How about my second question?

EDIT: never mind - numpy says the answer is yes.

Member

gfyoung commented Apr 19, 2016

Oh, okay. How about my second question?

EDIT: never mind - numpy says the answer is yes.

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Apr 19, 2016

Member

@jreback : Question, what does mangle_dupe_cols mean exactly? It seems to only apply when the header is in the file but has no effect if names has duplicates (as in this issue).

Member

gfyoung commented Apr 19, 2016

@jreback : Question, what does mangle_dupe_cols mean exactly? It seems to only apply when the header is in the file but has no effect if names has duplicates (as in this issue).

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Apr 19, 2016

Contributor

it's a way to turn duplicates into things like
A_1 A_2. etc

we really don't need this anymore but it's there so leave I guess -main issue is supporting duplicates properly

Contributor

jreback commented Apr 19, 2016

it's a way to turn duplicates into things like
A_1 A_2. etc

we really don't need this anymore but it's there so leave I guess -main issue is supporting duplicates properly

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Apr 19, 2016

Member

That is true...as of right now, there is ZERO support for duplicates AFAICT. 😞

Member

gfyoung commented Apr 19, 2016

That is true...as of right now, there is ZERO support for duplicates AFAICT. 😞

@jreback jreback modified the milestones: 0.18.2, Next Major Release May 7, 2016

gfyoung added a commit to gfyoung/pandas that referenced this issue May 23, 2016

BUG, ENH: Add support for parsing duplicate columns
Deduplicates the 'names' parameter by default if
there are duplicate names. Also raises when 'mangle_
dupe_cols' is False to prevent data overwrite.

Closes gh-7160.
Closes gh-9424.

@jreback jreback closed this in 9a6ce07 May 23, 2016

nps added a commit to nps/pandas that referenced this issue May 30, 2016

BUG, ENH: Add support for parsing duplicate columns
Closes #7160
Closes #9424

Author: gfyoung <gfyoung17@gmail.com>

Closes #12935 from gfyoung/dupe-col-names and squashes the following commits:

ef7636f [gfyoung] BUG, ENH: Add support for parsing duplicate columns
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment