New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv clobbers values of columns with duplicate names #9424

Closed
stevenmanton opened this Issue Feb 5, 2015 · 3 comments

Comments

Projects
None yet
5 participants
@stevenmanton

stevenmanton commented Feb 5, 2015

xref #10577 (has test for duplicates with empty data)

I don't expect this is the correct behavior, although it's always possible I'm doing something wrong. Importing data using the names keyword will clobber the values of columns where the name is duplicated. For example:

from StringIO import StringIO
import pandas as pd

data = """a,1
b,2
c,3"""
names = ['field', 'field']

print pd.read_csv(StringIO(data), names=names, mangle_dupe_cols=True)
print pd.read_csv(StringIO(data), names=names, mangle_dupe_cols=False)

returns

   field  field
0      1      1
1      2      2
2      3      3
   field  field
0      1      1
1      2      2
2      3      3

However, this produces the correct result:

df = pd.read_csv(StringIO(data), header=None)
df.columns = names
print df
   field  field
0      a      1
1      b      2
2      c      3

Interestingly, it works if the field names are in the header:

data_with_header = "field,field\n" + data
print pd.read_csv(StringIO(data_with_header))
  field  field.1
0     a        1
1     b        2
2     c        3

Is this a bug or am I doing something wrong?

@yoavram

This comment has been minimized.

Show comment
Hide comment
@yoavram

yoavram Apr 26, 2015

Contributor

I've came across something similar. When using mangle_dupe_cols=False I get duplicate column names but the column data is all the same, although it's not so in the data I read. So it seems that the last column of the same name overrides all other columns.

import pandas as pd
from StringIO import StringIO

data = """A,A,B,B,B
    1,2,3,4,5
    6,7,8,9,10
    11,12,13,14,15
    """

df1 = pd.read_table(StringIO(data), sep=',', mangle_dupe_cols=True)
df2 = pd.read_table(StringIO(data), sep=',', mangle_dupe_cols=False)

Now df1 is:

A A.1 B B.1 B.2
0 1 2 3 4 5
1 6 7 8 9 10
2 11 12 13 14 15

which has the original data but non-duplicate column names;

and df2 is:

A A B B B
0 2 2 5 5 5
1 7 7 10 10 10
2 12 12 15 15 15

which has duplicate column names but their respecrive data has been overriden.

Reproducible bug in IPython notebook: http://nbviewer.ipython.org/github/yoavram/ipython-notebooks/blob/master/pandas%20duplicate%20column%20bug.ipynb

Pandas version 0.16.0. Python 2.7.

Contributor

yoavram commented Apr 26, 2015

I've came across something similar. When using mangle_dupe_cols=False I get duplicate column names but the column data is all the same, although it's not so in the data I read. So it seems that the last column of the same name overrides all other columns.

import pandas as pd
from StringIO import StringIO

data = """A,A,B,B,B
    1,2,3,4,5
    6,7,8,9,10
    11,12,13,14,15
    """

df1 = pd.read_table(StringIO(data), sep=',', mangle_dupe_cols=True)
df2 = pd.read_table(StringIO(data), sep=',', mangle_dupe_cols=False)

Now df1 is:

A A.1 B B.1 B.2
0 1 2 3 4 5
1 6 7 8 9 10
2 11 12 13 14 15

which has the original data but non-duplicate column names;

and df2 is:

A A B B B
0 2 2 5 5 5
1 7 7 10 10 10
2 12 12 15 15 15

which has duplicate column names but their respecrive data has been overriden.

Reproducible bug in IPython notebook: http://nbviewer.ipython.org/github/yoavram/ipython-notebooks/blob/master/pandas%20duplicate%20column%20bug.ipynb

Pandas version 0.16.0. Python 2.7.

yoavram added a commit to yoavram/ipython-notebooks that referenced this issue Apr 26, 2015

@shoyer shoyer added Bug IO CSV labels Apr 27, 2015

@shoyer shoyer added this to the Next Major Release milestone Apr 27, 2015

@shoyer

This comment has been minimized.

Show comment
Hide comment
@shoyer

shoyer Apr 27, 2015

Member

This seems very strange to me -- I don't think there's any good reason for this behavior. I'm going to label it as a bug.

Member

shoyer commented Apr 27, 2015

This seems very strange to me -- I don't think there's any good reason for this behavior. I'm going to label it as a bug.

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Apr 22, 2016

Member

xref #7160

Member

gfyoung commented Apr 22, 2016

xref #7160

@jreback jreback modified the milestones: 0.18.2, Next Major Release May 7, 2016

gfyoung added a commit to gfyoung/pandas that referenced this issue May 23, 2016

BUG, ENH: Add support for parsing duplicate columns
Deduplicates the 'names' parameter by default if
there are duplicate names. Also raises when 'mangle_
dupe_cols' is False to prevent data overwrite.

Closes gh-7160.
Closes gh-9424.

@jreback jreback closed this in 9a6ce07 May 23, 2016

nps added a commit to nps/pandas that referenced this issue May 30, 2016

BUG, ENH: Add support for parsing duplicate columns
Closes #7160
Closes #9424

Author: gfyoung <gfyoung17@gmail.com>

Closes #12935 from gfyoung/dupe-col-names and squashes the following commits:

ef7636f [gfyoung] BUG, ENH: Add support for parsing duplicate columns
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment