Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_table / read_csv with duplicate names leads to column duplication #10496

Closed
jens-k opened this issue Jul 3, 2015 · 8 comments
Closed

read_table / read_csv with duplicate names leads to column duplication #10496

jens-k opened this issue Jul 3, 2015 · 8 comments
Labels
Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv

Comments

@jens-k
Copy link

jens-k commented Jul 3, 2015

When reading a table while specifying duplicate column names - let's say two different names - pandas 0.16.1 will copy the last two columns of the data over and over again.

I opened a thread on this here:
http://stackoverflow.com/questions/31207560/pandas-read-table-with-duplicate-names

Is this a bug or an intended behavior?

@jreback
Copy link
Contributor

jreback commented Jul 3, 2015

pls show a self-contained example reproducing the behavior

@jens-k
Copy link
Author

jens-k commented Jul 3, 2015

tbl.csv:
https://klinzing.blaucloud.de/index.php/s/gfkz8Za41tQ6E12

[In:]
df = pd.read_table('tbl.csv', header=0, names=['one','two','one','two','one']) 
df

...gives you:

[Out:]
one two one two one
0   0.132846    0.120522    0.132846    0.120522    0.132846

...rather than:

[Out:]
one two one two one
0   0.117766   0.058881   0.127572   0.120522   0.13286

i.e. it repeats the number of columns equal to the number of unique given names, starting from the right.

@jreback
Copy link
Contributor

jreback commented Jul 3, 2015

@forodin23 by self-contained I mean simple code, not having to download a file etc.

In [4]: df = DataFrame({'A' : [1,2], 'B' : [3,4], 'C' : [5,6]})

In [5]: df.to_csv('test.csv',mode='w')     

In [6]: pd.read_csv('test.csv',index_col=0)
Out[6]: 
   A  B  C
0  1  3  5
1  2  4  6

In [7]: pd.read_csv('test.csv',index_col=0,names=['one','two','one'])
Out[7]: 
    one two one
NaN   C   B   C
 0    5   3   5
 1    6   4   6

In answer to your question there is nothing pandas can do about this. Its not obvious that what you are doing is wrong.

If you want to actually have a column hierarchy, much better to use a MultiIndex, or if you really really want duplicate columns (this is not recommended and use at your own risk), then simply assign them after.

@jreback jreback closed this as completed Jul 3, 2015
@jreback jreback added Usage Question IO CSV read_csv, to_csv labels Jul 3, 2015
@jens-k
Copy link
Author

jens-k commented Jul 3, 2015

Why don't you just give an error message if someone tries to use duplicate names? The problem here is that pandas silently changes your data.

@jreback
Copy link
Contributor

jreback commented Jul 3, 2015

In [2]: df = DataFrame({'A' : [1,2], 'B' : [3,4], 'C' : [5,6]})

In [3]: df.columns=['one','two','one']

In [4]: df.to_csv('test.csv',mode='w')

In [5]: !cat test.csv
,one,two,one
0,1,3,5
1,2,4,6

In [8]: pd.read_csv('test.csv',index_col=0,names=['one','two','one'],header=0)
Out[8]: 
   one  two  one
0    5    3    5
1    6    4    6

hmm also a problem here.

Ok this might be an older bug. I suppose a case could be made for raising here (or rather assigning by position if the names match up). There are several cases to investigate.

pull-request?

@jreback jreback reopened this Jul 3, 2015
@jreback jreback added Difficulty Novice Error Reporting Incorrect or improved errors from pandas and removed Usage Question labels Jul 3, 2015
@jreback jreback added this to the Next Major Release milestone Jul 3, 2015
@jens-k
Copy link
Author

jens-k commented Jul 3, 2015

A pull request by me? Sorry, I'm using python for about a day (coming from matlab). I don't feel equipped to fix that myself :)

@jreback
Copy link
Contributor

jreback commented Jul 3, 2015

ok, np. always a good way to learn though :)

see here for guidelines

@jreback
Copy link
Contributor

jreback commented Jul 17, 2015

dupe of #9424

@jreback jreback closed this as completed Jul 17, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

2 participants