Closed
Description
Hi,
I don't know if this is a bug or a feature, or even if you guys noticed it already and it's not an issue, but this caused me some problems so I thought I should report it.
When you use the read_csv command to read a csv, you can pass the columns you want to set as indexes. If those columns contain duplicated entries, the index will have duplicated entries.
To reproduce this, read a csv file with this contents:
indexcol, anothercol, yetanothercol
1, 1, 1
1, 2, 3
2, 1, 2
with the command:
In [1]: import pandas
In [2]: pandas.read_csv('bar.csv', index_col = 'indexcol')
Out[2]:
anothercol yetanothercol
indexcol
1 1 1
1 2 3
2 1 2
Is this what should happen? Is there a way to prevent it (like, for example, just read the first ocurrence of each duplicated index an drop the rest)? What I did to get rid of the duplicated entries was:
In [3]: df = pandas.read_csv('bar.csv', index_col = 'indexcol')
In [4]: df.groupby(lambda x:x).first()
Out[4]:
anothercol yetanothercol
1 1 1
2 1 2