Skip to content

Reading csv with duplicate index columns generate duplicated indexes #3249

Closed
@rcalsaverini

Description

@rcalsaverini

Hi,

I don't know if this is a bug or a feature, or even if you guys noticed it already and it's not an issue, but this caused me some problems so I thought I should report it.

When you use the read_csv command to read a csv, you can pass the columns you want to set as indexes. If those columns contain duplicated entries, the index will have duplicated entries.

To reproduce this, read a csv file with this contents:

indexcol, anothercol, yetanothercol
1, 1, 1
1, 2, 3
2, 1, 2

with the command:

In [1]: import pandas

In [2]: pandas.read_csv('bar.csv', index_col = 'indexcol')
Out[2]: 
           anothercol   yetanothercol
indexcol                             
1                   1               1
1                   2               3
2                   1               2

Is this what should happen? Is there a way to prevent it (like, for example, just read the first ocurrence of each duplicated index an drop the rest)? What I did to get rid of the duplicated entries was:

In [3]: df = pandas.read_csv('bar.csv', index_col = 'indexcol')

In [4]: df.groupby(lambda x:x).first()
Out[4]: 
    anothercol   yetanothercol
1            1               1
2            1               2

Metadata

Metadata

Assignees

No one assigned

    Labels

    IO DataIO issues that don't fit into a more specific labelUsage Question

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions