New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandas.DataFrame.duplicated to allow take_all #6511

Closed
socheon opened this Issue Feb 28, 2014 · 3 comments

Comments

Projects
None yet
4 participants
@socheon

socheon commented Feb 28, 2014

When working with external data, I often see rows with primary key violations. Currently, I could not easily select all the violating rows. For example, if I have a massive file with some inconsistent data

datecol,valuecol
...
2014-01-01,12
2014-01-01,13
2014-01-02,10
...

In this use case, it would be good if we can do df[df.duplicated('datecol', take_all=True)] to directly get the bad rows

2014-01-01,12
2014-01-01,13
@jreback

This comment has been minimized.

Contributor

jreback commented Feb 28, 2014

You can do it like this. That said this is not hard to implement for lib.duplicated anyhow

In [108]: df = DataFrame({ 'A' : [1,1,2,2,2,4,5,2,2]})

In [109]: df
Out[109]: 
   A
0  1
1  1
2  2
3  2
4  2
5  4
6  5
7  2
8  2

[9 rows x 1 columns]

In [110]: df[df.A.isin(df.A[df.A.duplicated()].unique())]
Out[110]: 
   A
0  1
1  1
2  2
3  2
4  2
7  2
8  2

[7 rows x 1 columns]

@jreback jreback added Algos labels Feb 28, 2014

@jreback jreback added this to the 0.15.0 milestone Feb 28, 2014

@sinhrks

This comment has been minimized.

Member

sinhrks commented Nov 22, 2014

Interested in this. To cover 3 patterns, how about changing duplicated / drop_duplicates keyword like below?

duplicated:

  • take='first' (default): Set True to duplicates except the 1st element.
  • take='last': Set True to duplicates except the last element.
  • take='none': Set True to all duplicates.

`drop_duplicates':

  • take='first' (default): Drop duplicates holding the 1st element.
  • take='last': Drop duplicates holding the last element.
  • take='none': Drop all duplicates.
@shoyer

This comment has been minimized.

Member

shoyer commented Nov 23, 2014

@sinhrks take a look at #8505 (a duplicate issue) where we discussed this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment