Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: improves performance and memory usage of DataFrame.duplicated #9398

Closed
wants to merge 1 commit into from

Conversation

behzadnouri
Copy link
Contributor

on master:

In [1]: np.random.seed(2718281)

In [2]: n = 1 << 20

In [3]: t = pd.date_range('2015-01-01', freq='S', periods=n // 64)

In [4]: xs = np.random.randn(n // 64).round(2)

In [5]: df = DataFrame({'a':np.random.randint(- 1 << 8, 1 << 8, n),
   ...:                 'b':np.random.choice(t, n),
   ...:                 'c':np.random.choice(xs, n)})

In [6]: %timeit df.duplicated()
1 loops, best of 3: 8.03 s per loop

In [7]: %memit df.duplicated()
peak memory: 461.79 MiB, increment: 356.10 MiB

on branch:

In [6]: %timeit df.duplicated()
1 loops, best of 3: 259 ms per loop

In [7]: %memit df.duplicated()
peak memory: 154.62 MiB, increment: 49.36 MiB

benchmarks:

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
frame_duplicated                             | 308.4060 | 7809.8300 |   0.0395 |
frame_drop_duplicates                        |  16.1637 |  28.8030 |   0.5612 |
frame_drop_duplicates_na                     |  17.1696 |  28.3937 |   0.6047 |
multiindex_duplicated                        | 136.1593 | 141.6107 |   0.9615 |
series_drop_duplicates_int                   |   1.1793 |   1.1443 |   1.0306 |
series_drop_duplicates_string                |   0.7900 |   0.7540 |   1.0479 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [76459d2] : performance improvement in DataFrame.duplicated
Base   [ef48c6f] : Merge pull request #9377 from cmeeren/patch-1

DOC: Clarify how date_parser is called (GH9376)

@shoyer
Copy link
Member

shoyer commented Feb 3, 2015

Nice! I noticed the other day that duplicated did not use the klib hashtable, which seemed strange, but I didn't have the time to dig into it.

Can you safely get ride of pandas.lib.duplicated with this change?


size_hint = min(len(self), _SIZE_HINT_LIMIT)

def factorize(vals):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't you just use pandas.core.algorithms.factorize here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's because you don't want to bother with some calculations involving uniques, I would say:

  1. benchmark to see if it really matters
  2. if necessary, separate factorize out into two parts and leave all the private methods access in algorithms.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more so because of simplicity. it is only 4 lines of code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still rather reuse factorize here than duplicate these four lines which use a private API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"private" is with respect to public user api, not the library itself.

for example see the top of the same file where many private functions are imported.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but I still find it clearer to use public APIs internally when possible (especially to avoid duplicated code).

@behzadnouri
Copy link
Contributor Author

Can you safely get ride of pandas.lib.duplicated with this change?

lib.duplicated is still used in base.duplicated. you can implement a duplicated function for objects using kh_pyset_t but in my tests the performance would be the same; ( expectedly so, as the underlying hash function is the same. )

@jreback jreback added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Performance Memory or execution speed performance and removed Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Feb 5, 2015
@jreback jreback added this to the 0.16.0 milestone Feb 5, 2015
(hash_klass, vec_klass), vals = \
algos._get_data_algo(vals, algos._hashtables)

uniques, table = vec_klass(), hash_klass(size_hint)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@behzadnouri I think what @shoyer means as you are using the private cython impl of indexes. This is currently only used in algos. and does not need to be exposed to general reader of frame.py. So I would make a helper private function in algos.py that does these 3 lines (that return the labels/uniques).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that should be refactored into a private function as well

@jreback
Copy link
Contributor

jreback commented Feb 16, 2015

@behzadnouri if you could refactor this a bit to make the use of the hashtables to be a separate function isolated in core/algos.py would be gr8

@behzadnouri
Copy link
Contributor Author

@jreback honestly I do not see how to do this in a way i am comfortable with. this is a very special usage where only the number of unique values matter not their actual values;

@jreback
Copy link
Contributor

jreback commented Feb 24, 2015

I meant someting like this: jreback@923e35c

@shoyer

@behzadnouri
Copy link
Contributor Author

I still feel the current PR is cleaner and avoid the unnecessary type check/manipulation in here and here; but, if you like to make these changes, plz do.

@jreback
Copy link
Contributor

jreback commented Mar 3, 2015

merged via 7da9178

thanks @behzadnouri

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants