PERF: improves performance and memory usage of DataFrame.duplicated #9398

behzadnouri · 2015-02-03T00:12:12Z

on master:

In [1]: np.random.seed(2718281)

In [2]: n = 1 << 20

In [3]: t = pd.date_range('2015-01-01', freq='S', periods=n // 64)

In [4]: xs = np.random.randn(n // 64).round(2)

In [5]: df = DataFrame({'a':np.random.randint(- 1 << 8, 1 << 8, n),
   ...:                 'b':np.random.choice(t, n),
   ...:                 'c':np.random.choice(xs, n)})

In [6]: %timeit df.duplicated()
1 loops, best of 3: 8.03 s per loop

In [7]: %memit df.duplicated()
peak memory: 461.79 MiB, increment: 356.10 MiB

on branch:

In [6]: %timeit df.duplicated()
1 loops, best of 3: 259 ms per loop

In [7]: %memit df.duplicated()
peak memory: 154.62 MiB, increment: 49.36 MiB

benchmarks:

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
frame_duplicated                             | 308.4060 | 7809.8300 |   0.0395 |
frame_drop_duplicates                        |  16.1637 |  28.8030 |   0.5612 |
frame_drop_duplicates_na                     |  17.1696 |  28.3937 |   0.6047 |
multiindex_duplicated                        | 136.1593 | 141.6107 |   0.9615 |
series_drop_duplicates_int                   |   1.1793 |   1.1443 |   1.0306 |
series_drop_duplicates_string                |   0.7900 |   0.7540 |   1.0479 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [76459d2] : performance improvement in DataFrame.duplicated
Base   [ef48c6f] : Merge pull request #9377 from cmeeren/patch-1

DOC: Clarify how date_parser is called (GH9376)

shoyer · 2015-02-03T03:53:03Z

Nice! I noticed the other day that duplicated did not use the klib hashtable, which seemed strange, but I didn't have the time to dig into it.

Can you safely get ride of pandas.lib.duplicated with this change?

shoyer · 2015-02-03T04:01:35Z

pandas/core/frame.py

+
+        size_hint = min(len(self), _SIZE_HINT_LIMIT)
+
+        def factorize(vals):


why can't you just use pandas.core.algorithms.factorize here?

If it's because you don't want to bother with some calculations involving uniques, I would say:

benchmark to see if it really matters

if necessary, separate factorize out into two parts and leave all the private methods access in algorithms.py

more so because of simplicity. it is only 4 lines of code.

I would still rather reuse factorize here than duplicate these four lines which use a private API.

"private" is with respect to public user api, not the library itself.

for example see the top of the same file where many private functions are imported.

True, but I still find it clearer to use public APIs internally when possible (especially to avoid duplicated code).

behzadnouri · 2015-02-03T16:24:22Z

Can you safely get ride of pandas.lib.duplicated with this change?

lib.duplicated is still used in base.duplicated. you can implement a duplicated function for objects using kh_pyset_t but in my tests the performance would be the same; ( expectedly so, as the underlying hash function is the same. )

jreback · 2015-02-05T11:27:47Z

pandas/core/frame.py

+            (hash_klass, vec_klass), vals = \
+                    algos._get_data_algo(vals, algos._hashtables)
+
+            uniques, table = vec_klass(), hash_klass(size_hint)


@behzadnouri I think what @shoyer means as you are using the private cython impl of indexes. This is currently only used in algos. and does not need to be exposed to general reader of frame.py. So I would make a helper private function in algos.py that does these 3 lines (that return the labels/uniques).

@jreback categorical.py:1526

that should be refactored into a private function as well

jreback · 2015-02-16T12:31:57Z

@behzadnouri if you could refactor this a bit to make the use of the hashtables to be a separate function isolated in core/algos.py would be gr8

behzadnouri · 2015-02-18T14:29:09Z

@jreback honestly I do not see how to do this in a way i am comfortable with. this is a very special usage where only the number of unique values matter not their actual values;

jreback · 2015-02-24T22:08:22Z

I meant someting like this: jreback@923e35c

@shoyer

behzadnouri · 2015-02-25T14:31:23Z

I still feel the current PR is cleaner and avoid the unnecessary type check/manipulation in here and here; but, if you like to make these changes, plz do.

jreback · 2015-03-03T00:49:35Z

merged via 7da9178

thanks @behzadnouri

behzadnouri force-pushed the dupl branch from 52aedb4 to c3cf3ec Compare February 3, 2015 00:13

shoyer reviewed Feb 3, 2015
View reviewed changes

jreback added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Performance Memory or execution speed performance and removed Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Feb 5, 2015

jreback added this to the 0.16.0 milestone Feb 5, 2015

jreback reviewed Feb 5, 2015
View reviewed changes

behzadnouri force-pushed the dupl branch from c3cf3ec to 693c8dc Compare February 7, 2015 12:55

behzadnouri force-pushed the dupl branch from 693c8dc to 1af8caf Compare February 17, 2015 00:12

performance improvement in DataFrame.duplicated

4abf2b9

behzadnouri force-pushed the dupl branch from 1af8caf to 4abf2b9 Compare February 18, 2015 12:03

jreback closed this Mar 3, 2015

behzadnouri deleted the dupl branch March 8, 2015 22:34

shoyer mentioned this pull request May 18, 2015

duplicated() performance and bug on long rows regression from 0.15.2->0.16.0 #10161

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: improves performance and memory usage of DataFrame.duplicated #9398

PERF: improves performance and memory usage of DataFrame.duplicated #9398

behzadnouri commented Feb 3, 2015

shoyer commented Feb 3, 2015

shoyer Feb 3, 2015

shoyer Feb 3, 2015

behzadnouri Feb 3, 2015

shoyer Feb 3, 2015

behzadnouri Feb 3, 2015

shoyer Feb 3, 2015

behzadnouri commented Feb 3, 2015

jreback Feb 5, 2015

behzadnouri Feb 5, 2015

jreback Feb 5, 2015

jreback commented Feb 16, 2015

behzadnouri commented Feb 18, 2015

jreback commented Feb 24, 2015

behzadnouri commented Feb 25, 2015

jreback commented Mar 3, 2015


		size_hint = min(len(self), _SIZE_HINT_LIMIT)

		def factorize(vals):

PERF: improves performance and memory usage of DataFrame.duplicated #9398

PERF: improves performance and memory usage of DataFrame.duplicated #9398

Conversation

behzadnouri commented Feb 3, 2015

shoyer commented Feb 3, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

behzadnouri commented Feb 3, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Feb 16, 2015

behzadnouri commented Feb 18, 2015

jreback commented Feb 24, 2015

behzadnouri commented Feb 25, 2015

jreback commented Mar 3, 2015