PERF: Very slow clip performance #15400

Closed
wesm opened this Issue Feb 14, 2017 · 2 comments

Comments

Projects
None yet
4 participants
@wesm
Member

wesm commented Feb 14, 2017

Code Sample, a copy-pastable example if possible

In [38]: s = pd.Series(np.random.randn(30))

In [39]: timeit s.clip(0, 1)
100 loops, best of 3: 2.02 ms per loop

Problem description

There is more than 1000x performance difference between Series.clip and numpy.clip:

In [43]: timeit np.clip(arr, 0, 1)
1000000 loops, best of 3: 1.06 µs per loop

Output of pd.show_versions()

pandas 0.19.2

@wesm wesm added the Bug label Feb 14, 2017

@wesm wesm added this to the 0.20.0 milestone Feb 14, 2017

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Feb 14, 2017

Member

I wondered where this huge difference came from. Not that I want to say that this big difference is not a problem, but this seems a consequence of its implemention / several slower functions that are used under the hood.
The clip is done in two separate steps for clip_upper and clip_lower. Each of those clips does a comparison to create a mask and then uses where; in where an align is done, etc:

In [89]: %timeit s.clip(0, 1)
100 loops, best of 3: 1.91 ms per loop

In [91]: %timeit s.clip_lower(0)
1000 loops, best of 3: 958 µs per loop

In [92]: %timeit s < 0
10000 loops, best of 3: 118 µs per loop

In [93]: mask = s < 0

In [94]: %timeit s.where(mask, 0)
1000 loops, best of 3: 395 µs per loop

In [100]: %timeit s.align(mask)
10000 loops, best of 3: 98.6 µs per loop

So it seems that several individual steps in the current implementation (creation of the mask, the alignment, ..) already take way longer than the actual clip in numpy. Probably each of those steps can be optimized, but you won't get a big speed-up with that I think. To get a big speed-up in pandas' clip, we should probably need a more low-level implementation.

When you look at a larger series, the difference is not that huge anymore:

In [32]: s = pd.Series(np.random.randn(100000))

In [33]: %timeit s.clip(0,1)
The slowest run took 8.48 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 4.27 ms per loop

In [34]: arr = s.values

In [35]: %timeit np.clip(arr,0,1)
The slowest run took 4.41 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 558 µs per loop
Member

jorisvandenbossche commented Feb 14, 2017

I wondered where this huge difference came from. Not that I want to say that this big difference is not a problem, but this seems a consequence of its implemention / several slower functions that are used under the hood.
The clip is done in two separate steps for clip_upper and clip_lower. Each of those clips does a comparison to create a mask and then uses where; in where an align is done, etc:

In [89]: %timeit s.clip(0, 1)
100 loops, best of 3: 1.91 ms per loop

In [91]: %timeit s.clip_lower(0)
1000 loops, best of 3: 958 µs per loop

In [92]: %timeit s < 0
10000 loops, best of 3: 118 µs per loop

In [93]: mask = s < 0

In [94]: %timeit s.where(mask, 0)
1000 loops, best of 3: 395 µs per loop

In [100]: %timeit s.align(mask)
10000 loops, best of 3: 98.6 µs per loop

So it seems that several individual steps in the current implementation (creation of the mask, the alignment, ..) already take way longer than the actual clip in numpy. Probably each of those steps can be optimized, but you won't get a big speed-up with that I think. To get a big speed-up in pandas' clip, we should probably need a more low-level implementation.

When you look at a larger series, the difference is not that huge anymore:

In [32]: s = pd.Series(np.random.randn(100000))

In [33]: %timeit s.clip(0,1)
The slowest run took 8.48 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 4.27 ms per loop

In [34]: arr = s.values

In [35]: %timeit np.clip(arr,0,1)
The slowest run took 4.41 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 558 µs per loop
@wesm

This comment has been minimized.

Show comment
Hide comment
@wesm

wesm Feb 20, 2017

Member

Profile results of 100 runs

         301103 function calls (300903 primitive calls) in 0.220 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.220    0.220 {built-in method builtins.exec}
        1    0.000    0.000    0.220    0.220 <string>:1(<module>)
      100    0.001    0.000    0.220    0.002 generic.py:3825(clip)
      100    0.001    0.000    0.109    0.001 generic.py:3913(clip_lower)
      100    0.001    0.000    0.109    0.001 generic.py:3889(clip_upper)
      200    0.001    0.000    0.092    0.000 generic.py:4806(where)
      200    0.002    0.000    0.092    0.000 generic.py:4547(_where)
2000/1800    0.013    0.000    0.074    0.000 internals.py:2978(apply)
     2800    0.012    0.000    0.065    0.000 series.py:135(__init__)
      200    0.001    0.000    0.062    0.000 ops.py:903(wrapper)
      400    0.001    0.000    0.050    0.000 ops.py:907(<lambda>)
      200    0.001    0.000    0.047    0.000 ops.py:1039(flex_wrapper)
      600    0.002    0.000    0.040    0.000 series.py:2364(fillna)
      600    0.005    0.000    0.039    0.000 generic.py:3200(fillna)
      200    0.002    0.000    0.036    0.000 ops.py:803(wrapper)
      200    0.001    0.000    0.034    0.000 internals.py:3158(where)
      600    0.002    0.000    0.032    0.000 generic.py:3007(astype)
      600    0.002    0.000    0.031    0.000 generic.py:3057(copy)
      200    0.000    0.000    0.027    0.000 series.py:2342(align)
      200    0.001    0.000    0.027    0.000 generic.py:4379(align)
      200    0.001    0.000    0.026    0.000 generic.py:4470(_align_series)
      400    0.001    0.000    0.022    0.000 series.py:2360(reindex)
      400    0.002    0.000    0.022    0.000 generic.py:2224(reindex)
      400    0.000    0.000    0.021    0.000 series.py:2326(_reindex_inde

A single call to clip calls the Series constructor 28 times. Not good. I will try to look more deeply into fixing this if no one beats me to it

Member

wesm commented Feb 20, 2017

Profile results of 100 runs

         301103 function calls (300903 primitive calls) in 0.220 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.220    0.220 {built-in method builtins.exec}
        1    0.000    0.000    0.220    0.220 <string>:1(<module>)
      100    0.001    0.000    0.220    0.002 generic.py:3825(clip)
      100    0.001    0.000    0.109    0.001 generic.py:3913(clip_lower)
      100    0.001    0.000    0.109    0.001 generic.py:3889(clip_upper)
      200    0.001    0.000    0.092    0.000 generic.py:4806(where)
      200    0.002    0.000    0.092    0.000 generic.py:4547(_where)
2000/1800    0.013    0.000    0.074    0.000 internals.py:2978(apply)
     2800    0.012    0.000    0.065    0.000 series.py:135(__init__)
      200    0.001    0.000    0.062    0.000 ops.py:903(wrapper)
      400    0.001    0.000    0.050    0.000 ops.py:907(<lambda>)
      200    0.001    0.000    0.047    0.000 ops.py:1039(flex_wrapper)
      600    0.002    0.000    0.040    0.000 series.py:2364(fillna)
      600    0.005    0.000    0.039    0.000 generic.py:3200(fillna)
      200    0.002    0.000    0.036    0.000 ops.py:803(wrapper)
      200    0.001    0.000    0.034    0.000 internals.py:3158(where)
      600    0.002    0.000    0.032    0.000 generic.py:3007(astype)
      600    0.002    0.000    0.031    0.000 generic.py:3057(copy)
      200    0.000    0.000    0.027    0.000 series.py:2342(align)
      200    0.001    0.000    0.027    0.000 generic.py:4379(align)
      200    0.001    0.000    0.026    0.000 generic.py:4470(_align_series)
      400    0.001    0.000    0.022    0.000 series.py:2360(reindex)
      400    0.002    0.000    0.022    0.000 generic.py:2224(reindex)
      400    0.000    0.000    0.021    0.000 series.py:2326(_reindex_inde

A single call to clip calls the Series constructor 28 times. Not good. I will try to look more deeply into fixing this if no one beats me to it

@jreback jreback modified the milestones: 0.20.0, 0.21.0, Next Minor Release Mar 23, 2017

@jreback jreback modified the milestones: 0.20.2, Interesting Issues May 16, 2017

jreback added a commit to jreback/pandas that referenced this issue May 16, 2017

jreback added a commit to jreback/pandas that referenced this issue May 16, 2017

jreback added a commit to jreback/pandas that referenced this issue May 16, 2017

@jreback jreback closed this in #16364 May 16, 2017

jreback added a commit that referenced this issue May 16, 2017

pcluo added a commit to pcluo/pandas that referenced this issue May 22, 2017

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue May 29, 2017

PERF: improved clip performance (#16364)
closes #15400
(cherry picked from commit 42e2a87)

TomAugspurger added a commit that referenced this issue May 30, 2017

PERF: improved clip performance (#16364)
closes #15400
(cherry picked from commit 42e2a87)

stangirala added a commit to stangirala/pandas that referenced this issue Jun 11, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment