Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: improved clip performance #16364

Merged
merged 1 commit into from May 16, 2017
Merged

PERF: improved clip performance #16364

merged 1 commit into from May 16, 2017

Conversation

jreback
Copy link
Contributor

@jreback jreback commented May 16, 2017

closes #15400
In [1]: np.random.seed(1234)

In [2]: s = pd.Series(np.random.randn(50))

master

In [3]: %timeit s.clip(0, 1)
1.65 ms ± 48.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

PR

In [3]: %timeit s.clip(0, 1)
124 µs ± 2.79 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

prob as good as can do for now as we still have 2 where ops (numpy does this in a single loop), and we have a mask check and fill (and final construction).

but about 15x better

@jreback jreback added the Performance Memory or execution speed performance label May 16, 2017
@jreback jreback added this to the 0.20.2 milestone May 16, 2017
@codecov
Copy link

codecov bot commented May 16, 2017

Codecov Report

Merging #16364 into master will decrease coverage by 0.01%.
The diff coverage is 94.11%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16364      +/-   ##
==========================================
- Coverage   90.38%   90.36%   -0.02%     
==========================================
  Files         161      161              
  Lines       50916    50933      +17     
==========================================
+ Hits        46021    46028       +7     
- Misses       4895     4905      +10
Flag Coverage Δ
#multiple 88.14% <94.11%> (ø) ⬆️
#single 40.21% <5.88%> (-0.12%) ⬇️
Impacted Files Coverage Δ
pandas/core/generic.py 91.96% <94.11%> (+0.01%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.68% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d92f06a...62843f8. Read the comment docs.

@jreback jreback merged commit 42e2a87 into pandas-dev:master May 16, 2017
TomAugspurger added a commit to TomAugspurger/pandas that referenced this pull request May 16, 2017
Additional test cases for pandas-dev#16364
when upper and / or lower is nan.
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a slight change in behaviour in that the new implementation does not preserve heterogeneous data types (eg int/float).
Not that this should hold back these perf improvements, but we might consider to keep it for 0.21 (this is also no regression)

result = self.values
mask = isnull(result)
if upper is not None:
result = np.where(result >= upper, upper, result)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs a with np.errstate, as we are working with raw array

In [8]: pd.Series([0, np.nan, 2]).clip(0, 1)
/home/joris/scipy/pandas/pandas/core/generic.py:4117: RuntimeWarning: invalid value encountered in greater_equal
  result = np.where(result >= upper, upper, result)
/home/joris/scipy/pandas/pandas/core/generic.py:4119: RuntimeWarning: invalid value encountered in less_equal
  result = np.where(result <= lower, lower, result)
Out[8]: 
0    0.0
1    NaN
2    1.0
dtype: float64

def _clip_with_scalar(self, lower, upper):

if ((lower is not None and np.any(isnull(lower))) or
(upper is not None and np.any(isnull(upper)))):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the np.any needed here? As lower/upper are already confirmed to be a scalar?

@jorisvandenbossche
Copy link
Member

new implementation does not preserve heterogeneous data types

In principle could add a check for that (at the if statement to decide to take this path or not), but not sure that is worth it ..

@jreback
Copy link
Contributor Author

jreback commented May 16, 2017

yeah this is a limitation of the current methodology

let me see what i can do

jreback pushed a commit that referenced this pull request May 16, 2017
Additional test cases for #16364
when upper and / or lower is nan.
jreback added a commit to jreback/pandas that referenced this pull request May 16, 2017
pawroman added a commit to pawroman/pandas that referenced this pull request May 18, 2017
* upstream/master: (48 commits)
  BUG: Categorical comparison with unordered (pandas-dev#16339)
  ENH: Adding 'protocol' parameter to 'to_pickle'.
  PERF: improve MultiIndex get_loc performance (pandas-dev#16346)
  TST: remove pandas-datareader xfail as 0.4.0 works (pandas-dev#16374)
  TST: followup to pandas-dev#16364, catch errstate warnings (pandas-dev#16373)
  DOC: new oauth token
  TST: Add test for clip-na (pandas-dev#16369)
  ENH: Draft metadata specification doc for Apache Parquet (pandas-dev#16315)
  MAINT: Add .iml to .gitignore (pandas-dev#16368)
  BUG/API: Categorical constructor scalar categories (pandas-dev#16340)
  ENH: Provide dict object for to_dict() pandas-dev#16122 (pandas-dev#16220)
  PERF: improved clip performance (pandas-dev#16364)
  DOC: try new token for docs
  DOC: try with new secure token
  DOC: add developer section to the docs
  DEPS: Drop Python 3.4 support (pandas-dev#16303)
  DOC: remove credential helper
  DOC: force fetch on build docs
  DOC: redo dev docs access token
  DOC: add dataframe construction in merge_asof example (pandas-dev#16348)
  ...
pcluo pushed a commit to pcluo/pandas that referenced this pull request May 22, 2017
pcluo pushed a commit to pcluo/pandas that referenced this pull request May 22, 2017
Additional test cases for pandas-dev#16364
when upper and / or lower is nan.
pcluo pushed a commit to pcluo/pandas that referenced this pull request May 22, 2017
TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this pull request May 29, 2017
TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this pull request May 29, 2017
TomAugspurger pushed a commit that referenced this pull request May 30, 2017
closes #15400
(cherry picked from commit 42e2a87)
TomAugspurger pushed a commit that referenced this pull request May 30, 2017
stangirala pushed a commit to stangirala/pandas that referenced this pull request Jun 11, 2017
stangirala pushed a commit to stangirala/pandas that referenced this pull request Jun 11, 2017
Additional test cases for pandas-dev#16364
when upper and / or lower is nan.
stangirala pushed a commit to stangirala/pandas that referenced this pull request Jun 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: Very slow clip performance
3 participants