Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: improve performance of NDFrame.describe #21274

Conversation

Projects
None yet
6 participants
@DataOmbudsman
Copy link
Contributor

commented May 31, 2018

A one-line change that enables to calculate the percentiles in describe more efficiently. The point is that calculating percentiles in one pass is faster than separately.

describe (with default percentiles argument) becomes 25-30% faster than before for numerical Series and DataFrames.

Setup

import timeit

setup = '''
import numpy as np
import pandas as pd
np.random.seed(123)
s = pd.Series(np.random.randint(0, 100, 1000000))
'''

Benchmark

min(timeit.Timer('s.describe()', setup=setup).repeat(100, 1))

Results

On master:

0.06349272100487724

With this change:

0.04745814300258644

Results are similar for DataFrames.

  • closes #xxxx
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry
@WillAyd

This comment has been minimized.

Copy link
Member

commented May 31, 2018

Typically for performance-related changes we look for an ASV to measure and track over time. Can you add one to asv_bench/benchmarks/frame_methods.py and post the results of the benchmark here?

@WillAyd WillAyd added the Performance label May 31, 2018

@codecov

This comment has been minimized.

Copy link

commented May 31, 2018

Codecov Report

Merging #21274 into master will increase coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21274      +/-   ##
==========================================
+ Coverage   91.85%   91.85%   +<.01%     
==========================================
  Files         153      153              
  Lines       49546    49549       +3     
==========================================
+ Hits        45509    45512       +3     
  Misses       4037     4037
Flag Coverage Δ
#multiple 90.25% <ø> (ø) ⬆️
#single 41.87% <ø> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/generic.py 96.12% <ø> (ø) ⬆️
pandas/io/formats/csvs.py 98.14% <0%> (+0.01%) ⬆️
pandas/core/indexes/interval.py 93.16% <0%> (+0.02%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cbec58e...6dda68e. Read the comment docs.

@DataOmbudsman

This comment has been minimized.

Copy link
Contributor Author

commented Jun 1, 2018

Sure. Thanks for the suggestion. Here are my ASV benchmarks. These also show the improvement.

Setup

class Describe(object):

    goal_time = 0.2

    def setup(self):
        np.random.seed(123)
        self.df = DataFrame({
            'a': np.random.randint(0, 100, int(1e6)),
            'b': np.random.randint(0, 100, int(1e6)),
            'c': np.random.randint(0, 100, int(1e6)),
        })

    def time_series_describe(self):
        self.df['a'].describe()

    def time_dataframe_describe(self):
        self.df.describe()

Results

before after ratio
689±10ms 495±6ms 0.72 frame_methods.Describe.time_dataframe_describe
234±9ms 166±6ms 0.71 frame_methods.Describe.time_series_describe
@WillAyd

This comment has been minimized.

Copy link
Member

commented Jun 2, 2018

OK thanks. Can you update your PR to include the benchmark and a whatsnew note for 0.24?

DataOmbudsman added some commits May 31, 2018

PERF: improve performance of NDFrame.describe
Calculating percentiles in one pass is faster than separately.

@DataOmbudsman DataOmbudsman force-pushed the DataOmbudsman:improve-ndframe-describe-performance branch from 724f30e to 70668a1 Jun 4, 2018

@pep8speaks

This comment has been minimized.

Copy link

commented Jun 4, 2018

Hello @DataOmbudsman! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on June 05, 2018 at 12:40 Hours UTC

@jorisvandenbossche jorisvandenbossche added this to the 0.24.0 milestone Jun 4, 2018

@WillAyd

WillAyd approved these changes Jun 4, 2018

goal_time = 0.2

def setup(self):
np.random.seed(123)

This comment has been minimized.

Copy link
@mroeschke

mroeschke Jun 4, 2018

Member

You can remove the random seed; this is handled when setup is imported at the top (from .pandas_vb_common import setup)

@@ -63,8 +63,7 @@ Removal of prior version deprecations/changes
Performance Improvements
~~~~~~~~~~~~~~~~~~~~~~~~

-
-
- Improved performance of :func:`Series.describe` in case of numeric dtpyes

This comment has been minimized.

Copy link
@jreback

jreback Jun 5, 2018

Contributor

can you add the issue number (this pr number as we don't have an issue)

This comment has been minimized.

Copy link
@DataOmbudsman

DataOmbudsman Jun 5, 2018

Author Contributor

OK but I'm unsure about what format is expected. Do you think a link to an external URL (such as here) would be appropriate? E.g., `pull request #21274 <https://github.com/pandas-dev/pandas/pull/21274/>`_. Or something else?

This comment has been minimized.

Copy link
@jreback

jreback Jun 5, 2018

Contributor

same format as all the others, just use :issue:`number`

This comment has been minimized.

Copy link
@DataOmbudsman

DataOmbudsman Jun 5, 2018

Author Contributor

I see now that the URL of the issue is translated to the URL of the PR. That's great.

@jorisvandenbossche jorisvandenbossche merged commit 7dc6f70 into pandas-dev:master Jun 5, 2018

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Jun 5, 2018

@DataOmbudsman Thanks!

@DataOmbudsman DataOmbudsman deleted the DataOmbudsman:improve-ndframe-describe-performance branch Jun 8, 2018

david-liu-brattle-1 added a commit to david-liu-brattle-1/pandas that referenced this pull request Jun 18, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.