Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delay import #17710

Merged
merged 14 commits into from
Oct 2, 2017
Merged

Delay import #17710

merged 14 commits into from
Oct 2, 2017

Conversation

TomAugspurger
Copy link
Contributor

Closes #16764

Improves performance by delaying the import of matplotlib, s3fs, pytest, and openpyxl.

Also removes our old MPL style sheet. The option was deprecated and removed, but the stylesheet hung around.

Master:

In [1]: %time import numpy
CPU times: user 36.3 ms, sys: 13.2 ms, total: 49.5 ms
Wall time: 68.4 ms

In [2]: %time import pandas
CPU times: user 306 ms, sys: 52.2 ms, total: 358 ms
Wall time: 392 ms

head:

In [1]: %time import numpy
CPU times: user 37.7 ms, sys: 12.4 ms, total: 50 ms
Wall time: 69 ms

In [2]: %time import pandas
CPU times: user 166 ms, sys: 40.9 ms, total: 207 ms
Wall time: 245 ms

I can shave off more by hardcoding pandas.__version__. Maybe that's worthwhile for releases?

In [2]: %time import pandas
CPU times: user 133 ms, sys: 30 ms, total: 163 ms
Wall time: 173 ms

@TomAugspurger
Copy link
Contributor Author

I may have messed up some of the excel config stuff... I think all the defaults are still the same, and the tests pass, but that's a bit messy. AFACT the defaults are

  • xls: xlwt (even if not installed)
  • xlsm: openpyxl (even if not installed)
  • xlsx: xlsxwriter if installed, else openpyxl

@TomAugspurger TomAugspurger added the Performance Memory or execution speed performance label Sep 28, 2017
@TomAugspurger TomAugspurger added this to the 0.21.0 milestone Sep 28, 2017
@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Sep 28, 2017

The next big one would be avoiding numexpr, (~21 ms), since it imports pkg_resources, which takes a while. That got a little messy though.

@TomAugspurger
Copy link
Contributor Author

I'm thinking about tests for this. We should be able to whip something up with ModuleFinder

@max-sixty
Copy link
Contributor

I can shave off more by hardcoding pandas.version. Maybe that's worthwhile for releases?

I don't know versioneer in detail - at first glance it looks more feature-rich. But we've had success with setuptools_scm, and it can write a file on install (https://github.com/pypa/setuptools_scm#configuration-parameters)

@jreback
Copy link
Contributor

jreback commented Sep 28, 2017

I can shave off more by hardcoding pandas.version. Maybe that's worthwhile for releases?

NO this exactly the purpose of versioneer
to avoid this hardcoding mess we had over the years

@jreback
Copy link
Contributor

jreback commented Sep 28, 2017

The next big one would be avoiding numexpr, (~21 ms), since it imports pkg_resources, which takes a while. That got a little messy though.

this could certainly be deferred till first use

@TomAugspurger
Copy link
Contributor Author

But we've had success with setuptools_scm, and it can write a file on install

I have as well. It seems to be the "recommended" way to do automatic versioning.

NO this exactly the purpose of versioneer

To be clear, this would just be done for releases, as part of making the release commit. Immediately after tagging, the release manager would add a commit going back to versioneer.

this could certainly be deferred till first use

I'm trying it out now. It's not too bad. Will push a commit in a bit.

@codecov
Copy link

codecov bot commented Sep 28, 2017

Codecov Report

Merging #17710 into master will decrease coverage by 0.04%.
The diff coverage is 72.28%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17710      +/-   ##
==========================================
- Coverage   91.25%    91.2%   -0.05%     
==========================================
  Files         163      163              
  Lines       49823    49831       +8     
==========================================
- Hits        45464    45450      -14     
- Misses       4359     4381      +22
Flag Coverage Δ
#multiple 89% <72.28%> (-0.03%) ⬇️
#single 40.26% <39.75%> (-0.13%) ⬇️
Impacted Files Coverage Δ
pandas/util/testing.py 100% <ø> (ø) ⬆️
pandas/plotting/_style.py 74.28% <ø> (-0.25%) ⬇️
pandas/core/computation/expressions.py 0% <ø> (ø) ⬆️
pandas/core/frame.py 97.73% <100%> (-0.1%) ⬇️
pandas/core/config_init.py 98.34% <100%> (+2.22%) ⬆️
pandas/core/panel.py 96.91% <100%> (ø) ⬆️
pandas/core/internals.py 94.37% <100%> (ø) ⬆️
pandas/core/computation/eval.py 97.02% <100%> (+0.06%) ⬆️
pandas/io/common.py 68.64% <60%> (ø) ⬆️
pandas/core/ops.py 91.76% <75%> (-0.13%) ⬇️
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2781b18...9d0f74a. Read the comment docs.

@chris-b1
Copy link
Contributor

chris-b1 commented Sep 28, 2017

Thanks @TomAugspurger! I'll look at the Excel stuff later, I had a half-finished branch where I think I was fighting the same config stuff you are.

I'm not sure you need to anything with the version stuff? It is very slow on dev versions, but doesn't it get effectively hardcoded in releases already?

~\AppData\Local\Continuum\Anaconda3\envs\py36\lib\site-packages\pandas\_version.py

# This file was generated by 'versioneer.py' (0.15) from
# revision-control system data, or from the parent directory name of an
# unpacked source archive. Distribution tarballs contain a pre-generated copy
# of this file.

from warnings import catch_warnings
with catch_warnings(record=True):
    import json
import sys

version_json = '''
{
 "dirty": false,
 "error": null,
 "full-revisionid": "3a7f956c30528736beaae5784f509a76d892e229",
 "version": "0.20.3"
}
'''  # END VERSION_JSON


def get_versions():
    return json.loads(version_json)

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Sep 28, 2017

I'm not sure you need to anything with the version stuff? It is very slow on dev versions, but doesn't it get effectively hardcoded in releases already?

Oh cool. I wasn't aware of that. With numexpr delayed:

In [1]: %time import numpy
CPU times: user 35.4 ms, sys: 11.1 ms, total: 46.5 ms
Wall time: 66.7 ms

In [2]: %time import pandas
CPU times: user 125 ms, sys: 24.2 ms, total: 150 ms
Wall time: 175 ms

Down from ~240ms

@pep8speaks
Copy link

pep8speaks commented Sep 28, 2017

Hello @TomAugspurger! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on October 02, 2017 at 11:38 Hours UTC

@TomAugspurger
Copy link
Contributor Author

I added a simple script in https://github.com/pandas-dev/pandas/pull/17710/files#diff-b3b48041ab2d614f95206404de64cdfe to check for certain modules whose import is triggered by pandas. This will run on Travis.

@@ -4,12 +4,6 @@

# flake8: noqa

try: # mpl optional
from pandas.plotting import _converter
_converter.register() # needs to override so set_xlim works with str/number
Copy link
Member

@jorisvandenbossche jorisvandenbossche Sep 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One problem with this is I think this will affect users who just use matplotlib for plotting but have pandas imported (now they will get nicer datetime formatting)

Not sure this is a reason not to do this, though, as exactly this side-effect also has come up before as a problem. But it would change the moment when the side-effect is activated (on pandas import or on first pandas plot), and not really solve the side-effect

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. I was about to make a PR to update their documentation. I think this is unavoidable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could make an option, and if users really wanted they could set something in their ipython config. But I'm tempted to wait.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #16764 (comment) you had a patch with a global registered flag. Not sure how important it is to not again register on each plotting call.

In eg a case where a user manually removed entries from the registry to not have pandas things there, this could help.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have that flag https://github.com/pandas-dev/pandas/pull/17710/files#diff-2b118bda866f3d626ceb6b529c62cd1aR46

Is the desired behavior that we should only attempt to register once? So if a user removes the pandas converters, they don't get re-added on each call? That's what happens currently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, should have overlooked that in the diff.

Do we need to expose a user-oriented function to register the converters? (eg pandas.plotting.register_converters())

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see that in the matplotlib docs they already use the one through pandas.tseries.converter (https://github.com/matplotlib/matplotlib/pull/9251/files) (although the location has changed)

@jreback
Copy link
Contributor

jreback commented Sep 30, 2017

lgtm

many run some asv (or subset) to make sure nothing changes substabtially

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really nice!

Copy link
Contributor

@chris-b1 chris-b1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ltgm

@@ -121,6 +121,8 @@ script:
- ci/script_single.sh
- ci/script_multi.sh
- ci/lint.sh
- echo "checking imports"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe make a checking_imports.sh just to make style similar to existing

_NUMEXPR_INSTALLED = ver >= LooseVersion(_MIN_NUMEXPR_VERSION)

if not _NUMEXPR_INSTALLED:
warnings.warn(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do the same type of thing with bottleneck

def na_op(x, y):
import pandas.core.computation.expressions as expressions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if these imports become an issue i can do a global check (like u do with matplotlib)

@jreback
Copy link
Contributor

jreback commented Oct 2, 2017

@TomAugspurger need to rebase I had merged something else.

@jreback
Copy link
Contributor

jreback commented Oct 2, 2017

@TomAugspurger lgtm. thanks!

@jreback jreback merged commit 2310faa into pandas-dev:master Oct 2, 2017
TomAugspurger added a commit to TomAugspurger/dask that referenced this pull request Oct 3, 2017
TomAugspurger added a commit to TomAugspurger/dask that referenced this pull request Oct 5, 2017
TomAugspurger added a commit to TomAugspurger/dask that referenced this pull request Oct 5, 2017
TomAugspurger added a commit to dask/dask that referenced this pull request Oct 5, 2017
* COMPAT: pandas TimeGrouper

xref pandas-dev/pandas#16747

* COMPAT: For pandas 0.21 CategoricalDtype

* COMPAT: For pandas 0.21.0 HTML repr changes

pandas-dev/pandas#16879

* COMPAT: For pandas 0.21.0 numexpr import

pandas-dev/pandas#17710

* COMPAT: Default for inplace is now False in eval

pandas-dev/pandas#11149
@TomAugspurger TomAugspurger deleted the delay-import branch October 16, 2017 14:09
ghost pushed a commit to reef-technologies/pandas that referenced this pull request Oct 16, 2017
@gerritholl
Copy link

This PR includes commit 2310faa which unfortunately triggers a problem in xarray plotting, pydata/xarray#1661 . I don't know nearly enough about pandas to know what the right approach is to mitigate or resolve this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants