Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Forthcoming dateutil code reorganization #18141

Closed
pganssle opened this issue Nov 6, 2017 · 28 comments
Closed

Forthcoming dateutil code reorganization #18141

pganssle opened this issue Nov 6, 2017 · 28 comments
Labels
Dependencies Required and optional dependencies

Comments

@pganssle
Copy link
Contributor

pganssle commented Nov 6, 2017

I know that pandas makes use of some of the private interfaces in dateutil (and I'm hoping to start adding features so that this is no longer necessary), so I thought I'd give you a heads up about a proposed changed to the organization of the parser module that will change the location of some of the private interfaces.

Basically parser.py will move to parser/_parser.py, and only the public parts will be imported into parser/__init__.py, as part of the introduction of a dedicated ISO-8601 parser. See PR dateutil/dateutil#489 for the current discussion.

@TomAugspurger
Copy link
Contributor

Thanks for the heads up! We test against dateutil master in one of our jobs (here), so we should be able to catch these and patch them in pandas quickly.

Do you have a rough plan for the next dateutil release?

@TomAugspurger TomAugspurger added the Dependencies Required and optional dependencies label Nov 6, 2017
@pganssle
Copy link
Contributor Author

pganssle commented Nov 6, 2017

@TomAugspurger I'm trying to finalize some things, I want to get isoparse in there and possibly a few improvements to the time zones. I am hoping a new feature release before the end of the year. I usually finalize a release on master and let it sit for a week or so, I can ping you when that happens so you'll have some notice.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Nov 6, 2017 via email

@jbrockmendel
Copy link
Member

@pganssle I'm -1 on rushing the isoparse. The best-case scenario is that the dateutil implementation can drop-in replace the _string_to_dts pandas currently uses for parsing these formats. Since the current pandas implementation isn't fully spec compliant (as you mentioned in an issue somewhere), it isn't obvious how easy (or feasible) this will be. So yah, advocating patience and thorough in-house testing/profiling.

As for pandas accessing dateutil internals, there are a couple of workarounds pandas uses that we can probably make unnecessary: https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/tslibs/parsing.pyx#L207, https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/tslibs/parsing.pyx#L315

Note the tail end of tslibs.parsing.dateutil_parse is almost identical to _build_tzinfo from dateutil#493.

@pganssle
Copy link
Contributor Author

pganssle commented Nov 6, 2017

I'm not really planning on "rushing" the isoparse implementation, but at the end of the day because it's spec-driven, a lot of the behavior questions are actually made ahead of time (assuming we stick to the spec). The main question then comes down to the public API, and decisions about that can always be pushed off to later by releasing a highly constrained version (e.g. the only supported interface is isoparse(dt_str) - which we know we want to support anyway).

In any case, I'd like some form of it to be in dateutil 2.7.0.

@pganssle
Copy link
Contributor Author

pganssle commented Nov 6, 2017

@jbrockmendel I would also guess that a lot of this stuff will have to be in conditional import logic or something, unfortunately. I suspect it could cause a decent number of problems if pandas bumped their dependency to the latest version of dateutil, since that could cause inconsistencies in pinned versions.

@jreback
Copy link
Contributor

jreback commented Nov 6, 2017

@pganssle if you are moving things to private methods please deprecate first; simply moving will break lots of outstanding installs.

@pganssle
Copy link
Contributor Author

pganssle commented Nov 6, 2017

@jreback I'm not moving things to private methods, I'm moving private functions/classes (i.e. in the physical layout of the code). I've pulled out the reorganization portion of the isoparser PR into dateutil/dateutil#501 so it's easier to see what's happening.

Basically, if you're calling dateutil.parser._tzparser or something, it will have moved, but even stuff that is not in dateutil.parser.__all__ will all still be importable from dateutil.parser if it doesn't start with _

@jreback
Copy link
Contributor

jreback commented Nov 6, 2017

what i am saying is if you are changing your publicly exposed functions you should deprecate

@jbrockmendel
Copy link
Member

@jreback pandas should not be affected by the changes pganssle is discussing, with the exception of the private import of _timelex in tslibs.parsing. I'll make sure to patch that try/except import when appropriate.

@jbrockmendel
Copy link
Member

I'm not really planning on "rushing" the isoparse implementation

@pganssle it probably only seems rushed to me because I haven't made time to look it over closely. Will do so this evening.

@jreback
Copy link
Contributor

jreback commented Nov 7, 2017

@pganssle we should patch this on the pandas side
but the issue is that the upstream needs to deprecate or completely break existing installations which is not nice

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Nov 7, 2017 via email

@pganssle
Copy link
Contributor Author

pganssle commented Nov 7, 2017

I'm sure some people will break if they are relying on the private interface, that's a consequence of Python's philosophy of "nothing is really private", but I think there's a better than average chance that anyone who notices / pays attention to DeprecationWarning is also likely to not use private methods in the first place, and if they do use private methods, they're also likely to test against the master of their dependencies (as you guys are doing, presumably at least partially because you've been burned before).

Not saying things won't break out there, just saying that deprecation warnings (which aren't even on by default) probably won't mitigate this by much. In the short term people can pin their dependencies until they fix their fragile code. Worst case scenario, the library is BSD licensed, so there are no viral concerns from vendoring it.

That said, for some of the stuff where I'm actually planning on exposing it publicly anyway (_tzparser, _timelex), it might not be the worst thing in the world to go through a deprecation cycle. Still, the message is weird: "This private method has moved, you shouldn't be using it in the first place, but if you must, pick it up from somewhere else now."

@jreback
Copy link
Contributor

jreback commented Nov 7, 2017

@pganssle that is exactly what we did for lots of things in pandas over the last year
the entire code base was public not its much more private
my point is that you can do what you want but i would test against already released versions to see if there is breakage (then u can decide what to do)

expecting users to upgrade another package just to fix a misuse is not great

@jreback
Copy link
Contributor

jreback commented Nov 7, 2017

note that something is wrong in dateutil master already
see https://travis-ci.org/pandas-dev/pandas/jobs/298549012

we are probably using a private module and have been for quite some time

@pganssle
Copy link
Contributor Author

pganssle commented Nov 7, 2017

@jreback Well, it's not great in general that people count on private interfaces. I understand why it's done in this case (and in fact I'm working to remedy that by exposing some of these in a way that I'm comfortable supporting), but at the end of the day any such private interface calls will necessarily be fragile. They're private because they're implementation details, not to be relied upon.

In any case, my point was mostly that the situation is what it is. DeprecationWarning won't prevent you from needing to update pandas if you want the latest dateutil, and most people won't see them anyway, so it seems like a lot of work do deprecate things that people shouldn't be using anyway (for this very reason).

That said, I'll try to go the deprecation route with some of the bigger things.

@pganssle
Copy link
Contributor Author

pganssle commented Nov 7, 2017

@jreback
Copy link
Contributor

jreback commented Nov 8, 2017

@pganssle it’s not that i care about deprecation per se
more that it will just work for existing installations
iow what happens is people have a version of pandas say 0.21.0 and then they upgrade dateutil (or vice versa); would like for it just to work in a back compact way (iow we can easily try/except to achieve this); but if u can preserve the interface (private or not; it has been this way for a long time) for a version or 2 would be really helpful

further i would suggest bumping the major version number in any event

@jreback
Copy link
Contributor

jreback commented Nov 8, 2017

@pganssle

your current setup on master is not installing dateutil.parser, e.g.

#!/usr/bin/python
from os.path import isfile
import os

from setuptools import setup

from dateutil._version import VERSION

if isfile("MANIFEST"):
    os.unlink("MANIFEST")

setup(name="python-dateutil",
      version=VERSION,
      description="Extensions to the standard Python datetime module",
      author="Paul Ganssle",
      author_email="dateutil@python.org",
      url="https://dateutil.readthedocs.io",
      license="BSD 3-Clause",
      long_description="""
The dateutil module provides powerful extensions to the
datetime module available in the Python standard library.
""",
      packages=["dateutil", "dateutil.zoneinfo", "dateutil.tz"],
      package_data={"dateutil.zoneinfo": ["dateutil-zoneinfo.tar.gz"]},
      zip_safe=True,
      requires=["six"],
      install_requires=["six >=1.5"],  # XXX fix when packaging is sane again
      classifiers=[
          'Development Status :: 5 - Production/Stable',
          'Intended Audience :: Developers',
          'License :: OSI Approved :: BSD License',
          'Programming Language :: Python',
          'Programming Language :: Python :: 2',
          'Programming Language :: Python :: 2.7',
          'Programming Language :: Python :: 3',
          'Programming Language :: Python :: 3.2',
          'Programming Language :: Python :: 3.3',
          'Programming Language :: Python :: 3.4',
          'Programming Language :: Python :: 3.5',
          'Programming Language :: Python :: 3.6',
          'Topic :: Software Development :: Libraries',
      ],
      test_suite="dateutil.test"
      )
(

needs to be added here (or can use find_packages)

      packages=["dateutil", "dateutil.zoneinfo", "dateutil.tz"],

@pganssle
Copy link
Contributor Author

pganssle commented Nov 8, 2017

Thanks, good catch. I should probably set up the tests so that they install and run the installed version rather than running straight from the repo, to avoid this sort of problem.

@jreback
Copy link
Contributor

jreback commented Nov 8, 2017

yep, we do this explicity as a travis test: https://github.com/pandas-dev/pandas/blob/master/ci/install_travis.sh#L158

@jreback
Copy link
Contributor

jreback commented Nov 10, 2017

so I just merged #18172 which restores our build where we use dateutil from master (fixed by @pganssle in dateutil/dateutil#507)

We have a couple of failures. These must be some changes in dateutil as we pass vs 2.6.1. These generally look ok, and we should simply update our tests conditional on the dateutil version.

https://travis-ci.org/pandas-dev/pandas/jobs/299303619

_________ TestGuessDatetimeFormat.test_guess_datetime_format_for_array _________
[gw0] linux -- Python 3.6.3 /home/travis/miniconda3/envs/pandas/bin/python
self = <pandas.tests.indexes.datetimes.test_tools.TestGuessDatetimeFormat object at 0x7f81e01c6240>
    def test_guess_datetime_format_for_array(self):
        tm._skip_if_not_us_locale()
        expected_format = '%Y-%m-%d %H:%M:%S.%f'
        dt_string = datetime(2011, 12, 30, 0, 0, 0).strftime(expected_format)
    
        test_arrays = [
            np.array([dt_string, dt_string, dt_string], dtype='O'),
            np.array([np.nan, np.nan, dt_string], dtype='O'),
            np.array([dt_string, 'random_string'], dtype='O'),
        ]
    
        for test_array in test_arrays:
>           assert tools._guess_datetime_format_for_array(
                test_array) == expected_format
E           AssertionError: assert None == '%Y-%m-%d %H:%M:%S.%f'
E            +  where None = <function _guess_datetime_format_for_array at 0x7f81919e3ea0>(array(['2011-12-30 00:00:00.000000', '2011-12-30 00:00:00.000000',\n       '2011-12-30 00:00:00.000000'], dtype=object))
E            +    where <function _guess_datetime_format_for_array at 0x7f81919e3ea0> = tools._guess_datetime_format_for_array
pandas/tests/indexes/datetimes/test_tools.py:912: AssertionError
__ TestGuessDatetimeFormat.test_guess_datetime_format_with_parseable_formats ___
[gw0] linux -- Python 3.6.3 /home/travis/miniconda3/envs/pandas/bin/python
self = <pandas.tests.scalar.test_parsing.TestGuessDatetimeFormat object at 0x7f81895c9470>
    def test_guess_datetime_format_with_parseable_formats(self):
        tm._skip_if_not_us_locale()
        dt_string_to_format = (('20111230', '%Y%m%d'),
                               ('2011-12-30', '%Y-%m-%d'),
                               ('30-12-2011', '%d-%m-%Y'),
                               ('2011-12-30 00:00:00', '%Y-%m-%d %H:%M:%S'),
                               ('2011-12-30T00:00:00', '%Y-%m-%dT%H:%M:%S'),
                               ('2011-12-30 00:00:00.000000',
                                '%Y-%m-%d %H:%M:%S.%f'), )
    
        for dt_string, dt_format in dt_string_to_format:
>           assert parsing._guess_datetime_format(dt_string) == dt_format
E           AssertionError: assert None == '%Y%m%d'
E            +  where None = <built-in function _guess_datetime_format>('20111230')
E            +    where <built-in function _guess_datetime_format> = parsing._guess_datetime_format
pandas/tests/scalar/test_parsing.py:81: AssertionError
_______ TestGuessDatetimeFormat.test_guess_datetime_format_with_dayfirst _______
[gw0] linux -- Python 3.6.3 /home/travis/miniconda3/envs/pandas/bin/python
self = <pandas.tests.scalar.test_parsing.TestGuessDatetimeFormat object at 0x7f8188b1a940>
    def test_guess_datetime_format_with_dayfirst(self):
        ambiguous_string = '01/01/2011'
>       assert parsing._guess_datetime_format(
            ambiguous_string, dayfirst=True) == '%d/%m/%Y'
E       AssertionError: assert None == '%d/%m/%Y'
E        +  where None = <built-in function _guess_datetime_format>('01/01/2011', dayfirst=True)
E        +    where <built-in function _guess_datetime_format> = parsing._guess_datetime_format
pandas/tests/scalar/test_parsing.py:85: AssertionError
_________ TestGuessDatetimeFormat.test_guess_datetime_format_nopadding _________
[gw0] linux -- Python 3.6.3 /home/travis/miniconda3/envs/pandas/bin/python
self = <pandas.tests.scalar.test_parsing.TestGuessDatetimeFormat object at 0x7f81dae48828>
    def test_guess_datetime_format_nopadding(self):
        # GH 11142
        dt_string_to_format = (('2011-1-1', '%Y-%m-%d'),
                               ('30-1-2011', '%d-%m-%Y'),
                               ('1/1/2011', '%m/%d/%Y'),
                               ('2011-1-1 00:00:00', '%Y-%m-%d %H:%M:%S'),
                               ('2011-1-1 0:0:0', '%Y-%m-%d %H:%M:%S'),
                               ('2011-1-3T00:00:0', '%Y-%m-%dT%H:%M:%S'))
    
        for dt_string, dt_format in dt_string_to_format:
>           assert parsing._guess_datetime_format(dt_string) == dt_format
E           AssertionError: assert None == '%Y-%m-%d'
E            +  where None = <built-in function _guess_datetime_format>('2011-1-1')
E            +    where <built-in function _guess_datetime_format> = parsing._guess_datetime_format

@jreback
Copy link
Contributor

jreback commented Nov 16, 2017

https://travis-ci.org/pandas-dev/pandas/jobs/302975078

looks like more changes in master on dateutil, including deprecation of _timelex.

No-Stream pushed a commit to No-Stream/pandas that referenced this issue Nov 28, 2017
TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this issue Dec 8, 2017
TomAugspurger pushed a commit that referenced this issue Dec 11, 2017
@jreback
Copy link
Contributor

jreback commented Mar 26, 2018

@pganssle is there a direct replacement for _timelex, IOW taking a string and giving me back the tokens (rather than a datetime)? trying to move off this deprecated function (and its ok if its not back-compat).

@pganssle
Copy link
Contributor Author

pganssle commented Mar 26, 2018

@jreback It's not that it's deprecated so much as it was always private. It is an implementation detail of the public interface, which is the auto-magical parse function, so there is no supported replacement for it.

Before the end of the year, I'd like to make a pretty significant refactor to parser.parse (end of the year being the arbitrary time at which I'm going to switch to an LTS branch and a 3.x-only branch). I and @jbrockmendel both are very keen on exposing some of the actual details of parse's auto-magical action for people not content to just consume anything that looks like a date, and part of that effort will probably be a public lexer/tokenizer.

Until then, I think the best thing to do would be to just vendor _timelex somewhere in your source tree. Presumably if you put it in a .pyx file, that would also make the lexing part a bit faster. Once the parser refactor is done you can start using the public interface for that if it suits your needs better - I'll ping pandas' issue tracker when that happens (though I'm sure @jbrockmendel will be on it since he's very keen on all this parser stuff and also on pandas).

@jreback jreback modified the milestones: 0.23.0, 0.24.0 Apr 14, 2018
@jreback jreback modified the milestones: 0.24.0, Contributions Welcome Oct 23, 2018
@jbrockmendel
Copy link
Member

@pganssle is there still upcominig reorganization we need to watch out for here, or can this be closed?

@pganssle
Copy link
Contributor Author

@jbrockmendel I think this happened some time ago already (the parser reorg), plus pandas tests against dateutil master now, so as long as pandas has done everything it needs to do to adapt to the new code organization, feel free to close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dependencies Required and optional dependencies
Projects
None yet
Development

No branches or pull requests

4 participants