Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pd.Timestamp constructor ignores missing arguments #31930

Open
pganssle opened this issue Feb 12, 2020 · 10 comments
Open

pd.Timestamp constructor ignores missing arguments #31930

pganssle opened this issue Feb 12, 2020 · 10 comments
Labels
Bug Error Reporting Incorrect or improved errors from pandas Timeseries

Comments

@pganssle
Copy link
Contributor

pganssle commented Feb 12, 2020

As part of the discussions in #31563, I came across these strange semantics in pd.Timestamp, where it is apparently legal to over-specify a pd.Timestamp by specifying both a datetime (or another Timestamp) and pass the by-component construction values, and any irrelevant arguments are ignored:

>>> pd.Timestamp(datetime(2020, 12, 31),
                 year=1, month=1, day=1,
                 hour=23, minute=59, second=59, microsecond=999999)
Timestamp('2020-12-31 00:00:00')

The signature for the function is:

pd.Timestamp(
    ts_input=<object object at 0x7fd988a10760>,
    freq=None,
    tz=None,
    unit=None,
    year=None,
    month=None,
    day=None,
    hour=None,
    minute=None,
    second=None,
    microsecond=None,
    nanosecond=None,
    tzinfo=None,
)

There's actually a decent amount of redundant information in there, because pd.Timestamp is attempting to have its own constructor *in addition to being constructable like a datetime. Properly, there are two overloaded constructors here (note that I'm not sure if nanosecond belongs to both or just one):

pd.Timestamp(ts_input, freq, tz, unit[, nanosecond?])
pd.Timestamp(year, month, day
             [, hour, minute, second, microsecond, nanosecond, tzinfo])

I think that ideally the correct behavior would be to throw an error if you mix and match between the two, which is at least done in the case of specifying both tz and tzinfo:

>>> pd.Timestamp(datetime.now(), tz=timezone.utc, tzinfo=timezone.utc)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-17-d80c9ce6a89d> in <module>
----> 1 pd.Timestamp(datetime.now(), tz=timezone.utc, tzinfo=timezone.utc)

pandas/_libs/tslibs/timestamps.pyx in pandas._libs.tslibs.timestamps.Timestamp.__new__()

ValueError: Can provide at most one of tz, tzinfo

Though confusingly this also fails if you specify tzinfo at all in the "by-component" constructor. I have filed a separate bug for that at #31929.

Recommendation

I think that the behavior of pandas.Timestamp should probably be brought at least mostly in-line with the concept of two overloaded constructors (possibly with tz and tzinfo being mutually-exclusive aliases for one another). Any other combination, particularly combinations where the values passed are ignored, should raise an exception.

This may be a breaking change, since it will start raising exceptions in code that didn't raise exceptions before (though I am not sure I can think of any situation where silently ignoring the values is a desirable condition), so it may be a good idea to have a deprecation period where a warning rather than an exception is raised.

@mroeschke
Copy link
Member

Definite +1.

Moreover I think the Timestamp structure should be more strict and only accept the following combination of inputs and raise otherwise if there are mixing of parameters:

# datetime string: (ts_input: str)
Timestamp('2020-01-01')
# int (epoch) + unit (ts_input: int/float + unit: str)
Timestamp(1513393355, unit='s')
# datetime object (ts_input: datetime object)
Timestamp(datetime.datetime(2020, 1, 1))
# datetime constructor like (positional & kwargs)
Timestamp(year=2020, month=1, day=1, hour=1, ..., nanosecond=1, tzinfo=timezone.utc, fold=1)

I intentionally left out the following args that I think can be deprecated/removed:

  • freq: We already have an issue deprecating/removing this argument DEPR: remove freq attribute of Timestamp? #15146
  • tz: Timestamps already have tz_localize/tz_convert methods (with more options for edge case handling) for setting & converting the timezone information. Don't think it's critical that it needs to happen in the constructor.

@AlexKirko
Copy link
Member

AlexKirko commented Feb 13, 2020

The constructor argument-handling behavior is a problem.

I, however, disagree that constructing timezone-aware Timestamps should happen in two steps unless building from components. It would no doubt be much easier for us to support, but it looks like we are cutting functionality because refactoring is painful.

Also, we should consider that this is how tz_localize currently localizes:

            value = tz_localize_to_utc(np.array([self.value], dtype='i8'), tz,
                                       ambiguous=ambiguous,
                                       nonexistent=nonexistent)[0]
            return Timestamp(value, tz=tz, freq=self.freq)

It just converts to time since epoch and calls the constructor.

@mroeschke
Copy link
Member

I tend to be more in the camp of There should be one-- and preferably only one --obvious way to do it. "Zen of Python" philosophy, so since it's possible to localize with both tz=... in the Timestamp constructor and tz_localize method, I'd prefer to only have the tz_localize methodology since it supports more edge case handling.

Refactoring/transition is painful, but I think it's worth getting to a more straightforward state.

@AlexKirko
Copy link
Member

AlexKirko commented Feb 13, 2020

It's a matter of opinion, I suppose. I believe that if something is a valid object belonging to a class, we should be able to make this object with the constructor as easily as possible unless there is a very good reason not to have this functionality.

We could argue that the user can just parse whatever input they have into components themselves and then build with our constructor, but I don't think that's an excuse. Especially because it's very common (at least in my experience) to read dates as strings from text files and then build datetime-like from that, whether the result needs to be tz-aware or not.

A bit off-topic, but since we are quoting that principle from Zen of Python. Currently, replace and tz_localize do the same stuff very differently if you use them to turn a naive Timestamp into a tz-aware one. From my experience with working with the Timestamp constructor and methods that do what the constructor does, there are many issues like this.

If we'd like to refactor the missing arguments behavior, we could benefit from discussing PR scope in advance, otherwise we risk pulling the entire class into scope.

@jreback
Copy link
Contributor

jreback commented Feb 13, 2020

pls don’t massively refactor in this PR
i think reviewing a refactor and add fold PR would never be merged

if we want to merge this with a little more review and then potentially do a refactor (likely in multiple steps) then ok with that

we have quite a large suite for Timestamp so not refacators should be easily possible

@AlexKirko
Copy link
Member

@mroeschke @pganssle
I apologize if I sounded a bit paranoid: #31563 has made me wary of us getting bogged down, so I'd prefer if we broke up the Timestamp refactor as much as it is reasonable.

Now that I reread the discussion, looks like that's what you intend too.

@pganssle
Copy link
Contributor Author

@jreback I think you are responding on the wrong issue, this is not the "add fold" issue or PR.

@ArtyomKaltovich
Copy link

ArtyomKaltovich commented Mar 11, 2020

Hello all.

Just a small notification, that

print(pd.Timestamp(2019, 1, 1, tzinfo=pytz.timezone('EST5EDT'), tz=None))

doesn't work as workaround and produce an error

TypeError: __new__() got multiple values for keyword argument 'tz'

:)

@ArtyomKaltovich
Copy link

ArtyomKaltovich commented Mar 13, 2020

Hello all.

The reason of the issue is that tz was third positional argument, so it takes value instead of the day, to prevent this behavior tz was made key only attribute.

Please review the PR I've submitted.

Regards,
Artyom.

@jreback jreback added Error Reporting Incorrect or improved errors from pandas Timeseries labels Mar 15, 2020
@mroeschke mroeschke added the Bug label Apr 2, 2020
@drorata
Copy link
Contributor

drorata commented Apr 11, 2022

FWIW - with version 1.4.2 of pandas the following two options work nicely

ts1 = pd.Timestamp("2009-01-10", tz="UTC")
ts2 = pd.Timestamp(year=2009, month=1, day=10, tz='UTC')

and ts1 == ts2 returns True. However, trying to construct the same timestamp using positional values:

pd.Timestamp(2009,1, 10, tz='UTC')

raises an error:

TypeError                                 Traceback (most recent call last)
/var/folders/hc/m_f707sd7t3ghfvxn5013sfm0000gp/T/ipykernel_39707/770468094.py in <module>
----> 1 pd.Timestamp(2009,1, 10, tz='UTC')

pandas/_libs/tslibs/timestamps.pyx in pandas._libs.tslibs.timestamps.Timestamp.__new__()

TypeError: __new__() got multiple values for keyword argument 'tz'

I'm totally not sure whether this issue is the relevant one, but I got here thanks to #31930 (comment) by @ArtyomKaltovich.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Error Reporting Incorrect or improved errors from pandas Timeseries
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants