Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand datetime support #220

Closed
jcrist opened this issue Dec 2, 2022 · 11 comments · Fixed by #225
Closed

Expand datetime support #220

jcrist opened this issue Dec 2, 2022 · 11 comments · Fixed by #225

Comments

@jcrist
Copy link
Owner

jcrist commented Dec 2, 2022

Currently msgspec supports encoding and decoding only timezone-aware datetime.datetime objects, holding strict conformance to RFC3339. Naive datetime.datetime objects can be encoded using a custom enc_hook, but there's no way to decode a naive datetime.datetime object.

I would like to expand our builtin support for datetime types to include:

  • datetime.datetime (both aware and naive)
  • datetime.date
  • datetime.time (both aware and naive)

Here's the plan I've come up with:

Encoding

To support encoding, we add a support_naive_datetimes keyword argument to msgspec.*.decode and msgspec.*.Decoder to configure the treatment of naive datetimes. This would take one of:

  • False: the default. Naive datetime and time objects error on encoding.
  • True: allow encoding naive datetime and time objects. These will be encoded as their RFC3339 compatible counterparts, just missing the offset component
  • "UTC": naive datetime and time objects will be treated as if they have a UTC timezone.
  • a tzinfo object: naive datetime and time objects will be treated as if they have this timezone.

I'm not attached to the keyword name (or boolean options), so if someone can think of a nicer spelling I'd be happy. I think this supports all the common options.

One benefit of supporting these options builtin is that we no longer have the weird behavior of enc_hook only being called for naive datetime.datetime objects. This would admittedly be less weird if Python had different types for aware and naive datetimes.

I could hear an argument that the default should be True (encoding naive datetimes/times by default), but I'm hesitant to make that change. Having an error by default if you're using a naive datetime will force users to think about timezones early on - if they really want a naive datetime they can explicitly opt into it. Supporting naive datetimes/times by default could let programming errors slip by, since most times the user does want an aware datetime rather than a naive datetime.

Decoding

To support decoding, we want to handle the following use cases:

  • Only decode RFC3339 compatible datetimes and times (requiring a timezone)
  • Only decode naive datetimes and times (require no timezone)
  • Decode any datetime or time object (naive or aware)

Since msgspec will only ever decode an object into a datetime if type information is provided, then the natural place to enable this configuration is through our existing type annotations system. The question then is - what does an unannotated datetime.datetime mean?

I want msgspec to make it easy to do the right thing, and (within reason) possible to do the flexible thing. As such, I'd argue that raw datetime.datetime and datetime.time annotations should only decode timezone-aware objects. This means that by default APIs built with msgspec are compatible with json-schema (which lacks a naive datetime/time format), and common web languages like golang (which requires RFC3339 compatible strings in JSON by default).

To support naive-datetime or any-datetime types, we'd add a new config to Meta annotations. Something like:

from msgspec import Struct, Meta
from typing import Annotated
import datetime

class Example(Struct):
    aware_only_datetime: datetime.datetime
    aware_only_time: datetime.time
    date: datetime.date  # date objects have no timezone
    naive_only_datetime: Annotated[datetime.datetime, Meta(timezone=False)]
    naive_only_time: Annotated[datetime.time, Meta(timezone=False)]
    any_datetime: Annotated[datetime.datetime, Meta(timezone=None)]
    any_time: Annotated[datetime.time, Meta(timezone=None)]

Like above, I don't love the timezone=True (aware), timezone=False (naive), timezone=None (aware or naive) syntax, if anyone can think of a better API spelling please let me know.

We could also add type aliases in a new submodule msgspec.types to make this easier to spell (since datetimes are common):

from msgspec import Struct
from msgspec.types import NaiveDatetime

# NaiveDatetime = Annotated[datetime.datetime, Meta(timezone=False)]

class Example(Struct):
    naive_only_datetime: NaiveDatetime

Msgpack Complications

Currently we use msgpack's timestamp extension (https://github.com/msgpack/msgpack/blob/master/spec.md#timestamp-extension-type) when encoding datetimes to msgpack. This extension by design only supports timezone-aware datetimes. msgpack has no standard representation for naive datetimes (or time/date objects in general). To handle this, I plan to encode naive datetimes as strings in the same format as JSON. This is an edge case that I don't expect to affect most users. I think the main benefit of supporting it is parity between types supported by both protocols.

@jcrist
Copy link
Owner Author

jcrist commented Dec 2, 2022

cc @provinzkraut for thoughts, since you've been providing some good feedback lately. Feel free to ignore if this issue doesn't affect your use case.

@jcrist
Copy link
Owner Author

jcrist commented Dec 2, 2022

Also cc @adriangb for thoughts ^^.

@jcrist
Copy link
Owner Author

jcrist commented Dec 3, 2022

After thinking about this throughout the day, I'm now wondering if a simpler don't-try-to-stop-the-user-from-shooting-themselves-in-the-foot solution may be better.

  • Encoders would support both naive and aware datetime/time objects. Encoding an aware object would be RFC3339 compatible (no change from current behavior), while encoded naive objects would be close to compatible, but missing the timezone component. It's up to the user to ensure they're using the proper aware/naive types for their use case.
  • Decoders would interpret datetime.datetime/datetime.time annotations as meaning any object of those types (including both naive and aware). We'd still provide the annotations for ensuring an aware or naive datetime object, but the unannotated version (datetime.datetime) would mean AnyDateTime. This does make it easier to accidentally accept naive datetimes when you really wanted only aware datetimes. As with encoding, we'd take the stance that it's up to the user to configure things properly.

Pros: simpler. No encoding configuration needed. Decoding by default roundtrips all types, and users have the flexibility to mandate aware types if needed (usually this is needed).

Cons: easy to shoot yourself in the foot by using datetime.datetime when you probably wanted AwareDateTime. No error raised on encoding naive datetimes, which may be a programmer error.

Still not sure what the best option is, or if this solution is better than the one above. Just something I'm thinking about. Feedback on any of this very welcome.

@provinzkraut
Copy link

So, in general, this is very welcome! Broadened support for datetime objects would be a very useful thing to have.

Regarding the type vs. Annotated question: In general, I think the explicit custom type is a better approach. This is also the de-facto standard for similar ser/de libraries like Pydantic and marshmallow, so it's going to be most intuitive for users. On the other hand, Meta already exists, so it would only be logical to follow this pattern. I think it makes most sense if msgspec keeps this pattern.

As for the footgun. I think dropping the timezone by default isn't good, as it's very counterintuitive. If I put a timezone aware datetime object in, I expect to get something with a timezone out. So your first approach seems better to me. Strictness in validation is a good thing, especially when it comes to timezones. If the user needs lenience there, it should be specified explicitly.

@jcrist
Copy link
Owner Author

jcrist commented Dec 3, 2022

Thanks for the feedback!

I think dropping the timezone by default isn't good, as it's very counterintuitive.

I think there's some confusion between the options presented above - I'm not advocating for dropping the timezone at all. Apologies, I should have done a clearer writeup of what the design options and questions are.

There are 4 types of objects that I'm not clear on how we should handle:

import datetime

# 1. Timezone Aware Datetime: "2022-12-03T11:42:31.123+00:00"
tz_aware_datetime = datetime.datetime(
    2022, 12, 3, 11, 42, 31, 123, datetime.timezone.utc
)

# 2. Timezone Naive Datetime:  "2022-12-03T11:42:31.123"
tz_naive_datetime = datetime.datetime(2022, 12, 3, 11, 42, 31, 123)

# 3. Timezone Aware Time: "11:42:31.123+00:00"
tz_aware_time = datetime.time(11, 42, 31, 123, datetime.timezone.utc)

# 4. Timezone Naive Time: "11:42:31.123"
tz_naive_time = datetime.time(11, 42, 31, 123)

Currently only the first (timezone aware datetimes) is supported for encoding and decoding. The rest can be handled through custom enc_hook/dec_hook options. I'd like to change these to have builtin support since these are all common types (builtin support will also have much higher performance).

The reason we don't support naive datetimes currently is that RFC3339 (the RFC we follow for our datetime encoding format) doesn't support datetimes without a timezone component. Neither does json-schema (the format="date-time" is only for timezone-aware datetimes). So currently we error for both encoding and decoding to force the user to handle these cases explicitly, opting in to working with naive datetimes. But naive datetimes do have their uses, and other tools do support them by default:

  • orjson supports both naive and aware datetimes. Naive datetimes are encoded without a timezone component, unless OPT_NAIVE_UTC is passed in, in which case naive datetimes are encoded with a UTC timezone. orjson doesn't support timezone-aware time objects at all.
  • django supports both naive and aware datetimes. Naive datetimes are encoded without a timezone component. django doesn't support encoding timezone-aware time objects at all (from following the commits, it appears orjson got this preference from django's implementation).
  • flask supports both naive and aware datetimes. Naive datetimes are treated as UTC by default. Flask doesn't encode datetime.time objects by default.
  • pydantic supports both naive and aware datetime and time objects. Naive objects are encoded without a timezone component.

On the decoding side the consuming code often is written to only work with aware (or naive) datetime objects. As such, I'm a bit reluctant to make the datetime.datetime annotation decode naive datetimes by default, as most APIs will want to enforce these datetimes are aware instead.

On the other hand, a user might be confused why a datetime.datetime annotation can't decode the naive datetime.datetime object that the encoder successfully encoded. So I think for parity I think we'd want to make naive datetimes opt-in or opt-out on both the encoding and decoding side, so the APIs line up.

Given all that info, the open questions are:

  • Do we support encoding naive datetime.datetime objects by default? If so, do we encode with no timezone component, or treat them as UTC (and if we do that do we add an option to configure this behavior)?
  • Do we support encoding naive datetime.time objects by default? If so, do we encode with no timezone component, or treat them as UTC (and if we do that do we add an option to configure this behavior)? Or do we do as orjson/django do and only support timezone naive datetime.time objects?
  • When decoding, what does a bare datetime.datetime type support? Does it decode both timezone aware ( "2022-12-03T11:42:31.123+00:00") and naive ("2022-12-03T11:42:31.123") timestamps? Or just aware?
  • When decoding, what does a bare datetime.time type support? Does it decode both timezone aware ( "11:42:31.123+00:00") and naive ("11:42:31.123") timestamps? Or just aware? Or just naive?

@jcrist
Copy link
Owner Author

jcrist commented Dec 3, 2022

datetime.date support is added in #221. Holding off on adding datetime.time/reworking datetime.datetime support until the questions above are resolved.

@provinzkraut
Copy link

I think there's some confusion between the options presented above - I'm not advocating for dropping the timezone at all.

I think it was me who was being unclear actually, and I misread something 🙃

I think in general, it would be the most intuitive approach if you get out what you put in, that is, if the input includes a timezone, the output should include a timezone. If the input doesn't include a timezone, the output shouldn't include a timezone. This is the least confusing behaviour in my opinion.

Regarding the open questions this would mean:

Do we support encoding naive datetime.datetime objects by default? If so, do we encode with no timezone component, or treat them as UTC (and if we do that do we add an option to configure this behavior)?

Naive objects should be supported by default. They should not include a timezone component. This mimics the behaviour of datetime.datetime.isoformat().

Do we support encoding naive datetime.time objects by default? If so, do we encode with no timezone component, or treat them as UTC (and if we do that do we add an option to configure this behavior)? Or do we do as orjson/django do and only support timezone naive datetime.time objects?

Naive objects should be supported by default. They should not include a timezone component. This mimics the behaviour of datetime.time.isoformat().

When decoding, what does a bare datetime.datetime type support? Does it decode both timezone aware ( "2022-12-03T11:42:31.123+00:00") and naive ("2022-12-03T11:42:31.123") timestamps? Or just aware?

Both naive and aware should be supported. If the input data includes a timezone, the result should be an aware datetime.datetime object. If the input does not include a timezone, the result should be a naive datetime.datetime obejct.

When decoding, what does a bare datetime.time type support? Does it decode both timezone aware ( "11:42:31.123+00:00") and naive ("11:42:31.123") timestamps? Or just aware? Or just naive?

Both naive and aware should be supported. If the input data includes a timezone, the result should be an aware datetime.time object. If the input does not include a timezone, the result should be a naive datetime.time obejct.

Handling of constraints

Further constraints could be addressed with Meta.

  • Meta(timezone=False) will accept aware/naive objects and data including timezone information, but ignore that information
  • Meta(timezone=True) will only accept aware objects, and data including timezone information
  • Meta(timezone=None) is the default, and work like described above

@adriangb
Copy link

adriangb commented Dec 4, 2022

Further constraints could be addressed with Meta.

  • Meta(timezone=False) will accept aware/naive objects and data including timezone information, but ignore that information
  • Meta(timezone=True) will only accept aware objects, and data including timezone information
  • Meta(timezone=None) is the default, and work like described above

I think there's an option missing that only accepts naive objects and errors if any timezone info is included. I would also use literal strings or an enum instead of True/False.

  • Meta(timezone="naive-only") will accept only naive objects and error if any timezone information is included
  • Meta(timezone="aware-only") will only accept aware objects, and data including timezone information
  • Meta(timezone=None) is the default, decodes a naive timezone to naive datetime obj, an aware timezone to an aware datetime obj and does the reciprocal for encoding

To support different behaviors for encoding/decoding you could have Meta(timezone_encode=None timezone_decode="aware-only").

For what it's worth I believe Pydantic just encodes/decodes naive/aware by default and offers no options (other than custom validators). I don't think I've heard many complaints in this department, despite the possible footgun. It's not uncommon for things to pass around naive timezones and assume they're UTC so I think (hope maybe) that most developers are used to double checking timestamps for timezone handling.

@jcrist
Copy link
Owner Author

jcrist commented Dec 4, 2022

Thanks y'all, I think this all makes sense, and mostly matches the option presented in #220 (comment). This feedback is really helpful.

One last open question - since datetimes are common, I want to provide some type aliases so the user doesn't need to write Annotated[datetime.datetime, Meta(tz=True)] (or however we spell it) themselves. I'm just not sure what to call it. All of the following are possible and should work transparently with mypy/pyright, it's just a naming question:

  • DatetimeAware for aware-only datetimes, DatetimeNaive for naive-only datetimes (or should this be DateTimeAware and DateTimeNaive?)
  • Aware and Naive parametrized types, so you'd write Aware[datetime.datetime] or Naive[datetime.datetime]?
  • Maybe "aware" and "naive" aren't the clearest terms? Could do HasTZ[datetime.datetime] / NoTZ[datetime.datetime]?

Not critical, just curious what names seem clearest. All of these would be implemented using python type hint features, Aware for example would be something like:

from typing import TypeVar, Annotated
from msgspec import Meta

T = TypeVar("T")
Aware = Annotated[T, Meta(tz=True)]

@jcrist
Copy link
Owner Author

jcrist commented Dec 5, 2022

Ok, this is mostly done in #224. Added a new tz consraint in Meta that indicates whether a datetime must be aware (tz=True), naive (tz=False), or either (tz=None, the default).

Remaining todos:

  • datetime.time support
  • Any nicer type aliases for aware or naive datetimes as described above.

@jcrist
Copy link
Owner Author

jcrist commented Dec 5, 2022

Closing this now that datetime.time support is in in #225. I'd still like to add some nicer type aliases, but that can be done in the future. If anyone has thoughts on the spellings for those (described in #220 (comment)) please let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants