Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix crash during normalization of time zones #1407

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions docs/mkdocs/docs/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -183,3 +183,16 @@ The handling of `NaN` in ArcticDB depends on the type of the column under consid
* For string columns, `NaN`, as well as Python `None`, are fully supported.
* For floating-point numeric columns, `NaN` is also fully supported.
* For integer numeric columns `NaN` is not supported. A column that otherwise contains only integers will be treated as a floating point column if a `NaN` is encountered by ArcticDB, at which point [the usual rules](api/arctic.md#LibraryOptions) around type promotion for libraries configured with or without dynamic schema all apply as usual.

### How does ArcticDB handle `time zone` information?

ArcticDB takes a more strict approach to time zone handling than Pandas. While Pandas support multiple types for storing time zone information, ArcticDB coerces all time zone information to `datetime.timezone`. This way we can ensure that all time zone information is stored in a consistent manner, and that all time zone information is preserved when data is written to and read from ArcticDB.

When writing data with time zone information to ArcticDB, we preserve the time zone offset and name. When reading data with time zone information from ArcticDB, this data is used to create a `datetime.timezone` object.
jamesmunro marked this conversation as resolved.
Show resolved Hide resolved

!!! note

This means that regardles of the timezone types being written in ArcticDB, all time zone types are normalised to `datetime.timezone`.
If you would like a another time-zone type back then tzname can be used to recreate it:
- For `pytz` you can use: `pytz.gettz(dt.tzname())`
- For `ZoneInfo`, after Python 3.9+, you can use: `ZoneInfo(dt.tzname())`
44 changes: 36 additions & 8 deletions python/arcticdb/version_store/_normalization.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,9 @@
"""
import copy
import datetime
from datetime import timedelta
from datetime import timedelta, timezone
import math

import pytz
import numpy as np
import os
import sys
Expand All @@ -21,7 +21,12 @@
from arcticc.pb2.descriptors_pb2 import UserDefinedMetadata, NormalizationMetadata, MsgPackSerialization
from arcticc.pb2.storage_pb2 import VersionStoreConfig
from collections import Counter
from arcticdb.exceptions import ArcticNativeException, ArcticDbNotYetImplemented, NormalizationException, SortingException
from arcticdb.exceptions import (
ArcticNativeException,
ArcticDbNotYetImplemented,
NormalizationException,
SortingException,
)
from arcticdb.supported_types import DateRangeInput, time_types as supported_time_types
from arcticdb.util._versions import IS_PANDAS_TWO, IS_PANDAS_ZERO
from arcticdb.version_store.read_result import ReadResult
Expand Down Expand Up @@ -911,8 +916,15 @@ def denormalize(self, obj, meta):

def _custom_pack(self, obj):
if isinstance(obj, pd.Timestamp):
tz = obj.tz.zone if obj.tz is not None else None
return ExtType(MsgPackSerialization.PD_TIMESTAMP, packb([obj.value, tz]))
offset = obj.tzinfo.utcoffset(obj).total_seconds() if obj.tzinfo else 0
if obj.tz and hasattr(obj.tz, "zone"):
tz = obj.tz.zone
else:
tz = obj.tzname()

return ExtType(
MsgPackSerialization.PD_TIMESTAMP, packb([obj.value, tz, offset])
)

if isinstance(obj, datetime.datetime):
return ExtType(MsgPackSerialization.PY_DATETIME, packb(_to_tz_timestamp(obj)))
Expand All @@ -928,7 +940,24 @@ def _custom_pack(self, obj):
def _ext_hook(self, code, data):
if code == MsgPackSerialization.PD_TIMESTAMP:
data = unpackb(data, raw=False)
return pd.Timestamp(data[0], tz=data[1]) if data[1] is not None else pd.Timestamp(data[0])
if len(data) == 2:
jamesmunro marked this conversation as resolved.
Show resolved Hide resolved
# We used to store only the value and the timezone as a string
# This is covering this legacy case, where we don't have the offset
val, tz = data
offset = 0
jamesmunro marked this conversation as resolved.
Show resolved Hide resolved
if tz:
# We need to convert the string tz to a timedelta to denormalize to a datetime.timezone
obj = pd.Timestamp(val, tz=tz)
offset = obj.tzinfo.utcoffset(obj).total_seconds()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests for this case, both asserting zero and non-zero offset please.

else:
# This covers the current case where we store the value, the timezone and the offset
# Should be changed if the _custom_pack method is changed
val, tz, offset = data[:3]
jamesmunro marked this conversation as resolved.
Show resolved Hide resolved

if tz is None:
return pd.Timestamp(val)

return pd.Timestamp(val, tz=timezone(timedelta(seconds=offset), tz))

if code == MsgPackSerialization.PY_DATETIME:
data = unpackb(data, raw=False)
Expand Down Expand Up @@ -1248,8 +1277,7 @@ def _strip_tz(s, e):
if not getattr(data, "timezone", None):
start, end = _strip_tz(start, end)
data = data[
start.to_pydatetime(warn=False)
- timedelta(microseconds=1) : end.to_pydatetime(warn=False)
start.to_pydatetime(warn=False) - timedelta(microseconds=1) : end.to_pydatetime(warn=False)
+ timedelta(microseconds=1)
]
return data
Expand Down
Loading
Loading