Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Timestamp UTC offset incorrect for dateutil tz in ambiguous DST time #31338

Closed
AlexKirko opened this issue Jan 27, 2020 · 12 comments · Fixed by #31563
Closed

BUG: Timestamp UTC offset incorrect for dateutil tz in ambiguous DST time #31338

AlexKirko opened this issue Jan 27, 2020 · 12 comments · Fixed by #31563
Assignees
Labels
Bug Timezones Timezone data dtype
Milestone

Comments

@AlexKirko
Copy link
Member

AlexKirko commented Jan 27, 2020

Code Sample, a copy-pastable example if possible

If we make a Timestamp in an ambiguous DST period while specifying via the offset (or by supplying Timestamp.value directly) that the time is before DST switch, the representation then shows that this is after DST switch. This is backed up by calling Timestamp.tz.utcoffset(Timestamp).

IN:
t1 = pd.Timestamp(1382837400000000000, tz='dateutil/Europe/London')
t1

OUT:
Timestamp('2013-10-27 01:30:00+0100', tz='dateutil/GB-Eire')

IN:
t2 = pd.Timestamp(1382837400000000000, tz='Europe/London')
t2

OUT:
Timestamp('2013-10-27 01:30:00+0000', tz='Europe/London')

Problem description

The reason for this bug looks to be buried deep in the interaction of pandas and dateutil.

So this is what I've been able to dig up. When we try to determine whether we are in DST or not, we rely on timezone.utcoffset of the underlying timezone package. What gets executed in dateutil is this:

def utcoffset(self, dt):
	...

	return self._find_ttinfo(dt).delta
	
def _find_ttinfo(self, dt):
	idx = self._resolve_ambiguous_time(dt)
	...

def _resolve_ambiguous_time(self, dt):
	idx = self._find_last_transition(dt)

	# If we have no transitions, return the index
	_fold = self._fold(dt)
	if idx is None or idx == 0:
		return idx

	# If it's ambiguous and we're in a fold, shift to a different index.
	idx_offset = int(not _fold and self.is_ambiguous(dt, idx))

	return idx - idx_offset

dateutil is expecting an ordinary datetime.timedelta object here, so this is what it does:

  1. Use _find_last_transition to get the index of the last DST transition before dt. This is done by computing timedelta.total_seconds since epoch time. Our pandas.Timedelta.total_seconds is smart, and returns different total_seconds for before and after DST, since we basically return Timedelta.value which is the same as Timestamp.value when counting since epoch time (because of how _Timestamp.__sub__ in c_timestamp.pyx is implemented).

This is what we do (doesn't care about dt.replace(tzinfo=None)):

def total_seconds(self):
	"""
	Total duration of timedelta in seconds (to microsecond precision).
	"""
	# GH 31043
	# Microseconds precision to avoid confusing tzinfo.utcoffset
	return (self.value - self.value % 1000) / 1e9

This is what datetime.timedelta does (loses DST awareness after dt.replace(tzinfo=None)):

def total_seconds(self):
	"""Total seconds in the duration."""
	return ((self.days * 86400 + self.seconds) * 10**6 +
			self.microseconds) / 10**6
  1. The remainder of _resolve_ambiguous_time corrects for ambiguous times, since datetime.timedelta.total_seconds after dt.replace(tzinfo=None) isn't DST-aware. It checks if we are in an ambiguous period and if this is the first time this time has occured: this is what self._fold is for. fold is 0 for the first time, and 1 for the second time. If it's the first time, dateutil shifts the relevant transition index back by 1, since it thinks that total_seconds always returns the number of seconds calculated using the second time.

I'd like to discuss how we are going to approach this. From what I see, there isn't much we can do on our end. Making total_seconds non-DST-aware by default is bad, because that would be making our implementation less precise unless the user passes a parameter. Scratch that. The problem isn't so much the total_seconds implementation as it is the Timestamp.__sub__ implementation which preserves value when we subtract epoch time.

Another approach is to go to dateutil with this and implement a check there to avoid running the correction if they are dealing with a pandas.Timedelta. Might be tricky to do without introducing a dependency on pandas, though.

First came across this while solving #24329 in #30995

Expected Output

IN:
t1
OUT:
Timestamp('2013-10-27 01:30:00+0000', tz='dateutil/GB-Eire')

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : ru_RU.UTF-8
LOCALE : None.None

pandas : 0.26.0.dev0+1947.gca3bfcc54.dirty
numpy : 1.17.5
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 44.0.0.post20200106
Cython : 0.29.14
pytest : 5.3.4
hypothesis : 5.2.0
sphinx : 2.3.1
blosc : None
feather : None
xlsxwriter : 1.2.7
lxml.etree : 4.4.2
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : 4.8.2
bottleneck : 1.3.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.2
matplotlib : 3.1.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.1
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.3.4
pyxlsb : None
s3fs : 0.4.0
scipy : 1.3.1
sqlalchemy : 1.3.12
tables : 3.6.1
tabulate : 0.8.6
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.7
numba : 0.47.0

@jreback
Copy link
Contributor

jreback commented Jan 27, 2020

cc @pganssle

@jreback
Copy link
Contributor

jreback commented Jan 27, 2020

other options

  • dateutil expose another method
  • move utcoffset computation go pandas (calling private functions if needed)

@AlexKirko AlexKirko changed the title BUG: Timestamp repr doesn't match value for dateutil tz in ambiguous DST time BUG: Timestamp UTC offset incorrect for dateutil tz in ambiguous DST time Jan 27, 2020
@AlexKirko
Copy link
Member Author

AlexKirko commented Jan 27, 2020

dateutil could also expose a setting to signal if total_seconds of the timedelta implementation it's working with is DST-aware or not, and we could set it on our end. A number of ways this can be done, some less bad than others.

@AlexKirko
Copy link
Member Author

AlexKirko commented Jan 27, 2020

Just in case it's useful.

The reason that correction in _resolve_ambiguous_time is necessary is because dateutil replaces tzinfo of the datetime with None before calculating total_seconds. This makes the datetime.timedelta calculation ignore the fold parameter. So the time is assumed to be non-DST (after the shift). Maybe something could be done at that stage (without bouncing calls between datetime.__sub__ and tzinfo.utcoffset). I'll look into it further to see if I get any ideas.

@pganssle
Copy link
Contributor

I don't think you've quite diagnosed the problem correctly. It is true that pandas apparently uses non-standard datetime arithmetic semantics, which could cause problems, but in this case the issue seems to be that pandas.Timestamp does not hold a fold attribute for some reason:

>>> t1.replace(fold=1)
Timestamp('2013-10-27 01:30:00+0100', tz='dateutil//usr/share/zoneinfo/Europe/London')
>>> t1.replace(fold=1).fold
0
>>> t1 = pd.Timestamp(1382837400000000000, tz='dateutil/Europe/London')
>>> t1.fold
0
>>> t1.replace(fold=1)
Timestamp('2013-10-27 01:30:00+0100', tz='dateutil//usr/share/zoneinfo/Europe/London')
>>> t1.replace(fold=1).fold
0

pytz doesn't have this problem because it is old and apparently semi-unmaintained and thus does not support PEP 495, so it uses another mechanism to keep track of ambiguous datetimes. It is expected that if you throw away the fold information on a datetime that you will have trouble in ambiguous datetimes.

@mroeschke
Copy link
Member

xref #25057 for Timestamp not supporting fold

@AlexKirko
Copy link
Member Author

AlexKirko commented Jan 28, 2020

@pganssle
Thanks. I did some more testing and you are absolutely right:

IN:
EPOCH = datetime.datetime.utcfromtimestamp(0)
t1 = pd.Timestamp(1382833800000000000, tz='dateutil/Europe/London')
t1.value

OUT:
1382833800000000000

IN:
t1.replace(tzinfo=None).value

OUT:
1382837400000000000

So we can't tell the difference between the first and the second time an ambiguous time occurs after dropping the tzinfo information.

My hypothesis that pandas arithmetic somehow incorporates fold into it was wrong, and this is good for us. In theory, once we add fold support to the Timestamp constructor, this bug should be fixed. If so, we can mark this issue as a duplicate for #25057.
I'll see if I can come up with a solution.

@AlexKirko
Copy link
Member Author

take

@AlexKirko
Copy link
Member Author

@pganssle
Does dateutil's handling of the problem depend on Python >= 3.6 then?

@pganssle
Copy link
Contributor

@AlexKirko No, dateutil has a backport that returns a datetime subclass with a fold attribute if the normal datetime object doesn't have a fold attribute.

In this case it shouldn't matter much, because Timestamp is itself a datetime subclass, and so the main thing that matters is that it has a meaningful fold attribute.

@AlexKirko
Copy link
Member Author

AlexKirko commented Jan 29, 2020

Thanks, that clears things up.

Started looking into the issue. Caught a cold, but I still hope to have something by the end of the week.

@AlexKirko
Copy link
Member Author

Well, I do have an MVP, but there is still stuff to be done, and I can't seem to find the source of the performance impact that the changes introduce.
If anyone wants to take a look at the PR and help me out, it would be appreciated.

@jreback jreback added this to the 1.1 milestone Feb 9, 2020
@jreback jreback added Bug Timezones Timezone data dtype labels Feb 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants