Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG/API: to_datetime preserves UTC offsets when parsing datetime strings #21822

Merged
merged 43 commits into from
Jul 30, 2018

Conversation

mroeschke
Copy link
Member

@mroeschke mroeschke commented Jul 8, 2018

This PR makes to_datetime(ts_string_with_offset) and DatetimeIndex([ts_string_with_offset]) to now match Timestamp(ts_string_with_offset)

This branch

In [2]: Timestamp("2015-11-18 15:30:00+05:30")
Out[2]: Timestamp('2015-11-18 15:30:00+0530', tz='pytz.FixedOffset(330)')

In [3]: to_datetime("2015-11-18 15:30:00+05:30")
Out[3]: Timestamp('2015-11-18 15:30:00+0530', tz='pytz.FixedOffset(330)')

In [4]: DatetimeIndex(["2015-11-18 15:30:00+05:30"])
Out[4]: DatetimeIndex(['2015-11-18 15:30:00+05:30'], dtype='datetime64[ns, pytz.FixedOffset(330)]', freq=None)

@mroeschke mroeschke added the Timezones Timezone data dtype label Jul 8, 2018
if not is_same_offsets:
raise TypeError
else:
# Open question: should this return dateutil offset or pytz offset?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default to dateutil

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this PR then, is it okay that parsing through Timestamp will produce a pytz.FixedOffset and to_datetime will producea dateutil.tz.tzoffset?

I should probably start a larger discussion whether we should be migrating from pytz to dateutil

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My mistake, forgot that Timestamp defaulted to pytz.FixedOffset. Sharing code with the Timestamp constructor is definitely a higher priority than defaulting to dateutil.tz.

arg,
errors=errors,
utc=tz == 'utc',
dayfirst=dayfirst,
yearfirst=yearfirst,
require_iso8601=require_iso8601
)
if tz_parsed is not None:
return DatetimeIndex._simple_new(result, name=name,
tz=tz_parsed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case with multiple tzs that has to get wrapped in object-dtype?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That case will result in tz_parsed = None so this branch will not be hit.

@jbrockmendel
Copy link
Member

Looking over the test failures, most of them are clearly the-test-is-wrong.

@codecov
Copy link

codecov bot commented Jul 12, 2018

Codecov Report

Merging #21822 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21822      +/-   ##
==========================================
+ Coverage   92.07%   92.07%   +<.01%     
==========================================
  Files         170      170              
  Lines       50690    50696       +6     
==========================================
+ Hits        46672    46678       +6     
  Misses       4018     4018
Flag Coverage Δ
#multiple 90.48% <100%> (ø) ⬆️
#single 42.3% <60%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/dtypes/cast.py 88.52% <ø> (ø) ⬆️
pandas/core/tools/datetimes.py 85.07% <100%> (+0.28%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d30c4a0...1cbd9b9. Read the comment docs.

2) datetime.datetime objects, if OutOfBoundsDatetime or TypeError
is encountered

Also returns a pytz.FixedOffset if an array of strings with the same
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle other tzinfos could be returned, specifically if it falls back to dateutil

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is specifying that array_to_datetime function itself can return a pytz.FixedOffset or None from the C parser output. If it goes through the dateutil parser in the non-ValueError branch, I don't think there's a way to recover the timezone?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When parse_datetime_string is called if there's a timezone it should return a tz-aware datetime object, so the tzinfo can be pulled off that can't it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right that's true, good catch. Sure I will try to add some tests and functionality to hit the dateutil parser before the object branch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I am using a set to store the parsed timezone offsets (which should be more performant that using an array in theory / I was having some trouble using an array due to duplicates), however dateutil.tz.tzoffsets cannot be hashed: dateutil/dateutil#792

So instead, I am saving the total_seconds() of the dateutil tzoffset in the set instead and reconstructing the offsets as pytz.FixedOffsets

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add Parameters here


Returns
-------
(ndarray, timezone offset)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

underscore between timezone and offset?

# the parsed dates to (maybe) pass to DatetimeIndex
# 2) If the offsets are different, then force the parsing down the
# object path where an array of datetimes
# (with individual datutil.tzoffsets) are returned
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo datutil

# Faster to compare integers than to compare objects
is_same_offsets = (out_tzoffset_vals[0] == out_tzoffset_vals).all()
if not is_same_offsets:
raise TypeError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (pre-existing) pattern is pretty opaque to a first-time reader. What if instead of raising TypeError the fallback block became its own function that gets called from here?

assert is_datetime64_ns_dtype(i)

# tz coerceion
result = pd.to_datetime(i, errors='coerce', cache=cache)
tm.assert_index_equal(result, i)

result = pd.to_datetime(i, errors='coerce', utc=True, cache=cache)
expected = pd.DatetimeIndex(['2000-01-01 13:00:00'],
expected = pd.DatetimeIndex(['2000-01-01 08:00:00'],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reason for this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The +00:00 in i above required a tz_convert post construction while I think the original intention was to tz_localize it to the passed psycopg.tz in the constructor (leading to this change)

But to reflect the intension of the original test, I just removed the +00:00 instead


# TODO: Appears that parsing non-ISO strings adjust the date to UTC
# but don't return the offset. Not sure if this is the intended
# behavior of non-iso strings in np_datetime_strings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np_datetime_strings doesn't handle non-ISO. That case ends up going through dateutil (via parse_datetime_string). I'm not totally sure what the TODO is for. '01-01-2013 00:00:00' comes back tz-naive, right?

# (with individual datutil.tzoffsets) are returned

# Faster to compare integers than to compare objects
is_same_offsets = (out_tzoffset_vals[0] == out_tzoffset_vals).all()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be a perf tradeoff here, specifically in the case where we have all-strings, all of which are ISO, but that don't have matching timezones. Going through the parse_datetime_string path below is much slower than _string_to_dts. Going through the python path entails a big hit.

The various paths (including require_iso8859 ugh) make this a giant hassle. @jreback one way to simplify this hassle would be to strengthen the requirement on require_iso8859. ATM it raises if it sees a non-ISO string, but is fine with datetime/np.datetime64 objects. If it were strings-only, then a bunch of logic could be simplified (not necessarily this PR). Thoughts?

@jbrockmendel
Copy link
Member

A few comments, mostly about some tough perf tradeoffs. Generally this looks great; I'm really psyched to see this get fixed.

@mroeschke
Copy link
Member Author

Addressed your comments @jreback & @jbrockmendel. Open for a final look.

yearfirst=yearfirst)
pydatetime_to_dt64(oresult[i], &dts)
check_dts_bounds(&dts)
except Exception:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could probably be more specific. parse_datetime_string could raise ValueError or OverflowError, check_dts_bounds raises OutOfBoundsDatetime (which subclasses ValueError), and I think thats it.

@jbrockmendel
Copy link
Member

Does this make Timestamp("now") match to_datetime("now")? Or "today"?

@mroeschke
Copy link
Member Author

This doesn't fix "now". ("today" looks like it was handled in #19937)

@jreback
Copy link
Contributor

jreback commented Jul 28, 2018

can you rebase

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments. @jbrockmendel any comments?


*Previous Behavior*:

.. code-block:: ipython
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this directiive you just be python (rather than ipython here) @TomAugspurger ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at previous whatsnews, looks like either ipython or python works for this directive.

@@ -123,11 +147,11 @@ def test_coerce_of_invalid_datetimes(self):

# Without coercing, the presence of any invalid dates prevents
# any values from being converted
result = tslib.array_to_datetime(arr, errors='ignore')
result = tslib.array_to_datetime(arr, errors='ignore')[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like writing these like

result, _ = ...... I think its more clear

@jreback jreback added this to the 0.24.0 milestone Jul 29, 2018
@jreback
Copy link
Contributor

jreback commented Jul 29, 2018

@mroeschke can you also run some timeseries benchmarks generally to see if any effects (I would expected some small slowdown for parsing but but not significant). You may need to add some benchmarks to capture the new code paths.

@jbrockmendel
Copy link
Member

any comments?

Just that I’m pretty stoked to see this fixed.

@mroeschke
Copy link
Member Author

mroeschke commented Jul 30, 2018

From one new ASV I added, parsing strings with different offests will be slower, but I think it's acceptable given the new (more correct) behavior

       before           after         ratio
     [d30c4a06]       [807a2513]
+         1.17±0s       2.37±0.02s     2.02  timeseries.ToDatetimeNONISO8601.time_different_offset

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

@jreback jreback merged commit 9a8cebc into pandas-dev:master Jul 30, 2018
@jreback
Copy link
Contributor

jreback commented Jul 30, 2018

ncie patch @mroeschke !

@mroeschke mroeschke deleted the parse_tz_offsets branch July 30, 2018 15:22
# is comparison fails unlike other dateutil.tz
# objects. Also, dateutil.tz.tzlocal has no
# _offset attribute like tzoffset
offset_seconds = tz._dst_offset.total_seconds()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this is using an internal method and is fragile.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Suggestion for a better way to do this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand what this function is supposed to do, but from what I do understand it looks like a mistake.

Copy link
Member Author

@mroeschke mroeschke Aug 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My original intent with this section was to collect the dateutil timezone objects in a set and determine if more than 1 distinct timezone object was collected in the end.

However since dateutil timezones cannot be hashed (thanks for starting progress on the associated issue btw @pganssle), I instead store the timezone's UTC offsets in seconds instead (which is what I essentially care about).

It wasn't too apparent to me if there was a public way to access a dateutil timezone's UTC offset, hence why I am use a private method here. Definitely open to a better method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you can't find a public method you probably shouldn't dive into the private API. There are a few assumptions in here that won't survive future versions of dateutil for sure. The tzlocal thing also actually gives you the wrong answer, because tzlocal is not a fixed offset and you're taking the DST offset.

If you want the offset, you should do dt.utcoffset(), regardless of whether it's dateutil or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are significantly worried about speed, you can probably use the fact that in dateutil >= 2.7.0, dateutil.tz.tzutc() returns a singleton, and dateutil.tz.tzoffset(*args1) is the same object as any other dateutil.tz.tzoffset(*args2) where args1 == args2. So you can probably store a mapping between id(obj) and the result of obj.tzoffset(datetime(1970, 1, 1)) for that specific subclass (since it is also guaranteed to have no DST).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Timezones Timezone data dtype
Projects
None yet
4 participants