Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Parse %z and %Z directive in format for to_datetime #19979

Merged
merged 38 commits into from
May 29, 2018

Conversation

mroeschke
Copy link
Member

@mroeschke mroeschke commented Mar 3, 2018

The implimentiontion is near identical to https://github.com/python/cpython/blob/master/Lib/_strptime.py an currently works as datetime.strptime would:

In [3]: f = '%Y-%m-%d %H:%M:%S %Z %z'

# datetime will parse the case below even though UTC with +0100 should be bogus
In [4]: d = ['2010-01-01 12:00:00 UTC +0100'] * 10

In [5]: pd._libs.tslibs.strptime.array_strptime(np.array(d, dtype='object'), f)
Out[5]:
(array(['2010-01-01T12:00:00.000000000', '2010-01-01T12:00:00.000000000',
        '2010-01-01T12:00:00.000000000', '2010-01-01T12:00:00.000000000',
        '2010-01-01T12:00:00.000000000', '2010-01-01T12:00:00.000000000',
        '2010-01-01T12:00:00.000000000', '2010-01-01T12:00:00.000000000',
        '2010-01-01T12:00:00.000000000', '2010-01-01T12:00:00.000000000'],
       dtype='datetime64[ns]'),
 array([datetime.timezone(datetime.timedelta(0, 3600), 'UTC'),
        datetime.timezone(datetime.timedelta(0, 3600), 'UTC'),
        datetime.timezone(datetime.timedelta(0, 3600), 'UTC'),
        datetime.timezone(datetime.timedelta(0, 3600), 'UTC'),
        datetime.timezone(datetime.timedelta(0, 3600), 'UTC'),
        datetime.timezone(datetime.timedelta(0, 3600), 'UTC'),
        datetime.timezone(datetime.timedelta(0, 3600), 'UTC'),
        datetime.timezone(datetime.timedelta(0, 3600), 'UTC'),
        datetime.timezone(datetime.timedelta(0, 3600), 'UTC'),
        datetime.timezone(datetime.timedelta(0, 3600), 'UTC')],
       dtype=object))

In [29]: datetime.strptime('2010-01-01 12:00:00 UTC +0100', '%Y-%m-%d %H:%M:%S %Z %z')
Out[29]: datetime.datetime(2010, 1, 1, 12, 0, tzinfo=datetime.timezone(datetime.timedelta(0, 3600), 'UTC'))

Currently, an offset needs to get passed (%z) in order for the tzname used be used (%Z).

I'd like to get feedback of what this function should return having parsed %z or %Z. It may be difficult to return a normal DatimeIndex/Series/array given the following edge cases:

  1. User passes strings with different tz offsets ([date] +0100, [date]. -0600, [date] +1530)
  2. User passes strings with different tz names ([date] UTC, [date]. EST, [date] CET)
  3. User passes strings with incompatable tz name and offset (see example above)

I suppose the most agnostic thing to is to return an array of Timestamps?

In [27]: [pd.Timestamp(val, tzinfo=tz) for val, tz in zip(*pd._libs.tslibs.strptime.array_strptime(np.array(d, dtype='object'), f))]
Out[27]:
[Timestamp('2010-01-01 13:00:00+0100', tz='UTC'),
 Timestamp('2010-01-01 13:00:00+0100', tz='UTC'),
 Timestamp('2010-01-01 13:00:00+0100', tz='UTC'),
 Timestamp('2010-01-01 13:00:00+0100', tz='UTC'),
 Timestamp('2010-01-01 13:00:00+0100', tz='UTC'),
 Timestamp('2010-01-01 13:00:00+0100', tz='UTC'),
 Timestamp('2010-01-01 13:00:00+0100', tz='UTC'),
 Timestamp('2010-01-01 13:00:00+0100', tz='UTC'),
 Timestamp('2010-01-01 13:00:00+0100', tz='UTC'),
 Timestamp('2010-01-01 13:00:00+0100', tz='UTC')]

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need tests

@@ -344,7 +344,7 @@ def _convert_listlike(arg, box, format, name=None, tz=tz):
if result is None:
try:
result = array_strptime(arg, format, exact=exact,
errors=errors)
errors=errors)[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was this changed?

z = z[:3] + z[4:]
if len(z) > 5:
if z[5] != ':':
msg = "Unconsistent use of : in {0}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent

@codecov
Copy link

codecov bot commented Mar 9, 2018

Codecov Report

Merging #19979 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #19979      +/-   ##
==========================================
+ Coverage   91.84%   91.84%   +<.01%     
==========================================
  Files         153      153              
  Lines       49506    49516      +10     
==========================================
+ Hits        45467    45477      +10     
  Misses       4039     4039
Flag Coverage Δ
#multiple 90.24% <100%> (ø) ⬆️
#single 41.87% <23.07%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/tools/datetimes.py 84.98% <100%> (+0.54%) ⬆️
pandas/core/arrays/categorical.py 95.67% <0%> (-0.01%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 36c1f6b...757458d. Read the comment docs.

@mroeschke
Copy link
Member Author

Summary of the logic thus far and open to feedback:

array_strptime will now return 3 arrays of the datetime64 value, tzname, and tzoffset separately.

  1. If the tzname directive is only passed, %Z, dispatch to Datimetime's tz arg or Timestamp's tz arg if there are multiple different tznames are parsed.

  2. If the tzoffset directive is only passed, %z, create a pytz.FixedOffset and dispatch to Datimetime's tz arg or Timestamp's tz arg if there are multiple different tzoffset are parsed.

  3. If a tzname and tzoffet directives are passed, create a datetime.timezone object (similar to the cpython implementation) and just dispatch to Timestamp. One question here is if we should invalidate bogus timezones (e.g. UTC +0600, PST -1500, etc)

@mroeschke
Copy link
Member Author

I should be able to transfer this logic into cython on the next iteration.

@mroeschke mroeschke force-pushed the strftime_timezone branch 2 times, most recently from a6c61d8 to 94641dc Compare March 13, 2018 04:00
@jreback jreback added Enhancement Datetime Datetime data dtype labels Mar 13, 2018
if z == 'Z':
gmtoff = 0
else:
if z[3] == ':':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need a check on the len upfront here

seconds = int(z[5:7] or 0)
gmtoff = (hours * 60 * 60) + (minutes * 60) + seconds
gmtoff_remainder = z[8:]
# Pad to always return microseconds.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank line

@@ -281,6 +290,32 @@ def array_strptime(ndarray[object] values, object fmt,
else:
tz = value
break
elif parse_code == 19:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move this whole parse to a function and just all it here (and return the values as a tuple)

@@ -343,8 +346,74 @@ def _convert_listlike(arg, box, format, name=None, tz=tz):
# fallback
if result is None:
try:
result = array_strptime(arg, format, exact=exact,
errors=errors)
parsing_tzname = '%Z' in format
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woa, what do you need all this for???

@@ -183,6 +183,63 @@ def test_to_datetime_format_weeks(self, cache):
for s, format, dt in data:
assert to_datetime(s, format=format, cache=cache) == dt

@pytest.mark.skipif(not PY3,
reason="datetime.timezone not supported in PY2")
def test_to_datetime_parse_timezone(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import at the top

tm.assert_numpy_array_equal(result, expected)

# %z and %Z parsing
dates = ['2010-01-01 12:00:00 UTC +0100'] * 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need more checking for invalid (partially formed such as +0, +1foo, UTCbar

tm.assert_index_equal(result, expected)

result = pd.to_datetime(dates, format=fmt, box=False)
expected = np.array(expected_dates, dtype=object)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use assert_index_equal always

@mroeschke
Copy link
Member Author

@jreback I was able to simplify a lot of my logic ( I underestimated how Index would get cast to DatetimeIndex when passed Timestamps with the same tz).

I created separate functions to parse the timezone directive and then box the result.

@@ -632,3 +655,48 @@ cdef _calc_julian_from_U_or_W(int year, int week_of_year,
else:
days_to_week = week_0_length + (7 * (week_of_year - 1))
return 1 + days_to_week + day_of_week

cdef _parse_timezone_directive(object z):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be de-privatized (no leading _); these modules are all private.


if z == 'Z':
gmtoff = 0
gmtoff_fraction = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would just directly return for this case (0, 0)

then don't need the else

gmtoff = 0
gmtoff_fraction = 0
else:
if z[3] == ':':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might need to wrap this entire block in a try/except if the string is not long enough (or check lengths for each sub-section) and the raise the appropriate error

ts = tslib.Timestamp(res)
ts = ts.tz_localize(tzoffset)
tz_results.append(ts)
tz_results = np.array(tz_results)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you need all of this for, this is jumping thru a lot of hoops here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This elif branch is:

  1. Creating a pytz.FixedOffset from the parsed offset
  2. Creating a naive Timestamp, then localizing it to the pytz.FixedOffset (can't do it directly like Timezone(res, tz=pytz.FixedOffset(...)) because of my realization from DOC: Clarify passing epoch timestamp to Timestamp with timezone. #20257)

@@ -25,6 +25,12 @@
from pandas import (isna, to_datetime, Timestamp, Series, DataFrame,
Index, DatetimeIndex, NaT, date_range, compat)

if PY3:
from datetime import timezone
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have a fixture for this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a fixture for timezone.utc, but I am testing parsing a custom timezone.

def test_to_datetime_parse_tzname_or_tzoffset(self, box, const,
assert_equal, fmt,
dates, expected_dates):
# %z or %Z parsing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue number

assert_equal, dates,
expected_dates):
# %z and %Z parsing
fmt = '%Y-%m-%d %H:%M:%S %Z %z'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

noemielteto and others added 7 commits March 18, 2018 16:40
@mroeschke mroeschke changed the title [WIP] ENH: Parse %z and %Z directive in format for to_datetime ENH: Parse %z and %Z directive in format for to_datetime Mar 30, 2018
@mroeschke
Copy link
Member Author

I've changed a couple of things after your review @jreback

  1. Either %z or %Z can be parsed (not together)

  2. %z will return a pytz.FixedOffset

In [3]: pd.to_datetime(['2010-01-01 12:00:00 +0100'], format='%Y-%m-%d %H:%M:%S %z')
Out[3]: DatetimeIndex(['2010-01-01 12:00:00+01:00'], dtype='datetime64[ns, pytz.FixedOffset(60)]', freq=None)
  1. %Z can parse any timezone specified by pytz.all_timezones
In [2]: pd.to_datetime(['2010-01-01 12:00:00 US/Pacific'], format='%Y-%m-%d %H:%M:%S %Z')
Out[2]: DatetimeIndex(['2010-01-01 12:00:00-08:00'], dtype='datetime64[ns, US/Pacific]', freq=None)

dict found_key
bint is_raise = errors=='raise'
bint is_ignore = errors=='ignore'
bint is_coerce = errors=='coerce'
int ordinal
dict _parse_code_table = {'y': 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could make this a module level variable

raise ValueError("Cannot pass a tz argument when "
"parsing strings with timezone "
"information.")
result, timezones = array_strptime(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would much rather do the error handling in the _return_parsed_timezone_results. This block is just very complicated and hard to grok

@jreback jreback added this to the 0.23.1 milestone May 24, 2018
@jreback
Copy link
Contributor

jreback commented May 24, 2018

lgtm @mroeschke

merge on green.

@jorisvandenbossche jorisvandenbossche modified the milestones: 0.23.1, 0.24.0 May 24, 2018
@jorisvandenbossche
Copy link
Member

New features are for 0.24.0, @mroeschke can you move the whatsnew?

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice addition!
Added some comments

[pd.Timestamp('2010-01-01 12:00:00', tz='UTC'),
pd.Timestamp('2010-01-01 12:00:00', tz='GMT'),
pd.Timestamp('2010-01-01 12:00:00', tz='US/Pacific')]],
['%Y-%m-%d %H:%M:%S %z',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you one of them, eg this one, without the space before the tz?

['%Y-%m-%d %H:%M:%S %z',
['2010-01-01 12:00:00 Z', '2010-01-01 12:00:00 Z'],
[pd.Timestamp('2010-01-01 12:00:00',
tzinfo=pytz.FixedOffset(0)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be UTC or a fixed offset of 0 ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytz coerces a fixed offset of 0 to UTC

In [2]: pytz.FixedOffset(0)
Out[2]: <UTC>

But making it explicit here that %z should return pytz.FixedOffset(0)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the actual DatetimeIndex you get here has UTC timezone? OK, that's good! (but maybe add a small comment since I would not expect that)

pd.Timestamp('2010-01-01 12:00:00',
tzinfo=pytz.FixedOffset(-60))]],
['%Y-%m-%d %H:%M:%S %z',
['2010-01-01 12:00:00 Z', '2010-01-01 12:00:00 Z'],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also work with %Z?
It seems that with datetime.datetime.strptime it does not work with either

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK (that's probably a newer addition to python), then it makes sense to follow upstream python to be consistent

pd.to_datetime(dates, format=fmt, box=box, utc=True)

@pytest.mark.parametrize('offset', [
'+0', '-1foo', 'UTCbar', ':10', '+01:000:01'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add an empty string here as well?

@jreback jreback merged commit 7b1f9bf into pandas-dev:master May 29, 2018
@jreback
Copy link
Contributor

jreback commented May 29, 2018

thanks @mroeschke nice patch!

betcha didn't think it would be this long when you first put it up! hahah. tests and code look great!

@mroeschke mroeschke deleted the strftime_timezone branch May 29, 2018 04:50
@mroeschke
Copy link
Member Author

ha no problem, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: bad directive in to_datetime format - this uses std. strptime zone offset
4 participants