Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF/DEPR: DatetimeIndex constructor #23675

Merged
merged 40 commits into from
Dec 3, 2018
Merged

Conversation

jbrockmendel
Copy link
Member

This is preliminary to implementing DatetimeArray._from_sequence. The attempt to implement that directly (i.e. without existing circularity) failed, so trying it as a 2-parter.

Similar to #23539, this deprecates the option of passing timedelta64 data to DatetimeIndex.

There are a few places where I've made TODO notes of the "is this really intentional?" form; I'll point this out in inline comments.

@mroeschke I edited a couple of docstrings for the recently-implemented "nonexistent" kwarg, can you confirm that I didn't mess them up?

@pep8speaks
Copy link

Hello @jbrockmendel! Thanks for submitting the PR.

# as wall-times instead of UTC timestamps.
data = data.astype(_NS_DTYPE)
copy = False
# TODO: Why do we treat this differently from integer dtypes?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback any idea why we treat floats differently from ints here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no idea, i would try to remove this special handling. only thing i can think of maybe this could have some odd rounding in the astype if its out of range of an int64 .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would try to remove this special handling

Yah, the immediate goal is to pass fewer cases to to_datetime since it is painfully circular and hides weird behavior like this float-dtype behavior.

If If we just cast floats to int64 (and mask with iNaT) then exactly one test fails in tests.dtypes.test_missing. The behavior (in master) that is different between float vs int is that floats are treated as wall-times instead of UTC times. eg:

iarr = np.array([0], dtype='i8')
farr = np.array([0], dtype='f8')

>>> pd.DatetimeIndex(iarr)._data
array(['1970-01-01T00:00:00.000000000'], dtype='datetime64[ns]')
>>> pd.DatetimeIndex(iarr, tz='US/Eastern')._data
array(['1970-01-01T00:00:00.000000000'], dtype='datetime64[ns]')

>>> pd.DatetimeIndex(farr)._data
array(['1970-01-01T00:00:00.000000000'], dtype='datetime64[ns]')
>>> pd.DatetimeIndex(farr, tz='US/Eastern')._data
array(['1970-01-01T05:00:00.000000000'], dtype='datetime64[ns]')

I don't see any especially good reason why it should work this way, save for keeping this test working and not introducing breaking changes.

@jbrockmendel
Copy link
Member Author

azure failure is a Hypothesis failure

@codecov
Copy link

codecov bot commented Nov 13, 2018

Codecov Report

Merging #23675 into master will increase coverage by <.01%.
The diff coverage is 94.73%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #23675      +/-   ##
==========================================
+ Coverage   92.31%   92.31%   +<.01%     
==========================================
  Files         161      161              
  Lines       51562    51586      +24     
==========================================
+ Hits        47599    47622      +23     
- Misses       3963     3964       +1
Flag Coverage Δ
#multiple 90.71% <94.73%> (ø) ⬆️
#single 42.44% <63.15%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/indexes/datetimes.py 96.55% <100%> (+0.02%) ⬆️
pandas/core/tools/datetimes.py 85.27% <92.3%> (-0.27%) ⬇️
pandas/core/arrays/datetimes.py 98.43% <94.73%> (+0.06%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 022f458...03d5b35. Read the comment docs.

@gfyoung gfyoung added Datetime Datetime data dtype Deprecate Functionality to remove in pandas labels Nov 14, 2018
# as wall-times instead of UTC timestamps.
data = data.astype(_NS_DTYPE)
copy = False
# TODO: Why do we treat this differently from integer dtypes?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no idea, i would try to remove this special handling. only thing i can think of maybe this could have some odd rounding in the astype if its out of range of an int64 .

pandas/core/indexes/datetimes.py Show resolved Hide resolved
pandas/core/indexes/datetimes.py Outdated Show resolved Hide resolved
pandas/core/tools/datetimes.py Outdated Show resolved Hide resolved
pandas/tests/indexes/datetimes/test_construction.py Outdated Show resolved Hide resolved
pandas/tests/indexes/datetimes/test_construction.py Outdated Show resolved Hide resolved
@TomAugspurger
Copy link
Contributor

Merged master to fix the merge conflict.

@jbrockmendel
Copy link
Member Author

Travis error looks unrelated

@jorisvandenbossche
Copy link
Member

@jbrockmendel Can you give a bit more context here? Eg, why are you adding so much new code (while there are actually no user facing changes, I suppose, except from the deprecation) ?
What is the rationale of not using to_datetime (if we refactor things, I would rather think it is to share more implemenation) ?

@jbrockmendel
Copy link
Member Author

Can you give a bit more context here? Eg, why are you adding so much new code (while there are actually no user facing changes, I suppose, except from the deprecation) ?
What is the rationale of not using to_datetime (if we refactor things, I would rather think it is to share more implemenation) ?

Sharing implementation is fine, circularity is not. The fact that to_datetime calls the DatetimeIndex constructor and vice-versa makes it very difficult to sort out (in fact I spent a big chunk of Monday trying to untangle it in one pass and couldn't) what happens in which case. The addition of new code is to make these behaviors more clear/explicit.

Moreover I am not convinced that the float_dtype behavior is intentional.

There is an issue (need to dig it up) reporting a bug in passing a Categorical to DatetimeIndex. Point being: there is value in being more explicit about what we are doing and why.

@jorisvandenbossche
Copy link
Member

Sharing implementation is fine, circularity is not.

Yes, but that circularity already existed. What changes did you try to make that it became more problematic?

The addition of new code is to make these behaviors more clear/explicit.

But it also means duplicating in DatetimeIndex things that are already handled in to_datetime. Not sharing it will much more easily lead to a future change leading to inconsistencies between both?

For the TimedeltaIndex PR, you made some common functions that were used by both. Is a similar pattern not possible here?

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I find the logic of what to do by each dtype (as you are introducing here) nice and clear to follow, but mainly wondering why not doing that in to_datetime to avoid divergence between both

pandas/core/indexes/datetimes.py Outdated Show resolved Hide resolved
pandas/core/indexes/datetimes.py Outdated Show resolved Hide resolved
@jbrockmendel
Copy link
Member Author

For the TimedeltaIndex PR, you made some common functions that were used by both. Is a similar pattern not possible here?

That's the goal behind #23702.

@jbrockmendel
Copy link
Member Author

Azure fail looks unrelated

@jbrockmendel
Copy link
Member Author

yeah i agree this [treating floats as wall-times, ints as UTC] seems a little odd, these should just be cast to integer i think.

Change behavior here, deprecation cycle, or follow-up?

@jbrockmendel jbrockmendel mentioned this pull request Nov 30, 2018
12 tasks
@jbrockmendel
Copy link
Member Author

gentle ping. Getting the constructors done will noticeably tighten the scope of #24024.

# If tzaware, these values represent unix timestamps, so we
# return them as i8 to distinguish from wall times
return values.view('i8'), tz_parsed
except (ValueError, TypeError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel is correct here.

side-note, we could clarify all this with python-3 style

except (ValueError, TypeError) as e2:
    raise e2 from e

or six's https://pythonhosted.org/six/#six.raise_from, but it's probably just easier to wait until next month.

Returns
-------
result : ndarray
np.int64 dtype if returned values represent UTC timestamps
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to verify: getting an you'll get an np.int64-dtype result if and only if inferred_tz is not None?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct

# If tzaware, these values represent unix timestamps, so we
# return them as i8 to distinguish from wall times
return values.view('i8'), tz_parsed
except (ValueError, TypeError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a "no" on the resolving miscommunication. Did I at least accurately summarize your suggested change?

not resolved, but I see from @TomAugspurger which what i was getting at. i guess ok for now.

return result, tz_parsed
elif is_object_dtype(result):
if allow_object:
# allowed by to_datetime, not by DatetimeIndex constructor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right and that is the correct behavior, actually DTI should just return a Index in that case, otherwise this is very confusing. I believe you have an issue talking about mixed-timestatmps. These should always just return an Index of objects. No conversion / inference should happen at all.

if allow_object:
# allowed by to_datetime, not by DatetimeIndex constructor
return result, tz_parsed
raise TypeError(result)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just remove the else and do the raise TypeError(result) at the end

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its worth distinguishing between hittable TypeError and defensive TypeError

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you comment more then.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

yearfirst=yearfirst)

if is_object_dtype(data) or is_string_dtype(data):
# TODO: We do not have tests specific to string-dtypes,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could just write this as

try:
      data = datas.astype(np.int64)
except:
     ...

might be more clear

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue with that is np.array(['20160405']) becomes np.array([20160405]) instead of 2016-04-05.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok sure

raise e
except ValueError as e:
# Fallback to try to convert datetime objects
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cam you be slightly more specific on 'fallback', e.g. when would this be triggered

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll flesh out the comment. The gist of it is if tz-aware datetime objects are passed and utc=True is not passed, then array_to_datetime will raise a ValueError and datetime_to_datetime64 will handle this case.

It's a mess and is a big reason why I've been making array_to_datetime PRs: #24006 is an attempt to fix it, but there are some design questions @mroeschke and I are still batting around.

@jbrockmendel
Copy link
Member Author

actually DTI should just return a Index in that case, otherwise this is very confusing. I believe you have an issue talking about mixed-timestatmps. These should always just return an Index of objects. No conversion / inference should happen at all.

I'm -0.5 on having DatetimeIndex.new return an object Index, would rather have it raise. Either way, that is a design change that should be orthogonal to this PR.

@jreback
Copy link
Contributor

jreback commented Dec 3, 2018

I'm -0.5 on having DatetimeIndex.new return an object Index, would rather have it raise. Either way, that is a design change that should be orthogonal to this PR.

sure (and raising is prob correct)

return result, tz_parsed
raise TypeError(result)
else: # pragma: no cover
# GH#23675 this TypeError should never be hit, whereas the TypeError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not more clear.

Can you put the cases where each of these would be hit

@jbrockmendel
Copy link
Member Author

This is a blocker for DatetimeArray._from_sequence, so probably merits a 0.24.0 milestone tag

@jreback jreback added this to the 0.24.0 milestone Dec 3, 2018
@jreback jreback merged commit e7991b3 into pandas-dev:master Dec 3, 2018
@jreback
Copy link
Contributor

jreback commented Dec 3, 2018

@jbrockmendel ok thanks. I realy really want to limit scope to things that we absolutely need to pushout the Datetime and associated EA's. sure there are other things to do, but theses need much more consideration that we have time for. (basically I am saying, this code IMHO is not strictly necessary for this refactor, though its a nice to do).

@jbrockmendel jbrockmendel deleted the pre-fixes4 branch December 3, 2018 15:18
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Deprecate Functionality to remove in pandas
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants