Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Cannot create third-party ExtensionArrays for datetime types (xfail) #34987

Merged
merged 2 commits into from Jan 14, 2021

Conversation

xhochy
Copy link
Contributor

@xhochy xhochy commented Jun 25, 2020

@xhochy
Copy link
Contributor Author

xhochy commented Jun 25, 2020

This is just the failing test for now, happy to implement a fix if someone could tell me the location where this should be fixed.

@WillAyd
Copy link
Member

WillAyd commented Jun 25, 2020

@jbrockmendel

from .arrays import ArrowTimestampUSArray # isort:skip


def test_constructor_extensionblock():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be xfailed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can xfail this, so this can be merged. I would prefer to fix this myself though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just need a pointer at which code section I should apply a fix. Should I change the order in pandas/pandas/core/internals/blocks.py so that we only create a DatetimeTZBlock for pandas-provided datetime-based ExtensionArrays or shouldn't is_datetime64tz_dtype return True for my ExtensionDtype?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to fix this myself though.

Sounds good. Is this a use case you have a need to get working near-term, or more of a Principle Of The Thing? I ask because...

I just need a pointer at which code section I should apply a fix.

This is pretty daunting, as I expect this is scattered across the code. There are lots of places where we either a) implicitly assume nanoseconds or b) check dtype.kind in ["M", "m"] (much more performant than the is_foo_dtype checks)

Should I change the order in pandas/pandas/core/internals/blocks.py so that we only create a DatetimeTZBlock for pandas-provided datetime-based ExtensionArrays

That will probably be part of a solution.

or shouldn't is_datetime64tz_dtype return True for my ExtensionDtype?

I'd be very reticent to make that change, since I think a lot of code expects that to imply its getting our Datetime64TZDtype. Maybe a is_3rd_party_ea_dtype that we would check for before checking for any 1st-party dtypes? That runs into the "ideally we should treat 3rd party EAs symmetrically with 1st-party" problems.

So getting back to the motivation: how high a priority is this?

One thing I can unambiguously encourage is more tests, even if xfailed:

  • what happens if you pass one of these to the DatetimeIndex constructor? vice-versa?
  • what happens if i do DatetimeIndex.astype(this_new_ea_dtype)
  • addition/subtraction with the gamut of datetime/timedelta scalars/arrays we already support?
  • How does this behave if you stuff it inside a Categorical/CategoricalIndex?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to fix this myself though.

Sounds good. Is this a use case you have a need to get working near-term, or more of a Principle Of The Thing? I ask because...

More in the next 6 months range, thus I'm definitely going to add an xfail here as the points below indicate that we should rather think more than "fix quick".

I would love to have a nullable, non-nanosecond timestamp (actually I desparately need it but e.g. having a performant string is more important to me) but there are several other places that either assume that all timestamps are nanoseconds or backed by a numpy-array, so this is going to be a major effort.

or shouldn't is_datetime64tz_dtype return True for my ExtensionDtype?

I'd be very reticent to make that change, since I think a lot of code expects that to imply its getting our Datetime64TZDtype. Maybe a is_3rd_party_ea_dtype that we would check for before checking for any 1st-party dtypes? That runs into the "ideally we should treat 3rd party EAs symmetrically with 1st-party" problems.

So getting back to the motivation: how high a priority is this?

As already pointed out: Less than other things I want to contribute to pandas, so xfailing and adding more (possibly) xfailing tests is the way to go.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so xfailing and adding more (possibly) xfailing tests is the way to go.

Sounds good.

actually I desparately need [...] that either assume that all timestamps are nanoseconds or backed by a numpy-array

Would your need be solved if we get numpy-backed non-nano in place? There's a reasonable chance of that happening in the next 6 months.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would your need be solved if we get numpy-backed non-nano in place? There's a reasonable chance of that happening in the next 6 months.

For now: Yes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now: Yes.

I'm slowly tackling this from the cython side of the code. The parallelizable step is to comb through the rest of the code to find all the places where we implicitly/explicitly assume nanos. I'd start with pandas/plotting and pandas/io.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets see if we can at least get this one working.

i think we'll need to edit the dtype.kind check in is_datetime64tz_dtype, and possible the issubclass(vtype, np.datetime64) check in internals.blocks.get_block_type

@xhochy
Copy link
Contributor Author

xhochy commented Jul 1, 2020

xfail added, CI is now happy.

@jreback jreback changed the title BUG: Add failing unit test for GH#34986 BUG: Cannot create third-party ExtensionArrays for datetime types (xfail) Jul 2, 2020
@jreback jreback added the ExtensionArray Extending pandas with custom dtypes or arrays. label Jul 2, 2020
@@ -67,6 +68,26 @@ def construct_array_type(cls) -> Type["ArrowStringArray"]:
return ArrowStringArray


@register_extension_dtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put these in the test file for now as I am not sure we agree on these names (and is just used for testing ATM).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can move them but I wanted to keep the dtype here as done for the other test-Arrow-dtypes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, CI passed except the Docs but the warning about missing sparse methods are unrelated to this PR.

@simonjayhawkins simonjayhawkins added the Needs Review Waiting for review/response from a maintainer. label Sep 15, 2020
@jbrockmendel
Copy link
Member

can you merge master and we'll see if we can get this in

@xhochy xhochy force-pushed the issue-34986 branch 3 times, most recently from fd562df to 6d92caa Compare October 14, 2020 11:26
@xhochy
Copy link
Contributor Author

xhochy commented Oct 14, 2020

@jbrockmendel Rebased and all green except one Windows job that timeouted.

@jbrockmendel
Copy link
Member

I think the edit to get_block_type in #34683 might fix the test that fails here. can you confirm? if that is fixed, presumably the rest of the EA test suite still needs to be enabled for this EA?

@xhochy
Copy link
Contributor Author

xhochy commented Nov 11, 2020

I think the edit to get_block_type in #34683 might fix the test that fails here. can you confirm? if that is fixed, presumably the rest of the EA test suite still needs to be enabled for this EA?

Yes, merging in #34683 fixes the test.

I'm not sure whether it would be really worth to get the full suite running for this test EA. It is basically here to check for the regression but getting the whole suite to pass would be a lot more work that I don't see worthwhile currently.

@jbrockmendel
Copy link
Member

I'm not sure whether it would be really worth to get the full suite running for this test EA. It is basically here to check for the regression but getting the whole suite to pass would be a lot more work that I don't see worthwhile currently.

totally reasonable. i guess we can merge this now and then if/when #34683 makes this pass we can revisit getting other bits working.

cc @jreback

@jbrockmendel
Copy link
Member

@xhochy can you merge master, hopefully we'll get the CI green and can get this in

@pep8speaks
Copy link

pep8speaks commented Dec 9, 2020

Hello @xhochy! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-01-14 13:50:47 UTC

@xhochy xhochy force-pushed the issue-34986 branch 3 times, most recently from 877c401 to e393ca6 Compare December 10, 2020 08:21
Copy link
Member

@jbrockmendel jbrockmendel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@simonjayhawkins simonjayhawkins removed the Needs Review Waiting for review/response from a maintainer. label Dec 10, 2020
@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jan 14, 2021
@xhochy
Copy link
Contributor Author

xhochy commented Jan 14, 2021

@jbrockmendel @jreback Rebased and removed xfail as it is working now.

@jreback jreback added this to the 1.3 milestone Jan 14, 2021
import pandas as pd
from pandas.api.extensions import ExtensionDtype, register_extension_dtype

pytest.importorskip("pyarrow", minversion="0.13.0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could technicaly be later but ok for now

@jreback jreback merged commit de8fd00 into pandas-dev:master Jan 14, 2021
@jreback
Copy link
Contributor

jreback commented Jan 14, 2021

thanks @xhochy

@xhochy xhochy deleted the issue-34986 branch January 14, 2021 18:59
luckyvs1 pushed a commit to luckyvs1/pandas that referenced this pull request Jan 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Cannot create third-party ExtensionArrays for datetime types
6 participants