Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add in extension dtype registry #21185

Merged
merged 22 commits into from Jul 3, 2018

Conversation

Projects
None yet
5 participants
@jreback
Copy link
Contributor

commented May 24, 2018

precursor to #21160

@jreback jreback added this to the 0.24.0 milestone May 24, 2018

@codecov

This comment has been minimized.

Copy link

commented May 24, 2018

Codecov Report

Merging #21185 into master will increase coverage by <.01%.
The diff coverage is 92.98%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21185      +/-   ##
==========================================
+ Coverage    91.9%   91.91%   +<.01%     
==========================================
  Files         154      154              
  Lines       49659    49673      +14     
==========================================
+ Hits        45640    45657      +17     
+ Misses       4019     4016       -3
Flag Coverage Δ
#multiple 90.29% <92.98%> (ø) ⬆️
#single 41.95% <75.43%> (+0.05%) ⬆️
Impacted Files Coverage Δ
pandas/io/formats/format.py 98.25% <ø> (ø) ⬆️
pandas/core/algorithms.py 94.85% <100%> (ø) ⬆️
pandas/core/dtypes/base.py 92.3% <100%> (+0.2%) ⬆️
pandas/core/series.py 94.19% <100%> (ø) ⬆️
pandas/core/indexes/interval.py 93.13% <100%> (ø) ⬆️
pandas/core/dtypes/common.py 95.07% <100%> (+0.44%) ⬆️
pandas/core/arrays/base.py 87.85% <100%> (+0.26%) ⬆️
pandas/core/dtypes/cast.py 88.36% <50%> (-0.13%) ⬇️
pandas/core/internals.py 95.52% <81.81%> (-0.07%) ⬇️
pandas/core/dtypes/dtypes.py 95.98% <96.77%> (+0.03%) ⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ad76ffc...d2c91d7. Read the comment docs.

^^^^^^^^^^^^^^^^^^^^^

- ``ExtensionArray`` has gained the abstract methods ``.dropna()`` and ``.append()``, and attribute ``array_type`` (:issue:`21185`)
- ``ExtensionDtype`` has gained the ability to instantiate from string dtypes, e.g. ``decimal`` would instaniate a registered ``DecimalDtype`` (:issue:`21185`)

This comment has been minimized.

Copy link
@jbrockmendel

jbrockmendel May 24, 2018

Member

typo instantiate

@Dr-Irv
Copy link
Contributor

left a comment

It seems to me that there are changes in this PR that go beyond just adding in the extension dtype registry. Wasn't your goal to just make this PR about the registry?

@jreback jreback force-pushed the jreback:eat branch from 78abef9 to 9b2cdb0 May 24, 2018

@jreback jreback referenced this pull request May 24, 2018

Closed

ENH: extension ops #21191

@jreback

This comment has been minimized.

Copy link
Contributor Author

commented May 24, 2018

split out ops to a separate PR #21191

@jorisvandenbossche
Copy link
Member

left a comment

Didn't look at all the tests yet.

In addition:

  • We need to discuss what is the exact contract for this registry and array_type: where is it used, and how is it used?
    I think now it is used in constructors or astype if you specify a (string) dtype? But then it calls the class constructor of the extension array? Shouldn't this rather be the _from_sequence method ?
  • Can you add some explanation of the registry to the extension array documentation?
  • I think this also gives an enhancement for the existing types (interval, datetimetz, period, ..) that you can now specify string dtypes? (since they are added to the registry) In that case, can you add tests for this and a whatsnew notice?
ExtensionType Changes
^^^^^^^^^^^^^^^^^^^^^

- ``ExtensionArray`` has gained the abstract methods ``.dropna()`` and ``.append()``, and attribute ``array_type`` (:issue:`21185`)

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche May 24, 2018

Member

Are the addition of dropna and append orthogonal to the registry, or are they in some way needed for that?

I think we should be hesitant in adding more and more methods to the ExtensionArray if they are not strictly needed

This comment has been minimized.

Copy link
@jreback

jreback May 24, 2018

Author Contributor

no, these are necessary for followons. The api is not fleshed out enough.

"""Construct a new ExtensionArray from a sequence of scalars.
Parameters
----------
scalars : Sequence
Each element will be an instance of the scalar type for this
array, ``cls.dtype.type``.
copy : boolean, default True
if True, copy the underlying data

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche May 24, 2018

Member

What does this exactly mean?
Often, the scalars objects are either not stored as is, and then copy or not copy makes no difference. Or if they are stored, they don't necessarily have a 'copy' method.

This comment has been minimized.

Copy link
@jreback

jreback May 24, 2018

Author Contributor

same guarantee with copy we have already, if its True, then copy if possible.

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Jun 4, 2018

Member

Can you add that to the docstring?

* _concat_same_type
* array_type

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche May 24, 2018

Member

This is not an attribute of the array class I think?


# dispatch on extension dtype if needed
if is_extension_array_dtype(dtype):
return dtype.array_type._from_sequence(arr, copy=copy)

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche May 24, 2018

Member

should there be any validation about arr ? Eg that it is 1D?

This comment has been minimized.

Copy link
@jreback

jreback May 24, 2018

Author Contributor

this validation is a contract of the Array itself

"""
Parameters
----------
dtype : PandasExtension Dtype

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche May 24, 2018

Member

Can you document constructor parameter?
and "PandasExtension Dtype" -> "ExtensionDtype"

----------
dtype : PandasExtension Dtype
"""
if not issubclass(dtype, (PandasExtensionDtype, ExtensionDtype)):

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche May 24, 2018

Member

Shouldn't ExtensionDtype be enough? (our internals one subclass both I think?)

This comment has been minimized.

Copy link
@jreback

jreback May 24, 2018

Author Contributor

not yet, are extension dtypes are not all ExtensiDtypes yet (this is why we have the whole PandasExtensionDtype bizness)

if string.startswith('period[') or string.startswith('Period['):
# do not parse string like U as period[U]
return PeriodDtype.construct_from_string(string)
raise TypeError("could not construct PeriodDtype")

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche May 24, 2018

Member

Should we consider rather changing construct_from_string itself? As those don't really adhere to the ExtensionDtype requirements I think?

If we keep both, I would make this one private.

This comment has been minimized.

Copy link
@jreback

jreback May 24, 2018

Author Contributor

done


class BaseOpsTests(BaseExtensionTests):
"""Various Series and DataFrame ops methos."""
pass

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche May 24, 2018

Member

This is a left-over from splitting it from the previous PR?

@jreback jreback force-pushed the jreback:eat branch from 9b2cdb0 to a854f06 May 24, 2018

@jreback

This comment has been minimized.

Copy link
Contributor Author

commented May 24, 2018

all fixed up @jorisvandenbossche (except for docs on Registry in Extension section)

@jreback jreback force-pushed the jreback:eat branch from a854f06 to f5a0c24 May 25, 2018

@jreback

This comment has been minimized.

Copy link
Contributor Author

commented May 25, 2018

see if can make this dispatch to create_array_type as a function

@jreback jreback force-pushed the jreback:eat branch 3 times, most recently from 92a1322 to 10aab3c May 29, 2018

@jreback jreback force-pushed the jreback:eat branch from 10aab3c to ff34f01 Jun 3, 2018

@jreback

This comment has been minimized.

Copy link
Contributor Author

commented Jun 3, 2018

updated docs

@jorisvandenbossche @TomAugspurger if any final comments.

@jorisvandenbossche
Copy link
Member

left a comment

Added some more comments.

From my previous review:

We need to discuss what is the exact contract for this registry and array_type: where is it used, and how is it used?

Can you add some explanation about this to the docs/docstrings? (eg that it relies on _from_sequence)

"""Construct a new ExtensionArray from a sequence of scalars.
Parameters
----------
scalars : Sequence
Each element will be an instance of the scalar type for this
array, ``cls.dtype.type``.
copy : boolean, default True
if True, copy the underlying data

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Jun 4, 2018

Member

Can you add that to the docstring?

type
"""
if array is None:
return cls

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Jun 4, 2018

Member

I think cls is never the correct return? (since it is a dtype class, while this should return an array class?)

""" Registry for dtype inference
We can directly construct dtypes in pandas_dtypes if they are
a type; the registry allows us to register an extension dtype

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Jun 4, 2018

Member

The sentence "We can directly construct dtypes in pandas_dtypes if they are a type" is not really clear to me. What does "if they are a type" mean? To which context does this refer?

'dtype, expected',
[('int64', None),
('interval', IntervalDtype()),
('interval[int64]', IntervalDtype()),

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Jun 4, 2018

Member

can you replace this (or also add) one that is not the default?
Eg ('interval[datetime64[ns]]', IntervalDtype('datetime64[ns]')),

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Jun 4, 2018

Member

(although that is maybe covered with the period or datetime64 one below. Is 'interval[datetime64[ns]]' already tested for IntervalDtype?)

dtype = data.dtype

expected = pd.Series(data)
result = pd.Series(np.array(data), dtype=dtype)

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Jun 4, 2018

Member

Do we have a restriction on what the result of __array__ should be? Maybe not necessarily the scalars?

So maybe better to do list(data) or data.astype(object)

@pytest.mark.xfail(reason="not implemented constructor from dtype")
def test_from_dtype(self, data):
# construct from our dtype & string dtype
pass

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Jun 4, 2018

Member

Can you actually implement this? (only change needed would be to register DecimalDtype I think?)
As that way we actually have a test for external dtypes registering?

This comment has been minimized.

Copy link
@jreback

jreback Jun 19, 2018

Author Contributor

I do in subsequent PR's trying to keep this diff down.

# TODO(extension)
@pytest.mark.xfail(reason=(
"raising AssertionError as this is not implemented, "
"though easy enough to do"))

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Jun 4, 2018

Member

I think you can change the AssertionError to a ValueError in the code, and then we can still test this.

This comment has been minimized.

Copy link
@jreback

jreback Jun 19, 2018

Author Contributor

this is ok for now

@pytest.mark.xfail(reason="not implemented constructor from dtype")
def test_from_dtype(self, data):
# construct from our dtype & string dtype
pass

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Jun 4, 2018

Member

Can we test here that an appropriate error is raised (instead of skipping the test) ?

This comment has been minimized.

Copy link
@jreback

jreback Jun 19, 2018

Author Contributor

see above

@@ -109,6 +109,11 @@ class ExtensionDtype(_DtypeOpsMixin):
* name
* construct_from_string
Optionally one can override construct_array_type for construction
with the name of this dtype via the Registry

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Jun 4, 2018

Member

I would give some additional explanation of what the Registry is, because now this is not explained here?

ExtensionType Changes
^^^^^^^^^^^^^^^^^^^^^

- ``ExtensionArray`` has gained the abstract methods ``.dropna()`` and ``.append()`` (:issue:`21185`)

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Jun 4, 2018

Member

As I said before, I am not convinced we should necessarily add those methods (certainly append), so would personally leave this for another PR to discuss

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Jun 4, 2018

Contributor

Agreed with not adding append.

Could go either way on dropna.

This comment has been minimized.

Copy link
@jreback

jreback Jun 19, 2018

Author Contributor

i need them for subsquent PR's I guess will move them.

ExtensionType Changes
^^^^^^^^^^^^^^^^^^^^^

- ``ExtensionArray`` has gained the abstract methods ``.dropna()`` and ``.append()`` (:issue:`21185`)

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Jun 4, 2018

Contributor

Agreed with not adding append.

Could go either way on dropna.


- ``ExtensionArray`` has gained the abstract methods ``.dropna()`` and ``.append()`` (:issue:`21185`)
- ``ExtensionDtype`` has gained the ability to instantiate from string dtypes, e.g. ``decimal`` would instantiate a registered ``DecimalDtype``; furthermore
the dtype has gained the ``construct_array_type`` (:issue:`21185`)

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Jun 4, 2018

Contributor

'the dtype' -> ExtensionDtype has gained :meth:`ExtensionDtype.construct_array_type`

.. _whatsnew_0240.api.other:

Other API Changes
^^^^^^^^^^^^^^^^^

-
- Invalid consruction of ``IntervalDtype`` will now always raise a ``TypeError`` rather than a ``ValueError`` if the subdtype is invalid (:issue:`21185`)

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Jun 4, 2018

Contributor

consruction -> construction.

@@ -379,6 +383,16 @@ def fillna(self, value=None, method=None, limit=None):
new_values = self.copy()
return new_values

def dropna(self):
""" Return ExtensionArray without NA values

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Jun 4, 2018

Contributor

Remove the leading space. Add a trailing .

Parameters
----------
array : array-like, optional

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Jun 4, 2018

Contributor

@jorisvandenbossche by "this" you mean the array argument right? I'm also wondering that.

@@ -8,6 +8,64 @@
from .base import ExtensionDtype, _DtypeOpsMixin


class Registry(object):

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Jun 4, 2018

Contributor

Without looking at the uses yet, could we simplify this a by just allowing string lookup? Ideally, registry would be a simple class holding a dict mapping dtype.name -> ExtensionDtype.

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Jun 4, 2018

Member

I think the current code also supports finding the dtype for eg 'interval[int64]' and not just interval (so parametrized versions of the strings)

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Jun 4, 2018

Contributor

Oh, and I suppose we want that to support .astype('interval[int64]'). That's fair...

@@ -800,7 +800,7 @@ def astype(self, dtype, copy=True):
@cache_readonly
def dtype(self):
"""Return the dtype object of the underlying data"""
return IntervalDtype.construct_from_string(str(self.left.dtype))
return IntervalDtype(str(self.left.dtype))

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Jun 4, 2018

Contributor

left.dtype.name? I'm not sure when these differ, but using .name seems safer.

@jreback jreback force-pushed the jreback:eat branch from ff34f01 to 558a639 Jun 19, 2018

@jreback

This comment has been minimized.

Copy link
Contributor Author

commented Jun 19, 2018

ok all fixed up. @jorisvandenbossche @TomAugspurger if you want to look.

@Dr-Irv

This comment has been minimized.

Copy link
Contributor

commented Jun 19, 2018

@jreback I'm wondering if there is a meta issue with the registry that should be thought about. Let's say that pandas registers 10 types - they are always in the registry for all pandas users. Now let's say you have 2 different people who independently extend pandas with the EA capabilities creating libraries ABC and DEF. But they happen to both call their extension array type MyEAType . Now if someone wants to use both libraries ABC and DEF, there will be a collision in the registry of MyEAType. So does there need to be some "master" registry in the documentation containing the names of all EA types that people have registered?

I realize this is a highly unlikely occurrence.

Put another way, shouldn't there be code for the registry that says "this type has already been registered"?

Also, the implementation of the registry is using a list. Shouldn't you use a dict or set?

@jreback

This comment has been minimized.

Copy link
Contributor Author

commented Jun 19, 2018

@Dr-Irv you don;t need to register the EA types, this is only for the purpose of translating a string dtype name -> EA dtype (e.g. 'categorical' -> Categorical), basically for convenience.

so sure you could have a collision, but then user would have to deal with that (this wouldn't show up on pandas because we won't have colliding names).

@jreback

This comment has been minimized.

Copy link
Contributor Author

commented Jun 19, 2018

@Dr-Irv I had this an OrderdedDict but its actually not necessary, and is actually not correct. If we have a string, we have to ask the dtypes can you construct a type from it. The issue is that multiple strings can map to a dtype when its parameterized, e.g. datetime64[ns, US/Eastern] is a valid string dtype as is datetime64[ns, Asia/Tokyo] and w/o imbuing the knowledge to say its 'good' inside the Register (bad idea), we can't know something works unless we duck type it.

@Dr-Irv

This comment has been minimized.

Copy link
Contributor

commented Jun 19, 2018

@jreback Thanks for the clarification of the need versus require of registration of EA types.

On the implementation, wouldn't OrderedDict be faster once there are a lot of types?

@TomAugspurger
Copy link
Contributor

left a comment

Aside from adding EA.append, I'm +1 on this.

Given that NumPy doesn't implement it, I don't think we should either.

@jreback

This comment has been minimized.

Copy link
Contributor Author

commented Jun 20, 2018

I had removed append FYI (it should not longer be in this PR)

@jreback jreback force-pushed the jreback:eat branch from dd74832 to 5fabd51 Jul 2, 2018

@jreback

This comment has been minimized.

Copy link
Contributor Author

commented Jul 2, 2018

any final comments. going to rebase.

@jreback

This comment has been minimized.

Copy link
Contributor Author

commented Jul 2, 2018

@jreback

This comment has been minimized.

Copy link
Contributor Author

commented Jul 3, 2018

merging.

@jreback jreback merged commit 2f14faf into pandas-dev:master Jul 3, 2018

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

alimcmaster1 added a commit to alimcmaster1/pandas that referenced this pull request Aug 12, 2018

TomAugspurger added a commit to TomAugspurger/pandas that referenced this pull request Aug 14, 2018

Sup3rGeo added a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.