Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: Internal / External values #19558

Merged
merged 42 commits into from
Feb 13, 2018

Conversation

TomAugspurger
Copy link
Contributor

Closes #19548

Just changed Index so far. Need to

  • Run an ASV
  • add Series._ndarray_values
  • Start cleaning up now redundant if conditions.

@TomAugspurger TomAugspurger added Refactor Internal refactoring of code Clean labels Feb 6, 2018
always an ``ndarray`` or ``ExtensionArray``).

So, for example, ``Series[category]._values`` is a ``Categorical``, while ``Series[category]._ndarray_values`` is
the underlying ndarray.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what Series[categorical]._ndarray_values should be, categories or codes? PeriodIndex._ndarray_values is the ordinals, so maybe codes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so a) this should remove .get_values() as well completely, adding something w/o removing the old to me is a non-starter here.

I see lots of special cases. This points to a problem in the interface. if you are doing is_......_ then it needs to be fixed in ExtensenionArray (IOW have a dispatch for that function) and overrides in the subclasses. calling out Categorical here.

Values
~~~~~~

Pandas extends NumPy's type system in a few places, so we have multiple notions of "values" floating around.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first sentence is totally not clear to a new reader


* ``cls._ndarray_values`` is *always* and ``ndarray``
* ``cls._values`` refers is the "best possible" array. This could be an ``ndarray``, ``ExtensionArray``, or
in ``Index`` subclass (note: we're in the process of removing the index subclasses here so that it's
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are using internal (pun intended) jargon here

always an ``ndarray`` or ``ExtensionArray``).

So, for example, ``Series[category]._values`` is a ``Categorical``, while ``Series[category]._ndarray_values`` is
the underlying ndarray.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@@ -973,8 +999,12 @@ def value_counts(self, normalize=False, sort=True, ascending=False,
@Appender(_shared_docs['unique'] % _indexops_doc_kwargs)
def unique(self):
values = self._values

if isinstance(values, ABCDatetimeIndex):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you special casing?

return self._shallow_copy_with_infer(np.concatenate(to_concat), **attribs)
arrays = []
for x in to_concat:
if is_categorical_dtype(x):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this a special case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger don't we want np.asarray(x, dtype=object) for all extension dtypes in this case?

So for Period and Interval _values (used some lines below for Index types) currently still returns object array, but I think the purpose is that those will also return their own Array class, so using np.asarray(x, dtype=object) will do the same now and is more future proof?

@@ -2208,27 +2208,37 @@ def union(self, other):
other = other.astype('O')
return this.union(other)

if is_categorical_dtype(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are these special caes?

I think eventually we'll want to ndarray_values for this, but it'll
require a bit more work to support. Currently, using ndarary_values
causes occasional failures on categorical.
(cherry picked from commit 943a915562b72bed147c857de927afa0daf31c1a)
(cherry picked from commit fbf0a06)
(cherry picked from commit b20e12c)
@codecov
Copy link

codecov bot commented Feb 8, 2018

Codecov Report

Merging #19558 into master will increase coverage by 0.01%.
The diff coverage is 97.32%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #19558      +/-   ##
==========================================
+ Coverage   91.57%   91.58%   +0.01%     
==========================================
  Files         150      150              
  Lines       48817    48864      +47     
==========================================
+ Hits        44704    44752      +48     
+ Misses       4113     4112       -1
Flag Coverage Δ
#multiple 89.95% <97.32%> (+0.01%) ⬆️
#single 41.75% <43.75%> (+0.02%) ⬆️
Impacted Files Coverage Δ
pandas/io/pytables.py 92.45% <100%> (ø) ⬆️
pandas/core/base.py 96.79% <100%> (+0.01%) ⬆️
pandas/core/indexes/period.py 93.04% <100%> (+0.05%) ⬆️
pandas/core/indexes/datetimes.py 95.32% <100%> (+0.06%) ⬆️
pandas/core/indexes/base.py 96.46% <100%> (+0.02%) ⬆️
pandas/core/dtypes/cast.py 87.98% <100%> (ø) ⬆️
pandas/core/indexes/numeric.py 97.24% <100%> (ø) ⬆️
pandas/core/dtypes/concat.py 99.14% <100%> (ø) ⬆️
pandas/core/indexes/interval.py 92.44% <100%> (+0.04%) ⬆️
pandas/core/arrays/categorical.py 94.89% <100%> (+0.01%) ⬆️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d9551c8...3af8a21. Read the comment docs.

@pep8speaks
Copy link

pep8speaks commented Feb 8, 2018

Hello @TomAugspurger! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on February 13, 2018 at 11:54 Hours UTC

@TomAugspurger
Copy link
Contributor Author

Still a WIP, but this should hopefully pass.

I'm going to make a pass through the library to remove things that are currently doing checks for Series vs. Index, which should clean things up.

I've also cherry picked ExtensionArray.astype stuff from #19604, so the diff in lines in this PR is a bit smaller than it looks.

Setops are even more of a mess now. I have a biggish refactor started (with nice perf boosts), but didn't finish it today. It's tricky since the current implementation was relying on self._inner_indexer(self._values, ...) raising in some cases that now won't raise (period or categorical I think). I think that it's OK to accept a bit of technical debt in this case, and I'll pay it down later.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should remove .get_values() as well completely, adding something w/o removing the old to me is a non-starter here.

@jreback if you are speaking about Series.get_values and Index.get_values, then I don't think this related. Yes, ideally we can implement get_values cleanly with one of the other ones, but get_values is a public function that should not just be removed or changed. We can certainly discuss deprecating it, but IMO that is separate of this internal cleaning up.

index | values | _values | _ndarray_values |
----------------- | -------------- -| ----------- | --------------- |
CategoricalIndex | Categorical | Categorical | codes |
DatetimeIndex[tz] | ndarray[M8ns] | DTI[tz] | datetime@UTC |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datetime@UTC -> is the same as ndarray[M8ns] in the 'values' column ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll fix that to be consistent.

index | values | _values | ndarray_values |
----------------- | --------------- | ----------- | -------------- |
PeriodIndex | ndarray[object] | PeriodArray
IntervalIndex | IntervalArray | ndarray[Interval]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be the other way around?
and for PeriodIndex, _ndarray_values is ndarrary[ordinals] ? (for Interval there is not really a native ndarray representation possible, so that is also object array ?

@@ -293,6 +304,23 @@ def values(self):
""" return the underlying data, which is a Categorical """
return self._data

@property
def _values(self):
return self._data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a difference between ._data and .values ? As below you use .values for itemsize and nbytes, but both seems to point to the underlying Categorical

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably just me being inconsistent.

@@ -769,7 +797,7 @@ def _evaluate_compare(self, other):

def _delegate_method(self, name, *args, **kwargs):
""" method delegation to the ._values """
method = getattr(self._values, name)
method = getattr(self.values, name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be unnecessary. Reverting.

@@ -799,9 +799,11 @@ def values(self):
box = hasattr(lev, '_box_values')
# Try to minimize boxing.
if box and len(lev) > len(lab):
taken = lev._box_values(algos.take_1d(lev._values, lab))
taken = lev._box_values(algos.take_1d(lev._values,
lab))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't the algos functions always assume ndarray?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, fixed (the only one that should be ._values is the one on line 809 below).

elif isinstance(l_values, pd.Index):
tm.assert_index_equal(l_values, r_values)
elif pd.api.types.is_categorical(l_values):
tm.assert_categorical_equal(l_values, r_values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future, we should maybe add an assert_array_equal that handles the different extension arrays as well, then such a special case should not be needed here?

assert r_values.dtype == dtype


def test_values_periodindex():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is also tested in the test above?

@jreback
Copy link
Contributor

jreback commented Feb 9, 2018

.get_values() should be deprecated, if you want to implement in terms of an internal function, ok. but the external API needs fixing. This was never meant to be exposed and is useless. let's clean that up first

@jorisvandenbossche
Copy link
Member

external API needs fixing

What needs to be fixed? Or do just mean deprecating it?

let's clean that up first

That doesn't need to happen first, it can be done independently of this. I will open an issue for it.

@TomAugspurger TomAugspurger changed the title [WIP]REF: Internal / External values REF: Internal / External values Feb 12, 2018
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments, but in general this is looking good to me.

For Index, we now also have the _data attribute. Is it the idea that in the future this will become the same as _values ?

@@ -480,20 +480,22 @@ def _concat_datetimetz(to_concat, name=None):

def _concat_index_same_dtype(indexes, klass=None):
klass = klass if klass is not None else indexes[0].__class__
return klass(np.concatenate([x._values for x in indexes]))
return klass(np.concatenate([x._ndarray_values for x in indexes]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is only used for numeric indices, so _values or _ndarray_values should not matter.
I was mainly thinking that for those cases where _ndarray_values actually differs from _values, like a categorical, the above code will not work anyway, as klass(codes) is not enough to reconstruct the categorical. So _values seems safer to me.

# type: () -> np.ndarray
"""Internal pandas method for lossy conversion to a NumPy ndarray.

This method is not part of the pandas interface.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it a bit strange that we say this is not part of the interface, but still provide here a default implementation and say what it is.

If it is not part of the interface, we could also define it on our own subclasses without defining it here.
But on the other hand, that will raise errors when an external extension array does not have this method. Eg some attributes on Series will call into this, which makes it somehow part of the interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it a bit strange that we say this is not part of the interface, but still provide here a default implementation and say what it is.

This simplifies the implementation since

a.) we don't have to have a PandasExtensionArray that subclasses ExtensionArray, just to define this method.
b.) we can safely call ._values anywhere in our code without checking whether it's an extension array or a pandas extension array.

My preference is to leave it out of the interface until someone sees an actual need for it. I suspect this could come up if / when we start allowing custom indexes with their own indexing engines, but that seems like a ways down the road...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I understand those reasons, but that means it is part of the interface (in the sense that if somebody would be stupid to implement a _ndarray_values property that returns something differently, it will break code?)
It's just a part of the interface that we ask not to implement?

To give an example, for GeoPandas, I was wondering if I would overwrite this property to return my self.data (the integer pointers) instead of a materialized array (which can be costly).
For things like Series.strides, Series.itemsize, .. this will be OK (and in principle no user should call this anyway .. so maybe I should not worry about it), but if it is used in other places as well that I am not really sure about, such an implementation might actually give a wrong result (but at the same time, if this is called in certain places for my column with an extension array, I want to know that, because materializing can be costly and I want to avoid this as much as possible).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. Your GeoPandas concern is a good one. But I'm not sure how to proceeded :/ Do you have a preference for

a.) Adding it to the interface
b.) Leaving it out of the interface, with a default implementation on EA
c.) Making a pandas-specific EA that defines _ndarray_values, so uses of _ndarray_values will need to check for that attr.

._values / ._ndarray_values are meant to be internal, and I don't want to limit future refactorings with backwards compatibility concerns. I'm also not even sure how to document it beyond what's already there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the best option is to leave it for now? :)

Ideally, the Series machinery should not use _ndarray_values too often, as this would require knowledge of what those values mean (codes, ordinals, ...), which Series does not need to know (if there is knowledge required, it should somehow dispatch the operation to the underlying (extension) array itself).

And this is already the case I think. From a quick scan, currently in the PR, I see ndarray_values is mainly used for:

  • indexing (values used in the engine) -> for now we don't support extension arrays to be the data for Index (apart from our own), so this is a worry for later if we want to tackle that
  • Index object operations (eg the setops) -> same as above
  • in some specific cases where we know the parent object (eg we know it is a PeriodIndex, we just want to get the ordinals)
  • for algos (eg take, unique) -> either they dispatch to EA (eg take) or for now the fallback is object array anyhow (eg value_counts), and in general this is something we still need to discuss how to handle in general (factorize, unique, ..)
  • some Series attributes (shape, itemsize, strides, size) -> shape can also be derived from _values directly (like nbytes) as it is part of the EA interface. And from the others, I think size would be nice to not have to materialize the EA (itemsize and strides I don't care about). Can we let this return len(self._values), which is part of the interface? (I am not sure to what extent .size and __len__ on numpy arrays are different / have different performance)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your assessment looks correct. Nothing that's series specific currently uses _ndarray_values, so it's really just code that is Index-specific or index / series agnostic.

I can change .shape and .size to use ._values instead of instead of ._ndarary values. I'll just have to override it in DatetimeIndex to avoid a recursion error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree with @jorisvandenbossche assessment here. maybe if we can ultimatley remove this would be good, and simply dispatch to the array object. If Index is a proper EA then this would be possible.


@property
def nbytes(self):
""" return the number of bytes in the underlying data """
return self._values.nbytes
return self._ndarray_values.nbytes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should nbytes be called on self._values instead? (as we require a nbytes implementation on the Extension Array, and numpy arrays already have it)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this caused issues for CI, but re-running tests now with this change.

@@ -293,6 +293,15 @@ def values(self):
""" return the underlying data, which is a Categorical """
return self._data

@property
def itemsize(self):
return self.values.itemsize
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add a comment why you are overwriting this here (default is to use _ndarray_values, which are codes, but Categorical defines itemsize as the itemsize of the categories).
Also, would it be more consistent to do self._values.itemsize? (not that it would differ in this case)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More consistent with the parent's implementation, yes, but I'd prefer to use the public API where possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

, but I'd prefer to use the public API where possible.

But the public api is not meant to be consistent, but to maintain backwards compatibility :-) (at least for now)

@@ -1317,7 +1319,7 @@ def from_tuples(cls, tuples, sortorder=None, names=None):
arrays = [[]] * len(names)
elif isinstance(tuples, (np.ndarray, Index)):
if isinstance(tuples, Index):
tuples = tuples._values
tuples = tuples._ndarray_values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ndarray_values in this case? Because I think this code path can only work if it is an object index that contains tuples, so should not be different in this case.

Although that maybe raises the question: in such cases, where we know both _values and _ndarray_values will return the same ndarray (eg in all code only for numeric indexes), which one of the two do we take? (might be good to be consistent on this)

@@ -1897,7 +1897,7 @@ def _format(x):

vals = self.values
if isinstance(vals, Index):
vals = vals._values
vals = vals._ndarray_values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, an extension array will end up here (using the GenericArrayFormatter), so should this rather be np.array(vals._values) ?
(the implementation of _ndarray_values on an ExtensionArray should not influence its formatting?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like by this point we always have a regular Index, not an extension array, so it doesn't matter in terms of output. Changing to _values since that's clearer.

result = arr.unique()

if isinstance(expected, np.ndarray):
tm.assert_numpy_array_equal(result, expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expected is always a DatetimeIndex in the above parametrization

(pd.DatetimeIndex(['2017-01-01T00:00:00'], tz="US/Eastern"),
np.array(['2017-01-01T05:00:00'], dtype='M8[ns]')),
(pd.TimedeltaIndex([10**10]), np.array([10**10], dtype='m8[ns]')),
pytest.mark.xfail(reason='PeriodArray not implemented')((
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why fail? As I think _ndarray_values already are the ordinals, and will stay that way? (it is _values that will change from object array to PeriodArray?)

Copy link
Contributor Author

@TomAugspurger TomAugspurger Feb 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an xfail since Series[period]._ndarray_values will return an object ndarray with teh Periods, while PeriodIndex._ndarray_values returns the ordinals.

Passing this now would require modifying Series._ndarray_values, which is currently just self._internal_values, to special case object dtype to do inference to see if we have periods. I don't think that's worth it since Periods will (hopefully) be a first-class extension dtype by the next release.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah missed that the test below tested it for both Series and Index.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Feb 12, 2018

._data seems even more inconsistent

class output
Index ndarray
PeriodIndex ndarray[int] (ordinals)
DatetimeIndex DatetimeIndex (?)
DatetimeIndex[tz] ndarray[datetime64[ns]]
CategoricalIndex Categorical
IntervalIndex None

It hopefully wouldn't be much work to convert uses of ._data to ._values, but I don't have any plans to do that right now. I'll open up a followup issue though. #19658

@TomAugspurger
Copy link
Contributor Author

Thanks for the review. I implemented your suggestions where I didn't comment in reply, and tests pass locally.

@jorisvandenbossche
Copy link
Member

It hopefully wouldn't be much work to convert uses of ._data to ._values, but I don't have any plans to do that right now. I'll open up a followup issue though. #19658

+ 1, that only just got me thinking. If you create an index from an extension array (directly, or eg by doing set_index on a column backed by extension block with extension array), I suppose this creates an object Index (with array of scalars) right now? Might be good to add a test for that? (but that probably belongs in the other PR that implements storing an EA in Series)

@TomAugspurger
Copy link
Contributor Author

I suppose this creates an object Index (with array of scalars) right now? Might be good to add a test for that? (but that probably belongs in the other PR that implements storing an EA in Series)

Yes, and agreed (we hit this indirectly though ._value_counts), but it'd be good to have an explicit test.

@TomAugspurger
Copy link
Contributor Author

Implemented the changes dicussed in #19558 (comment): .shape and .size use self._values.

I implemented those on PeriodIndex and DatetimeIndex, else we would have materialized the array of objects (which is what we do on master), so we get some performance gains there I guess

master:

In [6]: idx = pd.interval_range(0, 100_000)

In [8]: %timeit idx.size
308 ns ± 9.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

pr:

In [3]: %timeit idx.size
1.05 µs ± 9.63 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

@jorisvandenbossche
Copy link
Member

do 'master' and 'pr' need to be swapped?

@TomAugspurger
Copy link
Contributor Author

Not sure what happened on that timing I posted earlier. I can't reproduce it.

In [1]: import pandas as pd

In [2]: idx = pd.interval_range(0, 100_000)

In [3]: %time idx.shape
CPU times: user 177 ms, sys: 10.1 ms, total: 188 ms
Wall time: 187 ms
Out[3]: (100000,)
In [1]: import pandas as pd

In [2]: idx = pd.interval_range(0, 100_000)

In [3]: %time idx.shape
CPU times: user 18 µs, sys: 2 µs, total: 20 µs
Wall time: 22.9 µs
Out[3]: (100000,)

@TomAugspurger
Copy link
Contributor Author

CI is green again.

``ndarray`` or ``ExtensionArray``).

So, for example, ``Series[category]._values`` is a ``Categorical``, while
``Series[category]._ndarray_values`` is the underlying codes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this section does not belong in the same document as "how to subclass", as this part is really internals for contributors to pandas.
So I would maybe split this in two: external vs internal details, where the future documentation on how to use and define ExtensionArrays would also go into the external details part (maybe "Extending Pandas" would be a good title, with then information about the different strategies: accessor, extension array, subclassing, ..)

But that's for a separate PR (I can start with that), so for here the above is fine for me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internal-internals.rst and external-internals.rst ;)

I think that "How to subclass", and the eventual "extending pandas with custom array types" would be better in developer.rst, which is labeled as "This section will focus on downstream applications of pandas.".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, I didn't see that the accessor documentation is actually already there (although, I personally don't find the parquet section that fitting in there, as it is not something typical you need to know when extending pandas. I can already start with moving it to the bottom of the file :-))

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some small comments. ok to merge. But before adding new things, a refactor of the existing is in order to avoid all of this special casing. When this is out of sight (meaning not in a current PR), it is easy to forget the special casing. Maybe need a TODO detector . but already:
pandas)

bash-3.2$ grep -r TODO pandas --include '*.py' |wc
     279    2233   22968

maybe put a special TODO(EA) or something to differentiate these

# type: () -> np.ndarray
"""Internal pandas method for lossy conversion to a NumPy ndarray.

This method is not part of the pandas interface.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree with @jorisvandenbossche assessment here. maybe if we can ultimatley remove this would be good, and simply dispatch to the array object. If Index is a proper EA then this would be possible.


@property
def nbytes(self):
""" return the number of bytes in the underlying data """
return self._values.nbytes
return self.values.nbytes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not ._values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on what we do with PeriodIndex.values and IntervalIndex.values.

If we change the public .values to be an ExtensionArray then there's no difference.

If we leave the public .values as an ndarray of object, then we prefer ._values since it doesn't materialize the ndarray.

So far we've put off that decision, but we'll need to make a choice soon.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to override this in DatetimeIndex (tz) then for now, this is the reason CI is failing (recursion error)

On the original comment: I think it is better for now to use ._values instead of .values in such cases. .values is the public one that might not always be consistent for backwards compatibility, so let's internally use always _values to have something certainly consistent? (and if we decide .values to return extension arrays, it is just a public wrapper around ._values, and if we decide not, we don't have to change anything internally)

@property
def _values(self):
# type: () -> Union[ExtensionArray, Index]
# TODO: remove index types as they become is extension arrays
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove 'is'


index | values | _values | _ndarray_values |
----------------- | --------------- | ------------ | --------------- |
PeriodIndex | ndarray[object] | ndarray[obj] | ndarray[int] |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in theory these should return an Index type for ._values for PI and II, and it might work, but would be disruptive w/o a Block type to hold it. This is why for example we have the is_period_arraylike methods to detect an array of Periods. The holder ATM is a ndarray[object].

@@ -2228,27 +2264,37 @@ def union(self, other):
other = other.astype('O')
return this.union(other)

# TODO: setops-refactor, clean all this up
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, let's do this next then. these if/then/else are really smelly.

@@ -2228,27 +2264,37 @@ def union(self, other):
other = other.astype('O')
return this.union(other)

# TODO: setops-refactor, clean all this up
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if might be worth it to put this in a single method now (not sure if its possible) to isolate this code, maybe

def _get_internal_type(other):
     if is_period_dtype(other) ......
          ...
    else:
         ...

@@ -678,13 +678,36 @@ def _assert_tzawareness_compat(self, other):
raise TypeError('Cannot compare tz-naive and tz-aware '
'datetime-like objects')

@property
def _values(self):
# tz-naive -> ndarray
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah we really ought to fix this

# So we extract the tz-naive DatetimeIndex, unique that, and wrap the
# result with out TZ.
if self.tz is not None:
naive = type(self)(self._ndarray_values, copy=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of this hacking around is not great. The problem is these temporary work-around become more and more. Really need to bite the bullet and fix up the implementations to avoid this.

@TomAugspurger
Copy link
Contributor Author

Agreed entirely about the hackiness, especially in the set-ops. I should be able to do it in parallel with my Interval ExtensionArray PR.

@TomAugspurger
Copy link
Contributor Author

All green. Merging.

@TomAugspurger TomAugspurger merged commit df38f66 into pandas-dev:master Feb 13, 2018
@TomAugspurger TomAugspurger deleted the index-values branch February 13, 2018 14:50
@jorisvandenbossche
Copy link
Member

Great!

@jreback
Copy link
Contributor

jreback commented Feb 13, 2018

@TomAugspurger this broke the build. what is going on here?

@jreback
Copy link
Contributor

jreback commented Feb 14, 2018

actually looks like just a network failure ok. But before merge anything else all of the other builds need to work.

harisbal pushed a commit to harisbal/pandas that referenced this pull request Feb 28, 2018
* REF/Clean: Internal / External values

* Move to index base

* Cleanup unique handling

* Simplify object concat

* Use values for intersection

I think eventually we'll want to ndarray_values for this, but it'll
require a bit more work to support. Currently, using ndarary_values
causes occasional failures on categorical.

* hmm

* Additional testing

* More tests

* ndarray_values

* API: Default ExtensionArray.astype

(cherry picked from commit 943a915562b72bed147c857de927afa0daf31c1a)
(cherry picked from commit fbf0a06)

* Simplify concat_as_object

* Py2 compat

(cherry picked from commit b20e12c)

* Set-ops ugliness

* better docstrings

* tolist

* linting

* Moved dtypes

(cherry picked from commit d136227)

* clean

* cleanup

* NumPy compat

* Use base _values for CategoricalIndex

* Update dev docs

* cleanup

* Linting

* Precision in tests

* Push _ndarray_values to ExtensionArray

Now IndexOpsMixin._ndarray_values will dispatch all the way down to the EA.
Subclasses like Categorical can override it as they see fit.

* Clean up tolist

* Move test locations

* Fixed test

* REF: Update per comments

* lint

* REF: Use _values for size and shape

* PERF: Implement size, shape for IntervalIndex

* PERF: Avoid materializing values for PeriodIndex shape, size

* Cleanup

* Override nbytes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Clean Refactor Internal refactoring of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants