REF: Internal / External values #19558

TomAugspurger · 2018-02-06T20:26:15Z

Closes #19548

Just changed Index so far. Need to

Run an ASV
add Series._ndarray_values
Start cleaning up now redundant if conditions.

TomAugspurger · 2018-02-06T20:28:20Z

doc/source/internals.rst

+  always an ``ndarray`` or ``ExtensionArray``).
+
+So, for example, ``Series[category]._values`` is a ``Categorical``, while ``Series[category]._ndarray_values`` is
+the underlying ndarray.


Not sure what Series[categorical]._ndarray_values should be, categories or codes? PeriodIndex._ndarray_values is the ordinals, so maybe codes?

jreback

so a) this should remove .get_values() as well completely, adding something w/o removing the old to me is a non-starter here.

I see lots of special cases. This points to a problem in the interface. if you are doing is_......_ then it needs to be fixed in ExtensenionArray (IOW have a dispatch for that function) and overrides in the subclasses. calling out Categorical here.

jreback · 2018-02-07T11:23:44Z

doc/source/internals.rst

+Values
+~~~~~~
+
+Pandas extends NumPy's type system in a few places, so we have multiple notions of "values" floating around.


The first sentence is totally not clear to a new reader

jreback · 2018-02-07T11:24:53Z

doc/source/internals.rst

+
+* ``cls._ndarray_values`` is *always* and ``ndarray``
+* ``cls._values`` refers is the "best possible" array. This could be an ``ndarray``, ``ExtensionArray``, or
+  in ``Index`` subclass (note: we're in the process of removing the index subclasses here so that it's


you are using internal (pun intended) jargon here

jreback · 2018-02-07T11:25:08Z

doc/source/internals.rst

+  always an ``ndarray`` or ``ExtensionArray``).
+
+So, for example, ``Series[category]._values`` is a ``Categorical``, while ``Series[category]._ndarray_values`` is
+the underlying ndarray.


jreback · 2018-02-07T11:26:37Z

pandas/core/base.py

@@ -973,8 +999,12 @@ def value_counts(self, normalize=False, sort=True, ascending=False,
    @Appender(_shared_docs['unique'] % _indexops_doc_kwargs)
    def unique(self):
        values = self._values
-
+        if isinstance(values, ABCDatetimeIndex):


why are you special casing?

jreback · 2018-02-07T11:26:53Z

pandas/core/dtypes/concat.py

-    return self._shallow_copy_with_infer(np.concatenate(to_concat), **attribs)
+    arrays = []
+    for x in to_concat:
+        if is_categorical_dtype(x):


why is this a special case?

@TomAugspurger don't we want np.asarray(x, dtype=object) for all extension dtypes in this case?

So for Period and Interval _values (used some lines below for Index types) currently still returns object array, but I think the purpose is that those will also return their own Array class, so using np.asarray(x, dtype=object) will do the same now and is more future proof?

jreback · 2018-02-07T11:27:06Z

pandas/core/indexes/base.py

@@ -2208,27 +2208,37 @@ def union(self, other):
            other = other.astype('O')
            return this.union(other)

+        if is_categorical_dtype(self):


why are these special caes?

I think eventually we'll want to ndarray_values for this, but it'll require a bit more work to support. Currently, using ndarary_values causes occasional failures on categorical.

(cherry picked from commit 943a915562b72bed147c857de927afa0daf31c1a) (cherry picked from commit fbf0a06)

(cherry picked from commit b20e12c)

codecov · 2018-02-08T23:03:59Z

Codecov Report

Merging #19558 into master will increase coverage by 0.01%.
The diff coverage is 97.32%.

@@            Coverage Diff             @@
##           master   #19558      +/-   ##
==========================================
+ Coverage   91.57%   91.58%   +0.01%     
==========================================
  Files         150      150              
  Lines       48817    48864      +47     
==========================================
+ Hits        44704    44752      +48     
+ Misses       4113     4112       -1

Flag	Coverage Δ
#multiple	`89.95% <97.32%> (+0.01%)`	⬆️
#single	`41.75% <43.75%> (+0.02%)`	⬆️

Impacted Files	Coverage Δ
pandas/io/pytables.py	`92.45% <100%> (ø)`	⬆️
pandas/core/base.py	`96.79% <100%> (+0.01%)`	⬆️
pandas/core/indexes/period.py	`93.04% <100%> (+0.05%)`	⬆️
pandas/core/indexes/datetimes.py	`95.32% <100%> (+0.06%)`	⬆️
pandas/core/indexes/base.py	`96.46% <100%> (+0.02%)`	⬆️
pandas/core/dtypes/cast.py	`87.98% <100%> (ø)`	⬆️
pandas/core/indexes/numeric.py	`97.24% <100%> (ø)`	⬆️
pandas/core/dtypes/concat.py	`99.14% <100%> (ø)`	⬆️
pandas/core/indexes/interval.py	`92.44% <100%> (+0.04%)`	⬆️
pandas/core/arrays/categorical.py	`94.89% <100%> (+0.01%)`	⬆️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d9551c8...3af8a21. Read the comment docs.

pep8speaks · 2018-02-08T23:03:59Z

Hello @TomAugspurger! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on February 13, 2018 at 11:54 Hours UTC

TomAugspurger · 2018-02-08T23:10:16Z

Still a WIP, but this should hopefully pass.

I'm going to make a pass through the library to remove things that are currently doing checks for Series vs. Index, which should clean things up.

I've also cherry picked ExtensionArray.astype stuff from #19604, so the diff in lines in this PR is a bit smaller than it looks.

Setops are even more of a mess now. I have a biggish refactor started (with nice perf boosts), but didn't finish it today. It's tricky since the current implementation was relying on self._inner_indexer(self._values, ...) raising in some cases that now won't raise (period or categorical I think). I think that it's OK to accept a bit of technical debt in this case, and I'll pay it down later.

jorisvandenbossche

this should remove .get_values() as well completely, adding something w/o removing the old to me is a non-starter here.

@jreback if you are speaking about Series.get_values and Index.get_values, then I don't think this related. Yes, ideally we can implement get_values cleanly with one of the other ones, but get_values is a public function that should not just be removed or changed. We can certainly discuss deprecating it, but IMO that is separate of this internal cleaning up.

jorisvandenbossche · 2018-02-09T13:13:59Z

pandas/core/indexes/base.py

+        index             | values          | _values     | _ndarray_values |
+        ----------------- | -------------- -| ----------- | --------------- |
+        CategoricalIndex  | Categorical     | Categorical | codes           |
+        DatetimeIndex[tz] | ndarray[M8ns]   | DTI[tz]     | datetime@UTC    |


datetime@UTC -> is the same as ndarray[M8ns] in the 'values' column ?

Yes, I'll fix that to be consistent.

jorisvandenbossche · 2018-02-09T13:14:02Z

pandas/core/indexes/base.py

+        index             | values          | _values     | ndarray_values |
+        ----------------- | --------------- | ----------- | -------------- |
+        PeriodIndex       | ndarray[object] | PeriodArray
+        IntervalIndex     | IntervalArray   | ndarray[Interval]


This should be the other way around?
and for PeriodIndex, _ndarray_values is ndarrary[ordinals] ? (for Interval there is not really a native ndarray representation possible, so that is also object array ?

jorisvandenbossche · 2018-02-09T13:14:09Z

pandas/core/indexes/category.py

@@ -293,6 +304,23 @@ def values(self):
        """ return the underlying data, which is a Categorical """
        return self._data

+    @property
+    def _values(self):
+        return self._data


Is there a difference between ._data and .values ? As below you use .values for itemsize and nbytes, but both seems to point to the underlying Categorical

Probably just me being inconsistent.

jorisvandenbossche · 2018-02-09T13:14:11Z

pandas/core/indexes/category.py

@@ -769,7 +797,7 @@ def _evaluate_compare(self, other):

    def _delegate_method(self, name, *args, **kwargs):
        """ method delegation to the ._values """
-        method = getattr(self._values, name)
+        method = getattr(self.values, name)


Why change this?

Seems to be unnecessary. Reverting.

jorisvandenbossche · 2018-02-09T13:14:16Z

pandas/core/indexes/multi.py

@@ -799,9 +799,11 @@ def values(self):
            box = hasattr(lev, '_box_values')
            # Try to minimize boxing.
            if box and len(lev) > len(lab):
-                taken = lev._box_values(algos.take_1d(lev._values, lab))
+                taken = lev._box_values(algos.take_1d(lev._values,
+                                                      lab))


don't the algos functions always assume ndarray?

Yes, fixed (the only one that should be ._values is the one on line 809 below).

jorisvandenbossche · 2018-02-09T13:17:45Z

pandas/tests/test_base.py

+    elif isinstance(l_values, pd.Index):
+        tm.assert_index_equal(l_values, r_values)
+    elif pd.api.types.is_categorical(l_values):
+        tm.assert_categorical_equal(l_values, r_values)


In the future, we should maybe add an assert_array_equal that handles the different extension arrays as well, then such a special case should not be needed here?

jorisvandenbossche · 2018-02-09T13:18:11Z

pandas/tests/test_base.py

+    assert r_values.dtype == dtype
+
+
+def test_values_periodindex():


this is also tested in the test above?

jreback · 2018-02-09T13:32:02Z

.get_values() should be deprecated, if you want to implement in terms of an internal function, ok. but the external API needs fixing. This was never meant to be exposed and is useless. let's clean that up first

jorisvandenbossche · 2018-02-09T13:33:55Z

external API needs fixing

What needs to be fixed? Or do just mean deprecating it?

let's clean that up first

That doesn't need to happen first, it can be done independently of this. I will open an issue for it.

jorisvandenbossche

Added some comments, but in general this is looking good to me.

For Index, we now also have the _data attribute. Is it the idea that in the future this will become the same as _values ?

jorisvandenbossche · 2018-02-12T13:21:13Z

pandas/core/dtypes/concat.py

@@ -480,20 +480,22 @@ def _concat_datetimetz(to_concat, name=None):

 def _concat_index_same_dtype(indexes, klass=None):
    klass = klass if klass is not None else indexes[0].__class__
-    return klass(np.concatenate([x._values for x in indexes]))
+    return klass(np.concatenate([x._ndarray_values for x in indexes]))


This one is only used for numeric indices, so _values or _ndarray_values should not matter.
I was mainly thinking that for those cases where _ndarray_values actually differs from _values, like a categorical, the above code will not work anyway, as klass(codes) is not enough to reconstruct the categorical. So _values seems safer to me.

jorisvandenbossche · 2018-02-12T13:21:26Z

pandas/core/arrays/base.py

+        # type: () -> np.ndarray
+        """Internal pandas method for lossy conversion to a NumPy ndarray.
+
+        This method is not part of the pandas interface.


I find it a bit strange that we say this is not part of the interface, but still provide here a default implementation and say what it is.

If it is not part of the interface, we could also define it on our own subclasses without defining it here.
But on the other hand, that will raise errors when an external extension array does not have this method. Eg some attributes on Series will call into this, which makes it somehow part of the interface.

I find it a bit strange that we say this is not part of the interface, but still provide here a default implementation and say what it is.

This simplifies the implementation since

a.) we don't have to have a PandasExtensionArray that subclasses ExtensionArray, just to define this method.
b.) we can safely call ._values anywhere in our code without checking whether it's an extension array or a pandas extension array.

My preference is to leave it out of the interface until someone sees an actual need for it. I suspect this could come up if / when we start allowing custom indexes with their own indexing engines, but that seems like a ways down the road...

Yes, I understand those reasons, but that means it is part of the interface (in the sense that if somebody would be stupid to implement a _ndarray_values property that returns something differently, it will break code?)
It's just a part of the interface that we ask not to implement?

To give an example, for GeoPandas, I was wondering if I would overwrite this property to return my self.data (the integer pointers) instead of a materialized array (which can be costly).
For things like Series.strides, Series.itemsize, .. this will be OK (and in principle no user should call this anyway .. so maybe I should not worry about it), but if it is used in other places as well that I am not really sure about, such an implementation might actually give a wrong result (but at the same time, if this is called in certain places for my column with an extension array, I want to know that, because materializing can be costly and I want to avoid this as much as possible).

I see your point. Your GeoPandas concern is a good one. But I'm not sure how to proceeded :/ Do you have a preference for

a.) Adding it to the interface
b.) Leaving it out of the interface, with a default implementation on EA
c.) Making a pandas-specific EA that defines _ndarray_values, so uses of _ndarray_values will need to check for that attr.

._values / ._ndarray_values are meant to be internal, and I don't want to limit future refactorings with backwards compatibility concerns. I'm also not even sure how to document it beyond what's already there.

Maybe the best option is to leave it for now? :)

Ideally, the Series machinery should not use _ndarray_values too often, as this would require knowledge of what those values mean (codes, ordinals, ...), which Series does not need to know (if there is knowledge required, it should somehow dispatch the operation to the underlying (extension) array itself).

And this is already the case I think. From a quick scan, currently in the PR, I see ndarray_values is mainly used for:

indexing (values used in the engine) -> for now we don't support extension arrays to be the data for Index (apart from our own), so this is a worry for later if we want to tackle that

Index object operations (eg the setops) -> same as above

in some specific cases where we know the parent object (eg we know it is a PeriodIndex, we just want to get the ordinals)

for algos (eg take, unique) -> either they dispatch to EA (eg take) or for now the fallback is object array anyhow (eg value_counts), and in general this is something we still need to discuss how to handle in general (factorize, unique, ..)

some Series attributes (shape, itemsize, strides, size) -> shape can also be derived from _values directly (like nbytes) as it is part of the EA interface. And from the others, I think size would be nice to not have to materialize the EA (itemsize and strides I don't care about). Can we let this return len(self._values), which is part of the interface? (I am not sure to what extent .size and __len__ on numpy arrays are different / have different performance)

Your assessment looks correct. Nothing that's series specific currently uses _ndarray_values, so it's really just code that is Index-specific or index / series agnostic.

I can change .shape and .size to use ._values instead of instead of ._ndarary values. I'll just have to override it in DatetimeIndex to avoid a recursion error.

i agree with @jorisvandenbossche assessment here. maybe if we can ultimatley remove this would be good, and simply dispatch to the array object. If Index is a proper EA then this would be possible.

jorisvandenbossche · 2018-02-12T13:31:43Z

pandas/core/base.py


    @property
    def nbytes(self):
        """ return the number of bytes in the underlying data """
-        return self._values.nbytes
+        return self._ndarray_values.nbytes


Should nbytes be called on self._values instead? (as we require a nbytes implementation on the Extension Array, and numpy arrays already have it)

I think this caused issues for CI, but re-running tests now with this change.

jorisvandenbossche · 2018-02-12T13:34:26Z

pandas/core/indexes/category.py

@@ -293,6 +293,15 @@ def values(self):
        """ return the underlying data, which is a Categorical """
        return self._data

+    @property
+    def itemsize(self):
+        return self.values.itemsize


let's add a comment why you are overwriting this here (default is to use _ndarray_values, which are codes, but Categorical defines itemsize as the itemsize of the categories).
Also, would it be more consistent to do self._values.itemsize? (not that it would differ in this case)

More consistent with the parent's implementation, yes, but I'd prefer to use the public API where possible.

, but I'd prefer to use the public API where possible.

But the public api is not meant to be consistent, but to maintain backwards compatibility :-) (at least for now)

jorisvandenbossche · 2018-02-12T13:41:57Z

pandas/core/indexes/multi.py

@@ -1317,7 +1319,7 @@ def from_tuples(cls, tuples, sortorder=None, names=None):
            arrays = [[]] * len(names)
        elif isinstance(tuples, (np.ndarray, Index)):
            if isinstance(tuples, Index):
-                tuples = tuples._values
+                tuples = tuples._ndarray_values


Why ndarray_values in this case? Because I think this code path can only work if it is an object index that contains tuples, so should not be different in this case.

Although that maybe raises the question: in such cases, where we know both _values and _ndarray_values will return the same ndarray (eg in all code only for numeric indexes), which one of the two do we take? (might be good to be consistent on this)

jorisvandenbossche · 2018-02-12T13:47:52Z

pandas/io/formats/format.py

@@ -1897,7 +1897,7 @@ def _format(x):

        vals = self.values
        if isinstance(vals, Index):
-            vals = vals._values
+            vals = vals._ndarray_values


Currently, an extension array will end up here (using the GenericArrayFormatter), so should this rather be np.array(vals._values) ?
(the implementation of _ndarray_values on an ExtensionArray should not influence its formatting?)

It seems like by this point we always have a regular Index, not an extension array, so it doesn't matter in terms of output. Changing to _values since that's clearer.

jorisvandenbossche · 2018-02-12T13:48:54Z

pandas/tests/indexes/datetimes/test_datetime.py

+        result = arr.unique()
+
+        if isinstance(expected, np.ndarray):
+            tm.assert_numpy_array_equal(result, expected)


The expected is always a DatetimeIndex in the above parametrization

jorisvandenbossche · 2018-02-12T13:51:49Z

pandas/tests/test_base.py

+    (pd.DatetimeIndex(['2017-01-01T00:00:00'], tz="US/Eastern"),
+     np.array(['2017-01-01T05:00:00'], dtype='M8[ns]')),
+    (pd.TimedeltaIndex([10**10]), np.array([10**10], dtype='m8[ns]')),
+    pytest.mark.xfail(reason='PeriodArray not implemented')((


Why fail? As I think _ndarray_values already are the ordinals, and will stay that way? (it is _values that will change from object array to PeriodArray?)

It's an xfail since Series[period]._ndarray_values will return an object ndarray with teh Periods, while PeriodIndex._ndarray_values returns the ordinals.

Passing this now would require modifying Series._ndarray_values, which is currently just self._internal_values, to special case object dtype to do inference to see if we have periods. I don't think that's worth it since Periods will (hopefully) be a first-class extension dtype by the next release.

Ah missed that the test below tested it for both Series and Index.

TomAugspurger · 2018-02-12T14:34:29Z

._data seems even more inconsistent

class	output
Index	ndarray
PeriodIndex	ndarray[int] (ordinals)
DatetimeIndex	DatetimeIndex (?)
DatetimeIndex[tz]	ndarray[datetime64[ns]]
CategoricalIndex	Categorical
IntervalIndex	None

It hopefully wouldn't be much work to convert uses of ._data to ._values, but I don't have any plans to do that right now. I'll open up a followup issue though. #19658

TomAugspurger · 2018-02-12T14:41:33Z

Thanks for the review. I implemented your suggestions where I didn't comment in reply, and tests pass locally.

jorisvandenbossche · 2018-02-12T14:57:10Z

It hopefully wouldn't be much work to convert uses of ._data to ._values, but I don't have any plans to do that right now. I'll open up a followup issue though. #19658

+ 1, that only just got me thinking. If you create an index from an extension array (directly, or eg by doing set_index on a column backed by extension block with extension array), I suppose this creates an object Index (with array of scalars) right now? Might be good to add a test for that? (but that probably belongs in the other PR that implements storing an EA in Series)

TomAugspurger · 2018-02-12T14:58:34Z

I suppose this creates an object Index (with array of scalars) right now? Might be good to add a test for that? (but that probably belongs in the other PR that implements storing an EA in Series)

Yes, and agreed (we hit this indirectly though ._value_counts), but it'd be good to have an explicit test.

TomAugspurger · 2018-02-12T19:15:17Z

Implemented the changes dicussed in #19558 (comment): .shape and .size use self._values.

I implemented those on PeriodIndex and DatetimeIndex, else we would have materialized the array of objects (which is what we do on master), so we get some performance gains there I guess

master:

In [6]: idx = pd.interval_range(0, 100_000)

In [8]: %timeit idx.size
308 ns ± 9.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

pr:

In [3]: %timeit idx.size
1.05 µs ± 9.63 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

jorisvandenbossche · 2018-02-12T19:22:56Z

do 'master' and 'pr' need to be swapped?

TomAugspurger · 2018-02-12T19:28:27Z

Not sure what happened on that timing I posted earlier. I can't reproduce it.

In [1]: import pandas as pd

In [2]: idx = pd.interval_range(0, 100_000)

In [3]: %time idx.shape
CPU times: user 177 ms, sys: 10.1 ms, total: 188 ms
Wall time: 187 ms
Out[3]: (100000,)

In [1]: import pandas as pd

In [2]: idx = pd.interval_range(0, 100_000)

In [3]: %time idx.shape
CPU times: user 18 µs, sys: 2 µs, total: 20 µs
Wall time: 22.9 µs
Out[3]: (100000,)

TomAugspurger · 2018-02-12T22:37:25Z

CI is green again.

jorisvandenbossche · 2018-02-12T23:00:14Z

doc/source/internals.rst

+  ``ndarray`` or ``ExtensionArray``).
+
+So, for example, ``Series[category]._values`` is a ``Categorical``, while
+``Series[category]._ndarray_values`` is the underlying codes.


I think this section does not belong in the same document as "how to subclass", as this part is really internals for contributors to pandas.
So I would maybe split this in two: external vs internal details, where the future documentation on how to use and define ExtensionArrays would also go into the external details part (maybe "Extending Pandas" would be a good title, with then information about the different strategies: accessor, extension array, subclassing, ..)

But that's for a separate PR (I can start with that), so for here the above is fine for me.

internal-internals.rst and external-internals.rst ;)

I think that "How to subclass", and the eventual "extending pandas with custom array types" would be better in developer.rst, which is labeled as "This section will focus on downstream applications of pandas.".

Ah, yes, I didn't see that the accessor documentation is actually already there (although, I personally don't find the parquet section that fitting in there, as it is not something typical you need to know when extending pandas. I can already start with moving it to the bottom of the file :-))

jreback

some small comments. ok to merge. But before adding new things, a refactor of the existing is in order to avoid all of this special casing. When this is out of sight (meaning not in a current PR), it is easy to forget the special casing. Maybe need a TODO detector . but already:
pandas)

bash-3.2$ grep -r TODO pandas --include '*.py' |wc
     279    2233   22968

maybe put a special TODO(EA) or something to differentiate these

jreback · 2018-02-12T23:53:54Z

pandas/core/arrays/base.py

+        # type: () -> np.ndarray
+        """Internal pandas method for lossy conversion to a NumPy ndarray.
+
+        This method is not part of the pandas interface.


i agree with @jorisvandenbossche assessment here. maybe if we can ultimatley remove this would be good, and simply dispatch to the array object. If Index is a proper EA then this would be possible.

jreback · 2018-02-12T23:54:46Z

pandas/core/base.py


    @property
    def nbytes(self):
        """ return the number of bytes in the underlying data """
-        return self._values.nbytes
+        return self.values.nbytes


why not ._values?

It depends on what we do with PeriodIndex.values and IntervalIndex.values.

If we change the public .values to be an ExtensionArray then there's no difference.

If we leave the public .values as an ndarray of object, then we prefer ._values since it doesn't materialize the ndarray.

So far we've put off that decision, but we'll need to make a choice soon.

You have to override this in DatetimeIndex (tz) then for now, this is the reason CI is failing (recursion error)

On the original comment: I think it is better for now to use ._values instead of .values in such cases. .values is the public one that might not always be consistent for backwards compatibility, so let's internally use always _values to have something certainly consistent? (and if we decide .values to return extension arrays, it is just a public wrapper around ._values, and if we decide not, we don't have to change anything internally)

jreback · 2018-02-12T23:56:35Z

pandas/core/indexes/base.py

+    @property
+    def _values(self):
+        # type: () -> Union[ExtensionArray, Index]
+        # TODO: remove index types as they become is extension arrays


remove 'is'

jreback · 2018-02-12T23:59:48Z

pandas/core/indexes/base.py

+
+        index             | values          | _values      | _ndarray_values |
+        ----------------- | --------------- | ------------ | --------------- |
+        PeriodIndex       | ndarray[object] | ndarray[obj] | ndarray[int]    |


in theory these should return an Index type for ._values for PI and II, and it might work, but would be disruptive w/o a Block type to hold it. This is why for example we have the is_period_arraylike methods to detect an array of Periods. The holder ATM is a ndarray[object].

jreback · 2018-02-13T00:00:34Z

pandas/core/indexes/base.py

@@ -2228,27 +2264,37 @@ def union(self, other):
            other = other.astype('O')
            return this.union(other)

+        # TODO: setops-refactor, clean all this up


ok, let's do this next then. these if/then/else are really smelly.

jreback · 2018-02-13T00:01:52Z

pandas/core/indexes/base.py

@@ -2228,27 +2264,37 @@ def union(self, other):
            other = other.astype('O')
            return this.union(other)

+        # TODO: setops-refactor, clean all this up


if might be worth it to put this in a single method now (not sure if its possible) to isolate this code, maybe

def _get_internal_type(other): if is_period_dtype(other) ...... ... else: ...

jreback · 2018-02-13T00:02:36Z

pandas/core/indexes/datetimes.py

@@ -678,13 +678,36 @@ def _assert_tzawareness_compat(self, other):
            raise TypeError('Cannot compare tz-naive and tz-aware '
                            'datetime-like objects')

+    @property
+    def _values(self):
+        # tz-naive -> ndarray


yeah we really ought to fix this

jreback · 2018-02-13T00:04:02Z

pandas/core/indexes/datetimes.py

+        # So we extract the tz-naive DatetimeIndex, unique that, and wrap the
+        # result with out TZ.
+        if self.tz is not None:
+            naive = type(self)(self._ndarray_values, copy=False)


all of this hacking around is not great. The problem is these temporary work-around become more and more. Really need to bite the bullet and fix up the implementations to avoid this.

TomAugspurger · 2018-02-13T02:56:45Z

Agreed entirely about the hackiness, especially in the set-ops. I should be able to do it in parallel with my Interval ExtensionArray PR.

TomAugspurger · 2018-02-13T14:50:50Z

All green. Merging.

jorisvandenbossche · 2018-02-13T14:55:23Z

Great!

jreback · 2018-02-13T23:59:41Z

@TomAugspurger this broke the build. what is going on here?

jreback · 2018-02-14T00:01:55Z

actually looks like just a network failure ok. But before merge anything else all of the other builds need to work.

* REF/Clean: Internal / External values * Move to index base * Cleanup unique handling * Simplify object concat * Use values for intersection I think eventually we'll want to ndarray_values for this, but it'll require a bit more work to support. Currently, using ndarary_values causes occasional failures on categorical. * hmm * Additional testing * More tests * ndarray_values * API: Default ExtensionArray.astype (cherry picked from commit 943a915562b72bed147c857de927afa0daf31c1a) (cherry picked from commit fbf0a06) * Simplify concat_as_object * Py2 compat (cherry picked from commit b20e12c) * Set-ops ugliness * better docstrings * tolist * linting * Moved dtypes (cherry picked from commit d136227) * clean * cleanup * NumPy compat * Use base _values for CategoricalIndex * Update dev docs * cleanup * Linting * Precision in tests * Push _ndarray_values to ExtensionArray Now IndexOpsMixin._ndarray_values will dispatch all the way down to the EA. Subclasses like Categorical can override it as they see fit. * Clean up tolist * Move test locations * Fixed test * REF: Update per comments * lint * REF: Use _values for size and shape * PERF: Implement size, shape for IntervalIndex * PERF: Avoid materializing values for PeriodIndex shape, size * Cleanup * Override nbytes

REF/Clean: Internal / External values

41f09d8

TomAugspurger added Refactor Internal refactoring of code Clean labels Feb 6, 2018

TomAugspurger commented Feb 6, 2018

View reviewed changes

Move to index base

29cfd7c

jreback requested changes Feb 7, 2018

View reviewed changes

TomAugspurger added 5 commits February 7, 2018 13:17

Cleanup unique handling

3185f4e

Merge remote-tracking branch 'upstream/master' into index-values

5a59591

Simplify object concat

476f75d

Use values for intersection

b15ee5a

I think eventually we'll want to ndarray_values for this, but it'll require a bit more work to support. Currently, using ndarary_values causes occasional failures on categorical.

hmm

659073f

TomAugspurger mentioned this pull request Feb 8, 2018

ENH: Allow storing ExtensionArrays in containers #19520

Merged

TomAugspurger added 4 commits February 8, 2018 08:40

Merge remote-tracking branch 'upstream/master' into index-values

7accb67

Additional testing

9b8d2a5

More tests

9fbac29

ndarray_values

55305dc

TomAugspurger mentioned this pull request Feb 8, 2018

API: Default ExtensionArray.astype #19604

Merged

TomAugspurger added 6 commits February 8, 2018 15:13

API: Default ExtensionArray.astype

0e63708

(cherry picked from commit 943a915562b72bed147c857de927afa0daf31c1a) (cherry picked from commit fbf0a06)

Simplify concat_as_object

fbbbc8a

Py2 compat

46a0a49

(cherry picked from commit b20e12c)

Set-ops ugliness

2c4445a

better docstrings

5612cda

tolist

b012c19

linting

d49e6aa

jorisvandenbossche reviewed Feb 9, 2018

View reviewed changes

TomAugspurger changed the title ~~[WIP]REF: Internal / External values~~ REF: Internal / External values Feb 12, 2018

jorisvandenbossche reviewed Feb 12, 2018

View reviewed changes

REF: Update per comments

8104ee5

TomAugspurger added 4 commits February 12, 2018 09:53

lint

f8e29b9

REF: Use _values for size and shape

0cd9faa

PERF: Implement size, shape for IntervalIndex

8fcdb70

PERF: Avoid materializing values for PeriodIndex shape, size

34a6a22

jorisvandenbossche approved these changes Feb 12, 2018

View reviewed changes

jreback requested changes Feb 13, 2018

View reviewed changes

TomAugspurger added 2 commits February 12, 2018 18:24

Merge remote-tracking branch 'upstream/master' into index-values

c233c28

Cleanup

d6e8051

jbrockmendel mentioned this pull request Feb 13, 2018

Fix docstrings, error messages #19672

Closed

jorisvandenbossche added this to the 0.23.0 milestone Feb 13, 2018

Override nbytes

3af8a21

TomAugspurger merged commit df38f66 into pandas-dev:master Feb 13, 2018

TomAugspurger deleted the index-values branch February 13, 2018 14:50

		assert r_values.dtype == dtype


		def test_values_periodindex():

REF: Internal / External values #19558

REF: Internal / External values #19558

Conversation

TomAugspurger commented Feb 6, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Feb 8, 2018 • edited Loading

Codecov Report

pep8speaks commented Feb 8, 2018 • edited Loading

Comment last updated on February 13, 2018 at 11:54 Hours UTC

TomAugspurger commented Feb 8, 2018

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Feb 9, 2018

jorisvandenbossche commented Feb 9, 2018

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger Feb 12, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Feb 12, 2018 • edited Loading

TomAugspurger commented Feb 12, 2018

jorisvandenbossche commented Feb 12, 2018

TomAugspurger commented Feb 12, 2018

TomAugspurger commented Feb 12, 2018

jorisvandenbossche commented Feb 12, 2018

TomAugspurger commented Feb 12, 2018

TomAugspurger commented Feb 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Feb 8, 2018 •

edited

Loading

pep8speaks commented Feb 8, 2018 •

edited

Loading

TomAugspurger Feb 12, 2018 •

edited

Loading

TomAugspurger commented Feb 12, 2018 •

edited

Loading

jreback left a comment •

edited

Loading