Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPR: deprecate get_values #26409

Merged

Conversation

jorisvandenbossche
Copy link
Member

Closes #19617

Reviving an old branch I had lying around. So this was more a quick experiment to see what it involves. The main "problem" is that we ourselves use get_values in a bunch of places (eg values_from_object which is massively used).

The current situation:

  • DataFrame.get_values simply points to .values so is (I think?) always a numpy ndarray.
  • Series.get_values points to _data.get_values() (the SingleBlockManager), which in turn does np.array(self._block.to_dense(), copy=False).
    So the Block.to_dense() in general returns a view on the array, although for ExtensionBlock it does np.array(self.values), with the exception for DatetimeTZBlock, which ensures a non-object array with np.asarray(self.values, dtype=_NS_DTYPE), and for CategoricalBlock, it returns the Categorical.get_values which can be a DatetimeIndex.
  • Index.get_values points to .values, so is a numpy ndarray with the exception of:
    • newer Index objects which are already backed by EA: IntervalIndex returns IntervalArray here (but eg DatetimeIndex with tz aware data does return a numpy array)
    • CategoricalIndex overrides the parent method to return the get_values of the underlying Categorical, which in turn mostly returns an ndarray, but an EA for interval-backed dtypes and DatetimeIndex for datetimetz aware backed categorical.

Summary: a complete mess :-)

This PR is for now just an experiment to see where it is used. Now, we should see for all those places what would be the best replacement (unless we are fine with keeping such a _internal_get_values, but it would be good to clean this up somewhat ..)

@pep8speaks

This comment has been minimized.

@jorisvandenbossche
Copy link
Member Author

(wanted to make this a draft PR, but forgot and I don't see a way to change the status once created)

@codecov

This comment has been minimized.

@pandas-dev pandas-dev deleted a comment from codecov bot May 15, 2019
@TomAugspurger
Copy link
Contributor

The main "problem" is that we ourselves use get_values in a bunch of places

Agree with putting "problem" in quotes. As you show, this is surprisingly hard to use correctly. I'd be happy to see it go away.

@jorisvandenbossche
Copy link
Member Author

I'd be happy to see it go away.

Well, that's the part I didn't do yet :-)
I currently simply replaced all internal uses with _internal_get_values. And if we want to get rid of that, we need to decide case by case what to use instead.

But the problem is that get_values has a quite specific behaviour that is not replicable with any of values, array, to_numpy(), _values or _ndarray_values.
get_values ensures the output is a numpy array; it does not preserve fully the information (eg looses timezone), but also avoids object dtype if possible. So it uses datetime64[ns] for datetimetz (while to_numpy gives object), but does convert to object for object categoricals (while _ndarray_values gives the integer codes here).

I suppose in some cases, this difference might not matter, and we can replace it with eg to_numpy(), or we can put this "special" logic into com.values_from_object (which currently simply calls get_values()) and for all cases where we can't replace it with one of the consistent attributes/methods, use this function.

@jreback
Copy link
Contributor

jreback commented May 15, 2019

you need to be extremely careful here as this may have perf implications
rather than actually deprecating this should just change to use .to_numpy() (or _ndarray_values) as appropriate

@jorisvandenbossche
Copy link
Member Author

rather than actually deprecating this should just change to use .to_numpy() (or _ndarray_values) as appropriate

Do those two aspects not go together? We try to change it to to_numpy et al, so we actually can deprecate (and eventually remove) it?

Note that it is actually the "just change to use to_numpy()" that has the performance implications (eg object dtype or not).

As explained above, the current get_values has a quite specific behaviour, so we probably need to keep some internal function or method that keeps that behaviour, exactly to avoid possible implications.

@jreback jreback added the Deprecate Functionality to remove in pandas label May 16, 2019
@@ -1505,6 +1505,9 @@ def get_values(self):
A numpy array of the same dtype as categorical.categories.dtype or
Index if datetime / periods.
"""
raise Exception("USING GET_VALUES")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I be concerned that CI is passing with this here? :)

Perhaps we need a test ensuring that Categorical.get_values() raises a warning.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I be concerned that CI is passing with this here? :)

No, that's a good sign :)
As that was my way to find all the places where it was being called, to fix those (although it is not a perfect way, as in certain places the error could also be catched)

But for sure still need to convert this into a warning and add tests that checks that.

@jorisvandenbossche
Copy link
Member Author

OK, to get this into a merge-able state, I see currently two options:

I think we need to keep for now the functionality of "get_values" (which means: give back some numpy array, special cases behaviour depending on the dtype, see above). So we can:

  • centralize the special cases in com.values_from_object. We have one place with the logic to decide what kind of ndarray to return for this function.

  • keep the special cases on their own class (like I am doing now with the _internal_get_values) which is then used by com.values_from_object.

I think I personally like the first one better, but that has possible the complication that we need to repeat the logic in the json C code (didn't check that in detail though). And also, in general we always have this logic on the object itself (eg the existing _values, _ndarary_values attributes).

@jreback
Copy link
Contributor

jreback commented Jun 18, 2019

I think I personally like the first one better, but that has possible the complication that we need to repeat the logic in the json C code (didn't check that in detail though). And also, in general we always have this logic on the object itself (eg the existing _values, _ndarary_values attributes).

slightly annoying, but you could call _values_from_object in the json code to unify this.

pandas/core/indexes/category.py Outdated Show resolved Hide resolved
pandas/core/indexes/base.py Outdated Show resolved Hide resolved
pandas/core/indexes/category.py Outdated Show resolved Hide resolved
@@ -177,6 +177,9 @@ def get_values(self, dtype=None):
return self.values.astype(object)
return self.values

def get_block_values(self, dtype=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we really need this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this still needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the JSON C code does handle blocks (if I remove the get_block_values part in objToJSON.C that I introduced in this PR, a couple tests fail).
I could name this here the same as for Series/INdex (i.e. _internal_get_values), but I prefer a distinct name to make it clear that the json code is handling blocks and not series/index (and that also makes it clear that all other places where _internal_get_values is used is not handling blocks). That will also make it easier to isolate and try to remove the block handling in the json C code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #27164 for follow-up on this JSON issue

@@ -162,15 +162,15 @@ def test_values(self):

exp = np.array([], dtype=np.object)
tm.assert_numpy_array_equal(idx.values, exp)
tm.assert_numpy_array_equal(idx.get_values(), exp)
#tm.assert_numpy_array_equal(idx.get_values(), exp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove and/or use filterwarnings on the test

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol, this is what's failing :_.>

@jreback jreback added this to the 0.25.0 milestone Jun 28, 2019
@jreback
Copy link
Contributor

jreback commented Jun 28, 2019

looks good, just some checks failing.

@jbrockmendel
Copy link
Member

Generally a big win, especially in sparse. Can we think of a better name than internal values? That doesn’t do much to tell me what it’s for.

@jreback
Copy link
Contributor

jreback commented Jun 29, 2019

“_dense_values”?

@jbrockmendel
Copy link
Member

_lossless_values? If I’m writing an EA and need to implement it, what would this be for eg RangeArray?

@jorisvandenbossche
Copy link
Member Author

EA's don't have to implement this. Currently in master, Categorical is the only one that does.

In general, the ExtensionBlock returns np.array(self.values) that will get used in those get_values. We only have internally the exception for DatetimeTzBlock (to return DatetimeIndex) and CategoricalBlock. That's certainly something we should try to fix, but that's a bigger endeavor than this PR.

@jorisvandenbossche
Copy link
Member Author

_lossless_values?

It's not fully lossless (eg for categorical you loose the categories and orderedness).

_internal_get_values is certainly an awful name, but on the other hand it also is more or less what it is. It is some kind of "get_values" that is for some historical reasons needed for internal use.

I hope to at some point get rid of it. But of course, that's maybe not for the short term.

@jorisvandenbossche
Copy link
Member Author

As said above (#26409 (comment)), I can also try to get rid of it entirely by concentrating the special case logic in values_from_object

@jorisvandenbossche
Copy link
Member Author

The issue about getting a "dense" array (but preserving dtype) of a Categorical is #26410

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Jun 30, 2019

@jreback do you want to do another check, or is this good to go?

(the one failure is flaky test)

@jreback
Copy link
Contributor

jreback commented Jun 30, 2019

lgtm but a check is failing?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, question. can you create a followup issue to actually see what we can fix of the remaining uses (unless you already did).

@@ -177,6 +177,9 @@ def get_values(self, dtype=None):
return self.values.astype(object)
return self.values

def get_block_values(self, dtype=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this still needed?

@jorisvandenbossche
Copy link
Member Author

Opened also a general follow-up issue for removing the use of _internal_get_values: #27165 (next to the specific json issue: #27164)

@jorisvandenbossche
Copy link
Member Author

Travis failure is the pyyaml issue, so ignoring that one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DEPR: .get_values()
5 participants